Description
Robert Strube
2018-10-23 05:14:00 UTC
Please attach your dmesg output. Created attachment 142151 [details]
dmesg log booting system with eGPU attached and powered
Starting at line:
[ 11.192733] ATOM BIOS: 401815-171128-QS1
You can see the failure that occurs when trying to initialize the RX 580 as an eGPU over Thunderbolt 3.
Quick question: Is it possible to completely disable the Vega M using kernel boot parameters. I did try using pci-stub.ids=xxxx:xxxx with the PCI hex id for my Vega M (1002:694e) but amdgpu was still applied to the device, not sure why. I also thought there was explicit PCI device blacklisting support in the kernel, but I have been unable to find any documentation on this. Ideally I'd like to see if having the Vega M disabled allows the eGPU to be correctly initialized. I took a look at the documentation for amdgpu, but I didn't see any boot parameters that stood out at me. Blacklisting the amdgpu module wouldn't work either, as I need that to correctly support the RX 580 once it's attached. Thanks! I decided to apply a hack to 4.19 and see if I could get the eGPU to initialize. I noticed that this code in /drivers/gpu/drm/amd/amdgpu/atom.c if ((jiffies_to_msecs(cjiffies) > 5000)) { DRM_ERROR("atombios stuck in loop for more than 5secs aborting\n"); ctx->abort = true; } is where the error is being thrown, so I thought I would try giving the eGPU more time. I increased the 5000 value to 15000, recompiled the kernel, and tried to attach the eGPU. Unfortunately I received the same error, but this time after 15 seconds of trying to initialize the GPU. Should I increase the time even more? I'm not sure if the issue is actually related to not having enough time, or if it's something else entirely. I'll bump it up to 30 seconds in a final last ditch attempt. If you can get any of the other methods to work you can remove the vegam device id from the driver. That said, I doubt it will make a difference. Usually the problem with thunderbolt is that pci BAR resources don't get assigned properly to the devices and the ones the driver needs are not available. That doesn't seem to be the case here, but I might be missing something. Created attachment 142182 [details]
dmesg log booting system with eGPU (Vega M device IDs removed in kernel)
Thanks for the suggestions! I took your advice and commented out the Vega M device IDs located here: /drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c These are the lines of code that I commented out. /* VEGAM */ {0x1002, 0x694C, PCI_ANY_ID, PCI_ANY_ID, 0, 0, CHIP_VEGAM}, {0x1002, 0x694E, PCI_ANY_ID, PCI_ANY_ID, 0, 0, CHIP_VEGAM}, This did indeed cause my Vega M to not be initialized, *but* the problem I'm having with the eGPU remains. So it appears you were correct and my hunch that the Vega M is interfering with the eGPU initialization was incorrect. I'm back to square one... I uploaded a new dmesg log for this kernel, perhaps with the Vega M out of the equation you might see something new? Thanks! Rob There does not seem to be enough MMIO space for the BARs on the thunderbolt bridges: [ 0.436946] pci 0000:04:00.0: BAR 13: no space for [io size 0x4000] [ 0.436947] pci 0000:04:00.0: BAR 13: failed to assign [io size 0x4000] [ 0.436949] pci 0000:04:00.0: BAR 13: assigned [io 0xc000-0xcfff] [ 0.436950] pci 0000:04:00.0: BAR 13: [io 0xc000-0xcfff] (failed to expand by 0x3000) [ 0.436951] pci 0000:04:00.0: failed to add 3000 res[13]=[io 0xc000-0xcfff] [ 0.436955] pci 0000:05:02.0: BAR 15: no space for [mem size 0x00200000 64bit pref] [ 0.436956] pci 0000:05:02.0: BAR 15: failed to assign [mem size 0x00200000 64bit pref] [ 0.436957] pci 0000:05:01.0: BAR 13: no space for [io size 0x2000] [ 0.436958] pci 0000:05:01.0: BAR 13: failed to assign [io size 0x2000] [ 0.436959] pci 0000:05:02.0: BAR 13: assigned [io 0xc000-0xcfff] [ 0.436960] pci 0000:05:04.0: BAR 13: no space for [io size 0x1000] [ 0.436961] pci 0000:05:04.0: BAR 13: failed to assign [io size 0x1000] [ 0.436963] pci 0000:05:01.0: BAR 13: assigned [io 0xc000-0xcfff] [ 0.436964] pci 0000:05:04.0: BAR 13: no space for [io size 0x1000] [ 0.436965] pci 0000:05:04.0: BAR 13: failed to assign [io size 0x1000] [ 0.436967] pci 0000:05:02.0: BAR 15: no space for [mem size 0x00200000 64bit pref] [ 0.436968] pci 0000:05:02.0: BAR 15: failed to assign [mem size 0x00200000 64bit pref] [ 0.436969] pci 0000:05:02.0: BAR 13: no space for [io size 0x1000] [ 0.436970] pci 0000:05:02.0: BAR 13: failed to assign [io size 0x1000] [ 0.436971] pci 0000:05:01.0: BAR 13: [io 0xc000-0xcfff] (failed to expand by 0x1000) [ 0.436972] pci 0000:05:01.0: failed to add 1000 res[13]=[io 0xc000-0xcfff] I don't think that should be an issue for the devices behind it, but perhaps it is? Any suggestion for how I can increase the MMIO space for the BARs on the Thunderbolt bridges? Should I try to disable additional devices in the BIOS, etc.? I'm a little out of my element here. Thanks! Rob (In reply to Robert Strube from comment #9) > Any suggestion for how I can increase the MMIO space for the BARs on the > Thunderbolt bridges? Should I try to disable additional devices in the BIOS, > etc.? I'm a little out of my element here. Worth a shot if you can. Alex I disabled a bunch of devices in the BIOS (sound, SD card reader, etc.) and I confirmed that they are no longer showing up in lspci, but I'm still getting the same error. I also found one suggestion to pass in a kernel parameter of hpbussize=4 to increase the bus size made available for hot-pluggable devices, this also didn't help. Thanks for all your assistance BTW! Any other suggestions? I'm starting run out of options here. Thanks! Created attachment 142187 [details]
dmesg log booting system *without* eGPU
So I decided to do a sanity check and completely remove the eGPU from the equation. I am still getting the BAR errors in dmesg, so perhaps this isn't the problem after all?!
The one thing I noticed is that the BAR errors are for the pci module and not the pcieport module. One other thing worth mentioning is that I tried with a different GPU (for the eGPU) yesterday. I had a GTX 1060 available, and this *did* work correctly. I haven't double checked if the BAR errors are present with the other GPU.
Perhaps it is a bug with amdgpu after all?
This problem isn't related to the GPU in any way, that the amdgpu driver fails to load is just another symptom. The Thunderbolt bridge doesn't get enough resources assigned for its devices even when the GPU isn't present at all for some reason. That could be a problem with the BIOS, with the Linux Thunderbolt or the resource allocation in the PCI subsystem. Please provide the output of "sudo cat /proc/iomem" and "lspci -t -nn -v" together with an up to date dmesg. Created attachment 142200 [details]
lspci with eGPU *not* connected.
lspci -t -nn -v output when the eGPU is *not* connected.
Created attachment 142201 [details]
sudo cat /proc/iomem when eGPU *not* connected
sudo cat /proc/iomem when eGPU *not* connected
Created attachment 142202 [details]
fresh dmesg log booting system *without* eGPU
Created attachment 142203 [details]
lspci *with* eGPU attached at boot
Created attachment 142204 [details]
sudo cat /proc/iomem *with* eGPU connected at boot
Created attachment 142205 [details]
fresh dmesg log booting system *with* eGPU connected at boot
One more thing I thought of. Would it help if I posted my dmesg log with the GTX 1060 connected as an eGPU? As I mentioned previously this card *is* working with nouveau. I haven't tested with the proprietary nvidia drivers. I'd imagine that PCI resource issues you pointed out are still there, so I'm surprised that the nvidia card is able to work. Perhaps they have some hacks in their drivers to work around issues like this? I also have a friend that has an older RX 290, should I give that a shot as well? It might take me a while to get a hold of that card. I don't doubt that this is most likely a BIOS bug, but I've noticed people on the windows side of the fence getting the XPS 9575 working with eGPUs, and presumably they have the same BIOS as me. Hi guys, Apologies for the deluge of posts here, I've been trying really hard to investigate this issue! So I took a closer look at the PCI resource issues that you mentioned, I've also been looking and thunderbolt driver issues in general, and I've noticed that this type of log message is quite common. Here's what I'm wondering: These four devices correspond to the TB to PCI bridges in the system 0000:04:00.0 0000:05:01.0 0000:05:02.0 0000:05:04.0 04:00.0 PCI bridge: Intel Corporation JHL6540 Thunderbolt 3 Bridge (C step) [Alpine Ridge 4C 2016] (rev 02) (prog-if 00 [Normal decode]) Flags: bus master, fast devsel, latency 0, IRQ 16 Bus: primary=04, secondary=05, subordinate=6e, sec-latency=0 Memory behind bridge: bc000000-ea0fffff Prefetchable memory behind bridge: 0000002fb0000000-0000002ff9ffffff Capabilities: [80] Power Management version 3 Capabilities: [88] MSI: Enable- Count=1/1 Maskable- 64bit+ Capabilities: [ac] Subsystem: Intel Corporation JHL6540 Thunderbolt 3 Bridge (C step) [Alpine Ridge 4C 2016] Capabilities: [c0] Express Upstream Port, MSI 00 Capabilities: [100] Device Serial Number b7-de-04-b0-a6-c9-a0-00 Capabilities: [200] Advanced Error Reporting Capabilities: [300] Virtual Channel Capabilities: [400] Power Budgeting <?> Capabilities: [500] Vendor Specific Information: ID=1234 Rev=1 Len=0d8 <?> Capabilities: [600] Latency Tolerance Reporting Capabilities: [700] #19 Kernel driver in use: pcieport 05:00.0 PCI bridge: Intel Corporation JHL6540 Thunderbolt 3 Bridge (C step) [Alpine Ridge 4C 2016] (rev 02) (prog-if 00 [Normal decode]) Flags: bus master, fast devsel, latency 0, IRQ 16 Bus: primary=05, secondary=06, subordinate=06, sec-latency=0 Memory behind bridge: ea000000-ea0fffff Capabilities: [80] Power Management version 3 Capabilities: [88] MSI: Enable- Count=1/1 Maskable- 64bit+ Capabilities: [ac] Subsystem: Intel Corporation JHL6540 Thunderbolt 3 Bridge (C step) [Alpine Ridge 4C 2016] Capabilities: [c0] Express Downstream Port (Slot+), MSI 00 Capabilities: [100] Device Serial Number b7-de-04-b0-a6-c9-a0-00 Capabilities: [200] Advanced Error Reporting Capabilities: [300] Virtual Channel Capabilities: [400] Power Budgeting <?> Capabilities: [500] Vendor Specific Information: ID=1234 Rev=1 Len=0d8 <?> Capabilities: [700] #19 Kernel driver in use: pcieport 05:01.0 PCI bridge: Intel Corporation JHL6540 Thunderbolt 3 Bridge (C step) [Alpine Ridge 4C 2016] (rev 02) (prog-if 00 [Normal decode]) Flags: bus master, fast devsel, latency 0, IRQ 17 Bus: primary=05, secondary=07, subordinate=39, sec-latency=0 Memory behind bridge: bc000000-d3efffff Prefetchable memory behind bridge: 0000002fb0000000-0000002fcfffffff Capabilities: [80] Power Management version 3 Capabilities: [88] MSI: Enable- Count=1/1 Maskable- 64bit+ Capabilities: [ac] Subsystem: Intel Corporation JHL6540 Thunderbolt 3 Bridge (C step) [Alpine Ridge 4C 2016] Capabilities: [c0] Express Downstream Port (Slot+), MSI 00 Capabilities: [100] Device Serial Number b7-de-04-b0-a6-c9-a0-00 Capabilities: [200] Advanced Error Reporting Capabilities: [300] Virtual Channel Capabilities: [400] Power Budgeting <?> Capabilities: [500] Vendor Specific Information: ID=1234 Rev=1 Len=0d8 <?> Capabilities: [700] #19 Kernel driver in use: pcieport 05:02.0 PCI bridge: Intel Corporation JHL6540 Thunderbolt 3 Bridge (C step) [Alpine Ridge 4C 2016] (rev 02) (prog-if 00 [Normal decode]) Flags: bus master, fast devsel, latency 0, IRQ 18 Bus: primary=05, secondary=3a, subordinate=3a, sec-latency=0 Memory behind bridge: d3f00000-d3ffffff Capabilities: [80] Power Management version 3 Capabilities: [88] MSI: Enable- Count=1/1 Maskable- 64bit+ Capabilities: [ac] Subsystem: Intel Corporation JHL6540 Thunderbolt 3 Bridge (C step) [Alpine Ridge 4C 2016] Capabilities: [c0] Express Downstream Port (Slot+), MSI 00 Capabilities: [100] Device Serial Number b7-de-04-b0-a6-c9-a0-00 Capabilities: [200] Advanced Error Reporting Capabilities: [300] Virtual Channel Capabilities: [400] Power Budgeting <?> Capabilities: [500] Vendor Specific Information: ID=1234 Rev=1 Len=0d8 <?> Capabilities: [700] #19 Kernel driver in use: pcieport 05:04.0 PCI bridge: Intel Corporation JHL6540 Thunderbolt 3 Bridge (C step) [Alpine Ridge 4C 2016] (rev 02) (prog-if 00 [Normal decode]) Flags: bus master, fast devsel, latency 0, IRQ 16 Bus: primary=05, secondary=3b, subordinate=6e, sec-latency=0 Memory behind bridge: d4000000-e9ffffff Prefetchable memory behind bridge: 0000002fd0000000-0000002ff9ffffff Capabilities: [80] Power Management version 3 Capabilities: [88] MSI: Enable- Count=1/1 Maskable- 64bit+ Capabilities: [ac] Subsystem: Intel Corporation JHL6540 Thunderbolt 3 Bridge (C step) [Alpine Ridge 4C 2016] Capabilities: [c0] Express Downstream Port (Slot+), MSI 00 Capabilities: [100] Device Serial Number b7-de-04-b0-a6-c9-a0-00 Capabilities: [200] Advanced Error Reporting Capabilities: [300] Virtual Channel Capabilities: [400] Power Budgeting <?> Capabilities: [500] Vendor Specific Information: ID=1234 Rev=1 Len=0d8 <?> Capabilities: [700] #19 Kernel driver in use: pcieport First you see pci defining the bridge windows for devices: [ 104.290143] pci 0000:05:01.0: bridge window [io 0x1000-0x0fff] to [bus 07-39] add_size 1000 [ 104.290152] pci 0000:05:02.0: bridge window [io 0x1000-0x0fff] to [bus 3a] add_size 1000 [ 104.290155] pci 0000:05:02.0: bridge window [mem 0x00100000-0x000fffff 64bit pref] to [bus 3a] add_size 200000 add_align 100000 [ 104.290169] pci 0000:05:04.0: bridge window [io 0x1000-0x0fff] to [bus 3b-6e] add_size 1000 [ 104.290180] pci 0000:04:00.0: bridge window [io 0x1000-0x0fff] to [bus 05-6e] add_size 3000 Then you see a bunch of BAR errors, saying there's no space and that they can't be assigned: [ 104.290184] pci 0000:04:00.0: BAR 13: no space for [io size 0x3000] [ 104.290185] pci 0000:04:00.0: BAR 13: failed to assign [io size 0x3000] [ 104.290187] pci 0000:04:00.0: BAR 13: no space for [io size 0x3000] [ 104.290188] pci 0000:04:00.0: BAR 13: failed to assign [io size 0x3000] [ 104.290193] pci 0000:05:02.0: BAR 15: no space for [mem size 0x00200000 64bit pref] [ 104.290194] pci 0000:05:02.0: BAR 15: failed to assign [mem size 0x00200000 64bit pref] [ 104.290196] pci 0000:05:01.0: BAR 13: no space for [io size 0x1000] [ 104.290197] pci 0000:05:01.0: BAR 13: failed to assign [io size 0x1000] [ 104.290198] pci 0000:05:02.0: BAR 13: no space for [io size 0x1000] [ 104.290199] pci 0000:05:02.0: BAR 13: failed to assign [io size 0x1000] [ 104.290201] pci 0000:05:04.0: BAR 13: no space for [io size 0x1000] [ 104.290202] pci 0000:05:04.0: BAR 13: failed to assign [io size 0x1000] [ 104.290203] pci 0000:05:04.0: BAR 13: no space for [io size 0x1000] [ 104.290205] pci 0000:05:04.0: BAR 13: failed to assign [io size 0x1000] [ 104.290207] pci 0000:05:02.0: BAR 15: no space for [mem size 0x00200000 64bit pref] [ 104.290208] pci 0000:05:02.0: BAR 15: failed to assign [mem size 0x00200000 64bit pref] [ 104.290209] pci 0000:05:02.0: BAR 13: no space for [io size 0x1000] [ 104.290210] pci 0000:05:02.0: BAR 13: failed to assign [io size 0x1000] [ 104.290212] pci 0000:05:01.0: BAR 13: no space for [io size 0x1000] [ 104.290213] pci 0000:05:01.0: BAR 13: failed to assign [io size 0x1000] But then you see that the PCI bridges seem to initialize for all the devices: [ 104.290215] pci 0000:05:00.0: PCI bridge to [bus 06] [ 104.290221] pci 0000:05:00.0: bridge window [mem 0xea000000-0xea0fffff] [ 104.290231] pci 0000:05:01.0: PCI bridge to [bus 07-39] [ 104.290237] pci 0000:05:01.0: bridge window [mem 0xbc000000-0xd3efffff] [ 104.290241] pci 0000:05:01.0: bridge window [mem 0x2fb0000000-0x2fcfffffff 64bit pref] [ 104.290248] pci 0000:05:02.0: PCI bridge to [bus 3a] [ 104.290254] pci 0000:05:02.0: bridge window [mem 0xd3f00000-0xd3ffffff] [ 104.290264] pci 0000:05:04.0: PCI bridge to [bus 3b-6e] [ 104.290270] pci 0000:05:04.0: bridge window [mem 0xd4000000-0xe9ffffff] [ 104.290274] pci 0000:05:04.0: bridge window [mem 0x2fd0000000-0x2ff9ffffff 64bit pref] [ 104.290281] pci 0000:04:00.0: PCI bridge to [bus 05-6e] [ 104.290286] pci 0000:04:00.0: bridge window [mem 0xbc000000-0xea0fffff] [ 104.290291] pci 0000:04:00.0: bridge window [mem 0x2fb0000000-0x2ff9ffffff 64bit pref] Perhaps the BAR errors are just a red herring and at the end of the process all of the the Thunderbolt PCI bridges *are* initialized correctly? As I said, I've probably spent way too much time looking at this, the main thing I keep coming back to is that my other GPU *does* work correctly as an eGPU. It's also a PCI x16 card (I know it's operating over PCI x4 due to TB3 bandwitch limitations), so theoretically if there were any PCI resource problems with the Thunderbolt bridge then this GPU should also fail, correct? I noticed a couple other things in my research: I found a bug that points to tlp (specifically power management) as causing the same problems with the atom bios being stuck in a loop: https://bugs.freedesktop.org/show_bug.cgi?id=103783 Perhaps the issue is caused by some sort of aggressive PM? I might try adding some kernel boot parameters amdgpu.dpm=0 amdgpu.apm=0 etc. I was also thinking that perhaps I should try the AMDGPU-PRO drivers just to see if they would work by chance. Somebody else reported that these drivers worked, while the amdgpu drivers failed. It's worth a shot. Thanks for any feedback and/or advice! Rob OK! My hunch about the PM was right! The card is fully initialized now, so the issue doesn't appear to be a PCI resource issue! I took the brute force approach, compiled my own custom kernel that completely disabled the Vega M (by commenting out it's device IDs). I then passed in the following kernel boot parameters: acpi=off apm=off amdgpu.dpm=0 amdgpu.aspm=0 amdgpu.runpm=0 amdgpu.bapm=0 Rebooted the machine and *BAM* the eGPU was initialized! I'm attaching the new dmesg! I'm just super excited that I was able to get the eGPU initialized! xrandr even sees it! xrandr --listproviders Providers: number : 2 Provider 0: id: 0x74 cap: 0x9, Source Output, Sink Offload crtcs: 3 outputs: 7 associated providers: 1 name:modesetting Provider 1: id: 0x4a cap: 0x6, Sink Output, Source Offload crtcs: 6 outputs: 5 associated providers: 1 name:Radeon RX 580 Series @ pci:0000:09:00.0 Edit: taking a closer look at the dmesg I see that disabling the PM did indeed eliminate the PCI resource issues. So for some reason having PM enabled affects the PCI resource allocation for the Thunderbolt PCI bridges! Created attachment 142209 [details]
dmesg log booting system with PM *DISABLED* and *WITH* eGPU
acpi=off is the only parameter necessary to get the eGPU up and running. Setting this parameter allows the Thunderbolt PCI bridge to correctly have it's resources allocated. This incidentally also completely disables the Vega M (even with Vanilla kernel that does not have the device IDs commented out). I'm wondering where I can report the Thunderbolt Controller / Bridge bug? Perhaps you fine folks can point me in the right direction? I don't want to stop your cheering, but that isn't a perfect solution either. (In reply to Robert Strube from comment #25) > I'm wondering where I can report the Thunderbolt Controller / Bridge bug? > Perhaps you fine folks can point me in the right direction? That is unfortunately most like a bug in the BIOS. What happens here is when you specify acpi=off that the internal Vega M gets disabled and the address space that one used freed up. This address space is then used for the Thunderbolt Controller to handle the Polaris. What you could try is to blacklist amdgpu from automatically loading and then issue the following commands as root manually: #Disable the internal Vega M echo 1 > ./bus/pci/devices/0000:01:00.0/remove #Manually load amdgpu to initialize the Polaris modprobe amdgpu #Rescan the PCI bus to find the Vega M again echo 1 > ./bus/pci/devices/0000:00:00.0/rescan It's just a shot into the dark, but that might work as well. Apart from that there isn't much else you could do except to upgrade the BIOS or use different hardware. Thanks for the response. I think it was just a coincidence that the eGPU started working with acpi=off. Taking a closer look at the issue it really does appear to be a BIOS problem that prevents the proper PCI resource allocation to one of the TB PCI bridges. In fact when I took a closer look at the dmesg with acpi=off I still see the resource issues present. I've opened up an official bug report here with the kernel ACPI BIOS team here: https://bugzilla.kernel.org/show_bug.cgi?id=201527 I realize the issue should really be solved by the manufacturer, but perhaps the kernel devs can create a work around and/or have more direct lines of communication with the Dell engineers. Thank you both for your suggestions and comments. Rob Quick update: I heard back from the ACPI BIOS kernel developers (see: https://bugzilla.kernel.org/show_bug.cgi?id=201527) and they seem to imply that the PCI resource issues showing up in the dmseg log are *not* a problem. Linux is simply trying to allocate more resources, and that the failure is OK and it does get the requisite resources required. See this comment https://bugzilla.kernel.org/show_bug.cgi?id=201527#c8 from Mika. I'm not sure where this leaves us, is it a BIOS / PCI resource issue, or is it a bug within amdgpu? I've also been in contact with Dell regarding the possibility that there is a BIOS bug causing some of these issues. I'm going to need to conduct some testing on Windows with eGPUs to see if the problem also exists there. Thanks! Rob (In reply to Robert Strube from comment #28) > Quick update: > > I heard back from the ACPI BIOS kernel developers (see: > https://bugzilla.kernel.org/show_bug.cgi?id=201527) and they seem to imply > that the PCI resource issues showing up in the dmseg log are *not* a > problem. Linux is simply trying to allocate more resources, and that the > failure is OK and it does get the requisite resources required. It would appear to not be ok as when it gets the resources, the GPU works. Does disabling dpm make the GPU work? E.g., append amdgpu.dpm=0 to the kernel command line in grub. The driver needs to query the supported pcie speeds from the pcie bridge it is connected to in order to setup the power management controller. Maybe when the resources are not available, the driver is not able to get that information or it gets garbage. Hi Alex, Thanks for the reply. I wanted to clarify an important point: When I disabled PM completely and ACPI completely, I did not see any PCI resource issues AND the eGPU initialized successfully. However, after testing a little more I was able to keep the PM enabled and only disabled ACPI. In this situation I encountered a scenario where the PCI resource issues were present in the log, BUT the eGPU still initialized. I mentioned that briefly in one of my previous comments but didn't really elaborate. So under certain situations the eGPU did initialize despite seeing PCI BAR resource issues. I've been working with another user that has the exact same system (XPS 9575) and an RX 580 and is having the same issues. He was actually able to get the eGPU initialized by passing in PCI=noacpi rather than completely disabling ACPI as a whole. I'll double check with him to see if he can post his dmesg log - because I'm not sure if the PCI resource issues are present under those circumstances. Reference: https://forum.manjaro.org/t/rx-580-in-a-thunderbolt-egpu-dock/58210/13 At this point I've had to return my RX 580 - great card but after about a month of troubleshooting I was running out of time in the return window - so I'm unable to do any more testing with that specific card at this time. Kind of a bummer... I'll probably pick up a Vega early next year and try again. Rob (In reply to Robert Strube from comment #30) > Hi Alex, > > Thanks for the reply. I wanted to clarify an important point: When I > disabled PM completely and ACPI completely, I did not see any PCI resource > issues AND the eGPU initialized successfully. > > acpi=off apm=off amdgpu.dpm=0 amdgpu.aspm=0 amdgpu.runpm=0 amdgpu.bapm=0 the only relevant items here are acpi=off and amdgpu.dpm=0. Did you test them independently or just together? setting dpm=0 is irrelevant if acpi=off since there will be no resource restrictions. You need to test them independently. > > I've been working with another user that has the exact same system (XPS > 9575) and an RX 580 and is having the same issues. He was actually able to > get the eGPU initialized by passing in PCI=noacpi rather than completely > disabling ACPI as a whole. I'll double check with him to see if he can post > his dmesg log - because I'm not sure if the PCI resource issues are present > under those circumstances. looks like the same issue as yours. PCI resources not getting assigned. Hi Alex, I just tested acpi=off independently and not amdgpu.dpm=0 independently. In regards to the other user, yes he had the exact same PCI resource issues as me, what I'm curious to find out though, is when he passed in PCI=noacpi and was able to get the card initialized if those same PCI resource issues are present. My hunch is that they are still present AND the card was able to initialize, but I'd be anxious to see. I've also asked him to test out amdgpu.dpm=0 independently and report back. Hopefully you're onto something here! Rob With respect to pci(e) devices, acpi=off and pci=noacpi are equivalent I think. Created attachment 143044 [details]
dmesg log amdgpu.dpm=0 with 580 as eGPU
Another user is reporting a similar problem with a different Dell laptop (the XPS 9370). He provided two dmesg log files. This one has amdgpu=0.
Created attachment 143045 [details]
dmesg log pci=noacpi with 580 as eGPU
Another user has reported a similar problem with a different laptop (XPS 9370). He provided to dmesg log files. This one has pci=noacpi set as a kernel boot parameter.
(In reply to Robert Strube from comment #34) > Created attachment 143044 [details] > dmesg log amdgpu.dpm=0 with 580 as eGPU > > Another user is reporting a similar problem with a different Dell laptop > (the XPS 9370). He provided two dmesg log files. This one has amdgpu=0. Meant to say amdgpu.dpm=0 as a boot paramater. One additional comment, the user with the XPS 9370 was able to get the RX 580 working as an eGPU flawlessly in Windows 10. This lends some credibility to the theory that it might not actually be a BIOS issue - unless the BIOS bug is worked around in Windows 10 drivers. Please see here for additional information: https://forum.manjaro.org/t/rx-580-in-a-thunderbolt-egpu-dock/58210/30 I'm planning on purchasing an Vega 56 or 64 in the near future so I can continue to attempt to troubleshoot the issue. Rob (In reply to Robert Strube from comment #34) > Created attachment 143044 [details] > dmesg log amdgpu.dpm=0 with 580 as eGPU > > Another user is reporting a similar problem with a different Dell laptop > (the XPS 9370). He provided two dmesg log files. This one has amdgpu=0. It would appear that this user is not experiencing the same issue as you. In your case the driver fails to even post the GPU. That happens long before dpm is initialized. The other user can try adding amdgpu.ppfeaturemask=0xfffd3ffb to disable pcie power management to see if his issue is related to comment 29. I am with same problem. Computer is Dell Precision 5530 2-in-1 with VegaM inside EGPU is Vega56. EGPU is not starting even with acpi=off. Kernel 5.0.4 Created attachment 143804 [details]
acpi=off amd.dpm=0
Vega 56
Created attachment 143805 [details]
default boot vega 56
actualy with acpi=off, Vega 56 is not found at all during boot, only enclosure (In reply to Dimitar Atanasov from comment #41) > Created attachment 143805 [details] > default boot vega 56 Similar issues on your system: [ 168.653171] pci 0000:3d:00.0: BAR 0: no space for [mem size 0x00004000] [ 168.653176] pci 0000:3d:00.0: BAR 0: failed to assign [mem size 0x00004000] Works with 4.19.32 witch acpi=off, but CPU is single core (In reply to Dimitar Atanasov from comment #44) > Works with 4.19.32 witch acpi=off, but CPU is single core Most likely some other device failed to get enumerated in that case and that freed up resources for the bridges. FWIW, I'm also hit by this on a HP Spectre X360 with a Vega M configuration (thought it's notable as the other reports seem to only come from Dell machines?). Disabling the Vega M by doing `echo 1 > /sys/bus/pci//devices/0000:01:00.0/remove` hasn't changed a thing (Checked by lspci first, that is the right device) and booting with `acpi=off` causes my touchpad and the thunderbolt controller to not function at all (apparently?), so there's that. Am using a RX470 in a Razer Core X, which works flawlessly on Windows. But I guess the Vega M does the trick on Linux for now, thanks for the great work on amdgpu and Mesa :) Created attachment 143818 [details]
dmesg-HP-SpectreX360-8705G
Seems that I also hit the
[13974.780260] pcieport 0000:03:04.0: BAR 13: failed to assign [io size 0x1000]
[13974.780262] pcieport 0000:03:02.0: BAR 15: no space for [mem size 0x00200000 64bit pref]
May be problem is the CPU, because it has only 16 PCIe lains, so 8 for vega M, 4 for NVME, and 4 for others, I have seen that card reader is also connected on PCIe. (In reply to Dimitar Atanasov from comment #48) > May be problem is the CPU, because it has only 16 PCIe lains, so 8 for vega > M, > 4 for NVME, and 4 for others, I have seen that card reader is also connected > on PCIe. It's the MMIO space in the CPU's address space. The CPU (by way of the sbios) defines a window of address space that is used for device MMIO. By default most platforms put a relatively small MMIO windows below 4GB for 32 bit OS compatibility. Having a small MMIO windows limits the amount of space for devices and if there is not enough space some device resources can't be mapped which is what causes the problem. There is often a feature in the sbios config called ">4GB MMIO" or similar which enables a bigger MMIO windows. Some sbioses also enable it dynamically depending on what OS is booted or conditions in the system at boot time (legacy vs UEFI boot). IIRC, it's a requirement for windows 10, so there is probably something about the windows 10 OEM install which causes it to boot with a larger MMIO window set up. There is two windows on this system. Small one below 4GB which is 2.5GB and bigger one over 4GB which is 64GB. Address space for thunderbolt is only 200 MB. As I know AMDGPU needs 250 MB in low 4GB and rest is in big space. Interesting enough, is that I have XPS 9570 which is 8750h and there is BIOS option how to assign MMIO space, there is no such option here. Hello Everyone, I realize it's been a long time since I updated this bug report, apologies in advance. I decided to give up on eGPUs + Linux (over Thunderbolt 3) for a while, and didn't get a chance to really tackle the problem again until more recently. Since my initial report, I have been able to get an eGPU working with my Dell XPS 9575, but only with a Nvidia GPU (specifically an RTX 2070). I did try another AMD card, but ran into the same problems. I'll attach my dmesg and lspci information in the hopes that this might shed some light on why the nvidia GPU works correctly (albeit with the proprietary driver) and certain AMD GPUs don't. Created attachment 144090 [details]
dmesg log with kernel 5.0.x with nvidia eGPU
Created attachment 144091 [details]
lspci kernel 5.0.x with nividia eGPU
Hello, I think I have the same problem as all the others here. My setup - working fine under Windows - is: * HP Spectre x360 - chxx model with the i7-8705 CPU (so also with an vega M dGPU * Razer Core X with AMD Vega56 A normal boot with Kernel 5.3.7-2 gives me this in dmesg: [ 11.672442] [drm:atom_op_jump [amdgpu]] *ERROR* atombios stuck in loop for more than 5secs aborting [ 11.672469] [drm:amdgpu_atom_execute_table_locked [amdgpu]] *ERROR* atombios stuck executing CF4E (len 1030, WS 8, PS 0) @ 0xD2C5 [ 11.672492] [drm:amdgpu_atom_execute_table_locked [amdgpu]] *ERROR* atombios stuck executing C410 (len 114, WS 0, PS 8) @ 0xC41C [ 11.672495] amdgpu 0000:09:00.0: gpu post error! [ 11.672496] amdgpu 0000:09:00.0: Fatal error during GPU init [ 11.672497] [drm] amdgpu: finishing device. but I also found some errors like this: Do they have to do something with our GPU? [ 2.732913] ACPI BIOS Error (bug): AE_AML_BUFFER_LIMIT, Field [D128] at bit offset/length 128/1024 exceeds size of target Buffer (160 bits) (20190703/dsopcode-198) [ 2.732919] ACPI Error: Aborting method \HWMC due to previous error (AE_AML_BUFFER_LIMIT) (20190703/psparse-529) [ 2.732927] ACPI Error: Aborting method \_SB.WMID.WMAA due to previous error (AE_AML_BUFFER_LIMIT) (20190703/psparse-529) I also have some of those BAR-errors, but I never have them with the PCI Adress of the eGPU only with other adresses: [ 0.779001] pci 0000:02:00.0: BAR 13: assigned [io 0x2000-0x4fff] [ 0.779004] pci 0000:03:02.0: BAR 15: no space for [mem size 0x00200000 64bit pref] [ 0.779005] pci 0000:03:02.0: BAR 15: failed to assign [mem size 0x00200000 64bit pref] [ 0.779006] pci 0000:03:01.0: BAR 13: assigned [io 0x2000-0x2fff] [ 0.779006] pci 0000:03:02.0: BAR 13: assigned [io 0x3000-0x3fff] [ 0.779007] pci 0000:03:04.0: BAR 13: assigned [io 0x4000-0x4fff] [ 0.779009] pci 0000:03:02.0: BAR 15: no space for [mem size 0x00200000 64bit pref] [ 0.779010] pci 0000:03:02.0: BAR 15: failed to assign [mem size 0x00200000 64bit pref] [ 0.779010] pci 0000:03:00.0: PCI bridge to [bus 04] [ 0.779016] pci 0000:03:00.0: bridge window [mem 0xde000000-0xde0fffff] [ 0.779025] pci 0000:03:01.0: PCI bridge to [bus 05-37] [ 0.779027] pci 0000:03:01.0: bridge window [io 0x2000-0x2fff] [ 0.779032] pci 0000:03:01.0: bridge window [mem 0xb0000000-0xc7efffff] [ 0.779036] pci 0000:03:01.0: bridge window [mem 0x2f90000000-0x2fafffffff 64bit pref] Device Adress 03:02.0 Corresponds to my Thunderbolt controller: 03:02.0 PCI bridge: Intel Corporation JHL6540 Thunderbolt 3 Bridge (C step) [Alpine Ridge 4C 2016] (rev 02) @Robert Strude: You wanted to buy a Vega and make some researches with it. As you can see there is still something wrong here. Perhaps you are interested and want to help me finding more information. Created attachment 145876 [details]
Spectrex360 / Kernel 5.3.7 / Vega56 / normal boot
More tests: *pci=noacpi PC is booting but either it freezes right after booting to GNOME or it blocks the Touchpad/USB mouse/keyboard/Tochscreen from working *amdgpu.dpm=0 or acpi=off Screen turns black and nothing happens anymore Another thing I noticed: Sometimes when booting, GRUB shows on the external monitor hoocked up to the eGPU and not on the Laptops internal one. But as soon as Linux starts it switches back to the internal monitor. If I select Windows the whole boot continues on the eGPU-monitor. I also made a thread in the manjaro-forum two days ago. There you can also find some more logs from my machine/system/configuration: https://forum.manjaro.org/t/help-setting-up-egpu-stuck-on-boot-when-connected/109583/10 -- GitLab Migration Automatic Message -- This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity. You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/drm/amd/issues/566. |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.