Bug 108521 - RX 580 as eGPU amdgpu: gpu post error!
Summary: RX 580 as eGPU amdgpu: gpu post error!
Status: RESOLVED NOTOURBUG
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/AMDgpu (show other bugs)
Version: unspecified
Hardware: x86-64 (AMD64) Linux (All)
: medium normal
Assignee: Default DRI bug account
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2018-10-23 05:14 UTC by Robert Strube
Modified: 2018-11-30 01:51 UTC (History)
0 users

See Also:
i915 platform:
i915 features:


Attachments
dmesg log booting system with eGPU attached and powered (97.37 KB, text/plain)
2018-10-23 16:00 UTC, Robert Strube
no flags Details
dmesg log booting system with eGPU (Vega M device IDs removed in kernel) (94.21 KB, text/plain)
2018-10-25 01:13 UTC, Robert Strube
no flags Details
dmesg log booting system *without* eGPU (97.96 KB, text/plain)
2018-10-25 03:21 UTC, Robert Strube
no flags Details
lspci with eGPU *not* connected. (1.72 KB, text/plain)
2018-10-25 14:56 UTC, Robert Strube
no flags Details
sudo cat /proc/iomem when eGPU *not* connected (3.97 KB, text/plain)
2018-10-25 15:01 UTC, Robert Strube
no flags Details
fresh dmesg log booting system *without* eGPU (98.00 KB, text/plain)
2018-10-25 15:07 UTC, Robert Strube
no flags Details
lspci *with* eGPU attached at boot (2.32 KB, text/plain)
2018-10-25 15:11 UTC, Robert Strube
no flags Details
sudo cat /proc/iomem *with* eGPU connected at boot (4.87 KB, text/plain)
2018-10-25 15:12 UTC, Robert Strube
no flags Details
fresh dmesg log booting system *with* eGPU connected at boot (98.23 KB, text/plain)
2018-10-25 15:12 UTC, Robert Strube
no flags Details
dmesg log booting system with PM *DISABLED* and *WITH* eGPU (76.18 KB, text/plain)
2018-10-26 05:15 UTC, Robert Strube
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Robert Strube 2018-10-23 05:14:00 UTC
Hello everyone,

I've been attempting to get my RX 580 working correctly as an eGPU using the Akitio Node eGPU enclosure (over Thunderbolt 3).

I've confirmed that both the Akitio Node and my laptops Thunderbolt 3 controller are running the most up-to-date firmware.  I've also been able to successfully authorize the Thunderbolt eGPU enclosure, and see the RX 580 in lspci, see blow:

00:00.0 Host bridge: Intel Corporation Xeon E3-1200 v6/7th Gen Core Processor Host Bridge/DRAM Registers (rev 05)
00:01.0 PCI bridge: Intel Corporation Xeon E3-1200 v5/E3-1500 v5/6th Gen Core Processor PCIe Controller (x16) (rev 05)
00:02.0 VGA compatible controller: Intel Corporation Device 591b (rev 04)
00:04.0 Signal processing controller: Intel Corporation Xeon E3-1200 v5/E3-1500 v5/6th Gen Core Processor Thermal Subsystem (rev 05)
00:13.0 Non-VGA unclassified device: Intel Corporation 100 Series/C230 Series Chipset Family Integrated Sensor Hub (rev 31)
00:14.0 USB controller: Intel Corporation 100 Series/C230 Series Chipset Family USB 3.0 xHCI Controller (rev 31)
00:14.2 Signal processing controller: Intel Corporation 100 Series/C230 Series Chipset Family Thermal Subsystem (rev 31)
00:15.0 Signal processing controller: Intel Corporation 100 Series/C230 Series Chipset Family Serial IO I2C Controller #0 (rev 31)
00:15.1 Signal processing controller: Intel Corporation 100 Series/C230 Series Chipset Family Serial IO I2C Controller #1 (rev 31)
00:16.0 Communication controller: Intel Corporation 100 Series/C230 Series Chipset Family MEI Controller #1 (rev 31)
00:17.0 SATA controller: Intel Corporation HM170/QM170 Chipset SATA Controller [AHCI Mode] (rev 31)
00:1c.0 PCI bridge: Intel Corporation 100 Series/C230 Series Chipset Family PCI Express Root Port #1 (rev f1)
00:1c.4 PCI bridge: Intel Corporation 100 Series/C230 Series Chipset Family PCI Express Root Port #5 (rev f1)
00:1d.0 PCI bridge: Intel Corporation 100 Series/C230 Series Chipset Family PCI Express Root Port #9 (rev f1)
00:1f.0 ISA bridge: Intel Corporation QM175 Chipset LPC/eSPI Controller (rev 31)
00:1f.2 Memory controller: Intel Corporation 100 Series/C230 Series Chipset Family Power Management Controller (rev 31)
00:1f.3 Audio device: Intel Corporation CM238 HD Audio Controller (rev 31)
00:1f.4 SMBus: Intel Corporation 100 Series/C230 Series Chipset Family SMBus (rev 31)
01:00.0 Display controller: Advanced Micro Devices, Inc. [AMD/ATI] Polaris 22 [Radeon RX Vega M GL] (rev c0)
02:00.0 Network controller: Qualcomm Atheros QCA6174 802.11ac Wireless Network Adapter (rev 32)
03:00.0 Unassigned class [ff00]: Realtek Semiconductor Co., Ltd. RTS525A PCI Express Card Reader (rev 01)
04:00.0 PCI bridge: Intel Corporation JHL6540 Thunderbolt 3 Bridge (C step) [Alpine Ridge 4C 2016] (rev 02)
05:00.0 PCI bridge: Intel Corporation JHL6540 Thunderbolt 3 Bridge (C step) [Alpine Ridge 4C 2016] (rev 02)
05:01.0 PCI bridge: Intel Corporation JHL6540 Thunderbolt 3 Bridge (C step) [Alpine Ridge 4C 2016] (rev 02)
05:02.0 PCI bridge: Intel Corporation JHL6540 Thunderbolt 3 Bridge (C step) [Alpine Ridge 4C 2016] (rev 02)
05:04.0 PCI bridge: Intel Corporation JHL6540 Thunderbolt 3 Bridge (C step) [Alpine Ridge 4C 2016] (rev 02)
06:00.0 System peripheral: Intel Corporation JHL6540 Thunderbolt 3 NHI (C step) [Alpine Ridge 4C 2016] (rev 02)
07:00.0 PCI bridge: Intel Corporation DSL6340 Thunderbolt 3 Bridge [Alpine Ridge 2C 2015]
08:01.0 PCI bridge: Intel Corporation DSL6340 Thunderbolt 3 Bridge [Alpine Ridge 2C 2015]
09:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Ellesmere [Radeon RX 470/480/570/570X/580/580X] (rev e7)
09:00.1 Audio device: Advanced Micro Devices, Inc. [AMD/ATI] Ellesmere [Radeon RX 580]

Looking at just the RX 580 in more detail using lspci -v we have:

09:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Ellesmere [Radeon RX 470/480/570/570X/580/580X] (rev e7) (prog-if 00 [VGA controller])
	Subsystem: XFX Pine Group Inc. Ellesmere [Radeon RX 470/480/570/570X/580/580X]
	Flags: fast devsel, IRQ 18
	Memory at 2fb0000000 (64-bit, prefetchable) [size=256M]
	Memory at 2fc0000000 (64-bit, prefetchable) [size=2M]
	I/O ports at 2000 [size=256]
	Memory at bc000000 (32-bit, non-prefetchable) [size=256K]
	Expansion ROM at bc040000 [disabled] [size=128K]
	Capabilities: [48] Vendor Specific Information: Len=08 <?>
	Capabilities: [50] Power Management version 3
	Capabilities: [58] Express Legacy Endpoint, MSI 00
	Capabilities: [a0] MSI: Enable- Count=1/1 Maskable- 64bit+
	Capabilities: [100] Vendor Specific Information: ID=0001 Rev=1 Len=010 <?>
	Capabilities: [150] Advanced Error Reporting
	Capabilities: [200] #15
	Capabilities: [270] #19
	Capabilities: [2b0] Address Translation Service (ATS)
	Capabilities: [2c0] Page Request Interface (PRI)
	Capabilities: [2d0] Process Address Space ID (PASID)
	Capabilities: [320] Latency Tolerance Reporting
	Capabilities: [328] Alternative Routing-ID Interpretation (ARI)
	Capabilities: [370] L1 PM Substates
	Kernel modules: amdgpu

When looking at demsg I see the following (I've removed non-relevant lines):

[    8.534250] amdgpu 0000:09:00.0: enabling device (0006 -> 0007)
[    8.534756] [drm] initializing kernel modesetting (POLARIS10 0x1002:0x67DF 0x1682:0xC580 0xE7).
[    8.537567] [drm] register mmio base: 0xBC000000
[    8.537568] [drm] register mmio size: 262144
[    8.537598] [drm] add ip block number 0 <vi_common>
[    8.537599] [drm] add ip block number 1 <gmc_v8_0>
[    8.537599] [drm] add ip block number 2 <tonga_ih>
[    8.537599] [drm] add ip block number 3 <powerplay>
[    8.537600] [drm] add ip block number 4 <dm>
[    8.537600] [drm] add ip block number 5 <gfx_v8_0>
[    8.537601] [drm] add ip block number 6 <sdma_v3_0>
[    8.537602] [drm] add ip block number 7 <uvd_v6_0>
[    8.537602] [drm] add ip block number 8 <vce_v3_0>
[    8.537608] kfd kfd: skipped device 1002:67df, PCI rejects atomics
[    8.537630] [drm] UVD is enabled in VM mode
[    8.537630] [drm] UVD ENC is enabled in VM mode
[    8.537636] [drm] VCE enabled in VM mode
[    8.614467] ATOM BIOS: 401815-171128-QS1
[    8.614512] [drm] GPU posting now...
[   13.621276] [drm:atom_op_jump [amdgpu]] *ERROR* atombios stuck in loop for more than 5secs aborting
[   13.621310] [drm:amdgpu_atom_execute_table_locked [amdgpu]] *ERROR* atombios stuck executing E650 (len 187, WS 0, PS 4) @ 0xE6FA
[   13.621341] [drm:amdgpu_atom_execute_table_locked [amdgpu]] *ERROR* atombios stuck executing C53A (len 193, WS 4, PS 4) @ 0xC569
[   13.621359] [drm:amdgpu_atom_execute_table_locked [amdgpu]] *ERROR* atombios stuck executing C410 (len 114, WS 0, PS 8) @ 0xC47C
[   13.621361] amdgpu 0000:09:00.0: gpu post error!
[   13.621363] amdgpu 0000:09:00.0: Fatal error during GPU init
[   13.621370] [drm] amdgpu: finishing device.
[   13.621792] amdgpu: probe of 0000:09:00.0 failed with error -22

Here are my system details:

System: Dell XPS 15 2 in 1 (Kaby Lake G)
Kernel: 4.19
Mesa: 18.2.2
Xorg: 1.20.1
Built in GPUs: Intel iGPU, Vega M
eGPU: RX 580

I'm not sure if I'm having problems because my laptop *also* contains a Vega M, which also uses the amdgpu driver.  Perhaps there's a problem if there are multiple GPUs using amdgpu?  One thing to point out is that the Vega M has worked flawlessly since Kernel 4.18.x.

I did run across several other users posting about this same problem when attempting to run AMD GPUs as eGPUs.  Here's a post where a user is reporting the same issue:

https://egpu.io/forums/thunderbolt-linux-setup/egpus-under-linux-an-advanced-guide/#post-33304

And here's another post:

https://forum.manjaro.org/t/rx-580-in-a-thunderbolt-egpu-dock/58210

I'm comfortable applying and testing kernel patches, so please feel free to ask me to test any fixes.  I'm currently running 4.19, but could also patch a 4.18.x kernel.

Thanks!
Comment 1 Alex Deucher 2018-10-23 14:25:42 UTC
Please attach your dmesg output.
Comment 2 Robert Strube 2018-10-23 16:00:21 UTC
Created attachment 142151 [details]
dmesg log booting system with eGPU attached and powered

Starting at line:

[   11.192733] ATOM BIOS: 401815-171128-QS1

You can see the failure that occurs when trying to initialize the RX 580 as an eGPU over Thunderbolt 3.
Comment 3 Robert Strube 2018-10-23 20:08:22 UTC
Quick question:

Is it possible to completely disable the Vega M using kernel boot parameters.  I did try using pci-stub.ids=xxxx:xxxx with the PCI hex id for my Vega M (1002:694e) but amdgpu was still applied to the device, not sure why.  I also thought there was explicit PCI device blacklisting support in the kernel, but I have been unable to find any documentation on this.

Ideally I'd like to see if having the Vega M disabled allows the eGPU to be correctly initialized.  I took a look at the documentation for amdgpu, but I didn't see any boot parameters that stood out at me.

Blacklisting the amdgpu module wouldn't work either, as I need that to correctly support the RX 580 once it's attached.

Thanks!
Comment 4 Robert Strube 2018-10-24 06:24:44 UTC
I decided to apply a hack to 4.19 and see if I could get the eGPU to initialize.  I noticed that this code in /drivers/gpu/drm/amd/amdgpu/atom.c

if ((jiffies_to_msecs(cjiffies) > 5000)) {
	DRM_ERROR("atombios stuck in loop for more than 5secs aborting\n");
	ctx->abort = true;
}

is where the error is being thrown, so I thought I would try giving the eGPU more time.  I increased the 5000 value to 15000, recompiled the kernel, and tried to attach the eGPU.  Unfortunately I received the same error, but this time after 15 seconds of trying to initialize the GPU.  Should I increase the time even more?

I'm not sure if the issue is actually related to not having enough time, or if it's something else entirely.

I'll bump it up to 30 seconds in a final last ditch attempt.
Comment 5 Alex Deucher 2018-10-24 20:44:00 UTC
If you can get any of the other methods to work you can remove the vegam device id from the driver.  That said, I doubt it will make a difference.  Usually the problem with thunderbolt is that pci BAR resources don't get assigned properly to the devices and the ones the driver needs are not available.  That doesn't seem to be the case here, but I might be missing something.
Comment 6 Robert Strube 2018-10-25 01:13:02 UTC
Created attachment 142182 [details]
dmesg log booting system with eGPU (Vega M device IDs removed in kernel)
Comment 7 Robert Strube 2018-10-25 01:15:41 UTC
Thanks for the suggestions! I took your advice and commented out the Vega M device IDs located here: /drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c

These are the lines of code that I commented out.
/* VEGAM */
{0x1002, 0x694C, PCI_ANY_ID, PCI_ANY_ID, 0, 0, CHIP_VEGAM},
{0x1002, 0x694E, PCI_ANY_ID, PCI_ANY_ID, 0, 0, CHIP_VEGAM},

This did indeed cause my Vega M to not be initialized, *but* the problem I'm having with the eGPU remains. So it appears you were correct and my hunch that the Vega M is interfering with the eGPU initialization was incorrect.  I'm back to square one...

I uploaded a new dmesg log for this kernel, perhaps with the Vega M out of the equation you might see something new?

Thanks!
Rob
Comment 8 Alex Deucher 2018-10-25 01:34:33 UTC
There does not seem to be enough MMIO space for the BARs on the thunderbolt bridges:
[    0.436946] pci 0000:04:00.0: BAR 13: no space for [io  size 0x4000]
[    0.436947] pci 0000:04:00.0: BAR 13: failed to assign [io  size 0x4000]
[    0.436949] pci 0000:04:00.0: BAR 13: assigned [io  0xc000-0xcfff]
[    0.436950] pci 0000:04:00.0: BAR 13: [io  0xc000-0xcfff] (failed to expand by 0x3000)
[    0.436951] pci 0000:04:00.0: failed to add 3000 res[13]=[io  0xc000-0xcfff]
[    0.436955] pci 0000:05:02.0: BAR 15: no space for [mem size 0x00200000 64bit pref]
[    0.436956] pci 0000:05:02.0: BAR 15: failed to assign [mem size 0x00200000 64bit pref]
[    0.436957] pci 0000:05:01.0: BAR 13: no space for [io  size 0x2000]
[    0.436958] pci 0000:05:01.0: BAR 13: failed to assign [io  size 0x2000]
[    0.436959] pci 0000:05:02.0: BAR 13: assigned [io  0xc000-0xcfff]
[    0.436960] pci 0000:05:04.0: BAR 13: no space for [io  size 0x1000]
[    0.436961] pci 0000:05:04.0: BAR 13: failed to assign [io  size 0x1000]
[    0.436963] pci 0000:05:01.0: BAR 13: assigned [io  0xc000-0xcfff]
[    0.436964] pci 0000:05:04.0: BAR 13: no space for [io  size 0x1000]
[    0.436965] pci 0000:05:04.0: BAR 13: failed to assign [io  size 0x1000]
[    0.436967] pci 0000:05:02.0: BAR 15: no space for [mem size 0x00200000 64bit pref]
[    0.436968] pci 0000:05:02.0: BAR 15: failed to assign [mem size 0x00200000 64bit pref]
[    0.436969] pci 0000:05:02.0: BAR 13: no space for [io  size 0x1000]
[    0.436970] pci 0000:05:02.0: BAR 13: failed to assign [io  size 0x1000]
[    0.436971] pci 0000:05:01.0: BAR 13: [io  0xc000-0xcfff] (failed to expand by 0x1000)
[    0.436972] pci 0000:05:01.0: failed to add 1000 res[13]=[io  0xc000-0xcfff]
I don't think that should be an issue for the devices behind it, but perhaps it is?
Comment 9 Robert Strube 2018-10-25 02:12:46 UTC
Any suggestion for how I can increase the MMIO space for the BARs on the Thunderbolt bridges? Should I try to disable additional devices in the BIOS, etc.?  I'm a little out of my element here.

Thanks!
Rob
Comment 10 Alex Deucher 2018-10-25 02:16:16 UTC
(In reply to Robert Strube from comment #9)
> Any suggestion for how I can increase the MMIO space for the BARs on the
> Thunderbolt bridges? Should I try to disable additional devices in the BIOS,
> etc.?  I'm a little out of my element here.

Worth a shot if you can.

Alex
Comment 11 Robert Strube 2018-10-25 03:06:30 UTC
I disabled a bunch of devices in the BIOS (sound, SD card reader, etc.) and I confirmed that they are no longer showing up in lspci, but I'm still getting the same error.

I also found one suggestion to pass in a kernel parameter of hpbussize=4 to increase the bus size made available for hot-pluggable devices, this also didn't help.

Thanks for all your assistance BTW!

Any other suggestions? I'm starting run out of options here.

Thanks!
Comment 12 Robert Strube 2018-10-25 03:21:15 UTC
Created attachment 142187 [details]
dmesg log booting system *without* eGPU

So I decided to do a sanity check and completely remove the eGPU from the equation.  I am still getting the BAR errors in dmesg, so perhaps this isn't the problem after all?!

The one thing I noticed is that the BAR errors are for the pci module and not the pcieport module.  One other thing worth mentioning is that I tried with a different GPU (for the eGPU) yesterday.  I had a GTX 1060 available, and this *did* work correctly.  I haven't double checked if the BAR errors are present with the other GPU.

Perhaps it is a bug with amdgpu after all?
Comment 13 Christian König 2018-10-25 07:05:51 UTC
This problem isn't related to the GPU in any way, that the amdgpu driver fails to load is just another symptom.

The Thunderbolt bridge doesn't get enough resources assigned for its devices even when the GPU isn't present at all for some reason.

That could be a problem with the BIOS, with the Linux Thunderbolt or the resource allocation in the PCI subsystem.

Please provide the output of "sudo cat /proc/iomem" and "lspci -t -nn -v" together with an up to date dmesg.
Comment 14 Robert Strube 2018-10-25 14:56:12 UTC
Created attachment 142200 [details]
lspci with eGPU *not* connected.

lspci -t -nn -v output when the eGPU is *not* connected.
Comment 15 Robert Strube 2018-10-25 15:01:38 UTC
Created attachment 142201 [details]
sudo cat /proc/iomem when eGPU *not* connected

sudo cat /proc/iomem when eGPU *not* connected
Comment 16 Robert Strube 2018-10-25 15:07:11 UTC
Created attachment 142202 [details]
fresh dmesg log booting system *without* eGPU
Comment 17 Robert Strube 2018-10-25 15:11:27 UTC
Created attachment 142203 [details]
lspci *with* eGPU attached at boot
Comment 18 Robert Strube 2018-10-25 15:12:10 UTC
Created attachment 142204 [details]
sudo cat /proc/iomem *with* eGPU connected at boot
Comment 19 Robert Strube 2018-10-25 15:12:54 UTC
Created attachment 142205 [details]
fresh dmesg log booting system *with* eGPU connected at boot
Comment 20 Robert Strube 2018-10-25 20:05:23 UTC
One more thing I thought of.  Would it help if I posted my dmesg log with the GTX 1060 connected as an eGPU?

As I mentioned previously this card *is* working with nouveau.  I haven't tested with the proprietary nvidia drivers.

I'd imagine that PCI resource issues you pointed out are still there, so I'm surprised that the nvidia card is able to work.  Perhaps they have some hacks in their drivers to work around issues like this?

I also have a friend that has an older RX 290, should I give that a shot as well? It might take me a while to get a hold of that card.

I don't doubt that this is most likely a BIOS bug, but I've noticed people on the windows side of the fence getting the XPS 9575 working with eGPUs, and presumably they have the same BIOS as me.
Comment 21 Robert Strube 2018-10-26 04:42:39 UTC
Hi guys,

Apologies for the deluge of posts here, I've been trying really hard to investigate this issue!

So I took a closer look at the PCI resource issues that you mentioned, I've also been looking and thunderbolt driver issues in general, and I've noticed that this type of log message is quite common.  Here's what I'm wondering:

These four devices correspond to the TB to PCI bridges in the system

0000:04:00.0
0000:05:01.0
0000:05:02.0
0000:05:04.0

04:00.0 PCI bridge: Intel Corporation JHL6540 Thunderbolt 3 Bridge (C step) [Alpine Ridge 4C 2016] (rev 02) (prog-if 00 [Normal decode])
	Flags: bus master, fast devsel, latency 0, IRQ 16
	Bus: primary=04, secondary=05, subordinate=6e, sec-latency=0
	Memory behind bridge: bc000000-ea0fffff
	Prefetchable memory behind bridge: 0000002fb0000000-0000002ff9ffffff
	Capabilities: [80] Power Management version 3
	Capabilities: [88] MSI: Enable- Count=1/1 Maskable- 64bit+
	Capabilities: [ac] Subsystem: Intel Corporation JHL6540 Thunderbolt 3 Bridge (C step) [Alpine Ridge 4C 2016]
	Capabilities: [c0] Express Upstream Port, MSI 00
	Capabilities: [100] Device Serial Number b7-de-04-b0-a6-c9-a0-00
	Capabilities: [200] Advanced Error Reporting
	Capabilities: [300] Virtual Channel
	Capabilities: [400] Power Budgeting <?>
	Capabilities: [500] Vendor Specific Information: ID=1234 Rev=1 Len=0d8 <?>
	Capabilities: [600] Latency Tolerance Reporting
	Capabilities: [700] #19
	Kernel driver in use: pcieport

05:00.0 PCI bridge: Intel Corporation JHL6540 Thunderbolt 3 Bridge (C step) [Alpine Ridge 4C 2016] (rev 02) (prog-if 00 [Normal decode])
	Flags: bus master, fast devsel, latency 0, IRQ 16
	Bus: primary=05, secondary=06, subordinate=06, sec-latency=0
	Memory behind bridge: ea000000-ea0fffff
	Capabilities: [80] Power Management version 3
	Capabilities: [88] MSI: Enable- Count=1/1 Maskable- 64bit+
	Capabilities: [ac] Subsystem: Intel Corporation JHL6540 Thunderbolt 3 Bridge (C step) [Alpine Ridge 4C 2016]
	Capabilities: [c0] Express Downstream Port (Slot+), MSI 00
	Capabilities: [100] Device Serial Number b7-de-04-b0-a6-c9-a0-00
	Capabilities: [200] Advanced Error Reporting
	Capabilities: [300] Virtual Channel
	Capabilities: [400] Power Budgeting <?>
	Capabilities: [500] Vendor Specific Information: ID=1234 Rev=1 Len=0d8 <?>
	Capabilities: [700] #19
	Kernel driver in use: pcieport

05:01.0 PCI bridge: Intel Corporation JHL6540 Thunderbolt 3 Bridge (C step) [Alpine Ridge 4C 2016] (rev 02) (prog-if 00 [Normal decode])
	Flags: bus master, fast devsel, latency 0, IRQ 17
	Bus: primary=05, secondary=07, subordinate=39, sec-latency=0
	Memory behind bridge: bc000000-d3efffff
	Prefetchable memory behind bridge: 0000002fb0000000-0000002fcfffffff
	Capabilities: [80] Power Management version 3
	Capabilities: [88] MSI: Enable- Count=1/1 Maskable- 64bit+
	Capabilities: [ac] Subsystem: Intel Corporation JHL6540 Thunderbolt 3 Bridge (C step) [Alpine Ridge 4C 2016]
	Capabilities: [c0] Express Downstream Port (Slot+), MSI 00
	Capabilities: [100] Device Serial Number b7-de-04-b0-a6-c9-a0-00
	Capabilities: [200] Advanced Error Reporting
	Capabilities: [300] Virtual Channel
	Capabilities: [400] Power Budgeting <?>
	Capabilities: [500] Vendor Specific Information: ID=1234 Rev=1 Len=0d8 <?>
	Capabilities: [700] #19
	Kernel driver in use: pcieport

05:02.0 PCI bridge: Intel Corporation JHL6540 Thunderbolt 3 Bridge (C step) [Alpine Ridge 4C 2016] (rev 02) (prog-if 00 [Normal decode])
	Flags: bus master, fast devsel, latency 0, IRQ 18
	Bus: primary=05, secondary=3a, subordinate=3a, sec-latency=0
	Memory behind bridge: d3f00000-d3ffffff
	Capabilities: [80] Power Management version 3
	Capabilities: [88] MSI: Enable- Count=1/1 Maskable- 64bit+
	Capabilities: [ac] Subsystem: Intel Corporation JHL6540 Thunderbolt 3 Bridge (C step) [Alpine Ridge 4C 2016]
	Capabilities: [c0] Express Downstream Port (Slot+), MSI 00
	Capabilities: [100] Device Serial Number b7-de-04-b0-a6-c9-a0-00
	Capabilities: [200] Advanced Error Reporting
	Capabilities: [300] Virtual Channel
	Capabilities: [400] Power Budgeting <?>
	Capabilities: [500] Vendor Specific Information: ID=1234 Rev=1 Len=0d8 <?>
	Capabilities: [700] #19
	Kernel driver in use: pcieport

05:04.0 PCI bridge: Intel Corporation JHL6540 Thunderbolt 3 Bridge (C step) [Alpine Ridge 4C 2016] (rev 02) (prog-if 00 [Normal decode])
	Flags: bus master, fast devsel, latency 0, IRQ 16
	Bus: primary=05, secondary=3b, subordinate=6e, sec-latency=0
	Memory behind bridge: d4000000-e9ffffff
	Prefetchable memory behind bridge: 0000002fd0000000-0000002ff9ffffff
	Capabilities: [80] Power Management version 3
	Capabilities: [88] MSI: Enable- Count=1/1 Maskable- 64bit+
	Capabilities: [ac] Subsystem: Intel Corporation JHL6540 Thunderbolt 3 Bridge (C step) [Alpine Ridge 4C 2016]
	Capabilities: [c0] Express Downstream Port (Slot+), MSI 00
	Capabilities: [100] Device Serial Number b7-de-04-b0-a6-c9-a0-00
	Capabilities: [200] Advanced Error Reporting
	Capabilities: [300] Virtual Channel
	Capabilities: [400] Power Budgeting <?>
	Capabilities: [500] Vendor Specific Information: ID=1234 Rev=1 Len=0d8 <?>
	Capabilities: [700] #19
	Kernel driver in use: pcieport

First you see pci defining the bridge windows for devices:

[  104.290143] pci 0000:05:01.0: bridge window [io  0x1000-0x0fff] to [bus 07-39] add_size 1000
[  104.290152] pci 0000:05:02.0: bridge window [io  0x1000-0x0fff] to [bus 3a] add_size 1000
[  104.290155] pci 0000:05:02.0: bridge window [mem 0x00100000-0x000fffff 64bit pref] to [bus 3a] add_size 200000 add_align 100000
[  104.290169] pci 0000:05:04.0: bridge window [io  0x1000-0x0fff] to [bus 3b-6e] add_size 1000
[  104.290180] pci 0000:04:00.0: bridge window [io  0x1000-0x0fff] to [bus 05-6e] add_size 3000

Then you see a bunch of BAR errors, saying there's no space and that they can't be assigned:

[  104.290184] pci 0000:04:00.0: BAR 13: no space for [io  size 0x3000]
[  104.290185] pci 0000:04:00.0: BAR 13: failed to assign [io  size 0x3000]
[  104.290187] pci 0000:04:00.0: BAR 13: no space for [io  size 0x3000]
[  104.290188] pci 0000:04:00.0: BAR 13: failed to assign [io  size 0x3000]
[  104.290193] pci 0000:05:02.0: BAR 15: no space for [mem size 0x00200000 64bit pref]
[  104.290194] pci 0000:05:02.0: BAR 15: failed to assign [mem size 0x00200000 64bit pref]
[  104.290196] pci 0000:05:01.0: BAR 13: no space for [io  size 0x1000]
[  104.290197] pci 0000:05:01.0: BAR 13: failed to assign [io  size 0x1000]
[  104.290198] pci 0000:05:02.0: BAR 13: no space for [io  size 0x1000]
[  104.290199] pci 0000:05:02.0: BAR 13: failed to assign [io  size 0x1000]
[  104.290201] pci 0000:05:04.0: BAR 13: no space for [io  size 0x1000]
[  104.290202] pci 0000:05:04.0: BAR 13: failed to assign [io  size 0x1000]
[  104.290203] pci 0000:05:04.0: BAR 13: no space for [io  size 0x1000]
[  104.290205] pci 0000:05:04.0: BAR 13: failed to assign [io  size 0x1000]
[  104.290207] pci 0000:05:02.0: BAR 15: no space for [mem size 0x00200000 64bit pref]
[  104.290208] pci 0000:05:02.0: BAR 15: failed to assign [mem size 0x00200000 64bit pref]
[  104.290209] pci 0000:05:02.0: BAR 13: no space for [io  size 0x1000]
[  104.290210] pci 0000:05:02.0: BAR 13: failed to assign [io  size 0x1000]
[  104.290212] pci 0000:05:01.0: BAR 13: no space for [io  size 0x1000]
[  104.290213] pci 0000:05:01.0: BAR 13: failed to assign [io  size 0x1000]

But then you see that the PCI bridges seem to initialize for all the devices:

[  104.290215] pci 0000:05:00.0: PCI bridge to [bus 06]
[  104.290221] pci 0000:05:00.0:   bridge window [mem 0xea000000-0xea0fffff]
[  104.290231] pci 0000:05:01.0: PCI bridge to [bus 07-39]
[  104.290237] pci 0000:05:01.0:   bridge window [mem 0xbc000000-0xd3efffff]
[  104.290241] pci 0000:05:01.0:   bridge window [mem 0x2fb0000000-0x2fcfffffff 64bit pref]
[  104.290248] pci 0000:05:02.0: PCI bridge to [bus 3a]
[  104.290254] pci 0000:05:02.0:   bridge window [mem 0xd3f00000-0xd3ffffff]
[  104.290264] pci 0000:05:04.0: PCI bridge to [bus 3b-6e]
[  104.290270] pci 0000:05:04.0:   bridge window [mem 0xd4000000-0xe9ffffff]
[  104.290274] pci 0000:05:04.0:   bridge window [mem 0x2fd0000000-0x2ff9ffffff 64bit pref]
[  104.290281] pci 0000:04:00.0: PCI bridge to [bus 05-6e]
[  104.290286] pci 0000:04:00.0:   bridge window [mem 0xbc000000-0xea0fffff]
[  104.290291] pci 0000:04:00.0:   bridge window [mem 0x2fb0000000-0x2ff9ffffff 64bit pref]

Perhaps the BAR errors are just a red herring and at the end of the process all of the the Thunderbolt PCI bridges *are* initialized correctly?

As I said, I've probably spent way too much time looking at this, the main thing I keep coming back to is that my other GPU *does* work correctly as an eGPU.  It's also a PCI x16 card (I know it's operating over PCI x4 due to TB3 bandwitch limitations), so theoretically if there were any PCI resource problems with the Thunderbolt bridge then this GPU should also fail, correct?

I noticed a couple other things in my research:

I found a bug that points to tlp (specifically power management) as causing the same problems with the atom bios being stuck in a loop: https://bugs.freedesktop.org/show_bug.cgi?id=103783
Perhaps the issue is caused by some sort of aggressive PM?  I might try adding some kernel boot parameters amdgpu.dpm=0 amdgpu.apm=0 etc.

I was also thinking that perhaps I should try the AMDGPU-PRO drivers just to see if they would work by chance.  Somebody else reported that these drivers worked, while the amdgpu drivers failed.  It's worth a shot.

Thanks for any feedback and/or advice!
Rob
Comment 22 Robert Strube 2018-10-26 05:11:46 UTC
OK! My hunch about the PM was right! The card is fully initialized now, so the issue doesn't appear to be a PCI resource issue!

I took the brute force approach, compiled my own custom kernel that completely disabled the Vega M (by commenting out it's device IDs). I then passed in the following kernel boot parameters:

acpi=off apm=off amdgpu.dpm=0 amdgpu.aspm=0 amdgpu.runpm=0 amdgpu.bapm=0

Rebooted the machine and *BAM* the eGPU was initialized!

I'm attaching the new dmesg!

I'm just super excited that I was able to get the eGPU initialized!

xrandr even sees it!

xrandr --listproviders
Providers: number : 2
Provider 0: id: 0x74 cap: 0x9, Source Output, Sink Offload crtcs: 3 outputs: 7 associated providers: 1 name:modesetting
Provider 1: id: 0x4a cap: 0x6, Sink Output, Source Offload crtcs: 6 outputs: 5 associated providers: 1 name:Radeon RX 580 Series @ pci:0000:09:00.0
Comment 23 Robert Strube 2018-10-26 05:14:00 UTC
Edit: taking a closer look at the dmesg I see that disabling the PM did indeed eliminate the PCI resource issues.  So for some reason having PM enabled affects the PCI resource allocation for the Thunderbolt PCI bridges!
Comment 24 Robert Strube 2018-10-26 05:15:13 UTC
Created attachment 142209 [details]
dmesg log booting system with PM *DISABLED*  and *WITH* eGPU
Comment 25 Robert Strube 2018-10-26 05:30:51 UTC
acpi=off is the only parameter necessary to get the eGPU up and running. Setting this parameter allows the Thunderbolt PCI bridge to correctly have it's resources allocated. This incidentally also completely disables the Vega M (even with Vanilla kernel that does not have the device IDs commented out).

I'm wondering where I can report the Thunderbolt Controller / Bridge bug?  Perhaps you fine folks can point me in the right direction?
Comment 26 Christian König 2018-10-26 10:35:17 UTC
I don't want to stop your cheering, but that isn't a perfect solution either.

(In reply to Robert Strube from comment #25)
> I'm wondering where I can report the Thunderbolt Controller / Bridge bug? 
> Perhaps you fine folks can point me in the right direction?

That is unfortunately most like a bug in the BIOS.

What happens here is when you specify acpi=off that the internal Vega M gets disabled and the address space that one used freed up.

This address space is then used for the Thunderbolt Controller to handle the Polaris.

What you could try is to blacklist amdgpu from automatically loading and then issue the following commands as root manually:

#Disable the internal Vega M
echo 1 > ./bus/pci/devices/0000:01:00.0/remove
#Manually load amdgpu to initialize the Polaris
modprobe amdgpu
#Rescan the PCI bus to find the Vega M again
echo 1 > ./bus/pci/devices/0000:00:00.0/rescan

It's just a shot into the dark, but that might work as well.

Apart from that there isn't much else you could do except to upgrade the BIOS or use different hardware.
Comment 27 Robert Strube 2018-10-26 16:49:19 UTC
Thanks for the response.

I think it was just a coincidence that the eGPU started working with acpi=off.  Taking a closer look at the issue it really does appear to be a BIOS problem that prevents the proper PCI resource allocation to one of the TB PCI bridges.  In fact when I took a closer look at the dmesg with acpi=off I still see the resource issues present.

I've opened up an official bug report here with the kernel ACPI BIOS team here:
https://bugzilla.kernel.org/show_bug.cgi?id=201527

I realize the issue should really be solved by the manufacturer, but perhaps the kernel devs can create a work around and/or have more direct lines of communication with the Dell engineers.  Thank you both for your suggestions and comments.

Rob
Comment 28 Robert Strube 2018-11-29 22:54:54 UTC
Quick update:

I heard back from the ACPI BIOS kernel developers (see: https://bugzilla.kernel.org/show_bug.cgi?id=201527) and they seem to imply that the PCI resource issues showing up in the dmseg log are *not* a problem.  Linux is simply trying to allocate more resources, and that the failure is OK and it does get the requisite resources required.

See this comment https://bugzilla.kernel.org/show_bug.cgi?id=201527#c8 from Mika.

I'm not sure where this leaves us, is it a BIOS / PCI resource issue, or is it a bug within amdgpu?

I've also been in contact with Dell regarding the possibility that there is a BIOS bug causing some of these issues.  I'm going to need to conduct some testing on Windows with eGPUs to see if the problem also exists there.

Thanks!
Rob
Comment 29 Alex Deucher 2018-11-29 23:47:01 UTC
(In reply to Robert Strube from comment #28)
> Quick update:
> 
> I heard back from the ACPI BIOS kernel developers (see:
> https://bugzilla.kernel.org/show_bug.cgi?id=201527) and they seem to imply
> that the PCI resource issues showing up in the dmseg log are *not* a
> problem.  Linux is simply trying to allocate more resources, and that the
> failure is OK and it does get the requisite resources required.

It would appear to not be ok as when it gets the resources, the GPU works.

Does disabling dpm make the GPU work?  E.g., append amdgpu.dpm=0 to the kernel command line in grub.  The driver needs to query the supported pcie speeds from the pcie bridge it is connected to in order to setup the power management controller.  Maybe when the resources are not available, the driver is not able to get that information or it gets garbage.
Comment 30 Robert Strube 2018-11-30 01:26:17 UTC
Hi Alex,

Thanks for the reply.  I wanted to clarify an important point: When I disabled PM completely and ACPI completely, I did not see any PCI resource issues AND the eGPU initialized successfully.

However, after testing a little more I was able to keep the PM enabled and only disabled ACPI.  In this situation I encountered a scenario where the PCI resource issues were present in the log, BUT the eGPU still initialized.  I mentioned that briefly in one of my previous comments but didn't really elaborate.

So under certain situations the eGPU did initialize despite seeing PCI BAR resource issues.

I've been working with another user that has the exact same system (XPS 9575) and an RX 580 and is having the same issues.  He was actually able to get the eGPU initialized by passing in PCI=noacpi rather than completely disabling ACPI as a whole.  I'll double check with him to see if he can post his dmesg log - because I'm not sure if the PCI resource issues are present under those circumstances.

Reference: https://forum.manjaro.org/t/rx-580-in-a-thunderbolt-egpu-dock/58210/13

At this point I've had to return my RX 580 - great card but after about a month of troubleshooting I was running out of time in the return window - so I'm unable to do any more testing with that specific card at this time.  Kind of a bummer... I'll probably pick up a Vega early next year and try again.

Rob
Comment 31 Alex Deucher 2018-11-30 01:35:29 UTC
(In reply to Robert Strube from comment #30)
> Hi Alex,
> 
> Thanks for the reply.  I wanted to clarify an important point: When I
> disabled PM completely and ACPI completely, I did not see any PCI resource
> issues AND the eGPU initialized successfully.
> 
> acpi=off apm=off amdgpu.dpm=0 amdgpu.aspm=0 amdgpu.runpm=0 amdgpu.bapm=0

the only relevant items here are acpi=off and amdgpu.dpm=0.  Did you test them independently or just together?  setting dpm=0 is irrelevant if acpi=off since there will be no resource restrictions.  You need to test them independently.

> 
> I've been working with another user that has the exact same system (XPS
> 9575) and an RX 580 and is having the same issues.  He was actually able to
> get the eGPU initialized by passing in PCI=noacpi rather than completely
> disabling ACPI as a whole.  I'll double check with him to see if he can post
> his dmesg log - because I'm not sure if the PCI resource issues are present
> under those circumstances.

looks like the same issue as yours.  PCI resources not getting assigned.
Comment 32 Robert Strube 2018-11-30 01:43:59 UTC
Hi Alex,

I just tested acpi=off independently and not amdgpu.dpm=0 independently. 

In regards to the other user, yes he had the exact same PCI resource issues as me, what I'm curious to find out though, is when he passed in PCI=noacpi and was able to get the card initialized if those same PCI resource issues are present.  My hunch is that they are still present AND the card was able to initialize, but I'd be anxious to see.

I've also asked him to test out amdgpu.dpm=0 independently and report back.  Hopefully you're onto something here!

Rob
Comment 33 Alex Deucher 2018-11-30 01:51:35 UTC
With respect to pci(e) devices, acpi=off and pci=noacpi are equivalent I think.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.