Bug 98552

Summary: RX 480 does not work as eGPU (amdgpu crashes at amdgpu_bo_init)
Product: xorg Reporter: Gašper Sedej <gsedej>
Component: Driver/AMDgpuAssignee: xf86-video-ati maintainers <xorg-driver-ati>
Status: RESOLVED NOTOURBUG QA Contact: Xorg Project Team <xorg-team>
Severity: normal    
Priority: medium CC: airlied
Version: git   
Hardware: Other   
OS: All   
Whiteboard:
i915 platform: i915 features:
Attachments:
Description Flags
dmesg from crash
none
linux 4.8
none
modprobe amdgpu none

Description Gašper Sedej 2016-11-02 14:53:49 UTC
Created attachment 127690 [details]
dmesg from crash

I have recently new laptop with i7 4700qm CPU. The nvidia 840m is very slow even with nvidia driver (I can play new steam linux games at 720p at low to medium).

My idea is to user "eGPU" solution using RX 480 (external graphics connected to laptop via expresscard or minipcie)

I prepeared my system - Ubuntu 16.04, kernel 4.9rc3, mesa 13.1 dev

If I start laptop connected to the RX 480, I get blank screen after GRUB, and can't access it

When I connect GPU while running I can get dmesg using ssh from another computer.

I am using kernel parameters pci=norcs and pci=realloc



here is probably the problematic part (rest is in the attachment)
[  980.754662]  [<ffffffffa1e705a9>] io_reserve_memtyp+e0x59/0x130
[  980.754688]  [<ffffffffa1e706af>] arch_io_reserve_memtype_wc+0x2f/0x50
[  980.754780]  [<ffffffffc08f3180>] amdgpu_bo_init+0x20/0x90 [amdgpu]
[  980.754835]  [<ffffffffc092cbca>] gmc_v8_0_sw_init+0x37a/0x5a0 [amdgpu]
[  980.754885]  [<ffffffffc08e12a4>] amdgpu_device_init+0xc64/0x11e0 [amdgpu]
[  980.754920]  [<ffffffffa1fd0584>] ? kmalloc_order_trace+0x24/0xa0
[  980.754967]  [<ffffffffc08e39db>] amdgpu_driver_load_kms+0x5b/0x1f0 [amdgpu]
[  980.755028]  [<ffffffffc00a5657>] drm_dev_register+0xa7/0xd0 [drm]
[  980.755069]  [<ffffffffc00a773c>] drm_get_pci_dev+0x9c/0x1c0 [drm]
[  980.755121]  [<ffffffffc08de49c>] amdgpu_pci_probe+0xbc/0xe0 [amdgpu]
[  980.755150]  [<ffffffffa226e415>] local_pci_probe+0x45/0xa0
[  980.755173]  [<ffffffffa226f8c9>] pci_device_probe+0x109/0x160
[  980.755202]  [<ffffffffa2393fd3>] driver_probe_device+0x223/0x430
[  980.755229]  [<ffffffffa23942bf>] __driver_attach+0xdf/0xf0
[  980.755253]  [<ffffffffa23941e0>] ? driver_probe_device+0x430/0x430
[  980.755279]  [<ffffffffa2391b0c>] bus_for_each_dev+0x6c/0xc0
[  980.755306]  [<ffffffffa239371e>] driver_attach+0x1e/0x20
[  980.755330]  [<ffffffffa2393140>] bus_add_driver+0x170/0x270
[  980.755357]  [<ffffffffc0a25000>] ? 0xffffffffc0a25000
[  980.755383]  [<ffffffffa2394c30>] driver_register+0x60/0xe0
[  980.755407]  [<ffffffffc0a25000>] ? 0xffffffffc0a25000
[  980.755432]  [<ffffffffa226dd0c>] __pci_register_driver+0x4c/0x50
[  980.755468]  [<ffffffffc00a794b>] drm_pci_init+0xeb/0x100 [drm]
[  980.755493]  [<ffffffffc0a25000>] ? 0xffffffffc0a25000
[  980.755519]  [<ffffffffc0a25000>] ? 0xffffffffc0a25000
[  980.755566]  [<ffffffffc0a25079>] amdgpu_init+0x79/0x7b [amdgpu]
[  980.755596]  [<ffffffffa1e02190>] do_one_initcall+0x50/0x180
[  980.755624]  [<ffffffffa1ff0201>] ? __vunmap+0x81/0xd0
[  980.755651]  [<ffffffffa200e742>] ? kmem_cache_alloc_trace+0x142/0x190
[  980.755683]  [<ffffffffa1fa3df1>] do_init_module+0x5f/0x1f7
[  980.755707]  [<ffffffffa1f1392b>] load_module+0x199b/0x1d00
[  980.755732]  [<ffffffffa1f10020>] ? __symbol_put+0x60/0x60
[  980.755759]  [<ffffffffa21bcf8e>] ? ima_post_read_file+0x7e/0xa0
[  980.755790]  [<ffffffffa2175edb>] ? security_kernel_post_read_file+0x6b/0x80
[  980.755824]  [<ffffffffa1f13eff>] SYSC_finit_module+0xdf/0x110
[  980.755853]  [<ffffffffa1f13f4e>] SyS_finit_module+0xe/0x10
[  980.755878]  [<ffffffffa268bbbb>] entry_SYSCALL_64_fastpath+0x1e/0xad
Comment 1 Alex Deucher 2016-11-02 16:16:01 UTC
Can you try with an older kernel (e.g., 4.8)?  This looks to be related to some recent changes in 4.9.
Comment 2 Gašper Sedej 2016-11-02 16:28:55 UTC
Created attachment 127696 [details]
linux 4.8
Comment 3 Gašper Sedej 2016-11-02 16:29:08 UTC
Thanks for reply

I tried it with kernel 4.8 but now i get error:
[   64.395083] [drm:amdgpu_device_init [amdgpu]] *ERROR* sw_init of IP block <gmc_v8_0> failed -12
(other log in attachment)

Also note, the laptop I am testing is HP elitebook 8730w Core2Duo and already uses Mobility Radeon HD 3670
Comment 4 Gašper Sedej 2016-11-02 16:33:55 UTC
apparrmor crashed, I disable it.

now I only get

...
[  387.295971] ATOM BIOS: E347
[  387.295994] [drm] GPU not posted. posting now...
Comment 5 Alex Deucher 2016-11-02 16:43:50 UTC
Your system does not seem to be allocating any address space for the PCI BAR:
[   64.395009] [drm] Detected VRAM RAM=8192M, BAR=0M
so the driver has not way to access vram with the CPU.
Comment 6 Gašper Sedej 2016-11-02 17:36:03 UTC
What can I do? Any way to debug?

This system DOES run with Nvidia GTX 770 (using prop driver)
Comment 7 Alex Deucher 2016-11-02 18:32:17 UTC
(In reply to Gašper Sedej from comment #6)
> What can I do? Any way to debug?

You'll probably want to ask on the Linux PCI mailing list or file a bug against the linux pci subsystem.

> 
> This system DOES run with Nvidia GTX 770 (using prop driver)

Is the nvidia device physically part of the laptop or connected via some external mechanism?

Generally the sbios handles resource allocation with respect to PCI BARs.  You may have just run out of address space.
Comment 8 Gašper Sedej 2016-11-02 19:09:30 UTC
I am using "EXP GDC Beast", http://www.banggood.com/EXP-GDC-Laptop-External-PCI-E-Graphics-Card-p-934367.html
The limitation is on PCIe lanes - only 1x or 2x. The pcie version is probably v2 because of laptop age.
The EXP GDC hase some CTD and PTD settings as seen in link, but I am not sure what it does. (nvidia works in current settings)

The laptop can boot via BIOS and via EFI. I did EFI installation, since I read it was better for such setups.

I am using kernel parameters pci=norcs and pci=realloc - I don't know why, but nvidia does not work without those parameters.
I think "pci=realloc" is for kernel memory allocation, but this is different than "PCI BAR" right?

Where can I ask about PCI BAR issue?
Comment 9 Alex Deucher 2016-11-02 19:23:16 UTC
(In reply to Gašper Sedej from comment #8)
> I am using "EXP GDC Beast",
> http://www.banggood.com/EXP-GDC-Laptop-External-PCI-E-Graphics-Card-p-934367.
> html
> The limitation is on PCIe lanes - only 1x or 2x. The pcie version is
> probably v2 because of laptop age.
> The EXP GDC hase some CTD and PTD settings as seen in link, but I am not
> sure what it does. (nvidia works in current settings)
> 
> The laptop can boot via BIOS and via EFI. I did EFI installation, since I
> read it was better for such setups.
> 
> I am using kernel parameters pci=norcs and pci=realloc - I don't know why,
> but nvidia does not work without those parameters.
> I think "pci=realloc" is for kernel memory allocation, but this is different
> than "PCI BAR" right?
> 
> Where can I ask about PCI BAR issue?


pci=realloc will attempt to handle additional pci resources like bars that were not handled by the bios on boot.  Whether or not it succeeds will depend on the number and size of the bars.  Do you also need pci=norcs?  You may just be getting lucky on the nvidia card if it uses fewer or smaller bars.  Please attach the full dmesg output.  You should see messages about unassigned resources.
Comment 10 Gašper Sedej 2016-11-02 19:40:03 UTC
By "BIOS" you also mean "EFI"?

I will try to post full dmesg, but it's quite hard - to get "live" dmesg, I am using another computer connected via ssh and command
watch -n 0.05 "dmesg | tail -n 100"
and console to "micro" font size, so I can select and copy text it comes trough...

How can I access logs when kernel freezes due to graphics issue?

Also, if I boot computer with enabled eGPU (RX 480), I simply won't get anything because it crashes too quick to be able to capture trough ssh...
(the monitor is blank even with text boot mode)
Comment 11 Tom St Denis 2016-11-02 19:42:27 UTC
Couldn't you run 

ssh foo@box "dmesg -w" | tee dmesg.log
Comment 12 Alex Deucher 2016-11-02 20:21:54 UTC
(In reply to Gašper Sedej from comment #10)
> By "BIOS" you also mean "EFI"?
> 

I mean the system bios, whether you are using legacy or EFI mode is largely irrelevant.


> I will try to post full dmesg, but it's quite hard - to get "live" dmesg, I
> am using another computer connected via ssh and command
> watch -n 0.05 "dmesg | tail -n 100"
> and console to "micro" font size, so I can select and copy text it comes
> trough...
> 
> How can I access logs when kernel freezes due to graphics issue?
> 
> Also, if I boot computer with enabled eGPU (RX 480), I simply won't get
> anything because it crashes too quick to be able to capture trough ssh...
> (the monitor is blank even with text boot mode)

You can blacklist the amdgpu driver so it doesn't load and then dump the dmesg output.  E.g., append the following to your kernel command line in grub:
modprobe.blacklist=amdgpu
If you want to load it manually after the system has booted just run (as root):
modprobe amdgpu
Comment 13 Gašper Sedej 2016-11-03 08:35:58 UTC
Created attachment 127713 [details]
modprobe amdgpu

I was able to capture more using command
ssh foo@box "dmesg -w" | tee dmesg.log
see the attachment

blacklisting didn't help, if I boot with card connected, after bios, I just get blank screen, nothing changes. With RX 480 I can't boot (I can with nvidia)
It might be hardware (system) issue...

Still, here can I look about this BAR=0 issue?
Comment 14 Tom St Denis 2016-11-03 11:01:41 UTC
I'm confused when you say "nvidia card" I see in your dmesg it's posting an RV635 which is a crazy old radeon part not nvidia.  Is there an nvidia dgpu in the system?  

Also by looking at your mem you seem to have 4GB of memory with 256M reserved for the "PCI address space."  On a normal PC you'd typically have ~768M reserved for that.

Maybe try the opposite, blacklist radeon and not amdgpu and try booting with the card inserted.
Comment 15 Alex Deucher 2016-11-03 13:35:59 UTC
[  478.856218] pci 0000:04:00.0: BAR 0: no space for [mem size 0x10000000 64bit pref]
[  478.856221] pci 0000:04:00.0: BAR 0: failed to assign [mem size 0x10000000 64bit pref]

It doesn't matter when/if you load any specific drivers.  You need to fix the PCI resource allocation on your platform.  I'm not sure it's even possible with your current sbios.
Comment 16 Gašper Sedej 2016-11-03 13:44:55 UTC
Sorry for the confusion. I was not explicit enough

My the laptop I wish to "upgrade" is HP envy 15, with intel i7-4700qm (igpu intel hd4700) and "dgpu" nvidia optimus 840m (not so fast!!!). The envy lacks express card

The "eGPU" GDC EXP BEAST, which includes expressCard "cable" to connect to your laptop. 
(I ordered minipcie cable, but I am waiting for delivery)

In mean time I am testing on company's old HP EliteBook, which have Intel Core2duo, and AMD RV635 (no intel gpu), that HAS expressCard.
(this is the only computer that has expressCard)


I also have few real, fullsized gpus to test:
- nvidia GTX 770 - it's working as eGPU, but it after few minutes in game (also the card is my friends)
- AMD RX 480 - does not work as eGPU
-(I also have AMD HD5570, and GeForce 610, also doesn't work)


I tried with blacklisting radeon, but it didnt help
So the "PCI resource allocation" is something that bios is doing? So another computer to test...?
Comment 17 Alex Deucher 2016-11-03 16:31:58 UTC
(In reply to Gašper Sedej from comment #16)
> So the "PCI resource allocation" is something that bios is doing? So another
> computer to test...?

It's a combination of the bios and the kernel.  Your best bet is to email the linux-pci mailing list or file a kernel bug against the pci subsystem about the failure to assign pci resources.  There's nothing the gpu driver can do until that is resolved.

linux-pci ML:
http://vger.kernel.org/vger-lists.html#linux-pci
Kernel bugzilla:
https://bugzilla.kernel.org/enter_bug.cgi?product=Drivers
Select PCI from the component list.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.