Bug 97313

Summary: amdgpu: crash when PCI rescan discovers new card
Product: xorg Reporter: jimijames.bove
Component: Driver/modesettingAssignee: Xorg Project Team <xorg-team>
Status: RESOLVED MOVED QA Contact: Xorg Project Team <xorg-team>
Severity: normal    
Priority: medium CC: jimijames.bove, tjaalton
Version: unspecified   
Hardware: x86-64 (AMD64)   
OS: Linux (All)   
Whiteboard:
i915 platform: i915 features:
Attachments:
Description Flags
dmesg log
none
x-0.log
none
x-0.log.old
none
debug-without-glibc
none
backtrace with debug symbols
none
naive assert fix none

Description jimijames.bove 2016-08-12 00:16:35 UTC
This is semi-related to another bug: https://bugzilla.kernel.org/show_bug.cgi?id=150731
And by semi, I mean it's either completely related or completely not related, and I don't know which yet.

If I boot my computer with my AMD video card bound to a driver besides amdgpu (like vfio-pci for my Windows virtual machine), then unbind it from that driver, intending to bind to to amdgpu for Linux gaming...

EXPECTED BEHAVIOR: I can unbind, remove the card, rescan (echo 1 > /sys/bus/pci/rescan), then bind the card to amdgpu, using DRI3 and the DRI_PRIME variable for games. This is a process that some successfully do with the radeon driver on older AMD cards.

ACTUAL BEHAVIOR: At the moment that I rescan (echo 1 > /sys/bus/pci/rescan), X crashes and restarts, booting me back to the login screen, and something (I guess either the kernel or X) has automatically bound the card to amdgpu on its own, without me entering those echo commands (echo "<card's vendor ID>" "<card's device ID>" > /sys/bus/pci/drivers/amdgpu/new_id) to bind it.

DESIRED BEHAVIOR: I can get my card bound to amdgpu, whether it's automatic or by my own typed bind commands, WITHOUT a boot-me-to-login-screen X crash.

The X crash log is at https://bugzilla.kernel.org/attachment.cgi?id=228411
The crash, starting at [456.336], has no errors. It seems to be just reconfiguring the graphics because it noticed a new available card, and somehow that resulted in me being booted back to the login screen.
The log that was made after the crash, when I logged back in, is at https://bugzilla.kernel.org/attachment.cgi?id=228421
Comment 1 Alex Deucher 2016-08-12 13:52:38 UTC
Please attach your dmesg output from after you load the amdgpu driver.
Comment 2 jimijames.bove 2016-08-12 20:04:19 UTC
Created attachment 125758 [details]
dmesg log

Here's the full dmesg output. The moment amdgpu starts is [6156.280097], and that output seems to end around [6208.867655].

I should mention at this point, this behavior has happened to me on an R9 380 (Tonga) and my current R9 Fury (Fiji).
Comment 3 Michel Dänzer 2016-08-15 03:41:44 UTC
Since the Xorg log file ends abruptly, we'd probably need to see the Xorg stderr output. It should be captured in a display manager log file / the systemd  journal / ...

BTW, for DRI3 PRIME there's no need for X to do anything with the card directly, so you can prevent it from trying and crashing with

Section "ServerFlags"
       Option  "AutoAddGPU" "off"
EndSection

in /etc/X11/xorg.conf as a workaround.
Comment 4 jimijames.bove 2016-08-15 04:14:18 UTC
That's interesting. Turning AutoAddGPU off completely fixed the problem, and didn't work the way I expected. With it off, my computer still automatically binds the card to amdgpu as soon as I rescan, so I guess that's a quirk of the kernel or amdgpu, but X takes the card in with no crash. It works as seamlessly as it's supposed to, and I can once again use DRI_PRIME to access the card. I'll turn it back off to get the error output (because there is none when I'm using this perfectly fine workaround), which I figured out is not in my Xorg logs (https://bugs.launchpad.net/lightdm/+bug/1322277), and upload that now.
Comment 5 jimijames.bove 2016-08-15 04:27:56 UTC
Created attachment 125786 [details]
x-0.log

Here are the stderr logs. x-0.log.old is the session that crashed from trying to "AutoAdd" the AMD GPU, and x-0.log is the session that loaded after the crash with the GPU loaded.

I paid close attention to differences in these logs between when I had AutoAdd turned off or on in xorg.conf, and there are only 2 different lines: the obvious "using xorg.conf" line (because I just deleted the xorg.conf file when I wasn't turning AutoAdd off), and the line that the .old file ends in if it crashes: "Xorg: privates.c:385: dixRegisterPrivateKey: Assertion `!global_keys[type].created' failed."
Comment 6 jimijames.bove 2016-08-15 04:28:12 UTC
Created attachment 125787 [details]
x-0.log.old
Comment 7 Michel Dänzer 2016-08-16 09:46:07 UTC
Can you:

* Start X with AutoAddGPU enabled
* Attach gdb to the Xorg process and continue its execution
* Reproduce the problem
* At the gdb prompt after SIGABRT, enter "bt full" and attach the resulting output here

See https://www.x.org/wiki/Development/Documentation/ServerDebugging/ for some background information, in particular the "Debug support" section.
Comment 8 jimijames.bove 2016-08-16 20:54:19 UTC
It looks like in Arch Linux, I'd have to recompile those packages myself to get debug symbols. At that point, I think it'd be easier to just do this in a LiveUSB of debian. If I have the same issue with this card in debian, would that debug info be as good as if I had done it in Arch?
Comment 9 Michel Dänzer 2016-08-17 00:48:46 UTC
(In reply to jimijames.bove from comment #8)
> If I have the same issue with this card in debian,
> would that debug info be as good as if I had done it in Arch?

Yeah, should be fine.
Comment 10 jimijames.bove 2016-08-17 23:14:59 UTC
Alright, new problem. I have debian all set up to debug this. I even tested the whole debugging process without amdgpu running and it went swimmingly. But, as soon as I put in my 20-amdgpu.conf file, X crashes, because it can't grab my card from pci-stub, but in order to test this bug, I *need* to boot up the machine with the card bound to pci-stub. It's because for some reason, in Arch Linux, X is OK with loading amdgpu without finding a usable card, but in debian, not finding a card makes it freak out, even though Ignore is set to true. Here's the amdgpu part of the crash log from debian:

[     4.716] (--) PCI:*(0:0:2:0) 1b36:0100:1af4:1100 rev 4, Mem @ 0x94000000/67108864, 0x90000000/67108864, 0x98248000/8192, I/O @ 0x0000c2a0/32, BIOS @ 0x????????/131072
[     4.716] (--) PCI: (0:0:7:0) 1002:7300:174b:e331 rev 203, Mem @ 0x80000000/268435456, 0x98000000/2097152, 0x98200000/262144, I/O @ 0x0000c000/256, BIOS @ 0x????????/131072
[     4.716] (II) LoadModule: "glx"
[     4.716] (II) Loading /usr/lib/xorg/modules/extensions/libglx.so
[     4.717] (II) Module glx: vendor="X.Org Foundation"
[     4.717] 	compiled for 1.18.4, module version = 1.0.0
[     4.717] 	ABI class: X.Org Server Extension, version 9.0
[     4.717] (==) AIGLX enabled
[     4.717] (II) LoadModule: "amdgpu"
[     4.717] (II) Loading /usr/lib/xorg/modules/drivers/amdgpu_drv.so
[     4.717] (II) Module amdgpu: vendor="X.Org Foundation"
[     4.717] 	compiled for 1.18.3, module version = 1.1.0
[     4.717] 	Module class: X.Org Video Driver
[     4.717] 	ABI class: X.Org Video Driver, version 20.0
[     4.717] (II) AMDGPU: Driver for AMD Radeon chipsets: BONAIRE, BONAIRE, BONAIRE,
	BONAIRE, BONAIRE, BONAIRE, BONAIRE, BONAIRE, BONAIRE, KABINI, KABINI,
	KABINI, KABINI, KABINI, KABINI, KABINI, KABINI, KABINI, KABINI,
	KABINI, KABINI, KABINI, KABINI, KABINI, KABINI, KAVERI, KAVERI,
	KAVERI, KAVERI, KAVERI, KAVERI, KAVERI, KAVERI, KAVERI, KAVERI,
	KAVERI, KAVERI, KAVERI, KAVERI, KAVERI, KAVERI, KAVERI, KAVERI,
	KAVERI, KAVERI, KAVERI, HAWAII, HAWAII, HAWAII, HAWAII, HAWAII,
	HAWAII, HAWAII, HAWAII, HAWAII, HAWAII, HAWAII, HAWAII, TOPAZ, TOPAZ,
	TOPAZ, TOPAZ, TOPAZ, TONGA, TONGA, TONGA, TONGA, TONGA, TONGA, TONGA,
	TONGA, TONGA, CARRIZO, CARRIZO, CARRIZO, CARRIZO, CARRIZO, FIJI,
	STONEY, POLARIS11, POLARIS11, POLARIS11, POLARIS11, POLARIS11,
	POLARIS11, POLARIS10, POLARIS10
[     4.718] (EE) No devices detected.
[     4.718] (EE) 
Fatal server error:
[     4.718] (EE) no screens found(EE)

And here's my 20-amdgpu.conf file that works fine in Arch but not debian, given a card that's occupied by vfio-pci or pci-stub:

Section "Device"
    Identifier "Radeon"
    Driver "amdgpu"
    Option "DRI" "3"
    Option "Ignore" "true"
EndSection
Comment 11 Michel Dänzer 2016-08-18 00:35:51 UTC
(In reply to jimijames.bove from comment #10)
> Section "Device"
>     Identifier "Radeon"
>     Driver "amdgpu"
>     Option "DRI" "3"
>     Option "Ignore" "true"
> EndSection

Option "Ignore" doesn't have any effect here, both according to the xorg.conf manpage and the xserver code.

I suspect the problem is that this Device section makes Xorg try to use the amdgpu driver for all GPUs, and not fall back to any other drivers. I'm not sure why this section would be necessary anyway; Xorg should automatically try to use the amdgpu driver (if it's installed correctly) for any GPUs controlled by the amdgpu kernel driver, and fall back to other drivers.

If the above doesn't help you overcome this issue, please attach the full Xorg log files from Arch and Debian.
Comment 12 jimijames.bove 2016-08-18 00:45:37 UTC
I forgot to mention, the reason I need that conf file is I need to test this on DRI3, because on the default of DRI2, the crash did not occur, meaning it's either a problem with Arch and not debian, or a problem with DRI3. Is there a section besides "Device" that doesn't disable fallback but still enables DRI3?
Comment 13 Michel Dänzer 2016-08-18 00:49:00 UTC
DRI3 needs to be enabled for the primary GPU. The secondary GPU doesn't matter for that; in fact, as I mentioned in comment 3, Xorg doesn't need to use the secondary GPU directly at all with DRI3. It's all handled by Mesa.
Comment 14 jimijames.bove 2016-08-18 00:51:33 UTC
Oh. I didn't know that. In that case, this got easier to test.

Also, the Ignore option is supposed to make X only use the card for features like DRI_PRIME and not try to use it for any screens, as referenced here: http://arseniyshestakov.com/2016/03/31/how-to-pass-gpu-to-vm-and-back-without-x-restart/ in this comment: http://arseniyshestakov.com/2016/03/31/how-to-pass-gpu-to-vm-and-back-without-x-restart/#comment-2669641933

The Ignore option is precisely why I didn't expect X to crash upon loading the AMD card, and looking back at that link, I just realized that I've been doing it wrong this whole time and should have it in a 30- file instead of my 20- file. I'll test it again on my own Arch system with that set up properly and AutoAddGPU turned back on, then test it in debian on DRI3 but without that amdgpu conf file.
Comment 15 Michel Dänzer 2016-08-18 01:00:48 UTC
According to the xorg.conf manpage and the xserver code, Option "Ignore" only has an effect in "InputClass" and "Monitor" sections.
Comment 16 Michel Dänzer 2016-08-18 01:03:27 UTC
Note that since the crash happens when Xorg tries to use the GPU, testing with DRI3 and Xorg not using the GPU directly cannot trigger the crash.
Comment 17 jimijames.bove 2016-08-18 01:24:11 UTC
But a crash happening anyway with DRI3 and Xorg not using the GPU directly is exactly the original issue I posted this bug report trying to fix. And I just finished testing without any of my own explicit options for amdgpu: no DRI3, no Ignore (and it the DRI_PRIME stuff I was doing still works fine, so you were right that those options had no point), and it's still crashing unless AutoAddGPU is turned off. So, I'll be grabbing that debug output in debian now.
Comment 18 Michel Dänzer 2016-08-18 01:30:02 UTC
(In reply to jimijames.bove from comment #17)
> But a crash happening anyway with DRI3 and Xorg not using the GPU directly
> is exactly the original issue I posted this bug report trying to fix.

The crash referenced in the original description of this report happens when Xorg tries to use (using the modesetting driver) the GPU controlled by the amdgpu kernel driver.
Comment 19 jimijames.bove 2016-08-18 02:03:39 UTC
Sorry, I think I'm just confusing myself from not having a full grasp of the terminology.
Comment 20 jimijames.bove 2016-08-18 02:29:46 UTC
Argh. I finally got X configured and running in debian, and now I can't get the card to bind to amdgpu no matter what I do. The normal method--a rescan--results in it going straight back to pci-stub. If I unbind, then rmmod pci-stub, then rescan, it still refuses to bind to amdgpu. I think I'll just have to compile X and amdgpu in Arch after all.
Comment 21 jimijames.bove 2016-09-05 00:01:36 UTC
Created attachment 126211 [details]
debug-without-glibc

Sorry that took so long. A whole lot of life stuff happened as soon as I decided to build it in Arch. Now that I have, it was actually really easy (I didn't know about the ABS until doing this). I ran the test once and got some output, but it seemed to want the debug symbols from glibc as well, so I also recompiled glibc with debug symbols and ran the test again. Here's both of those outputs.
Comment 22 jimijames.bove 2016-09-05 00:18:31 UTC
The "Cannot access memory at address errors" are weird, because yes, they popped up as soon as I ran gdb, but they didn't happen again until the moment of the crash, and there was a good 10 minutes between when I opened gdb and actually ran the test.

The log with glibc is coming soon. I didn't realize glibc takes this long to compile.
Comment 23 jimijames.bove 2016-09-05 02:29:25 UTC
OK, nevermind on that second debug report. Even though pacman tells me /lib64/ld-linux-x86-64.so.2 comes from glibc, reinstalling glibc with debug symbols (which DID make objdump --sym return plenty of symbols for /lib64/ld-linux-x86-64.so.2) did not change the output in any way. So, I don't know how to make gdb able to find debug symbols for /lib64/ld-linux-x86-64.so.2.
Comment 24 dx 2018-04-02 15:00:40 UTC
Hi there! Same issue, same use case, same distro.

I'm using the radeon driver instead of amdgpu but the assert seems to be the same one anyway. I don't have the actual assert message, but I do have a backtrace that doesn't seem to have been posted here before:

Backtrace:
0: /usr/lib/xorg-server/Xorg (OsLookupColor+0x139) [0x55cd08924e99]
1: /usr/lib/libpthread.so.0 (funlockfile+0x50) [0x7f6d28fb2e1f]
2: /usr/lib/libc.so.6 (gsignal+0x110) [0x7f6d28c1e860]
3: /usr/lib/libc.so.6 (abort+0x1c9) [0x7f6d28c1fec9]
4: /usr/lib/libc.so.6 (__assert_fail_base+0x14c) [0x7f6d28c170bc]
5: /usr/lib/libc.so.6 (__assert_fail+0x43) [0x7f6d28c17133]
6: /usr/lib/xorg-server/Xorg (dixRegisterPrivateKey+0x23f) [0x55cd087dd9ff]
7: /usr/lib/xorg/modules/libglamoregl.so (glamor_init+0xca) [0x7f6d1ba9d3fa]
8: /usr/lib/xorg/modules/drivers/modesetting_drv.so (_init+0x39de) [0x7f6d1bcd2d2e]
9: /usr/lib/xorg-server/Xorg (AddGPUScreen+0xf0) [0x55cd087bf630]
10: /usr/lib/xorg-server/Xorg (xf86PlatformMatchDriver+0xa9c) [0x55cd0881f59c]
11: /usr/lib/xorg-server/Xorg (xf86PlatformDeviceCheckBusID+0x211) [0x55cd08824881]
12: /usr/lib/xorg-server/Xorg (config_fini+0x9bd) [0x55cd0882105d]
13: /usr/lib/xorg-server/Xorg (config_fini+0x1340) [0x55cd08822420]
14: /usr/lib/xorg-server/Xorg (OsCleanup+0x621) [0x55cd08925e01]
15: /usr/lib/xorg-server/Xorg (WaitForSomething+0x1fb) [0x55cd0891e70b]
16: /usr/lib/xorg-server/Xorg (SendErrorToClient+0x113) [0x55cd087bf023]
17: /usr/lib/xorg-server/Xorg (InitFonts+0x420) [0x55cd087c32a0]
18: /usr/lib/libc.so.6 (__libc_start_main+0xea) [0x7f6d28c0af4a]
19: /usr/lib/xorg-server/Xorg (_start+0x2a) [0x55cd087acf0a]

Followed by "Caught signal 6 (Aborted). Server aborting".

The assertion message in the attached (but not mine) x-0.log.old is:

>Xorg: privates.c:385: dixRegisterPrivateKey: Assertion `!global_keys[type].created' failed.

BTW, useful stuff in the comments here regarding AutoAddGPU, the (useless) Ignore setting and DRI3 stuff! I think I learnt more from reading a couple of comments here than in the past four hours of fiddling.
Comment 25 dx 2018-04-02 20:44:16 UTC
Created attachment 138507 [details]
backtrace with debug symbols
Comment 26 jimijames.bove 2018-04-02 20:50:34 UTC
I'm afraid I won't be able to help out with this anymore for I have no idea how long. I'm currently dealing with another bug in which DRI3 refuses to turn on for my NVidia GT 740, even though there's nothing wrong with DRI3, my hardware and conf files haven't changed, and downgrading my entire system back to a time before that other bug started happening does not make it go away. Because magic or something. Other nouveau users don't seem to have the issue besides one guy on reddit who's using a 730. Consequently, it's been months and I still haven't been able to find the cause. Until it's fixed, I can't test the entire premise that this bug report is based on--adding my AMD card to the current X server.
Comment 27 dx 2018-04-02 20:57:24 UTC
Created attachment 138508 [details] [review]
naive assert fix

This turns the assert() into a return FALSE. Applies against 1.19.6+13+gd0d1a694f-1 (current stable in arch)

Probably not a good idea and probably not even following style correctly. Trivial enough but I surrender any copyright it might have (or license as CC0).

It stops the crash and I don't see any yelling in logs, but I might not have the correct debug level. Haven't checked what consequences it might have but other parts of the function return FALSE too, anyway.

Most noticeable is the fact that the monitor attached to the newly added GPU does not appear in xrandr, so having AutoAddGPU gets me nothing...
Comment 28 GitLab Migration User 2018-12-13 18:11:34 UTC
-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/xorg/xserver/issues/53.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.