Summary: | [4.18rc2] RX470 dGPU on hybrid laptop freezes screen after use | ||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | DRI | Reporter: | taijian | ||||||||||||
Component: | DRM/AMDgpu | Assignee: | Default DRI bug account <dri-devel> | ||||||||||||
Status: | RESOLVED FIXED | QA Contact: | |||||||||||||
Severity: | critical | ||||||||||||||
Priority: | medium | CC: | andrey.grodzovsky, harry.wentland, nicholas.kazlauskas | ||||||||||||
Version: | DRI git | ||||||||||||||
Hardware: | x86-64 (AMD64) | ||||||||||||||
OS: | Linux (All) | ||||||||||||||
Whiteboard: | |||||||||||||||
i915 platform: | i915 features: | ||||||||||||||
Attachments: |
|
Description
taijian
2018-06-26 14:35:06 UTC
Created attachment 140344 [details]
relevant xorg.log
When was it last working? Can you bisect? (In reply to Alex Deucher from comment #2) > When was it last working? Can you bisect? It never really did work. As you can see from the first post (and may remember) there have been a number of issues with this card and the dc stack and they often precluded even booting up. So far I always had a "real" error message that stuck out and that I could sort of hang my bug report on. So I thought that this issue was due to any of them, not something unrelated (which it seems to be). Also, so far I was always able to at least boot up with dc=0 and have it working, but that seems to be over with 4.18 (does not get past gdm login screen without freezing...). So no, I cannot really bisect as there does not seem to have ever been a version that actually did work flawlessly, unfortunately. I really willing to try and help coax some better debug info out of my machine, though, if anyone can give me any pointers as to how that might be achieved. (In reply to taijian from comment #3) > Also, so far I was always able to at least boot up with dc=0 and have it > working, but that seems to be over with 4.18 (does not > get past gdm login screen without freezing...). Can you bisect that? That might lead us to the real issue here. (In reply to Michel Dänzer from comment #4) > (In reply to taijian from comment #3) > > Also, so far I was always able to at least boot up with dc=0 and have it > > working, but that seems to be over with 4.18 (does not > > get past gdm login screen without freezing...). > > Can you bisect that? That might lead us to the real issue here. Well, I'd love to, but 4.17 unfortunately suffers from bug #105760 and so is completely unusable for me. So should I try starting from 4.16? That seems like it might take rather long, even if I don't run into said bug in between... Created attachment 140362 [details]
excerpt from build log
OK, I decided to bite the bullet and try to bisect from 4.16. However, now 4.16 won't build, and I get the attached error message. What am I doing wrong?
For reference, I'm working with Linus' mainline tree.
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/tools/lib/str_error_r.c?id=854e55ad289ef8888e7991f0ada85d5846f5afb9 is required for building with GCC 8. (In reply to Michel Dänzer from comment #7) > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/ > tools/lib/str_error_r.c?id=854e55ad289ef8888e7991f0ada85d5846f5afb9 is > required for building with GCC 8. Thank you! I'll try this out and then start working on bisecting first bug 105760 between 4.16 and 4.17 and report back both here and over there. Created attachment 140390 [details] First bad commit, as per git bisect OK, so I did some git-bisect work in order to find where exactly booting with amdgpu.dc=0 broke. What I found is attached, although it does not make much sense to me... Maybe someone else can figure this out. Note that when trying to boot this build with dc=1, then bug 105760 happens. (In reply to taijian from comment #9) > OK, so I did some git-bisect work in order to find where exactly booting > with amdgpu.dc=0 broke. What I found is attached, although it does not make > much sense to me... Indeed, it's pretty much impossible for that change to cause the problem. Most likely, the issue with amdgpu.dc=0 happens with some probability < 100%, so you'd have to test longer / several times before declaring a commit as good. (A commit where the problem occurs can immediately be marked as bad) (In reply to Michel Dänzer from comment #10) > (In reply to taijian from comment #9) > > OK, so I did some git-bisect work in order to find where exactly booting > > with amdgpu.dc=0 broke. What I found is attached, although it does not make > > much sense to me... > > Indeed, it's pretty much impossible for that change to cause the problem. > > Most likely, the issue with amdgpu.dc=0 happens with some probability < > 100%, so you'd have to test longer / several times before declaring a commit > as good. (A commit where the problem occurs can immediately be marked as bad) That's what I figured... So, back to testing... OK, so I tried again, and again the result was kinda non-sensical. I think the problem is that there are more than one problem here - while bisecting, bug 105760 went away at some point, but instead something else kept crashing the system seemingly at random. Some builds would load up once or twice, then crash on the next try without ANY usable hint in the journal/dmesg. So I'm all outa ideas. Aaaand I figured out what I did wrong this time... So hopefully third time IS the charm... Yeah, screw this. I tried again, but because there are several different bugs interacting and screwing up the boot process, I really can't seem to be able to figure out which one exactly is borking up which build. I've been waiting for more than a year to be able to use my laptop the way it was meant to be, and I'm now ready to declare that I'm never again buying a piece of hardware that hasn't already been confirmed to work with Linux. Created attachment 140733 [details] dmesg output 4.18rc5 + drm-fixes-2018-07-20 OK, so I have some new, probably interesting dmesg output with the latest mainline build. What's happening here is that the system boots up, then my background display brightness service goes to work (see here: https://github.com/FedeDP/Clight and here: https://github.com/FedeDP/Clightd) and tries to adjust screen brightness. This leads to a number of RIP: 0010:dm_dp_aux_transfer+0xa5/0xb0 [amdgpu] trace calls and then the system freeezes completely. And I mean completely, as in not even sysrq + REISUB does anything. Does this help in any way? Oh and this service thingy naturally works just fine with amdgpu either blacklisted or 4.14-lts with dc=0. (In reply to taijian from comment #15) > Created attachment 140733 [details] > dmesg output 4.18rc5 + drm-fixes-2018-07-20 > > OK, so I have some new, probably interesting dmesg output with the latest > mainline build. > > What's happening here is that the system boots up, then my background > display brightness service goes to work (see here: > https://github.com/FedeDP/Clight and here: > https://github.com/FedeDP/Clightd) and tries to adjust screen brightness. > This leads to a number of > > RIP: 0010:dm_dp_aux_transfer+0xa5/0xb0 [amdgpu] > > trace calls and then the system freeezes completely. And I mean completely, > as in not even sysrq + REISUB does anything. Does this help in any way? So i tried with kernel 4.18 rc.1 from here - https://cgit.freedesktop.org/~agd5f/linux/log/?h=amd-staging-drm-next and 2 cards Provider 0: id: 0x81 cap: 0x9, Source Output, Sink Offload crtcs: 5 outputs: 3 associated providers: 1 name:AMD Radeon (TM) RX 460 Graphics @ pci:0000:0b:00.0 Provider 1: id: 0x49 cap: 0x6, Sink Output, Source Offload crtcs: 6 outputs: 4 associated providers: 1 name:AMD Radeon (TM) RX 480 Graphics @ pci:0000:08:00.0 Where RX 460 is the default and RX 480 is the secondary. I ran both glxgears and glxinfo multiple time with DRI_PRIME=1 and haven't observed any issues. From the log I see GPU pci config reset print - where does it come from ? Did you trigger PCI reset for the device manually or did it happen once you tried to run any application with DRI_PRIME=1 ? Which device is 0000:01:00.0 - primary or secondary ? (In reply to Andrey Grodzovsky from comment #17) > (In reply to taijian from comment #15) > > Created attachment 140733 [details] > > dmesg output 4.18rc5 + drm-fixes-2018-07-20 > > > > OK, so I have some new, probably interesting dmesg output with the latest > > mainline build. > > > > What's happening here is that the system boots up, then my background > > display brightness service goes to work (see here: > > https://github.com/FedeDP/Clight and here: > > https://github.com/FedeDP/Clightd) and tries to adjust screen brightness. > > This leads to a number of > > > > RIP: 0010:dm_dp_aux_transfer+0xa5/0xb0 [amdgpu] > > > > trace calls and then the system freeezes completely. And I mean completely, > > as in not even sysrq + REISUB does anything. Does this help in any way? > > So i tried with kernel 4.18 rc.1 from here - > https://cgit.freedesktop.org/~agd5f/linux/log/?h=amd-staging-drm-next > > and 2 cards > > Provider 0: id: 0x81 cap: 0x9, Source Output, Sink Offload crtcs: 5 outputs: > 3 associated providers: 1 name:AMD Radeon (TM) RX 460 Graphics @ > pci:0000:0b:00.0 > Provider 1: id: 0x49 cap: 0x6, Sink Output, Source Offload crtcs: 6 outputs: > 4 associated providers: 1 name:AMD Radeon (TM) RX 480 Graphics @ > pci:0000:08:00.0 > > Where RX 460 is the default and RX 480 is the secondary. I ran both glxgears > and glxinfo multiple time with DRI_PRIME=1 and haven't observed any issues. > > From the log I see GPU pci config reset print - where does it come from ? > Did you trigger PCI reset for the device manually or did it happen once you > tried to run any application with DRI_PRIME=1 ? Which device is 0000:01:00.0 > - primary or secondary ? Disclaimer: I'm on vacation and away from my computer right now, so going from memory. 0000:01:00.0 is probably the RX 470. And the PCI resets happen automatically when invoking an application via DRI_PRIME=1,no manual action necessary. 01:00.0 is indeed the RX470. $ sudo lspci 00:00.0 Host bridge: Intel Corporation Xeon E3-1200 v6/7th Gen Core Processor Host Bridge/DRAM Registers (rev 05) 00:01.0 PCI bridge: Intel Corporation Xeon E3-1200 v5/E3-1500 v5/6th Gen Core Processor PCIe Controller (x16) (rev 05) 00:01.2 PCI bridge: Intel Corporation Xeon E3-1200 v5/E3-1500 v5/6th Gen Core Processor PCIe Controller (x4) (rev 05) 00:02.0 VGA compatible controller: Intel Corporation Device 591b (rev 04) 00:04.0 Signal processing controller: Intel Corporation Xeon E3-1200 v5/E3-1500 v5/6th Gen Core Processor Thermal Subsystem (rev 05) 00:14.0 USB controller: Intel Corporation Sunrise Point-H USB 3.0 xHCI Controller (rev 31) 00:14.2 Signal processing controller: Intel Corporation Sunrise Point-H Thermal subsystem (rev 31) 00:16.0 Communication controller: Intel Corporation Sunrise Point-H CSME HECI #1 (rev 31) 00:17.0 SATA controller: Intel Corporation Sunrise Point-H SATA controller [AHCI mode] (rev 31) 00:1b.0 PCI bridge: Intel Corporation Sunrise Point-H PCI Root Port #17 (rev f1) 00:1c.0 PCI bridge: Intel Corporation Sunrise Point-H PCI Express Root Port #1 (rev f1) 00:1c.5 PCI bridge: Intel Corporation Sunrise Point-H PCI Express Root Port #6 (rev f1) 00:1d.0 PCI bridge: Intel Corporation Sunrise Point-H PCI Express Root Port #9 (rev f1) 00:1f.0 ISA bridge: Intel Corporation Sunrise Point-H LPC Controller (rev 31) 00:1f.2 Memory controller: Intel Corporation Sunrise Point-H PMC (rev 31) 00:1f.3 Audio device: Intel Corporation CM238 HD Audio Controller (rev 31) 00:1f.4 SMBus: Intel Corporation Sunrise Point-H SMBus (rev 31) 01:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Ellesmere [Radeon RX 470/480/570/570X/580/580X] (rev c5) 01:00.1 Audio device: Advanced Micro Devices, Inc. [AMD/ATI] Ellesmere [Radeon RX 580] 3b:00.0 Non-Volatile memory controller: Toshiba America Info Systems XG4 NVMe SSD Controller (rev 01) 3d:00.0 Network controller: Qualcomm Atheros QCA6174 802.11ac Wireless Network Adapter (rev 32) 3e:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller SM961/PM961 OK, here is a new twist: upon further investigating this, I came across some weird backlight behaviour that I originally reported here: https://gitlab.gnome.org/GNOME/gnome-settings-daemon/issues/53. However, as it turns out, this is not a gnome issue but a kernel one. Namely, doing echo X > /sys/class/backlight/intel_backlight/brightness will wake my dGPU, even though it really shouldn't be involved at all (and does not have a /sys/class/backlight device registered). Furthermore, repeatedly invoking this command before the dGPU has had the chance to go back to sleep will not do anything. Thirdly, this seems to be a very good way to get the system to hard lock up. So, is this an amdgpu issue? A i915 issue? ACPI? where should I take this bug report? (In reply to taijian from comment #20) > OK, here is a new twist: upon further investigating this, I came across some > weird backlight behaviour that I originally reported here: > https://gitlab.gnome.org/GNOME/gnome-settings-daemon/issues/53. However, as > it turns out, this is not a gnome issue but a kernel one. > > Namely, doing > > echo X > /sys/class/backlight/intel_backlight/brightness > > will wake my dGPU, even though it really shouldn't be involved at all (and > does not have a /sys/class/backlight device registered). Furthermore, > repeatedly invoking this command before the dGPU has had the chance to go > back to sleep will not do anything. Thirdly, this seems to be a very good > way to get the system to hard lock up. > > So, is this an amdgpu issue? A i915 issue? ACPI? where should I take this > bug report? What do you mean by 'wake' the dGPU? This should be a separate ticket which you can submit here + attach dmesg after this 'wake'. OK, after some further testing, here is some more information on my issue: 1) It turns out that the crashing/hanging behaviour ONLY happens when I'm logged into a graphical session (GNOME in my case). When working exclusively from a tty, then the system remains stable. 2) A Wayland session seems to be slightly more stable/resillient to crashing than an X11 session. While the latter crashes almost immediately when doing anything with the backlight, the former goes through a couple of cycles of extreme lag, stuttering and recovery before finally succumbing to whatever the problem is. 3) I can reliably and reproduceably crash my graphical session by messing with screen brightness in any way - either through the tools of the graphical shell or by doing #echo X > /sys/class/backlight/intel_backlight/brightness. 4) Why is this an amdgpu bug then? Because the issue only arises when amdgpu is loaded in DC mode, because in this mode the display connectors directly connected to the dGPU (DP, eDP and HDMI) are being ennumerated. With dc=0 they are not recognized and can therefore not create any problems. 5) What I think happens is that the graphical shell tries to adjust brightness on the displays that are ennumerated as connectors but not actually connected, and by doing so eventually get amdgpu to crash irrecoverably. 6) When intel_backlight gets manipulated, the following always shows up in dmesg (the first couple of times the dGPU gets to the 'reset', at some point is just crashes before that, taking the system with it). Aug 31 17:58:54 alien-arch kernel: [drm] PCIE GART of 256M enabled (table at 0x000000F400000000). Aug 31 17:58:54 alien-arch kernel: amdgpu: [powerplay] failed to send message 62 ret is 0 Aug 31 17:58:54 alien-arch kernel: amdgpu: [powerplay] last message was failed ret is 0 Aug 31 17:58:55 alien-arch kernel: amdgpu: [powerplay] failed to send message 18f ret is 0 Aug 31 17:58:55 alien-arch kernel: [drm] UVD and UVD ENC initialized successfully. Aug 31 17:58:55 alien-arch kernel: [drm] VCE initialized successfully. Aug 31 17:58:56 alien-arch kernel: [drm] Cannot find any crtc or sizes Aug 31 17:58:57 alien-arch kernel: amdgpu: [powerplay] failed to send message 15b ret is 0 Aug 31 17:58:58 alien-arch kernel: amdgpu: [powerplay] failed to send message 155 ret is 0 Aug 31 17:59:06 alien-arch kernel: amdgpu: [powerplay] failed to send message 281 ret is 0 Aug 31 17:59:07 alien-arch kernel: amdgpu: [powerplay] last message was failed ret is 0 Aug 31 17:59:07 alien-arch kernel: amdgpu: [powerplay] failed to send message 261 ret is 0 Aug 31 17:59:08 alien-arch kernel: amdgpu: [powerplay] last message was failed ret is 0 Aug 31 17:59:08 alien-arch kernel: amdgpu: [powerplay] failed to send message 261 ret is 0 Aug 31 17:59:09 alien-arch kernel: amdgpu: [powerplay] last message was failed ret is 0 Aug 31 17:59:10 alien-arch kernel: amdgpu: [powerplay] failed to send message 261 ret is 0 Aug 31 17:59:10 alien-arch kernel: amdgpu: [powerplay] last message was failed ret is 0 Aug 31 17:59:11 alien-arch kernel: amdgpu: [powerplay] failed to send message 261 ret is 0 Aug 31 17:59:12 alien-arch kernel: amdgpu: [powerplay] last message was failed ret is 0 Aug 31 17:59:12 alien-arch kernel: amdgpu: [powerplay] failed to send message 261 ret is 0 Aug 31 17:59:13 alien-arch kernel: amdgpu: [powerplay] last message was failed ret is 0 Aug 31 17:59:13 alien-arch kernel: amdgpu: [powerplay] failed to send message 261 ret is 0 Aug 31 17:59:14 alien-arch kernel: amdgpu: [powerplay] last message was failed ret is 0 Aug 31 17:59:14 alien-arch kernel: amdgpu: [powerplay] failed to send message 261 ret is 0 Aug 31 17:59:14 alien-arch kernel: amdgpu 0000:01:00.0: GPU pci config reset Oh, and before I forget - GNOME bug is here: https://gitlab.gnome.org/GNOME/gnome-settings-daemon/issues/53 OK, another update after another test, this time booting into a live system without any graphical shell/login manager. Here, manipulating /sys/class/intel_backlight/brightness does nothing to the dGPU, so that part is clearly a user space bug. However, a simple #lspci DOES crash the system in the manner described above, so that part is still on amdgpu (and likely the root cause of the behaviour described in my OP). Again, this is only an issue with dc=1 and not with dc=0. Possibly related: https://bugzilla.kernel.org/show_bug.cgi?id=156341 OK, this seems to be fixed on 4.19rc6+ |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.