Bug 107045

Summary: [4.18rc2] RX470 dGPU on hybrid laptop freezes screen after use
Product: DRI Reporter: taijian
Component: DRM/AMDgpuAssignee: Default DRI bug account <dri-devel>
Status: RESOLVED FIXED QA Contact:
Severity: critical    
Priority: medium CC: andrey.grodzovsky, harry.wentland, nicholas.kazlauskas
Version: DRI git   
Hardware: x86-64 (AMD64)   
OS: Linux (All)   
Whiteboard:
i915 platform: i915 features:
Attachments:
Description Flags
full dmesg output via 'journalctl -kb -1'
none
relevant xorg.log
none
excerpt from build log
none
First bad commit, as per git bisect
none
dmesg output 4.18rc5 + drm-fixes-2018-07-20 none

Description taijian 2018-06-26 14:35:06 UTC
Created attachment 140343 [details]
full dmesg output via 'journalctl -kb -1'

With kernel 4.18rc dpm on my hybrid laptop is finally working (thanks again to everyone who helped out getting it this far - see bug #104064, bug #106597 and bug #105760). However, it is still not actually usable, because any activation of the dGPU leads to a system freeze shortly thereafter. It does not really matter HOW the dGPU is triggered - be it running a graphical application via DRI_PRIME=1 or just querying #lshw. The dGPU activates and then tries to power down again, but that seems to be problematic as the (graphical) system will invariable freeze and become unresponsive.

The best I have been able to reconstruct from the journal is that whenever the dGPU is called, it will first do this:

  [drm] PCIE GART of 256M enabled (table at 0x000000F400000000).

followed by a bunch of this: 

  amdgpu: [powerplay] 
           failed to send message 62 ret is 0 
  amdgpu: [powerplay] 
           last message was failed ret is 0
  amdgpu: [powerplay] 
           failed to send message 18f ret is 0 

and then finally: 

  [drm] UVD and UVD ENC initialized successfully.
  [drm] VCE initialized successfully.
  [drm] Cannot find any crtc or sizes
  amdgpu 0000:01:00.0: GPU pci config reset

Sometimes, this will work more than once, sometimes the first invocation after boot will crash. This seems to be somewhat random. However, the crash will always come in the "[powerplay] ...failed to send message 18f ret is 0" phase.

Any ideas how to tackle that or extract some more usable error message from my system?
Comment 1 taijian 2018-06-26 14:37:47 UTC
Created attachment 140344 [details]
relevant xorg.log
Comment 2 Alex Deucher 2018-06-26 14:49:56 UTC
When was it last working?  Can you bisect?
Comment 3 taijian 2018-06-26 16:13:10 UTC
(In reply to Alex Deucher from comment #2)
> When was it last working?  Can you bisect?

It never really did work. As you can see from the first post (and may remember) there have been a number of issues with this card and the dc stack and they often precluded even booting up. So far I always had a "real" error message that stuck out and that I could sort of hang my bug report on. So I thought that this issue was due to any of them, not something unrelated (which it seems to be). Also, so far I was always able to at least boot up with dc=0 and have it working, but that seems to be over with 4.18 (does not get past gdm login screen without freezing...).

So no, I cannot really bisect as there does not seem to have ever been a version that actually did work flawlessly, unfortunately. I really willing to try and help coax some better debug info out of my machine, though, if anyone can give me any pointers as to how that might be achieved.
Comment 4 Michel Dänzer 2018-06-26 16:21:54 UTC
(In reply to taijian from comment #3)
> Also, so far I was always able to at least boot up with dc=0 and have it
> working, but that seems to be over with 4.18 (does not
> get past gdm login screen without freezing...).

Can you bisect that? That might lead us to the real issue here.
Comment 5 taijian 2018-06-26 21:30:38 UTC
(In reply to Michel Dänzer from comment #4)
> (In reply to taijian from comment #3)
> > Also, so far I was always able to at least boot up with dc=0 and have it
> > working, but that seems to be over with 4.18 (does not
> > get past gdm login screen without freezing...).
> 
> Can you bisect that? That might lead us to the real issue here.

Well, I'd love to, but 4.17 unfortunately suffers from bug #105760 and so is completely unusable for me. So should I try starting from 4.16? That seems like it might take rather long, even if I don't run into said bug in between...
Comment 6 taijian 2018-06-27 12:59:18 UTC
Created attachment 140362 [details]
excerpt from build log

OK, I decided to bite the bullet and try to bisect from 4.16. However, now 4.16 won't build, and I get the attached error message. What am I doing wrong?

For reference, I'm working with Linus' mainline tree.
Comment 8 taijian 2018-06-28 07:24:17 UTC
(In reply to Michel Dänzer from comment #7)
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/
> tools/lib/str_error_r.c?id=854e55ad289ef8888e7991f0ada85d5846f5afb9 is
> required for building with GCC 8.

Thank you! I'll try this out and then start working on bisecting first bug 105760 between 4.16 and 4.17 and report back both here and over there.
Comment 9 taijian 2018-06-29 08:37:09 UTC
Created attachment 140390 [details]
First bad commit, as per git bisect

OK, so I did some git-bisect work in order to find where exactly booting with amdgpu.dc=0 broke. What I found is attached, although it does not make much sense to me... Maybe someone else can figure this out. 

Note that when trying to boot this build with dc=1, then bug 105760 happens.
Comment 10 Michel Dänzer 2018-06-29 08:44:31 UTC
(In reply to taijian from comment #9)
> OK, so I did some git-bisect work in order to find where exactly booting
> with amdgpu.dc=0 broke. What I found is attached, although it does not make
> much sense to me...

Indeed, it's pretty much impossible for that change to cause the problem.

Most likely, the issue with amdgpu.dc=0 happens with some probability < 100%, so you'd have to test longer / several times before declaring a commit as good. (A commit where the problem occurs can immediately be marked as bad)
Comment 11 taijian 2018-06-29 08:56:11 UTC
(In reply to Michel Dänzer from comment #10)
> (In reply to taijian from comment #9)
> > OK, so I did some git-bisect work in order to find where exactly booting
> > with amdgpu.dc=0 broke. What I found is attached, although it does not make
> > much sense to me...
> 
> Indeed, it's pretty much impossible for that change to cause the problem.
> 
> Most likely, the issue with amdgpu.dc=0 happens with some probability <
> 100%, so you'd have to test longer / several times before declaring a commit
> as good. (A commit where the problem occurs can immediately be marked as bad)

That's what I figured... So, back to testing...
Comment 12 taijian 2018-06-29 15:10:04 UTC
OK, so I tried again, and again the result was kinda non-sensical. I think the problem is that there are more than one problem here - while bisecting, bug 105760 went away at some point, but instead something else kept crashing the system seemingly at random. Some builds would load up once or twice, then crash on the next try without ANY usable hint in the journal/dmesg. 

So I'm all outa ideas.
Comment 13 taijian 2018-06-29 16:04:07 UTC
Aaaand I figured out what I did wrong this time... So hopefully third time IS the charm...
Comment 14 taijian 2018-06-29 22:13:58 UTC
Yeah, screw this. 

I tried again, but because there are several different bugs interacting and screwing up the boot process, I really can't seem to be able to figure out which one exactly is borking up which build. 

I've been waiting for more than a year to be able to use my laptop the way it was meant to be, and I'm now ready to declare that I'm never again buying a piece of hardware that hasn't already been confirmed to work with Linux.
Comment 15 taijian 2018-07-20 15:13:49 UTC
Created attachment 140733 [details]
dmesg output 4.18rc5 + drm-fixes-2018-07-20

OK, so I have some new, probably interesting dmesg output with the latest mainline build.

What's happening here is that the system boots up, then my background display brightness service goes to work (see here: https://github.com/FedeDP/Clight and here: https://github.com/FedeDP/Clightd) and tries to adjust screen brightness. This leads to a number of 

  RIP: 0010:dm_dp_aux_transfer+0xa5/0xb0 [amdgpu]

trace calls and then the system freeezes completely. And I mean completely, as in not even sysrq + REISUB does anything. Does this help in any way?
Comment 16 taijian 2018-07-20 15:14:51 UTC
Oh and this service thingy naturally works just fine with amdgpu either blacklisted or 4.14-lts with dc=0.
Comment 17 Andrey Grodzovsky 2018-08-15 22:21:03 UTC
(In reply to taijian from comment #15)
> Created attachment 140733 [details]
> dmesg output 4.18rc5 + drm-fixes-2018-07-20
> 
> OK, so I have some new, probably interesting dmesg output with the latest
> mainline build.
> 
> What's happening here is that the system boots up, then my background
> display brightness service goes to work (see here:
> https://github.com/FedeDP/Clight and here:
> https://github.com/FedeDP/Clightd) and tries to adjust screen brightness.
> This leads to a number of 
> 
>   RIP: 0010:dm_dp_aux_transfer+0xa5/0xb0 [amdgpu]
> 
> trace calls and then the system freeezes completely. And I mean completely,
> as in not even sysrq + REISUB does anything. Does this help in any way?

So i tried with kernel 4.18 rc.1 from here - https://cgit.freedesktop.org/~agd5f/linux/log/?h=amd-staging-drm-next

and 2 cards 

Provider 0: id: 0x81 cap: 0x9, Source Output, Sink Offload crtcs: 5 outputs: 3 associated providers: 1 name:AMD Radeon (TM) RX 460 Graphics @ pci:0000:0b:00.0
Provider 1: id: 0x49 cap: 0x6, Sink Output, Source Offload crtcs: 6 outputs: 4 associated providers: 1 name:AMD Radeon (TM) RX 480 Graphics @ pci:0000:08:00.0

Where RX 460 is the default and RX 480 is the secondary. I ran both glxgears and glxinfo multiple time with DRI_PRIME=1 and haven't observed any issues.

From the log I see GPU pci config reset print - where does it come from ? Did you trigger PCI reset for the device  manually or did it happen once you tried to run any application with DRI_PRIME=1 ? Which device is 0000:01:00.0 - primary or secondary ?
Comment 18 taijian 2018-08-16 07:05:54 UTC
(In reply to Andrey Grodzovsky from comment #17)
> (In reply to taijian from comment #15)
> > Created attachment 140733 [details]
> > dmesg output 4.18rc5 + drm-fixes-2018-07-20
> > 
> > OK, so I have some new, probably interesting dmesg output with the latest
> > mainline build.
> > 
> > What's happening here is that the system boots up, then my background
> > display brightness service goes to work (see here:
> > https://github.com/FedeDP/Clight and here:
> > https://github.com/FedeDP/Clightd) and tries to adjust screen brightness.
> > This leads to a number of 
> > 
> >   RIP: 0010:dm_dp_aux_transfer+0xa5/0xb0 [amdgpu]
> > 
> > trace calls and then the system freeezes completely. And I mean completely,
> > as in not even sysrq + REISUB does anything. Does this help in any way?
> 
> So i tried with kernel 4.18 rc.1 from here -
> https://cgit.freedesktop.org/~agd5f/linux/log/?h=amd-staging-drm-next
> 
> and 2 cards 
> 
> Provider 0: id: 0x81 cap: 0x9, Source Output, Sink Offload crtcs: 5 outputs:
> 3 associated providers: 1 name:AMD Radeon (TM) RX 460 Graphics @
> pci:0000:0b:00.0
> Provider 1: id: 0x49 cap: 0x6, Sink Output, Source Offload crtcs: 6 outputs:
> 4 associated providers: 1 name:AMD Radeon (TM) RX 480 Graphics @
> pci:0000:08:00.0
> 
> Where RX 460 is the default and RX 480 is the secondary. I ran both glxgears
> and glxinfo multiple time with DRI_PRIME=1 and haven't observed any issues.
> 
> From the log I see GPU pci config reset print - where does it come from ?
> Did you trigger PCI reset for the device  manually or did it happen once you
> tried to run any application with DRI_PRIME=1 ? Which device is 0000:01:00.0
> - primary or secondary ?

Disclaimer: I'm on vacation and away from my computer right now, so going from memory. 

0000:01:00.0 is probably the RX 470. And the PCI resets happen automatically when invoking an application via DRI_PRIME=1,no manual action necessary.
Comment 19 taijian 2018-08-24 20:48:25 UTC
01:00.0 is indeed the RX470.

$ sudo lspci
00:00.0 Host bridge: Intel Corporation Xeon E3-1200 v6/7th Gen Core Processor Host Bridge/DRAM Registers (rev 05)
00:01.0 PCI bridge: Intel Corporation Xeon E3-1200 v5/E3-1500 v5/6th Gen Core Processor PCIe Controller (x16) (rev 05)
00:01.2 PCI bridge: Intel Corporation Xeon E3-1200 v5/E3-1500 v5/6th Gen Core Processor PCIe Controller (x4) (rev 05)
00:02.0 VGA compatible controller: Intel Corporation Device 591b (rev 04)
00:04.0 Signal processing controller: Intel Corporation Xeon E3-1200 v5/E3-1500 v5/6th Gen Core Processor Thermal Subsystem (rev 05)
00:14.0 USB controller: Intel Corporation Sunrise Point-H USB 3.0 xHCI Controller (rev 31)
00:14.2 Signal processing controller: Intel Corporation Sunrise Point-H Thermal subsystem (rev 31)
00:16.0 Communication controller: Intel Corporation Sunrise Point-H CSME HECI #1 (rev 31)
00:17.0 SATA controller: Intel Corporation Sunrise Point-H SATA controller [AHCI mode] (rev 31)
00:1b.0 PCI bridge: Intel Corporation Sunrise Point-H PCI Root Port #17 (rev f1)
00:1c.0 PCI bridge: Intel Corporation Sunrise Point-H PCI Express Root Port #1 (rev f1)
00:1c.5 PCI bridge: Intel Corporation Sunrise Point-H PCI Express Root Port #6 (rev f1)
00:1d.0 PCI bridge: Intel Corporation Sunrise Point-H PCI Express Root Port #9 (rev f1)
00:1f.0 ISA bridge: Intel Corporation Sunrise Point-H LPC Controller (rev 31)
00:1f.2 Memory controller: Intel Corporation Sunrise Point-H PMC (rev 31)
00:1f.3 Audio device: Intel Corporation CM238 HD Audio Controller (rev 31)
00:1f.4 SMBus: Intel Corporation Sunrise Point-H SMBus (rev 31)
01:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Ellesmere [Radeon RX 470/480/570/570X/580/580X] (rev c5)
01:00.1 Audio device: Advanced Micro Devices, Inc. [AMD/ATI] Ellesmere [Radeon RX 580]
3b:00.0 Non-Volatile memory controller: Toshiba America Info Systems XG4 NVMe SSD Controller (rev 01)
3d:00.0 Network controller: Qualcomm Atheros QCA6174 802.11ac Wireless Network Adapter (rev 32)
3e:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller SM961/PM961
Comment 20 taijian 2018-08-28 16:48:22 UTC
OK, here is a new twist: upon further investigating this, I came across some weird backlight behaviour that I originally reported here: https://gitlab.gnome.org/GNOME/gnome-settings-daemon/issues/53. However, as it turns out, this is not a gnome issue but a kernel one.

Namely, doing 

  echo X > /sys/class/backlight/intel_backlight/brightness

will wake my dGPU, even though it really shouldn't be involved at all (and does not have a /sys/class/backlight device registered). Furthermore, repeatedly invoking this command before the dGPU has had the chance to go back to sleep will not do anything. Thirdly, this seems to be a very good way to get the system to hard lock up.

So, is this an amdgpu issue? A i915 issue? ACPI? where should I take this bug report?
Comment 21 Andrey Grodzovsky 2018-08-28 22:14:39 UTC
(In reply to taijian from comment #20)
> OK, here is a new twist: upon further investigating this, I came across some
> weird backlight behaviour that I originally reported here:
> https://gitlab.gnome.org/GNOME/gnome-settings-daemon/issues/53. However, as
> it turns out, this is not a gnome issue but a kernel one.
> 
> Namely, doing 
> 
>   echo X > /sys/class/backlight/intel_backlight/brightness
> 
> will wake my dGPU, even though it really shouldn't be involved at all (and
> does not have a /sys/class/backlight device registered). Furthermore,
> repeatedly invoking this command before the dGPU has had the chance to go
> back to sleep will not do anything. Thirdly, this seems to be a very good
> way to get the system to hard lock up.
> 
> So, is this an amdgpu issue? A i915 issue? ACPI? where should I take this
> bug report?

What do you mean by 'wake' the dGPU? This should be a separate ticket which you can submit here + attach dmesg after this 'wake'.
Comment 22 taijian 2018-08-31 19:21:15 UTC
OK, after some further testing, here is some more information on my issue:

1) It turns out that the crashing/hanging behaviour ONLY happens when I'm logged into a graphical session (GNOME in my case). When working exclusively from a tty, then the system remains stable.

2) A Wayland session seems to be slightly more stable/resillient to crashing than an X11 session. While the latter crashes almost immediately when doing anything with the backlight, the former goes through a couple of cycles of extreme lag, stuttering and recovery before finally succumbing to whatever the problem is.

3) I can reliably and reproduceably crash my graphical session by messing with screen brightness in any way - either through the tools of the graphical shell or by doing #echo X > /sys/class/backlight/intel_backlight/brightness.

4) Why is this an amdgpu bug then? Because the issue only arises when amdgpu is loaded in DC mode, because in this mode the display connectors directly connected to the dGPU (DP, eDP and HDMI) are being ennumerated. With dc=0 they are not recognized and can therefore not create any problems.

5) What I think happens is that the graphical shell tries to adjust brightness on the displays that are ennumerated as connectors but not actually connected, and by doing so eventually get amdgpu to crash irrecoverably. 

6) When intel_backlight gets manipulated, the following always shows up in dmesg (the first couple of times the dGPU gets to the 'reset', at some point is just crashes before that, taking the system with it).


Aug 31 17:58:54 alien-arch kernel: [drm] PCIE GART of 256M enabled (table at 0x000000F400000000).
Aug 31 17:58:54 alien-arch kernel: amdgpu: [powerplay] 
                                    failed to send message 62 ret is 0 
Aug 31 17:58:54 alien-arch kernel: amdgpu: [powerplay] 
                                    last message was failed ret is 0
Aug 31 17:58:55 alien-arch kernel: amdgpu: [powerplay] 
                                    failed to send message 18f ret is 0 
Aug 31 17:58:55 alien-arch kernel: [drm] UVD and UVD ENC initialized successfully.
Aug 31 17:58:55 alien-arch kernel: [drm] VCE initialized successfully.
Aug 31 17:58:56 alien-arch kernel: [drm] Cannot find any crtc or sizes
Aug 31 17:58:57 alien-arch kernel: amdgpu: [powerplay] 
                                    failed to send message 15b ret is 0 
Aug 31 17:58:58 alien-arch kernel: amdgpu: [powerplay] 
                                    failed to send message 155 ret is 0 
Aug 31 17:59:06 alien-arch kernel: amdgpu: [powerplay] 
                                    failed to send message 281 ret is 0 
Aug 31 17:59:07 alien-arch kernel: amdgpu: [powerplay] 
                                    last message was failed ret is 0
Aug 31 17:59:07 alien-arch kernel: amdgpu: [powerplay] 
                                    failed to send message 261 ret is 0 
Aug 31 17:59:08 alien-arch kernel: amdgpu: [powerplay] 
                                    last message was failed ret is 0
Aug 31 17:59:08 alien-arch kernel: amdgpu: [powerplay] 
                                    failed to send message 261 ret is 0 
Aug 31 17:59:09 alien-arch kernel: amdgpu: [powerplay] 
                                    last message was failed ret is 0
Aug 31 17:59:10 alien-arch kernel: amdgpu: [powerplay] 
                                    failed to send message 261 ret is 0 
Aug 31 17:59:10 alien-arch kernel: amdgpu: [powerplay] 
                                    last message was failed ret is 0
Aug 31 17:59:11 alien-arch kernel: amdgpu: [powerplay] 
                                    failed to send message 261 ret is 0 
Aug 31 17:59:12 alien-arch kernel: amdgpu: [powerplay] 
                                    last message was failed ret is 0
Aug 31 17:59:12 alien-arch kernel: amdgpu: [powerplay] 
                                    failed to send message 261 ret is 0 
Aug 31 17:59:13 alien-arch kernel: amdgpu: [powerplay] 
                                    last message was failed ret is 0
Aug 31 17:59:13 alien-arch kernel: amdgpu: [powerplay] 
                                    failed to send message 261 ret is 0 
Aug 31 17:59:14 alien-arch kernel: amdgpu: [powerplay] 
                                    last message was failed ret is 0
Aug 31 17:59:14 alien-arch kernel: amdgpu: [powerplay] 
                                    failed to send message 261 ret is 0 
Aug 31 17:59:14 alien-arch kernel: amdgpu 0000:01:00.0: GPU pci config reset
Comment 23 taijian 2018-08-31 19:21:47 UTC
Oh, and before I forget - GNOME bug is here: https://gitlab.gnome.org/GNOME/gnome-settings-daemon/issues/53
Comment 24 taijian 2018-09-05 21:46:16 UTC
OK, another update after another test, this time booting into a live system without any graphical shell/login manager.

Here, manipulating /sys/class/intel_backlight/brightness does nothing to the dGPU, so that part is clearly a user space bug.

However, a simple #lspci DOES crash the system in the manner described above, so that part is still on amdgpu (and likely the root cause of the behaviour described in my OP). Again, this is only an issue with dc=1 and not with dc=0. Possibly related: https://bugzilla.kernel.org/show_bug.cgi?id=156341
Comment 25 taijian 2018-10-06 19:49:58 UTC
OK, this seems to be fixed on 4.19rc6+

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.