Created attachment 135921 [details] relevant dmesg output I am getting repeated gpu resets on 4.15-rc2 with my hybrid laptop (i7-7700HQ/RX470) with amdgpu.dc=1 set on the command line. Something seems to be off with power management, as the dGPU does not fully power down, even when not in use. dmesg attached, lspci follows.
Created attachment 135922 [details] relevant output of #lspci -vv
Created attachment 137103 [details] dmesg output without amdgpu.dc=1 I'm seemingly hitting this too, on 4.15. Note that I get it with or without amdgpu.dc=1. Might also be relevant that on 4.14 and earlier, I would get kernel oopsies on resume from suspend, with a very similar dmesg output (oopsie would happen in the same dm_suspend function and same stack trace, as far as I can recall). While I don't have any saved dmesg's from that, I could boot an old kernel to obtain one if it seems useful. I have slightly different hardware (eg, Radeon RX 570), but it's also a hybrid laptop.
Probably also worth noting that as of 4.15, my laptop hangs with a black screen on suspend/resume. Magic SysRq seems to be about the only thing the computer responds to if I close and then open the lid. It seems somewhat likely this is related, due to my aforementioned kernel oopsies on resumes and the fact they had very similar stack traces to these warnings. I'm not sure how to get any relevant logging output from these hangs though. Any suggestions?
We have a few new patches in our staging trees relating to suspend and driver unload. Would you be able to try amd-staging-drm-next or drm-next-4.17-wip from https://cgit.freedesktop.org/~agd5f/linux/?h=drm-next-4.17-wip and see if the issue is fixed there?
I certainly wouldn't mind. I've never really done anything of the sort before however, so is there an easy way for me to set it up? For reference, I'm on Ubuntu 17.10 and while I know my way around git and compilers fairly well, I've never compiled or installed a custom kernel or kernel module before.
I'll give 4.17-wip a whirl, cloning and building right now.
Created attachment 137643 [details] snippet from make protocol 4.17-wip does not compile for me, see the attached error snippet.
(In reply to taijian from comment #7) > Created attachment 137643 [details] > snippet from make protocol > > 4.17-wip does not compile for me, see the attached error snippet. Fixed.
Created attachment 137646 [details] dmesg with kernel amd-staging-4.17-wip-c6bcaec0aa3e OK, so on boot everything seems fine and dandy, less error messages than before. However, please note the 213.xxx and 480.xxx marks: This is where I actually use the dGPU (with DRI_PRIME=1 glxinfo). Something is clearly not right here... Also, power consumption as per powertop has not improved over 4.15/16, the dGPU clearly does not fully power down when not in use (~25W vs ~12W with amdgpu blacklisted).
Created attachment 137647 [details] dmesg with kernel amd-staging-4.17-wip-c6bcaec0aa3e accidentially uploaded the trunkated version... PEBKEC...
Addendum: What DOES work now without crashing the system is dynamically en- and disabling the dGPU via # echo "\\_SB.PCI0.PEG0.PEGP._{OFF|ON}" > /proc/acpi/call So that is definitely progress!
(In reply to taijian from comment #11) > Addendum: What DOES work now without crashing the system is dynamically en- > and disabling the dGPU via > > # echo "\\_SB.PCI0.PEG0.PEGP._{OFF|ON}" > /proc/acpi/call > > So that is definitely progress! Are you messing with that while the driver is loaded? Doing so will cause problems because you're changing the hw state behind the driver's back.
(In reply to Alex Deucher from comment #12) > (In reply to taijian from comment #11) > > Addendum: What DOES work now without crashing the system is dynamically en- > > and disabling the dGPU via > > > > # echo "\\_SB.PCI0.PEG0.PEGP._{OFF|ON}" > /proc/acpi/call > > > > So that is definitely progress! > > Are you messing with that while the driver is loaded? Doing so will cause > problems because you're changing the hw state behind the driver's back. Yeah, I know. The problem is that I cannot unload the driver, because modprobe always complains about it being in use. However, I found that with 4.17-wip the system actually remains stable after disabling the hardware, as long as I don't try to use it - which was not the case before, where it would crash much faster. But seriously, what I'm doing is that I have two boot entries, one that blacklists amdgpu and does the acpi_call thingy and one that doesn't. So one for mobile usage, one for gaming. Would be nice, though, if I did not need to reboot to switch between the two use cases...
Created attachment 137651 [details] [review] patch to test Can you try this patch and append amdgpu.force_atpx=1 to the kernel command line in grub? Does it help?
(In reply to Alex Deucher from comment #14) > Created attachment 137651 [details] [review] [review] > patch to test > > Can you try this patch and append amdgpu.force_atpx=1 to the kernel command > line in grub? Does it help? Nope. To be precise: After invoking 'DRI_PRIME=1 glxinfo', the system crashed before I was able to run dmesg (as in: the screen froze and it became completely unresponsive). So that was not really an improvement. I'm gonna go ahead and try the latest git head from that repo now, before I had only applied your patch on top of c6bcaec0aa3e.
Created attachment 137668 [details] dmesg with kernel amd-staging-4.17-wip-d0d4f398ddc0 here is the new dmesg with all of todays patches applied (up to d0d4f398ddc0). The usual (DRI_PRIME=1 glxinfo) happens at 66.xxx - and this time around, the system did not crash!
And that was premature... The second invocation of DRI_PRIME=1 does crash the system (as in: total freeze, nothing doing except for hard reboot). The system starts up OK, power consumption is way down compared to before (~11W), DRI_PRIME=1 glxinfo triggers the dGPU to on, then it turns off again. Everything fine so far. But the second invocation causes the system to die, and this is consistent over several attempts now. So we are making progress, but are not quite there yet...
Adding to the above comment after some further testing: I rebuild earlier today with HEAD at c2637e788da7. Startup went well, error messages were as before. I then launched Steam, which seemed to power up the dGPU for a short time, before it turned off again. So far, so good. I the started a gaming session, using the dGPU. Performance was about even with 4.16rc3 without dc. There were some artifacts/stuttering when scrolling, but I read on Phoronix that dc likely works best with mesa 18.0/LLVM6, which are not out yet. So nothing really worrying. However, when I quit the game and wanted to do another dmesg dump to look at and upload here, the computer froze again and became completely unresponsive before I could get to that. So it seems that the problem is not with the dGPU powering up for the second time, but with it powering down for the second time in a session. Any ideas why that might be or what kinds of test I could do to help narrow this down?
were you using the patch and module option from comment 14?
(In reply to Alex Deucher from comment #19) > were you using the patch and module option from comment 14? I was using the drm-next-4.17-wip branch and pulled from commit c2637e788da7a1cd535185e253ecc267b0410aff (current HEAD). I am fairly certain that this does contain the patch you mentioned, amongst other things. Kernel command line options used were amdgpu.dc=1 and amdgpu.force_atpx=1.
Now testing the same build, but with amdgpu.dc=0 amdgpu.force_atpx=1. Here the system does not crash on repeated invocations of the dGPU (via DRI_PRIME=1 glxinfo), but it also does not fully power down, as confirmed by powertop (26-28 W power consumption, vs ~12W with dc=1 and the dGPU not in use). So the option to force atpx apparently does nothing without dc. Is that intended? (Sorry, I don't speak C, so I really cannot tell...)
(In reply to taijian from comment #21) > Now testing the same build, but with amdgpu.dc=0 amdgpu.force_atpx=1. > > Here the system does not crash on repeated invocations of the dGPU (via > DRI_PRIME=1 glxinfo), but it also does not fully power down, as confirmed by > powertop (26-28 W power consumption, vs ~12W with dc=1 and the dGPU not in > use). > > So the option to force atpx apparently does nothing without dc. Is that > intended? (Sorry, I don't speak C, so I really cannot tell...) Please attach your dmesg output. DC or not-DC shouldn't affect the runtime pm.
Created attachment 137732 [details] dmesg with dc=1 Here is the first dmesg output. This was was generated with amdgpu.dc=1 and the command "DRI_PRIME=1 glxgears || dmesg > dmesg.dc1.post". Shortly after writing the file, the computer then crashed again, so I was not able to collect any more debugging data.
Created attachment 137733 [details] dmesg with dc=0 This is booting with dc=0. Again, the command run was "DRI_PRIME=1 glxgears || dmesg > dmesg.dc0.post". Interestingly, this time the computer did not crash and there is a message about a gpu crash dump in there, which I unfortunately only saw after I rebootet. I'll do it again to get that and then file a bug against DRI/Intel here as suggested by dmesg.
Created attachment 137816 [details] dmesg with 4.17-wip-7f462340284582c0180384c046ddd6dda03888b1 and dc=1 I have rebuilt with the latest commits to 4.17-wip (up to 7f462340284582c0180384c046ddd6dda03888b1) and lo and behold: I was able to run my usual DRI_PRIME=1 glxgears *without* the system crashing afterwards. So that is quite a bit of progress. PM also seems to work, as battery drain is around 11W while composing this message, and noticeably higher only when explicitly invoking the dGPU. Will test with dc=0 soon.
Created attachment 137817 [details] dmesg with 4.17-wip-7f462340284582c0180384c046ddd6dda03888b1 and dc=1 OK, so maybe my enthusiasm was premature. I decided to do a second invocation of the dGPU within the same session and this time it did crash shortly thereafter. But, I was able to get some dmesg output first. Notice that on the second invocation there are a bunch of snd_hda_intel bug messages, maybe those have something to do with it?
Created attachment 137825 [details] dmesg with 4.17-wip-7f462340284582c0180384c046ddd6dda03888b1 and dc=0 Here the promised output with dc=0. Interesting thing happening here: On system startup, the dGPU stays powered off until explicitly invoked. After that, the system does not crash and the gpu dump from before seems to be gone. However, the dGPU also refuses to fully power down again and can be seen as continuing to consume power via powertop. So dc=0 vs dc=1 continues to have an impact wrt pm.
Created attachment 137852 [details] dmesg with 4.17-wip-4ac51159819d and dc=1 And here is my daily testing report (do you guys actually read them anymore?). Build 4ac51159819d seems to ba a clear regression: dc=1 does not even boot - after enetring my LUKS passphrase, the screen just freezes, and that is that. dc=0 does boot and does not crash when running DRI_PRIME=1 glxgears, however, glxgears is no longer synced to my monitors refresh rate, for some reason (48 vs 60 FPS). Also pm still does not work wrt powering down the dGPU.
Created attachment 137979 [details] dmesg with 4.17-wip-d1eeebbd78fd and dc=1 OK, I've gotten around to doing some more testing and playing around with some more kcl setting, in order to better debug my crashing issues. Here's what I found. 1) I believe that the crashing issues I kept having with force_atpx=1 and dc=1 after repeated or longer lasting invocations of the dGPU might have been because of the shitty ACPI/UEFI implementation of my laptop vendor. I have now tried setting acpi.osi=!* acpi.osi='Windos 2015' and voilá, the crashing has vanished. So that seems to be a problem on that end, unfortunately improbable to ever get properly fixed... But at least there is a workaround and the problem does not appear to be with your code! 2) The inconsistend dpm behaviour between dc=1 and dc=0 is still a thing. force_atpx=1 plus dc=1 gives me wonderful dpm with the dGPU fully powering down between uses. force_atpx=1 plus dc=0 does not. No idea what's up with that. 3) There is a new bug, probably introduced by this commit: https://cgit.freedesktop.org/~agd5f/linux/patch/?id=7c8c32854566dd9fb2ad4108029670604bc77b19. On hybrid gpu laptops amdgpu does not actually control the brightness of the internal screen, because that is controlled by the iGPU. So amdgpu should probably not try. Maybe put a check in there? There is a crash referenced related to this issue in the attached dmesg output, look starting at [10.234751]. I hope this helps and isn't too late for a last minute pull to drm-next for 4.17, because I'd really like to see the force_atpx option in there!
(In reply to taijian from comment #29) > 3) There is a new bug, probably introduced by this commit: > https://cgit.freedesktop.org/~agd5f/linux/patch/ > ?id=7c8c32854566dd9fb2ad4108029670604bc77b19. On hybrid gpu laptops amdgpu > does not actually control the brightness of the internal screen, because > that is controlled by the iGPU. So amdgpu should probably not try. Maybe put > a check in there? There is a crash referenced related to this issue in the > attached dmesg output, look starting at [10.234751]. > The patch you're referencing only applies to legacy display code (dc=0). The stacktrace you show is with DC display code (dc=1). This probably happens because we advertise a backlight device if we have an eDP connector available, even if it's disconnected. See https://cgit.freedesktop.org/~agd5f/linux/tree/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c?h=drm-next-4.17-wip#n3601 for more info. Unfortunately I can't comment on the atpx stuff. Harry
(In reply to Harry Wentland from comment #30) > This probably happens because we advertise a backlight device if we have an > eDP connector available, even if it's disconnected. See > https://cgit.freedesktop.org/~agd5f/linux/tree/drivers/gpu/drm/amd/display/ > amdgpu_dm/amdgpu_dm.c?h=drm-next-4.17-wip#n3601 for more info. OK, that makes sense, even if it is unfortunate. I'm fairly certain that my HDMI and DP connectors are connected to the dGPU, even if I never use them. So it makes sense that a backlight device get's created for that, I guess. Too bad that the code commentary says that there is no way to check whether it is actually needed, apparently.
(In reply to taijian from comment #29) > 2) The inconsistend dpm behaviour between dc=1 and dc=0 is still a thing. > force_atpx=1 plus dc=1 gives me wonderful dpm with the dGPU fully powering > down between uses. force_atpx=1 plus dc=0 does not. No idea what's up with > that. The platform claims to have an eDP panel connected, apparently even in hybrid graphics mode when it's not actually in use. The non-DC code assumes that if there is an eDP panel, it's always connected since it's a non-removable display and it shouldn't be present if there isn't one. Because the driver thinks a display is connected, the dGPU is kept on.
Created attachment 138027 [details] [review] [PATCH] Only register backlight device if embedded panel connected Can you see if this help with the backlight device registration?
Created attachment 138028 [details] [review] [PATCH] Only register backlight device if embedded panel connected Use this instead.
Created attachment 138032 [details] dmesg output with the patch from comment #34 OK, I've recompiled the kernel with v2 of your patch (comment #34) and tested for three power cycles. The problem with the backlight seems to be gone, so presumably the patch did what you intended it to do? Also, so far the system seems stable, the dGPU powers up AND down nicely.
Also, here are my backlight devices: [gunnar@alien-arch ~]$ ls /sys/class/backlight insgesamt 0 drwxr-xr-x 2 root root 0 12. Mär 18:05 . drwxr-xr-x 62 root root 0 12. Mär 18:05 .. lrwxrwxrwx 1 root root 0 12. Mär 18:05 intel_backlight -> ../../devices/pci0000:00/0000:00:02.0/drm/card0/card0-eDP-1/intel_backlight [gunnar@alien-arch ~]$
Ans suspend-to-ram and wake-up-again also works! That also was kinda buggy for me ever since 4.15...
Thanks, taijian. I'll queue the patch up for merge.
(In reply to Harry Wentland from comment #38) > Thanks, taijian. I'll queue the patch up for merge. Hey, thank YOU for fixing this!
OK, I hate to be necrobumping this, but I've continued to follow and test drm-next-4.17-wip and it seems that not only has the backlight patch from #34 not made it in yet, but also that the forced atpx option from #14 has also disappeared after all of the vega12 additions. Soooo, are these things going to come back?
(In reply to taijian from comment #40) > OK, I hate to be necrobumping this, but I've continued to follow and test > drm-next-4.17-wip and it seems that not only has the backlight patch from > #34 not made it in yet, but also that the forced atpx option from #14 has > also disappeared after all of the vega12 additions. Soooo, are these things > going to come back? Sorry, my impression from comment 29 was that messing the the OSI string fixed the issue. Is that not the case? Do you still need to force ATPX? If we, we can just add a quirk so the driver does it automatically.
(In reply to Alex Deucher from comment #41) > Sorry, my impression from comment 29 was that messing the the OSI string > fixed the issue. Is that not the case? Do you still need to force ATPX? > If we, we can just add a quirk so the driver does it automatically. Ah, sorry if that was unclear. No, messing with the OSI string did help with the intermittend crashing I was experiencing. The backlight patch by Harry in #34 worked even better, though. Forcing ATPX is still needed for the dGPU to properly power down when not in use. Messing with the OSI string does absolutely nothing in that regard, only forcing ATPX helps with that. So yes, either the option to force this behaviour via the kcl or a driver quirk to do this would be very much appreciated!
Created attachment 138267 [details] [review] add an ATPX quirk Does this patch work?
Created attachment 138278 [details] dmesg with 4.17-wip-b6356df3eb9a with aptx.patch OK, the patch works, dmesg reports that ATPX is force-activated. Power usage is properly down when not in use. I did one powercycle of the dGPU in this test.
Created attachment 138279 [details] dmesg with 4.17-wip-b6356df3eb9a with aptx.patch and backlight.patch Here is the dmesg output with both patches applied, both the atpx quirk and the backlight patch from comment 34. The system certainly feels more stable now and the dGPU stays powered down more reliably, because it does not try to muck about with the backlight (which it can't affect anyway...).
The backlight patch has been stuck in our submission queue for a while but it's in our internal repo. It should be part of the next set of DC patches, although I don't think it'll make the 4.17 kernel, unless we want to treat it as a bugfix (one could argue either way here).
(In reply to Harry Wentland from comment #46) > The backlight patch has been stuck in our submission queue for a while but > it's in our internal repo. It should be part of the next set of DC patches, > although I don't think it'll make the 4.17 kernel, unless we want to treat > it as a bugfix (one could argue either way here). Well, as an affected user, I'd of course argue that they are both clearly bug fixes, because to me that's their effect. No idea how much weight that opinion is going to carry with David Airlie, though...
I managed to compile and test drm-next-4.17-wip, commit a611dd16c69025b6df115427af0a5d63ae9f5145. I can confirm that I did not see the amdgpu crash on boot any more. However, suspend/resume still hangs my laptop. So not sure if my suspend/resume hangs are related at all to this particular bug, but I was hoping to get some help in figuring this out. Eg, any way I can get some useful logging information out of a hanged session? In particular, suspend seems to work as I can hear the fans etc turn off when I close the lid. If I then open the lid and wait for a bit, they start up again but my screen remains blank.
(In reply to Bjorn from comment #48) > I managed to compile and test drm-next-4.17-wip, commit > a611dd16c69025b6df115427af0a5d63ae9f5145. > > I can confirm that I did not see the amdgpu crash on boot any more. However, > suspend/resume still hangs my laptop. > > So not sure if my suspend/resume hangs are related at all to this particular > bug, but I was hoping to get some help in figuring this out. Eg, any way I > can get some useful logging information out of a hanged session? In > particular, suspend seems to work as I can hear the fans etc turn off when I > close the lid. If I then open the lid and wait for a bit, they start up > again but my screen remains blank. I had similar problems in the past, but with the latest version of drm-next-4.17-wip (09695ad78f1f) PLUS the backlight patch in comment 34 applied on top of that, my system is now very nice and stable! So maybe try that out and let us know how it goes?
That did not help. However, looking closer at my dmesg output on the 4.17-wip branch, I see that there's actually some error loading the intel graphics firmware. In particular, there are these lines: [ 1.806186] i915 0000:00:02.0: Direct firmware load for i915/kbl_dmc_ver1_04.bin failed with error -2 [ 1.806188] i915 0000:00:02.0: Failed to load DMC firmware i915/kbl_dmc_ver1_04.bin. Disabling runtime power management. This definitely seems to imply the intel card is having some power management issue. So perhaps my problem was never with the AMD GPU at all... In any case, the original amdgpu crash is fixed for me too, whether it impacted my suspend/resume or not.
(In reply to Bjorn from comment #50) > That did not help. However, looking closer at my dmesg output on the > 4.17-wip branch, I see that there's actually some error loading the intel > graphics firmware. In particular, there are these lines: > > [ 1.806186] i915 0000:00:02.0: Direct firmware load for > i915/kbl_dmc_ver1_04.bin failed with error -2 > [ 1.806188] i915 0000:00:02.0: Failed to load DMC firmware > i915/kbl_dmc_ver1_04.bin. Disabling runtime power management. > > This definitely seems to imply the intel card is having some power > management issue. So perhaps my problem was never with the AMD GPU at all... > OK, I'm not seeing that at all, and I also have a Kaby Lake chip. What version of Linux-Firmware are you using?
(In reply to taijian from comment #51) > > OK, I'm not seeing that at all, and I also have a Kaby Lake chip. What > version of Linux-Firmware are you using? 1.169.3. I tried going back to an older kernel, and then I no longer see the kbl firmware error (but obviously the amdgpu crash is there again). So perhaps I need to update the firmware package too, to get everything working?
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.