Over on Bug 104064 Alex Deucher and Harry Wentland had heroically assisted me in finally fixing dGPU auto-suspend on my HG laptop (Intel iGPU + AMD dGPU). I was really looking forward to their work going mainline in 4.16 and me being able to enjoy my newly working system. Unfortunately, some other commit into 4.16 broke things again. After overcoming my initial frustration and doing some research, I was now able to track this down to the work by Lukas Wunner on vga_switcheroo.
I am doing my debugging with drm-next-4.18-wip over at https://cgit.freedesktop.org/~agd5f/linux because 4.16 & 4.17 have one other bug that 4.18 doesn't (Bug 105760) that prevent me from using those. Simply compiling drm-next-4.18-wip will result in a kernel that cannot dynamically power down the dGPU. It remains up regardless of use status and draws power.
Now if I revert commits 07f4f97d7b4bf325d9f558c5b58230387e4e57e0, b67ae78efae0d5be5d9c7a507e67cd02971b32e1 and 2b4f44eec2be2688511c2b617d0e1b4f94c45ba4 and recompile, then the resulting kernel will once again fully power down my dGPU when it is not in use.
Now, if I can assist in any way with further debugging or testing patches to figure out what specifically went wrong with those patches, I'd be happy to do so, just let me know!
Does reverting the ATPX quirk from bug 105760 also fix the issue? I wonder if Lukas' patch fixed the underlying issue and the ATPX workaround is no longer necessary.
Yeah, I tried that, but unfortunately that did not work.
Pretty sure there's an issue with the HDA controller on the GPU that prevents the GPU from suspending. Can you attach dmesg output of a drm-next-4.18-wip kernel? There's dmesg output in the bug you've linked but I'm not sure if that's with or without my patches?
What does the following show?
cat /sys/bus/pci/devices/0000:01:00.0/power/runtime_status # GPU
cat /sys/bus/pci/devices/0000:01:00.0/power/runtime_usage # GPU
cat /sys/bus/pci/devices/0000:01:00.0/power/runtime_active_kids # GPU
cat /sys/bus/pci/devices/0000:01:00.1/power/runtime_status # HDA
cat /sys/bus/pci/devices/0000:01:00.1/power/runtime_usage # HDA
cat /sys/bus/pci/devices/0000:01:00.1/power/runtime_active_kids # HDA
Ok I found this dmesg ouput of a 4.17 kernel...
However the HDA device 0000:01:00.1 is not enumerated! It *is* enumerated in the 4.16 dmesg output you've attached. What gives?
We know there are Nvidia GPUs which hide the HDA controller if no external display is attached:
Maybe something like that is going on here despite this being an AMD GPU?
Also, there are Dell machines which can be switched to one of two modes in the BIOS: In one mode, audio is streamed to external displays via the Intel HDA, in the other mode, it's streamed via the AMD HDA. In the former mode, the AMD HDA serves no purpose and should thus be hidden as per this patch:
It's possible that something similar is needed here.
Ah, never mind, the 4.17 output was attached by Thomas Martitz. So without 4.17 dmesg output of your machine it's pretty hard to tell what's going on.
see also: https://bugzilla.kernel.org/show_bug.cgi?id=199693
Created attachment 139673 [details]
dmesg with 4.18-wip 'vanilla-as-is'
OK, here come three dmesg outputs - first one is 'vanilla' drm-next-4.18-wip, the second one is with the atpx quirk reversed, the third one is with the three vga_switcheroo patches I mentioned in the original bug report reversed. That third one is also the only one where my dGPU properly powers down.
Created attachment 139674 [details]
dmesg with 4.18-wip 'atpx-quirk reversed'
Here the second dmesg - forced atpx method quirk reversed.
Created attachment 139675 [details]
dmesg with 4.18-wip 'vga_switcheroo-patches reversed'
Here the third dmesg output - this one with the vga_switcheroo patches reversed and working power management.
$ cat /sys/bus/pci/devices/0000:01:00.0/power/runtime_status
$ cat /sys/bus/pci/devices/0000:01:00.0/power/runtime_usage
$ cat /sys/bus/pci/devices/0000:01:00.0/power/runtime_active_kids
$ cat /sys/bus/pci/devices/0000:01:00.1/power/runtime_status
$ cat /sys/bus/pci/devices/0000:01:00.1/power/runtime_usage
$ cat /sys/bus/pci/devices/0000:01:00.1/power/runtime_active_kids
All of these are with the 'vanilla' kernel.
Hm, the first 4.17 output does not show anything suspicious. Could you also post the output of:
$ cat /sys/bus/pci/devices/0000:01:00.0/power/control
$ cat /sys/bus/pci/devices/0000:01:00.1/power/control
again, this is with the 'vanilla' kernel.
(In reply to Lukas Wunner from comment #5)
> Also, there are Dell machines which can be switched to one of two modes in
> the BIOS: In one mode, audio is streamed to external displays via the Intel
> HDA, in the other mode, it's streamed via the AMD HDA. In the former mode,
> the AMD HDA serves no purpose and should thus be hidden as per this patch:
Well, this is a Dell machine, but there is no BIOS option to switch anything regarding video/audio. However, it is possible that there is something wonky about how the chip is wired, because all other Alienware laptops do sport NVIDIA gpus and it could be that the wiring is the same.
(In reply to Alex Deucher from comment #7)
> see also: https://bugzilla.kernel.org/show_bug.cgi?id=199693
I tried that patch, but am not seeing any difference.
Created attachment 139688 [details] [review]
drm-next-4.18-wip debug patch 1
So, the HDA device's usage counter is 0, it has no active children and runtime PM is allowed. Still, it's runtime active. Why is it not suspending? Here's a little patch which litters relevant code with debug printks, hopefully this will get us closer to the answer. Could you apply this on top of drm-next-4.18-wip and attach the resulting dmesg output?
Oh and could you also try:
... to double check that runtime PM is enabled?
Created attachment 139692 [details]
dmesg with the debug symbols requested by Lukas
This is the 'vanilla' kernel with the extra patch for the debug symbols added in.
Also, as requested:
$ cat /sys/bus/pci/devices/0000:01:00.1/power/runtime_status
(In reply to Lukas Wunner from comment #16)
> Created attachment 139688 [details] [review] [review]
> drm-next-4.18-wip debug patch 1
> So, the HDA device's usage counter is 0, it has no active children and
> runtime PM is allowed. Still, it's runtime active. Why is it not suspending?
Well, AFAIK the both the HDMI port and the DP on my laptop are directly attached to the dGPU and not to the iGPU like the internal panel is. Maybe it has something to do with that?
Created attachment 139706 [details] [review]
drm-next-4.18-wip debug patch 2
Okay at 10.574564, the HDA controller tries to runtime suspend but returns -EBUSY (-16) because the child HDA codec hdaudioC1D0 is still probing. After the child is done with probing at 10.586551, it informs the HDA controller that it may now be idle. However for some reason rpm_idle() doesn't cause the controller to be suspended.
Here's another debug patch which also litters the ->runtime_idle() path with debug printks. Please try this and attach the resulting dmesg.
> Well, AFAIK the both the HDMI port and the DP on my laptop are directly attached
> to the dGPU and not to the iGPU like the internal panel is. Maybe it has
> something to do with that?
You've got two HDMI codecs in your machine, hdaudioC1D0 on the AMD GPU and hdaudioC0D2 in the Intel PCH. It's unclear to me which one is used for external displays.
> Also, as requested:
> $ cat /sys/bus/pci/devices/0000:01:00.1/power/runtime_status
My apologies, I meant runtime_enabled, however it does seem to be enabled, otherwise the idle/suspend machinery wouldn't work. Thanks!
Created attachment 139723 [details]
dmesg with the debug symbols requested by Lukas, version 2
Here you go with another round of dmesg output. Also:
$ cat /sys/bus/pci/devices/0000:01:00.1/power/runtime_enabled
I am fairly certain that the external ports on my machine are wired to the dPGU, however, neither of them are in use.
Okay so the reason the HDA controller doesn't suspend is because the codec_powered bitmask is not null:
snd_hda_intel 0000:01:00.1: azx_runtime_idle: !power_save_controller = 0, !azx_has_pm_runtime(chip) = 0, azx_bus(chip)->codec_powered = 0x1, !chip->running = 0
The lowest bit is set. In other words, the single codec on that HDA bus is considered powered on, so the HDA controller is prevented from suspending.
Bits in the codec_powered bitmask are set or cleared using snd_hdac_codec_link_up() and snd_hdac_codec_link_down(), which are only called during runtime suspend and resume of the codec.
Looking at hda_codec_runtime_suspend(), we can see that the bit is only cleared conditionally:
if (codec_has_clkstop(codec) && codec_has_epss(codec) &&
(state & AC_PWRST_CLK_STOP_OK))
The codec_has_clkstop() and codec_has_epss() macros just query bits in the power_caps bitmask. That bitmask can also be retrieved from sysfs, could you please post the output of:
grep . /sys/bus/hdaudio/devices/hdaudioC1D0/widgets/*/power_caps
I'll prepare another debug patch so that we get a better understanding why the codec isn't powered down.
We've either got a bug in the HDA code here or we're missing a quirk for this specific HDA controller. Sooner or later we'll have to pull in Takashi Iwai (hda maintainer) and/or AMD engineers responsible for the HDA portion of their GPUs.
Previously we just papered over such issues by forcing the HDA controller off when the GPU runtime suspends, but since my commit 07f4f97d7b4b such bugs are now exposed as we deliberately keep the GPU awake as long as the HDA controller is in use. Thanks for helping identify and tracking down this issue.
Hey, thank you guys for working on fixing this - I am kinda the one who stands to benefit if this gets fixed, so...
$ grep . /sys/bus/hdaudio/devices/hdaudioC1D0/widgets/*/power_caps
Created attachment 139735 [details] [review]
drm-next-4.18-wip debug patch 3
Here's another debug patch to understand why the HDA codec refuses to power down. I've removed some of the earlier debug printks as they're no longer needed, so dmesg output should be shorter than before, still you may want to add log_buf_len=10M to your command line to avoid losing messages.
(In reply to taijian from comment #22)
> $ grep . /sys/bus/hdaudio/devices/hdaudioC1D0/widgets/*/power_caps
According to the HDA specification r1.0a, page 209
(https://www.intel.com/content/dam/www/public/us/en/documents/product-specifications/high-definition-audio-specification.pdf) this decodes as:
Hm, looks fine at first glance. I think I'll need dmesg output with the 3rd debug patch to understand what's going wrong.
Created attachment 139736 [details]
dmesg with the debug symbols requested by Lukas, version 3
And another round of dmesg output. Enjoy!
So hda_call_codec_suspend() tries to set the codec's power state to D3 by calling hda_set_power_state(). The return value of that is the 32 bit response detailed on page 151 ff. of the HDA spec. hda_codec_runtime_suspend() then checks if the PS-ClkStopOk bit (bit 9) is set. Only then does it allow suspend of the codec.
That bit is not set. It's also not set for the other two codecs of your machine (in the Intel HDA controller).
And there's another oddity, the 32 bit response should contain the actual power state in bits 7:4, but those bits are always cleared, for all codecs and regardless whether the codec was put into D0 or D3. In other words, the codecs all remain in D0 even if we tell them to go to D3.
Moreover, bits 0:3 of the response should contain the last power state set, but those bits are always either set to D0 or D1. But according to the power capabilities of the AMD HDA controller, it only supports D0 and D3, not D1. So this looks totally bogus.
Let me test the debug patch on my own machine and check what the response looks like there.
Okay the real culprit seems to be commit 3b5b899ca67d ("ALSA: hda: Make use of core codec functions to sync power state"):
If you revert that, does the issue go away?
I can see on my MacBook Pro that the return value is consistently 0x233 without this commit when setting the power state to D3 (which is the correct return value), and 0x1 with the commit. The commit went into 4.17. Given how late we are in this cycle, I think a revert is in order. The commit breaks runtime suspend for any HDA controller. Why the hell wasn't this discovered before, this must have been in linux-next for > 4 months!?
Maybe a revert isn't necessary, I think I've spotted the problem:
+static inline bool snd_hda_sync_power_state(struct hda_codec *codec,
... should be unsigned int.
So a patch to try would look like:
-static inline bool snd_hda_sync_power_state(struct hda_codec *codec,
+static unsigned int snd_hda_sync_power_state(struct hda_codec *codec,
Created attachment 139739 [details] [review]
patch that seems to fix things for me
OK, going with Lukas' discovery, I have rebuild my kernel from drm-next-4.18-wip, applying the attached patch, reversing the atpx quirk discussed earlier, and also applying the debug patch (dmesg comes in the next post).
And voilà: It seems to be fixed, AFAIK! YAY!
Created attachment 139740 [details]
the promised dmesg output with the patch
(In reply to taijian from comment #30)
> Created attachment 139739 [details] [review] [review]
> patch that seems to fix things for me
> OK, going with Lukas' discovery, I have rebuild my kernel from
> drm-next-4.18-wip, applying the attached patch, reversing the atpx quirk
> discussed earlier, and also applying the debug patch (dmesg comes in the
> next post).
> And voilà: It seems to be fixed, AFAIK! YAY!
Just to be clear, it works fine without the ATPX quirk if you apply the HDA fix? Should I revert the ATPX quirk?
(In reply to Alex Deucher from comment #32)
> (In reply to taijian from comment #30)
> > Created attachment 139739 [details] [review] [review] [review]
> > patch that seems to fix things for me
> > OK, going with Lukas' discovery, I have rebuild my kernel from
> > drm-next-4.18-wip, applying the attached patch, reversing the atpx quirk
> > discussed earlier, and also applying the debug patch (dmesg comes in the
> > next post).
> > And voilà: It seems to be fixed, AFAIK! YAY!
> Just to be clear, it works fine without the ATPX quirk if you apply the HDA
> fix? Should I revert the ATPX quirk?
Yes, as you suggested earlier in the thread, I reversed the atpx quirk patch when applying the hda fix, just to see if you were right that Lukas' vga_switcheroo work made that obsolete. And from my testing it does seem to be obsolete, so if no one else is objecting, please feel free to revert it.
Fixed with commit 009f8c90f571 ("ALSA: hda - Fix runtime PM") which is now queued for 4.17-rc7: