Bug 106597

Summary: [vga_switcheroo] commit 07f4f97d7b4bf325d9f558c5b58230387e4e57e0 breaks dpm on Alienware 15R3
Product: DRI Reporter: taijian
Component: DRM/otherAssignee: Default DRI bug account <dri-devel>
Status: RESOLVED FIXED QA Contact:
Severity: normal    
Priority: medium CC: lukas
Version: DRI git   
Hardware: x86-64 (AMD64)   
OS: Linux (All)   
Whiteboard:
i915 platform: i915 features:
Attachments:
Description Flags
dmesg with 4.18-wip 'vanilla-as-is'
none
dmesg with 4.18-wip 'atpx-quirk reversed'
none
dmesg with 4.18-wip 'vga_switcheroo-patches reversed'
none
drm-next-4.18-wip debug patch 1
none
dmesg with the debug symbols requested by Lukas
none
drm-next-4.18-wip debug patch 2
none
dmesg with the debug symbols requested by Lukas, version 2
none
drm-next-4.18-wip debug patch 3
none
dmesg with the debug symbols requested by Lukas, version 3
none
patch that seems to fix things for me
none
the promised dmesg output with the patch none

Description taijian 2018-05-21 20:36:17 UTC
Over on Bug 104064 Alex Deucher and Harry Wentland had heroically assisted me in finally fixing dGPU auto-suspend on my HG laptop (Intel iGPU + AMD dGPU). I was really looking forward to their work going mainline in 4.16 and me being able to enjoy my newly working system. Unfortunately, some other commit into 4.16 broke things again. After overcoming my initial frustration and doing some research, I was now able to track this down to the work by Lukas Wunner on vga_switcheroo.

I am doing my debugging with drm-next-4.18-wip over at https://cgit.freedesktop.org/~agd5f/linux because 4.16 & 4.17 have one other bug that 4.18 doesn't (Bug 105760) that prevent me from using those. Simply compiling drm-next-4.18-wip will result in a kernel that cannot dynamically power down the dGPU. It remains up regardless of use status and draws power.

Now if I revert commits 07f4f97d7b4bf325d9f558c5b58230387e4e57e0, b67ae78efae0d5be5d9c7a507e67cd02971b32e1 and 2b4f44eec2be2688511c2b617d0e1b4f94c45ba4 and recompile, then the resulting kernel will once again fully power down my dGPU when it is not in use. 

Now, if I can assist in any way with further debugging or testing patches to figure out what specifically went wrong with those patches, I'd be happy to do so, just let me know!
Comment 1 Alex Deucher 2018-05-21 20:59:55 UTC
Does reverting the ATPX quirk from bug 105760 also fix the issue?  I wonder if Lukas' patch fixed the underlying issue and the ATPX workaround is no longer necessary.
Comment 2 taijian 2018-05-21 21:17:25 UTC
Yeah, I tried that, but unfortunately that did not work.
Comment 3 Lukas Wunner 2018-05-21 21:26:42 UTC
Pretty sure there's an issue with the HDA controller on the GPU that prevents the GPU from suspending. Can you attach dmesg output of a drm-next-4.18-wip kernel? There's dmesg output in the bug you've linked but I'm not sure if that's with or without my patches?
Comment 4 Lukas Wunner 2018-05-21 21:34:17 UTC
What does the following show?

cat /sys/bus/pci/devices/0000:01:00.0/power/runtime_status        # GPU
cat /sys/bus/pci/devices/0000:01:00.0/power/runtime_usage         # GPU
cat /sys/bus/pci/devices/0000:01:00.0/power/runtime_active_kids   # GPU
cat /sys/bus/pci/devices/0000:01:00.1/power/runtime_status        # HDA
cat /sys/bus/pci/devices/0000:01:00.1/power/runtime_usage         # HDA
cat /sys/bus/pci/devices/0000:01:00.1/power/runtime_active_kids   # HDA
Comment 5 Lukas Wunner 2018-05-21 21:47:32 UTC
Ok I found this dmesg ouput of a 4.17 kernel...
https://bugs.freedesktop.org/attachment.cgi?id=139668

However the HDA device 0000:01:00.1 is not enumerated! It *is* enumerated in the 4.16 dmesg output you've attached. What gives?

We know there are Nvidia GPUs which hide the HDA controller if no external display is attached:
https://bugs.freedesktop.org/show_bug.cgi?id=75985

Maybe something like that is going on here despite this being an AMD GPU?

Also, there are Dell machines which can be switched to one of two modes in the BIOS: In one mode, audio is streamed to external displays via the Intel HDA, in the other mode, it's streamed via the AMD HDA. In the former mode, the AMD HDA serves no purpose and should thus be hidden as per this patch:
https://lkml.org/lkml/2018/4/20/244

It's possible that something similar is needed here.
Comment 6 Lukas Wunner 2018-05-21 22:03:46 UTC
Ah, never mind, the 4.17 output was attached by Thomas Martitz. So without 4.17 dmesg output of your machine it's pretty hard to tell what's going on.
Comment 7 Alex Deucher 2018-05-21 22:43:00 UTC
see also: https://bugzilla.kernel.org/show_bug.cgi?id=199693
Comment 8 taijian 2018-05-22 08:16:58 UTC
Created attachment 139673 [details]
dmesg with 4.18-wip 'vanilla-as-is'

OK, here come three dmesg outputs - first one is 'vanilla' drm-next-4.18-wip, the second one is with the atpx quirk reversed, the third one is with the three vga_switcheroo patches I mentioned in the original bug report reversed. That third one is also the only one where my dGPU properly powers down.
Comment 9 taijian 2018-05-22 08:18:18 UTC
Created attachment 139674 [details]
dmesg with 4.18-wip 'atpx-quirk reversed'

Here the second dmesg - forced atpx method quirk reversed.
Comment 10 taijian 2018-05-22 08:19:24 UTC
Created attachment 139675 [details]
dmesg with 4.18-wip 'vga_switcheroo-patches reversed'

Here the third dmesg output - this one with the vga_switcheroo patches reversed and working power management.
Comment 11 taijian 2018-05-22 08:22:19 UTC
$ cat /sys/bus/pci/devices/0000:01:00.0/power/runtime_status
active
$ cat /sys/bus/pci/devices/0000:01:00.0/power/runtime_usage
1
$ cat /sys/bus/pci/devices/0000:01:00.0/power/runtime_active_kids 
0
$ cat /sys/bus/pci/devices/0000:01:00.1/power/runtime_status
active
$ cat /sys/bus/pci/devices/0000:01:00.1/power/runtime_usage
0
$ cat /sys/bus/pci/devices/0000:01:00.1/power/runtime_active_kids
0


All of these are with the 'vanilla' kernel.
Comment 12 Lukas Wunner 2018-05-22 09:03:29 UTC
Hm, the first 4.17 output does not show anything suspicious. Could you also post the output of:

cat /sys/bus/pci/devices/0000:01:00.0/power/control
cat /sys/bus/pci/devices/0000:01:00.1/power/control
Comment 13 taijian 2018-05-22 20:40:02 UTC
$ cat /sys/bus/pci/devices/0000:01:00.0/power/control
auto
$ cat /sys/bus/pci/devices/0000:01:00.1/power/control
auto


again, this is with the 'vanilla' kernel.
Comment 14 taijian 2018-05-22 21:18:51 UTC
(In reply to Lukas Wunner from comment #5)

> Also, there are Dell machines which can be switched to one of two modes in
> the BIOS: In one mode, audio is streamed to external displays via the Intel
> HDA, in the other mode, it's streamed via the AMD HDA. In the former mode,
> the AMD HDA serves no purpose and should thus be hidden as per this patch:
> https://lkml.org/lkml/2018/4/20/244
>

Well, this is a Dell machine, but there is no BIOS option to switch anything regarding video/audio. However, it is possible that there is something wonky about how the chip is wired, because all other Alienware laptops do sport NVIDIA gpus and it could be that the wiring is the same.
Comment 15 taijian 2018-05-22 21:49:48 UTC
(In reply to Alex Deucher from comment #7)
> see also: https://bugzilla.kernel.org/show_bug.cgi?id=199693

I tried that patch, but am not seeing any difference.
Comment 16 Lukas Wunner 2018-05-23 06:54:51 UTC
Created attachment 139688 [details] [review]
drm-next-4.18-wip debug patch 1

So, the HDA device's usage counter is 0, it has no active children and runtime PM is allowed. Still, it's runtime active. Why is it not suspending? Here's a little patch which litters relevant code with debug printks, hopefully this will get us closer to the answer. Could you apply this on top of drm-next-4.18-wip and attach the resulting dmesg output?

Oh and could you also try:
cat /sys/bus/pci/devices/0000:01:00.1/power/runtime_status
... to double check that runtime PM is enabled?

Thanks!
Comment 17 taijian 2018-05-23 09:59:24 UTC
Created attachment 139692 [details]
dmesg with the debug symbols requested by Lukas

This is the 'vanilla' kernel with the extra patch for the debug symbols added in.

Also, as requested:

$ cat /sys/bus/pci/devices/0000:01:00.1/power/runtime_status
active
Comment 18 taijian 2018-05-23 10:21:47 UTC
(In reply to Lukas Wunner from comment #16)
> Created attachment 139688 [details] [review] [review]
> drm-next-4.18-wip debug patch 1
> 
> So, the HDA device's usage counter is 0, it has no active children and
> runtime PM is allowed. Still, it's runtime active. Why is it not suspending?

Well, AFAIK the both the HDMI port and the DP on my laptop are directly attached to the dGPU and not to the iGPU like the internal panel is. Maybe it has something to do with that?
Comment 19 Lukas Wunner 2018-05-23 12:19:57 UTC
Created attachment 139706 [details] [review]
drm-next-4.18-wip debug patch 2

Okay at 10.574564, the HDA controller tries to runtime suspend but returns -EBUSY (-16) because the child HDA codec hdaudioC1D0 is still probing. After the child is done with probing at 10.586551, it informs the HDA controller that it may now be idle. However for some reason rpm_idle() doesn't cause the controller to be suspended.

Here's another debug patch which also litters the ->runtime_idle() path with debug printks. Please try this and attach the resulting dmesg.


> Well, AFAIK the both the HDMI port and the DP on my laptop are directly attached
> to the dGPU and not to the iGPU like the internal panel is. Maybe it has
> something to do with that?

You've got two HDMI codecs in your machine, hdaudioC1D0 on the AMD GPU and hdaudioC0D2 in the Intel PCH. It's unclear to me which one is used for external displays.


> Also, as requested:
> $ cat /sys/bus/pci/devices/0000:01:00.1/power/runtime_status
> active

My apologies, I meant runtime_enabled, however it does seem to be enabled, otherwise the idle/suspend machinery wouldn't work. Thanks!
Comment 20 taijian 2018-05-23 23:12:58 UTC
Created attachment 139723 [details]
dmesg with the debug symbols requested by Lukas, version 2

Here you go with another round of dmesg output. Also:

$ cat /sys/bus/pci/devices/0000:01:00.1/power/runtime_enabled
enabled


I am fairly certain that the external ports on my machine are wired to the dPGU, however, neither of them are in use.
Comment 21 Lukas Wunner 2018-05-24 08:56:44 UTC
Okay so the reason the HDA controller doesn't suspend is because the codec_powered bitmask is not null:

snd_hda_intel 0000:01:00.1: azx_runtime_idle: !power_save_controller = 0, !azx_has_pm_runtime(chip) = 0, azx_bus(chip)->codec_powered = 0x1, !chip->running = 0

The lowest bit is set. In other words, the single codec on that HDA bus is considered powered on, so the HDA controller is prevented from suspending.

Bits in the codec_powered bitmask are set or cleared using snd_hdac_codec_link_up() and snd_hdac_codec_link_down(), which are only called during runtime suspend and resume of the codec.

Looking at hda_codec_runtime_suspend(), we can see that the bit is only cleared conditionally:

https://elixir.bootlin.com/linux/latest/source/sound/pci/hda/hda_codec.c#L2906

	if (codec_has_clkstop(codec) && codec_has_epss(codec) &&
	    (state & AC_PWRST_CLK_STOP_OK))
		snd_hdac_codec_link_down(&codec->core);

The codec_has_clkstop() and codec_has_epss() macros just query bits in the power_caps bitmask. That bitmask can also be retrieved from sysfs, could you please post the output of:
grep . /sys/bus/hdaudio/devices/hdaudioC1D0/widgets/*/power_caps

I'll prepare another debug patch so that we get a better understanding why the codec isn't powered down.

We've either got a bug in the HDA code here or we're missing a quirk for this specific HDA controller. Sooner or later we'll have to pull in Takashi Iwai (hda maintainer) and/or AMD engineers responsible for the HDA portion of their GPUs.

Previously we just papered over such issues by forcing the HDA controller off when the GPU runtime suspends, but since my commit 07f4f97d7b4b such bugs are now exposed as we deliberately keep the GPU awake as long as the HDA controller is in use. Thanks for helping identify and tracking down this issue.
Comment 22 taijian 2018-05-24 10:01:00 UTC
Hey, thank you guys for working on fixing this - I am kinda the one who stands to benefit if this gets fixed, so...

Also:
$ grep . /sys/bus/hdaudio/devices/hdaudioC1D0/widgets/*/power_caps
/sys/bus/hdaudio/devices/hdaudioC1D0/widgets/01/power_caps:0xc0000009
Comment 23 Lukas Wunner 2018-05-24 10:04:15 UTC
Created attachment 139735 [details] [review]
drm-next-4.18-wip debug patch 3

Here's another debug patch to understand why the HDA codec refuses to power down. I've removed some of the earlier debug printks as they're no longer needed, so dmesg output should be shorter than before, still you may want to add log_buf_len=10M to your command line to avoid losing messages.
Comment 24 Lukas Wunner 2018-05-24 10:13:37 UTC
(In reply to taijian from comment #22)
> $ grep . /sys/bus/hdaudio/devices/hdaudioC1D0/widgets/*/power_caps
> /sys/bus/hdaudio/devices/hdaudioC1D0/widgets/01/power_caps:0xc0000009

According to the HDA specification r1.0a, page 209
(https://www.intel.com/content/dam/www/public/us/en/documents/product-specifications/high-definition-audio-specification.pdf) this decodes as:

EPSS supported
CLKSTOP supported
D3 supported
D0 supported

Hm, looks fine at first glance. I think I'll need dmesg output with the 3rd debug patch to understand what's going wrong.
Comment 25 taijian 2018-05-24 10:28:39 UTC
Created attachment 139736 [details]
dmesg with the debug symbols requested by Lukas, version 3

And another round of dmesg output. Enjoy!
Comment 26 Lukas Wunner 2018-05-24 13:07:17 UTC
So hda_call_codec_suspend() tries to set the codec's power state to D3 by calling hda_set_power_state(). The return value of that is the 32 bit response detailed on page 151 ff. of the HDA spec. hda_codec_runtime_suspend() then checks if the PS-ClkStopOk bit (bit 9) is set. Only then does it allow suspend of the codec.

That bit is not set. It's also not set for the other two codecs of your machine (in the Intel HDA controller).

And there's another oddity, the 32 bit response should contain the actual power state in bits 7:4, but those bits are always cleared, for all codecs and regardless whether the codec was put into D0 or D3. In other words, the codecs all remain in D0 even if we tell them to go to D3.

Moreover, bits 0:3 of the response should contain the last power state set, but those bits are always either set to D0 or D1. But according to the power capabilities of the AMD HDA controller, it only supports D0 and D3, not D1. So this looks totally bogus.

Let me test the debug patch on my own machine and check what the response looks like there.
Comment 27 Lukas Wunner 2018-05-24 15:34:05 UTC
Okay the real culprit seems to be commit 3b5b899ca67d ("ALSA: hda: Make use of core codec functions to sync power state"):
https://git.kernel.org/linus/3b5b899ca67d

If you revert that, does the issue go away?

I can see on my MacBook Pro that the return value is consistently 0x233 without this commit when setting the power state to D3 (which is the correct return value), and 0x1 with the commit. The commit went into 4.17. Given how late we are in this cycle, I think a revert is in order. The commit breaks runtime suspend for any HDA controller. Why the hell wasn't this discovered before, this must have been in linux-next for > 4 months!?
Comment 28 Lukas Wunner 2018-05-24 15:43:36 UTC
Maybe a revert isn't necessary, I think I've spotted the problem:

+static inline bool snd_hda_sync_power_state(struct hda_codec *codec,
              ^^^^

... should be unsigned int.
Comment 29 taijian 2018-05-24 16:15:49 UTC
So a patch to try would look like:

-static inline bool snd_hda_sync_power_state(struct hda_codec *codec,
+static unsigned int snd_hda_sync_power_state(struct hda_codec *codec,

??
Comment 30 taijian 2018-05-24 16:40:08 UTC
Created attachment 139739 [details] [review]
patch that seems to fix things for me

OK, going with Lukas' discovery, I have rebuild my kernel from drm-next-4.18-wip, applying the attached patch, reversing the atpx quirk discussed earlier, and also applying the debug patch (dmesg comes in the next post). 

And voilà: It seems to be fixed, AFAIK! YAY!
Comment 31 taijian 2018-05-24 16:40:48 UTC
Created attachment 139740 [details]
the promised dmesg output with the patch
Comment 32 Alex Deucher 2018-05-24 17:18:04 UTC
(In reply to taijian from comment #30)
> Created attachment 139739 [details] [review] [review]
> patch that seems to fix things for me
> 
> OK, going with Lukas' discovery, I have rebuild my kernel from
> drm-next-4.18-wip, applying the attached patch, reversing the atpx quirk
> discussed earlier, and also applying the debug patch (dmesg comes in the
> next post). 
> 
> And voilà: It seems to be fixed, AFAIK! YAY!

Just to be clear, it works fine without the ATPX quirk if you apply the HDA fix?  Should I revert the ATPX quirk?
Comment 33 taijian 2018-05-24 17:42:25 UTC
(In reply to Alex Deucher from comment #32)
> (In reply to taijian from comment #30)
> > Created attachment 139739 [details] [review] [review] [review]
> > patch that seems to fix things for me
> > 
> > OK, going with Lukas' discovery, I have rebuild my kernel from
> > drm-next-4.18-wip, applying the attached patch, reversing the atpx quirk
> > discussed earlier, and also applying the debug patch (dmesg comes in the
> > next post). 
> > 
> > And voilà: It seems to be fixed, AFAIK! YAY!
> 
> Just to be clear, it works fine without the ATPX quirk if you apply the HDA
> fix?  Should I revert the ATPX quirk?

Yes, as you suggested earlier in the thread, I reversed the atpx quirk patch when applying the hda fix, just to see if you were right that Lukas' vga_switcheroo work made that obsolete. And from my testing it does seem to be obsolete, so if no one else is objecting, please feel free to revert it.
Comment 34 Lukas Wunner 2018-05-24 21:08:28 UTC
Fixed with commit 009f8c90f571 ("ALSA: hda - Fix runtime PM") which is now queued for 4.17-rc7:
https://git.kernel.org/tiwai/sound/c/009f8c90f571

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.