Bug 111762

Summary: RX 5700 XT Navi - amdgpu.ppfeaturemask=0xffffffff causes stuttering and does not unlock clock/voltage/power controls
Product: DRI Reporter: tempel.julian
Component: DRM/AMDgpuAssignee: Default DRI bug account <dri-devel>
Status: RESOLVED MOVED QA Contact:
Severity: enhancement    
Priority: not set CC: danielkinsman.nospam, ragnaros39216, tempel.julian
Version: DRI git   
Hardware: x86-64 (AMD64)   
OS: Linux (All)   
Whiteboard:
i915 platform: i915 features:

Description tempel.julian 2019-09-22 09:54:16 UTC
Hello,
with Polaris and Vega, setting amdgpu.ppfeaturemask=0xffffffff worked without issues here: It unlocked pp_od_clk_voltage and didn't cause any issues for me.

But with Navi, it doesn't work. I'm still not allowed to open
/sys/class/drm/card0/device/pp_od_clk_voltage
as root with specifying that flag.

Also, I can't increase the GPU's power consumption, as
/sys/class/drm/card0/device/hwmon/hwmon0/power1_cap_max
only allows the default 100% Powertune limit, meaning I can't set any higher value in
/sys/class/drm/card0/device/hwmon/hwmon0/power1_cap

Apart from not being able to change the aforementioned parameters, setting amdgpu.ppfeaturemask=0xffffffff causes stuttering, even on the desktop and also affects the mouse cursor.

This is with kernel drm-next-5.5-wip 73cdff347343504287feae8b36fa7317f04dcc61
and an MSI 5700 XT Gaming X.
Comment 1 tempel.julian 2019-10-20 11:26:38 UTC
Still happens with current 5.5-wip/drm-next kernels.
I don't know if it is supposed to be implemented, but there seems to be some bug apart from that:
Just reading sysfs entries at "/sys/class/drm/card0/device/" makes the parsing program freeze, e.g. filebrowser (also if started as root).

Anyhow, "# cat /sys/class/drm/card0/device/pp_od_clk_voltage" returns nothing.

Could there be an update on this? Not being able to overclock/undervolt almost 4 months after Navi release is a huge disappointment.
Comment 2 Andrew Sheldon 2019-10-21 01:42:16 UTC
As a workaround, use upp instead as a workaround (write to the powerplay binary directly). See: https://github.com/sibradzic/upp

I suggest using 5.4-rcX as AMD's wip kernels (amd-staging-drm-next and drm-next) may still have a bug with pptable writing. Or you can try reverting 3abf8d896f8ac72341677a6cd82662b80943f9c8

drm/amd/powerplay: do proper cleanups on hw_fini

Be aware that this method can cause issues with fan control, so you might also need to manually set the fans after that. You can use fanctl to handle this:
https://gitlab.com/mcoffin/fanctl
Comment 3 zamundaaalp 2019-10-21 17:33:52 UTC
I have the same (or at least a similar) bug. /sys/class/drm/card1/device/hwmon/hwmon3/power1_cap_max in my case gives the default 220W (value: 220000000).
$ cat /sys/class/drm/card0/device/pp_od_clk_voltage
returns nothing.
I don't get any stuttering though, with kernel 5.3.6 or with 5.4rc2.

Dolphin freezes when looking at /sys/class/drm/card1/device/ as well.
Comment 4 tempel.julian 2019-10-21 17:38:46 UTC
Thanks for the hint @ Andrew Sheldon, SPPT being possible on Linux totally passed me by. Will test it with my cheap Polaris card first, which made me stick with custom fan curve anyway.

Regarding the stutter with amdgpu.ppfeaturemask=0xffffffff: I'm not sure anymore if it really was related, as hardware cursor support seems to be still a complete mess for Navi with 5.3/5.4 and 5.5 still being incomplete.
Comment 5 L.S.S. 2019-10-24 13:31:31 UTC
I can also confirm the issue exists. Setting amdgpu.ppfeaturemask=0xffffffff doesn't allow me to access the "States Table" section in radeon-profile, as if the parameter was ignored.

As for the stutter issue, I don't know what exactly it is as I don't notice any difference with or without the parameter. On 5.3 kernel, the mouse feels sluggish as if my monitor is running at 30Hz, but it's fine on 5.4 (rc) kernel. This is observed on official Manjaro kernels.
Comment 6 tempel.julian 2019-10-25 15:36:13 UTC
Tested custom soft power play table via UPP on Polaris and it generally seems to work well (might be able to test Navi at a later time).

However, there is the issue that the voltage gets reset when there is a modeline switch. So I've written a script which checks the voltage and restarts UPP when it exceeds values which would not occur with my undervolting:

#!/bin/bash

while true; do
    sleep 1

read -r num < /sys/devices/pci0000:00/0000:00:01.0/0000:01:00.0/hwmon/hwmon0/in0_input
if [[ "$num" -gt 1030 ]]; then
  systemctl restart amdgpu-oc && systemctl restart amdgpu-fancontrol
fi
done
Comment 7 Matt Coffin 2019-11-07 18:47:03 UTC
This patch should take care of the problem by treating navi10's TDPODLimit the same as vega20 does: https://patchwork.freedesktop.org/series/69090/
Comment 8 Matt Coffin 2019-11-07 18:48:40 UTC
(In reply to Matt Coffin from comment #7)
> This patch should take care of the problem by treating navi10's TDPODLimit
> the same as vega20 does: https://patchwork.freedesktop.org/series/69090/

Sorry for the spam. This is in reply to the power1_cap issue, not the whole bug in general.
Comment 9 tempel.julian 2019-11-07 18:50:49 UTC
Thank you, I'll try it out at some point.

I also got an email by fin4478 with the suggestion to try out amdgpu.ppfeaturemask=0xfffd7fff.
Comment 10 zamundaaalp 2019-11-08 19:25:41 UTC
I'm already using ppfeaturemask=0xfffd7fff, it doesn't unlock anything - or at least CoreCtrl doesn't show anything.

In the journald log I see a lot of these lines, always grouped together:
08.11.19 20:20	kernel	amdgpu: [powerplay] Failed to send message 0xe, response 0xfffffffb, param 0x80
08.11.19 20:20	kernel	amdgpu: [powerplay] Failed to send message 0x20, response 0xfffffffb param 0x2
Comment 11 tempel.julian 2019-11-08 19:39:59 UTC
That really looks suspicious.

Looks like the issue of voltage resetting itself to default I mentioned earlier when using custom power play stables might not apply to Navi:
https://bugzilla.kernel.org/show_bug.cgi?id=205393

Can anyone share his/her experience with using custom power play tables via upp for Navi?
Now with that fix for Polaris by Alex, it seems to be absolutely flawless for me.
Would be good to know if the same applied to Navi.

It was just a bit inconvenient that for my Polaris card the Vdds were defined as garbage values when parsing the default pp_table. Though specifying custom values in mV worked without issues.
Comment 12 Alex Deucher 2019-11-08 19:46:41 UTC
(In reply to tempel.julian from comment #11)
> It was just a bit inconvenient that for my Polaris card the Vdds were
> defined as garbage values when parsing the default pp_table. Though
> specifying custom values in mV worked without issues.

If you are seeing values like 0xff01, those are not garbage.  They are virtual voltage ids so that the driver uses look up the proper voltage via a different method.
Comment 13 tempel.julian 2019-11-08 20:13:03 UTC
It might be that, just not in hex. E.g. VddcLookupTable entry 1 returns a Vdd of 65282.
Comment 14 Alex Deucher 2019-11-08 21:00:41 UTC
(In reply to tempel.julian from comment #13)
> It might be that, just not in hex. E.g. VddcLookupTable entry 1 returns a
> Vdd of 65282.

Correct.  65282 is 0xff02 which is a virtual voltage id.  The driver uses that id to look up the real voltage based on the leakage for the board.  Take a look at smu7_get_evv_voltages() or smu7_get_elb_voltages() in smu7_hwmgr.c.
Comment 15 Alex Deucher 2019-11-08 21:02:17 UTC
pp_od_clk_voltage isn't implemented yet for navi.  There are patches on the mailing list:
https://patchwork.freedesktop.org/series/69152/
Comment 16 Martin Peres 2019-11-19 09:52:18 UTC
-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/drm/amd/issues/913.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.