Bug 107518

Summary: polaris powerplay init fails: There must be 1 or more PCIE levels defined in PPTable
Product: DRI Reporter: Shawn Anastasio <shawnanastasio>
Component: DRM/AMDgpuAssignee: Default DRI bug account <dri-devel>
Status: RESOLVED MOVED QA Contact:
Severity: normal    
Priority: medium CC: sir
Version: unspecified   
Hardware: PowerPC   
OS: Linux (All)   
Whiteboard:
i915 platform: i915 features:
Attachments:
Description Flags
dmesg for 4.17.11-200
none
dmesg for 4.18.0-rc8+ none

Description Shawn Anastasio 2018-08-07 19:15:10 UTC
Created attachment 141002 [details]
dmesg for 4.17.11-200

When booting kernel 4.17 or 4.18-rc8+ (git) on a POWER9 system with an ASUS Rx 580 GPU, the following messages are printed to the kernel log:

[   10.398837] amdgpu: [powerplay] There must be 1 or more PCIE levels defined in PPTable.
[   10.398839] amdgpu: [powerplay] Failed to populate SCLK during PopulateNewDPMClocksStates Function!
[   10.398840] amdgpu: [powerplay] Failed to populate and upload SCLK MCLK DPM levels!


Note that the system is booted with the kernel argument `amdgpu.dc=0` to work around this issue: https://bugs.freedesktop.org/show_bug.cgi?id=107049

GPU performance seems to be significantly hindered as a result of these errors.

Booting with `amdgpu.dpm=0` silences the errors but does not improve performance.
Comment 1 Shawn Anastasio 2018-08-07 19:55:14 UTC
Created attachment 141003 [details]
dmesg for 4.18.0-rc8+
Comment 2 Shawn Anastasio 2018-08-07 20:34:50 UTC
Upon further testing, the issue seems to go away when the firmware is removed from petitboot, preventing it from initializing the card before the host OS. This indicates that it may have something to do with the GPU being initialized twice.
Comment 3 Alex Deucher 2018-08-07 21:24:34 UTC
(In reply to Shawn Anastasio from comment #2)
> Upon further testing, the issue seems to go away when the firmware is
> removed from petitboot, preventing it from initializing the card before the
> host OS. This indicates that it may have something to do with the GPU being
> initialized twice.

The hw requires a special reset before it can be initialized again.  This is handled in driver for things like hibernate (S4) support.
Comment 4 Shawn Anastasio 2018-08-08 03:47:24 UTC
(In reply to Alex Deucher from comment #3)
> (In reply to Shawn Anastasio from comment #2)
> > Upon further testing, the issue seems to go away when the firmware is
> > removed from petitboot, preventing it from initializing the card before the
> > host OS. This indicates that it may have something to do with the GPU being
> > initialized twice.
> 
> The hw requires a special reset before it can be initialized again.  This is
> handled in driver for things like hibernate (S4) support.

Does the driver do the reset on a kexec reboot? If so, it seems insufficient to mitigate this issue.
Comment 5 Alex Deucher 2018-08-08 15:04:12 UTC
(In reply to Shawn Anastasio from comment #4)
> Does the driver do the reset on a kexec reboot? If so, it seems insufficient
> to mitigate this issue.

Probably not.  I'm not that familiar with kexec unfortunately.
Comment 6 Shawn Anastasio 2018-08-10 06:06:06 UTC
Could you point me towards the applicable routines that perform the reset on hibernate? They may provide some more insight into the situation.
Comment 7 Alex Deucher 2018-08-10 14:46:08 UTC
amdgpu_pmops_freeze() calls amdgpu_device_suspend() which calls 
amdgpu_asic_reset() at the end.  amdgpu_asic_reset() is a macro which calls an asic specific callback to reset the GPU.  vi_asic_reset() in vi.c is the callback for polaris and other VI family parts.
Comment 8 Timothy Pearson 2018-08-25 19:19:48 UTC
Would it make sense to call amdgpu_asic_reset() as part of module load to ensure that the GPU is in a known good state?
Comment 9 Timothy Pearson 2018-08-26 00:08:24 UTC
(In reply to Timothy Pearson from comment #8)
> Would it make sense to call amdgpu_asic_reset() as part of module load to
> ensure that the GPU is in a known good state?

This didn't fix the problem, but I did note that rmmod / modprobing the amdgpu module from the host is a valid workaround.  Something must happen on rmmod-based teardown aside from amdgpu_asic_reset().
Comment 11 Luigi Laurini 2018-08-29 09:09:00 UTC
Same bug present in amd64 architecture under kvm guest with RX 480 passed through.
The first time i boot the guest the performance are ok, but if i reboot the guest without rebooting the host, the messages appears.


Tried the patch 

https://cgit.freedesktop.org/~agd5f/linux/commit/?h=amd-staging-drm-next&id=8242308cc3c4419832126ab78ca409ce7110ab33

The message are gone but the performance problem is still present.

the problem affects the memory/gpu clocking:

First boot of the guest:

cat /sys/kernel/debug/dri/0/amdgpu_pm_info 
Clock Gating Flags Mask: 0x37bcf
	Graphics Medium Grain Clock Gating: On
	Graphics Medium Grain memory Light Sleep: On
	Graphics Coarse Grain Clock Gating: On
	Graphics Coarse Grain memory Light Sleep: On
	Graphics Coarse Grain Tree Shader Clock Gating: Off
	Graphics Coarse Grain Tree Shader Light Sleep: Off
	Graphics Command Processor Light Sleep: On
	Graphics Run List Controller Light Sleep: On
	Graphics 3D Coarse Grain Clock Gating: Off
	Graphics 3D Coarse Grain memory Light Sleep: Off
	Memory Controller Light Sleep: On
	Memory Controller Medium Grain Clock Gating: On
	System Direct Memory Access Light Sleep: Off
	System Direct Memory Access Medium Grain Clock Gating: On
	Bus Interface Medium Grain Clock Gating: Off
	Bus Interface Light Sleep: On
	Unified Video Decoder Medium Grain Clock Gating: On
	Video Compression Engine Medium Grain Clock Gating: On
	Host Data Path Light Sleep: Off
	Host Data Path Medium Grain Clock Gating: On
	Digital Right Management Medium Grain Clock Gating: Off
	Digital Right Management Light Sleep: Off
	Rom Medium Grain Clock Gating: On
	Data Fabric Medium Grain Clock Gating: Off

GFX Clocks and Power:
	1750 MHz (MCLK)
	330 MHz (SCLK)
	300 MHz (PSTATE_SCLK)
	300 MHz (PSTATE_MCLK)
	1000 mV (VDDGFX)
	19.127 W (average GPU)

GPU Temperature: 56 C
GPU Load: 0 %

UVD: Disabled

VCE: Disabled

if i reboot the guest:

cat /sys/kernel/debug/dri/0/amdgpu_pm_info
Clock Gating Flags Mask: 0x37bcf
	Graphics Medium Grain Clock Gating: On
	Graphics Medium Grain memory Light Sleep: On
	Graphics Coarse Grain Clock Gating: On
	Graphics Coarse Grain memory Light Sleep: On
	Graphics Coarse Grain Tree Shader Clock Gating: Off
	Graphics Coarse Grain Tree Shader Light Sleep: Off
	Graphics Command Processor Light Sleep: On
	Graphics Run List Controller Light Sleep: On
	Graphics 3D Coarse Grain Clock Gating: Off
	Graphics 3D Coarse Grain memory Light Sleep: Off
	Memory Controller Light Sleep: On
	Memory Controller Medium Grain Clock Gating: On
	System Direct Memory Access Light Sleep: Off
	System Direct Memory Access Medium Grain Clock Gating: On
	Bus Interface Medium Grain Clock Gating: Off
	Bus Interface Light Sleep: On
	Unified Video Decoder Medium Grain Clock Gating: On
	Video Compression Engine Medium Grain Clock Gating: On
	Host Data Path Light Sleep: Off
	Host Data Path Medium Grain Clock Gating: On
	Digital Right Management Medium Grain Clock Gating: Off
	Digital Right Management Light Sleep: Off
	Rom Medium Grain Clock Gating: On
	Data Fabric Medium Grain Clock Gating: Off

GFX Clocks and Power:
	300 MHz (MCLK)
	300 MHz (SCLK)
	300 MHz (PSTATE_SCLK)
	300 MHz (PSTATE_MCLK)
	800 mV (VDDGFX)
	7.162 W (average GPU)

GPU Temperature: 42 C
GPU Load: 0 %

UVD: Disabled

VCE: Disabled
Comment 12 Luigi Laurini 2018-08-29 13:33:55 UTC
I've to rectify my last affirmation about the patch. 

I've patched the 4.18.5 kernel, compiled and rebooted the guest without rebooting the host.
The guest vm was already in a bad state (300 MHz (MCLK)). Using a pathed kernel in a guest vm in such state did not solve the situation.

After this, i tried resetting the host.
The first start of the host has led to normal behavior. Resetting the guest without resetting the host seems to maintain the correct behavior (1750 MHz (MCLK)).

The patch seems to work only if the card is in a "good" state. if, for some reasons, the card tunrns in a bad state, the patch cannot solve the problem.


I also notice that the message "powerplay ini fails" is gone, but now i get:

[   36.352542] amdgpu: [powerplay] 
                last message was failed ret is 0
[   36.353460] amdgpu: [powerplay] 
                last message was failed ret is 0
[   36.353468] amdgpu: [powerplay] 
                failed to send message 260 ret is 255 
[   36.353471] amdgpu: [powerplay] 
                failed to send message 145 ret is 255 
[   36.353475] amdgpu: [powerplay] 
                last message was failed ret is 255
Comment 13 Timothy Pearson 2018-10-27 05:29:00 UTC
The patch originally at https://cgit.freedesktop.org/~agd5f/linux/commit/?h=amd-staging-drm-next&id=8242308cc3c4419832126ab78ca409ce7110ab33 is no longer available:

> Bad commit reference: 8242308cc3c4419832126ab78ca409ce7110ab33

Is an equivalent now in mainline?  I'd like to try it out on one of our POWER9 boxes.

Thanks!
Comment 15 Martin Peres 2019-11-19 08:46:40 UTC
-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/drm/amd/issues/474.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.