Created attachment 141002 [details] dmesg for 4.17.11-200 When booting kernel 4.17 or 4.18-rc8+ (git) on a POWER9 system with an ASUS Rx 580 GPU, the following messages are printed to the kernel log: [ 10.398837] amdgpu: [powerplay] There must be 1 or more PCIE levels defined in PPTable. [ 10.398839] amdgpu: [powerplay] Failed to populate SCLK during PopulateNewDPMClocksStates Function! [ 10.398840] amdgpu: [powerplay] Failed to populate and upload SCLK MCLK DPM levels! Note that the system is booted with the kernel argument `amdgpu.dc=0` to work around this issue: https://bugs.freedesktop.org/show_bug.cgi?id=107049 GPU performance seems to be significantly hindered as a result of these errors. Booting with `amdgpu.dpm=0` silences the errors but does not improve performance.
Created attachment 141003 [details] dmesg for 4.18.0-rc8+
Upon further testing, the issue seems to go away when the firmware is removed from petitboot, preventing it from initializing the card before the host OS. This indicates that it may have something to do with the GPU being initialized twice.
(In reply to Shawn Anastasio from comment #2) > Upon further testing, the issue seems to go away when the firmware is > removed from petitboot, preventing it from initializing the card before the > host OS. This indicates that it may have something to do with the GPU being > initialized twice. The hw requires a special reset before it can be initialized again. This is handled in driver for things like hibernate (S4) support.
(In reply to Alex Deucher from comment #3) > (In reply to Shawn Anastasio from comment #2) > > Upon further testing, the issue seems to go away when the firmware is > > removed from petitboot, preventing it from initializing the card before the > > host OS. This indicates that it may have something to do with the GPU being > > initialized twice. > > The hw requires a special reset before it can be initialized again. This is > handled in driver for things like hibernate (S4) support. Does the driver do the reset on a kexec reboot? If so, it seems insufficient to mitigate this issue.
(In reply to Shawn Anastasio from comment #4) > Does the driver do the reset on a kexec reboot? If so, it seems insufficient > to mitigate this issue. Probably not. I'm not that familiar with kexec unfortunately.
Could you point me towards the applicable routines that perform the reset on hibernate? They may provide some more insight into the situation.
amdgpu_pmops_freeze() calls amdgpu_device_suspend() which calls amdgpu_asic_reset() at the end. amdgpu_asic_reset() is a macro which calls an asic specific callback to reset the GPU. vi_asic_reset() in vi.c is the callback for polaris and other VI family parts.
Would it make sense to call amdgpu_asic_reset() as part of module load to ensure that the GPU is in a known good state?
(In reply to Timothy Pearson from comment #8) > Would it make sense to call amdgpu_asic_reset() as part of module load to > ensure that the GPU is in a known good state? This didn't fix the problem, but I did note that rmmod / modprobing the amdgpu module from the host is a valid workaround. Something must happen on rmmod-based teardown aside from amdgpu_asic_reset().
Does this patch help? https://cgit.freedesktop.org/~agd5f/linux/commit/?h=amd-staging-drm-next&id=8242308cc3c4419832126ab78ca409ce7110ab33
Same bug present in amd64 architecture under kvm guest with RX 480 passed through. The first time i boot the guest the performance are ok, but if i reboot the guest without rebooting the host, the messages appears. Tried the patch https://cgit.freedesktop.org/~agd5f/linux/commit/?h=amd-staging-drm-next&id=8242308cc3c4419832126ab78ca409ce7110ab33 The message are gone but the performance problem is still present. the problem affects the memory/gpu clocking: First boot of the guest: cat /sys/kernel/debug/dri/0/amdgpu_pm_info Clock Gating Flags Mask: 0x37bcf Graphics Medium Grain Clock Gating: On Graphics Medium Grain memory Light Sleep: On Graphics Coarse Grain Clock Gating: On Graphics Coarse Grain memory Light Sleep: On Graphics Coarse Grain Tree Shader Clock Gating: Off Graphics Coarse Grain Tree Shader Light Sleep: Off Graphics Command Processor Light Sleep: On Graphics Run List Controller Light Sleep: On Graphics 3D Coarse Grain Clock Gating: Off Graphics 3D Coarse Grain memory Light Sleep: Off Memory Controller Light Sleep: On Memory Controller Medium Grain Clock Gating: On System Direct Memory Access Light Sleep: Off System Direct Memory Access Medium Grain Clock Gating: On Bus Interface Medium Grain Clock Gating: Off Bus Interface Light Sleep: On Unified Video Decoder Medium Grain Clock Gating: On Video Compression Engine Medium Grain Clock Gating: On Host Data Path Light Sleep: Off Host Data Path Medium Grain Clock Gating: On Digital Right Management Medium Grain Clock Gating: Off Digital Right Management Light Sleep: Off Rom Medium Grain Clock Gating: On Data Fabric Medium Grain Clock Gating: Off GFX Clocks and Power: 1750 MHz (MCLK) 330 MHz (SCLK) 300 MHz (PSTATE_SCLK) 300 MHz (PSTATE_MCLK) 1000 mV (VDDGFX) 19.127 W (average GPU) GPU Temperature: 56 C GPU Load: 0 % UVD: Disabled VCE: Disabled if i reboot the guest: cat /sys/kernel/debug/dri/0/amdgpu_pm_info Clock Gating Flags Mask: 0x37bcf Graphics Medium Grain Clock Gating: On Graphics Medium Grain memory Light Sleep: On Graphics Coarse Grain Clock Gating: On Graphics Coarse Grain memory Light Sleep: On Graphics Coarse Grain Tree Shader Clock Gating: Off Graphics Coarse Grain Tree Shader Light Sleep: Off Graphics Command Processor Light Sleep: On Graphics Run List Controller Light Sleep: On Graphics 3D Coarse Grain Clock Gating: Off Graphics 3D Coarse Grain memory Light Sleep: Off Memory Controller Light Sleep: On Memory Controller Medium Grain Clock Gating: On System Direct Memory Access Light Sleep: Off System Direct Memory Access Medium Grain Clock Gating: On Bus Interface Medium Grain Clock Gating: Off Bus Interface Light Sleep: On Unified Video Decoder Medium Grain Clock Gating: On Video Compression Engine Medium Grain Clock Gating: On Host Data Path Light Sleep: Off Host Data Path Medium Grain Clock Gating: On Digital Right Management Medium Grain Clock Gating: Off Digital Right Management Light Sleep: Off Rom Medium Grain Clock Gating: On Data Fabric Medium Grain Clock Gating: Off GFX Clocks and Power: 300 MHz (MCLK) 300 MHz (SCLK) 300 MHz (PSTATE_SCLK) 300 MHz (PSTATE_MCLK) 800 mV (VDDGFX) 7.162 W (average GPU) GPU Temperature: 42 C GPU Load: 0 % UVD: Disabled VCE: Disabled
I've to rectify my last affirmation about the patch. I've patched the 4.18.5 kernel, compiled and rebooted the guest without rebooting the host. The guest vm was already in a bad state (300 MHz (MCLK)). Using a pathed kernel in a guest vm in such state did not solve the situation. After this, i tried resetting the host. The first start of the host has led to normal behavior. Resetting the guest without resetting the host seems to maintain the correct behavior (1750 MHz (MCLK)). The patch seems to work only if the card is in a "good" state. if, for some reasons, the card tunrns in a bad state, the patch cannot solve the problem. I also notice that the message "powerplay ini fails" is gone, but now i get: [ 36.352542] amdgpu: [powerplay] last message was failed ret is 0 [ 36.353460] amdgpu: [powerplay] last message was failed ret is 0 [ 36.353468] amdgpu: [powerplay] failed to send message 260 ret is 255 [ 36.353471] amdgpu: [powerplay] failed to send message 145 ret is 255 [ 36.353475] amdgpu: [powerplay] last message was failed ret is 255
The patch originally at https://cgit.freedesktop.org/~agd5f/linux/commit/?h=amd-staging-drm-next&id=8242308cc3c4419832126ab78ca409ce7110ab33 is no longer available: > Bad commit reference: 8242308cc3c4419832126ab78ca409ce7110ab33 Is an equivalent now in mainline? I'd like to try it out on one of our POWER9 boxes. Thanks!
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=11a88c2e92feca1ed3fba19fa375f76d3c75f5d5
-- GitLab Migration Automatic Message -- This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity. You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/drm/amd/issues/474.
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.