Bug 96964

Summary: R290X stuck at 100% GPU load / full core clock on non-x86 machines
Product: DRI Reporter: Timothy Pearson <kb9vqf>
Component: DRM/RadeonAssignee: Default DRI bug account <dri-devel>
Status: RESOLVED MOVED QA Contact:
Severity: normal    
Priority: medium CC: funfunctor, kb9vqf, vedran
Version: XOrg git   
Hardware: Other   
OS: All   
Whiteboard:
i915 platform: i915 features:
Attachments:
Description Flags
Hack around spurious GPU load indication none

Description Timothy Pearson 2016-07-17 10:19:13 UTC
Our twin Radeon 290X cards are stuck at 100% GPU load (according to radeontop and Gallium) and full core clock (according to radeon_pm_info) on non-x86 machines such as our POWER8 compute server.  The identical card does not show this behaviour on a test x86 machine.

Forcibly crashing the GPU (causing a soft reset) fixes the issue.  Relevant dmesg output starts at line 4 in this pastebin: https://bugzilla.kernel.org/show_bug.cgi?id=70651  It is unknown if simply triggering a soft reset without the GPU crash would also resolve the issue.

I suspect this is related to the atombios x86-specific oprom code only executing on x86 machines, and related setup therefore not being finalized by the radeon driver itself on non-x86 machines.  However, this is just an educated guess.

radeontop output of stuck card:
gpu 100.00%, ee 0.00%, vgt 0.00%, ta 0.00%, sx 0.00%, sh 0.00%, spi 0.00%, sc 0.00%, pa 0.00%, db 0.00%, cb 0.00%

radeontop output of "fixed" card after GPU crash / reset, running 3D app:
gpu 4.17%, ee 0.00%, vgt 0.00%, ta 3.33%, sx 3.33%, sh 0.00%, spi 3.33%, sc 3.33%, pa 0.00%, db 3.33%, cb 3.33%, vram 11.72% 479.87mb

Despite the "100% GPU load" indication, there is no sign of actual load being placed on the GPU.  3D-intensive applications function 100% correctly with no apparent performance degradation, so it seems the reading is a.) spurious and b.) causing the core clock to throttle up needlessly.
Comment 1 Timothy Pearson 2016-07-17 10:22:19 UTC
I should note that after "fixing" the GPU, the radeon driver can be unloaded and loaded repeatedly without the issue reappearing.  However, rebooting the machine (i.e. hard GPU reset with firmware reload) will cause the issue to appear, and it will persist until the GPU is "fixed" (crashed) again.
Comment 2 Timothy Pearson 2016-07-17 10:28:28 UTC
Additional information requested:
Kernel 4.6

Issue appears before X is loaded.  Loading X makes no difference.  Terminating X makes no difference.  Unloading / reloading radeon driver makes no difference.  Forced hard reset through the radeon_gpu_reset device node makes no difference.
Comment 3 Timothy Pearson 2016-07-17 10:56:01 UTC
Corrected pastebin:
https://paste.ee/p/Utp5X
Comment 4 Vedran Miletić 2016-07-17 12:53:22 UTC
Have you confirmed this affecting aarch64 as well?
Comment 5 John Bridgman 2016-07-17 18:52:55 UTC
At the risk of sending things off in the wrong direction, my first thought is some kind of funky data caching thing when reading GRBM_STATUS using POWER hardware. 

If bit 31 were always 1 and the other bits were behaving normally then the idea of being stuck at 100% load would make more sense, but bit 31 stuck at 1 and all the rest stuck at 0 seems really odd.
Comment 6 Timothy Pearson 2016-07-17 19:05:26 UTC
(In reply to John Bridgman from comment #5)
> At the risk of sending things off in the wrong direction, my first thought
> is some kind of funky data caching thing when reading GRBM_STATUS using
> POWER hardware. 
> 
> If bit 31 were always 1 and the other bits were behaving normally then the
> idea of being stuck at 100% load would make more sense, but bit 31 stuck at
> 1 and all the rest stuck at 0 seems really odd.

If it were a data caching issue, how would the GPU crash / soft reset fix it?
Comment 7 Timothy Pearson 2016-07-17 19:05:51 UTC
(In reply to Vedran Miletić from comment #4)
> Have you confirmed this affecting aarch64 as well?

No, I have not.  It is non-trivial to test this using the systems on this end.
Comment 8 John Bridgman 2016-07-17 19:53:54 UTC
Hold on, there is additional info on radeon IRC log (and in OP's head :)) which is not yet in the ticket:

>radeontop output of stuck card:
>gpu 100.00%, ee 0.00%, vgt 0.00%, ta 0.00%, sx 0.00%, sh 0.00%, spi 0.00%, sc 0.00%, pa 0.00%, db 0.00%, cb 0.00%

The above is only when no load on the card... when running a 3D app the gpu bit stays stuck at 1 (100%) but other bits behave normally. 

I think that pretty much eliminates the caching idea.
Comment 9 Timothy Pearson 2016-07-17 23:22:54 UTC
A bit more information:
 * Disabling DPM does not fix the problem (dpm=0 on module load)
 * Using hard reset instead of soft reset just makes a complete mess / host hang
 * It looks like only the CP block needs to be reset (GPU softreset: 0x00000008 corresponds to RADEON_RESET_CP).
 * After reset DPM is broken, but DPM also breaks after unloading / reloading the radeon module so this may be a red herring.
Comment 10 Timothy Pearson 2016-07-18 03:39:51 UTC
Created attachment 125126 [details] [review]
Hack around spurious GPU load indication

This is rather nasty but it does fix the problem.  DPM works perfectly on both cards with this applied.
Comment 11 Timothy Pearson 2016-07-18 04:26:55 UTC
This bug is also triggered on x86 if the BIOS is set to not execute option ROMs on installed PCI/PCIe cards.
Comment 12 Martin Peres 2019-11-19 09:17:34 UTC
-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/drm/amd/issues/727.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.