Summary: | R290X stuck at 100% GPU load / full core clock on non-x86 machines | ||||||
---|---|---|---|---|---|---|---|
Product: | DRI | Reporter: | Timothy Pearson <kb9vqf> | ||||
Component: | DRM/Radeon | Assignee: | Default DRI bug account <dri-devel> | ||||
Status: | RESOLVED MOVED | QA Contact: | |||||
Severity: | normal | ||||||
Priority: | medium | CC: | funfunctor, kb9vqf, vedran | ||||
Version: | XOrg git | ||||||
Hardware: | Other | ||||||
OS: | All | ||||||
Whiteboard: | |||||||
i915 platform: | i915 features: | ||||||
Attachments: |
|
Description
Timothy Pearson
2016-07-17 10:19:13 UTC
I should note that after "fixing" the GPU, the radeon driver can be unloaded and loaded repeatedly without the issue reappearing. However, rebooting the machine (i.e. hard GPU reset with firmware reload) will cause the issue to appear, and it will persist until the GPU is "fixed" (crashed) again. Additional information requested: Kernel 4.6 Issue appears before X is loaded. Loading X makes no difference. Terminating X makes no difference. Unloading / reloading radeon driver makes no difference. Forced hard reset through the radeon_gpu_reset device node makes no difference. Corrected pastebin: https://paste.ee/p/Utp5X Have you confirmed this affecting aarch64 as well? At the risk of sending things off in the wrong direction, my first thought is some kind of funky data caching thing when reading GRBM_STATUS using POWER hardware. If bit 31 were always 1 and the other bits were behaving normally then the idea of being stuck at 100% load would make more sense, but bit 31 stuck at 1 and all the rest stuck at 0 seems really odd. (In reply to John Bridgman from comment #5) > At the risk of sending things off in the wrong direction, my first thought > is some kind of funky data caching thing when reading GRBM_STATUS using > POWER hardware. > > If bit 31 were always 1 and the other bits were behaving normally then the > idea of being stuck at 100% load would make more sense, but bit 31 stuck at > 1 and all the rest stuck at 0 seems really odd. If it were a data caching issue, how would the GPU crash / soft reset fix it? (In reply to Vedran Miletić from comment #4) > Have you confirmed this affecting aarch64 as well? No, I have not. It is non-trivial to test this using the systems on this end. Hold on, there is additional info on radeon IRC log (and in OP's head :)) which is not yet in the ticket:
>radeontop output of stuck card:
>gpu 100.00%, ee 0.00%, vgt 0.00%, ta 0.00%, sx 0.00%, sh 0.00%, spi 0.00%, sc 0.00%, pa 0.00%, db 0.00%, cb 0.00%
The above is only when no load on the card... when running a 3D app the gpu bit stays stuck at 1 (100%) but other bits behave normally.
I think that pretty much eliminates the caching idea.
A bit more information: * Disabling DPM does not fix the problem (dpm=0 on module load) * Using hard reset instead of soft reset just makes a complete mess / host hang * It looks like only the CP block needs to be reset (GPU softreset: 0x00000008 corresponds to RADEON_RESET_CP). * After reset DPM is broken, but DPM also breaks after unloading / reloading the radeon module so this may be a red herring. Created attachment 125126 [details] [review] Hack around spurious GPU load indication This is rather nasty but it does fix the problem. DPM works perfectly on both cards with this applied. This bug is also triggered on x86 if the BIOS is set to not execute option ROMs on installed PCI/PCIe cards. -- GitLab Migration Automatic Message -- This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity. You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/drm/amd/issues/727. |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.