Bug 108854 - [polaris11] - GPU Hang - ring gfx timeout
Summary: [polaris11] - GPU Hang - ring gfx timeout
Status: NEW
Alias: None
Product: Mesa
Classification: Unclassified
Component: Drivers/Gallium/radeonsi (show other bugs)
Version: unspecified
Hardware: x86-64 (AMD64) Linux (All)
: medium normal
Assignee: Default DRI bug account
QA Contact: Default DRI bug account
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2018-11-24 20:41 UTC by Tom Seewald
Modified: 2019-02-21 19:21 UTC (History)
0 users

See Also:
i915 platform:
i915 features:


Attachments
dmesg showing the hang and failed gpu reset (118.68 KB, text/plain)
2018-11-24 20:41 UTC, Tom Seewald
Details
dmesg of 4.20-rc5 with drm.debug=0xe (251.35 KB, text/plain)
2018-12-08 18:20 UTC, Tom Seewald
Details
amd-drm-staging-next dmesg as of January 14th 2019 (88.10 KB, text/plain)
2019-01-14 16:54 UTC, Tom Seewald
Details
UMR wave dump as of January 14th 2019 (3.06 KB, text/plain)
2019-01-14 16:55 UTC, Tom Seewald
Details
UMR gfx ring dump as of January 14th 2019 (149.64 KB, text/plain)
2019-01-14 16:55 UTC, Tom Seewald
Details
UMR gpu info (2.41 KB, text/plain)
2019-01-14 16:57 UTC, Tom Seewald
Details
dmesg with xorg 1.20, kernel 5.0-rc2 (82.81 KB, text/plain)
2019-01-14 17:43 UTC, Tom Seewald
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Tom Seewald 2018-11-24 20:41:30 UTC
Created attachment 142604 [details]
dmesg showing the hang and failed gpu reset

Problem:

While running RuneLite [1] with GPU acceleration enabled, the system hangs after  several minutes of seemingly normal operation. Once the GPU hangs, it attempts to reset itself but fails with the following message:

[drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!

This hang causes the system to lock up and ssh is the only access possible. There is no graphical corruption, the displays are simply frozen.

System Information:

GPU: POLARIS11 - RX 560 4GB (1002:67ff)
Mesa: 18.0.5
X11: 1.19.6
Firmware files should be the latest as I've pulled them from adg5f's repo [2].

Kernel parameters: "quiet splash scsi_mod.use_blk_mq=1 apparmor=2 security=apparmor amdgpu.gpu_recovery=1 spectre_v2=off"

I have reproduced this issue on:

4.20-rc3
amd-staging-drm-next (as of commit 1179994039abc10aab0d2f0ecfc4c65dfbd77438)

[1] https://github.com/runelite/runelite
[2] https://people.freedesktop.org/~agd5f/radeon_ucode/
Comment 1 Tom Seewald 2018-12-01 18:17:02 UTC
I can confirm this is still happening on 4.20-rc4 as well as with more up to date userspace software.

libdrm: 3.27.0
Mesa: 18.2.4

The hangs can be reliably reproduced at least as far back as kernel 4.15 so I am not confident I can bisect this.

Here is a dump of my card's firmware information in case I missed an update.

# cat /sys/kernel/debug/dri/1/amdgpu_firmware_info

VCE feature version: 0, firmware version: 0x34040300
UVD feature version: 0, firmware version: 0x01821000
MC feature version: 0, firmware version: 0x00000000
ME feature version: 47, firmware version: 0x000000a2
PFP feature version: 47, firmware version: 0x000000f0
CE feature version: 47, firmware version: 0x00000089
RLC feature version: 1, firmware version: 0x00000035
RLC SRLC feature version: 0, firmware version: 0x00000000
RLC SRLG feature version: 0, firmware version: 0x00000000
RLC SRLS feature version: 0, firmware version: 0x00000000
MEC feature version: 47, firmware version: 0x000002cb
MEC2 feature version: 47, firmware version: 0x000002cb
SOS feature version: 0, firmware version: 0x00000000
ASD feature version: 0, firmware version: 0x00000000
SMC feature version: 0, firmware version: 0x001d0900
SDMA0 feature version: 31, firmware version: 0x00000036
SDMA1 feature version: 0, firmware version: 0x00000036
VCN feature version: 0, firmware version: 0x00000000
DMCU feature version: 0, firmware version: 0x00000000
VBIOS version: 113-C98121-M01

Would umr[1] be useful here? I have not used it before, so I'd need some guidance on what arguments would produce output relevant to this hang.

Any help is appreciated.

[1] https://cgit.freedesktop.org/amd/umr/
Comment 2 Tom Seewald 2018-12-08 18:20:35 UTC
Created attachment 142754 [details]
dmesg of 4.20-rc5 with drm.debug=0xe
Comment 3 Tom Seewald 2018-12-08 18:27:07 UTC
Installed the new Polaris firmware released on December 3rd, however that doesn't appear to affect my card as the content of /sys/kernel/debug/dri/1/amdgpu_firmware_info is unchanged.

Upgraded to Mesa 18.3.0 from 18.2.4 - no change.

Added dmesg of 4.20-rc5 with drm.debug=0xe, showing the hang. It now prints hung kernel tasks backtraces rather than "[drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!".

I've also included the power management information before and after the GPU hang.

/sys/kernel/debug/dri/1/amdgpu_pm_info *before* GPU hang:

Clock Gating Flags Mask: 0x3fbcf
	Graphics Medium Grain Clock Gating: On
	Graphics Medium Grain memory Light Sleep: On
	Graphics Coarse Grain Clock Gating: On
	Graphics Coarse Grain memory Light Sleep: On
	Graphics Coarse Grain Tree Shader Clock Gating: Off
	Graphics Coarse Grain Tree Shader Light Sleep: Off
	Graphics Command Processor Light Sleep: On
	Graphics Run List Controller Light Sleep: On
	Graphics 3D Coarse Grain Clock Gating: Off
	Graphics 3D Coarse Grain memory Light Sleep: Off
	Memory Controller Light Sleep: On
	Memory Controller Medium Grain Clock Gating: On
	System Direct Memory Access Light Sleep: Off
	System Direct Memory Access Medium Grain Clock Gating: On
	Bus Interface Medium Grain Clock Gating: Off
	Bus Interface Light Sleep: On
	Unified Video Decoder Medium Grain Clock Gating: On
	Video Compression Engine Medium Grain Clock Gating: On
	Host Data Path Light Sleep: On
	Host Data Path Medium Grain Clock Gating: On
	Digital Right Management Medium Grain Clock Gating: Off
	Digital Right Management Light Sleep: Off
	Rom Medium Grain Clock Gating: On
	Data Fabric Medium Grain Clock Gating: Off

GFX Clocks and Power:
	1750 MHz (MCLK)
	1196 MHz (SCLK)
	387 MHz (PSTATE_SCLK)
	625 MHz (PSTATE_MCLK)
	993 mV (VDDGFX)
	20.30 W (average GPU)

GPU Temperature: 38 C
GPU Load: 0 %

UVD: Disabled

VCE: Disabled


/sys/kernel/debug/dri/1/amdgpu_pm_info *after* GPU hang:

Clock Gating Flags Mask: 0x6400
	Graphics Medium Grain Clock Gating: Off
	Graphics Medium Grain memory Light Sleep: Off
	Graphics Coarse Grain Clock Gating: Off
	Graphics Coarse Grain memory Light Sleep: Off
	Graphics Coarse Grain Tree Shader Clock Gating: Off
	Graphics Coarse Grain Tree Shader Light Sleep: Off
	Graphics Command Processor Light Sleep: Off
	Graphics Run List Controller Light Sleep: Off
	Graphics 3D Coarse Grain Clock Gating: Off
	Graphics 3D Coarse Grain memory Light Sleep: Off
	Memory Controller Light Sleep: Off
	Memory Controller Medium Grain Clock Gating: Off
	System Direct Memory Access Light Sleep: On
	System Direct Memory Access Medium Grain Clock Gating: Off
	Bus Interface Medium Grain Clock Gating: Off
	Bus Interface Light Sleep: Off
	Unified Video Decoder Medium Grain Clock Gating: On
	Video Compression Engine Medium Grain Clock Gating: On
	Host Data Path Light Sleep: Off
	Host Data Path Medium Grain Clock Gating: Off
	Digital Right Management Medium Grain Clock Gating: Off
	Digital Right Management Light Sleep: Off
	Rom Medium Grain Clock Gating: Off
	Data Fabric Medium Grain Clock Gating: Off

GFX Clocks and Power:
	1750 MHz (MCLK)
	1196 MHz (SCLK)
	387 MHz (PSTATE_SCLK)
	625 MHz (PSTATE_MCLK)
	993 mV (VDDGFX)
	28.186 W (average GPU)

GPU Temperature: 42 C
GPU Load: 100 %

UVD: Disabled

VCE: Disabled
Comment 4 Tom Seewald 2019-01-14 16:54:18 UTC
Created attachment 143107 [details]
amd-drm-staging-next dmesg as of January 14th 2019
Comment 5 Tom Seewald 2019-01-14 16:55:19 UTC
Created attachment 143108 [details]
UMR wave dump as of January 14th 2019
Comment 6 Tom Seewald 2019-01-14 16:55:56 UTC
Created attachment 143109 [details]
UMR gfx ring dump as of January 14th 2019
Comment 7 Tom Seewald 2019-01-14 16:57:02 UTC
Created attachment 143110 [details]
UMR gpu info
Comment 8 Tom Seewald 2019-01-14 17:05:14 UTC
I've reproduced this issue on amd-staging-drm-next and have attached a UMR wave and gfx ring dump, along with a new dmesg.  To clarify, this issue also prevents me from rebooting/shutting down my computer, and I am forced to hold the power button.

Here are the version strings of the relevant software I'm running:

Kernel: amd-staging-drm-next (commit: d2d07f246b126b23d02af0603b83866a3c3e2483)
Mesa: 18.3.1
Xorg: 1.19.6
UMR: 016bc2e93af2cac7a9bd790f7fcacb1ffdadc819

This is my first attempt at using UMR to get information about this system hang.  I'm essentially just copying what Andrey Grodzovsky suggested in a previous thread[0].

Here are the umr commands used to gather the information:

Waves dump: umr -i 1 -O verbose,halt_waves -wa
    
GFX ring dump: umr -i 1 -O verbose,follow -R gfx[.]
    
GFX info: umr -i 1 -e

I've attached the output of these to the bugzilla report.

[0] https://lists.freedesktop.org/archives/amd-gfx/2018-December/029790.html
Comment 9 Tom Seewald 2019-01-14 17:42:57 UTC
I temporarily upgraded to Xorg 1.20, and the issue still occurs.
Comment 10 Tom Seewald 2019-01-14 17:43:46 UTC
Created attachment 143113 [details]
dmesg with xorg 1.20, kernel 5.0-rc2
Comment 11 Alex Deucher 2019-01-25 17:24:59 UTC
The reset was actually successful.  The problem is, userspace components need to be aware of the reset and recreate their contexts.  As a workaround, you can kill the problematic app or restart X.
Comment 12 Tom Seewald 2019-02-10 01:01:52 UTC
(In reply to Alex Deucher from comment #11)
> The reset was actually successful.  The problem is, userspace components
> need to be aware of the reset and recreate their contexts.  As a workaround,
> you can kill the problematic app or restart X.

Hmm, but then why will the machine not restart unless I use sysrq keys? I would think a userspace issue wouldn't cause hung kernel tasks like that.

I'm also curious regarding why this program is causing the GPU to reset to begin with, I have not seen others reporting issues on other platforms with this program.

Is this ring gfx timeout purely a problem with userspace?

e.g.
[drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=32203, emitted seq=32205
Comment 13 Tom St Denis 2019-02-21 17:56:01 UTC
The wave dump seems to be empty... Is that the complete output?  Was there anything printed to stderr (like there are no waves)?
Comment 14 Tom Seewald 2019-02-21 19:21:57 UTC
(In reply to Tom St Denis from comment #13)
> The wave dump seems to be empty... Is that the complete output?  Was there
> anything printed to stderr (like there are no waves)?

Yes it says "no active waves!" - so it makes sense that is empty.  Is there something else you'd like me to try? 

Currently I'm running "umr -i 1 -O verbose,halt_waves -wa" immediately after I see the "ring gfx timeout" in dmesg.  I also just rebuilt UMR so I should be up to date.

Some potentially good news though, after upgrading from mesa 18.3.1 to 18.3.3, I have not been able to reproduce the issue. On mesa 18.3.1 and earlier I can reproduce it in under 20 seconds (I did so today on the latest amd-staging-drm-next), and I have tested mesa 18.3.3 for about an hour now.

But I believe this is still something to look into as user space should probably not be able to hang the entire system, even if the user is running an older version of Mesa.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.