Bug 108854

Summary:	[polaris11] - GPU Hang - ring gfx timeout
Product:	Mesa	Reporter:	Tom Seewald <tseewald>
Component:	Drivers/Gallium/radeonsi	Assignee:	Default DRI bug account <dri-devel>
Status:	RESOLVED FIXED	QA Contact:	Default DRI bug account <dri-devel>
Severity:	normal
Priority:	medium	CC:	tom.stdenis
Version:	unspecified
Hardware:	x86-64 (AMD64)
OS:	Linux (All)
Whiteboard:
i915 platform:		i915 features:
Attachments:	dmesg showing the hang and failed gpu reset dmesg of 4.20-rc5 with drm.debug=0xe amd-drm-staging-next dmesg as of January 14th 2019 UMR wave dump as of January 14th 2019 UMR gfx ring dump as of January 14th 2019 UMR gpu info dmesg with xorg 1.20, kernel 5.0-rc2

Description Tom Seewald 2018-11-24 20:41:30 UTC

Created attachment 142604 [details]
dmesg showing the hang and failed gpu reset

Problem:

While running RuneLite [1] with GPU acceleration enabled, the system hangs after  several minutes of seemingly normal operation. Once the GPU hangs, it attempts to reset itself but fails with the following message:

[drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!

This hang causes the system to lock up and ssh is the only access possible. There is no graphical corruption, the displays are simply frozen.

System Information:

GPU: POLARIS11 - RX 560 4GB (1002:67ff)
Mesa: 18.0.5
X11: 1.19.6
Firmware files should be the latest as I've pulled them from adg5f's repo [2].

Kernel parameters: "quiet splash scsi_mod.use_blk_mq=1 apparmor=2 security=apparmor amdgpu.gpu_recovery=1 spectre_v2=off"

I have reproduced this issue on:

4.20-rc3
amd-staging-drm-next (as of commit 1179994039abc10aab0d2f0ecfc4c65dfbd77438)

[1] https://github.com/runelite/runelite
[2] https://people.freedesktop.org/~agd5f/radeon_ucode/

Comment 1 Tom Seewald 2018-12-01 18:17:02 UTC

I can confirm this is still happening on 4.20-rc4 as well as with more up to date userspace software.

libdrm: 3.27.0
Mesa: 18.2.4

The hangs can be reliably reproduced at least as far back as kernel 4.15 so I am not confident I can bisect this.

Here is a dump of my card's firmware information in case I missed an update.

# cat /sys/kernel/debug/dri/1/amdgpu_firmware_info

VCE feature version: 0, firmware version: 0x34040300
UVD feature version: 0, firmware version: 0x01821000
MC feature version: 0, firmware version: 0x00000000
ME feature version: 47, firmware version: 0x000000a2
PFP feature version: 47, firmware version: 0x000000f0
CE feature version: 47, firmware version: 0x00000089
RLC feature version: 1, firmware version: 0x00000035
RLC SRLC feature version: 0, firmware version: 0x00000000
RLC SRLG feature version: 0, firmware version: 0x00000000
RLC SRLS feature version: 0, firmware version: 0x00000000
MEC feature version: 47, firmware version: 0x000002cb
MEC2 feature version: 47, firmware version: 0x000002cb
SOS feature version: 0, firmware version: 0x00000000
ASD feature version: 0, firmware version: 0x00000000
SMC feature version: 0, firmware version: 0x001d0900
SDMA0 feature version: 31, firmware version: 0x00000036
SDMA1 feature version: 0, firmware version: 0x00000036
VCN feature version: 0, firmware version: 0x00000000
DMCU feature version: 0, firmware version: 0x00000000
VBIOS version: 113-C98121-M01

Would umr[1] be useful here? I have not used it before, so I'd need some guidance on what arguments would produce output relevant to this hang.

Any help is appreciated.

[1] https://cgit.freedesktop.org/amd/umr/

Comment 2 Tom Seewald 2018-12-08 18:20:35 UTC

Created attachment 142754 [details]
dmesg of 4.20-rc5 with drm.debug=0xe

Comment 3 Tom Seewald 2018-12-08 18:27:07 UTC

Installed the new Polaris firmware released on December 3rd, however that doesn't appear to affect my card as the content of /sys/kernel/debug/dri/1/amdgpu_firmware_info is unchanged.

Upgraded to Mesa 18.3.0 from 18.2.4 - no change.

Added dmesg of 4.20-rc5 with drm.debug=0xe, showing the hang. It now prints hung kernel tasks backtraces rather than "[drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!".

I've also included the power management information before and after the GPU hang.

/sys/kernel/debug/dri/1/amdgpu_pm_info *before* GPU hang:

Clock Gating Flags Mask: 0x3fbcf
	Graphics Medium Grain Clock Gating: On
	Graphics Medium Grain memory Light Sleep: On
	Graphics Coarse Grain Clock Gating: On
	Graphics Coarse Grain memory Light Sleep: On
	Graphics Coarse Grain Tree Shader Clock Gating: Off
	Graphics Coarse Grain Tree Shader Light Sleep: Off
	Graphics Command Processor Light Sleep: On
	Graphics Run List Controller Light Sleep: On
	Graphics 3D Coarse Grain Clock Gating: Off
	Graphics 3D Coarse Grain memory Light Sleep: Off
	Memory Controller Light Sleep: On
	Memory Controller Medium Grain Clock Gating: On
	System Direct Memory Access Light Sleep: Off
	System Direct Memory Access Medium Grain Clock Gating: On
	Bus Interface Medium Grain Clock Gating: Off
	Bus Interface Light Sleep: On
	Unified Video Decoder Medium Grain Clock Gating: On
	Video Compression Engine Medium Grain Clock Gating: On
	Host Data Path Light Sleep: On
	Host Data Path Medium Grain Clock Gating: On
	Digital Right Management Medium Grain Clock Gating: Off
	Digital Right Management Light Sleep: Off
	Rom Medium Grain Clock Gating: On
	Data Fabric Medium Grain Clock Gating: Off

GFX Clocks and Power:
	1750 MHz (MCLK)
	1196 MHz (SCLK)
	387 MHz (PSTATE_SCLK)
	625 MHz (PSTATE_MCLK)
	993 mV (VDDGFX)
	20.30 W (average GPU)

GPU Temperature: 38 C
GPU Load: 0 %

UVD: Disabled

VCE: Disabled


/sys/kernel/debug/dri/1/amdgpu_pm_info *after* GPU hang:

Clock Gating Flags Mask: 0x6400
	Graphics Medium Grain Clock Gating: Off
	Graphics Medium Grain memory Light Sleep: Off
	Graphics Coarse Grain Clock Gating: Off
	Graphics Coarse Grain memory Light Sleep: Off
	Graphics Coarse Grain Tree Shader Clock Gating: Off
	Graphics Coarse Grain Tree Shader Light Sleep: Off
	Graphics Command Processor Light Sleep: Off
	Graphics Run List Controller Light Sleep: Off
	Graphics 3D Coarse Grain Clock Gating: Off
	Graphics 3D Coarse Grain memory Light Sleep: Off
	Memory Controller Light Sleep: Off
	Memory Controller Medium Grain Clock Gating: Off
	System Direct Memory Access Light Sleep: On
	System Direct Memory Access Medium Grain Clock Gating: Off
	Bus Interface Medium Grain Clock Gating: Off
	Bus Interface Light Sleep: Off
	Unified Video Decoder Medium Grain Clock Gating: On
	Video Compression Engine Medium Grain Clock Gating: On
	Host Data Path Light Sleep: Off
	Host Data Path Medium Grain Clock Gating: Off
	Digital Right Management Medium Grain Clock Gating: Off
	Digital Right Management Light Sleep: Off
	Rom Medium Grain Clock Gating: Off
	Data Fabric Medium Grain Clock Gating: Off

GFX Clocks and Power:
	1750 MHz (MCLK)
	1196 MHz (SCLK)
	387 MHz (PSTATE_SCLK)
	625 MHz (PSTATE_MCLK)
	993 mV (VDDGFX)
	28.186 W (average GPU)

GPU Temperature: 42 C
GPU Load: 100 %

UVD: Disabled

VCE: Disabled

Comment 4 Tom Seewald 2019-01-14 16:54:18 UTC

Created attachment 143107 [details]
amd-drm-staging-next dmesg as of January 14th 2019

Comment 5 Tom Seewald 2019-01-14 16:55:19 UTC

Created attachment 143108 [details]
UMR wave dump as of January 14th 2019

Comment 6 Tom Seewald 2019-01-14 16:55:56 UTC

Created attachment 143109 [details]
UMR gfx ring dump as of January 14th 2019

Comment 7 Tom Seewald 2019-01-14 16:57:02 UTC

Created attachment 143110 [details]
UMR gpu info

Comment 8 Tom Seewald 2019-01-14 17:05:14 UTC

I've reproduced this issue on amd-staging-drm-next and have attached a UMR wave and gfx ring dump, along with a new dmesg.  To clarify, this issue also prevents me from rebooting/shutting down my computer, and I am forced to hold the power button.

Here are the version strings of the relevant software I'm running:

Kernel: amd-staging-drm-next (commit: d2d07f246b126b23d02af0603b83866a3c3e2483)
Mesa: 18.3.1
Xorg: 1.19.6
UMR: 016bc2e93af2cac7a9bd790f7fcacb1ffdadc819

This is my first attempt at using UMR to get information about this system hang.  I'm essentially just copying what Andrey Grodzovsky suggested in a previous thread[0].

Here are the umr commands used to gather the information:

Waves dump: umr -i 1 -O verbose,halt_waves -wa
    
GFX ring dump: umr -i 1 -O verbose,follow -R gfx[.]
    
GFX info: umr -i 1 -e

I've attached the output of these to the bugzilla report.

[0] https://lists.freedesktop.org/archives/amd-gfx/2018-December/029790.html

Comment 9 Tom Seewald 2019-01-14 17:42:57 UTC

I temporarily upgraded to Xorg 1.20, and the issue still occurs.

Comment 10 Tom Seewald 2019-01-14 17:43:46 UTC

Created attachment 143113 [details]
dmesg with xorg 1.20, kernel 5.0-rc2

Comment 11 Alex Deucher 2019-01-25 17:24:59 UTC

The reset was actually successful.  The problem is, userspace components need to be aware of the reset and recreate their contexts.  As a workaround, you can kill the problematic app or restart X.

Comment 12 Tom Seewald 2019-02-10 01:01:52 UTC

(In reply to Alex Deucher from comment #11)
> The reset was actually successful.  The problem is, userspace components
> need to be aware of the reset and recreate their contexts.  As a workaround,
> you can kill the problematic app or restart X.

Hmm, but then why will the machine not restart unless I use sysrq keys? I would think a userspace issue wouldn't cause hung kernel tasks like that.

I'm also curious regarding why this program is causing the GPU to reset to begin with, I have not seen others reporting issues on other platforms with this program.

Is this ring gfx timeout purely a problem with userspace?

e.g.
[drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=32203, emitted seq=32205

Comment 13 Tom St Denis 2019-02-21 17:56:01 UTC

The wave dump seems to be empty... Is that the complete output?  Was there anything printed to stderr (like there are no waves)?

Comment 14 Tom Seewald 2019-02-21 19:21:57 UTC

(In reply to Tom St Denis from comment #13)
> The wave dump seems to be empty... Is that the complete output?  Was there
> anything printed to stderr (like there are no waves)?

Yes it says "no active waves!" - so it makes sense that is empty.  Is there something else you'd like me to try? 

Currently I'm running "umr -i 1 -O verbose,halt_waves -wa" immediately after I see the "ring gfx timeout" in dmesg.  I also just rebuilt UMR so I should be up to date.

Some potentially good news though, after upgrading from mesa 18.3.1 to 18.3.3, I have not been able to reproduce the issue. On mesa 18.3.1 and earlier I can reproduce it in under 20 seconds (I did so today on the latest amd-staging-drm-next), and I have tested mesa 18.3.3 for about an hour now.

But I believe this is still something to look into as user space should probably not be able to hang the entire system, even if the user is running an older version of Mesa.

Comment 15 Tom St Denis 2019-02-22 17:03:26 UTC

If you can't reproduce on a newer version of mesa then it's "been fixed" :-)

Blocking shutdown is simply due to the device deinit being blocked because the device is not in an operational state.  Not much to be done from a driver point of view I don't think.

Comment 16 Tom Seewald 2019-02-22 18:15:41 UTC

(In reply to Tom St Denis from comment #15)
> If you can't reproduce on a newer version of mesa then it's "been fixed" :-)

My (probably incorrect) understanding is roughly this:

    +-------+-------+
1.) |  Application  |
    +-------+-------+
       |
       | Possibly sending bad commands/calls to Mesa
       |
       v
    +------+---------+
2.) |     Mesa       |
    +------+---------+
       |
       | Passing on bad calls from the application
       |     or
       | There is a bug in Mesa itself where it is sending bad calls/commands to the kernel
       v
    +--------+--------+
3.) |  Kernel/amdgpu  |
    +--------+--------+
       |
       | amdgpu puts the physical device in a bad state due to bad commands from Mesa
       v
    +--------+--------+
4.) |       GPU       |
    +--------+--------+

Given that mesa 18.3.3+ "fixes" the issue, it sounds like a specific case of mesa sending garbage to the kernel (step 2 to 3) has been fixed.

But in general shouldn't the kernel driver (ideally) be able to handle mesa passing malformed/bad commands rather than freezing the device (step 3 to 4)?  I understand not every case can be covered, and I also understand that GPU resets need to be supported in user space for seamless recovery, but shouldn't the driver "unstick" itself enough so the computer can be rebooted normally?

Thanks for your time and patience.

Comment 17 Alex Deucher 2019-02-22 18:31:37 UTC

(In reply to Tom Seewald from comment #16)

> But in general shouldn't the kernel driver (ideally) be able to handle mesa
> passing malformed/bad commands rather than freezing the device (step 3 to
> 4)?  I understand not every case can be covered, and I also understand that
> GPU resets need to be supported in user space for seamless recovery, but
> shouldn't the driver "unstick" itself enough so the computer can be rebooted
> normally?

These are not generally bad data from mesa per se.  There's not really a good way to validate all combinations of state sent to the GPU are valid or not.  There are hundreds of registers and state buffers that the GPU uses to process the 3D pipeline.  It's impossible to test every combination of state and dispatch and ordering.  The hangs are generally due to a deadlock in the hw due to a bad interaction of states set by the application.  E.g., some hw block is waiting on a signal from another hw block which won't get sent because the user sent another state update which stops that signal.

The GPU reset should generally be able recover the GPU, but in some cases you may end up with a deadlock in sw in the kernel somewhere.

Comment 18 Tom Seewald 2019-02-22 18:46:04 UTC

Thanks Tom and Alex, I'll trust your judgement on this.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.