Bug 108854 - [polaris11] - Failed GPU reset after hang
Summary: [polaris11] - Failed GPU reset after hang
Status: NEW
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/AMDgpu (show other bugs)
Version: DRI git
Hardware: x86-64 (AMD64) Linux (All)
: medium normal
Assignee: Default DRI bug account
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2018-11-24 20:41 UTC by Tom Seewald
Modified: 2018-12-08 18:27 UTC (History)
0 users

See Also:
i915 platform:
i915 features:


Attachments
dmesg showing the hang and failed gpu reset (118.68 KB, text/plain)
2018-11-24 20:41 UTC, Tom Seewald
no flags Details
dmesg of 4.20-rc5 with drm.debug=0xe (251.35 KB, text/plain)
2018-12-08 18:20 UTC, Tom Seewald
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Tom Seewald 2018-11-24 20:41:30 UTC
Created attachment 142604 [details]
dmesg showing the hang and failed gpu reset

Problem:

While running RuneLite [1] with GPU acceleration enabled, the system hangs after  several minutes of seemingly normal operation. Once the GPU hangs, it attempts to reset itself but fails with the following message:

[drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!

This hang causes the system to lock up and ssh is the only access possible. There is no graphical corruption, the displays are simply frozen.

System Information:

GPU: POLARIS11 - RX 560 4GB (1002:67ff)
Mesa: 18.0.5
X11: 1.19.6
Firmware files should be the latest as I've pulled them from adg5f's repo [2].

Kernel parameters: "quiet splash scsi_mod.use_blk_mq=1 apparmor=2 security=apparmor amdgpu.gpu_recovery=1 spectre_v2=off"

I have reproduced this issue on:

4.20-rc3
amd-staging-drm-next (as of commit 1179994039abc10aab0d2f0ecfc4c65dfbd77438)

[1] https://github.com/runelite/runelite
[2] https://people.freedesktop.org/~agd5f/radeon_ucode/
Comment 1 Tom Seewald 2018-12-01 18:17:02 UTC
I can confirm this is still happening on 4.20-rc4 as well as with more up to date userspace software.

libdrm: 3.27.0
Mesa: 18.2.4

The hangs can be reliably reproduced at least as far back as kernel 4.15 so I am not confident I can bisect this.

Here is a dump of my card's firmware information in case I missed an update.

# cat /sys/kernel/debug/dri/1/amdgpu_firmware_info

VCE feature version: 0, firmware version: 0x34040300
UVD feature version: 0, firmware version: 0x01821000
MC feature version: 0, firmware version: 0x00000000
ME feature version: 47, firmware version: 0x000000a2
PFP feature version: 47, firmware version: 0x000000f0
CE feature version: 47, firmware version: 0x00000089
RLC feature version: 1, firmware version: 0x00000035
RLC SRLC feature version: 0, firmware version: 0x00000000
RLC SRLG feature version: 0, firmware version: 0x00000000
RLC SRLS feature version: 0, firmware version: 0x00000000
MEC feature version: 47, firmware version: 0x000002cb
MEC2 feature version: 47, firmware version: 0x000002cb
SOS feature version: 0, firmware version: 0x00000000
ASD feature version: 0, firmware version: 0x00000000
SMC feature version: 0, firmware version: 0x001d0900
SDMA0 feature version: 31, firmware version: 0x00000036
SDMA1 feature version: 0, firmware version: 0x00000036
VCN feature version: 0, firmware version: 0x00000000
DMCU feature version: 0, firmware version: 0x00000000
VBIOS version: 113-C98121-M01

Would umr[1] be useful here? I have not used it before, so I'd need some guidance on what arguments would produce output relevant to this hang.

Any help is appreciated.

[1] https://cgit.freedesktop.org/amd/umr/
Comment 2 Tom Seewald 2018-12-08 18:20:35 UTC
Created attachment 142754 [details]
dmesg of 4.20-rc5 with drm.debug=0xe
Comment 3 Tom Seewald 2018-12-08 18:27:07 UTC
Installed the new Polaris firmware released on December 3rd, however that doesn't appear to affect my card as the content of /sys/kernel/debug/dri/1/amdgpu_firmware_info is unchanged.

Upgraded to Mesa 18.3.0 from 18.2.4 - no change.

Added dmesg of 4.20-rc5 with drm.debug=0xe, showing the hang. It now prints hung kernel tasks backtraces rather than "[drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!".

I've also included the power management information before and after the GPU hang.

/sys/kernel/debug/dri/1/amdgpu_pm_info *before* GPU hang:

Clock Gating Flags Mask: 0x3fbcf
	Graphics Medium Grain Clock Gating: On
	Graphics Medium Grain memory Light Sleep: On
	Graphics Coarse Grain Clock Gating: On
	Graphics Coarse Grain memory Light Sleep: On
	Graphics Coarse Grain Tree Shader Clock Gating: Off
	Graphics Coarse Grain Tree Shader Light Sleep: Off
	Graphics Command Processor Light Sleep: On
	Graphics Run List Controller Light Sleep: On
	Graphics 3D Coarse Grain Clock Gating: Off
	Graphics 3D Coarse Grain memory Light Sleep: Off
	Memory Controller Light Sleep: On
	Memory Controller Medium Grain Clock Gating: On
	System Direct Memory Access Light Sleep: Off
	System Direct Memory Access Medium Grain Clock Gating: On
	Bus Interface Medium Grain Clock Gating: Off
	Bus Interface Light Sleep: On
	Unified Video Decoder Medium Grain Clock Gating: On
	Video Compression Engine Medium Grain Clock Gating: On
	Host Data Path Light Sleep: On
	Host Data Path Medium Grain Clock Gating: On
	Digital Right Management Medium Grain Clock Gating: Off
	Digital Right Management Light Sleep: Off
	Rom Medium Grain Clock Gating: On
	Data Fabric Medium Grain Clock Gating: Off

GFX Clocks and Power:
	1750 MHz (MCLK)
	1196 MHz (SCLK)
	387 MHz (PSTATE_SCLK)
	625 MHz (PSTATE_MCLK)
	993 mV (VDDGFX)
	20.30 W (average GPU)

GPU Temperature: 38 C
GPU Load: 0 %

UVD: Disabled

VCE: Disabled


/sys/kernel/debug/dri/1/amdgpu_pm_info *after* GPU hang:

Clock Gating Flags Mask: 0x6400
	Graphics Medium Grain Clock Gating: Off
	Graphics Medium Grain memory Light Sleep: Off
	Graphics Coarse Grain Clock Gating: Off
	Graphics Coarse Grain memory Light Sleep: Off
	Graphics Coarse Grain Tree Shader Clock Gating: Off
	Graphics Coarse Grain Tree Shader Light Sleep: Off
	Graphics Command Processor Light Sleep: Off
	Graphics Run List Controller Light Sleep: Off
	Graphics 3D Coarse Grain Clock Gating: Off
	Graphics 3D Coarse Grain memory Light Sleep: Off
	Memory Controller Light Sleep: Off
	Memory Controller Medium Grain Clock Gating: Off
	System Direct Memory Access Light Sleep: On
	System Direct Memory Access Medium Grain Clock Gating: Off
	Bus Interface Medium Grain Clock Gating: Off
	Bus Interface Light Sleep: Off
	Unified Video Decoder Medium Grain Clock Gating: On
	Video Compression Engine Medium Grain Clock Gating: On
	Host Data Path Light Sleep: Off
	Host Data Path Medium Grain Clock Gating: Off
	Digital Right Management Medium Grain Clock Gating: Off
	Digital Right Management Light Sleep: Off
	Rom Medium Grain Clock Gating: Off
	Data Fabric Medium Grain Clock Gating: Off

GFX Clocks and Power:
	1750 MHz (MCLK)
	1196 MHz (SCLK)
	387 MHz (PSTATE_SCLK)
	625 MHz (PSTATE_MCLK)
	993 mV (VDDGFX)
	28.186 W (average GPU)

GPU Temperature: 42 C
GPU Load: 100 %

UVD: Disabled

VCE: Disabled


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.