Created attachment 142604 [details] dmesg showing the hang and failed gpu reset Problem: While running RuneLite [1] with GPU acceleration enabled, the system hangs after several minutes of seemingly normal operation. Once the GPU hangs, it attempts to reset itself but fails with the following message: [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125! This hang causes the system to lock up and ssh is the only access possible. There is no graphical corruption, the displays are simply frozen. System Information: GPU: POLARIS11 - RX 560 4GB (1002:67ff) Mesa: 18.0.5 X11: 1.19.6 Firmware files should be the latest as I've pulled them from adg5f's repo [2]. Kernel parameters: "quiet splash scsi_mod.use_blk_mq=1 apparmor=2 security=apparmor amdgpu.gpu_recovery=1 spectre_v2=off" I have reproduced this issue on: 4.20-rc3 amd-staging-drm-next (as of commit 1179994039abc10aab0d2f0ecfc4c65dfbd77438) [1] https://github.com/runelite/runelite [2] https://people.freedesktop.org/~agd5f/radeon_ucode/
I can confirm this is still happening on 4.20-rc4 as well as with more up to date userspace software. libdrm: 3.27.0 Mesa: 18.2.4 The hangs can be reliably reproduced at least as far back as kernel 4.15 so I am not confident I can bisect this. Here is a dump of my card's firmware information in case I missed an update. # cat /sys/kernel/debug/dri/1/amdgpu_firmware_info VCE feature version: 0, firmware version: 0x34040300 UVD feature version: 0, firmware version: 0x01821000 MC feature version: 0, firmware version: 0x00000000 ME feature version: 47, firmware version: 0x000000a2 PFP feature version: 47, firmware version: 0x000000f0 CE feature version: 47, firmware version: 0x00000089 RLC feature version: 1, firmware version: 0x00000035 RLC SRLC feature version: 0, firmware version: 0x00000000 RLC SRLG feature version: 0, firmware version: 0x00000000 RLC SRLS feature version: 0, firmware version: 0x00000000 MEC feature version: 47, firmware version: 0x000002cb MEC2 feature version: 47, firmware version: 0x000002cb SOS feature version: 0, firmware version: 0x00000000 ASD feature version: 0, firmware version: 0x00000000 SMC feature version: 0, firmware version: 0x001d0900 SDMA0 feature version: 31, firmware version: 0x00000036 SDMA1 feature version: 0, firmware version: 0x00000036 VCN feature version: 0, firmware version: 0x00000000 DMCU feature version: 0, firmware version: 0x00000000 VBIOS version: 113-C98121-M01 Would umr[1] be useful here? I have not used it before, so I'd need some guidance on what arguments would produce output relevant to this hang. Any help is appreciated. [1] https://cgit.freedesktop.org/amd/umr/
Created attachment 142754 [details] dmesg of 4.20-rc5 with drm.debug=0xe
Installed the new Polaris firmware released on December 3rd, however that doesn't appear to affect my card as the content of /sys/kernel/debug/dri/1/amdgpu_firmware_info is unchanged. Upgraded to Mesa 18.3.0 from 18.2.4 - no change. Added dmesg of 4.20-rc5 with drm.debug=0xe, showing the hang. It now prints hung kernel tasks backtraces rather than "[drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!". I've also included the power management information before and after the GPU hang. /sys/kernel/debug/dri/1/amdgpu_pm_info *before* GPU hang: Clock Gating Flags Mask: 0x3fbcf Graphics Medium Grain Clock Gating: On Graphics Medium Grain memory Light Sleep: On Graphics Coarse Grain Clock Gating: On Graphics Coarse Grain memory Light Sleep: On Graphics Coarse Grain Tree Shader Clock Gating: Off Graphics Coarse Grain Tree Shader Light Sleep: Off Graphics Command Processor Light Sleep: On Graphics Run List Controller Light Sleep: On Graphics 3D Coarse Grain Clock Gating: Off Graphics 3D Coarse Grain memory Light Sleep: Off Memory Controller Light Sleep: On Memory Controller Medium Grain Clock Gating: On System Direct Memory Access Light Sleep: Off System Direct Memory Access Medium Grain Clock Gating: On Bus Interface Medium Grain Clock Gating: Off Bus Interface Light Sleep: On Unified Video Decoder Medium Grain Clock Gating: On Video Compression Engine Medium Grain Clock Gating: On Host Data Path Light Sleep: On Host Data Path Medium Grain Clock Gating: On Digital Right Management Medium Grain Clock Gating: Off Digital Right Management Light Sleep: Off Rom Medium Grain Clock Gating: On Data Fabric Medium Grain Clock Gating: Off GFX Clocks and Power: 1750 MHz (MCLK) 1196 MHz (SCLK) 387 MHz (PSTATE_SCLK) 625 MHz (PSTATE_MCLK) 993 mV (VDDGFX) 20.30 W (average GPU) GPU Temperature: 38 C GPU Load: 0 % UVD: Disabled VCE: Disabled /sys/kernel/debug/dri/1/amdgpu_pm_info *after* GPU hang: Clock Gating Flags Mask: 0x6400 Graphics Medium Grain Clock Gating: Off Graphics Medium Grain memory Light Sleep: Off Graphics Coarse Grain Clock Gating: Off Graphics Coarse Grain memory Light Sleep: Off Graphics Coarse Grain Tree Shader Clock Gating: Off Graphics Coarse Grain Tree Shader Light Sleep: Off Graphics Command Processor Light Sleep: Off Graphics Run List Controller Light Sleep: Off Graphics 3D Coarse Grain Clock Gating: Off Graphics 3D Coarse Grain memory Light Sleep: Off Memory Controller Light Sleep: Off Memory Controller Medium Grain Clock Gating: Off System Direct Memory Access Light Sleep: On System Direct Memory Access Medium Grain Clock Gating: Off Bus Interface Medium Grain Clock Gating: Off Bus Interface Light Sleep: Off Unified Video Decoder Medium Grain Clock Gating: On Video Compression Engine Medium Grain Clock Gating: On Host Data Path Light Sleep: Off Host Data Path Medium Grain Clock Gating: Off Digital Right Management Medium Grain Clock Gating: Off Digital Right Management Light Sleep: Off Rom Medium Grain Clock Gating: Off Data Fabric Medium Grain Clock Gating: Off GFX Clocks and Power: 1750 MHz (MCLK) 1196 MHz (SCLK) 387 MHz (PSTATE_SCLK) 625 MHz (PSTATE_MCLK) 993 mV (VDDGFX) 28.186 W (average GPU) GPU Temperature: 42 C GPU Load: 100 % UVD: Disabled VCE: Disabled
Created attachment 143107 [details] amd-drm-staging-next dmesg as of January 14th 2019
Created attachment 143108 [details] UMR wave dump as of January 14th 2019
Created attachment 143109 [details] UMR gfx ring dump as of January 14th 2019
Created attachment 143110 [details] UMR gpu info
I've reproduced this issue on amd-staging-drm-next and have attached a UMR wave and gfx ring dump, along with a new dmesg. To clarify, this issue also prevents me from rebooting/shutting down my computer, and I am forced to hold the power button. Here are the version strings of the relevant software I'm running: Kernel: amd-staging-drm-next (commit: d2d07f246b126b23d02af0603b83866a3c3e2483) Mesa: 18.3.1 Xorg: 1.19.6 UMR: 016bc2e93af2cac7a9bd790f7fcacb1ffdadc819 This is my first attempt at using UMR to get information about this system hang. I'm essentially just copying what Andrey Grodzovsky suggested in a previous thread[0]. Here are the umr commands used to gather the information: Waves dump: umr -i 1 -O verbose,halt_waves -wa GFX ring dump: umr -i 1 -O verbose,follow -R gfx[.] GFX info: umr -i 1 -e I've attached the output of these to the bugzilla report. [0] https://lists.freedesktop.org/archives/amd-gfx/2018-December/029790.html
I temporarily upgraded to Xorg 1.20, and the issue still occurs.
Created attachment 143113 [details] dmesg with xorg 1.20, kernel 5.0-rc2
The reset was actually successful. The problem is, userspace components need to be aware of the reset and recreate their contexts. As a workaround, you can kill the problematic app or restart X.
(In reply to Alex Deucher from comment #11) > The reset was actually successful. The problem is, userspace components > need to be aware of the reset and recreate their contexts. As a workaround, > you can kill the problematic app or restart X. Hmm, but then why will the machine not restart unless I use sysrq keys? I would think a userspace issue wouldn't cause hung kernel tasks like that. I'm also curious regarding why this program is causing the GPU to reset to begin with, I have not seen others reporting issues on other platforms with this program. Is this ring gfx timeout purely a problem with userspace? e.g. [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=32203, emitted seq=32205
The wave dump seems to be empty... Is that the complete output? Was there anything printed to stderr (like there are no waves)?
(In reply to Tom St Denis from comment #13) > The wave dump seems to be empty... Is that the complete output? Was there > anything printed to stderr (like there are no waves)? Yes it says "no active waves!" - so it makes sense that is empty. Is there something else you'd like me to try? Currently I'm running "umr -i 1 -O verbose,halt_waves -wa" immediately after I see the "ring gfx timeout" in dmesg. I also just rebuilt UMR so I should be up to date. Some potentially good news though, after upgrading from mesa 18.3.1 to 18.3.3, I have not been able to reproduce the issue. On mesa 18.3.1 and earlier I can reproduce it in under 20 seconds (I did so today on the latest amd-staging-drm-next), and I have tested mesa 18.3.3 for about an hour now. But I believe this is still something to look into as user space should probably not be able to hang the entire system, even if the user is running an older version of Mesa.
If you can't reproduce on a newer version of mesa then it's "been fixed" :-) Blocking shutdown is simply due to the device deinit being blocked because the device is not in an operational state. Not much to be done from a driver point of view I don't think.
(In reply to Tom St Denis from comment #15) > If you can't reproduce on a newer version of mesa then it's "been fixed" :-) My (probably incorrect) understanding is roughly this: +-------+-------+ 1.) | Application | +-------+-------+ | | Possibly sending bad commands/calls to Mesa | v +------+---------+ 2.) | Mesa | +------+---------+ | | Passing on bad calls from the application | or | There is a bug in Mesa itself where it is sending bad calls/commands to the kernel v +--------+--------+ 3.) | Kernel/amdgpu | +--------+--------+ | | amdgpu puts the physical device in a bad state due to bad commands from Mesa v +--------+--------+ 4.) | GPU | +--------+--------+ Given that mesa 18.3.3+ "fixes" the issue, it sounds like a specific case of mesa sending garbage to the kernel (step 2 to 3) has been fixed. But in general shouldn't the kernel driver (ideally) be able to handle mesa passing malformed/bad commands rather than freezing the device (step 3 to 4)? I understand not every case can be covered, and I also understand that GPU resets need to be supported in user space for seamless recovery, but shouldn't the driver "unstick" itself enough so the computer can be rebooted normally? Thanks for your time and patience.
(In reply to Tom Seewald from comment #16) > But in general shouldn't the kernel driver (ideally) be able to handle mesa > passing malformed/bad commands rather than freezing the device (step 3 to > 4)? I understand not every case can be covered, and I also understand that > GPU resets need to be supported in user space for seamless recovery, but > shouldn't the driver "unstick" itself enough so the computer can be rebooted > normally? These are not generally bad data from mesa per se. There's not really a good way to validate all combinations of state sent to the GPU are valid or not. There are hundreds of registers and state buffers that the GPU uses to process the 3D pipeline. It's impossible to test every combination of state and dispatch and ordering. The hangs are generally due to a deadlock in the hw due to a bad interaction of states set by the application. E.g., some hw block is waiting on a signal from another hw block which won't get sent because the user sent another state update which stops that signal. The GPU reset should generally be able recover the GPU, but in some cases you may end up with a deadlock in sw in the kernel somewhere.
Thanks Tom and Alex, I'll trust your judgement on this.
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.