Bug 107991

Summary: RX580 ~ ring gfx timeout ~ particular shaders created by a dolphin-emu game can crash AMDGPU, with both RadeonSI and RADV ~ attached apitrace for RadeonSI
Product: Mesa Reporter: Kyle De'Vir <kyle.devir>
Component: OtherAssignee: mesa-dev
Status: RESOLVED MOVED QA Contact: mesa-dev
Severity: major    
Priority: medium CC: glencoesmith, john.ettedgui, keramidasceid, kyle.devir, PrincessRiikka, robert.stolarz
Version: git   
Hardware: x86-64 (AMD64)   
OS: Linux (All)   
Whiteboard:
i915 platform: i915 features:
Attachments: dolphin-emu apitrace
apitrace using twilight princess
twilight princess savestate

Description Kyle De'Vir 2018-09-19 18:29:00 UTC
Created attachment 141652 [details]
dolphin-emu apitrace

I have successfully captured an apitrace that intercepted the crashy frame before the driver crashed.

The issues occur at very specific parts of Fire Emblem: Path of Radiance, where the frames / shaders / textures generated by dolphin-emu cause the driver to crash.
Comment 1 Kyle De'Vir 2018-09-19 18:33:24 UTC
I'm running Mesa master compiled with LLVM master, and kernel 4.18.8.
Comment 2 Michel Dänzer 2018-09-20 08:04:33 UTC
FWIW, the apitrace doesn't hang the Bonaire card in my development box.
Comment 3 Kyle De'Vir 2018-09-20 08:07:52 UTC
Hi Michel,

What kernel and Mesa drivers, and versions, are you using?
Comment 4 Michel Dänzer 2018-09-20 08:11:12 UTC
(In reply to kyle.devir from comment #3)
> What kernel and Mesa drivers, and versions, are you using?

amdgpu from kernel 4.18.8 + DRM changes for 4.19, Mesa Git master + LLVM trunk.

It's likely a GPU specific issue in radeonsi.
Comment 5 Kyle De'Vir 2018-09-20 08:20:48 UTC
> amdgpu from kernel 4.18.8 + DRM changes for 4.19

So, stable upstream plus patches, or is this the AMD Staging kernel?

> It's likely a GPU specific issue in radeonsi.

I wonder if there's any way to narrow down this issue any further?

Apart from testing with your kernel config, anyways.
Comment 6 Rob Stolarz 2018-10-01 01:54:26 UTC
I get the same crash (drm amdgpu ring gfx timeout) on an RX 480 in The Legend of Zelda: Twilight Princess (GameCube) with any combination of Arch's linux and linux-git packages, regular Mesa packages+vulkan-radeon and mesa-git, libdrm and libdrm-git, and dolphin-emu and dolphin-emu-git. This crash happens both in Vulkan and OpenGL render modes. It's also easy to reproduce, just start a new game and watch the first cutscene to completion.

Kyle's apitrace crashes my card as well.

I don't mind getting more data about this, I just don't know how to capture it.

If this happens in both Vulkan and OpenGL render modes, is it really a radeonsi issue?
Comment 7 Kyle De'Vir 2018-10-01 03:54:46 UTC
Ah, that'
Comment 8 Kyle De'Vir 2018-10-01 04:01:00 UTC
That's right. Thanks for reminding me...

I attached an apitrace for the OpenGL backend, however the issue also happens on RADV, so this cannot be a RadeonSI-specific issue.

In other threads on the "ring gfx timeout" issue, one developer mused that Mesa might be issuing malformed commands to AMDGPU during very particular events, resulting in the crash.
Comment 9 Kyle De'Vir 2018-10-01 04:08:40 UTC
Hi Rob,

You can capture a dolphin-emu apitrace with:

MESA_EXTENSION_OVERRIDE="-GL_AMD_pinned_memory -GL_ARB_buffer_storage" apitrace trace dolphin-emu

You need to override these because apitrace doesn't like them.

I got my original advice from here: https://forums.dolphin-emu.org/Thread-dumping-shaders-to-diagnose-gpu-kernel-driver-crashes

Just make a save state right before where it crashes, exit, then run the above, load the save state, and apitrace should capture the crashy frame before the driver carks it, which REISUB should allow you to properly commit to disk.

Then, trying running it to see if it crashes. If so, compress with the heaviest level of XZ compression you can, because it'll be rather large.

I'll trying running on my system, to see if it crashes. ;)
Comment 10 Rob Stolarz 2018-10-02 00:59:02 UTC
Created attachment 141828 [details]
apitrace using twilight princess
Comment 11 Rob Stolarz 2018-10-02 01:02:24 UTC
Created attachment 141829 [details]
twilight princess savestate

this is a savestate of gc twilight princess running on dolphin 5.0-8775 that reproduces the crash using both the vulkan and opengl renderers
Comment 12 Rob Stolarz 2018-10-02 01:10:42 UTC
I added a smaller apitrace that reproduces the issue under OpenGL; I wasn't able to use vktrace to capture the Vulkan GPU crash. vktrace seems to produce traces that only segfault when my GPU crashes (even on version 1.1.82.0 of vulkantools). To that end, I've added a Dolphin savestate that anyone with a copy of the NTSC GC version of The Legend of Zelda: Twilight Princess can use to replicate the issue in either the Vulkan or OpenGL renderers.

Thanks for the detailed capture instructions, Kyle.
Comment 13 John 2018-11-18 02:21:09 UTC
This seems the same as my bug https://bugs.freedesktop.org/show_bug.cgi?id=108771, also using Dolphin.

I hope we can get some information.
Comment 14 Kyle De'Vir 2018-11-18 07:39:39 UTC
Hi John,

Out of morbid curiousity, can you see if my attached apitrace causes a freeze?

As of 4.19.2, it no longer causes freezing on my RX 580, but I was reluctant to close this issue just in case others are suffering from the issue, like you.
Comment 15 Rob Stolarz 2018-11-19 00:41:59 UTC
I too am no longer seeing crashes on any apitrace file here or from the savestate, as of linux 4.19.2 / mesa 19.0.0_devel.105705.b4380cb070. Kyle, would you agree that this is resolved?
Comment 16 Kyle De'Vir 2018-11-19 05:14:11 UTC
Well, not quite.

John, from above, is still having issues, apparently, so I'm happy to wait.
Comment 17 John 2018-12-05 03:59:34 UTC
My apologies for the delay, I had not seen your question before.

Your trace replays fine here, but truthfully I did not try it before so I don't know if anything changed on my end.

My dolphin save still crashes the system, and I'm guessing it's related to your issue somehow, but I can continue that on the bug I created. Either way is fine with me.

Thank you for waiting!

(In case you're curious, here's my trace: https://mega.nz/#!plBngY4B!zQ8P24a84PsHWym-5hAGUMjiMKv1CKQB7EFnlPorrx4 I used the command you provided here but it ended up all black, it still freezes the system though)
Comment 18 e88z4 2018-12-06 04:00:21 UTC
Hi Everyone, 

I can confirm that this bug is still affecting amdgpu driver. I ran yuzu-canary built playing Super Mario Odyssey. The bug can be reproduced very consistent at the beginning of the game when Mario jump for the first being woken up by Cappy. 

My system is RX580
Kernel 4.19.7
I tried different combination of mesa 18.2.6/18.3.0-rc6/19.0-development. 

[drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=77668, emitted seq=77671
Comment 19 GitLab Migration User 2019-09-18 20:18:55 UTC
-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/mesa/mesa/issues/927.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.