Created attachment 141652 [details]
I have successfully captured an apitrace that intercepted the crashy frame before the driver crashed.
The issues occur at very specific parts of Fire Emblem: Path of Radiance, where the frames / shaders / textures generated by dolphin-emu cause the driver to crash.
I'm running Mesa master compiled with LLVM master, and kernel 4.18.8.
FWIW, the apitrace doesn't hang the Bonaire card in my development box.
What kernel and Mesa drivers, and versions, are you using?
(In reply to kyle.devir from comment #3)
> What kernel and Mesa drivers, and versions, are you using?
amdgpu from kernel 4.18.8 + DRM changes for 4.19, Mesa Git master + LLVM trunk.
It's likely a GPU specific issue in radeonsi.
> amdgpu from kernel 4.18.8 + DRM changes for 4.19
So, stable upstream plus patches, or is this the AMD Staging kernel?
> It's likely a GPU specific issue in radeonsi.
I wonder if there's any way to narrow down this issue any further?
Apart from testing with your kernel config, anyways.
I get the same crash (drm amdgpu ring gfx timeout) on an RX 480 in The Legend of Zelda: Twilight Princess (GameCube) with any combination of Arch's linux and linux-git packages, regular Mesa packages+vulkan-radeon and mesa-git, libdrm and libdrm-git, and dolphin-emu and dolphin-emu-git. This crash happens both in Vulkan and OpenGL render modes. It's also easy to reproduce, just start a new game and watch the first cutscene to completion.
Kyle's apitrace crashes my card as well.
I don't mind getting more data about this, I just don't know how to capture it.
If this happens in both Vulkan and OpenGL render modes, is it really a radeonsi issue?
That's right. Thanks for reminding me...
I attached an apitrace for the OpenGL backend, however the issue also happens on RADV, so this cannot be a RadeonSI-specific issue.
In other threads on the "ring gfx timeout" issue, one developer mused that Mesa might be issuing malformed commands to AMDGPU during very particular events, resulting in the crash.
You can capture a dolphin-emu apitrace with:
MESA_EXTENSION_OVERRIDE="-GL_AMD_pinned_memory -GL_ARB_buffer_storage" apitrace trace dolphin-emu
You need to override these because apitrace doesn't like them.
I got my original advice from here: https://forums.dolphin-emu.org/Thread-dumping-shaders-to-diagnose-gpu-kernel-driver-crashes
Just make a save state right before where it crashes, exit, then run the above, load the save state, and apitrace should capture the crashy frame before the driver carks it, which REISUB should allow you to properly commit to disk.
Then, trying running it to see if it crashes. If so, compress with the heaviest level of XZ compression you can, because it'll be rather large.
I'll trying running on my system, to see if it crashes. ;)
Created attachment 141828 [details]
apitrace using twilight princess
Created attachment 141829 [details]
twilight princess savestate
this is a savestate of gc twilight princess running on dolphin 5.0-8775 that reproduces the crash using both the vulkan and opengl renderers
I added a smaller apitrace that reproduces the issue under OpenGL; I wasn't able to use vktrace to capture the Vulkan GPU crash. vktrace seems to produce traces that only segfault when my GPU crashes (even on version 126.96.36.199 of vulkantools). To that end, I've added a Dolphin savestate that anyone with a copy of the NTSC GC version of The Legend of Zelda: Twilight Princess can use to replicate the issue in either the Vulkan or OpenGL renderers.
Thanks for the detailed capture instructions, Kyle.
This seems the same as my bug https://bugs.freedesktop.org/show_bug.cgi?id=108771, also using Dolphin.
I hope we can get some information.
Out of morbid curiousity, can you see if my attached apitrace causes a freeze?
As of 4.19.2, it no longer causes freezing on my RX 580, but I was reluctant to close this issue just in case others are suffering from the issue, like you.
I too am no longer seeing crashes on any apitrace file here or from the savestate, as of linux 4.19.2 / mesa 19.0.0_devel.105705.b4380cb070. Kyle, would you agree that this is resolved?
Well, not quite.
John, from above, is still having issues, apparently, so I'm happy to wait.
My apologies for the delay, I had not seen your question before.
Your trace replays fine here, but truthfully I did not try it before so I don't know if anything changed on my end.
My dolphin save still crashes the system, and I'm guessing it's related to your issue somehow, but I can continue that on the bug I created. Either way is fine with me.
Thank you for waiting!
(In case you're curious, here's my trace: https://mega.nz/#!plBngY4B!zQ8P24a84PsHWym-5hAGUMjiMKv1CKQB7EFnlPorrx4 I used the command you provided here but it ended up all black, it still freezes the system though)
I can confirm that this bug is still affecting amdgpu driver. I ran yuzu-canary built playing Super Mario Odyssey. The bug can be reproduced very consistent at the beginning of the game when Mario jump for the first being woken up by Cappy.
My system is RX580
I tried different combination of mesa 18.2.6/18.3.0-rc6/19.0-development.
[drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=77668, emitted seq=77671