When playing Minecraft, being in a certain area of my world at night causes my GPU to hang. I'm using Optifine and Sildur's shaders. Sep 12 01:38:42 xxx kernel: [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out or interrupted! Sep 12 01:38:47 xxx kernel: [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out or interrupted! Sep 12 01:38:47 xxx kernel: [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out or interrupted! Sep 12 01:38:47 xxx kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0 timeout, signaled seq=19965, emitted seq=19967 Sep 12 01:38:47 xxx kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process java pid 1375 thread java:cs0 pid 1433 CPU: 3700X GPU: Sapphire 5700XT (reference) Motherboard: Gigabyte X570-I (BIOS F4) Kernel: 5.3.0-rc8-mainline Mesa: 19.3.0_devel.115190.f83f9d7daa0 LLVM: 10.0.0_r326348.d7d8bb937ad OpenGL string (as seen ingame): 4.5 (Compatibility Profile) Mesa 19.3.0-devel (git-f83f9d7daa), X.Org, AMD NAVI10 (DRM 3.33.0, 5.3.0-rc8-mainline, LLVM 10.0.0) I get the hang extremely reliably when in this specific spot at night, but only this one apitrace recreates the hang when I replay it. Apologies for the filesize. https://drive.google.com/open?id=16wAmCa27o2xxv3bFXnR6rGXAum0Wci_5 When the hangs occur, my screen freezes but everything is still running in the background, and I need to use REISUB hotkeys in order to reboot. Occurs with both PCIe 4.0 and 3.0 set in the BIOS. Please let me know if any more info is needed. Thank you.
Thanks for the bug report and the trace. I can reproduce the hang. There's always a page fault before, e.g: amdgpu 0000:0b:00.0: [gfxhub] page fault (src_id:0 ring:24 vmid:3 pasid:32772, for process glretrace pid 8616 thread glretrace:cs0 pid 8617) amdgpu 0000:0b:00.0: in page starting at address 0x0000000000f03000 from client 27 amdgpu 0000:0b:00.0: GCVM_L2_PROTECTION_FAULT_STATUS:0x00301031 amdgpu 0000:0b:00.0: MORE_FAULTS: 0x1 amdgpu 0000:0b:00.0: WALKER_ERROR: 0x0 amdgpu 0000:0b:00.0: PERMISSION_FAULTS: 0x3 amdgpu 0000:0b:00.0: MAPPING_ERROR: 0x0 amdgpu 0000:0b:00.0: RW: 0x0 I couldn't find the root cause yet.
The kernel patch from https://bugs.freedesktop.org/show_bug.cgi?id=111481#c33 seems to prevent the hang here. Could you try it as well and report the results?
Thanks for the response. Still hanging, unfortunately. While the patch allows me to replay the first apitrace just fine now, I'm still hanging in the same spot ingame. Same messages in journalctl I've captured a new apitrace that recreates the hang with the patch for me. https://drive.google.com/open?id=1WMeuCoZnOOqD0Tbjix6nNpFyVkzzbd94 As suggested in the other thread, AMD_DEBUG=nodma seems to successfully prevent the hang. Unsure if you can see it in the apitrace, but there are usually some artifacts shortly before the hang: stretchy verts, sheep textures turning blue -- these are also not present with nodma It's worth noting that I am getting some general desktop instability and sdma hangs like in the other thread you linked as well. While compiling the kernel patch I got a hang trying to watch a video in Firefox (has happened a couple times before), and previously I've also gotten hangs while loading Half Life 2 maps and closing GIMP. Not sure if any of these could be related. They happen so irregularly that I've been unable to reproduce or capture apitraces for them. Occasionally images on web pages will load corrupted and not display as well, though I can't tell if this is a GPU problem or a browser/network problem. The card works great on my Windows dual boot, so I'm pretty sure it's not a hardware problem. (though I have to use 19.7.5 as anything newer causes Firefox to blue screen me)
Thanks for the test and new trace. I can reproduce the hang and it seems to go away with AMD_DEBUG=nodma. Another workaround is to use the following kernel parameter amdgpu.vm_update_mode=3 (well, except that sometimes this introduces another problem, see https://bugs.freedesktop.org/show_bug.cgi?id=111682)
Another env variable to test is: AMD_DEBUG=nongg Using AMD_DEBUG=nongg and a kernel with the patch from https://bugs.freedesktop.org/show_bug.cgi?id=111481#c33 I could replay both traces multiple times without a single hang.
Unfortunately I'm still getting the hang with the kernel patch + AMD_DEBUG=nongg, both ingame as well as replaying the apitraces. Same messages in journalctl Not sure how useful it'll be but I've made another apitrace with patch + nongg https://drive.google.com/open?id=1NSMBW-GKHMAMOjrHS_cD-CvvUkvviqx5 Is there anything more I can do to help debug this? A specific firmware I should be using? Currently using: Linux 5.3 (both rc8 and now stable release, compiled with the patch) llvm-git 10.0.0_r326744.bfb5b0cb86c-1 mesa-git 1:19.3.0_devel.115313.f812cbfd884-1 Latest firmware (9/13) from https://people.freedesktop.org/~agd5f/radeon_ucode/navi10/ (was previously using 7/14 from Fedora's linux-firmware) Only AMD_DEBUG=nodma stops the hang for me No luck with amdgpu.vm_update_mode=3
-- GitLab Migration Automatic Message -- This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity. You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/mesa/mesa/issues/1429.
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.