Bug 111669 - Navi GPU hang in Minecraft
Summary: Navi GPU hang in Minecraft
Status: RESOLVED MOVED
Alias: None
Product: Mesa
Classification: Unclassified
Component: Drivers/Gallium/radeonsi (show other bugs)
Version: git
Hardware: x86-64 (AMD64) Linux (All)
: not set major
Assignee: Default DRI bug account
QA Contact: Default DRI bug account
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2019-09-12 08:11 UTC by Doug Ty
Modified: 2019-09-25 18:50 UTC (History)
1 user (show)

See Also:
i915 platform:
i915 features:


Attachments

Description Doug Ty 2019-09-12 08:11:43 UTC
When playing Minecraft, being in a certain area of my world at night causes my GPU to hang. I'm using Optifine and Sildur's shaders.

Sep 12 01:38:42 xxx kernel: [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out or interrupted!
Sep 12 01:38:47 xxx kernel: [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out or interrupted!
Sep 12 01:38:47 xxx kernel: [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out or interrupted!
Sep 12 01:38:47 xxx kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0 timeout, signaled seq=19965, emitted seq=19967
Sep 12 01:38:47 xxx kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process java pid 1375 thread java:cs0 pid 1433


CPU: 3700X
GPU: Sapphire 5700XT (reference)
Motherboard: Gigabyte X570-I (BIOS F4)
Kernel: 5.3.0-rc8-mainline
Mesa: 19.3.0_devel.115190.f83f9d7daa0
LLVM: 10.0.0_r326348.d7d8bb937ad
OpenGL string (as seen ingame): 4.5 (Compatibility Profile) Mesa 19.3.0-devel (git-f83f9d7daa), X.Org, AMD NAVI10 (DRM 3.33.0, 5.3.0-rc8-mainline, LLVM 10.0.0)

I get the hang extremely reliably when in this specific spot at night, but only this one apitrace recreates the hang when I replay it. Apologies for the filesize.

https://drive.google.com/open?id=16wAmCa27o2xxv3bFXnR6rGXAum0Wci_5

When the hangs occur, my screen freezes but everything is still running in the background, and I need to use REISUB hotkeys in order to reboot. Occurs with both PCIe 4.0 and 3.0 set in the BIOS.

Please let me know if any more info is needed.
Thank you.
Comment 1 Pierre-Eric Pelloux-Prayer 2019-09-12 13:50:42 UTC
Thanks for the bug report and the trace.

I can reproduce the hang. There's always a page fault before, e.g:

amdgpu 0000:0b:00.0: [gfxhub] page fault (src_id:0 ring:24 vmid:3 pasid:32772, for process glretrace pid 8616 thread glretrace:cs0 pid 8617)
amdgpu 0000:0b:00.0:   in page starting at address 0x0000000000f03000 from client 27
amdgpu 0000:0b:00.0: GCVM_L2_PROTECTION_FAULT_STATUS:0x00301031
amdgpu 0000:0b:00.0: 	 MORE_FAULTS: 0x1
amdgpu 0000:0b:00.0: 	 WALKER_ERROR: 0x0
amdgpu 0000:0b:00.0: 	 PERMISSION_FAULTS: 0x3
amdgpu 0000:0b:00.0: 	 MAPPING_ERROR: 0x0
amdgpu 0000:0b:00.0: 	 RW: 0x0

I couldn't find the root cause yet.
Comment 2 Pierre-Eric Pelloux-Prayer 2019-09-13 12:17:34 UTC
The kernel patch from https://bugs.freedesktop.org/show_bug.cgi?id=111481#c33 seems to prevent the hang here.

Could you try it as well and report the results?
Comment 3 Doug Ty 2019-09-13 13:54:29 UTC
Thanks for the response.
Still hanging, unfortunately.

While the patch allows me to replay the first apitrace just fine now, I'm still hanging in the same spot ingame. Same messages in journalctl

I've captured a new apitrace that recreates the hang with the patch for me.

https://drive.google.com/open?id=1WMeuCoZnOOqD0Tbjix6nNpFyVkzzbd94

As suggested in the other thread, AMD_DEBUG=nodma seems to successfully prevent the hang. Unsure if you can see it in the apitrace, but there are usually some artifacts shortly before the hang: stretchy verts, sheep textures turning blue -- these are also not present with nodma


It's worth noting that I am getting some general desktop instability and sdma hangs like in the other thread you linked as well. While compiling the kernel patch I got a hang trying to watch a video in Firefox (has happened a couple times before), and previously I've also gotten hangs while loading Half Life 2 maps and closing GIMP. Not sure if any of these could be related. They happen so irregularly that I've been unable to reproduce or capture apitraces for them. Occasionally images on web pages will load corrupted and not display as well, though I can't tell if this is a GPU problem or a browser/network problem.

The card works great on my Windows dual boot, so I'm pretty sure it's not a hardware problem. (though I have to use 19.7.5 as anything newer causes Firefox to blue screen me)
Comment 4 Pierre-Eric Pelloux-Prayer 2019-09-13 15:49:57 UTC
Thanks for the test and new trace.

I can reproduce the hang and it seems to go away with AMD_DEBUG=nodma.

Another workaround is to use the following kernel parameter amdgpu.vm_update_mode=3 (well, except that sometimes this introduces another problem, see https://bugs.freedesktop.org/show_bug.cgi?id=111682)
Comment 5 Pierre-Eric Pelloux-Prayer 2019-09-16 12:40:58 UTC
Another env variable to test is: AMD_DEBUG=nongg

Using AMD_DEBUG=nongg and a kernel with the patch from https://bugs.freedesktop.org/show_bug.cgi?id=111481#c33 I could replay both traces multiple times without a single hang.
Comment 6 Doug Ty 2019-09-17 00:46:01 UTC
Unfortunately I'm still getting the hang with the kernel patch + AMD_DEBUG=nongg, both ingame as well as replaying the apitraces. Same messages in journalctl

Not sure how useful it'll be but I've made another apitrace with patch + nongg
https://drive.google.com/open?id=1NSMBW-GKHMAMOjrHS_cD-CvvUkvviqx5

Is there anything more I can do to help debug this? A specific firmware I should be using?

Currently using:
Linux 5.3 (both rc8 and now stable release, compiled with the patch)
llvm-git 10.0.0_r326744.bfb5b0cb86c-1
mesa-git 1:19.3.0_devel.115313.f812cbfd884-1
Latest firmware (9/13) from https://people.freedesktop.org/~agd5f/radeon_ucode/navi10/ (was previously using 7/14 from Fedora's linux-firmware)

Only AMD_DEBUG=nodma stops the hang for me
No luck with amdgpu.vm_update_mode=3
Comment 7 GitLab Migration User 2019-09-25 18:50:48 UTC
-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/mesa/mesa/issues/1429.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.