Hello, I have been experiencing a worrying amount of these ever since I got my RX 570 a few months ago. I can reproduce the hang quite reliably by with some 3D workloads, for instance the Unigine Superposition run on High quality or Witcher 3 (through WINE) crash the GPU quite reliably within minutes. Once that happens I can always SSH into the machine and try to get at least some debugging information. Unfortunately, there does not seem to be much to go on. dmesg does not tell me more than this: [ 254.704581] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, last signaled seq=103742, last emitted seq=103745 [ 254.704586] [drm] IP block:gfx_v8_0 is hung! [ 254.704629] [drm] GPU recovery disabled. Here are a few things I have tried so far: - Boot with amdgpu.dc=0 - Boot with amdgpu.vm_update_mode=3 - Force the GPU to max power state - Disable IOMMU (both by iommu=off and by disabling VT-d in BIOS) - Boot with amdgpu.gpu_recovery=1 (does not produce any additional info) I grabbed the umr tool to try to get the state of the GPU when in crashes but it does not seem to be able to read anything. Running: umr -R gfx[.] Leaves me with: [ERROR]: Could not open ring debugfs file# I check that entries in /sys/kernel/debug/amdgpu that look relevant are there, cat'ing them gives me "Operation not permitted". Yes, I am doing it as root. Once this happens the only way out is a hard reboot. I am running up-to-date Fedora 28, kernel 4.17.2, Mesa 18.0 series, LLVM 6.0.1. Is there anything else I can do? Thanks.
Can you try latest Mesa / LLVM? Please attach the corresponding Xorg log file and output of dmesg.
I remember I tried with an RC of mesa 18.2 and kernel 4.18-rc6 which didn't help in any way. If you want me to try the latest code from git/SVN I'll see what I can do (I can't exactly mess up my production box). In the mean time, is there any way I can get some more useful debugging output?
Created attachment 141125 [details] dmesg right after the GPU hanged
Created attachment 141126 [details] Xorg log
Requested logs attached, I'm afraid they do not contain anything particularly revealing though. Just FTR, my exact version of mesa is 18.0.5, libdrm 2.4.93.
I believe I have the exact same or at least a very similar Issue. I have a RX 480 though. I can reproduce this very reliable with Witcher 3 as well unless I use dxvk (a vulkan based DX11 implementation for wine), I can play it for hours without any issues using it compared to a few minutes. Which makes me think that the issues might be somewhere in the opengl machinery. "Normal" usage aka browsing an watching videos does occasionally trigger it too. Relevant software Versions: linux: 4.17.14 mesa: 18.1.5 llvm: 6.0.1 I'm trying to compile current git/svn versions of llvm and mesa right now, but it will take some time. Let's see if that helps.
I don't think this is isolated to OpenGL as I got the very same hang in the Vulkan beta of The Talos Principle - it happened only once though. If it is any help I believe that the Unigine Superpostion benchmark always crashes the GPU at a specific point during the benchmark. Reducing the image quality level to "medium" makes the benchmark finish correctly.
Reassigning this to Mesa for now; GFX ring hangs are indeed most likely triggered by userspace issues. Beware that there might be multiple separate issues with similar symptoms, but different causes. It's better to track each issue separately until it's clear that some of them have the same cause. In particular, those issues which can be reliably reproduced with a certain application vs those which happen randomly.
I just tried running the Witcher 3 with wines own DX11 implementation and svn/git version of llvm and mesa and it hung again.
I'm using RX 480 and experiencing same kind of problems. Running Unigine Superposition crashes GPU 4 times out of 5. I can reproduce these crashes also by playing Euro Truck Simulator 2 but then it's directly dependent how high I set resolution scale in game settings. Larger scale causes crashes to occur more often. When booting my machine to Win10 (I'm running dual boot) everything works fine. System info: CPU: Intel i7-3770K GPU: AMD RX480 Arch Linux Linux: 4.17.14 Mesa: 18.1.6 LLVM: 6.0.1
Just out of curiosity, do either of you have a card that is supposed to have some small overclocking done by the manufacturer? My RX570 is supposed to have this and I’m wondering if it could be responsible in any way.
I'm using reference RX480 with default clocks.
Having this issue, thought it might be 105733 but no vmfault in dmesg Last few kernel releases i've been checking the bug by running Obduction under wine using dxvk, gpu hangs before the game loads iirc the first time i launched Obduction it was without dxvk, and it did run Is there something like apitrace for vulkan? maybe it can be reproduced using one Asus GL702ZC, Bios 305 CPU: Ryzen 1700 GPU: RX580 Fedora Kernel: 4.17.14-202.fc28.x86_64 Mesa: 18.0.1 llvm: 6.0.1
@Andrew: Could you check that you can reproduce the crash with Unigine Superposition run at High or Ultra quality in 1920x1080? This is what crashes my GPU very reliably. It would be good to have some kind of freely available baseline for this. Note that U:S depends on the older OpenSSL 1.0.2 so a bit of manual library juggling is needed to get it going on F28.
Unigine locked up right at the end of the benchmark, but it also prints a vmfault kernel: gmc_v8_0_process_interrupt: 71 callbacks suppressed kernel: amdgpu 0000:0c:00.0: GPU fault detected: 146 0x0048080c kernel: amdgpu 0000:0c:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00100009 kernel: amdgpu 0000:0c:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0800800C kernel: amdgpu 0000:0c:00.0: VM fault (0x0c, vmid 4, pasid 32790) at page 1048585, read from 'TC0' (0x54433000) (8) kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, last signaled seq=1413785, last emitted seq=1413787 kernel: [drm] IP block:gfx_v8_0 is hung!
https://github.com/ValveSoftware/Proton/blob/proton_3.7/PREREQS.md#directx-11-games Suggests using llvm 7 to avoid gpu hangs, is someone able to test that? In addition, is it expected for userspace to be capable of hanging the gpu? Really seems like something the kernel should prevent
I just ran a few tests with git/svn versions of LLVM 8.0 and mesa 18.3 and the problem is still there. I attached a dmesg log of the crash in Unigine Superposition. Just FTR the crash with LLVM 8.0/mesa 18.3 happens only on the Extreme settings, High settings survive without a hitch.
Created attachment 141261 [details] dmesg log of the crash in Unigine Superposition
Tried again using the debug kernel in fedora Couldn't reproduce the unigen crash Obduction crashed in the same way, nothing new in dmesg Kernel: 4.17.19-200.fc28.x86_64+debug
I ran some Unigine tests with different kernels. No crashes with 4.13.12 and older kernels. Maybe somebody could try to run these tests too and confirm this?
I just tried to run Unigine Superposition with llvm-6.0.1-7 and kernel 4.18.5 as they arrived to F28 and it finished fine twice. Witcher 3 still crashes though.
Installed this: https://copr.fedorainfracloud.org/coprs/jerbear64/mesa_dxvk/ Which is mesa 18.2 and the obduction crash seems to have disappeared
OK, I just tried Mesa 18.2 from the Copr suggested by Andrew but it does not fix the Witcher 3 for me. Unigine Superposition seems to have been fixed by the 4.18 kernel as I just ran it multiple times even at 4K profile and it always finished successfully. The only thing I cannot try easily is LLVM 7 because it breaks too much dependencies on my Fedora box.
-- GitLab Migration Automatic Message -- This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity. You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/mesa/mesa/issues/1323.
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.