107572 – Unrecoverable GPU hang with IP block:gfx_v8_0 is hung

Bug 107572 - Unrecoverable GPU hang with IP block:gfx_v8_0 is hung

Summary: Unrecoverable GPU hang with IP block:gfx_v8_0 is hung

Status:	RESOLVED MOVED

Alias:	None

Product:	Mesa
Classification:	Unclassified
Component:	Drivers/Gallium/radeonsi (show other bugs)
Version:	18.2
Hardware:	x86-64 (AMD64) Linux (All)

Importance:	medium normal
Assignee:	Default DRI bug account
QA Contact:	Default DRI bug account

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2018-08-14 23:45 UTC by madcatx
Modified:	2019-09-25 18:09 UTC (History)
CC List:	3 users (show)

See Also:
i915 platform:
i915 features:

Attachments
dmesg right after the GPU hanged (73.63 KB, text/plain) 2018-08-15 20:21 UTC, madcatx	Details
Xorg log (48.59 KB, text/plain) 2018-08-15 20:21 UTC, madcatx	Details
dmesg log of the crash in Unigine Superposition (74.15 KB, text/plain) 2018-08-23 18:46 UTC, madcatx	Details
View All

Description madcatx 2018-08-14 23:45:33 UTC

Hello,

I have been experiencing a worrying amount of these ever since I got my RX 570 a few months ago. I can reproduce the hang quite reliably by with some 3D workloads, for instance the Unigine Superposition run on High quality or Witcher 3 (through WINE) crash the GPU quite reliably within minutes.

Once that happens I can always SSH into the machine and try to get at least some debugging information. Unfortunately, there does not seem to be much to go on.

dmesg does not tell me more than this:
[  254.704581] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, last signaled seq=103742, last emitted seq=103745
[  254.704586] [drm] IP block:gfx_v8_0 is hung!
[  254.704629] [drm] GPU recovery disabled.

Here are a few things I have tried so far:
- Boot with amdgpu.dc=0
- Boot with amdgpu.vm_update_mode=3
- Force the GPU to max power state
- Disable IOMMU (both by iommu=off and by disabling VT-d in BIOS)
- Boot with amdgpu.gpu_recovery=1 (does not produce any additional info)

I grabbed the umr tool to try to get the state of the GPU when in crashes but it does not seem to be able to read anything. Running:

umr -R gfx[.]

Leaves me with:

[ERROR]: Could not open ring debugfs file#  

I check that entries in /sys/kernel/debug/amdgpu that look relevant are there, cat'ing them gives me "Operation not permitted". Yes, I am doing it as root.

Once this happens the only way out is a hard reboot.

I am running up-to-date Fedora 28, kernel 4.17.2, Mesa 18.0 series, LLVM 6.0.1.

Is there anything else I can do?

Thanks.

Comment 1 Michel Dänzer 2018-08-15 07:58:32 UTC

Can you try latest Mesa / LLVM?

Please attach the corresponding Xorg log file and output of dmesg.

Comment 2 madcatx 2018-08-15 08:13:31 UTC

I remember I tried with an RC of mesa 18.2 and kernel 4.18-rc6 which didn't help in any way. If you want me to try the latest code from git/SVN I'll see what I can do (I can't exactly mess up my production box). In the mean time, is there any way I can get some more useful debugging output?

Comment 3 madcatx 2018-08-15 20:21:04 UTC

Created attachment 141125 [details]
dmesg right after the GPU hanged

Comment 4 madcatx 2018-08-15 20:21:29 UTC

Created attachment 141126 [details]
Xorg log

Comment 5 madcatx 2018-08-15 20:23:48 UTC

Requested logs attached, I'm afraid they do not contain anything particularly revealing though. Just FTR, my exact version of mesa is 18.0.5, libdrm 2.4.93.

Comment 6 Asseon 2018-08-16 12:16:10 UTC

I believe I have the exact same or at least a very similar Issue. I have a RX 480 though. I can reproduce this very reliable with Witcher 3 as well unless I use dxvk (a vulkan based DX11 implementation for wine), I can play it for hours without any issues using it compared to a few minutes. Which makes me think that the issues might be somewhere in the opengl machinery.
"Normal" usage aka browsing an watching videos does occasionally trigger it too.

Relevant software Versions: 
linux: 4.17.14
mesa: 18.1.5
llvm: 6.0.1


I'm trying to compile current git/svn versions of llvm and mesa right now, but it will take some time. Let's see if that helps.

Comment 7 madcatx 2018-08-16 12:25:36 UTC

I don't think this is isolated to OpenGL as I got the very same hang in the Vulkan beta of The Talos Principle - it happened only once though. If it is any help I believe that the Unigine Superpostion benchmark always crashes the GPU at a specific point during the benchmark. Reducing the image quality level to "medium" makes the benchmark finish correctly.

Comment 8 Michel Dänzer 2018-08-16 12:31:29 UTC

Reassigning this to Mesa for now; GFX ring hangs are indeed most likely triggered by userspace issues.

Beware that there might be multiple separate issues with similar symptoms, but different causes. It's better to track each issue separately until it's clear that some of them have the same cause. In particular, those issues which can be reliably reproduced with a certain application vs those which happen randomly.

Comment 9 Asseon 2018-08-16 14:16:39 UTC

I just tried running the Witcher 3 with wines own DX11 implementation and svn/git version of llvm and mesa and it hung again.

Comment 10 Paju 2018-08-16 15:29:23 UTC

I'm using RX 480 and experiencing same kind of problems. Running Unigine Superposition crashes GPU 4 times out of 5. I can reproduce these crashes also by playing Euro Truck Simulator 2 but then it's directly dependent how high I set resolution scale in game settings. Larger scale causes crashes to occur more often. When booting my machine to Win10 (I'm running dual boot) everything works fine. 

System info:

CPU: Intel i7-3770K
GPU: AMD RX480
Arch Linux
Linux: 4.17.14
Mesa: 18.1.6
LLVM: 6.0.1

Comment 11 madcatx 2018-08-17 06:38:51 UTC

Just out of curiosity, do either of you have a card that is supposed to have some small overclocking done by the manufacturer? My RX570 is supposed to have this and I’m wondering if it could be responsible in any way.

Comment 12 Paju 2018-08-17 09:22:40 UTC

I'm using reference RX480 with default clocks.

Comment 13 Andrew Cook 2018-08-18 11:51:20 UTC

Having this issue, thought it might be 105733 but no vmfault in dmesg

Last few kernel releases i've been checking the bug by running Obduction under wine using dxvk, gpu hangs before the game loads
iirc the first time i launched Obduction it was without dxvk, and it did run

Is there something like apitrace for vulkan? maybe it can be reproduced using one

Asus GL702ZC, Bios 305
CPU: Ryzen 1700
GPU: RX580
Fedora
Kernel: 4.17.14-202.fc28.x86_64
Mesa: 18.0.1
llvm: 6.0.1

Comment 14 madcatx 2018-08-18 14:46:07 UTC

@Andrew: Could you check that you can reproduce the crash with Unigine Superposition run at High or Ultra quality in 1920x1080? This is what crashes my GPU very reliably. It would be good to have some kind of freely available baseline for this. Note that U:S depends on the older OpenSSL 1.0.2 so a bit of manual library juggling is needed to get it going on F28.

Comment 15 Andrew Cook 2018-08-18 23:05:53 UTC

Unigine locked up right at the end of the benchmark, but it also prints a vmfault

kernel: gmc_v8_0_process_interrupt: 71 callbacks suppressed
kernel: amdgpu 0000:0c:00.0: GPU fault detected: 146 0x0048080c
kernel: amdgpu 0000:0c:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00100009
kernel: amdgpu 0000:0c:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0800800C
kernel: amdgpu 0000:0c:00.0: VM fault (0x0c, vmid 4, pasid 32790) at page 1048585, read from 'TC0' (0x54433000) (8)
kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, last signaled seq=1413785, last emitted seq=1413787
kernel: [drm] IP block:gfx_v8_0 is hung!

Comment 16 Andrew Cook 2018-08-23 02:25:43 UTC

https://github.com/ValveSoftware/Proton/blob/proton_3.7/PREREQS.md#directx-11-games

Suggests using llvm 7 to avoid gpu hangs, is someone able to test that?

In addition, is it expected for userspace to be capable of hanging the gpu? Really seems like something the kernel should prevent

Comment 17 madcatx 2018-08-23 18:45:54 UTC

I just ran a few tests with git/svn versions of LLVM 8.0 and mesa 18.3 and the problem is still there. I attached a dmesg log of the crash in Unigine Superposition. Just FTR the crash with LLVM 8.0/mesa 18.3 happens only on the Extreme settings, High settings survive without a hitch.

Comment 18 madcatx 2018-08-23 18:46:35 UTC

Created attachment 141261 [details]
dmesg log of the crash in Unigine Superposition

Comment 19 Andrew Cook 2018-09-01 00:07:24 UTC

Tried again using the debug kernel in fedora

Couldn't reproduce the unigen crash
Obduction crashed in the same way, nothing new in dmesg

Kernel: 4.17.19-200.fc28.x86_64+debug

Comment 20 Paju 2018-09-03 17:58:22 UTC

I ran some Unigine tests with different kernels. No crashes with 4.13.12 and older kernels. Maybe somebody could try to run these tests too and confirm this?

Comment 21 madcatx 2018-09-12 05:18:13 UTC

I just tried to run Unigine Superposition with llvm-6.0.1-7 and kernel 4.18.5 as they arrived to F28 and it finished fine twice. Witcher 3 still crashes though.

Comment 22 Andrew Cook 2018-09-14 05:46:21 UTC

Installed this:
https://copr.fedorainfracloud.org/coprs/jerbear64/mesa_dxvk/

Which is mesa 18.2 and the obduction crash seems to have disappeared

Comment 23 madcatx 2018-09-16 21:24:09 UTC

OK, I just tried Mesa 18.2 from the Copr suggested by Andrew but it does not fix the Witcher 3 for me.  Unigine Superposition seems to have been fixed by the 4.18 kernel as I just ran it multiple times even at 4K profile and it always finished successfully. The only thing I cannot try easily is LLVM 7 because it breaks too much dependencies on my Fedora box.

Comment 24 GitLab Migration User 2019-09-25 18:09:34 UTC

-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/mesa/mesa/issues/1323.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.