Bug 109805 - GPU hangs with error VM_CONTEXT1_PROTECTION_FAULT_STATUS
Summary: GPU hangs with error VM_CONTEXT1_PROTECTION_FAULT_STATUS
Status: RESOLVED WORKSFORME
Alias: None
Product: Mesa
Classification: Unclassified
Component: Drivers/Vulkan/radeon (show other bugs)
Version: 18.3
Hardware: x86-64 (AMD64) Linux (All)
: medium normal
Assignee: mesa-dev
QA Contact: mesa-dev
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2019-03-01 03:03 UTC by rainbowsforthewin@gmail.com
Modified: 2019-04-01 07:14 UTC (History)
0 users

See Also:
i915 platform:
i915 features:


Attachments
dmesg output (87.52 KB, text/plain)
2019-03-01 03:03 UTC, rainbowsforthewin@gmail.com
Details
glxinfo output (144.56 KB, text/plain)
2019-03-01 03:03 UTC, rainbowsforthewin@gmail.com
Details
The Witcher 3 errors with RADV_DEBUG=zerovram,nodcc,nohiz,nofastclears (4.22 KB, text/plain)
2019-03-03 23:33 UTC, rainbowsforthewin@gmail.com
Details

Description rainbowsforthewin@gmail.com 2019-03-01 03:03:03 UTC
Created attachment 143503 [details]
dmesg output

Seemingly at random during intensive games, the graphics will become distorted or corrupted, and this is accompanied by variations on the following errors:

[ 7935.967417] amdgpu 0000:23:00.0: GPU fault detected: 146 0x0020c40c for process GTA5.exe pid 17653 thread GTA5.exe pid 17653
[ 7935.967427] amdgpu 0000:23:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00000004
[ 7935.967430] amdgpu 0000:23:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0A0C400C
[ 7935.967433] amdgpu 0000:23:00.0: VM fault (0x0c, vmid 5, pasid 32781) at page 4, read from 'TC3' (0x54433300) (196)

Once this happens, the entire system will eventually lock up after a seemingly random amount of time while the GPU is being used. For example, I could experience the error in GTA V using Proton, then close it and launch a completely different game, and it will eventually freeze, whereas if I have not experienced the error, the game will run fine indefinitely.

System specs:

CPU: AMD Ryzen R5 1600
Motherboard: MSI B350 Tomahawk
RAM: 16GB Corsair Vengeance LPX 3200MHz
GPU: 8GB Sapphire RX 480 Nitro+
PSU: 750W Corsair CS750M

Distribution: Anarchy Linux rolling 64-bit
Kernel: 4.20.13-arch1-1-ARCH
OpenGL renderer string: AMD Radeon (TM) RX 480 Graphics (POLARIS10, DRM 3.27.0, 4.20.13-arch1-1-ARCH, LLVM 7.0.1)
OpenGL core profile version string: 4.5 (Core Profile) Mesa 18.3.4

Proton version: 3.16-7 Beta
DXVK version: 0.96
Comment 1 rainbowsforthewin@gmail.com 2019-03-01 03:03:33 UTC
Created attachment 143504 [details]
glxinfo output
Comment 2 Michel Dänzer 2019-03-01 09:22:39 UTC
Reassigning assuming it's using Vulkan via DXVK.
Comment 3 Samuel Pitoiset 2019-03-01 17:37:51 UTC
Random GPU hangs are really hard to fix.

Can you reproduce the problem with
export RADV_DEBUG=zerovram,nodcc,nohiz,nofastclears ?

Also, LLVM 7 is quite old, I would recommend to upgrade, although VM faults are probably unrelated to LLVM.
Comment 4 rainbowsforthewin@gmail.com 2019-03-03 20:51:25 UTC
(In reply to Michel Dänzer from comment #2)
> Reassigning assuming it's using Vulkan via DXVK.

I have been able to produce the error in the past with OpenGL apps, but very inconsistently. The only games that I know are guaranteed to cause me issues are GTA V (specifically Online), and The Witcher 3 (specifically at 1440p).

(In reply to Samuel Pitoiset from comment #3)
> Random GPU hangs are really hard to fix.
> 
> Can you reproduce the problem with
> export RADV_DEBUG=zerovram,nodcc,nohiz,nofastclears ?
> 
> Also, LLVM 7 is quite old, I would recommend to upgrade, although VM faults
> are probably unrelated to LLVM.

I was unable to reproduce the issue in GTA Online with that. I played for several hours and the game itself crashed, without any errors appearing in dmesg, or any graphical corruption observed. The game did, however run poorly and render particle effects very strangely right from the beginning, but I assume this is expected.

I am going to try RADV_DEBUG=zerovram,nodcc,nohiz,nofastclears with The Witcher 3 at 1440p to see if I can reproduce the issue there.
Comment 5 rainbowsforthewin@gmail.com 2019-03-03 23:33:08 UTC
The Witcher 3 threw the error for me with RADV_DEBUG=zerovram,nodcc,nohiz,nofastclears after less than five minutes.

I will attach a text file with the amdgpu errors from dmesg.
Comment 6 rainbowsforthewin@gmail.com 2019-03-03 23:33:59 UTC
Created attachment 143517 [details]
The Witcher 3 errors with RADV_DEBUG=zerovram,nodcc,nohiz,nofastclears
Comment 7 Samuel Pitoiset 2019-03-04 08:47:53 UTC
Can you try to boot your kernel with amdgpu.vm_debug=1? Does the VM faults happen more often with that option?
Comment 8 rainbowsforthewin@gmail.com 2019-03-06 09:34:26 UTC
Apologies for taking so long to reply, I've spent the last couple of days swapping out my motherboard and doing some (non-GPU) stress testing.

I'm now on a Gigabyte B450 Aorus Elite, and kernel 5.0.

(In reply to Samuel Pitoiset from comment #7)
> Can you try to boot your kernel with amdgpu.vm_debug=1? Does the VM faults
> happen more often with that option?

I will admit that I haven't done very extensive testing, but it seems as though that option has actually made things significantly more stable.

I was able to run The Witcher 3 for a good 15 to 20 minutes at 1440p without any issues. I even tried disabling Vsync, increasing graphics options, and even enabled HairWorks in an attempt to provoke an error, but no dice. Considering I'm used to The Witcher 3 freaking out violently after about five minutes or less, it's a hell of an improvement already.

I will note, however, that when I got the error previously (the one I have attached), that was during the tutorial section of the game, and I am now in White Orchard, which may or may not be making some difference. I will also mention that without any special parameters or environment variables, I can run The Witcher 3 perfectly fine with Wine D3D11, just not with good performance.

GTA Online will be a difficult one to make a judgement on however, as I can sometimes go hours without an error, other times it can happen very quickly.
Comment 9 rainbowsforthewin@gmail.com 2019-03-06 09:40:20 UTC
Forgot to mention as well, I have tested The Witcher 3 on kernel 5.0 without amdgpu.vm_debug=1, and it was still as unstable as ever, so it's definitely not the new kernel making a difference.
Comment 10 rainbowsforthewin@gmail.com 2019-03-06 20:19:52 UTC
Okay, I just experienced a full system lock up with The Witcher 3, with amdgpu.vm_debug=1. It happened shortly after some very minor texture flickering started happening on a character's face, which seemed to be caused by enabling anti aliasing. This time, I observed no errors in dmesg, and I see nothing in the system journal.
Comment 11 Samuel Pitoiset 2019-03-06 21:07:58 UTC
Does the GPU hang with TW3 is reproducible now?
Comment 12 Samuel Pitoiset 2019-03-07 10:47:20 UTC
Does the attached patch help https://patchwork.freedesktop.org/patch/290846/?series=57689&rev=1 ?
Comment 13 Samuel Pitoiset 2019-03-27 09:40:13 UTC
Are you able to reproduce the VM faults with mesa 19.0?
Comment 14 rainbowsforthewin@gmail.com 2019-03-29 11:54:12 UTC
(In reply to Samuel Pitoiset from comment #13)
> Are you able to reproduce the VM faults with mesa 19.0?

With Mesa 19.0.1, LLVM 8.0.0, Kernel 5.0.5 and Proton 4.2-1, it seems as though The Witcher 3 is stable now. I've been playing for well over an hour without any special kernel parameters or environment variables. No error messages or instability to report.
Comment 15 Samuel Pitoiset 2019-04-01 07:14:03 UTC
Thanks for your response, feel free to re-open if the problem happens again. Closing.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.