Created attachment 143503 [details] dmesg output Seemingly at random during intensive games, the graphics will become distorted or corrupted, and this is accompanied by variations on the following errors: [ 7935.967417] amdgpu 0000:23:00.0: GPU fault detected: 146 0x0020c40c for process GTA5.exe pid 17653 thread GTA5.exe pid 17653 [ 7935.967427] amdgpu 0000:23:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00000004 [ 7935.967430] amdgpu 0000:23:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0A0C400C [ 7935.967433] amdgpu 0000:23:00.0: VM fault (0x0c, vmid 5, pasid 32781) at page 4, read from 'TC3' (0x54433300) (196) Once this happens, the entire system will eventually lock up after a seemingly random amount of time while the GPU is being used. For example, I could experience the error in GTA V using Proton, then close it and launch a completely different game, and it will eventually freeze, whereas if I have not experienced the error, the game will run fine indefinitely. System specs: CPU: AMD Ryzen R5 1600 Motherboard: MSI B350 Tomahawk RAM: 16GB Corsair Vengeance LPX 3200MHz GPU: 8GB Sapphire RX 480 Nitro+ PSU: 750W Corsair CS750M Distribution: Anarchy Linux rolling 64-bit Kernel: 4.20.13-arch1-1-ARCH OpenGL renderer string: AMD Radeon (TM) RX 480 Graphics (POLARIS10, DRM 3.27.0, 4.20.13-arch1-1-ARCH, LLVM 7.0.1) OpenGL core profile version string: 4.5 (Core Profile) Mesa 18.3.4 Proton version: 3.16-7 Beta DXVK version: 0.96
Created attachment 143504 [details] glxinfo output
Reassigning assuming it's using Vulkan via DXVK.
Random GPU hangs are really hard to fix. Can you reproduce the problem with export RADV_DEBUG=zerovram,nodcc,nohiz,nofastclears ? Also, LLVM 7 is quite old, I would recommend to upgrade, although VM faults are probably unrelated to LLVM.
(In reply to Michel Dänzer from comment #2) > Reassigning assuming it's using Vulkan via DXVK. I have been able to produce the error in the past with OpenGL apps, but very inconsistently. The only games that I know are guaranteed to cause me issues are GTA V (specifically Online), and The Witcher 3 (specifically at 1440p). (In reply to Samuel Pitoiset from comment #3) > Random GPU hangs are really hard to fix. > > Can you reproduce the problem with > export RADV_DEBUG=zerovram,nodcc,nohiz,nofastclears ? > > Also, LLVM 7 is quite old, I would recommend to upgrade, although VM faults > are probably unrelated to LLVM. I was unable to reproduce the issue in GTA Online with that. I played for several hours and the game itself crashed, without any errors appearing in dmesg, or any graphical corruption observed. The game did, however run poorly and render particle effects very strangely right from the beginning, but I assume this is expected. I am going to try RADV_DEBUG=zerovram,nodcc,nohiz,nofastclears with The Witcher 3 at 1440p to see if I can reproduce the issue there.
The Witcher 3 threw the error for me with RADV_DEBUG=zerovram,nodcc,nohiz,nofastclears after less than five minutes. I will attach a text file with the amdgpu errors from dmesg.
Created attachment 143517 [details] The Witcher 3 errors with RADV_DEBUG=zerovram,nodcc,nohiz,nofastclears
Can you try to boot your kernel with amdgpu.vm_debug=1? Does the VM faults happen more often with that option?
Apologies for taking so long to reply, I've spent the last couple of days swapping out my motherboard and doing some (non-GPU) stress testing. I'm now on a Gigabyte B450 Aorus Elite, and kernel 5.0. (In reply to Samuel Pitoiset from comment #7) > Can you try to boot your kernel with amdgpu.vm_debug=1? Does the VM faults > happen more often with that option? I will admit that I haven't done very extensive testing, but it seems as though that option has actually made things significantly more stable. I was able to run The Witcher 3 for a good 15 to 20 minutes at 1440p without any issues. I even tried disabling Vsync, increasing graphics options, and even enabled HairWorks in an attempt to provoke an error, but no dice. Considering I'm used to The Witcher 3 freaking out violently after about five minutes or less, it's a hell of an improvement already. I will note, however, that when I got the error previously (the one I have attached), that was during the tutorial section of the game, and I am now in White Orchard, which may or may not be making some difference. I will also mention that without any special parameters or environment variables, I can run The Witcher 3 perfectly fine with Wine D3D11, just not with good performance. GTA Online will be a difficult one to make a judgement on however, as I can sometimes go hours without an error, other times it can happen very quickly.
Forgot to mention as well, I have tested The Witcher 3 on kernel 5.0 without amdgpu.vm_debug=1, and it was still as unstable as ever, so it's definitely not the new kernel making a difference.
Okay, I just experienced a full system lock up with The Witcher 3, with amdgpu.vm_debug=1. It happened shortly after some very minor texture flickering started happening on a character's face, which seemed to be caused by enabling anti aliasing. This time, I observed no errors in dmesg, and I see nothing in the system journal.
Does the GPU hang with TW3 is reproducible now?
Does the attached patch help https://patchwork.freedesktop.org/patch/290846/?series=57689&rev=1 ?
Are you able to reproduce the VM faults with mesa 19.0?
(In reply to Samuel Pitoiset from comment #13) > Are you able to reproduce the VM faults with mesa 19.0? With Mesa 19.0.1, LLVM 8.0.0, Kernel 5.0.5 and Proton 4.2-1, it seems as though The Witcher 3 is stable now. I've been playing for well over an hour without any special kernel parameters or environment variables. No error messages or instability to report.
Thanks for your response, feel free to re-open if the problem happens again. Closing.
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.