Bug 109955 - amdgpu [RX Vega 64] system freeze while gaming
Summary: amdgpu [RX Vega 64] system freeze while gaming
Status: NEW
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/AMDgpu (show other bugs)
Version: unspecified
Hardware: x86-64 (AMD64) Linux (All)
: medium normal
Assignee: Default DRI bug account
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2019-03-11 07:05 UTC by Mauro Gaspari
Modified: 2019-08-20 18:02 UTC (History)
6 users (show)

See Also:
i915 platform:
i915 features:


Attachments
syslog lines relevant to the crash (3.78 MB, text/plain)
2019-03-22 20:01 UTC, Mauro Gaspari
no flags Details
full dmesg after crash (87.19 KB, text/plain)
2019-03-22 20:02 UTC, Mauro Gaspari
no flags Details
dmesg from the freeze which didn't completely bork everything. It starts on line 1181 (987.98 KB, text/plain)
2019-06-13 21:04 UTC, Sam
no flags Details
Dmesg after crash (88.25 KB, text/plain)
2019-07-19 00:12 UTC, Hadet
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Mauro Gaspari 2019-03-11 07:05:19 UTC
Symptoms:
During gaming sessions, system locks up and freezes completely. Audio seems to keep working for a few seconds more, but full desktop is frozen, no mouse and keyboard actions available. Hard reset only possible action on local pc. I have not tried to ssh in the PC from another box.
Some times I can play for 20 minutes, some times for a few hours. Freezes seem unrelated to any activity running in-game. All system temperatures are under control.
The system outside of 3d gaming is very stable, including playing videos, encoding videos, regular desktop usage.

Further testing done:
1. Installed Windows10 on same hardware, same BIOS settings. Running same games has no issue at all. No hangs, no problems.
2. Ran same games on my NVIDIA+Intel based laptop. No issue at all on same distributions and kernels. No hangs, no problems.

Additional information:
This issue has been going on for a while now. It comes and goes with Mesa versions (or Mesa+kernel combinations). Some times an update comes and I have no freezes for weeks. Then next update gets installed and the issue comes back. 
I have tested this mainly on openSUSE Tumbleweed, Ubuntu 18.04 and Ubuntu 18.10. 

-- Ubuntu testing:
Ubuntu 18.04 was running well for months, then latest mesa updates that got in 2 weeks ago, re-introduced the issue. System started freezing again. I tried updating to 18.10 but I had the same issue. I enabled oibaf PPA for video drivers and the issue disappeared. Then after a few days a new mesa came in and the issue came back. I am now running on Padoka unstable PPA with Mesa 19 and LLVM9. The issue still happens.

-- Tumbleweed testing:
I am adding my previous bug report I filed with Tumbleweed. A couple of occurrences with system logs. I will post more as I collect them.

OS: OpenSUSE tumbleweed x86_64 updated (2018 04 21)
Kernel: 4.16.2-1-default
Desktop Environment: KDE Plasma (x11)
OpenGL version string: 3.0 Mesa 18.0.0
GPU: AMD Radeon RX Vega 64 8GB

System Logs:

Apr 21 17:08:34 STUDIO kernel: [drm:gfx_v9_0_priv_reg_irq [amdgpu]] *ERROR* Illegal register access in command stream
Apr 21 17:08:34 STUDIO kernel: [drm] No hardware hang detected. Did some blocks stall?
Apr 21 17:08:44 STUDIO kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, last signaled seq=128859, last emitted seq=128861
Apr 21 17:08:44 STUDIO kernel: [drm] No hardware hang detected. Did some blocks stall?
-- Reboot --


Dmesg lines relative to amdgpu:

[    3.407020] [drm] amdgpu kernel modesetting enabled.
[    3.411462] fb: switching to amdgpudrmfb from VESA VGA
[    3.426163] amdgpu 0000:04:00.0: Invalid PCI ROM header signature: expecting 0xaa55, got 0xffff
[    3.426261] amdgpu 0000:04:00.0: VRAM: 8176M 0x000000F400000000 - 0x000000F5FEFFFFFF (8176M used)
[    3.426263] amdgpu 0000:04:00.0: GTT: 256M 0x000000F600000000 - 0x000000F60FFFFFFF
[    3.426371] [drm] amdgpu: 8176M of VRAM memory ready
[    3.426372] [drm] amdgpu: 8176M of GTT memory ready.
[    4.031665] fbcon: amdgpudrmfb (fb0) is primary device
[    4.083803] amdgpu 0000:04:00.0: fb0: amdgpudrmfb frame buffer device
[    4.096086] amdgpu 0000:04:00.0: ring 0(gfx) uses VM inv eng 4 on hub 0
[    4.096088] amdgpu 0000:04:00.0: ring 1(comp_1.0.0) uses VM inv eng 5 on hub 0
[    4.096089] amdgpu 0000:04:00.0: ring 2(comp_1.1.0) uses VM inv eng 6 on hub 0
[    4.096090] amdgpu 0000:04:00.0: ring 3(comp_1.2.0) uses VM inv eng 7 on hub 0
[    4.096091] amdgpu 0000:04:00.0: ring 4(comp_1.3.0) uses VM inv eng 8 on hub 0
[    4.096093] amdgpu 0000:04:00.0: ring 5(comp_1.0.1) uses VM inv eng 9 on hub 0
[    4.096094] amdgpu 0000:04:00.0: ring 6(comp_1.1.1) uses VM inv eng 10 on hub 0
[    4.096095] amdgpu 0000:04:00.0: ring 7(comp_1.2.1) uses VM inv eng 11 on hub 0
[    4.096096] amdgpu 0000:04:00.0: ring 8(comp_1.3.1) uses VM inv eng 12 on hub 0
[    4.096098] amdgpu 0000:04:00.0: ring 9(kiq_2.1.0) uses VM inv eng 13 on hub 0
[    4.096099] amdgpu 0000:04:00.0: ring 10(sdma0) uses VM inv eng 4 on hub 1
[    4.096100] amdgpu 0000:04:00.0: ring 11(sdma1) uses VM inv eng 5 on hub 1
[    4.096101] amdgpu 0000:04:00.0: ring 12(uvd) uses VM inv eng 6 on hub 1
[    4.096103] amdgpu 0000:04:00.0: ring 13(uvd_enc0) uses VM inv eng 7 on hub 1
[    4.096104] amdgpu 0000:04:00.0: ring 14(uvd_enc1) uses VM inv eng 8 on hub 1
[    4.096105] amdgpu 0000:04:00.0: ring 15(vce0) uses VM inv eng 9 on hub 1
[    4.096107] amdgpu 0000:04:00.0: ring 16(vce1) uses VM inv eng 10 on hub 1
[    4.096108] amdgpu 0000:04:00.0: ring 17(vce2) uses VM inv eng 11 on hub 1
[    4.096662] [drm] Initialized amdgpu 3.23.0 20150101 for 0000:04:00.0 on minor 0



The issue was later identified here   https://bugs.freedesktop.org/show_bug.cgi?id=105317 and fixed with Mesa 18.0.1. 



Then, The issue was noticed again after a few months:
OS: OpenSUSE tumbleweed x86_64 updated (2018 08 10)
Kernel: 4.17.2-1-default
Desktop Environment: KDE Plasma (x11)
OpenGL version string: 3.1 Mesa 18.1.5
GPU: AMD Radeon RX Vega 64 8GB


Relevant log lines I found during freeze:

2018-08-09T23:16:53.103775+08:00 MGDT-Tumbleweed kernel: [ 6305.852703] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, last signaled seq=1745163, last emitted seq=
1745165
2018-08-09T23:16:53.103795+08:00 MGDT-Tumbleweed kernel: [ 6305.852704] [drm] No hardware hang detected. Did some blocks stall?


Dmesg lines relative to amdgpu:

[    3.130759] [drm] amdgpu kernel modesetting enabled.
[    3.135770] fb: switching to amdgpudrmfb from EFI VGA
[    3.136106] amdgpu 0000:03:00.0: Invalid PCI ROM header signature: expecting 0xaa55, got 0xffff
[    3.136171] amdgpu 0000:03:00.0: VRAM: 8176M 0x000000F400000000 - 0x000000F5FEFFFFFF (8176M used)
[    3.136173] amdgpu 0000:03:00.0: GTT: 512M 0x000000F600000000 - 0x000000F61FFFFFFF
[    3.136494] [drm] amdgpu: 8176M of VRAM memory ready
[    3.136495] [drm] amdgpu: 8176M of GTT memory ready.
[    4.114469] fbcon: amdgpudrmfb (fb0) is primary device
[    4.141179] amdgpu 0000:03:00.0: fb0: amdgpudrmfb frame buffer device
[    4.164072] amdgpu 0000:03:00.0: ring 0(gfx) uses VM inv eng 4 on hub 0
[    4.164074] amdgpu 0000:03:00.0: ring 1(comp_1.0.0) uses VM inv eng 5 on hub 0
[    4.164075] amdgpu 0000:03:00.0: ring 2(comp_1.1.0) uses VM inv eng 6 on hub 0
[    4.164075] amdgpu 0000:03:00.0: ring 3(comp_1.2.0) uses VM inv eng 7 on hub 0
[    4.164076] amdgpu 0000:03:00.0: ring 4(comp_1.3.0) uses VM inv eng 8 on hub 0
[    4.164077] amdgpu 0000:03:00.0: ring 5(comp_1.0.1) uses VM inv eng 9 on hub 0
[    4.164078] amdgpu 0000:03:00.0: ring 6(comp_1.1.1) uses VM inv eng 10 on hub 0
[    4.164079] amdgpu 0000:03:00.0: ring 7(comp_1.2.1) uses VM inv eng 11 on hub 0
[    4.164079] amdgpu 0000:03:00.0: ring 8(comp_1.3.1) uses VM inv eng 12 on hub 0
[    4.164080] amdgpu 0000:03:00.0: ring 9(kiq_2.1.0) uses VM inv eng 13 on hub 0
[    4.164081] amdgpu 0000:03:00.0: ring 10(sdma0) uses VM inv eng 4 on hub 1
[    4.164082] amdgpu 0000:03:00.0: ring 11(sdma1) uses VM inv eng 5 on hub 1
[    4.164083] amdgpu 0000:03:00.0: ring 12(uvd) uses VM inv eng 6 on hub 1
[    4.164084] amdgpu 0000:03:00.0: ring 13(uvd_enc0) uses VM inv eng 7 on hub 1
[    4.164085] amdgpu 0000:03:00.0: ring 14(uvd_enc1) uses VM inv eng 8 on hub 1
[    4.164085] amdgpu 0000:03:00.0: ring 15(vce0) uses VM inv eng 9 on hub 1
[    4.164086] amdgpu 0000:03:00.0: ring 16(vce1) uses VM inv eng 10 on hub 1
[    4.164087] amdgpu 0000:03:00.0: ring 17(vce2) uses VM inv eng 11 on hub 1
[    4.164553] [drm] Initialized amdgpu 3.25.0 20150101 for 0000:03:00.0 on minor 0
Comment 1 Mauro Gaspari 2019-03-22 20:01:01 UTC
Created attachment 143759 [details]
syslog lines relevant to the crash
Comment 2 Mauro Gaspari 2019-03-22 20:02:04 UTC
Created attachment 143760 [details]
full dmesg after crash
Comment 3 Mauro Gaspari 2019-03-22 20:02:15 UTC
New reports as the issue is still happening:

I found a link on phoronix that describes with pictures exactly what is happening:
https://www.phoronix.com/forums/forum/linux-graphics-x-org-drivers/open-source-amd-linux/1049483-amd-devs-error-ring-gfx-timeout


OS: OpenSUSE tumbleweed x86_64 updated (2019 03 23)
Kernel: 5.0.2-1-default
Desktop Environment: KDE Plasma (x11)
OpenGL version: string: 4.5 (Compatibility Profile) Mesa 19.0.0
GPU: AMD Radeon RX Vega 64 8GB

Attaching log files and dmesg after crash.
Comment 4 Mauro Gaspari 2019-04-11 06:37:46 UTC
Issue still happens despite kernel updates and mesa updates on openSUSE Tumbleweed. Same happens on Kubuntu with oibaf ppa, and on Arch.

It seems this bug affects many people on linux using AMDGPUS, and found some interesting workarounds. Had a look at kernel options, applied to grub, and so far it has been 2 weeks of extensive testing, and I did not have a single system freeze or hang.

-> BEGIN KENEL PARAMETERS <-
This is what I am using now. Please note that some of those settings are to
enable debugging and should not left there forever. I will remove those once
I am confident with the stability of the system.

AMDGPU
amdgpu.dc=1 amdgpu.vm_update_mode=0 amdgpu.dpm=-1
amdgpu.ppfeaturemask=0xffffffff amdgpu.vm_fault_stop=2 amdgpu.vm_debug=1
amdgpu.gpu_recovery=0


- Kernel parameters explained from:
https://www.kernel.org/doc/html/latest/gpu/amdgpu.html

--- dc (int)
Disable/Enable Display Core driver for debugging (1 = enable, 0 = disable).
The default is -1 (automatic for each asic).


--- dpm (int)
Override for dynamic power management setting (1 = enable, 0 = disable). The
default is -1 (auto).

--- vm_update_mode (int)
Override VM update mode. VM updated by using CPU (0 = never, 1 = Graphics
only, 2 = Compute only, 3 = Both). The default is -1 (Only in large BAR(LB)
systems Compute VM tables will be updated by CPU, otherwise 0, never).

--- ppfeaturemask (uint)
Override power features enabled. See enum PP_FEATURE_MASK in
drivers/gpu/drm/amd/include/amd_shared.h. The default is the current set of
stable power features.

--- vm_fault_stop (int)
Stop on VM fault for debugging (0 = never, 1 = print first, 2 = always). The
default is 0 (No stop).

--- vm_debug (int)
Debug VM handling (0 = disabled, 1 = enabled). The default is 0 (Disabled).

-gpu_recovery (int)
Set to enable GPU recovery mechanism (1 = enable, 0 = disable). The default
is -1 (auto, disabled except SRIOV).

-> END KERNEL PARAMETERS <-
Comment 5 Jaap Buurman 2019-04-12 21:37:54 UTC
I have the exact same problem with my Vega 64. Crashes when playing games. Happens with Vulkan games (RADV), OpenGL games (RadeonSI) and DirectX 9 games via Wine (Gallium9). It happens only for some games, presumably because it depends on the workload.

I am also suspecting power management issues. This might be a long shot, but worth a try. I know for a fact that Power management works slightly different when multiple monitors are connected, as memory isn't clocked back as much in that case. For the people also experiencing this issue, are you guys running multiple monitors like I am?
Comment 6 Jaap Buurman 2019-04-12 22:10:26 UTC
Another question: What is the output of the following command for you guys?

cat /sys/class/drm/card0/device/vbios_version 

I am running the following version:

113-D0500100-103

According to the techpowerup GPU bios database, this is a vega bios that was replaced two days (!) later by a new version. Perhaps issues were found that required another bios update? I might install Windows on a spare HDD and try to flash my Vega to see if that changes anything.
Comment 7 Mauro Gaspari 2019-04-13 09:34:26 UTC
@ Jaap Buurman 
I run a single monitor, ultra-wide 3440xx1440 @100hz.

my bios version: 113-D0500100-103
Comment 8 Jaap Buurman 2019-04-13 09:41:44 UTC
I guess we can rule out a multi-monitor issue then. But I find is VERY interesting that you also run the exact same bios version, that was replaced two days later, so it should be fairly rare. Perhaps it is buggy and was therefor replaced only 2 days after it was released? I am going to try and flash my GPU in Windows on a separate HDD and see if that fixes anything.
Comment 9 Mauro Gaspari 2019-04-13 09:49:00 UTC
Interesting catch the one about the BIOS of the card.

I have a separate SSD with windows10 I use to test this card stability. I will check my windows MSI update tool, see if it offers me an updated BIOS. If I do have an updated bios I will temporarily remove my workarounds and see how it goes.
Comment 10 Jaap Buurman 2019-04-13 09:52:32 UTC
You will have to flash using Atiflash:

https://www.techpowerup.com/download/ati-atiflash/

And downloading the latest bios for your card from Techpowerup as well:

https://www.techpowerup.com/vgabios/

Bios updates are usually not supported directly by the vendor, but I have never worked with MSI update tool, so I am not 100% sure.

Make sure you are very careful when picking the bios. Some bioses are for the watercooling variant, variants with aftermarket coolers, or overclocked ones.
Comment 11 Mauro Gaspari 2019-04-13 11:34:47 UTC
You are right. MSI tools do not offer any BIOS update for GPU.

I downloaded the utility and filtered BIOS by vendor and DeviceID, I saw the 3 BIOS version and the one that, as you said was released 2 days after the one we are using.

I do not have high hopes, because with current BIOS, all games on windows run fine. But well, cannot hurt to try the upgrade. Worst case I will re-introduce my workarounds. I had zero freezes with those enabled in the last 2 weeks. 

And if I end up bricking my GPU out of warranty, I have the excuse to get a new RadeonVII :D
Comment 12 Jaap Buurman 2019-04-13 13:19:33 UTC
My Vega64 was also 100% stable on the exact same build under Windows 10. So I am also not getting my hopes up, but I am really frustrated. I am hoping it is some kind of incompatibility problem. I have honestly tried so many things, that I am willing to give the long-shots a chance as well. 

Since my Switch to Linux ~1.5 years ago, stability with the Vega64 has been very finicky. Some games run fine, while some games cause this crash pretty reliably. Very, very frustrating.
Comment 13 Mauro Gaspari 2019-04-13 13:45:06 UTC
Status update: I updated the BIOS and now disabled all kernel parameters I previously used. It might take some time to make sure the system is stable. 

Regarding your frustrations,
AMD released open source drivers and that is a major improvement for people on Linux. I got the Vega RX64 to support that. I expected a few bumps in the road but well, it is taking longer than anticipated.

Having said that, there you are all kernel parameters I enabled, and with those as I said, I was unable to get a single freeze. Those are not fixes, most likely optimizations and workarounds. Still, work pretty well for me.

CPU
rcu_nocbs=0-15 (adjust to the number of cores of your cpu)
idle=nomwait
processor.max_cstate=5
pcie_aspm=off 

GPU
amdgpu.dc=1
amdgpu.vm_update_mode=0
amdgpu.dpm=-1
amdgpu.ppfeaturemask=0xffffffff
amdgpu.vm_fault_stop=2
amdgpu.vm_debug=1
amdgpu.gpu_recovery=0
Comment 14 Mauro Gaspari 2019-04-15 12:51:58 UTC
Quick update.


OS: OpenSUSE tumbleweed x86_64 updated (2019 04 15)
Kernel: 5.0.7-1-default
Desktop Environment: KDE Plasma (x11)
OpenGL version string: 4.5 (Compatibility Profile) Mesa 19.0.1
GPU: AMD Radeon RX Vega 64 8GB


GPU firmware upgrade did not change much. 
I disabled kernel parameters on grub, upgraded BIOS, ran some games. Same old system freeze on my system came back.

After that, I re-enabled kernel parameters on grub, rebooted. no more system freeze on my system.
Comment 15 Jaap Buurman 2019-04-25 19:44:19 UTC
That's bad to hear :( Worth a try though. How often do you experience freezes by the way? And is this for all games, or are some games completely stable? For me, I am getting crashes in Kerbal Space Program, but not in Final Fantasy XII or World of Warcraft, even after hundreds of hours in both of these stable games.

Also, have you ever figured out which kernel parameter in particular makes your setup stable? It might help identify where the problem exists. Or do you need that exact combination of all those parameters to get your system stable?
Comment 16 Jaap Buurman 2019-04-28 16:33:39 UTC
Just got a crash in World of Warcraft as well, running via vkd3d. It happens instantly after trying to log into the game world, so the issue is nicely reproducible for me. If you want me to get any traces, please let me know what you would like me to run to get them. dmesg logs for now:

[   78.450637] amdgpu 0000:09:00.0: [gfxhub] VMC page fault (src_id:0 ring:158 vmid:1 pasid:32769, for process WoW.exe pid 2349 thread WoW.exe:cs0 pid 2370)
[   78.450641] amdgpu 0000:09:00.0:   in page starting at address 0x0000984ec2d4b000 from 27
[   78.450642] amdgpu 0000:09:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x0010113D
[   78.450648] amdgpu 0000:09:00.0: [gfxhub] VMC page fault (src_id:0 ring:158 vmid:1 pasid:32769, for process WoW.exe pid 2349 thread WoW.exe:cs0 pid 2370)
[   78.450650] amdgpu 0000:09:00.0:   in page starting at address 0x0000850e92553000 from 27
[   78.450652] amdgpu 0000:09:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x0010113D
[   78.450656] amdgpu 0000:09:00.0: [gfxhub] VMC page fault (src_id:0 ring:158 vmid:1 pasid:32769, for process WoW.exe pid 2349 thread WoW.exe:cs0 pid 2370)
[   78.450658] amdgpu 0000:09:00.0:   in page starting at address 0x0000984ec2d4e000 from 27
[   78.450660] amdgpu 0000:09:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x0010113D
[   78.450665] amdgpu 0000:09:00.0: [gfxhub] VMC page fault (src_id:0 ring:158 vmid:1 pasid:32769, for process WoW.exe pid 2349 thread WoW.exe:cs0 pid 2370)
[   78.450666] amdgpu 0000:09:00.0:   in page starting at address 0x0000850e92542000 from 27
[   78.450668] amdgpu 0000:09:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x0010113D
[   78.450673] amdgpu 0000:09:00.0: [gfxhub] VMC page fault (src_id:0 ring:158 vmid:1 pasid:32769, for process WoW.exe pid 2349 thread WoW.exe:cs0 pid 2370)
[   78.450674] amdgpu 0000:09:00.0:   in page starting at address 0x0000984ec2d42000 from 27
[   78.450676] amdgpu 0000:09:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x0010113D
[   78.450680] amdgpu 0000:09:00.0: [gfxhub] VMC page fault (src_id:0 ring:158 vmid:1 pasid:32769, for process WoW.exe pid 2349 thread WoW.exe:cs0 pid 2370)
[   78.450682] amdgpu 0000:09:00.0:   in page starting at address 0x0000850e92552000 from 27
[   78.450683] amdgpu 0000:09:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x0010113D
[   78.450688] amdgpu 0000:09:00.0: [gfxhub] VMC page fault (src_id:0 ring:158 vmid:1 pasid:32769, for process WoW.exe pid 2349 thread WoW.exe:cs0 pid 2370)
[   78.450690] amdgpu 0000:09:00.0:   in page starting at address 0x0000984ec2d40000 from 27
[   78.450691] amdgpu 0000:09:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x0010113D
[   78.450696] amdgpu 0000:09:00.0: [gfxhub] VMC page fault (src_id:0 ring:158 vmid:1 pasid:32769, for process WoW.exe pid 2349 thread WoW.exe:cs0 pid 2370)
[   78.450697] amdgpu 0000:09:00.0:   in page starting at address 0x0000850e92552000 from 27
[   78.450699] amdgpu 0000:09:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x0010113D
[   78.450703] amdgpu 0000:09:00.0: [gfxhub] VMC page fault (src_id:0 ring:158 vmid:1 pasid:32769, for process WoW.exe pid 2349 thread WoW.exe:cs0 pid 2370)
[   78.450705] amdgpu 0000:09:00.0:   in page starting at address 0x0000984ec2d49000 from 27
[   78.450706] amdgpu 0000:09:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x0010113D
[   78.450711] amdgpu 0000:09:00.0: [gfxhub] VMC page fault (src_id:0 ring:158 vmid:1 pasid:32769, for process WoW.exe pid 2349 thread WoW.exe:cs0 pid 2370)
[   78.450713] amdgpu 0000:09:00.0:   in page starting at address 0x0000850ea1eb2000 from 27
[   78.450714] amdgpu 0000:09:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x0010113D
[   78.454307] amdgpu 0000:09:00.0: IH ring buffer overflow (0x000BEDC0, 0x0003EEC0, 0x0003EDE0)
[   88.570062] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=25317, emitted seq=25319
[   88.570099] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process WoW.exe pid 2349 thread WoW.exe:cs0 pid 2370
[   88.570102] amdgpu 0000:09:00.0: GPU reset begin!
[   88.831392] amdgpu 0000:09:00.0: GPU reset
[   89.356679] [drm] psp mode1 reset succeed 
[   89.475356] amdgpu 0000:09:00.0: GPU reset succeeded, trying to resume
[   89.475465] [drm] PCIE GART of 512M enabled (table at 0x000000F400900000).
[   89.475508] [drm:amdgpu_device_gpu_recover [amdgpu]] *ERROR* VRAM is lost!
[   89.475642] [drm] PSP is resuming...
[   89.623052] [drm] reserve 0x400000 from 0xf400d00000 for PSP TMR SIZE
[   89.806625] [drm] SADs count is: -2, don't need to read it
[   89.856619] [drm] SADs count is: -2, don't need to read it
[   89.938255] [drm] UVD and UVD ENC initialized successfully.
[   90.038674] [drm] VCE initialized successfully.
[   90.039672] [drm] recover vram bo from shadow start
[   90.047496] [drm] recover vram bo from shadow done
[   90.047497] [drm] Skip scheduling IBs!
[   90.047499] [drm] Skip scheduling IBs!
[   90.047511] [drm] Skip scheduling IBs!
[   90.047518] [drm] Skip scheduling IBs!
[   90.047523] [drm] Skip scheduling IBs!
[   90.047524] [drm] Skip scheduling IBs!
[   90.047530] [drm] Skip scheduling IBs!
[   90.047531] [drm] Skip scheduling IBs!
[   90.047533] [drm] Skip scheduling IBs!
[   90.047535] [drm] Skip scheduling IBs!
[   90.047536] [drm] Skip scheduling IBs!
[   90.047538] [drm] Skip scheduling IBs!
[   90.047539] [drm] Skip scheduling IBs!
[   90.047555] amdgpu 0000:09:00.0: GPU reset(2) succeeded!
[   90.047796] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   90.049377] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   90.050524] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   90.051990] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   90.055576] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   90.136508] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   90.180374] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   90.181405] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   90.246698] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   90.313258] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   90.380264] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   90.446291] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   90.513947] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   90.579552] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   99.218785] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   99.218976] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   99.219571] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   99.219745] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   99.221821] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   99.221969] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   99.222145] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   99.222360] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   99.229911] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   99.230213] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   99.231183] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   99.231328] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   99.231487] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   99.231703] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   99.233480] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   99.247154] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   99.249213] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   99.249437] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   99.250924] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   99.251258] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   99.251320] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   99.252417] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   99.252532] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   99.252739] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   99.252994] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   99.254745] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   99.265835] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   99.265974] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   99.266056] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   99.266222] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   99.266342] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   99.266436] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   99.266516] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   99.266646] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   99.266796] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   99.266997] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   99.271605] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   99.274639] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   99.274699] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   99.274747] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   99.274794] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   99.274869] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   99.274929] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   99.274981] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   99.275033] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   99.275373] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   99.284443] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   99.286591] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   99.286881] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   99.302782] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   99.319311] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   99.335908] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   99.353111] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   99.369124] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   99.385670] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   99.402801] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   99.421232] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   99.737933] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   99.738054] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   99.742378] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   99.742737] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   99.742845] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   99.744592] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   99.744806] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   99.751833] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   99.752108] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   99.752371] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   99.752475] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   99.752604] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   99.752762] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   99.754128] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   99.765700] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   99.766154] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   99.766250] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   99.767140] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   99.767447] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   99.789098] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   99.789205] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   99.789293] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   99.789364] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   99.789473] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   99.789598] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   99.789675] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   99.789745] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   99.790301] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   99.803790] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   99.811866] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   99.821133] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   99.837593] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   99.841186] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   99.854467] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   99.870915] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   99.871297] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   99.887676] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   99.901326] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   99.902101] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   99.903913] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   99.927724] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   99.938301] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   99.941050] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   99.952885] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   99.975232] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   99.975468] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   99.986053] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[  100.005910] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[  100.018771] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[  100.036370] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[  100.052090] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[  100.067194] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[  100.067901] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[  100.068016] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[  100.081081] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[  100.081359] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[  100.081525] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[  100.081618] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[  100.081721] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[  100.081845] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[  100.082026] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[  100.082151] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[  100.082246] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[  100.082329] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[  100.082439] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[  100.082579] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[  100.082757] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[  100.086543] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[  100.098769] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[  100.102700] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[  100.445931] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[  100.446590] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[  100.946103] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[  100.946823] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[  101.446237] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[  101.446803] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[  101.946107] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[  101.946642] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[  102.445541] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[  102.446075] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[  102.946163] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[  102.946730] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[  103.446040] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[  103.446555] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[  103.945513] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[  103.945951] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[  104.437414] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[  104.437827] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[  104.946771] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[  104.947166] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[  105.446585] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[  105.447008] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[  105.937954] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[  105.938407] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[  106.445966] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[  106.446429] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[  106.945528] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[  106.945999] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[  107.445983] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[  107.446405] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[  107.946131] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[  107.946642] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[  108.446428] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[  108.446960] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[  108.946992] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[  108.947500] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[  109.445052] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[  109.445477] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[  109.533707] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[  109.946108] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[  109.946604] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[  110.445730] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[  110.446232] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[  110.943308] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[  110.943823] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[  111.036544] kauditd_printk_skb: 16509 callbacks suppressed
[  111.036545] audit: type=1006 audit(1556468881.509:99): pid=2590 uid=0 old-auid=4294967295 auid=1000 tty=(none) old-ses=4294967295 ses=4 res=1
[  111.446470] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[  111.446899] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[  111.945982] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[  111.946413] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
Comment 17 Alex Deucher 2019-04-29 01:15:49 UTC
(In reply to Jaap Buurman from comment #16)
> Just got a crash in World of Warcraft as well, running via vkd3d. It happens
> instantly after trying to log into the game world, so the issue is nicely
> reproducible for me. If you want me to get any traces, please let me know
> what you would like me to run to get them. dmesg logs for now:
> 
> [   78.450637] amdgpu 0000:09:00.0: [gfxhub] VMC page fault (src_id:0
> ring:158 vmid:1 pasid:32769, for process WoW.exe pid 2349 thread WoW.exe:cs0
> pid 2370)
> [   78.450641] amdgpu 0000:09:00.0:   in page starting at address
> 0x0000984ec2d4b000 from 27
> [   78.450642] amdgpu 0000:09:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x0010113D
> [   78.450648] amdgpu 0000:09:00.0: [gfxhub] VMC page fault (src_id:0
> ring:158 vmid:1 pasid:32769, for process WoW.exe pid 2349 thread WoW.exe:cs0
> pid 2370)
> [   78.450650] amdgpu 0000:09:00.0:   in page starting at address
> 0x0000850e92553000 from 27
> [   78.450652] amdgpu 0000:09:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x0010113D
> [   78.450656] amdgpu 0000:09:00.0: [gfxhub] VMC page fault (src_id:0
> ring:158 vmid:1 pasid:32769, for process WoW.exe pid 2349 thread WoW.exe:cs0
> pid 2370)
> [   78.450658] amdgpu 0000:09:00.0:   in page starting at address
> 0x0000984ec2d4e000 from 27
> [   78.450660] amdgpu 0000:09:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x0010113D
> [   78.450665] amdgpu 0000:09:00.0: [gfxhub] VMC page fault (src_id:0
> ring:158 vmid:1 pasid:32769, for process WoW.exe pid 2349 thread WoW.exe:cs0
> pid 2370)
> [   78.450666] amdgpu 0000:09:00.0:   in page starting at address
> 0x0000850e92542000 from 27
> [   78.450668] amdgpu 0000:09:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x0010113D
> [   78.450673] amdgpu 0000:09:00.0: [gfxhub] VMC page fault (src_id:0
> ring:158 vmid:1 pasid:32769, for process WoW.exe pid 2349 thread WoW.exe:cs0
> pid 2370)
> [   78.450674] amdgpu 0000:09:00.0:   in page starting at address
> 0x0000984ec2d42000 from 27
> [   78.450676] amdgpu 0000:09:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x0010113D
> [   78.450680] amdgpu 0000:09:00.0: [gfxhub] VMC page fault (src_id:0
> ring:158 vmid:1 pasid:32769, for process WoW.exe pid 2349 thread WoW.exe:cs0
> pid 2370)
> [   78.450682] amdgpu 0000:09:00.0:   in page starting at address
> 0x0000850e92552000 from 27
> [   78.450683] amdgpu 0000:09:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x0010113D
> [   78.450688] amdgpu 0000:09:00.0: [gfxhub] VMC page fault (src_id:0
> ring:158 vmid:1 pasid:32769, for process WoW.exe pid 2349 thread WoW.exe:cs0
> pid 2370)
> [   78.450690] amdgpu 0000:09:00.0:   in page starting at address
> 0x0000984ec2d40000 from 27
> [   78.450691] amdgpu 0000:09:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x0010113D
> [   78.450696] amdgpu 0000:09:00.0: [gfxhub] VMC page fault (src_id:0
> ring:158 vmid:1 pasid:32769, for process WoW.exe pid 2349 thread WoW.exe:cs0
> pid 2370)
> [   78.450697] amdgpu 0000:09:00.0:   in page starting at address
> 0x0000850e92552000 from 27
> [   78.450699] amdgpu 0000:09:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x0010113D
> [   78.450703] amdgpu 0000:09:00.0: [gfxhub] VMC page fault (src_id:0
> ring:158 vmid:1 pasid:32769, for process WoW.exe pid 2349 thread WoW.exe:cs0
> pid 2370)
> [   78.450705] amdgpu 0000:09:00.0:   in page starting at address
> 0x0000984ec2d49000 from 27
> [   78.450706] amdgpu 0000:09:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x0010113D
> [   78.450711] amdgpu 0000:09:00.0: [gfxhub] VMC page fault (src_id:0
> ring:158 vmid:1 pasid:32769, for process WoW.exe pid 2349 thread WoW.exe:cs0
> pid 2370)
> [   78.450713] amdgpu 0000:09:00.0:   in page starting at address
> 0x0000850ea1eb2000 from 27
> [   78.450714] amdgpu 0000:09:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x0010113D
> [   78.454307] amdgpu 0000:09:00.0: IH ring buffer overflow (0x000BEDC0,
> 0x0003EEC0, 0x0003EDE0)
> [   88.570062] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout,
> signaled seq=25317, emitted seq=25319
> [   88.570099] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process
> information: process WoW.exe pid 2349 thread WoW.exe:cs0 pid 2370
> [   88.570102] amdgpu 0000:09:00.0: GPU reset begin!
> [   88.831392] amdgpu 0000:09:00.0: GPU reset
> [   89.356679] [drm] psp mode1 reset succeed 
> [   89.475356] amdgpu 0000:09:00.0: GPU reset succeeded, trying to resume
> [   89.475465] [drm] PCIE GART of 512M enabled (table at 0x000000F400900000).
> [   89.475508] [drm:amdgpu_device_gpu_recover [amdgpu]] *ERROR* VRAM is lost!
> [   89.475642] [drm] PSP is resuming...
> [   89.623052] [drm] reserve 0x400000 from 0xf400d00000 for PSP TMR SIZE
> [   89.806625] [drm] SADs count is: -2, don't need to read it
> [   89.856619] [drm] SADs count is: -2, don't need to read it
> [   89.938255] [drm] UVD and UVD ENC initialized successfully.
> [   90.038674] [drm] VCE initialized successfully.
> [   90.039672] [drm] recover vram bo from shadow start
> [   90.047496] [drm] recover vram bo from shadow done
> [   90.047497] [drm] Skip scheduling IBs!
> [   90.047499] [drm] Skip scheduling IBs!
> [   90.047511] [drm] Skip scheduling IBs!
> [   90.047518] [drm] Skip scheduling IBs!
> [   90.047523] [drm] Skip scheduling IBs!
> [   90.047524] [drm] Skip scheduling IBs!
> [   90.047530] [drm] Skip scheduling IBs!
> [   90.047531] [drm] Skip scheduling IBs!
> [   90.047533] [drm] Skip scheduling IBs!
> [   90.047535] [drm] Skip scheduling IBs!
> [   90.047536] [drm] Skip scheduling IBs!
> [   90.047538] [drm] Skip scheduling IBs!
> [   90.047539] [drm] Skip scheduling IBs!
> [   90.047555] amdgpu 0000:09:00.0: GPU reset(2) succeeded!

The GPU reset succeeded.  You'll need to restart your desktop manager to recover because currently no desktop managers handle GPU reset errors and re-initialize their contexts.
Comment 18 Jaap Buurman 2019-04-29 10:41:42 UTC
I was aware of that. I was more curious if the bug that is causing the crash can be identified and hopefully fixed. I can provide traces if required, since it seems I can easily reproduce the crash.
Comment 19 Mauro Gaspari 2019-04-29 11:35:27 UTC
(In reply to Jaap Buurman from comment #15)
> That's bad to hear :( Worth a try though. How often do you experience
> freezes by the way? And is this for all games, or are some games completely
> stable? For me, I am getting crashes in Kerbal Space Program, but not in
> Final Fantasy XII or World of Warcraft, even after hundreds of hours in both
> of these stable games.
> 
> Also, have you ever figured out which kernel parameter in particular makes
> your setup stable? It might help identify where the problem exists. Or do
> you need that exact combination of all those parameters to get your system
> stable?

Hi, regarding the parameters I am using.
Unfortunately for me the issue is not easy to reproduce. Without the parameters enabled, it still takes hours for a crash to happen. On top of that, mesa and kernel updates are really frequent on Tumbleweed, that is another variable that makes it a bit harder to troubleshoot. Unless I can find a really fast way to reproduce the issue.

Regarding which game crash, with those kernel parameters applied, the only crashes I noticed were when I tried to run games through Wine in DX11 mode with DXVK. Which i believe to be stable on Vega GPUs, would need at least LLVM8. Currently on my Tumbleweed I have LLVM7 so I just stick to NON-DXVK games, or even better native ones, until LLVM8 is available for tumbleweed.

If you want to give it a try and you run on ubuntu, you can check this article: https://github.com/lutris/lutris/wiki/Installing-drivers

If you do so, I recommend you run a full system backup using clonezilla or similar software, those ppas are marked as unstable.
Comment 20 Jaap Buurman 2019-04-29 11:37:13 UTC
I already run LLVM 8.0.0, since it's the latest stable in Arch's repository. Thanks for the tip though :)
Comment 21 Mauro Gaspari 2019-04-29 13:52:33 UTC
(In reply to Jaap Buurman from comment #20)
> I already run LLVM 8.0.0, since it's the latest stable in Arch's repository.
> Thanks for the tip though :)

Since it is very easy for you to reproduce the freeze, it would be great if you could add those kernel parameters, and see if they help.
Comment 22 Mauro Gaspari 2019-05-24 05:12:18 UTC
I ran more tests:

1. Installed Arch Linux, vulkan, llvm8 and ran wine games with DXVK. With same kernel parameters on grub, no freezes, no crashes. Great performance.

2. Installed Ubuntu Budgie 19.04, Oibaf ppa, updated mesa and llvm8. Same as with Arch Linux: With same kernel parameters on grub, no freezes, no crashes. Great performance.

The only issue I have not being able to reproduce the issue quickly, is to clearly understand when the issue is resolved by Mesa. It takes hours for me to get the freeze sometimes. 
If someone has a quick way to trigger system freeze, I am happy to run more tests.
Comment 23 Sylvain BERTRAND 2019-05-24 12:25:11 UTC
It seems I get the same freezes than you. It takes hours of gaming to get some
random hard hang (no log). I thought I was overheating, but realized that my system is on
"vacation" while playing.
linux amd-staging-drm-new/x11 native/mesa/llvm(erk...), all git no older than a
week.
playing mostly dota2 vulkan on AMD TAHITI XT
Comment 24 Mauro Gaspari 2019-05-24 13:44:27 UTC
(In reply to Sylvain BERTRAND from comment #23)
> It seems I get the same freezes than you. It takes hours of gaming to get
> some
> random hard hang (no log). I thought I was overheating, but realized that my
> system is on
> "vacation" while playing.
> linux amd-staging-drm-new/x11 native/mesa/llvm(erk...), all git no older
> than a
> week.
> playing mostly dota2 vulkan on AMD TAHITI XT

Hi, a bit frustrating eh? :)
I have been asking around and it seems that RadeonVII and RX590 do not suffer those issues. Probably related to default clock speeds by manufacturers.

Anyway, If you try the kernel parameters I mentioned above, those should help. I have not had crashes in weeks after I enabled those on my grub. And not related to distribution, those grub kernel settings worked for me on Tumbleweed, Arch, Ubuntu Budgie.

I hope it helps.
Comment 25 Matt Coffin 2019-06-03 08:07:58 UTC
(In reply to Mauro Gaspari from comment #24)

> Hi, a bit frustrating eh? :)
> I have been asking around and it seems that RadeonVII and RX590 do not
> suffer those issues. Probably related to default clock speeds by
> manufacturers.

FWIW, I'm seeing this exact same issue, and I'm on an RX590.
Comment 26 Matt Coffin 2019-06-03 20:10:26 UTC
For reproducability, here's what I've been using. (I can reproduce this crash on both the RADV and AMDVLK Vulkan implementations, and can reproduce it both on top of sway 1.1 (wayland), and xfce4 (X11)).

* 5.1.3-arch2-1-ARCH
* LLVM 8.0.0
* mesa/vulkan-radeon: 19.0.4
* AMDVLK: (dev branch from nighttime Mountain time 20190602)
* DXVK: winelib version - release 1.2.1

I run "House Flipper" from Steam with DXVK_FILTER_DEVICE_NAME=590.

On 1080p@60Hz with v-sync, it runs quite well and stable (for hours). If I disable v-sync and framerate limiting, the crash occurs within a minute usually.

At 2560x1440 resolution, no refresh rate works in a stable mannner, but I have tried both 60Hz and 144Hz.

With the game rendering 1080p but scaling up to a 2560x1440 display, I saw it crash once, but was unable to duplicate it again.

I'm new to low-level development, and would like to help. If I can provide any information since I can reliably reproduce the issue, I'd love to. Let me know what would be useful and I'd be happy to get it out to you.

I've also seen the bugs listed in my other comment on the other bug here: https://bugs.freedesktop.org/show_bug.cgi?id=102322#c82
Comment 27 Sam 2019-06-04 21:43:38 UTC
Hello! I can confirm that I have the same issues. I am using a Vega 56 and openSUSE Tumbleweed (X11 and KDE) with:

Kernel Version:  5.1.5-1-default
X Server Release:  12004000
Driver:  X.Org Radeon RX Vega (VEGA10, DRM 3.30.0, 5.1.5-1-default, LLVM 7.0.1)


I have been having the same freezes exactly as described here since, as far as I can remember, mesa 19.0.4 and 5.0.13 (based on the Tumbleweed snapshots from when this started happening)

This was definitely not happening before on mesa 18.x/LLVM 6 and 7 and kernel 4.20. I niehter run overclocks, never messed with firmware/BIOS...etc. Everything has been running as-is since Oct. 2018 so firmware or BIOS issues should be discarded, I guess.

In my case, I have also experienced this issue when running non-demanding OpenGL games and even desktop applications (I had a crash happen on the desktop with just WxMaxima, a computer algebra system GUI, opened doing nothing)

The easiest way for me to reproduce it is by simply leaving Pillars of Eternity (an OpenGL unity game) open and idle for an hour or so. I have tried setting up Kdump and trying to catch some error messages in the logs with no luck. I'm definitely open for directions on how to get more info if this can help.
Comment 28 Mauro Gaspari 2019-06-05 06:34:02 UTC
Thanks all for adding comments and testing to this bug. I believe if we prove there is enough people affected on different cards, it will get the attention it needs, and hopefully a permanent mesa fix can be found and implemented.

For those affected, if you don't mind testing the kernel parameters workaround i described above, and post your results, that would be a nice start.
If you need help on how to do that you can reach out to me via PM or email.
Comment 29 Sam 2019-06-09 18:46:37 UTC
I have been trying myself for the moment to get some info with just debug parameters:

amdgpu.dc=1 
amdgpu.vm_fault_stop=2 
amdgpu.vm_debug=1 
amdgpu.gpu_recovery=0 

Incidentally I couldn't get any freeze to happen after running two troublesome games for about two hours each (left idle but on load, Pillars of Eternity and Surviving Mars) but this could mean anything as they happen completely randomly. 

Perhaps someone who can reproduce the issue instantly can test the parameters more reliably?
Comment 30 Sam 2019-06-10 17:13:57 UTC
Update: I can now confirm, at least in my case, that the freezes DO occur using the parameters above, and also with all of them (shown below), while doing another test round on Pillars of Eternity.

amdgpu.dc=1 
amdgpu.vm_update_mode=0 
amdgpu.dpm=-1 
amdgpu.ppfeaturemask=0xffffffff 
amdgpu.vm_fault_stop=2 
amdgpu.vm_debug=1 
amdgpu.gpu_recovery=0 

I was continuously writing dmesg to a file but yet again I didn't get any messages/warnings/errors.
Comment 31 Sam 2019-06-13 21:04:11 UTC
I have attached another trace I managed to get today at 22:24 while playing Pillars Of Eternity (OpenGL) 

It didn't freeze the whole as usual, just the whole Plasma and X sessions, so the other TTYs were accessible. This is the first occurrence of this happening. I was using the latest kernel default from the openSUSE Kernel:stable repo (5.1.9-5.1), as per request on https://bugzilla.opensuse.org/show_bug.cgi?id=1136293

To note that, as in the other dmesgs attached, the crash seems to be caused by amdgpu. Should the bug category be moved there?
Comment 32 Sam 2019-06-13 21:04:35 UTC
Created attachment 144535 [details]
dmesg from the freeze which didn't completely bork everything. It starts on line 1181
Comment 33 Jiri Slaby 2019-06-14 05:48:33 UTC
(In reply to Sam from comment #32)
> Created attachment 144535 [details]
> dmesg from the freeze which didn't completely bork everything. It starts on
> line 1181

Attaching the relevant part inline:

> [drm:amdgpu_dm_commit_planes.isra.0 [amdgpu]] *ERROR* Waiting for fences timed out.
> [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=726226, emitted seq=726228
> [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process PillarsOfEterni pid 12250 thread PillarsOfE:cs0 pid 12254
> amdgpu 0000:1e:00.0: GPU reset begin!
> [drm:amdgpu_dm_commit_planes.isra.0 [amdgpu]] *ERROR* Waiting for fences timed out.
> amdgpu 0000:1e:00.0: GPU BACO reset
> amdgpu 0000:1e:00.0: GPU reset succeeded, trying to resume
> [drm] PCIE GART of 512M enabled (table at 0x000000F400900000).
> [drm:amdgpu_device_gpu_recover [amdgpu]] *ERROR* VRAM is lost!
> [drm] PSP is resuming...
> [drm] reserve 0x400000 from 0xf400d00000 for PSP TMR SIZE
> [drm] UVD and UVD ENC initialized successfully.
> [drm] VCE initialized successfully.
> [drm] recover vram bo from shadow start
> [drm] recover vram bo from shadow done
> [drm] Skip scheduling IBs!
> [drm] Skip scheduling IBs!
> amdgpu 0000:1e:00.0: GPU reset(2) succeeded!
> [drm] Skip scheduling IBs!
> ...
> [drm] Skip scheduling IBs!
> [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
> [drm] Skip scheduling IBs!
> ...
> [drm] Skip scheduling IBs!
> [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
> [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
> [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
Comment 34 Alex Deucher 2019-06-14 14:33:47 UTC
(In reply to Jiri Slaby from comment #33)
> > amdgpu 0000:1e:00.0: GPU reset(2) succeeded!
> > [drm] Skip scheduling IBs!
> > ...
> > [drm] Skip scheduling IBs!
> > [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
> > [drm] Skip scheduling IBs!
> > ...
> > [drm] Skip scheduling IBs!
> > [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
> > [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
> > [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!

The GPU reset was successful.  You need to restart your desktop environment to recover.
Comment 35 shadow.archemage 2019-07-06 09:30:35 UTC
(In reply to Mauro Gaspari from comment #22)

> The only issue I have not being able to reproduce the issue quickly, is to
> clearly understand when the issue is resolved by Mesa. It takes hours for me
> to get the freeze sometimes. 
> If someone has a quick way to trigger system freeze, I am happy to run more
> tests.

Hi Mauro,

The issue happened to me much more frequently when I opted into Steam beta and ran Monster Hunter: World. Before opting in, the crashes happen around 1-2 hours after the game starts. With Steam beta though, it happens around <5 minutes in.

The only change that I noted when I opted into Steam beta was that the games suddenly downloaded some shader pre-caching stuff. Unfortunately, I'm not too familiar with it, and I'm not too sure if it is related to the problem.

I am running Manjaro, Gnome 3.32.2, Kernel version 5.1.15-1, Mesa 19.1.1.
Let me know if I missed something.

Thanks,
Eph
Comment 36 Mauro Gaspari 2019-07-07 05:31:34 UTC
(In reply to shadow.archemage from comment #35)
> (In reply to Mauro Gaspari from comment #22)
> 
> > The only issue I have not being able to reproduce the issue quickly, is to
> > clearly understand when the issue is resolved by Mesa. It takes hours for me
> > to get the freeze sometimes. 
> > If someone has a quick way to trigger system freeze, I am happy to run more
> > tests.
> 
> Hi Mauro,
> 
> The issue happened to me much more frequently when I opted into Steam beta
> and ran Monster Hunter: World. Before opting in, the crashes happen around
> 1-2 hours after the game starts. With Steam beta though, it happens around
> <5 minutes in.
> 
> The only change that I noted when I opted into Steam beta was that the games
> suddenly downloaded some shader pre-caching stuff. Unfortunately, I'm not
> too familiar with it, and I'm not too sure if it is related to the problem.
> 
> I am running Manjaro, Gnome 3.32.2, Kernel version 5.1.15-1, Mesa 19.1.1.
> Let me know if I missed something.
> 
> Thanks,
> Eph

I am not an expert, but I am quite sure shaders have a big part in this. If you can, disable shader caching.
There are a few tests you can do:
1. Did you try with the kernel parameters I posted above? I always ran all the parameters together. GPU+CPU and at the time, I did not have crashes for weeks on my Vega64. I am using a RadeonVII now and it seems those parameters are not needed.
2. Valve sponsored an interesting project that removes dependency of AMD Mesa from LLVM. And instead uses ACO. Valve made this available for Arch based systems via AUR, and Ubuntu based system via PPA. If you want to test it, you can check the posts below. I am going to test this myself on both Arch and Ubuntu. 
https://steamcommunity.com/games/221410/announcements/detail/1602634609636894200
https://steamcommunity.com/app/221410/discussions/0/1640915206474070669/
Comment 37 shadow.archemage 2019-07-07 10:55:49 UTC
(In reply to Mauro Gaspari from comment #36)
> (In reply to shadow.archemage from comment #35) 
> I am not an expert, but I am quite sure shaders have a big part in this. If
> you can, disable shader caching.
> There are a few tests you can do:
> 1. Did you try with the kernel parameters I posted above? I always ran all
> the parameters together. GPU+CPU and at the time, I did not have crashes for
> weeks on my Vega64. I am using a RadeonVII now and it seems those parameters
> are not needed.

I tried the kernel parameters above, and the game still crashed for me.

> 2. Valve sponsored an interesting project that removes dependency of AMD
> Mesa from LLVM. And instead uses ACO. Valve made this available for Arch
> based systems via AUR, and Ubuntu based system via PPA. If you want to test
> it, you can check the posts below. I am going to test this myself on both
> Arch and Ubuntu. 
> https://steamcommunity.com/games/221410/announcements/detail/
> 1602634609636894200
> https://steamcommunity.com/app/221410/discussions/0/1640915206474070669/

Will check this out, but will also keep an eye on this thread about the results of your tests. Thanks!
Comment 38 Sylvain BERTRAND 2019-07-07 17:42:14 UTC Comment hidden (spam)
Comment 39 Samuel Sieb 2019-07-08 05:29:56 UTC
(In reply to shadow.archemage from comment #37)
> I tried the kernel parameters above, and the game still crashed for me.

Are you saying that the game is crashing or the graphics device is?
Comment 40 Wilko Bartels 2019-07-09 14:29:41 UTC
Since i experience the same issue since june (didnt game much) i want to share my system info.
I am on Ryzen 2600X, Vega 56 Pulse, Strix B450. Using Arch 5.1.
Tested every Windowmanager i know , tested also 60Hz and 144Hz. The crashes are totally random. I only play Dota 2. Last friday i played like 6 games in a row without a single issue. The day after i crashed like 7 times per game. Always have to press reset on my PC. 
Is it know that hits issue related to a kernel or mesa update? I mean it wasnt always like this no?
Comment 41 Sylvain BERTRAND 2019-07-09 18:06:21 UTC
Guys,

I am getting freezes on tahiti xt/fx9590 recently... But I am not logging a bug yet
because I think the reason is summer heat.

Try to game with an opened computer case with a big fan blowing
into it.
Comment 42 Wilko Bartels 2019-07-10 06:29:33 UTC
(In reply to Wilko Bartels from comment #40)
> Since i experience the same issue since june (didnt game much) i want to
> share my system info.
> I am on Ryzen 2600X, Vega 56 Pulse, Strix B450. Using Arch 5.1.
> Tested every Windowmanager i know , tested also 60Hz and 144Hz. The crashes
> are totally random. I only play Dota 2. Last friday i played like 6 games in
> a row without a single issue. The day after i crashed like 7 times per game.
> Always have to press reset on my PC. 
> Is it know that hits issue related to a kernel or mesa update? I mean it
> wasnt always like this no?

tested yesterday with the new 5.2 linux kernel from arch testing, and also tested without variable refreshrate setting and without tearfree setting in Xorg. crashed three times.
Comment 43 Mauro Gaspari 2019-07-10 07:25:35 UTC
Hi,
No it was not always like this. I was using Kubuntu and my games were really smooth for months. Zero crashes. Then after a mesa update, I do not recall exactly the version but was around 18.5 or something like that, it all got worse. 

Same game on same PC same hardware same power supply, same cooling, but on windows, zero crashes.
same game on same PC with NVIDIA gpu, zero crashes.

I wish we could get the attention of someone @AMD because there is clearly some issue going on. I would be very happy to help troubleshooting, if only we had some contact with AMD. 

I have not used AMDGPU-PRO in ages, anyone here got that one to check if the same issue happens with proprietary drivers?
Comment 44 Wilko Bartels 2019-07-10 08:03:07 UTC
(In reply to Mauro Gaspari from comment #43)
> Hi,
> No it was not always like this. I was using Kubuntu and my games were really
> smooth for months. Zero crashes. Then after a mesa update, I do not recall
> exactly the version but was around 18.5 or something like that, it all got
> worse. 
> 
> Same game on same PC same hardware same power supply, same cooling, but on
> windows, zero crashes.
> same game on same PC with NVIDIA gpu, zero crashes.
> 
> I wish we could get the attention of someone @AMD because there is clearly
> some issue going on. I would be very happy to help troubleshooting, if only
> we had some contact with AMD. 
> 
> I have not used AMDGPU-PRO in ages, anyone here got that one to check if the
> same issue happens with proprietary drivers?

I was also thinking about GPU-PRO but i would want to install Ubuntu LTS on another disk then. That might take several weeks for me to test or even longer. And i am not even sure if thats super helpful. Im pretty sure at least on Arch at the end of 2018 i had zero problems. At least with my Vega ;-)
Maybe i was wrong switching from green to red after 10 years. hehe
Comment 45 Wilko Bartels 2019-07-10 08:19:30 UTC
(In reply to Mauro Gaspari from comment #43)
> Hi,
> No it was not always like this. I was using Kubuntu and my games were really
> smooth for months. Zero crashes. Then after a mesa update, I do not recall
> exactly the version but was around 18.5 or something like that, it all got
> worse. 
But it is proven that Mesa is the problem here?  There was once an issue regarding linux-firmware package in early 2018 if i remember correctly. Users had to rollback back than.
I might rollback to mesa 18.3 to test if i can manage that regardless.
Comment 46 Mauro Gaspari 2019-07-10 08:26:23 UTC
This is exactly the reason why I wish we could get more attention to this issue. 
I have seen so many people in forums on the internet replacing their AMD cards with NVIDIA due to similar issues. Or switching back to windows. 

I do not have the proof that the issue is just Mesa, could be a combination of mesa, kernel, firmware for all I know. 

I  opened this bug to see if I could get help troubleshooting the issue and finding a permanent fix for all affected users. If there is a better place to report this, I am happy to open as many tickets and sending as many emails as needed :)

Also It would be extremely helpful if we had a script or something to trigger the freeze quickly and consistently, so that troubleshooting mesa, kernel, ad firmware combinations would be so much easier and reliable. 
If anyone has a test suite or script or some automated check that can trigger the issue quickly, please share.
Comment 47 Sam 2019-07-10 09:41:22 UTC
The relevant issue and bug report here (the system freezing completely or if lucky just killing the X session, NOT games crashing) seems to be related exclusively to AMDGPU, and not to mesa. Whereas I got the same issues over and over after trying out several versions of mesa, switching to older versions of the kernel "fixes" it for me (the latest version I tried out which didn't have these issues is Kernel 4.20.13, in my case from https://download.opensuse.org/repositories/home:/tiwai:/kernel:/4.20/standard/x86_64/)

There is also a report from another user which temporarily fixed it by forcing the gpu to run at the maximum power setting (https://bugzilla.opensuse.org/show_bug.cgi?id=1136293):

# echo manual > /sys/class/drm/card0/device/power_dpm_force_performance_level
# echo 7 > /sys/class/drm/card0/device/pp_dpm_sclk

and then to reset back to normal:

# echo auto > /sys/class/drm/card0/device/power_dpm_force_performance_level
Comment 48 Mauro Gaspari 2019-07-10 14:44:21 UTC
@Sam,

Thank you, this is helpful. Since it is not distribution specific and not mesa related, do you think we should keep the bug here, merge it with other similar bugs, or create on other bug tracking?
Happy to help and troubleshoot more from my side, and/or push for this to be resolved once and for all, for all AMDGPU users.

Thanks
Mauro
Comment 49 Wilko Bartels 2019-07-10 18:42:53 UTC
(In reply to Sam from comment #47)
> The relevant issue and bug report here (the system freezing completely or if
> lucky just killing the X session, NOT games crashing) seems to be related
> exclusively to AMDGPU, and not to mesa. Whereas I got the same issues over
> and over after trying out several versions of mesa, switching to older
> versions of the kernel "fixes" it for me (the latest version I tried out
> which didn't have these issues is Kernel 4.20.13, in my case from
> https://download.opensuse.org/repositories/home:/tiwai:/kernel:/4.20/
> standard/x86_64/)
> 
> There is also a report from another user which temporarily fixed it by
> forcing the gpu to run at the maximum power setting
> (https://bugzilla.opensuse.org/show_bug.cgi?id=1136293):
> 
> # echo manual > /sys/class/drm/card0/device/power_dpm_force_performance_level
> # echo 7 > /sys/class/drm/card0/device/pp_dpm_sclk
> 
> and then to reset back to normal:
> 
> # echo auto > /sys/class/drm/card0/device/power_dpm_force_performance_level

I am currently on my 4th game of dota in a row when setting performance level manual to 7. working so far. Everyone should test this now so we have more reliable data. As we all now the issue can be gone for several hours so my experience means nothing yet. 
Would be amazing if we can pin down the issue to the  performance level of the cards.
Comment 50 shadow.archemage 2019-07-12 15:26:39 UTC
(In reply to Samuel Sieb from comment #39)
> (In reply to shadow.archemage from comment #37)
> > I tried the kernel parameters above, and the game still crashed for me.
> 
> Are you saying that the game is crashing or the graphics device is?

Apologies, what I meant by this is that my system locks up, not just the game crashing. I can't recover from it except by resetting my PC using the power button.
Comment 51 shadow.archemage 2019-07-13 17:22:41 UTC
(In reply to Wilko Bartels from comment #49)
> (In reply to Sam from comment #47)
> > The relevant issue and bug report here (the system freezing completely or if
> > lucky just killing the X session, NOT games crashing) seems to be related
> > exclusively to AMDGPU, and not to mesa. Whereas I got the same issues over
> > and over after trying out several versions of mesa, switching to older
> > versions of the kernel "fixes" it for me (the latest version I tried out
> > which didn't have these issues is Kernel 4.20.13, in my case from
> > https://download.opensuse.org/repositories/home:/tiwai:/kernel:/4.20/
> > standard/x86_64/)
> > 
> > There is also a report from another user which temporarily fixed it by
> > forcing the gpu to run at the maximum power setting
> > (https://bugzilla.opensuse.org/show_bug.cgi?id=1136293):
> > 
> > # echo manual > /sys/class/drm/card0/device/power_dpm_force_performance_level
> > # echo 7 > /sys/class/drm/card0/device/pp_dpm_sclk
> > 
> > and then to reset back to normal:
> > 
> > # echo auto > /sys/class/drm/card0/device/power_dpm_force_performance_level
> 
> I am currently on my 4th game of dota in a row when setting performance
> level manual to 7. working so far. Everyone should test this now so we have
> more reliable data. As we all now the issue can be gone for several hours so
> my experience means nothing yet. 
> Would be amazing if we can pin down the issue to the  performance level of
> the cards.

Played Monster Hunter and Dota 2 for quite a long time, and I didn't experience any system freezes with the max performance settings. Will test again tomorrow to see if the workaround is consistent enough.
Comment 52 Wilko Bartels 2019-07-16 08:28:22 UTC
i played like 30 dota 2 matches withour a single freeze. its save to say this is it. where is the right place to report this issue?
Comment 53 Mauro Gaspari 2019-07-17 03:34:31 UTC
Thank you all for the great work.
I will post on AMD support forums and add the link of this and other AMDGPU related bugs.
Comment 54 Sylvain BERTRAND 2019-07-17 16:02:32 UTC
power management related code is in amdgpu, then the right place is here, the "dri" and
"amdgfx" mailing lists (aka linux gpu driver mailing lists).

As far as I am concerned, when I play dota2, I always switch the GPU dpm to
high and the CPU freq governor to perf (because, all those things steal a
significant amount of fps... actually, I do switch my GPU dpm to high just in
case it would be nasty like the cpu governor).
Comment 55 Hadet 2019-07-18 02:30:29 UTC
So I think this might have something to do with something Xorg is doing because I've not had it happen while gaming for many hours since just seeing if it happened on wayland on a whim. I now have 21 hours of uptime with no random crashes.
Comment 56 Sylvain BERTRAND 2019-07-18 13:44:29 UTC
Playing dota2 vulkan or GL?

I guess it's vulkan: and there I don't know how vulkan deal with multiple WSIs,
and how dota2 selects the one it will use.

The idea is to clearly identify the code paths which would be "buggy".

(my custom distro is x11 native)

That said, I don't know the status of wayland: did they reach the same "cluster
f*ck" level that x11 is at? (irony, since wayland reason to exist is to be
orders of magnitude less kludgy than x11)
Comment 57 Hadet 2019-07-19 00:12:59 UTC
Created attachment 144821 [details]
Dmesg after crash

I spoke too soon it's happening on Wayland now too just a lot less frequently
Comment 58 Mauro Gaspari 2019-07-22 05:19:29 UTC
After a long time without crashes on Tumbleweed, I wanted to prepare a test setup for valve mesa built with ACO. So I installed Ubuntu Budgie 18.04 LTS with hardware enablement stack and I noticed the OS freezes are now back, even on the RadeonVII. 

What I noticed in the game behavior is this. This is a game running on crossover (wine) with DX11 and DXVK. I want to point out that I do alt-tab out of games to do other things, so this might be a factor to consider. But again, I do the same on my NVIDIA-GPU laptop and I never had a single freeze or fps drop.
Not sure if point 2 and 3 are related, I just wanted to share my observations.

1. Game starts with excellent FPS. I can hear GPU fans spinning.
2. After a while, game loses a lot of FPS starts to become slow and sluggish, GPU seems to be no longer doing much and I can no longer hear the fans spinning.
3. After a while longer, the whole OS freezes as described in my first post.


What I am going to do next:
1. Use the workaround of comment #47 and test for a few days.
2. Install Valve mesa-aco with ubuntu PPA and test (without workarounds) for a few days.

I will report back when I have more details on my tests.

System info:
OS: Ubuntu 18.04.2 LTS x86_64 
Kernel: 5.0.0-21-generic
Resolution: 3440x1440
CPU: AMD Ryzen 7 2700X (16) @ 3.700G 
GPU: AMD Vega 20 
Memory: 2650MiB / 64398MiB
OpenGL version string: 4.5 (Compatibility Profile) Mesa 19.0.2
Comment 59 wedens13 2019-07-23 16:25:04 UTC
I have similar issues with Sapphire Pulse Vega 56.
Arch Linux
Kernel versions: 4.19.60-1-lts, 5.2.1-1
mesa: 19.1.3-1, mesa with ACO (f9b38efdda166f2b79562525e72fe135c6b23d54)
llvm: 8.0.0

I've also tried booting with integrated video and using DRI_PRIME=1 to offload to vega. It crashes similarly (after 5min of playing witcher 3 with dxvk 1.3.1):

Jul 23 22:44:01 wedens-pc kernel: amdgpu 0000:03:00.0: [mmhub] VMC page fault (src_id:0 ring:154 vmid:1 pasid:32771, for process  pid 0 thread  pid 0
                                  )
Jul 23 22:44:01 wedens-pc kernel: amdgpu 0000:03:00.0:   at address 0x0000800100a00000 from 18
Jul 23 22:44:01 wedens-pc kernel: amdgpu 0000:03:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00100134
Jul 23 22:44:11 wedens-pc kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma1 timeout, signaled seq=230, emitted seq=233
Jul 23 22:44:11 wedens-pc kernel: [drm] GPU recovery disabled.


I'm going to try mesa master and manual power level workaround (when should I use "reset to normal" command?).
Comment 60 wedens13 2019-07-23 16:30:05 UTC
A couple of relevant log fragments with crashes: https://paste.ee/p/rtDEg
Comment 61 wedens13 2019-07-23 17:14:25 UTC
I've tried starting witcher 3 after executing
echo manual > /sys/class/drm/card0/device/power_dpm_force_performance_level
echo 7 > /sys/class/drm/card0/device/pp_dpm_sclk

and it still crashes immediately.

log: https://paste.ee/p/thvXf
Comment 62 Sylvain BERTRAND 2019-07-23 20:18:12 UTC
unstable power supply lines to the gpu if overheating is excluded?
Comment 63 Mauro Gaspari 2019-07-24 04:14:21 UTC
(In reply to Sylvain BERTRAND from comment #62)
> unstable power supply lines to the gpu if overheating is excluded?

I cannot speak for others. In my case,U would say no. I installed windows10 in a separate ssd, just to check there was no hardware issue of any kind. 
On windows10 with latest amd drivers, I have no freezes or any other issue running same games.
Comment 64 Sylvain BERTRAND 2019-07-24 13:09:23 UTC
> I cannot speak for others. In my case,U would say no. I installed windows10 in
> a separate ssd, just to check there was no hardware issue of any kind. 
> On windows10 with latest amd drivers, I have no freezes or any other issue
> running same games.

Native gnu/linux game or going through wine/dxvk?
Comment 65 wedens13 2019-07-24 14:27:33 UTC
(In reply to Sylvain BERTRAND from comment #62)
> unstable power supply lines to the gpu if overheating is excluded?

It's not overheating in my case, but my PSU is pretty old (I'm waiting for components for my new build to arrive, including new PSU). I've lowered power limit (to 80W) and I haven't had any crashes yet. 

So, in my case the problem *might be* related to PSU. But I can't exclude (nor confirm) possibility of driver problems with higher power states (until I have a better PSU).

I'll report back if I have any crashes with new PSU or lowered PL.
Comment 66 Hadet 2019-07-24 14:41:33 UTC
I don't think it's faulty hardware in any of our cases to be perfectly honest, it's a bad instruction set, this didn't happen with older kernels or firmware and the issue now is there are so few of us with Vega cards that we're really on our own trying to troubleshoot this situatio.

Since switching to wayland my crashing has been a lot less frequent, it'd say once every couple days as opposed to once every few hours when gaming with Vulkan/DXVK
Comment 67 Sylvain BERTRAND 2019-07-24 14:56:22 UTC
> ...
> Vulkan/DXVK

The bugs may be in wine/DXVK then. You should report to a bug to them and link
this bug to theirs.
Comment 68 Mauro Gaspari 2019-07-27 11:28:28 UTC
(In reply to Sylvain BERTRAND from comment #67)
> > ...
> > Vulkan/DXVK
> 
> The bugs may be in wine/DXVK then. You should report to a bug to them and
> link
> this bug to theirs.

If any of you opened bugs on other bug trackers, please post a link here so we can all contribute to both.

I did some test on my end and I can report the following:

System info:
OS: Ubuntu 18.04.2 LTS x86_64 
Kernel: 5.0.0-21-generic
Resolution: 3440x1440
CPU: AMD Ryzen 7 2700X (16) @ 3.700G 
GPU: AMD Vega 20 
Memory: 2650MiB / 64398MiB
OpenGL version string: 4.5 (Compatibility Profile) Mesa 19.0.2

1. Power profile set to manual did not help
2. Mesa-ACO from valve seem to have helped quite a bit. So far, no system freezes

I installed Arch on another SSD and will try to reproduce the same tests:
1. Plain Arch - crash or not ?
2. Arch with forced power profile - crash or not ?
3- Arch with mesa-ACO - crash or not ?
Comment 69 Sylvain BERTRAND 2019-07-27 13:19:59 UTC
Don't forget to provide the software stack used:

which sofware (game, cad...)? wine/dxvk? native?
Comment 70 Mauro Gaspari 2019-07-27 17:32:53 UTC
(In reply to Sylvain BERTRAND from comment #69)
> Don't forget to provide the software stack used:
> 
> which sofware (game, cad...)? wine/dxvk? native?

Good point. Games being tested:

Pillars of Eternity - Native
Battletech - Native
Eve Online - Wine+DXVK
Comment 71 Yury Zhuravlev 2019-07-28 03:14:23 UTC
Can somebody try games without any fps limits?
Like vblank_mode=0 and in-game limits.
Comment 72 Mauro Gaspari 2019-08-03 13:35:55 UTC
After a few weeks without crashes on Ubuntu Budgie 18.04 LTS with valve mesa-aco, I moved to another distribution that does not have valve mesa-aco to cross check.

This is what I am using:
OS: openSUSE Tumbleweed x86_64 
Kernel: 5.2.2-1-default
Resolution: 3440x1440
DE: Xfce
WM: Xfwm4
CPU: AMD Ryzen 7 2700X (16) @ 3.700GHz
GPU: AMD ATI Radeon VII
Memory: 1644MiB / 64387MiB 
OpenGL version string: 4.5 (Compatibility Profile) Mesa 19.1.3
No kernel parameters configured, just out of the box openSUSE

I had 3 of full OS freezes:

1. As I was playing Albion Online (Native) No full system freeze, I was able to drop to tty, and notice this error: [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!

2. As I closed down Albion Online (Native) and returned to desktop. Full System Freeze

3. As I was doing regular desktop operations on XFCE. No 3d gaming going on. Please see below logs:

DMESG after crash:

ilvipero@MGDT-ROG:~> dmesg | grep amdgpu
[    5.758450] [drm] amdgpu kernel modesetting enabled.
[    5.758569] amdgpu 0000:0a:00.0: remove_conflicting_pci_framebuffers: bar 0: 0xe0000000 -> 0xefffffff
[    5.758570] amdgpu 0000:0a:00.0: remove_conflicting_pci_framebuffers: bar 2: 0xf0000000 -> 0xf01fffff
[    5.758571] amdgpu 0000:0a:00.0: remove_conflicting_pci_framebuffers: bar 5: 0xfcd00000 -> 0xfcd7ffff
[    5.758573] fb0: switching to amdgpudrmfb from EFI VGA
[    5.758646] amdgpu 0000:0a:00.0: vgaarb: deactivate vga console
[    5.758826] amdgpu 0000:0a:00.0: No more image in the PCI ROM
[    5.758870] amdgpu 0000:0a:00.0: VRAM: 16368M 0x0000008000000000 - 0x00000083FEFFFFFF (16368M used)
[    5.758871] amdgpu 0000:0a:00.0: GART: 512M 0x0000000000000000 - 0x000000001FFFFFFF
[    5.758872] amdgpu 0000:0a:00.0: AGP: 267894784M 0x0000008400000000 - 0x0000FFFFFFFFFFFF
[    5.758936] [drm] amdgpu: 16368M of VRAM memory ready
[    5.758938] [drm] amdgpu: 16368M of GTT memory ready.
[    5.759204] amdgpu 0000:0a:00.0: Direct firmware load for amdgpu/vega20_ta.bin failed with error -2
[    5.759205] amdgpu 0000:0a:00.0: psp v11.0: Failed to load firmware "amdgpu/vega20_ta.bin"
[    6.855053] fbcon: amdgpudrmfb (fb0) is primary device
[    6.913835] amdgpu 0000:0a:00.0: fb0: amdgpudrmfb frame buffer device
[    6.928054] amdgpu 0000:0a:00.0: ring gfx uses VM inv eng 0 on hub 0
[    6.928055] amdgpu 0000:0a:00.0: ring comp_1.0.0 uses VM inv eng 1 on hub 0
[    6.928056] amdgpu 0000:0a:00.0: ring comp_1.1.0 uses VM inv eng 4 on hub 0
[    6.928056] amdgpu 0000:0a:00.0: ring comp_1.2.0 uses VM inv eng 5 on hub 0
[    6.928057] amdgpu 0000:0a:00.0: ring comp_1.3.0 uses VM inv eng 6 on hub 0
[    6.928058] amdgpu 0000:0a:00.0: ring comp_1.0.1 uses VM inv eng 7 on hub 0
[    6.928059] amdgpu 0000:0a:00.0: ring comp_1.1.1 uses VM inv eng 8 on hub 0
[    6.928059] amdgpu 0000:0a:00.0: ring comp_1.2.1 uses VM inv eng 9 on hub 0
[    6.928060] amdgpu 0000:0a:00.0: ring comp_1.3.1 uses VM inv eng 10 on hub 0
[    6.928060] amdgpu 0000:0a:00.0: ring kiq_2.1.0 uses VM inv eng 11 on hub 0
[    6.928061] amdgpu 0000:0a:00.0: ring sdma0 uses VM inv eng 0 on hub 1
[    6.928062] amdgpu 0000:0a:00.0: ring page0 uses VM inv eng 1 on hub 1
[    6.928063] amdgpu 0000:0a:00.0: ring sdma1 uses VM inv eng 4 on hub 1
[    6.928063] amdgpu 0000:0a:00.0: ring page1 uses VM inv eng 5 on hub 1
[    6.928064] amdgpu 0000:0a:00.0: ring uvd_0 uses VM inv eng 6 on hub 1
[    6.928064] amdgpu 0000:0a:00.0: ring uvd_enc_0.0 uses VM inv eng 7 on hub 1
[    6.928065] amdgpu 0000:0a:00.0: ring uvd_enc_0.1 uses VM inv eng 8 on hub 1
[    6.928066] amdgpu 0000:0a:00.0: ring uvd_1 uses VM inv eng 9 on hub 1
[    6.928066] amdgpu 0000:0a:00.0: ring uvd_enc_1.0 uses VM inv eng 10 on hub 1
[    6.928067] amdgpu 0000:0a:00.0: ring uvd_enc_1.1 uses VM inv eng 11 on hub 1
[    6.928067] amdgpu 0000:0a:00.0: ring vce0 uses VM inv eng 12 on hub 1
[    6.928068] amdgpu 0000:0a:00.0: ring vce1 uses VM inv eng 13 on hub 1
[    6.928068] amdgpu 0000:0a:00.0: ring vce2 uses VM inv eng 14 on hub 1
[    7.609167] [drm] Initialized amdgpu 3.32.0 20150101 for 0000:0a:00.0 on minor 0

system logs:

2019-08-03T18:51:21.779695+08:00 MGDT-ROG kernel: [11817.727681] pcieport 0000:00:03.1: AER: Multiple Corrected error received: 0000:00:00.0
2019-08-03T18:51:21.779730+08:00 MGDT-ROG kernel: [11817.771355] pcieport 0000:00:03.1: AER: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Transmitter ID)
2019-08-03T18:51:21.779735+08:00 MGDT-ROG kernel: [11817.771358] pcieport 0000:00:03.1: AER:   device [1022:1453] error status/mask=00003100/00006000
2019-08-03T18:51:21.779737+08:00 MGDT-ROG kernel: [11817.771361] pcieport 0000:00:03.1: AER:    [ 8] Rollover              
2019-08-03T18:51:21.779738+08:00 MGDT-ROG kernel: [11817.771371] pcieport 0000:00:03.1: AER:    [12] Timeout               
2019-08-03T18:51:26.721833+08:00 MGDT-ROG sudo: pam_unix(sudo:session): session closed for user root
2019-08-03T18:51:31.983837+08:00 MGDT-ROG kernel: [11827.971739] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 timeout, signaled seq=2324984, emitted seq=2324986
2019-08-03T18:51:31.983851+08:00 MGDT-ROG kernel: [11827.971800] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process X pid 2132 thread X:cs0 pid 2139
2019-08-03T18:51:31.983853+08:00 MGDT-ROG kernel: [11827.971804] amdgpu 0000:0a:00.0: GPU reset begin!
2019-08-03T18:51:32.751834+08:00 MGDT-ROG kernel: [11828.741066] amdgpu: [powerplay] Failed to send message 0x47, response 0xffffffff
2019-08-03T18:51:32.751846+08:00 MGDT-ROG kernel: [11828.741077] amdgpu: [powerplay] Failed to send message 0x28, response 0xffffffff
2019-08-03T18:51:32.751849+08:00 MGDT-ROG kernel: [11828.741078] amdgpu: [powerplay] [SetUclkToHightestDpmLevel] Set hard min uclk failed!
2019-08-03T18:51:32.751850+08:00 MGDT-ROG kernel: [11828.741090] amdgpu: [powerplay] Failed to send message 0x28, response 0xffffffff
2019-08-03T18:51:32.751852+08:00 MGDT-ROG kernel: [11828.741091] amdgpu: [powerplay] Attempt to set Hard Min for DCEFCLK Failed!
2019-08-03T18:51:32.751854+08:00 MGDT-ROG kernel: [11828.741102] amdgpu: [powerplay] Failed to send message 0x28, response 0xffffffff
2019-08-03T18:51:32.751855+08:00 MGDT-ROG kernel: [11828.741102] amdgpu: [powerplay] [SetHardMinFreq] Set hard min uclk failed!
2019-08-03T18:51:32.751856+08:00 MGDT-ROG kernel: [11828.741113] amdgpu: [powerplay] Failed to send message 0x26, response 0xffffffff
2019-08-03T18:51:32.751858+08:00 MGDT-ROG kernel: [11828.741114] amdgpu: [powerplay] Failed to set soft min gfxclk !
2019-08-03T18:51:32.751859+08:00 MGDT-ROG kernel: [11828.741114] amdgpu: [powerplay] Failed to upload DPM Bootup Levels!
2019-08-03T18:51:32.787843+08:00 MGDT-ROG kernel: [11828.775671] [drm] REG_WAIT timeout 10us * 3000 tries - dce110_stream_encoder_dp_blank line:951
2019-08-03T18:51:32.787852+08:00 MGDT-ROG kernel: [11828.775672] ------------[ cut here ]------------
2019-08-03T18:51:32.787853+08:00 MGDT-ROG kernel: [11828.775778] WARNING: CPU: 1 PID: 10195 at drivers/gpu/drm/amd/amdgpu/../display/dc/dc_helper.c:329 generic_reg_wait.cold+0x31/0x53 [amdgpu]
2019-08-03T18:51:32.787855+08:00 MGDT-ROG kernel: [11828.775779] Modules linked in: tun fuse af_packet ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter ip_tables x_tables bpfilter uvcvideo videobuf2_vmalloc videobuf2_memops videobuf2_v4l2 snd_usb_audio videobuf2_common snd_usbmidi_lib videodev snd_rawmidi snd_seq_device media joydev scsi_transport_iscsi msr nls_iso8859_1 nls_cp437 vfat fat edac_mce_amd kvm_amd kvm irqbypass snd_hda_codec_realtek crct10dif_pclmul snd_hda_codec_generic crc32_pclmul ledtrig_audio snd_hda_codec_hdmi ghash_clmulni_intel snd_hda_intel snd_hda_codec snd_hda_core snd_hwdep aesni_intel eeepc_wmi asus_wmi aes_x86_64 sparse_keymap snd_pcm crypto_simd rfkill cryptd video glue_helper wmi_bmof mxm_wmi igb snd_timer sp5100_tco snd ptp pcspkr i2c_piix4 pps_core dca k10temp ccp soundcore gpio_amdpt gpio_generic pcc_cpufreq button acpi_cpufreq btrfs libcrc32c xor hid_generic usbhid amdgpu raid6_pq amd_iommu_v2 gpu_sched i2c_algo_bit ttm drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops xhci_pci drm
2019-08-03T18:51:32.787858+08:00 MGDT-ROG kernel: [11828.775807]  crc32c_intel xhci_hcd usbcore sr_mod cdrom wmi pinctrl_amd l2tp_ppp l2tp_netlink l2tp_core ip6_udp_tunnel udp_tunnel pppox ppp_generic slhc sg dm_multipath dm_mod scsi_dh_rdac scsi_dh_emc scsi_dh_alua efivarfs
2019-08-03T18:51:32.787860+08:00 MGDT-ROG kernel: [11828.775817] CPU: 1 PID: 10195 Comm: kworker/1:0 Not tainted 5.2.3-1-default #1 openSUSE Tumbleweed (unreleased)
2019-08-03T18:51:32.787861+08:00 MGDT-ROG kernel: [11828.775818] Hardware name: System manufacturer System Product Name/ROG STRIX X470-F GAMING, BIOS 5007 06/17/2019
2019-08-03T18:51:32.787862+08:00 MGDT-ROG kernel: [11828.775822] Workqueue: events drm_sched_job_timedout [gpu_sched]
2019-08-03T18:51:32.787863+08:00 MGDT-ROG kernel: [11828.775897] RIP: 0010:generic_reg_wait.cold+0x31/0x53 [amdgpu]
2019-08-03T18:51:32.787864+08:00 MGDT-ROG kernel: [11828.775899] Code: 4c 24 18 44 89 fa 89 ee 48 c7 c7 68 7c 75 c0 e8 e9 71 84 f4 83 7b 20 01 0f 84 2b 1b fe ff 48 c7 c7 d8 7b 75 c0 e8 d3 71 84 f4 <0f> 0b e9 18 1b fe ff 48 c7 c7 d8 7b 75 c0 89 54 24 04 e8 bc 71 84
2019-08-03T18:51:32.787866+08:00 MGDT-ROG kernel: [11828.775901] RSP: 0018:ffffab7acdeb77e8 EFLAGS: 00010282
2019-08-03T18:51:32.787867+08:00 MGDT-ROG kernel: [11828.775902] RAX: 0000000000000024 RBX: ffff960e92c3c880 RCX: 0000000000000006
2019-08-03T18:51:32.787868+08:00 MGDT-ROG kernel: [11828.775903] RDX: 0000000000000007 RSI: 0000000000000096 RDI: ffff960e9e659a10
2019-08-03T18:51:32.787869+08:00 MGDT-ROG kernel: [11828.775903] RBP: 000000000000000a R08: 00000000000004da R09: 0000000000000001
2019-08-03T18:51:32.787870+08:00 MGDT-ROG kernel: [11828.775904] R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000004ee2
2019-08-03T18:51:32.787871+08:00 MGDT-ROG kernel: [11828.775905] R13: 0000000000000bb9 R14: 0000000000000000 R15: 0000000000000bb8
2019-08-03T18:51:32.787872+08:00 MGDT-ROG kernel: [11828.775906] FS:  0000000000000000(0000) GS:ffff960e9e640000(0000) knlGS:0000000000000000
2019-08-03T18:51:32.787874+08:00 MGDT-ROG kernel: [11828.775907] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
2019-08-03T18:51:32.787874+08:00 MGDT-ROG kernel: [11828.775907] CR2: 000055d4170da000 CR3: 0000000f03cd6000 CR4: 00000000003406e0
2019-08-03T18:51:32.787875+08:00 MGDT-ROG kernel: [11828.775908] Call Trace:
2019-08-03T18:51:32.787876+08:00 MGDT-ROG kernel: [11828.775982]  dce110_stream_encoder_dp_blank+0xda/0x120 [amdgpu]
2019-08-03T18:51:32.787877+08:00 MGDT-ROG kernel: [11828.776049]  core_link_disable_stream+0x32/0x260 [amdgpu]
2019-08-03T18:51:32.787878+08:00 MGDT-ROG kernel: [11828.776054]  ? printk+0x48/0x4a
2019-08-03T18:51:32.787879+08:00 MGDT-ROG kernel: [11828.776119]  dce110_reset_hw_ctx_wrap+0xc1/0x1e0 [amdgpu]
2019-08-03T18:51:32.787881+08:00 MGDT-ROG kernel: [11828.776192]  ? vega20_dpm_force_dpm_level.cold+0x5b/0x90 [amdgpu]
2019-08-03T18:51:32.787882+08:00 MGDT-ROG kernel: [11828.776256]  dce110_apply_ctx_to_hw+0x3a/0x470 [amdgpu]
2019-08-03T18:51:32.787883+08:00 MGDT-ROG kernel: [11828.776318]  ? hwmgr_handle_task+0x66/0xc0 [amdgpu]
2019-08-03T18:51:32.787884+08:00 MGDT-ROG kernel: [11828.776322]  ? mutex_lock+0xe/0x30
2019-08-03T18:51:32.787885+08:00 MGDT-ROG kernel: [11828.776385]  ? pp_dpm_dispatch_tasks+0x45/0x60 [amdgpu]
2019-08-03T18:51:32.787886+08:00 MGDT-ROG kernel: [11828.776450]  ? dm_pp_apply_display_requirements+0x1a1/0x1c0 [amdgpu]
2019-08-03T18:51:32.787887+08:00 MGDT-ROG kernel: [11828.776513]  dc_commit_state_no_check+0x200/0x530 [amdgpu]
2019-08-03T18:51:32.787888+08:00 MGDT-ROG kernel: [11828.776516]  ? get_page_from_freelist+0x289/0x380
2019-08-03T18:51:32.787889+08:00 MGDT-ROG kernel: [11828.776579]  dc_commit_state+0x8f/0xb0 [amdgpu]
2019-08-03T18:51:32.787889+08:00 MGDT-ROG kernel: [11828.776644]  amdgpu_dm_atomic_commit_tail+0x3a6/0xd30 [amdgpu]
2019-08-03T18:51:32.787890+08:00 MGDT-ROG kernel: [11828.776709]  ? bw_calcs+0x8ac/0x1440 [amdgpu]
2019-08-03T18:51:32.787892+08:00 MGDT-ROG kernel: [11828.776711]  ? __ww_mutex_lock.isra.0+0x2a/0x780
2019-08-03T18:51:32.787893+08:00 MGDT-ROG kernel: [11828.776714]  ? _raw_spin_unlock_irqrestore+0x24/0x40
2019-08-03T18:51:32.787893+08:00 MGDT-ROG kernel: [11828.776717]  ? __wake_up_common_lock+0x7c/0xa0
2019-08-03T18:51:32.787894+08:00 MGDT-ROG kernel: [11828.776719]  ? wait_for_completion_timeout+0xf3/0x110
2019-08-03T18:51:32.787895+08:00 MGDT-ROG kernel: [11828.776720]  ? wait_for_completion_interruptible+0x10b/0x150
2019-08-03T18:51:32.787896+08:00 MGDT-ROG kernel: [11828.776728]  ? commit_tail+0x3c/0x70 [drm_kms_helper]
2019-08-03T18:51:32.787897+08:00 MGDT-ROG kernel: [11828.776735]  commit_tail+0x3c/0x70 [drm_kms_helper]
2019-08-03T18:51:32.787898+08:00 MGDT-ROG kernel: [11828.776742]  drm_atomic_helper_commit+0x108/0x110 [drm_kms_helper]
2019-08-03T18:51:32.787899+08:00 MGDT-ROG kernel: [11828.776749]  drm_atomic_helper_disable_all+0x144/0x160 [drm_kms_helper]
2019-08-03T18:51:32.787900+08:00 MGDT-ROG kernel: [11828.776756]  drm_atomic_helper_suspend+0x4c/0xe0 [drm_kms_helper]
2019-08-03T18:51:32.787901+08:00 MGDT-ROG kernel: [11828.776820]  dm_suspend+0x20/0x60 [amdgpu]
2019-08-03T18:51:32.787902+08:00 MGDT-ROG kernel: [11828.776861]  amdgpu_device_ip_suspend_phase1+0x8b/0xc0 [amdgpu]
2019-08-03T18:51:32.787903+08:00 MGDT-ROG kernel: [11828.776903]  amdgpu_device_ip_suspend+0x1c/0x60 [amdgpu]
2019-08-03T18:51:32.787904+08:00 MGDT-ROG kernel: [11828.776975]  amdgpu_device_pre_asic_reset+0x1f4/0x209 [amdgpu]
2019-08-03T18:51:32.787905+08:00 MGDT-ROG kernel: [11828.777047]  amdgpu_device_gpu_recover+0x67/0x765 [amdgpu]
2019-08-03T18:51:32.787906+08:00 MGDT-ROG kernel: [11828.777106]  amdgpu_job_timedout+0xf7/0x120 [amdgpu]
2019-08-03T18:51:32.787906+08:00 MGDT-ROG kernel: [11828.777110]  drm_sched_job_timedout+0x3a/0x70 [gpu_sched]
2019-08-03T18:51:32.787907+08:00 MGDT-ROG kernel: [11828.777113]  process_one_work+0x1df/0x3c0
2019-08-03T18:51:32.787908+08:00 MGDT-ROG kernel: [11828.777115]  worker_thread+0x4d/0x400
2019-08-03T18:51:32.787909+08:00 MGDT-ROG kernel: [11828.777117]  kthread+0xf9/0x130
2019-08-03T18:51:32.787910+08:00 MGDT-ROG kernel: [11828.777119]  ? process_one_work+0x3c0/0x3c0
2019-08-03T18:51:32.787911+08:00 MGDT-ROG kernel: [11828.777120]  ? kthread_park+0x80/0x80
2019-08-03T18:51:32.787912+08:00 MGDT-ROG kernel: [11828.777122]  ret_from_fork+0x27/0x50
2019-08-03T18:51:32.787913+08:00 MGDT-ROG kernel: [11828.777125] ---[ end trace 9aaf1f62ae398b4b ]---
2019-08-03T18:51:37.791882+08:00 MGDT-ROG kernel: [11833.780084] [drm:atom_op_jump [amdgpu]] *ERROR* atombios stuck in loop for more than 5secs aborting
2019-08-03T18:51:37.791896+08:00 MGDT-ROG kernel: [11833.780129] [drm:amdgpu_atom_execute_table_locked [amdgpu]] *ERROR* atombios stuck executing B0B0 (len 2971, WS 4, PS 0) @ 0xB963
2019-08-03T18:51:37.791898+08:00 MGDT-ROG kernel: [11833.780172] [drm:amdgpu_atom_execute_table_locked [amdgpu]] *ERROR* atombios stuck executing AFB0 (len 255, WS 4, PS 0) @ 0xB089
2019-08-03T18:51:37.791899+08:00 MGDT-ROG kernel: [11833.780240] [drm:dce110_link_encoder_disable_output [amdgpu]] *ERROR* dce110_link_encoder_disable_output: Failed to execute VBIOS command table!
2019-08-03T18:51:37.791901+08:00 MGDT-ROG kernel: [11833.780240] ------------[ cut here ]------------
2019-08-03T18:51:37.791902+08:00 MGDT-ROG kernel: [11833.780328] WARNING: CPU: 1 PID: 10195 at drivers/gpu/drm/amd/amdgpu/../display/dc/dce/dce_link_encoder.c:1096 dce110_link_encoder_disable_output+0x13d/0x150 [amdgpu]
2019-08-03T18:51:37.791903+08:00 MGDT-ROG kernel: [11833.780329] Modules linked in: tun fuse af_packet ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter ip_tables x_tables bpfilter uvcvideo videobuf2_vmalloc videobuf2_memops videobuf2_v4l2 snd_usb_audio videobuf2_common snd_usbmidi_lib videodev snd_rawmidi snd_seq_device media joydev scsi_transport_iscsi msr nls_iso8859_1 nls_cp437 vfat fat edac_mce_amd kvm_amd kvm irqbypass snd_hda_codec_realtek crct10dif_pclmul snd_hda_codec_generic crc32_pclmul ledtrig_audio snd_hda_codec_hdmi ghash_clmulni_intel snd_hda_intel snd_hda_codec snd_hda_core snd_hwdep aesni_intel eeepc_wmi asus_wmi aes_x86_64 sparse_keymap snd_pcm crypto_simd rfkill cryptd video glue_helper wmi_bmof mxm_wmi igb snd_timer sp5100_tco snd ptp pcspkr i2c_piix4 pps_core dca k10temp ccp soundcore gpio_amdpt gpio_generic pcc_cpufreq button acpi_cpufreq btrfs libcrc32c xor hid_generic usbhid amdgpu raid6_pq amd_iommu_v2 gpu_sched i2c_algo_bit ttm drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops xhci_pci drm
2019-08-03T18:51:37.791905+08:00 MGDT-ROG kernel: [11833.780356]  crc32c_intel xhci_hcd usbcore sr_mod cdrom wmi pinctrl_amd l2tp_ppp l2tp_netlink l2tp_core ip6_udp_tunnel udp_tunnel pppox ppp_generic slhc sg dm_multipath dm_mod scsi_dh_rdac scsi_dh_emc scsi_dh_alua efivarfs
2019-08-03T18:51:37.791907+08:00 MGDT-ROG kernel: [11833.780365] CPU: 1 PID: 10195 Comm: kworker/1:0 Tainted: G        W         5.2.3-1-default #1 openSUSE Tumbleweed (unreleased)
2019-08-03T18:51:37.791908+08:00 MGDT-ROG kernel: [11833.780366] Hardware name: System manufacturer System Product Name/ROG STRIX X470-F GAMING, BIOS 5007 06/17/2019
2019-08-03T18:51:37.791910+08:00 MGDT-ROG kernel: [11833.780370] Workqueue: events drm_sched_job_timedout [gpu_sched]
2019-08-03T18:51:37.791911+08:00 MGDT-ROG kernel: [11833.780435] RIP: 0010:dce110_link_encoder_disable_output+0x13d/0x150 [amdgpu]
2019-08-03T18:51:37.791912+08:00 MGDT-ROG kernel: [11833.780437] Code: ff ff 48 83 c4 38 5b 5d 41 5c c3 48 c7 c6 c0 c8 6f c0 48 c7 c7 d8 d9 74 c0 e8 cf bb de ff 48 c7 c7 70 d9 74 c0 e8 61 13 8c f4 <0f> 0b eb d4 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 0f 1f 44
2019-08-03T18:51:37.791913+08:00 MGDT-ROG kernel: [11833.780438] RSP: 0018:ffffab7acdeb77f8 EFLAGS: 00010282
2019-08-03T18:51:37.791914+08:00 MGDT-ROG kernel: [11833.780439] RAX: 0000000000000024 RBX: ffff960e96034a80 RCX: 0000000000000006
2019-08-03T18:51:37.791915+08:00 MGDT-ROG kernel: [11833.780440] RDX: 0000000000000007 RSI: 0000000000000096 RDI: ffff960e9e659a10
2019-08-03T18:51:37.791917+08:00 MGDT-ROG kernel: [11833.780441] RBP: 0000000000000020 R08: 0000000000000518 R09: 0000000000000001
2019-08-03T18:51:37.791918+08:00 MGDT-ROG kernel: [11833.780441] R10: 0000000000000000 R11: 0000000000000001 R12: ffffab7acdeb77fc
2019-08-03T18:51:37.791919+08:00 MGDT-ROG kernel: [11833.780442] R13: ffff95ffc13c1000 R14: 0000000000000000 R15: ffff9601c92c8188
2019-08-03T18:51:37.791920+08:00 MGDT-ROG kernel: [11833.780443] FS:  0000000000000000(0000) GS:ffff960e9e640000(0000) knlGS:0000000000000000
2019-08-03T18:51:37.791921+08:00 MGDT-ROG kernel: [11833.780444] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
2019-08-03T18:51:37.791922+08:00 MGDT-ROG kernel: [11833.780445] CR2: 000055d4170da000 CR3: 0000000f03cd6000 CR4: 00000000003406e0
2019-08-03T18:51:37.791923+08:00 MGDT-ROG kernel: [11833.780446] Call Trace:
2019-08-03T18:51:37.791924+08:00 MGDT-ROG kernel: [11833.780512]  dp_disable_link_phy+0x73/0x110 [amdgpu]
2019-08-03T18:51:37.791925+08:00 MGDT-ROG kernel: [11833.780576]  core_link_disable_stream+0xb6/0x260 [amdgpu]
2019-08-03T18:51:37.791926+08:00 MGDT-ROG kernel: [11833.780580]  ? printk+0x48/0x4a
2019-08-03T18:51:37.791927+08:00 MGDT-ROG kernel: [11833.780642]  dce110_reset_hw_ctx_wrap+0xc1/0x1e0 [amdgpu]
2019-08-03T18:51:37.791928+08:00 MGDT-ROG kernel: [11833.780716]  ? vega20_dpm_force_dpm_level.cold+0x5b/0x90 [amdgpu]
2019-08-03T18:51:37.791929+08:00 MGDT-ROG kernel: [11833.780779]  dce110_apply_ctx_to_hw+0x3a/0x470 [amdgpu]
2019-08-03T18:51:37.791930+08:00 MGDT-ROG kernel: [11833.780840]  ? hwmgr_handle_task+0x66/0xc0 [amdgpu]
2019-08-03T18:51:37.791931+08:00 MGDT-ROG kernel: [11833.780843]  ? mutex_lock+0xe/0x30
2019-08-03T18:51:37.791933+08:00 MGDT-ROG kernel: [11833.780905]  ? pp_dpm_dispatch_tasks+0x45/0x60 [amdgpu]
2019-08-03T18:51:37.791934+08:00 MGDT-ROG kernel: [11833.780969]  ? dm_pp_apply_display_requirements+0x1a1/0x1c0 [amdgpu]
2019-08-03T18:51:37.791935+08:00 MGDT-ROG kernel: [11833.781032]  dc_commit_state_no_check+0x200/0x530 [amdgpu]
2019-08-03T18:51:37.791936+08:00 MGDT-ROG kernel: [11833.781036]  ? get_page_from_freelist+0x289/0x380
2019-08-03T18:51:37.791937+08:00 MGDT-ROG kernel: [11833.781098]  dc_commit_state+0x8f/0xb0 [amdgpu]
2019-08-03T18:51:37.791938+08:00 MGDT-ROG kernel: [11833.781162]  amdgpu_dm_atomic_commit_tail+0x3a6/0xd30 [amdgpu]
2019-08-03T18:51:37.791939+08:00 MGDT-ROG kernel: [11833.781227]  ? bw_calcs+0x8ac/0x1440 [amdgpu]
2019-08-03T18:51:37.791940+08:00 MGDT-ROG kernel: [11833.781229]  ? __ww_mutex_lock.isra.0+0x2a/0x780
2019-08-03T18:51:37.791941+08:00 MGDT-ROG kernel: [11833.781231]  ? _raw_spin_unlock_irqrestore+0x24/0x40
2019-08-03T18:51:37.791942+08:00 MGDT-ROG kernel: [11833.781234]  ? __wake_up_common_lock+0x7c/0xa0
2019-08-03T18:51:37.791943+08:00 MGDT-ROG kernel: [11833.781236]  ? wait_for_completion_timeout+0xf3/0x110
2019-08-03T18:51:37.791944+08:00 MGDT-ROG kernel: [11833.781237]  ? wait_for_completion_interruptible+0x10b/0x150
2019-08-03T18:51:37.791945+08:00 MGDT-ROG kernel: [11833.781245]  ? commit_tail+0x3c/0x70 [drm_kms_helper]
2019-08-03T18:51:37.791946+08:00 MGDT-ROG kernel: [11833.781251]  commit_tail+0x3c/0x70 [drm_kms_helper]
2019-08-03T18:51:37.791947+08:00 MGDT-ROG kernel: [11833.781258]  drm_atomic_helper_commit+0x108/0x110 [drm_kms_helper]
2019-08-03T18:51:37.791948+08:00 MGDT-ROG kernel: [11833.781265]  drm_atomic_helper_disable_all+0x144/0x160 [drm_kms_helper]
2019-08-03T18:51:37.791949+08:00 MGDT-ROG kernel: [11833.781272]  drm_atomic_helper_suspend+0x4c/0xe0 [drm_kms_helper]
2019-08-03T18:51:37.791950+08:00 MGDT-ROG kernel: [11833.781335]  dm_suspend+0x20/0x60 [amdgpu]
2019-08-03T18:51:37.791951+08:00 MGDT-ROG kernel: [11833.781377]  amdgpu_device_ip_suspend_phase1+0x8b/0xc0 [amdgpu]
2019-08-03T18:51:37.791952+08:00 MGDT-ROG kernel: [11833.781418]  amdgpu_device_ip_suspend+0x1c/0x60 [amdgpu]
2019-08-03T18:51:37.791953+08:00 MGDT-ROG kernel: [11833.781490]  amdgpu_device_pre_asic_reset+0x1f4/0x209 [amdgpu]
2019-08-03T18:51:37.791954+08:00 MGDT-ROG kernel: [11833.781561]  amdgpu_device_gpu_recover+0x67/0x765 [amdgpu]
2019-08-03T18:51:37.791955+08:00 MGDT-ROG kernel: [11833.781620]  amdgpu_job_timedout+0xf7/0x120 [amdgpu]
2019-08-03T18:51:37.791956+08:00 MGDT-ROG kernel: [11833.781624]  drm_sched_job_timedout+0x3a/0x70 [gpu_sched]
2019-08-03T18:51:37.791957+08:00 MGDT-ROG kernel: [11833.781627]  process_one_work+0x1df/0x3c0
2019-08-03T18:51:37.791958+08:00 MGDT-ROG kernel: [11833.781629]  worker_thread+0x4d/0x400
2019-08-03T18:51:37.791959+08:00 MGDT-ROG kernel: [11833.781631]  kthread+0xf9/0x130
2019-08-03T18:51:37.791960+08:00 MGDT-ROG kernel: [11833.781633]  ? process_one_work+0x3c0/0x3c0
2019-08-03T18:51:37.791961+08:00 MGDT-ROG kernel: [11833.781634]  ? kthread_park+0x80/0x80
2019-08-03T18:51:37.791962+08:00 MGDT-ROG kernel: [11833.781636]  ret_from_fork+0x27/0x50
2019-08-03T18:51:37.791963+08:00 MGDT-ROG kernel: [11833.781639] ---[ end trace 9aaf1f62ae398b4c ]---
2019-08-03T18:51:42.796019+08:00 MGDT-ROG kernel: [11838.784083] [drm:atom_op_jump [amdgpu]] *ERROR* atombios stuck in loop for more than 5secs aborting
2019-08-03T18:51:42.796034+08:00 MGDT-ROG kernel: [11838.784127] [drm:amdgpu_atom_execute_table_locked [amdgpu]] *ERROR* atombios stuck executing A048 (len 62, WS 0, PS 0) @ 0xA064
2019-08-03T18:51:42.796035+08:00 MGDT-ROG kernel: [11838.784208] amdgpu: [powerplay] Failed to send message 0x28, response 0xffffffff
2019-08-03T18:51:42.796036+08:00 MGDT-ROG kernel: [11838.784219] amdgpu: [powerplay] Failed to send message 0x28, response 0xffffffff
2019-08-03T18:51:42.796038+08:00 MGDT-ROG kernel: [11838.784233] amdgpu: [powerplay] Failed to send message 0x47, response 0xffffffff
2019-08-03T18:51:42.796039+08:00 MGDT-ROG kernel: [11838.784245] amdgpu: [powerplay] Failed to send message 0x28, response 0xffffffff
2019-08-03T18:51:42.796040+08:00 MGDT-ROG kernel: [11838.784245] amdgpu: [powerplay] [SetUclkToHightestDpmLevel] Set hard min uclk failed!
2019-08-03T18:51:42.796041+08:00 MGDT-ROG kernel: [11838.784258] amdgpu: [powerplay] Failed to send message 0x28, response 0xffffffff
2019-08-03T18:51:42.796042+08:00 MGDT-ROG kernel: [11838.784258] amdgpu: [powerplay] Attempt to set Hard Min for DCEFCLK Failed!
2019-08-03T18:51:42.796044+08:00 MGDT-ROG kernel: [11838.784269] amdgpu: [powerplay] Failed to send message 0x28, response 0xffffffff
2019-08-03T18:51:42.796045+08:00 MGDT-ROG kernel: [11838.784270] amdgpu: [powerplay] [SetHardMinFreq] Set hard min uclk failed!
2019-08-03T18:51:42.796046+08:00 MGDT-ROG kernel: [11838.784281] amdgpu: [powerplay] Failed to send message 0x26, response 0xffffffff
2019-08-03T18:51:42.796047+08:00 MGDT-ROG kernel: [11838.784282] amdgpu: [powerplay] Failed to set soft min gfxclk !
2019-08-03T18:51:42.796048+08:00 MGDT-ROG kernel: [11838.784282] amdgpu: [powerplay] Failed to upload DPM Bootup Levels!
2019-08-03T18:51:43.656061+08:00 MGDT-ROG kernel: [11839.645436] amdgpu: [powerplay] Failed to send message 0x26, response 0xffffffff
2019-08-03T18:51:43.656078+08:00 MGDT-ROG kernel: [11839.645438] amdgpu: [powerplay] Failed to set soft min gfxclk !
2019-08-03T18:51:43.656080+08:00 MGDT-ROG kernel: [11839.645438] amdgpu: [powerplay] Failed to upload DPM Bootup Levels!
2019-08-03T18:51:43.656081+08:00 MGDT-ROG kernel: [11839.645449] amdgpu: [powerplay] Failed to send message 0x7, response 0xffffffff
2019-08-03T18:51:43.656082+08:00 MGDT-ROG kernel: [11839.645450] amdgpu: [powerplay] [DisableAllSMUFeatures] Failed to disable all smu features!
2019-08-03T18:51:43.656083+08:00 MGDT-ROG kernel: [11839.645450] amdgpu: [powerplay] [DisableDpmTasks] Failed to disable all smu features!
2019-08-03T18:51:43.656084+08:00 MGDT-ROG kernel: [11839.645451] amdgpu: [powerplay] [PowerOffAsic] Failed to disable DPM!
2019-08-03T18:51:43.656086+08:00 MGDT-ROG kernel: [11839.645497] [drm:amdgpu_device_ip_suspend_phase2 [amdgpu]] *ERROR* suspend of IP block <powerplay> failed -5
2019-08-03T18:51:43.911990+08:00 MGDT-ROG kernel: [11839.902893] amdgpu 0000:0a:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring kiq_2.1.0 test failed (-110)
2019-08-03T18:51:43.912001+08:00 MGDT-ROG kernel: [11839.902947] [drm:gfx_v9_0_hw_fini [amdgpu]] *ERROR* KCQ disable failed
2019-08-03T18:51:44.167806+08:00 MGDT-ROG kernel: [11840.159797] [drm] Timeout wait for RLC serdes 0,0
2019-08-03T18:51:44.191826+08:00 MGDT-ROG kernel: [11840.180793] amdgpu 0000:0a:00.0: GPU mode1 reset
2019-08-03T18:51:44.451982+08:00 MGDT-ROG kernel: [11840.442308] [drm] psp is not working correctly before mode1 reset!
2019-08-03T18:51:44.451993+08:00 MGDT-ROG kernel: [11840.442310] amdgpu 0000:0a:00.0: GPU mode1 reset failed
2019-08-03T18:51:44.719056+08:00 MGDT-ROG kernel: [11840.710967] [drm:amdgpu_device_gpu_recover [amdgpu]] *ERROR* ASIC reset failed with error, -22 for drm dev, 0000:0a:00.0
2019-08-03T18:51:44.719066+08:00 MGDT-ROG kernel: [11840.711014] amdgpu 0000:0a:00.0: GPU reset(1) failed
2019-08-03T18:51:44.719068+08:00 MGDT-ROG kernel: [11840.711033] [drm] Skip scheduling IBs!
2019-08-03T18:51:44.719068+08:00 MGDT-ROG kernel: [11840.711038] [drm] Skip scheduling IBs!
2019-08-03T18:51:44.719070+08:00 MGDT-ROG kernel: [11840.711040] [drm] Skip scheduling IBs!
2019-08-03T18:51:44.719071+08:00 MGDT-ROG kernel: [11840.711043] [drm] Skip scheduling IBs!
2019-08-03T18:51:44.719072+08:00 MGDT-ROG kernel: [11840.711045] [drm] Skip scheduling IBs!
2019-08-03T18:51:44.719073+08:00 MGDT-ROG kernel: [11840.711049] [drm] Skip scheduling IBs!
2019-08-03T18:51:44.719075+08:00 MGDT-ROG kernel: [11840.711051] [drm] Skip scheduling IBs!
2019-08-03T18:51:44.719076+08:00 MGDT-ROG kernel: [11840.711053] [drm] Skip scheduling IBs!
2019-08-03T18:51:44.719077+08:00 MGDT-ROG kernel: [11840.711057] [drm] Skip scheduling IBs!
2019-08-03T18:51:44.719078+08:00 MGDT-ROG kernel: [11840.711059] [drm] Skip scheduling IBs!
2019-08-03T18:51:44.719079+08:00 MGDT-ROG kernel: [11840.711061] [drm] Skip scheduling IBs!
2019-08-03T18:51:44.719080+08:00 MGDT-ROG kernel: [11840.711064] [drm] Skip scheduling IBs!
2019-08-03T18:51:44.719081+08:00 MGDT-ROG kernel: [11840.711066] [drm] Skip scheduling IBs!
2019-08-03T18:51:44.719082+08:00 MGDT-ROG kernel: [11840.711068] [drm] Skip scheduling IBs!
2019-08-03T18:51:44.719083+08:00 MGDT-ROG kernel: [11840.711072] [drm] Skip scheduling IBs!
2019-08-03T18:51:44.719084+08:00 MGDT-ROG kernel: [11840.711075] [drm] Skip scheduling IBs!
2019-08-03T18:51:44.719085+08:00 MGDT-ROG kernel: [11840.711077] [drm] Skip scheduling IBs!
2019-08-03T18:51:44.719086+08:00 MGDT-ROG kernel: [11840.711080] [drm] Skip scheduling IBs!
2019-08-03T18:51:44.719087+08:00 MGDT-ROG kernel: [11840.711083] [drm] Skip scheduling IBs!
2019-08-03T18:51:44.719088+08:00 MGDT-ROG kernel: [11840.711085] [drm] Skip scheduling IBs!
2019-08-03T18:51:44.719089+08:00 MGDT-ROG kernel: [11840.711087] [drm] Skip scheduling IBs!
2019-08-03T18:51:44.719090+08:00 MGDT-ROG kernel: [11840.711090] [drm] Skip scheduling IBs!
2019-08-03T18:51:44.719091+08:00 MGDT-ROG kernel: [11840.711092] [drm] Skip scheduling IBs!
2019-08-03T18:51:44.719092+08:00 MGDT-ROG kernel: [11840.711094] [drm] Skip scheduling IBs!
2019-08-03T18:51:44.719093+08:00 MGDT-ROG kernel: [11840.711096] [drm] Skip scheduling IBs!
2019-08-03T18:51:44.719094+08:00 MGDT-ROG kernel: [11840.711097] [drm] Skip scheduling IBs!
2019-08-03T18:51:44.719095+08:00 MGDT-ROG kernel: [11840.711100] [drm] Skip scheduling IBs!
2019-08-03T18:51:44.719096+08:00 MGDT-ROG kernel: [11840.711102] amdgpu 0000:0a:00.0: GPU reset end with ret = -22
2019-08-03T18:51:44.719097+08:00 MGDT-ROG kernel: [11840.711102] [drm] Skip scheduling IBs!
2019-08-03T18:51:44.719098+08:00 MGDT-ROG kernel: [11840.711104] [drm] Skip scheduling IBs!
2019-08-03T18:51:44.719099+08:00 MGDT-ROG kernel: [11840.711106] [drm] Skip scheduling IBs!
2019-08-03T18:51:54.767980+08:00 MGDT-ROG kernel: [11850.756186] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 timeout, signaled seq=2324986, emitted seq=2324986
2019-08-03T18:51:54.767994+08:00 MGDT-ROG kernel: [11850.756247] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process X pid 2132 thread X:cs0 pid 2139
2019-08-03T18:51:54.767996+08:00 MGDT-ROG kernel: [11850.756251] amdgpu 0000:0a:00.0: GPU reset begin!
Comment 73 Sylvain BERTRAND 2019-08-03 16:54:17 UTC
On Sat, Aug 03, 2019 at 01:35:55PM +0000, bugzilla-daemon@freedesktop.org wrote:
> [    5.759204] amdgpu 0000:0a:00.0: Direct firmware load for
> amdgpu/vega20_ta.bin failed with error -2
> [    5.759205] amdgpu 0000:0a:00.0: psp v11.0: Failed to load firmware
> "amdgpu/vega20_ta.bin"

Did you get the latest and "greatest" amdgpu firmware package?
Comment 74 Mauro Gaspari 2019-08-03 17:43:01 UTC
(In reply to Sylvain BERTRAND from comment #73)
> On Sat, Aug 03, 2019 at 01:35:55PM +0000, bugzilla-daemon@freedesktop.org
> wrote:
> > [    5.759204] amdgpu 0000:0a:00.0: Direct firmware load for
> > amdgpu/vega20_ta.bin failed with error -2
> > [    5.759205] amdgpu 0000:0a:00.0: psp v11.0: Failed to load firmware
> > "amdgpu/vega20_ta.bin"
> 
> Did you get the latest and "greatest" amdgpu firmware package?

This is a fresh install I made to test this issue, so for now I only installed the packages per openSUSE wiki: https://en.opensuse.org/SDB:AMDGPU

I have done a snapper btrfs snapshot therefore if there is anything you want me to test, I am ready.
Comment 75 Sylvain BERTRAND 2019-08-03 18:46:19 UTC
On Sat, Aug 03, 2019 at 05:43:01PM +0000, bugzilla-daemon@freedesktop.org wrote:
> > > [    5.759204] amdgpu 0000:0a:00.0: Direct firmware load for
> > > amdgpu/vega20_ta.bin failed with error -2
> > > [    5.759205] amdgpu 0000:0a:00.0: psp v11.0: Failed to load firmware
> > > "amdgpu/vega20_ta.bin"

It seems you have a corrupted/old/missing vega20_ta.bin firmware file.
It looks like outdated distro files.
Comment 76 Mauro Gaspari 2019-08-04 05:05:52 UTC
(In reply to Sylvain BERTRAND from comment #75)
> On Sat, Aug 03, 2019 at 05:43:01PM +0000, bugzilla-daemon@freedesktop.org
> wrote:
> > > > [    5.759204] amdgpu 0000:0a:00.0: Direct firmware load for
> > > > amdgpu/vega20_ta.bin failed with error -2
> > > > [    5.759205] amdgpu 0000:0a:00.0: psp v11.0: Failed to load firmware
> > > > "amdgpu/vega20_ta.bin"
> 
> It seems you have a corrupted/old/missing vega20_ta.bin firmware file.
> It looks like outdated distro files.

Hello,
I did some quick search online and it seems a common problem for many users amdgpu. And looking around on other reports they seem to be dismissed as warnings and not mandatory. I am not an expert and I do not  want to dismiss it here, just report what I see.

By the way, Interesting to see that even my ubuntu budgie LTS with valve mesa-aco and different kernel, has the same warning.

[    5.435346] [drm] amdgpu kernel modesetting enabled.
[    5.435500] fb0: switching to amdgpudrmfb from EFI VGA
[    5.735058] amdgpu 0000:0a:00.0: No more image in the PCI ROM
[    5.735102] amdgpu 0000:0a:00.0: VRAM: 16368M 0x0000008000000000 - 0x00000083FEFFFFFF (16368M used)
[    5.735103] amdgpu 0000:0a:00.0: GART: 512M 0x0000000000000000 - 0x000000001FFFFFFF
[    5.735104] amdgpu 0000:0a:00.0: AGP: 267894784M 0x0000008400000000 - 0x0000FFFFFFFFFFFF
[    5.735185] [drm] amdgpu: 16368M of VRAM memory ready
[    5.735186] [drm] amdgpu: 16368M of GTT memory ready.
[    5.739656] amdgpu 0000:0a:00.0: Direct firmware load for amdgpu/vega20_ta.bin failed with error -2
[    5.739659] amdgpu 0000:0a:00.0: psp v11.0: Failed to load firmware "amdgpu/vega20_ta.bin"
[    6.354308] fbcon: amdgpudrmfb (fb0) is primary device
[    6.354490] amdgpu 0000:0a:00.0: fb0: amdgpudrmfb frame buffer device
[    6.384079] amdgpu 0000:0a:00.0: ring gfx uses VM inv eng 0 on hub 0
[    6.384080] amdgpu 0000:0a:00.0: ring comp_1.0.0 uses VM inv eng 1 on hub 0
[    6.384081] amdgpu 0000:0a:00.0: ring comp_1.1.0 uses VM inv eng 4 on hub 0
[    6.384082] amdgpu 0000:0a:00.0: ring comp_1.2.0 uses VM inv eng 5 on hub 0
[    6.384083] amdgpu 0000:0a:00.0: ring comp_1.3.0 uses VM inv eng 6 on hub 0
[    6.384084] amdgpu 0000:0a:00.0: ring comp_1.0.1 uses VM inv eng 7 on hub 0
[    6.384084] amdgpu 0000:0a:00.0: ring comp_1.1.1 uses VM inv eng 8 on hub 0
[    6.384085] amdgpu 0000:0a:00.0: ring comp_1.2.1 uses VM inv eng 9 on hub 0
[    6.384086] amdgpu 0000:0a:00.0: ring comp_1.3.1 uses VM inv eng 10 on hub 0
[    6.384087] amdgpu 0000:0a:00.0: ring kiq_2.1.0 uses VM inv eng 11 on hub 0
[    6.384088] amdgpu 0000:0a:00.0: ring sdma0 uses VM inv eng 0 on hub 1
[    6.384089] amdgpu 0000:0a:00.0: ring page0 uses VM inv eng 1 on hub 1
[    6.384089] amdgpu 0000:0a:00.0: ring sdma1 uses VM inv eng 4 on hub 1
[    6.384090] amdgpu 0000:0a:00.0: ring page1 uses VM inv eng 5 on hub 1
[    6.384090] amdgpu 0000:0a:00.0: ring uvd_0 uses VM inv eng 6 on hub 1
[    6.384091] amdgpu 0000:0a:00.0: ring uvd_enc_0.0 uses VM inv eng 7 on hub 1
[    6.384092] amdgpu 0000:0a:00.0: ring uvd_enc_0.1 uses VM inv eng 8 on hub 1
[    6.384092] amdgpu 0000:0a:00.0: ring uvd_1 uses VM inv eng 9 on hub 1
[    6.384093] amdgpu 0000:0a:00.0: ring uvd_enc_1.0 uses VM inv eng 10 on hub 1
[    6.384094] amdgpu 0000:0a:00.0: ring uvd_enc_1.1 uses VM inv eng 11 on hub 1
[    6.384094] amdgpu 0000:0a:00.0: ring vce0 uses VM inv eng 12 on hub 1
[    6.384095] amdgpu 0000:0a:00.0: ring vce1 uses VM inv eng 13 on hub 1
[    6.384096] amdgpu 0000:0a:00.0: ring vce2 uses VM inv eng 14 on hub 1
[    7.067068] [drm] Initialized amdgpu 3.27.0 20150101 for 0000:0a:00.0 on minor 0
Comment 77 Sylvain BERTRAND 2019-08-04 14:18:56 UTC
On Sun, Aug 04, 2019 at 05:05:52AM +0000, bugzilla-daemon@freedesktop.org wrote:
> By the way, Interesting to see that even my ubuntu budgie LTS with valve
> mesa-aco and different kernel, has the same warning.
> [    5.739656] amdgpu 0000:0a:00.0: Direct firmware load for
> amdgpu/vega20_ta.bin failed with error -2
> [    5.739659] amdgpu 0000:0a:00.0: psp v11.0: Failed to load firmware
> "amdgpu/vega20_ta.bin"

I don't know of an AMD GPU part able to run without properly loaded firmware.

That would have to be confirmed by official AMD devs which are the sole ppl
with that knowledge.

In the very probable case that the firmware _must_ be loaded for proper gpu
operations, you have to tell the maintainers of the distros you use to update
their linux/amdgpu firmware package.
Comment 78 Mauro Gaspari 2019-08-04 16:17:41 UTC
(In reply to Sylvain BERTRAND from comment #77)
> On Sun, Aug 04, 2019 at 05:05:52AM +0000, bugzilla-daemon@freedesktop.org
> wrote:
> > By the way, Interesting to see that even my ubuntu budgie LTS with valve
> > mesa-aco and different kernel, has the same warning.
> > [    5.739656] amdgpu 0000:0a:00.0: Direct firmware load for
> > amdgpu/vega20_ta.bin failed with error -2
> > [    5.739659] amdgpu 0000:0a:00.0: psp v11.0: Failed to load firmware
> > "amdgpu/vega20_ta.bin"
> 
> I don't know of an AMD GPU part able to run without properly loaded firmware.
> 
> That would have to be confirmed by official AMD devs which are the sole ppl
> with that knowledge.
> 
> In the very probable case that the firmware _must_ be loaded for proper gpu
> operations, you have to tell the maintainers of the distros you use to update
> their linux/amdgpu firmware package.

I believe so, and yes it makes total sense that you need the correct firmware for a piece of hardware to work properly. 
I will open bugs for openSUSE and ubuntu, and ask the questions, point to this bug tracker. Let's see what comes out. I will report back as I hear from distribution maintainers. 

I am using a RadeonVII at the moment. Is there anyone with a Vega64 or Vega56 that can do the same tests and let me know if they see same issue? I am happy to include those cards in my same bug reports if someone can confirm.
Comment 79 Alex Deucher 2019-08-05 05:54:44 UTC
the ta bin is optional.  It's only used for server cards with xgmi and ras features.  Consumer cards don't support those features and don't use it.
Comment 80 Mauro Gaspari 2019-08-05 06:16:32 UTC
(In reply to Alex Deucher from comment #79)
> the ta bin is optional.  It's only used for server cards with xgmi and ras
> features.  Consumer cards don't support those features and don't use it.

Alex,
Thank you for confirming this. Good to know.
Regarding the logs and dmesg I posted above, in comment #72, do you see anything useful? Is there any other specific tests I can do to help pinpoint the issue?
Comment 81 Pierre-Eric Pelloux-Prayer 2019-08-07 09:53:53 UTC
Can anyone provide a apitrace/renderdoc capture that can reliably reproduce the crash/freeze?
Comment 82 Mauro Gaspari 2019-08-11 09:31:41 UTC
(In reply to Pierre-Eric Pelloux-Prayer from comment #81)
> Can anyone provide a apitrace/renderdoc capture that can reliably reproduce
> the crash/freeze?

Hello, Sadly my freezes are hard to reproduce. Sometimes I can play for a day with no freeze, sometimes it freezes in 10 minutes, one hour, and so on.

I had another freeze today:

OS: openSUSE Tumbleweed x86_64 
Kernel: 5.2.5-1-default
Resolution: 3440x1440
DE: Xfce
WM: Xfwm4
CPU: AMD Ryzen 7 2700X (16) @ 3.700GHz
GPU: AMD ATI Radeon VII
Memory: 3791MiB / 64387MiB 
OpenGL version string: 4.5 (Compatibility Profile) Mesa 19.1.3

Game: EVE Online: Wine+DXVK. (Crossover 18.5.0) vsync off frame limiter off
Problem description: Afer rougly 1 hour of gameplay, desktop Frozen for a few seconds but managed to recover. Game did not recover and I killed the process. 

DMESG:

[20612.721860] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=12880412, emitted seq=12880414
[20612.721921] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process exefile.exe pid 1980 thread exefile.ex:cs0 pid 2057
[20612.721925] amdgpu 0000:0a:00.0: GPU reset begin!
[20613.526448] amdgpu 0000:0a:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring kiq_2.1.0 test failed (-110)
[20613.526502] [drm:gfx_v9_0_hw_fini [amdgpu]] *ERROR* KCQ disable failed
[20613.547524] amdgpu 0000:0a:00.0: GPU mode1 reset
[20614.055810] [drm] psp mode1 reset succeed 
[20614.128815] amdgpu 0000:0a:00.0: GPU reset succeeded, trying to resume
[20614.128943] [drm] PCIE GART of 512M enabled (table at 0x0000008000300000).
[20614.129304] [drm] PSP is resuming...
[20614.192202] [drm] reserve 0x400000 from 0x8000c00000 for PSP TMR SIZE
[20614.649220] [drm] UVD and UVD ENC initialized successfully.
[20614.748872] [drm] VCE initialized successfully.
[20615.271942] [drm] Fence fallback timer expired on ring gfx
[20615.783826] [drm] Fence fallback timer expired on ring comp_1.0.0
[20616.616023] [drm] Fence fallback timer expired on ring uvd_1
[20617.127844] [drm] Fence fallback timer expired on ring uvd_enc_1.0
[20617.639836] [drm] Fence fallback timer expired on ring uvd_enc_1.1
[20617.739606] [drm] recover vram bo from shadow start
[20617.742231] [drm] recover vram bo from shadow done
[20617.742233] [drm] Skip scheduling IBs!
[20617.742234] [drm] Skip scheduling IBs!
[20617.742259] amdgpu 0000:0a:00.0: GPU reset(2) succeeded!
[20617.742289] [drm] Skip scheduling IBs!
[20617.742309] [drm] Skip scheduling IBs!
[20617.742314] [drm] Skip scheduling IBs!
[20617.742316] [drm] Skip scheduling IBs!
[20617.742318] [drm] Skip scheduling IBs!
[20617.742320] [drm] Skip scheduling IBs!
[20617.743840] [drm] Skip scheduling IBs!
[20617.744006] [drm] Skip scheduling IBs!
[20617.744180] [drm] Skip scheduling IBs!
[20617.744450] [drm] Skip scheduling IBs!

System Logs:

2019-08-11T17:13:10.377029+08:00 MGDT-ROG kernel: [20612.721860] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=12880412, emitted seq=12880414
2019-08-11T17:13:10.377046+08:00 MGDT-ROG kernel: [20612.721921] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process exefile.exe pid 1980 thread exefile.ex:cs0 pid 2057
2019-08-11T17:13:10.377047+08:00 MGDT-ROG kernel: [20612.721925] amdgpu 0000:0a:00.0: GPU reset begin!
2019-08-11T17:13:11.182763+08:00 MGDT-ROG kernel: [20613.526448] amdgpu 0000:0a:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring kiq_2.1.0 test failed (-110)
2019-08-11T17:13:11.182776+08:00 MGDT-ROG kernel: [20613.526502] [drm:gfx_v9_0_hw_fini [amdgpu]] *ERROR* KCQ disable failed
2019-08-11T17:13:11.202766+08:00 MGDT-ROG kernel: [20613.547524] amdgpu 0000:0a:00.0: GPU mode1 reset
2019-08-11T17:13:11.714757+08:00 MGDT-ROG kernel: [20614.055810] [drm] psp mode1 reset succeed 
2019-08-11T17:13:11.786740+08:00 MGDT-ROG kernel: [20614.128815] amdgpu 0000:0a:00.0: GPU reset succeeded, trying to resume
2019-08-11T17:13:11.786749+08:00 MGDT-ROG kernel: [20614.128943] [drm] PCIE GART of 512M enabled (table at 0x0000008000300000).
2019-08-11T17:13:11.786751+08:00 MGDT-ROG kernel: [20614.129304] [drm] PSP is resuming...
2019-08-11T17:13:11.850739+08:00 MGDT-ROG kernel: [20614.192202] [drm] reserve 0x400000 from 0x8000c00000 for PSP TMR SIZE
2019-08-11T17:13:12.306756+08:00 MGDT-ROG kernel: [20614.649220] [drm] UVD and UVD ENC initialized successfully.
2019-08-11T17:13:12.406756+08:00 MGDT-ROG kernel: [20614.748872] [drm] VCE initialized successfully.
2019-08-11T17:13:12.926899+08:00 MGDT-ROG kernel: [20615.271942] [drm] Fence fallback timer expired on ring gfx
2019-08-11T17:13:13.438783+08:00 MGDT-ROG kernel: [20615.783826] [drm] Fence fallback timer expired on ring comp_1.0.0
2019-08-11T17:13:14.274773+08:00 MGDT-ROG kernel: [20616.616023] [drm] Fence fallback timer expired on ring uvd_1
2019-08-11T17:13:14.671435+08:00 MGDT-ROG tracker-store[4801]: OK
2019-08-11T17:13:14.672970+08:00 MGDT-ROG systemd[2481]: tracker-store.service: Succeeded.
2019-08-11T17:13:14.782896+08:00 MGDT-ROG kernel: [20617.127844] [drm] Fence fallback timer expired on ring uvd_enc_1.0
2019-08-11T17:13:15.294768+08:00 MGDT-ROG kernel: [20617.639836] [drm] Fence fallback timer expired on ring uvd_enc_1.1
2019-08-11T17:13:15.394759+08:00 MGDT-ROG kernel: [20617.739606] [drm] recover vram bo from shadow start
2019-08-11T17:13:15.397215+08:00 MGDT-ROG kernel: [20617.742231] [drm] recover vram bo from shadow done
2019-08-11T17:13:15.397227+08:00 MGDT-ROG kernel: [20617.742233] [drm] Skip scheduling IBs!
2019-08-11T17:13:15.397228+08:00 MGDT-ROG kernel: [20617.742234] [drm] Skip scheduling IBs!
2019-08-11T17:13:15.397231+08:00 MGDT-ROG kernel: [20617.742259] amdgpu 0000:0a:00.0: GPU reset(2) succeeded!
2019-08-11T17:13:15.397233+08:00 MGDT-ROG kernel: [20617.742289] [drm] Skip scheduling IBs!
2019-08-11T17:13:15.397235+08:00 MGDT-ROG kernel: [20617.742309] [drm] Skip scheduling IBs!
2019-08-11T17:13:15.397242+08:00 MGDT-ROG kernel: [20617.742314] [drm] Skip scheduling IBs!
2019-08-11T17:13:15.397262+08:00 MGDT-ROG kernel: [20617.742316] [drm] Skip scheduling IBs!
2019-08-11T17:13:15.397265+08:00 MGDT-ROG kernel: [20617.742318] [drm] Skip scheduling IBs!
2019-08-11T17:13:15.397268+08:00 MGDT-ROG kernel: [20617.742320] [drm] Skip scheduling IBs!
2019-08-11T17:13:15.402744+08:00 MGDT-ROG kernel: [20617.743840] [drm] Skip scheduling IBs!
2019-08-11T17:13:15.402753+08:00 MGDT-ROG kernel: [20617.744006] [drm] Skip scheduling IBs!
2019-08-11T17:13:15.402755+08:00 MGDT-ROG kernel: [20617.744180] [drm] Skip scheduling IBs!
2019-08-11T17:13:15.402757+08:00 MGDT-ROG kernel: [20617.744450] [drm] Skip scheduling IBs!
Comment 83 J. Andrew Lanz-O'Brien 2019-08-12 02:50:02 UTC
Can confirm that this bug is still present as of August 11, 2019 on kernel 5.2.8 with mesa 19.1.4. Borderlands 2 hard locked my system about 5 times tonight. Manually setting the power profile didn't help either, ie these two commands:

echo manual > /sys/class/drm/card0/device/power_dpm_force_performance_level
echo 7 > /sys/class/drm/card0/device/pp_dpm_sclk
Comment 84 Pierre-Eric Pelloux-Prayer 2019-08-12 08:16:49 UTC
(In reply to Mauro Gaspari from comment #82)
> (In reply to Pierre-Eric Pelloux-Prayer from comment #81)
> > Can anyone provide a apitrace/renderdoc capture that can reliably reproduce
> > the crash/freeze?
> 
> Hello, Sadly my freezes are hard to reproduce. Sometimes I can play for a
> day with no freeze, sometimes it freezes in 10 minutes, one hour, and so on.
> 

Ok.

This patch https://patchwork.freedesktop.org/series/64792/ might help: it won't fix any issue, but when a timeout is detected it should allow the soft recovery of the GPU.

Other things worth trying: setting AMD_DEBUG environment variables. I'd suggest:

   AMD_DEBUG=zerovram,nodma,nodpbb

There are others (see mesa/src/gallium/drivers/radeonsi/si_pipe.c) to try if these don't help.
Comment 85 Mauro Gaspari 2019-08-12 14:10:11 UTC
(In reply to Pierre-Eric Pelloux-Prayer from comment #84)
> (In reply to Mauro Gaspari from comment #82)
> > (In reply to Pierre-Eric Pelloux-Prayer from comment #81)
> > > Can anyone provide a apitrace/renderdoc capture that can reliably reproduce
> > > the crash/freeze?
> > 
> > Hello, Sadly my freezes are hard to reproduce. Sometimes I can play for a
> > day with no freeze, sometimes it freezes in 10 minutes, one hour, and so on.
> > 
> 
> Ok.
> 
> This patch https://patchwork.freedesktop.org/series/64792/ might help: it
> won't fix any issue, but when a timeout is detected it should allow the soft
> recovery of the GPU.
> 
> Other things worth trying: setting AMD_DEBUG environment variables. I'd
> suggest:
> 
>    AMD_DEBUG=zerovram,nodma,nodpbb
> 
> There are others (see mesa/src/gallium/drivers/radeonsi/si_pipe.c) to try if
> these don't help.

Thank you.

I will first try to reintroduce the kernel parameters I previously used. Do you think those can help at all?

CPU
rcu_nocbs=0-15 (adjust to the number of cores of your cpu)
idle=nomwait
processor.max_cstate=5
pcie_aspm=off 

GPU
amdgpu.dc=1
amdgpu.vm_update_mode=0
amdgpu.dpm=-1
amdgpu.ppfeaturemask=0xffffffff
amdgpu.vm_fault_stop=2
amdgpu.vm_debug=1
amdgpu.gpu_recovery=0
Comment 86 Pierre-Eric Pelloux-Prayer 2019-08-13 15:59:27 UTC
(In reply to Mauro Gaspari from comment #85)
> I will first try to reintroduce the kernel parameters I previously used.
> Do you think those can help at all?
> [...]
> GPU
> amdgpu.dc=1

Not needed: dc will be automatically enabled on recent GPU

> amdgpu.vm_update_mode=0

Shouldn't be needed since it should be the default value. 

> amdgpu.dpm=-1

Not needed: this is the default value

> amdgpu.ppfeaturemask=0xffffffff

The only difference with the default value is that you're enabling Overdrive.
I'd suggest to keep the default parameter here.

> amdgpu.vm_fault_stop=2

I think this one isn't helpful (it's a debugging tool)

> amdgpu.vm_debug=1

This one can help.

> amdgpu.gpu_recovery=0

No opinion on this one :)
Comment 87 Mauro Gaspari 2019-08-13 16:19:27 UTC
(In reply to Pierre-Eric Pelloux-Prayer from comment #86)
> (In reply to Mauro Gaspari from comment #85)
> > I will first try to reintroduce the kernel parameters I previously used.
> > Do you think those can help at all?
> > [...]
> > GPU
> > amdgpu.dc=1
> 
> Not needed: dc will be automatically enabled on recent GPU
> 
> > amdgpu.vm_update_mode=0
> 
> Shouldn't be needed since it should be the default value. 
> 
> > amdgpu.dpm=-1
> 
> Not needed: this is the default value
> 
> > amdgpu.ppfeaturemask=0xffffffff
> 
> The only difference with the default value is that you're enabling Overdrive.
> I'd suggest to keep the default parameter here.
> 
> > amdgpu.vm_fault_stop=2
> 
> I think this one isn't helpful (it's a debugging tool)
> 
> > amdgpu.vm_debug=1
> 
> This one can help.
> 
> > amdgpu.gpu_recovery=0
> 
> No opinion on this one :)

Thank you!

I am currently testing on ubuntu budgie with valve-released Mesa-ACO and so far, I am having no freezes nor crashes. Couple of days without incidents. But as I posted previously, it is all a bit random so I think I will need to use this for at least a week. 

I will report back soon with my findings.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.