109955 – amdgpu [RX Vega 64] system freeze while gaming (VSYNC enabled)

Bug 109955 - amdgpu [RX Vega 64] system freeze while gaming (VSYNC enabled)

Summary: amdgpu [RX Vega 64] system freeze while gaming (VSYNC enabled)

Status:	RESOLVED MOVED

Alias:	None

Product:	DRI
Classification:	Unclassified
Component:	DRM/AMDgpu (show other bugs)
Version:	unspecified
Hardware:	x86-64 (AMD64) Linux (All)

Importance:	medium normal
Assignee:	Default DRI bug account
QA Contact:

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2019-03-11 07:05 UTC by Mauro Gaspari
Modified:	2019-11-20 07:52 UTC (History)
CC List:	15 users (show)

See Also:
i915 platform:
i915 features:

Attachments
syslog lines relevant to the crash (3.78 MB, text/plain) 2019-03-22 20:01 UTC, Mauro Gaspari	no flags	Details
full dmesg after crash (87.19 KB, text/plain) 2019-03-22 20:02 UTC, Mauro Gaspari	no flags	Details
dmesg from the freeze which didn't completely bork everything. It starts on line 1181 (987.98 KB, text/plain) 2019-06-13 21:04 UTC, Sam	no flags	Details
Dmesg after crash (88.25 KB, text/plain) 2019-07-19 00:12 UTC, Hadet	no flags	Details
dmesg for crash (189.98 KB, text/plain) 2019-09-07 03:50 UTC, Rodney A Morris	no flags	Details
apitrace of Hearts of Iron IV hard lock (672.11 MB, application/octet-stream) 2019-09-15 01:16 UTC, Rodney A Morris	no flags	Details
Full dmesg from Stellaris crash (141.00 KB, text/plain) 2019-09-15 04:35 UTC, Rodney A Morris	no flags	Details
dmesg from Stellaris crash 2019-09-20 (177.08 KB, text/plain) 2019-09-23 03:06 UTC, Rodney A Morris	no flags	Details
Full dmesg from crash (169.06 KB, text/plain) 2019-10-19 21:27 UTC, Rodney A Morris	no flags	Details
Full journal from start to crash (684.44 KB, text/plain) 2019-10-19 21:28 UTC, Rodney A Morris	no flags	Details
proposed fix for crashes, caused by frequent mclk level 0/1 switches (886 bytes, patch) 2019-11-06 10:23 UTC, haro41	no flags	Details \| Splinter Review
View All

Description Mauro Gaspari 2019-03-11 07:05:19 UTC

Symptoms:
During gaming sessions, system locks up and freezes completely. Audio seems to keep working for a few seconds more, but full desktop is frozen, no mouse and keyboard actions available. Hard reset only possible action on local pc. I have not tried to ssh in the PC from another box.
Some times I can play for 20 minutes, some times for a few hours. Freezes seem unrelated to any activity running in-game. All system temperatures are under control.
The system outside of 3d gaming is very stable, including playing videos, encoding videos, regular desktop usage.

Further testing done:
1. Installed Windows10 on same hardware, same BIOS settings. Running same games has no issue at all. No hangs, no problems.
2. Ran same games on my NVIDIA+Intel based laptop. No issue at all on same distributions and kernels. No hangs, no problems.

Additional information:
This issue has been going on for a while now. It comes and goes with Mesa versions (or Mesa+kernel combinations). Some times an update comes and I have no freezes for weeks. Then next update gets installed and the issue comes back. 
I have tested this mainly on openSUSE Tumbleweed, Ubuntu 18.04 and Ubuntu 18.10. 

-- Ubuntu testing:
Ubuntu 18.04 was running well for months, then latest mesa updates that got in 2 weeks ago, re-introduced the issue. System started freezing again. I tried updating to 18.10 but I had the same issue. I enabled oibaf PPA for video drivers and the issue disappeared. Then after a few days a new mesa came in and the issue came back. I am now running on Padoka unstable PPA with Mesa 19 and LLVM9. The issue still happens.

-- Tumbleweed testing:
I am adding my previous bug report I filed with Tumbleweed. A couple of occurrences with system logs. I will post more as I collect them.

OS: OpenSUSE tumbleweed x86_64 updated (2018 04 21)
Kernel: 4.16.2-1-default
Desktop Environment: KDE Plasma (x11)
OpenGL version string: 3.0 Mesa 18.0.0
GPU: AMD Radeon RX Vega 64 8GB

System Logs:

Apr 21 17:08:34 STUDIO kernel: [drm:gfx_v9_0_priv_reg_irq [amdgpu]] *ERROR* Illegal register access in command stream
Apr 21 17:08:34 STUDIO kernel: [drm] No hardware hang detected. Did some blocks stall?
Apr 21 17:08:44 STUDIO kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, last signaled seq=128859, last emitted seq=128861
Apr 21 17:08:44 STUDIO kernel: [drm] No hardware hang detected. Did some blocks stall?
-- Reboot --


Dmesg lines relative to amdgpu:

[    3.407020] [drm] amdgpu kernel modesetting enabled.
[    3.411462] fb: switching to amdgpudrmfb from VESA VGA
[    3.426163] amdgpu 0000:04:00.0: Invalid PCI ROM header signature: expecting 0xaa55, got 0xffff
[    3.426261] amdgpu 0000:04:00.0: VRAM: 8176M 0x000000F400000000 - 0x000000F5FEFFFFFF (8176M used)
[    3.426263] amdgpu 0000:04:00.0: GTT: 256M 0x000000F600000000 - 0x000000F60FFFFFFF
[    3.426371] [drm] amdgpu: 8176M of VRAM memory ready
[    3.426372] [drm] amdgpu: 8176M of GTT memory ready.
[    4.031665] fbcon: amdgpudrmfb (fb0) is primary device
[    4.083803] amdgpu 0000:04:00.0: fb0: amdgpudrmfb frame buffer device
[    4.096086] amdgpu 0000:04:00.0: ring 0(gfx) uses VM inv eng 4 on hub 0
[    4.096088] amdgpu 0000:04:00.0: ring 1(comp_1.0.0) uses VM inv eng 5 on hub 0
[    4.096089] amdgpu 0000:04:00.0: ring 2(comp_1.1.0) uses VM inv eng 6 on hub 0
[    4.096090] amdgpu 0000:04:00.0: ring 3(comp_1.2.0) uses VM inv eng 7 on hub 0
[    4.096091] amdgpu 0000:04:00.0: ring 4(comp_1.3.0) uses VM inv eng 8 on hub 0
[    4.096093] amdgpu 0000:04:00.0: ring 5(comp_1.0.1) uses VM inv eng 9 on hub 0
[    4.096094] amdgpu 0000:04:00.0: ring 6(comp_1.1.1) uses VM inv eng 10 on hub 0
[    4.096095] amdgpu 0000:04:00.0: ring 7(comp_1.2.1) uses VM inv eng 11 on hub 0
[    4.096096] amdgpu 0000:04:00.0: ring 8(comp_1.3.1) uses VM inv eng 12 on hub 0
[    4.096098] amdgpu 0000:04:00.0: ring 9(kiq_2.1.0) uses VM inv eng 13 on hub 0
[    4.096099] amdgpu 0000:04:00.0: ring 10(sdma0) uses VM inv eng 4 on hub 1
[    4.096100] amdgpu 0000:04:00.0: ring 11(sdma1) uses VM inv eng 5 on hub 1
[    4.096101] amdgpu 0000:04:00.0: ring 12(uvd) uses VM inv eng 6 on hub 1
[    4.096103] amdgpu 0000:04:00.0: ring 13(uvd_enc0) uses VM inv eng 7 on hub 1
[    4.096104] amdgpu 0000:04:00.0: ring 14(uvd_enc1) uses VM inv eng 8 on hub 1
[    4.096105] amdgpu 0000:04:00.0: ring 15(vce0) uses VM inv eng 9 on hub 1
[    4.096107] amdgpu 0000:04:00.0: ring 16(vce1) uses VM inv eng 10 on hub 1
[    4.096108] amdgpu 0000:04:00.0: ring 17(vce2) uses VM inv eng 11 on hub 1
[    4.096662] [drm] Initialized amdgpu 3.23.0 20150101 for 0000:04:00.0 on minor 0



The issue was later identified here   https://bugs.freedesktop.org/show_bug.cgi?id=105317 and fixed with Mesa 18.0.1. 



Then, The issue was noticed again after a few months:
OS: OpenSUSE tumbleweed x86_64 updated (2018 08 10)
Kernel: 4.17.2-1-default
Desktop Environment: KDE Plasma (x11)
OpenGL version string: 3.1 Mesa 18.1.5
GPU: AMD Radeon RX Vega 64 8GB


Relevant log lines I found during freeze:

2018-08-09T23:16:53.103775+08:00 MGDT-Tumbleweed kernel: [ 6305.852703] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, last signaled seq=1745163, last emitted seq=
1745165
2018-08-09T23:16:53.103795+08:00 MGDT-Tumbleweed kernel: [ 6305.852704] [drm] No hardware hang detected. Did some blocks stall?


Dmesg lines relative to amdgpu:

[    3.130759] [drm] amdgpu kernel modesetting enabled.
[    3.135770] fb: switching to amdgpudrmfb from EFI VGA
[    3.136106] amdgpu 0000:03:00.0: Invalid PCI ROM header signature: expecting 0xaa55, got 0xffff
[    3.136171] amdgpu 0000:03:00.0: VRAM: 8176M 0x000000F400000000 - 0x000000F5FEFFFFFF (8176M used)
[    3.136173] amdgpu 0000:03:00.0: GTT: 512M 0x000000F600000000 - 0x000000F61FFFFFFF
[    3.136494] [drm] amdgpu: 8176M of VRAM memory ready
[    3.136495] [drm] amdgpu: 8176M of GTT memory ready.
[    4.114469] fbcon: amdgpudrmfb (fb0) is primary device
[    4.141179] amdgpu 0000:03:00.0: fb0: amdgpudrmfb frame buffer device
[    4.164072] amdgpu 0000:03:00.0: ring 0(gfx) uses VM inv eng 4 on hub 0
[    4.164074] amdgpu 0000:03:00.0: ring 1(comp_1.0.0) uses VM inv eng 5 on hub 0
[    4.164075] amdgpu 0000:03:00.0: ring 2(comp_1.1.0) uses VM inv eng 6 on hub 0
[    4.164075] amdgpu 0000:03:00.0: ring 3(comp_1.2.0) uses VM inv eng 7 on hub 0
[    4.164076] amdgpu 0000:03:00.0: ring 4(comp_1.3.0) uses VM inv eng 8 on hub 0
[    4.164077] amdgpu 0000:03:00.0: ring 5(comp_1.0.1) uses VM inv eng 9 on hub 0
[    4.164078] amdgpu 0000:03:00.0: ring 6(comp_1.1.1) uses VM inv eng 10 on hub 0
[    4.164079] amdgpu 0000:03:00.0: ring 7(comp_1.2.1) uses VM inv eng 11 on hub 0
[    4.164079] amdgpu 0000:03:00.0: ring 8(comp_1.3.1) uses VM inv eng 12 on hub 0
[    4.164080] amdgpu 0000:03:00.0: ring 9(kiq_2.1.0) uses VM inv eng 13 on hub 0
[    4.164081] amdgpu 0000:03:00.0: ring 10(sdma0) uses VM inv eng 4 on hub 1
[    4.164082] amdgpu 0000:03:00.0: ring 11(sdma1) uses VM inv eng 5 on hub 1
[    4.164083] amdgpu 0000:03:00.0: ring 12(uvd) uses VM inv eng 6 on hub 1
[    4.164084] amdgpu 0000:03:00.0: ring 13(uvd_enc0) uses VM inv eng 7 on hub 1
[    4.164085] amdgpu 0000:03:00.0: ring 14(uvd_enc1) uses VM inv eng 8 on hub 1
[    4.164085] amdgpu 0000:03:00.0: ring 15(vce0) uses VM inv eng 9 on hub 1
[    4.164086] amdgpu 0000:03:00.0: ring 16(vce1) uses VM inv eng 10 on hub 1
[    4.164087] amdgpu 0000:03:00.0: ring 17(vce2) uses VM inv eng 11 on hub 1
[    4.164553] [drm] Initialized amdgpu 3.25.0 20150101 for 0000:03:00.0 on minor 0

Comment 1 Mauro Gaspari 2019-03-22 20:01:01 UTC

Created attachment 143759 [details]
syslog lines relevant to the crash

Comment 2 Mauro Gaspari 2019-03-22 20:02:04 UTC

Created attachment 143760 [details]
full dmesg after crash

Comment 3 Mauro Gaspari 2019-03-22 20:02:15 UTC

New reports as the issue is still happening:

I found a link on phoronix that describes with pictures exactly what is happening:
https://www.phoronix.com/forums/forum/linux-graphics-x-org-drivers/open-source-amd-linux/1049483-amd-devs-error-ring-gfx-timeout


OS: OpenSUSE tumbleweed x86_64 updated (2019 03 23)
Kernel: 5.0.2-1-default
Desktop Environment: KDE Plasma (x11)
OpenGL version: string: 4.5 (Compatibility Profile) Mesa 19.0.0
GPU: AMD Radeon RX Vega 64 8GB

Attaching log files and dmesg after crash.

Comment 4 Mauro Gaspari 2019-04-11 06:37:46 UTC

Issue still happens despite kernel updates and mesa updates on openSUSE Tumbleweed. Same happens on Kubuntu with oibaf ppa, and on Arch.

It seems this bug affects many people on linux using AMDGPUS, and found some interesting workarounds. Had a look at kernel options, applied to grub, and so far it has been 2 weeks of extensive testing, and I did not have a single system freeze or hang.

-> BEGIN KENEL PARAMETERS <-
This is what I am using now. Please note that some of those settings are to
enable debugging and should not left there forever. I will remove those once
I am confident with the stability of the system.

AMDGPU
amdgpu.dc=1 amdgpu.vm_update_mode=0 amdgpu.dpm=-1
amdgpu.ppfeaturemask=0xffffffff amdgpu.vm_fault_stop=2 amdgpu.vm_debug=1
amdgpu.gpu_recovery=0


- Kernel parameters explained from:
https://www.kernel.org/doc/html/latest/gpu/amdgpu.html

--- dc (int)
Disable/Enable Display Core driver for debugging (1 = enable, 0 = disable).
The default is -1 (automatic for each asic).


--- dpm (int)
Override for dynamic power management setting (1 = enable, 0 = disable). The
default is -1 (auto).

--- vm_update_mode (int)
Override VM update mode. VM updated by using CPU (0 = never, 1 = Graphics
only, 2 = Compute only, 3 = Both). The default is -1 (Only in large BAR(LB)
systems Compute VM tables will be updated by CPU, otherwise 0, never).

--- ppfeaturemask (uint)
Override power features enabled. See enum PP_FEATURE_MASK in
drivers/gpu/drm/amd/include/amd_shared.h. The default is the current set of
stable power features.

--- vm_fault_stop (int)
Stop on VM fault for debugging (0 = never, 1 = print first, 2 = always). The
default is 0 (No stop).

--- vm_debug (int)
Debug VM handling (0 = disabled, 1 = enabled). The default is 0 (Disabled).

-gpu_recovery (int)
Set to enable GPU recovery mechanism (1 = enable, 0 = disable). The default
is -1 (auto, disabled except SRIOV).

-> END KERNEL PARAMETERS <-

Comment 5 Jaap Buurman 2019-04-12 21:37:54 UTC

I have the exact same problem with my Vega 64. Crashes when playing games. Happens with Vulkan games (RADV), OpenGL games (RadeonSI) and DirectX 9 games via Wine (Gallium9). It happens only for some games, presumably because it depends on the workload.

I am also suspecting power management issues. This might be a long shot, but worth a try. I know for a fact that Power management works slightly different when multiple monitors are connected, as memory isn't clocked back as much in that case. For the people also experiencing this issue, are you guys running multiple monitors like I am?

Comment 6 Jaap Buurman 2019-04-12 22:10:26 UTC

Another question: What is the output of the following command for you guys?

cat /sys/class/drm/card0/device/vbios_version 

I am running the following version:

113-D0500100-103

According to the techpowerup GPU bios database, this is a vega bios that was replaced two days (!) later by a new version. Perhaps issues were found that required another bios update? I might install Windows on a spare HDD and try to flash my Vega to see if that changes anything.

Comment 7 Mauro Gaspari 2019-04-13 09:34:26 UTC

@ Jaap Buurman 
I run a single monitor, ultra-wide 3440xx1440 @100hz.

my bios version: 113-D0500100-103

Comment 8 Jaap Buurman 2019-04-13 09:41:44 UTC

I guess we can rule out a multi-monitor issue then. But I find is VERY interesting that you also run the exact same bios version, that was replaced two days later, so it should be fairly rare. Perhaps it is buggy and was therefor replaced only 2 days after it was released? I am going to try and flash my GPU in Windows on a separate HDD and see if that fixes anything.

Comment 9 Mauro Gaspari 2019-04-13 09:49:00 UTC

Interesting catch the one about the BIOS of the card.

I have a separate SSD with windows10 I use to test this card stability. I will check my windows MSI update tool, see if it offers me an updated BIOS. If I do have an updated bios I will temporarily remove my workarounds and see how it goes.

Comment 10 Jaap Buurman 2019-04-13 09:52:32 UTC

You will have to flash using Atiflash:

https://www.techpowerup.com/download/ati-atiflash/

And downloading the latest bios for your card from Techpowerup as well:

https://www.techpowerup.com/vgabios/

Bios updates are usually not supported directly by the vendor, but I have never worked with MSI update tool, so I am not 100% sure.

Make sure you are very careful when picking the bios. Some bioses are for the watercooling variant, variants with aftermarket coolers, or overclocked ones.

Comment 11 Mauro Gaspari 2019-04-13 11:34:47 UTC

You are right. MSI tools do not offer any BIOS update for GPU.

I downloaded the utility and filtered BIOS by vendor and DeviceID, I saw the 3 BIOS version and the one that, as you said was released 2 days after the one we are using.

I do not have high hopes, because with current BIOS, all games on windows run fine. But well, cannot hurt to try the upgrade. Worst case I will re-introduce my workarounds. I had zero freezes with those enabled in the last 2 weeks. 

And if I end up bricking my GPU out of warranty, I have the excuse to get a new RadeonVII :D

Comment 12 Jaap Buurman 2019-04-13 13:19:33 UTC

My Vega64 was also 100% stable on the exact same build under Windows 10. So I am also not getting my hopes up, but I am really frustrated. I am hoping it is some kind of incompatibility problem. I have honestly tried so many things, that I am willing to give the long-shots a chance as well. 

Since my Switch to Linux ~1.5 years ago, stability with the Vega64 has been very finicky. Some games run fine, while some games cause this crash pretty reliably. Very, very frustrating.

Comment 13 Mauro Gaspari 2019-04-13 13:45:06 UTC

Status update: I updated the BIOS and now disabled all kernel parameters I previously used. It might take some time to make sure the system is stable. 

Regarding your frustrations,
AMD released open source drivers and that is a major improvement for people on Linux. I got the Vega RX64 to support that. I expected a few bumps in the road but well, it is taking longer than anticipated.

Having said that, there you are all kernel parameters I enabled, and with those as I said, I was unable to get a single freeze. Those are not fixes, most likely optimizations and workarounds. Still, work pretty well for me.

CPU
rcu_nocbs=0-15 (adjust to the number of cores of your cpu)
idle=nomwait
processor.max_cstate=5
pcie_aspm=off 

GPU
amdgpu.dc=1
amdgpu.vm_update_mode=0
amdgpu.dpm=-1
amdgpu.ppfeaturemask=0xffffffff
amdgpu.vm_fault_stop=2
amdgpu.vm_debug=1
amdgpu.gpu_recovery=0

Comment 14 Mauro Gaspari 2019-04-15 12:51:58 UTC

Quick update.


OS: OpenSUSE tumbleweed x86_64 updated (2019 04 15)
Kernel: 5.0.7-1-default
Desktop Environment: KDE Plasma (x11)
OpenGL version string: 4.5 (Compatibility Profile) Mesa 19.0.1
GPU: AMD Radeon RX Vega 64 8GB


GPU firmware upgrade did not change much. 
I disabled kernel parameters on grub, upgraded BIOS, ran some games. Same old system freeze on my system came back.

After that, I re-enabled kernel parameters on grub, rebooted. no more system freeze on my system.

Comment 15 Jaap Buurman 2019-04-25 19:44:19 UTC

That's bad to hear :( Worth a try though. How often do you experience freezes by the way? And is this for all games, or are some games completely stable? For me, I am getting crashes in Kerbal Space Program, but not in Final Fantasy XII or World of Warcraft, even after hundreds of hours in both of these stable games.

Also, have you ever figured out which kernel parameter in particular makes your setup stable? It might help identify where the problem exists. Or do you need that exact combination of all those parameters to get your system stable?

Comment 16 Jaap Buurman 2019-04-28 16:33:39 UTC

Just got a crash in World of Warcraft as well, running via vkd3d. It happens instantly after trying to log into the game world, so the issue is nicely reproducible for me. If you want me to get any traces, please let me know what you would like me to run to get them. dmesg logs for now:

[   78.450637] amdgpu 0000:09:00.0: [gfxhub] VMC page fault (src_id:0 ring:158 vmid:1 pasid:32769, for process WoW.exe pid 2349 thread WoW.exe:cs0 pid 2370)
[   78.450641] amdgpu 0000:09:00.0:   in page starting at address 0x0000984ec2d4b000 from 27
[   78.450642] amdgpu 0000:09:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x0010113D
[   78.450648] amdgpu 0000:09:00.0: [gfxhub] VMC page fault (src_id:0 ring:158 vmid:1 pasid:32769, for process WoW.exe pid 2349 thread WoW.exe:cs0 pid 2370)
[   78.450650] amdgpu 0000:09:00.0:   in page starting at address 0x0000850e92553000 from 27
[   78.450652] amdgpu 0000:09:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x0010113D
[   78.450656] amdgpu 0000:09:00.0: [gfxhub] VMC page fault (src_id:0 ring:158 vmid:1 pasid:32769, for process WoW.exe pid 2349 thread WoW.exe:cs0 pid 2370)
[   78.450658] amdgpu 0000:09:00.0:   in page starting at address 0x0000984ec2d4e000 from 27
[   78.450660] amdgpu 0000:09:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x0010113D
[   78.450665] amdgpu 0000:09:00.0: [gfxhub] VMC page fault (src_id:0 ring:158 vmid:1 pasid:32769, for process WoW.exe pid 2349 thread WoW.exe:cs0 pid 2370)
[   78.450666] amdgpu 0000:09:00.0:   in page starting at address 0x0000850e92542000 from 27
[   78.450668] amdgpu 0000:09:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x0010113D
[   78.450673] amdgpu 0000:09:00.0: [gfxhub] VMC page fault (src_id:0 ring:158 vmid:1 pasid:32769, for process WoW.exe pid 2349 thread WoW.exe:cs0 pid 2370)
[   78.450674] amdgpu 0000:09:00.0:   in page starting at address 0x0000984ec2d42000 from 27
[   78.450676] amdgpu 0000:09:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x0010113D
[   78.450680] amdgpu 0000:09:00.0: [gfxhub] VMC page fault (src_id:0 ring:158 vmid:1 pasid:32769, for process WoW.exe pid 2349 thread WoW.exe:cs0 pid 2370)
[   78.450682] amdgpu 0000:09:00.0:   in page starting at address 0x0000850e92552000 from 27
[   78.450683] amdgpu 0000:09:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x0010113D
[   78.450688] amdgpu 0000:09:00.0: [gfxhub] VMC page fault (src_id:0 ring:158 vmid:1 pasid:32769, for process WoW.exe pid 2349 thread WoW.exe:cs0 pid 2370)
[   78.450690] amdgpu 0000:09:00.0:   in page starting at address 0x0000984ec2d40000 from 27
[   78.450691] amdgpu 0000:09:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x0010113D
[   78.450696] amdgpu 0000:09:00.0: [gfxhub] VMC page fault (src_id:0 ring:158 vmid:1 pasid:32769, for process WoW.exe pid 2349 thread WoW.exe:cs0 pid 2370)
[   78.450697] amdgpu 0000:09:00.0:   in page starting at address 0x0000850e92552000 from 27
[   78.450699] amdgpu 0000:09:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x0010113D
[   78.450703] amdgpu 0000:09:00.0: [gfxhub] VMC page fault (src_id:0 ring:158 vmid:1 pasid:32769, for process WoW.exe pid 2349 thread WoW.exe:cs0 pid 2370)
[   78.450705] amdgpu 0000:09:00.0:   in page starting at address 0x0000984ec2d49000 from 27
[   78.450706] amdgpu 0000:09:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x0010113D
[   78.450711] amdgpu 0000:09:00.0: [gfxhub] VMC page fault (src_id:0 ring:158 vmid:1 pasid:32769, for process WoW.exe pid 2349 thread WoW.exe:cs0 pid 2370)
[   78.450713] amdgpu 0000:09:00.0:   in page starting at address 0x0000850ea1eb2000 from 27
[   78.450714] amdgpu 0000:09:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x0010113D
[   78.454307] amdgpu 0000:09:00.0: IH ring buffer overflow (0x000BEDC0, 0x0003EEC0, 0x0003EDE0)
[   88.570062] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=25317, emitted seq=25319
[   88.570099] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process WoW.exe pid 2349 thread WoW.exe:cs0 pid 2370
[   88.570102] amdgpu 0000:09:00.0: GPU reset begin!
[   88.831392] amdgpu 0000:09:00.0: GPU reset
[   89.356679] [drm] psp mode1 reset succeed 
[   89.475356] amdgpu 0000:09:00.0: GPU reset succeeded, trying to resume
[   89.475465] [drm] PCIE GART of 512M enabled (table at 0x000000F400900000).
[   89.475508] [drm:amdgpu_device_gpu_recover [amdgpu]] *ERROR* VRAM is lost!
[   89.475642] [drm] PSP is resuming...
[   89.623052] [drm] reserve 0x400000 from 0xf400d00000 for PSP TMR SIZE
[   89.806625] [drm] SADs count is: -2, don't need to read it
[   89.856619] [drm] SADs count is: -2, don't need to read it
[   89.938255] [drm] UVD and UVD ENC initialized successfully.
[   90.038674] [drm] VCE initialized successfully.
[   90.039672] [drm] recover vram bo from shadow start
[   90.047496] [drm] recover vram bo from shadow done
[   90.047497] [drm] Skip scheduling IBs!
[   90.047499] [drm] Skip scheduling IBs!
[   90.047511] [drm] Skip scheduling IBs!
[   90.047518] [drm] Skip scheduling IBs!
[   90.047523] [drm] Skip scheduling IBs!
[   90.047524] [drm] Skip scheduling IBs!
[   90.047530] [drm] Skip scheduling IBs!
[   90.047531] [drm] Skip scheduling IBs!
[   90.047533] [drm] Skip scheduling IBs!
[   90.047535] [drm] Skip scheduling IBs!
[   90.047536] [drm] Skip scheduling IBs!
[   90.047538] [drm] Skip scheduling IBs!
[   90.047539] [drm] Skip scheduling IBs!
[   90.047555] amdgpu 0000:09:00.0: GPU reset(2) succeeded!
[   90.047796] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   90.049377] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   90.050524] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   90.051990] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   90.055576] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   90.136508] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   90.180374] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   90.181405] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   90.246698] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   90.313258] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   90.380264] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   90.446291] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   90.513947] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   90.579552] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   99.218785] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   99.218976] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   99.219571] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   99.219745] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   99.221821] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   99.221969] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   99.222145] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   99.222360] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   99.229911] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   99.230213] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   99.231183] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   99.231328] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   99.231487] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   99.231703] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   99.233480] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   99.247154] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   99.249213] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   99.249437] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   99.250924] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   99.251258] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   99.251320] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   99.252417] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   99.252532] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   99.252739] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   99.252994] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   99.254745] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   99.265835] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   99.265974] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   99.266056] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   99.266222] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   99.266342] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   99.266436] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   99.266516] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   99.266646] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   99.266796] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   99.266997] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   99.271605] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   99.274639] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   99.274699] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   99.274747] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   99.274794] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   99.274869] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   99.274929] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   99.274981] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   99.275033] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   99.275373] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   99.284443] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   99.286591] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   99.286881] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   99.302782] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   99.319311] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   99.335908] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   99.353111] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   99.369124] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   99.385670] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   99.402801] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   99.421232] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   99.737933] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   99.738054] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   99.742378] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   99.742737] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   99.742845] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   99.744592] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   99.744806] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   99.751833] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   99.752108] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   99.752371] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   99.752475] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   99.752604] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   99.752762] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   99.754128] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   99.765700] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   99.766154] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   99.766250] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   99.767140] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   99.767447] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   99.789098] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   99.789205] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   99.789293] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   99.789364] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   99.789473] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   99.789598] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   99.789675] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   99.789745] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   99.790301] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   99.803790] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   99.811866] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   99.821133] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   99.837593] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   99.841186] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   99.854467] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   99.870915] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   99.871297] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   99.887676] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   99.901326] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   99.902101] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   99.903913] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   99.927724] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   99.938301] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   99.941050] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   99.952885] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   99.975232] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   99.975468] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[   99.986053] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[  100.005910] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[  100.018771] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[  100.036370] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[  100.052090] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[  100.067194] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[  100.067901] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[  100.068016] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[  100.081081] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[  100.081359] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[  100.081525] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[  100.081618] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[  100.081721] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[  100.081845] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[  100.082026] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[  100.082151] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[  100.082246] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[  100.082329] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[  100.082439] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[  100.082579] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[  100.082757] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[  100.086543] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[  100.098769] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[  100.102700] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[  100.445931] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[  100.446590] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[  100.946103] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[  100.946823] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[  101.446237] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[  101.446803] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[  101.946107] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[  101.946642] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[  102.445541] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[  102.446075] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[  102.946163] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[  102.946730] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[  103.446040] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[  103.446555] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[  103.945513] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[  103.945951] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[  104.437414] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[  104.437827] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[  104.946771] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[  104.947166] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[  105.446585] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[  105.447008] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[  105.937954] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[  105.938407] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[  106.445966] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[  106.446429] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[  106.945528] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[  106.945999] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[  107.445983] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[  107.446405] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[  107.946131] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[  107.946642] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[  108.446428] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[  108.446960] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[  108.946992] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[  108.947500] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[  109.445052] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[  109.445477] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[  109.533707] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[  109.946108] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[  109.946604] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[  110.445730] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[  110.446232] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[  110.943308] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[  110.943823] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[  111.036544] kauditd_printk_skb: 16509 callbacks suppressed
[  111.036545] audit: type=1006 audit(1556468881.509:99): pid=2590 uid=0 old-auid=4294967295 auid=1000 tty=(none) old-ses=4294967295 ses=4 res=1
[  111.446470] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[  111.446899] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[  111.945982] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[  111.946413] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!

Comment 17 Alex Deucher 2019-04-29 01:15:49 UTC

(In reply to Jaap Buurman from comment #16)
> Just got a crash in World of Warcraft as well, running via vkd3d. It happens
> instantly after trying to log into the game world, so the issue is nicely
> reproducible for me. If you want me to get any traces, please let me know
> what you would like me to run to get them. dmesg logs for now:
> 
> [   78.450637] amdgpu 0000:09:00.0: [gfxhub] VMC page fault (src_id:0
> ring:158 vmid:1 pasid:32769, for process WoW.exe pid 2349 thread WoW.exe:cs0
> pid 2370)
> [   78.450641] amdgpu 0000:09:00.0:   in page starting at address
> 0x0000984ec2d4b000 from 27
> [   78.450642] amdgpu 0000:09:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x0010113D
> [   78.450648] amdgpu 0000:09:00.0: [gfxhub] VMC page fault (src_id:0
> ring:158 vmid:1 pasid:32769, for process WoW.exe pid 2349 thread WoW.exe:cs0
> pid 2370)
> [   78.450650] amdgpu 0000:09:00.0:   in page starting at address
> 0x0000850e92553000 from 27
> [   78.450652] amdgpu 0000:09:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x0010113D
> [   78.450656] amdgpu 0000:09:00.0: [gfxhub] VMC page fault (src_id:0
> ring:158 vmid:1 pasid:32769, for process WoW.exe pid 2349 thread WoW.exe:cs0
> pid 2370)
> [   78.450658] amdgpu 0000:09:00.0:   in page starting at address
> 0x0000984ec2d4e000 from 27
> [   78.450660] amdgpu 0000:09:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x0010113D
> [   78.450665] amdgpu 0000:09:00.0: [gfxhub] VMC page fault (src_id:0
> ring:158 vmid:1 pasid:32769, for process WoW.exe pid 2349 thread WoW.exe:cs0
> pid 2370)
> [   78.450666] amdgpu 0000:09:00.0:   in page starting at address
> 0x0000850e92542000 from 27
> [   78.450668] amdgpu 0000:09:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x0010113D
> [   78.450673] amdgpu 0000:09:00.0: [gfxhub] VMC page fault (src_id:0
> ring:158 vmid:1 pasid:32769, for process WoW.exe pid 2349 thread WoW.exe:cs0
> pid 2370)
> [   78.450674] amdgpu 0000:09:00.0:   in page starting at address
> 0x0000984ec2d42000 from 27
> [   78.450676] amdgpu 0000:09:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x0010113D
> [   78.450680] amdgpu 0000:09:00.0: [gfxhub] VMC page fault (src_id:0
> ring:158 vmid:1 pasid:32769, for process WoW.exe pid 2349 thread WoW.exe:cs0
> pid 2370)
> [   78.450682] amdgpu 0000:09:00.0:   in page starting at address
> 0x0000850e92552000 from 27
> [   78.450683] amdgpu 0000:09:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x0010113D
> [   78.450688] amdgpu 0000:09:00.0: [gfxhub] VMC page fault (src_id:0
> ring:158 vmid:1 pasid:32769, for process WoW.exe pid 2349 thread WoW.exe:cs0
> pid 2370)
> [   78.450690] amdgpu 0000:09:00.0:   in page starting at address
> 0x0000984ec2d40000 from 27
> [   78.450691] amdgpu 0000:09:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x0010113D
> [   78.450696] amdgpu 0000:09:00.0: [gfxhub] VMC page fault (src_id:0
> ring:158 vmid:1 pasid:32769, for process WoW.exe pid 2349 thread WoW.exe:cs0
> pid 2370)
> [   78.450697] amdgpu 0000:09:00.0:   in page starting at address
> 0x0000850e92552000 from 27
> [   78.450699] amdgpu 0000:09:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x0010113D
> [   78.450703] amdgpu 0000:09:00.0: [gfxhub] VMC page fault (src_id:0
> ring:158 vmid:1 pasid:32769, for process WoW.exe pid 2349 thread WoW.exe:cs0
> pid 2370)
> [   78.450705] amdgpu 0000:09:00.0:   in page starting at address
> 0x0000984ec2d49000 from 27
> [   78.450706] amdgpu 0000:09:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x0010113D
> [   78.450711] amdgpu 0000:09:00.0: [gfxhub] VMC page fault (src_id:0
> ring:158 vmid:1 pasid:32769, for process WoW.exe pid 2349 thread WoW.exe:cs0
> pid 2370)
> [   78.450713] amdgpu 0000:09:00.0:   in page starting at address
> 0x0000850ea1eb2000 from 27
> [   78.450714] amdgpu 0000:09:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x0010113D
> [   78.454307] amdgpu 0000:09:00.0: IH ring buffer overflow (0x000BEDC0,
> 0x0003EEC0, 0x0003EDE0)
> [   88.570062] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout,
> signaled seq=25317, emitted seq=25319
> [   88.570099] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process
> information: process WoW.exe pid 2349 thread WoW.exe:cs0 pid 2370
> [   88.570102] amdgpu 0000:09:00.0: GPU reset begin!
> [   88.831392] amdgpu 0000:09:00.0: GPU reset
> [   89.356679] [drm] psp mode1 reset succeed 
> [   89.475356] amdgpu 0000:09:00.0: GPU reset succeeded, trying to resume
> [   89.475465] [drm] PCIE GART of 512M enabled (table at 0x000000F400900000).
> [   89.475508] [drm:amdgpu_device_gpu_recover [amdgpu]] *ERROR* VRAM is lost!
> [   89.475642] [drm] PSP is resuming...
> [   89.623052] [drm] reserve 0x400000 from 0xf400d00000 for PSP TMR SIZE
> [   89.806625] [drm] SADs count is: -2, don't need to read it
> [   89.856619] [drm] SADs count is: -2, don't need to read it
> [   89.938255] [drm] UVD and UVD ENC initialized successfully.
> [   90.038674] [drm] VCE initialized successfully.
> [   90.039672] [drm] recover vram bo from shadow start
> [   90.047496] [drm] recover vram bo from shadow done
> [   90.047497] [drm] Skip scheduling IBs!
> [   90.047499] [drm] Skip scheduling IBs!
> [   90.047511] [drm] Skip scheduling IBs!
> [   90.047518] [drm] Skip scheduling IBs!
> [   90.047523] [drm] Skip scheduling IBs!
> [   90.047524] [drm] Skip scheduling IBs!
> [   90.047530] [drm] Skip scheduling IBs!
> [   90.047531] [drm] Skip scheduling IBs!
> [   90.047533] [drm] Skip scheduling IBs!
> [   90.047535] [drm] Skip scheduling IBs!
> [   90.047536] [drm] Skip scheduling IBs!
> [   90.047538] [drm] Skip scheduling IBs!
> [   90.047539] [drm] Skip scheduling IBs!
> [   90.047555] amdgpu 0000:09:00.0: GPU reset(2) succeeded!

The GPU reset succeeded.  You'll need to restart your desktop manager to recover because currently no desktop managers handle GPU reset errors and re-initialize their contexts.

Comment 18 Jaap Buurman 2019-04-29 10:41:42 UTC

I was aware of that. I was more curious if the bug that is causing the crash can be identified and hopefully fixed. I can provide traces if required, since it seems I can easily reproduce the crash.

Comment 19 Mauro Gaspari 2019-04-29 11:35:27 UTC

(In reply to Jaap Buurman from comment #15)
> That's bad to hear :( Worth a try though. How often do you experience
> freezes by the way? And is this for all games, or are some games completely
> stable? For me, I am getting crashes in Kerbal Space Program, but not in
> Final Fantasy XII or World of Warcraft, even after hundreds of hours in both
> of these stable games.
> 
> Also, have you ever figured out which kernel parameter in particular makes
> your setup stable? It might help identify where the problem exists. Or do
> you need that exact combination of all those parameters to get your system
> stable?

Hi, regarding the parameters I am using.
Unfortunately for me the issue is not easy to reproduce. Without the parameters enabled, it still takes hours for a crash to happen. On top of that, mesa and kernel updates are really frequent on Tumbleweed, that is another variable that makes it a bit harder to troubleshoot. Unless I can find a really fast way to reproduce the issue.

Regarding which game crash, with those kernel parameters applied, the only crashes I noticed were when I tried to run games through Wine in DX11 mode with DXVK. Which i believe to be stable on Vega GPUs, would need at least LLVM8. Currently on my Tumbleweed I have LLVM7 so I just stick to NON-DXVK games, or even better native ones, until LLVM8 is available for tumbleweed.

If you want to give it a try and you run on ubuntu, you can check this article: https://github.com/lutris/lutris/wiki/Installing-drivers

If you do so, I recommend you run a full system backup using clonezilla or similar software, those ppas are marked as unstable.

Comment 20 Jaap Buurman 2019-04-29 11:37:13 UTC

I already run LLVM 8.0.0, since it's the latest stable in Arch's repository. Thanks for the tip though :)

Comment 21 Mauro Gaspari 2019-04-29 13:52:33 UTC

(In reply to Jaap Buurman from comment #20)
> I already run LLVM 8.0.0, since it's the latest stable in Arch's repository.
> Thanks for the tip though :)

Since it is very easy for you to reproduce the freeze, it would be great if you could add those kernel parameters, and see if they help.

Comment 22 Mauro Gaspari 2019-05-24 05:12:18 UTC

I ran more tests:

1. Installed Arch Linux, vulkan, llvm8 and ran wine games with DXVK. With same kernel parameters on grub, no freezes, no crashes. Great performance.

2. Installed Ubuntu Budgie 19.04, Oibaf ppa, updated mesa and llvm8. Same as with Arch Linux: With same kernel parameters on grub, no freezes, no crashes. Great performance.

The only issue I have not being able to reproduce the issue quickly, is to clearly understand when the issue is resolved by Mesa. It takes hours for me to get the freeze sometimes. 
If someone has a quick way to trigger system freeze, I am happy to run more tests.

Comment 23 Sylvain BERTRAND 2019-05-24 12:25:11 UTC

It seems I get the same freezes than you. It takes hours of gaming to get some
random hard hang (no log). I thought I was overheating, but realized that my system is on
"vacation" while playing.
linux amd-staging-drm-new/x11 native/mesa/llvm(erk...), all git no older than a
week.
playing mostly dota2 vulkan on AMD TAHITI XT

Comment 24 Mauro Gaspari 2019-05-24 13:44:27 UTC

(In reply to Sylvain BERTRAND from comment #23)
> It seems I get the same freezes than you. It takes hours of gaming to get
> some
> random hard hang (no log). I thought I was overheating, but realized that my
> system is on
> "vacation" while playing.
> linux amd-staging-drm-new/x11 native/mesa/llvm(erk...), all git no older
> than a
> week.
> playing mostly dota2 vulkan on AMD TAHITI XT

Hi, a bit frustrating eh? :)
I have been asking around and it seems that RadeonVII and RX590 do not suffer those issues. Probably related to default clock speeds by manufacturers.

Anyway, If you try the kernel parameters I mentioned above, those should help. I have not had crashes in weeks after I enabled those on my grub. And not related to distribution, those grub kernel settings worked for me on Tumbleweed, Arch, Ubuntu Budgie.

I hope it helps.

Comment 25 Matt Coffin 2019-06-03 08:07:58 UTC

(In reply to Mauro Gaspari from comment #24)

> Hi, a bit frustrating eh? :)
> I have been asking around and it seems that RadeonVII and RX590 do not
> suffer those issues. Probably related to default clock speeds by
> manufacturers.

FWIW, I'm seeing this exact same issue, and I'm on an RX590.

Comment 26 Matt Coffin 2019-06-03 20:10:26 UTC

For reproducability, here's what I've been using. (I can reproduce this crash on both the RADV and AMDVLK Vulkan implementations, and can reproduce it both on top of sway 1.1 (wayland), and xfce4 (X11)).

* 5.1.3-arch2-1-ARCH
* LLVM 8.0.0
* mesa/vulkan-radeon: 19.0.4
* AMDVLK: (dev branch from nighttime Mountain time 20190602)
* DXVK: winelib version - release 1.2.1

I run "House Flipper" from Steam with DXVK_FILTER_DEVICE_NAME=590.

On 1080p@60Hz with v-sync, it runs quite well and stable (for hours). If I disable v-sync and framerate limiting, the crash occurs within a minute usually.

At 2560x1440 resolution, no refresh rate works in a stable mannner, but I have tried both 60Hz and 144Hz.

With the game rendering 1080p but scaling up to a 2560x1440 display, I saw it crash once, but was unable to duplicate it again.

I'm new to low-level development, and would like to help. If I can provide any information since I can reliably reproduce the issue, I'd love to. Let me know what would be useful and I'd be happy to get it out to you.

I've also seen the bugs listed in my other comment on the other bug here: https://bugs.freedesktop.org/show_bug.cgi?id=102322#c82

Comment 27 Sam 2019-06-04 21:43:38 UTC

Hello! I can confirm that I have the same issues. I am using a Vega 56 and openSUSE Tumbleweed (X11 and KDE) with:

Kernel Version:  5.1.5-1-default
X Server Release:  12004000
Driver:  X.Org Radeon RX Vega (VEGA10, DRM 3.30.0, 5.1.5-1-default, LLVM 7.0.1)


I have been having the same freezes exactly as described here since, as far as I can remember, mesa 19.0.4 and 5.0.13 (based on the Tumbleweed snapshots from when this started happening)

This was definitely not happening before on mesa 18.x/LLVM 6 and 7 and kernel 4.20. I niehter run overclocks, never messed with firmware/BIOS...etc. Everything has been running as-is since Oct. 2018 so firmware or BIOS issues should be discarded, I guess.

In my case, I have also experienced this issue when running non-demanding OpenGL games and even desktop applications (I had a crash happen on the desktop with just WxMaxima, a computer algebra system GUI, opened doing nothing)

The easiest way for me to reproduce it is by simply leaving Pillars of Eternity (an OpenGL unity game) open and idle for an hour or so. I have tried setting up Kdump and trying to catch some error messages in the logs with no luck. I'm definitely open for directions on how to get more info if this can help.

Comment 28 Mauro Gaspari 2019-06-05 06:34:02 UTC

Thanks all for adding comments and testing to this bug. I believe if we prove there is enough people affected on different cards, it will get the attention it needs, and hopefully a permanent mesa fix can be found and implemented.

For those affected, if you don't mind testing the kernel parameters workaround i described above, and post your results, that would be a nice start.
If you need help on how to do that you can reach out to me via PM or email.

Comment 29 Sam 2019-06-09 18:46:37 UTC

I have been trying myself for the moment to get some info with just debug parameters:

amdgpu.dc=1 
amdgpu.vm_fault_stop=2 
amdgpu.vm_debug=1 
amdgpu.gpu_recovery=0 

Incidentally I couldn't get any freeze to happen after running two troublesome games for about two hours each (left idle but on load, Pillars of Eternity and Surviving Mars) but this could mean anything as they happen completely randomly. 

Perhaps someone who can reproduce the issue instantly can test the parameters more reliably?

Comment 30 Sam 2019-06-10 17:13:57 UTC

Update: I can now confirm, at least in my case, that the freezes DO occur using the parameters above, and also with all of them (shown below), while doing another test round on Pillars of Eternity.

amdgpu.dc=1 
amdgpu.vm_update_mode=0 
amdgpu.dpm=-1 
amdgpu.ppfeaturemask=0xffffffff 
amdgpu.vm_fault_stop=2 
amdgpu.vm_debug=1 
amdgpu.gpu_recovery=0 

I was continuously writing dmesg to a file but yet again I didn't get any messages/warnings/errors.

Comment 31 Sam 2019-06-13 21:04:11 UTC

I have attached another trace I managed to get today at 22:24 while playing Pillars Of Eternity (OpenGL) 

It didn't freeze the whole as usual, just the whole Plasma and X sessions, so the other TTYs were accessible. This is the first occurrence of this happening. I was using the latest kernel default from the openSUSE Kernel:stable repo (5.1.9-5.1), as per request on https://bugzilla.opensuse.org/show_bug.cgi?id=1136293

To note that, as in the other dmesgs attached, the crash seems to be caused by amdgpu. Should the bug category be moved there?

Comment 32 Sam 2019-06-13 21:04:35 UTC

Created attachment 144535 [details]
dmesg from the freeze which didn't completely bork everything. It starts on line 1181

Comment 33 Jiri Slaby 2019-06-14 05:48:33 UTC

(In reply to Sam from comment #32)
> Created attachment 144535 [details]
> dmesg from the freeze which didn't completely bork everything. It starts on
> line 1181

Attaching the relevant part inline:

> [drm:amdgpu_dm_commit_planes.isra.0 [amdgpu]] *ERROR* Waiting for fences timed out.
> [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=726226, emitted seq=726228
> [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process PillarsOfEterni pid 12250 thread PillarsOfE:cs0 pid 12254
> amdgpu 0000:1e:00.0: GPU reset begin!
> [drm:amdgpu_dm_commit_planes.isra.0 [amdgpu]] *ERROR* Waiting for fences timed out.
> amdgpu 0000:1e:00.0: GPU BACO reset
> amdgpu 0000:1e:00.0: GPU reset succeeded, trying to resume
> [drm] PCIE GART of 512M enabled (table at 0x000000F400900000).
> [drm:amdgpu_device_gpu_recover [amdgpu]] *ERROR* VRAM is lost!
> [drm] PSP is resuming...
> [drm] reserve 0x400000 from 0xf400d00000 for PSP TMR SIZE
> [drm] UVD and UVD ENC initialized successfully.
> [drm] VCE initialized successfully.
> [drm] recover vram bo from shadow start
> [drm] recover vram bo from shadow done
> [drm] Skip scheduling IBs!
> [drm] Skip scheduling IBs!
> amdgpu 0000:1e:00.0: GPU reset(2) succeeded!
> [drm] Skip scheduling IBs!
> ...
> [drm] Skip scheduling IBs!
> [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
> [drm] Skip scheduling IBs!
> ...
> [drm] Skip scheduling IBs!
> [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
> [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
> [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!

Comment 34 Alex Deucher 2019-06-14 14:33:47 UTC

(In reply to Jiri Slaby from comment #33)
> > amdgpu 0000:1e:00.0: GPU reset(2) succeeded!
> > [drm] Skip scheduling IBs!
> > ...
> > [drm] Skip scheduling IBs!
> > [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
> > [drm] Skip scheduling IBs!
> > ...
> > [drm] Skip scheduling IBs!
> > [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
> > [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
> > [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!

The GPU reset was successful.  You need to restart your desktop environment to recover.

Comment 35 shadow.archemage 2019-07-06 09:30:35 UTC

(In reply to Mauro Gaspari from comment #22)

> The only issue I have not being able to reproduce the issue quickly, is to
> clearly understand when the issue is resolved by Mesa. It takes hours for me
> to get the freeze sometimes. 
> If someone has a quick way to trigger system freeze, I am happy to run more
> tests.

Hi Mauro,

The issue happened to me much more frequently when I opted into Steam beta and ran Monster Hunter: World. Before opting in, the crashes happen around 1-2 hours after the game starts. With Steam beta though, it happens around <5 minutes in.

The only change that I noted when I opted into Steam beta was that the games suddenly downloaded some shader pre-caching stuff. Unfortunately, I'm not too familiar with it, and I'm not too sure if it is related to the problem.

I am running Manjaro, Gnome 3.32.2, Kernel version 5.1.15-1, Mesa 19.1.1.
Let me know if I missed something.

Thanks,
Eph

Comment 36 Mauro Gaspari 2019-07-07 05:31:34 UTC

(In reply to shadow.archemage from comment #35)
> (In reply to Mauro Gaspari from comment #22)
> 
> > The only issue I have not being able to reproduce the issue quickly, is to
> > clearly understand when the issue is resolved by Mesa. It takes hours for me
> > to get the freeze sometimes. 
> > If someone has a quick way to trigger system freeze, I am happy to run more
> > tests.
> 
> Hi Mauro,
> 
> The issue happened to me much more frequently when I opted into Steam beta
> and ran Monster Hunter: World. Before opting in, the crashes happen around
> 1-2 hours after the game starts. With Steam beta though, it happens around
> <5 minutes in.
> 
> The only change that I noted when I opted into Steam beta was that the games
> suddenly downloaded some shader pre-caching stuff. Unfortunately, I'm not
> too familiar with it, and I'm not too sure if it is related to the problem.
> 
> I am running Manjaro, Gnome 3.32.2, Kernel version 5.1.15-1, Mesa 19.1.1.
> Let me know if I missed something.
> 
> Thanks,
> Eph

I am not an expert, but I am quite sure shaders have a big part in this. If you can, disable shader caching.
There are a few tests you can do:
1. Did you try with the kernel parameters I posted above? I always ran all the parameters together. GPU+CPU and at the time, I did not have crashes for weeks on my Vega64. I am using a RadeonVII now and it seems those parameters are not needed.
2. Valve sponsored an interesting project that removes dependency of AMD Mesa from LLVM. And instead uses ACO. Valve made this available for Arch based systems via AUR, and Ubuntu based system via PPA. If you want to test it, you can check the posts below. I am going to test this myself on both Arch and Ubuntu. 
https://steamcommunity.com/games/221410/announcements/detail/1602634609636894200
https://steamcommunity.com/app/221410/discussions/0/1640915206474070669/

Comment 37 shadow.archemage 2019-07-07 10:55:49 UTC

(In reply to Mauro Gaspari from comment #36)
> (In reply to shadow.archemage from comment #35) 
> I am not an expert, but I am quite sure shaders have a big part in this. If
> you can, disable shader caching.
> There are a few tests you can do:
> 1. Did you try with the kernel parameters I posted above? I always ran all
> the parameters together. GPU+CPU and at the time, I did not have crashes for
> weeks on my Vega64. I am using a RadeonVII now and it seems those parameters
> are not needed.

I tried the kernel parameters above, and the game still crashed for me.

> 2. Valve sponsored an interesting project that removes dependency of AMD
> Mesa from LLVM. And instead uses ACO. Valve made this available for Arch
> based systems via AUR, and Ubuntu based system via PPA. If you want to test
> it, you can check the posts below. I am going to test this myself on both
> Arch and Ubuntu. 
> https://steamcommunity.com/games/221410/announcements/detail/
> 1602634609636894200
> https://steamcommunity.com/app/221410/discussions/0/1640915206474070669/

Will check this out, but will also keep an eye on this thread about the results of your tests. Thanks!

Comment 38 Sylvain BERTRAND 2019-07-07 17:42:14 UTC Comment hidden (spam)

On Sun, Jul 07, 2019 at 05:31:34AM +0000, bugzilla-daemon@freedesktop.org wrote:
> 2. Valve sponsored an interesting project that removes dependency of AMD Mesa
> from LLVM. And instead uses ACO. Valve made this available for Arch based
> systems via AUR, and Ubuntu based system via PPA. If you want to test it, you
> can check the posts below. I am going to test this myself on both Arch and
> Ubuntu. 
> https://steamcommunity.com/games/221410/announcements/detail/1602634609636894200
> https://steamcommunity.com/app/221410/discussions/0/1640915206474070669/

Huho!

Cons:
    - it's c++
    - only GFX8 and GFX9 (I have GFX6 :( )
    - some nasty python scripts (there are tons in mesa)

Pros:
    - it's several orders of magnitude less brain f*cked than llvm.
    - it is actual working code which does disjoint mesa from llvm.

conclusion:
    - for GFX8 and GFX9, it's less worse than llvm.
    - I was asking for a clean GCN ABI definition document from shaders
      perspective, maybe this code will help to write one (or it is an AMD
      confidential document??).

Comment 39 Samuel Sieb 2019-07-08 05:29:56 UTC

(In reply to shadow.archemage from comment #37)
> I tried the kernel parameters above, and the game still crashed for me.

Are you saying that the game is crashing or the graphics device is?

Comment 40 Wilko Bartels 2019-07-09 14:29:41 UTC

Since i experience the same issue since june (didnt game much) i want to share my system info.
I am on Ryzen 2600X, Vega 56 Pulse, Strix B450. Using Arch 5.1.
Tested every Windowmanager i know , tested also 60Hz and 144Hz. The crashes are totally random. I only play Dota 2. Last friday i played like 6 games in a row without a single issue. The day after i crashed like 7 times per game. Always have to press reset on my PC. 
Is it know that hits issue related to a kernel or mesa update? I mean it wasnt always like this no?

Comment 41 Sylvain BERTRAND 2019-07-09 18:06:21 UTC

Guys,

I am getting freezes on tahiti xt/fx9590 recently... But I am not logging a bug yet
because I think the reason is summer heat.

Try to game with an opened computer case with a big fan blowing
into it.

Comment 42 Wilko Bartels 2019-07-10 06:29:33 UTC

(In reply to Wilko Bartels from comment #40)
> Since i experience the same issue since june (didnt game much) i want to
> share my system info.
> I am on Ryzen 2600X, Vega 56 Pulse, Strix B450. Using Arch 5.1.
> Tested every Windowmanager i know , tested also 60Hz and 144Hz. The crashes
> are totally random. I only play Dota 2. Last friday i played like 6 games in
> a row without a single issue. The day after i crashed like 7 times per game.
> Always have to press reset on my PC. 
> Is it know that hits issue related to a kernel or mesa update? I mean it
> wasnt always like this no?

tested yesterday with the new 5.2 linux kernel from arch testing, and also tested without variable refreshrate setting and without tearfree setting in Xorg. crashed three times.

Comment 43 Mauro Gaspari 2019-07-10 07:25:35 UTC

Hi,
No it was not always like this. I was using Kubuntu and my games were really smooth for months. Zero crashes. Then after a mesa update, I do not recall exactly the version but was around 18.5 or something like that, it all got worse. 

Same game on same PC same hardware same power supply, same cooling, but on windows, zero crashes.
same game on same PC with NVIDIA gpu, zero crashes.

I wish we could get the attention of someone @AMD because there is clearly some issue going on. I would be very happy to help troubleshooting, if only we had some contact with AMD. 

I have not used AMDGPU-PRO in ages, anyone here got that one to check if the same issue happens with proprietary drivers?

Comment 44 Wilko Bartels 2019-07-10 08:03:07 UTC

(In reply to Mauro Gaspari from comment #43)
> Hi,
> No it was not always like this. I was using Kubuntu and my games were really
> smooth for months. Zero crashes. Then after a mesa update, I do not recall
> exactly the version but was around 18.5 or something like that, it all got
> worse. 
> 
> Same game on same PC same hardware same power supply, same cooling, but on
> windows, zero crashes.
> same game on same PC with NVIDIA gpu, zero crashes.
> 
> I wish we could get the attention of someone @AMD because there is clearly
> some issue going on. I would be very happy to help troubleshooting, if only
> we had some contact with AMD. 
> 
> I have not used AMDGPU-PRO in ages, anyone here got that one to check if the
> same issue happens with proprietary drivers?

I was also thinking about GPU-PRO but i would want to install Ubuntu LTS on another disk then. That might take several weeks for me to test or even longer. And i am not even sure if thats super helpful. Im pretty sure at least on Arch at the end of 2018 i had zero problems. At least with my Vega ;-)
Maybe i was wrong switching from green to red after 10 years. hehe

Comment 45 Wilko Bartels 2019-07-10 08:19:30 UTC

(In reply to Mauro Gaspari from comment #43)
> Hi,
> No it was not always like this. I was using Kubuntu and my games were really
> smooth for months. Zero crashes. Then after a mesa update, I do not recall
> exactly the version but was around 18.5 or something like that, it all got
> worse. 
But it is proven that Mesa is the problem here?  There was once an issue regarding linux-firmware package in early 2018 if i remember correctly. Users had to rollback back than.
I might rollback to mesa 18.3 to test if i can manage that regardless.

Comment 46 Mauro Gaspari 2019-07-10 08:26:23 UTC

This is exactly the reason why I wish we could get more attention to this issue. 
I have seen so many people in forums on the internet replacing their AMD cards with NVIDIA due to similar issues. Or switching back to windows. 

I do not have the proof that the issue is just Mesa, could be a combination of mesa, kernel, firmware for all I know. 

I  opened this bug to see if I could get help troubleshooting the issue and finding a permanent fix for all affected users. If there is a better place to report this, I am happy to open as many tickets and sending as many emails as needed :)

Also It would be extremely helpful if we had a script or something to trigger the freeze quickly and consistently, so that troubleshooting mesa, kernel, ad firmware combinations would be so much easier and reliable. 
If anyone has a test suite or script or some automated check that can trigger the issue quickly, please share.

Comment 47 Sam 2019-07-10 09:41:22 UTC

The relevant issue and bug report here (the system freezing completely or if lucky just killing the X session, NOT games crashing) seems to be related exclusively to AMDGPU, and not to mesa. Whereas I got the same issues over and over after trying out several versions of mesa, switching to older versions of the kernel "fixes" it for me (the latest version I tried out which didn't have these issues is Kernel 4.20.13, in my case from https://download.opensuse.org/repositories/home:/tiwai:/kernel:/4.20/standard/x86_64/)

There is also a report from another user which temporarily fixed it by forcing the gpu to run at the maximum power setting (https://bugzilla.opensuse.org/show_bug.cgi?id=1136293):

# echo manual > /sys/class/drm/card0/device/power_dpm_force_performance_level
# echo 7 > /sys/class/drm/card0/device/pp_dpm_sclk

and then to reset back to normal:

# echo auto > /sys/class/drm/card0/device/power_dpm_force_performance_level

Comment 48 Mauro Gaspari 2019-07-10 14:44:21 UTC

@Sam,

Thank you, this is helpful. Since it is not distribution specific and not mesa related, do you think we should keep the bug here, merge it with other similar bugs, or create on other bug tracking?
Happy to help and troubleshoot more from my side, and/or push for this to be resolved once and for all, for all AMDGPU users.

Thanks
Mauro

Comment 49 Wilko Bartels 2019-07-10 18:42:53 UTC

(In reply to Sam from comment #47)
> The relevant issue and bug report here (the system freezing completely or if
> lucky just killing the X session, NOT games crashing) seems to be related
> exclusively to AMDGPU, and not to mesa. Whereas I got the same issues over
> and over after trying out several versions of mesa, switching to older
> versions of the kernel "fixes" it for me (the latest version I tried out
> which didn't have these issues is Kernel 4.20.13, in my case from
> https://download.opensuse.org/repositories/home:/tiwai:/kernel:/4.20/
> standard/x86_64/)
> 
> There is also a report from another user which temporarily fixed it by
> forcing the gpu to run at the maximum power setting
> (https://bugzilla.opensuse.org/show_bug.cgi?id=1136293):
> 
> # echo manual > /sys/class/drm/card0/device/power_dpm_force_performance_level
> # echo 7 > /sys/class/drm/card0/device/pp_dpm_sclk
> 
> and then to reset back to normal:
> 
> # echo auto > /sys/class/drm/card0/device/power_dpm_force_performance_level

I am currently on my 4th game of dota in a row when setting performance level manual to 7. working so far. Everyone should test this now so we have more reliable data. As we all now the issue can be gone for several hours so my experience means nothing yet. 
Would be amazing if we can pin down the issue to the  performance level of the cards.

Comment 50 shadow.archemage 2019-07-12 15:26:39 UTC

(In reply to Samuel Sieb from comment #39)
> (In reply to shadow.archemage from comment #37)
> > I tried the kernel parameters above, and the game still crashed for me.
> 
> Are you saying that the game is crashing or the graphics device is?

Apologies, what I meant by this is that my system locks up, not just the game crashing. I can't recover from it except by resetting my PC using the power button.

Comment 51 shadow.archemage 2019-07-13 17:22:41 UTC

(In reply to Wilko Bartels from comment #49)
> (In reply to Sam from comment #47)
> > The relevant issue and bug report here (the system freezing completely or if
> > lucky just killing the X session, NOT games crashing) seems to be related
> > exclusively to AMDGPU, and not to mesa. Whereas I got the same issues over
> > and over after trying out several versions of mesa, switching to older
> > versions of the kernel "fixes" it for me (the latest version I tried out
> > which didn't have these issues is Kernel 4.20.13, in my case from
> > https://download.opensuse.org/repositories/home:/tiwai:/kernel:/4.20/
> > standard/x86_64/)
> > 
> > There is also a report from another user which temporarily fixed it by
> > forcing the gpu to run at the maximum power setting
> > (https://bugzilla.opensuse.org/show_bug.cgi?id=1136293):
> > 
> > # echo manual > /sys/class/drm/card0/device/power_dpm_force_performance_level
> > # echo 7 > /sys/class/drm/card0/device/pp_dpm_sclk
> > 
> > and then to reset back to normal:
> > 
> > # echo auto > /sys/class/drm/card0/device/power_dpm_force_performance_level
> 
> I am currently on my 4th game of dota in a row when setting performance
> level manual to 7. working so far. Everyone should test this now so we have
> more reliable data. As we all now the issue can be gone for several hours so
> my experience means nothing yet. 
> Would be amazing if we can pin down the issue to the  performance level of
> the cards.

Played Monster Hunter and Dota 2 for quite a long time, and I didn't experience any system freezes with the max performance settings. Will test again tomorrow to see if the workaround is consistent enough.

Comment 52 Wilko Bartels 2019-07-16 08:28:22 UTC

i played like 30 dota 2 matches withour a single freeze. its save to say this is it. where is the right place to report this issue?

Comment 53 Mauro Gaspari 2019-07-17 03:34:31 UTC

Thank you all for the great work.
I will post on AMD support forums and add the link of this and other AMDGPU related bugs.

Comment 54 Sylvain BERTRAND 2019-07-17 16:02:32 UTC

power management related code is in amdgpu, then the right place is here, the "dri" and
"amdgfx" mailing lists (aka linux gpu driver mailing lists).

As far as I am concerned, when I play dota2, I always switch the GPU dpm to
high and the CPU freq governor to perf (because, all those things steal a
significant amount of fps... actually, I do switch my GPU dpm to high just in
case it would be nasty like the cpu governor).

Comment 55 Hadet 2019-07-18 02:30:29 UTC

So I think this might have something to do with something Xorg is doing because I've not had it happen while gaming for many hours since just seeing if it happened on wayland on a whim. I now have 21 hours of uptime with no random crashes.

Comment 56 Sylvain BERTRAND 2019-07-18 13:44:29 UTC

Playing dota2 vulkan or GL?

I guess it's vulkan: and there I don't know how vulkan deal with multiple WSIs,
and how dota2 selects the one it will use.

The idea is to clearly identify the code paths which would be "buggy".

(my custom distro is x11 native)

That said, I don't know the status of wayland: did they reach the same "cluster
f*ck" level that x11 is at? (irony, since wayland reason to exist is to be
orders of magnitude less kludgy than x11)

Comment 57 Hadet 2019-07-19 00:12:59 UTC

Created attachment 144821 [details]
Dmesg after crash

I spoke too soon it's happening on Wayland now too just a lot less frequently

Comment 58 Mauro Gaspari 2019-07-22 05:19:29 UTC

After a long time without crashes on Tumbleweed, I wanted to prepare a test setup for valve mesa built with ACO. So I installed Ubuntu Budgie 18.04 LTS with hardware enablement stack and I noticed the OS freezes are now back, even on the RadeonVII. 

What I noticed in the game behavior is this. This is a game running on crossover (wine) with DX11 and DXVK. I want to point out that I do alt-tab out of games to do other things, so this might be a factor to consider. But again, I do the same on my NVIDIA-GPU laptop and I never had a single freeze or fps drop.
Not sure if point 2 and 3 are related, I just wanted to share my observations.

1. Game starts with excellent FPS. I can hear GPU fans spinning.
2. After a while, game loses a lot of FPS starts to become slow and sluggish, GPU seems to be no longer doing much and I can no longer hear the fans spinning.
3. After a while longer, the whole OS freezes as described in my first post.


What I am going to do next:
1. Use the workaround of comment #47 and test for a few days.
2. Install Valve mesa-aco with ubuntu PPA and test (without workarounds) for a few days.

I will report back when I have more details on my tests.

System info:
OS: Ubuntu 18.04.2 LTS x86_64 
Kernel: 5.0.0-21-generic
Resolution: 3440x1440
CPU: AMD Ryzen 7 2700X (16) @ 3.700G 
GPU: AMD Vega 20 
Memory: 2650MiB / 64398MiB
OpenGL version string: 4.5 (Compatibility Profile) Mesa 19.0.2

Comment 59 wedens13 2019-07-23 16:25:04 UTC

I have similar issues with Sapphire Pulse Vega 56.
Arch Linux
Kernel versions: 4.19.60-1-lts, 5.2.1-1
mesa: 19.1.3-1, mesa with ACO (f9b38efdda166f2b79562525e72fe135c6b23d54)
llvm: 8.0.0

I've also tried booting with integrated video and using DRI_PRIME=1 to offload to vega. It crashes similarly (after 5min of playing witcher 3 with dxvk 1.3.1):

Jul 23 22:44:01 wedens-pc kernel: amdgpu 0000:03:00.0: [mmhub] VMC page fault (src_id:0 ring:154 vmid:1 pasid:32771, for process  pid 0 thread  pid 0
                                  )
Jul 23 22:44:01 wedens-pc kernel: amdgpu 0000:03:00.0:   at address 0x0000800100a00000 from 18
Jul 23 22:44:01 wedens-pc kernel: amdgpu 0000:03:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00100134
Jul 23 22:44:11 wedens-pc kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma1 timeout, signaled seq=230, emitted seq=233
Jul 23 22:44:11 wedens-pc kernel: [drm] GPU recovery disabled.


I'm going to try mesa master and manual power level workaround (when should I use "reset to normal" command?).

Comment 60 wedens13 2019-07-23 16:30:05 UTC

A couple of relevant log fragments with crashes: https://paste.ee/p/rtDEg

Comment 61 wedens13 2019-07-23 17:14:25 UTC

I've tried starting witcher 3 after executing
echo manual > /sys/class/drm/card0/device/power_dpm_force_performance_level
echo 7 > /sys/class/drm/card0/device/pp_dpm_sclk

and it still crashes immediately.

log: https://paste.ee/p/thvXf

Comment 62 Sylvain BERTRAND 2019-07-23 20:18:12 UTC

unstable power supply lines to the gpu if overheating is excluded?

Comment 63 Mauro Gaspari 2019-07-24 04:14:21 UTC

(In reply to Sylvain BERTRAND from comment #62)
> unstable power supply lines to the gpu if overheating is excluded?

I cannot speak for others. In my case,U would say no. I installed windows10 in a separate ssd, just to check there was no hardware issue of any kind. 
On windows10 with latest amd drivers, I have no freezes or any other issue running same games.

Comment 64 Sylvain BERTRAND 2019-07-24 13:09:23 UTC

> I cannot speak for others. In my case,U would say no. I installed windows10 in
> a separate ssd, just to check there was no hardware issue of any kind. 
> On windows10 with latest amd drivers, I have no freezes or any other issue
> running same games.

Native gnu/linux game or going through wine/dxvk?

Comment 65 wedens13 2019-07-24 14:27:33 UTC

(In reply to Sylvain BERTRAND from comment #62)
> unstable power supply lines to the gpu if overheating is excluded?

It's not overheating in my case, but my PSU is pretty old (I'm waiting for components for my new build to arrive, including new PSU). I've lowered power limit (to 80W) and I haven't had any crashes yet. 

So, in my case the problem *might be* related to PSU. But I can't exclude (nor confirm) possibility of driver problems with higher power states (until I have a better PSU).

I'll report back if I have any crashes with new PSU or lowered PL.

Comment 66 Hadet 2019-07-24 14:41:33 UTC

I don't think it's faulty hardware in any of our cases to be perfectly honest, it's a bad instruction set, this didn't happen with older kernels or firmware and the issue now is there are so few of us with Vega cards that we're really on our own trying to troubleshoot this situatio.

Since switching to wayland my crashing has been a lot less frequent, it'd say once every couple days as opposed to once every few hours when gaming with Vulkan/DXVK

Comment 67 Sylvain BERTRAND 2019-07-24 14:56:22 UTC

> ...
> Vulkan/DXVK

The bugs may be in wine/DXVK then. You should report to a bug to them and link
this bug to theirs.

Comment 68 Mauro Gaspari 2019-07-27 11:28:28 UTC

(In reply to Sylvain BERTRAND from comment #67)
> > ...
> > Vulkan/DXVK
> 
> The bugs may be in wine/DXVK then. You should report to a bug to them and
> link
> this bug to theirs.

If any of you opened bugs on other bug trackers, please post a link here so we can all contribute to both.

I did some test on my end and I can report the following:

System info:
OS: Ubuntu 18.04.2 LTS x86_64 
Kernel: 5.0.0-21-generic
Resolution: 3440x1440
CPU: AMD Ryzen 7 2700X (16) @ 3.700G 
GPU: AMD Vega 20 
Memory: 2650MiB / 64398MiB
OpenGL version string: 4.5 (Compatibility Profile) Mesa 19.0.2

1. Power profile set to manual did not help
2. Mesa-ACO from valve seem to have helped quite a bit. So far, no system freezes

I installed Arch on another SSD and will try to reproduce the same tests:
1. Plain Arch - crash or not ?
2. Arch with forced power profile - crash or not ?
3- Arch with mesa-ACO - crash or not ?

Comment 69 Sylvain BERTRAND 2019-07-27 13:19:59 UTC

Don't forget to provide the software stack used:

which sofware (game, cad...)? wine/dxvk? native?

Comment 70 Mauro Gaspari 2019-07-27 17:32:53 UTC

(In reply to Sylvain BERTRAND from comment #69)
> Don't forget to provide the software stack used:
> 
> which sofware (game, cad...)? wine/dxvk? native?

Good point. Games being tested:

Pillars of Eternity - Native
Battletech - Native
Eve Online - Wine+DXVK

Comment 71 Yury Zhuravlev 2019-07-28 03:14:23 UTC

Can somebody try games without any fps limits?
Like vblank_mode=0 and in-game limits.

Comment 72 Mauro Gaspari 2019-08-03 13:35:55 UTC

After a few weeks without crashes on Ubuntu Budgie 18.04 LTS with valve mesa-aco, I moved to another distribution that does not have valve mesa-aco to cross check.

This is what I am using:
OS: openSUSE Tumbleweed x86_64 
Kernel: 5.2.2-1-default
Resolution: 3440x1440
DE: Xfce
WM: Xfwm4
CPU: AMD Ryzen 7 2700X (16) @ 3.700GHz
GPU: AMD ATI Radeon VII
Memory: 1644MiB / 64387MiB 
OpenGL version string: 4.5 (Compatibility Profile) Mesa 19.1.3
No kernel parameters configured, just out of the box openSUSE

I had 3 of full OS freezes:

1. As I was playing Albion Online (Native) No full system freeze, I was able to drop to tty, and notice this error: [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!

2. As I closed down Albion Online (Native) and returned to desktop. Full System Freeze

3. As I was doing regular desktop operations on XFCE. No 3d gaming going on. Please see below logs:

DMESG after crash:

ilvipero@MGDT-ROG:~> dmesg | grep amdgpu
[    5.758450] [drm] amdgpu kernel modesetting enabled.
[    5.758569] amdgpu 0000:0a:00.0: remove_conflicting_pci_framebuffers: bar 0: 0xe0000000 -> 0xefffffff
[    5.758570] amdgpu 0000:0a:00.0: remove_conflicting_pci_framebuffers: bar 2: 0xf0000000 -> 0xf01fffff
[    5.758571] amdgpu 0000:0a:00.0: remove_conflicting_pci_framebuffers: bar 5: 0xfcd00000 -> 0xfcd7ffff
[    5.758573] fb0: switching to amdgpudrmfb from EFI VGA
[    5.758646] amdgpu 0000:0a:00.0: vgaarb: deactivate vga console
[    5.758826] amdgpu 0000:0a:00.0: No more image in the PCI ROM
[    5.758870] amdgpu 0000:0a:00.0: VRAM: 16368M 0x0000008000000000 - 0x00000083FEFFFFFF (16368M used)
[    5.758871] amdgpu 0000:0a:00.0: GART: 512M 0x0000000000000000 - 0x000000001FFFFFFF
[    5.758872] amdgpu 0000:0a:00.0: AGP: 267894784M 0x0000008400000000 - 0x0000FFFFFFFFFFFF
[    5.758936] [drm] amdgpu: 16368M of VRAM memory ready
[    5.758938] [drm] amdgpu: 16368M of GTT memory ready.
[    5.759204] amdgpu 0000:0a:00.0: Direct firmware load for amdgpu/vega20_ta.bin failed with error -2
[    5.759205] amdgpu 0000:0a:00.0: psp v11.0: Failed to load firmware "amdgpu/vega20_ta.bin"
[    6.855053] fbcon: amdgpudrmfb (fb0) is primary device
[    6.913835] amdgpu 0000:0a:00.0: fb0: amdgpudrmfb frame buffer device
[    6.928054] amdgpu 0000:0a:00.0: ring gfx uses VM inv eng 0 on hub 0
[    6.928055] amdgpu 0000:0a:00.0: ring comp_1.0.0 uses VM inv eng 1 on hub 0
[    6.928056] amdgpu 0000:0a:00.0: ring comp_1.1.0 uses VM inv eng 4 on hub 0
[    6.928056] amdgpu 0000:0a:00.0: ring comp_1.2.0 uses VM inv eng 5 on hub 0
[    6.928057] amdgpu 0000:0a:00.0: ring comp_1.3.0 uses VM inv eng 6 on hub 0
[    6.928058] amdgpu 0000:0a:00.0: ring comp_1.0.1 uses VM inv eng 7 on hub 0
[    6.928059] amdgpu 0000:0a:00.0: ring comp_1.1.1 uses VM inv eng 8 on hub 0
[    6.928059] amdgpu 0000:0a:00.0: ring comp_1.2.1 uses VM inv eng 9 on hub 0
[    6.928060] amdgpu 0000:0a:00.0: ring comp_1.3.1 uses VM inv eng 10 on hub 0
[    6.928060] amdgpu 0000:0a:00.0: ring kiq_2.1.0 uses VM inv eng 11 on hub 0
[    6.928061] amdgpu 0000:0a:00.0: ring sdma0 uses VM inv eng 0 on hub 1
[    6.928062] amdgpu 0000:0a:00.0: ring page0 uses VM inv eng 1 on hub 1
[    6.928063] amdgpu 0000:0a:00.0: ring sdma1 uses VM inv eng 4 on hub 1
[    6.928063] amdgpu 0000:0a:00.0: ring page1 uses VM inv eng 5 on hub 1
[    6.928064] amdgpu 0000:0a:00.0: ring uvd_0 uses VM inv eng 6 on hub 1
[    6.928064] amdgpu 0000:0a:00.0: ring uvd_enc_0.0 uses VM inv eng 7 on hub 1
[    6.928065] amdgpu 0000:0a:00.0: ring uvd_enc_0.1 uses VM inv eng 8 on hub 1
[    6.928066] amdgpu 0000:0a:00.0: ring uvd_1 uses VM inv eng 9 on hub 1
[    6.928066] amdgpu 0000:0a:00.0: ring uvd_enc_1.0 uses VM inv eng 10 on hub 1
[    6.928067] amdgpu 0000:0a:00.0: ring uvd_enc_1.1 uses VM inv eng 11 on hub 1
[    6.928067] amdgpu 0000:0a:00.0: ring vce0 uses VM inv eng 12 on hub 1
[    6.928068] amdgpu 0000:0a:00.0: ring vce1 uses VM inv eng 13 on hub 1
[    6.928068] amdgpu 0000:0a:00.0: ring vce2 uses VM inv eng 14 on hub 1
[    7.609167] [drm] Initialized amdgpu 3.32.0 20150101 for 0000:0a:00.0 on minor 0

system logs:

2019-08-03T18:51:21.779695+08:00 MGDT-ROG kernel: [11817.727681] pcieport 0000:00:03.1: AER: Multiple Corrected error received: 0000:00:00.0
2019-08-03T18:51:21.779730+08:00 MGDT-ROG kernel: [11817.771355] pcieport 0000:00:03.1: AER: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Transmitter ID)
2019-08-03T18:51:21.779735+08:00 MGDT-ROG kernel: [11817.771358] pcieport 0000:00:03.1: AER:   device [1022:1453] error status/mask=00003100/00006000
2019-08-03T18:51:21.779737+08:00 MGDT-ROG kernel: [11817.771361] pcieport 0000:00:03.1: AER:    [ 8] Rollover              
2019-08-03T18:51:21.779738+08:00 MGDT-ROG kernel: [11817.771371] pcieport 0000:00:03.1: AER:    [12] Timeout               
2019-08-03T18:51:26.721833+08:00 MGDT-ROG sudo: pam_unix(sudo:session): session closed for user root
2019-08-03T18:51:31.983837+08:00 MGDT-ROG kernel: [11827.971739] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 timeout, signaled seq=2324984, emitted seq=2324986
2019-08-03T18:51:31.983851+08:00 MGDT-ROG kernel: [11827.971800] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process X pid 2132 thread X:cs0 pid 2139
2019-08-03T18:51:31.983853+08:00 MGDT-ROG kernel: [11827.971804] amdgpu 0000:0a:00.0: GPU reset begin!
2019-08-03T18:51:32.751834+08:00 MGDT-ROG kernel: [11828.741066] amdgpu: [powerplay] Failed to send message 0x47, response 0xffffffff
2019-08-03T18:51:32.751846+08:00 MGDT-ROG kernel: [11828.741077] amdgpu: [powerplay] Failed to send message 0x28, response 0xffffffff
2019-08-03T18:51:32.751849+08:00 MGDT-ROG kernel: [11828.741078] amdgpu: [powerplay] [SetUclkToHightestDpmLevel] Set hard min uclk failed!
2019-08-03T18:51:32.751850+08:00 MGDT-ROG kernel: [11828.741090] amdgpu: [powerplay] Failed to send message 0x28, response 0xffffffff
2019-08-03T18:51:32.751852+08:00 MGDT-ROG kernel: [11828.741091] amdgpu: [powerplay] Attempt to set Hard Min for DCEFCLK Failed!
2019-08-03T18:51:32.751854+08:00 MGDT-ROG kernel: [11828.741102] amdgpu: [powerplay] Failed to send message 0x28, response 0xffffffff
2019-08-03T18:51:32.751855+08:00 MGDT-ROG kernel: [11828.741102] amdgpu: [powerplay] [SetHardMinFreq] Set hard min uclk failed!
2019-08-03T18:51:32.751856+08:00 MGDT-ROG kernel: [11828.741113] amdgpu: [powerplay] Failed to send message 0x26, response 0xffffffff
2019-08-03T18:51:32.751858+08:00 MGDT-ROG kernel: [11828.741114] amdgpu: [powerplay] Failed to set soft min gfxclk !
2019-08-03T18:51:32.751859+08:00 MGDT-ROG kernel: [11828.741114] amdgpu: [powerplay] Failed to upload DPM Bootup Levels!
2019-08-03T18:51:32.787843+08:00 MGDT-ROG kernel: [11828.775671] [drm] REG_WAIT timeout 10us * 3000 tries - dce110_stream_encoder_dp_blank line:951
2019-08-03T18:51:32.787852+08:00 MGDT-ROG kernel: [11828.775672] ------------[ cut here ]------------
2019-08-03T18:51:32.787853+08:00 MGDT-ROG kernel: [11828.775778] WARNING: CPU: 1 PID: 10195 at drivers/gpu/drm/amd/amdgpu/../display/dc/dc_helper.c:329 generic_reg_wait.cold+0x31/0x53 [amdgpu]
2019-08-03T18:51:32.787855+08:00 MGDT-ROG kernel: [11828.775779] Modules linked in: tun fuse af_packet ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter ip_tables x_tables bpfilter uvcvideo videobuf2_vmalloc videobuf2_memops videobuf2_v4l2 snd_usb_audio videobuf2_common snd_usbmidi_lib videodev snd_rawmidi snd_seq_device media joydev scsi_transport_iscsi msr nls_iso8859_1 nls_cp437 vfat fat edac_mce_amd kvm_amd kvm irqbypass snd_hda_codec_realtek crct10dif_pclmul snd_hda_codec_generic crc32_pclmul ledtrig_audio snd_hda_codec_hdmi ghash_clmulni_intel snd_hda_intel snd_hda_codec snd_hda_core snd_hwdep aesni_intel eeepc_wmi asus_wmi aes_x86_64 sparse_keymap snd_pcm crypto_simd rfkill cryptd video glue_helper wmi_bmof mxm_wmi igb snd_timer sp5100_tco snd ptp pcspkr i2c_piix4 pps_core dca k10temp ccp soundcore gpio_amdpt gpio_generic pcc_cpufreq button acpi_cpufreq btrfs libcrc32c xor hid_generic usbhid amdgpu raid6_pq amd_iommu_v2 gpu_sched i2c_algo_bit ttm drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops xhci_pci drm
2019-08-03T18:51:32.787858+08:00 MGDT-ROG kernel: [11828.775807]  crc32c_intel xhci_hcd usbcore sr_mod cdrom wmi pinctrl_amd l2tp_ppp l2tp_netlink l2tp_core ip6_udp_tunnel udp_tunnel pppox ppp_generic slhc sg dm_multipath dm_mod scsi_dh_rdac scsi_dh_emc scsi_dh_alua efivarfs
2019-08-03T18:51:32.787860+08:00 MGDT-ROG kernel: [11828.775817] CPU: 1 PID: 10195 Comm: kworker/1:0 Not tainted 5.2.3-1-default #1 openSUSE Tumbleweed (unreleased)
2019-08-03T18:51:32.787861+08:00 MGDT-ROG kernel: [11828.775818] Hardware name: System manufacturer System Product Name/ROG STRIX X470-F GAMING, BIOS 5007 06/17/2019
2019-08-03T18:51:32.787862+08:00 MGDT-ROG kernel: [11828.775822] Workqueue: events drm_sched_job_timedout [gpu_sched]
2019-08-03T18:51:32.787863+08:00 MGDT-ROG kernel: [11828.775897] RIP: 0010:generic_reg_wait.cold+0x31/0x53 [amdgpu]
2019-08-03T18:51:32.787864+08:00 MGDT-ROG kernel: [11828.775899] Code: 4c 24 18 44 89 fa 89 ee 48 c7 c7 68 7c 75 c0 e8 e9 71 84 f4 83 7b 20 01 0f 84 2b 1b fe ff 48 c7 c7 d8 7b 75 c0 e8 d3 71 84 f4 <0f> 0b e9 18 1b fe ff 48 c7 c7 d8 7b 75 c0 89 54 24 04 e8 bc 71 84
2019-08-03T18:51:32.787866+08:00 MGDT-ROG kernel: [11828.775901] RSP: 0018:ffffab7acdeb77e8 EFLAGS: 00010282
2019-08-03T18:51:32.787867+08:00 MGDT-ROG kernel: [11828.775902] RAX: 0000000000000024 RBX: ffff960e92c3c880 RCX: 0000000000000006
2019-08-03T18:51:32.787868+08:00 MGDT-ROG kernel: [11828.775903] RDX: 0000000000000007 RSI: 0000000000000096 RDI: ffff960e9e659a10
2019-08-03T18:51:32.787869+08:00 MGDT-ROG kernel: [11828.775903] RBP: 000000000000000a R08: 00000000000004da R09: 0000000000000001
2019-08-03T18:51:32.787870+08:00 MGDT-ROG kernel: [11828.775904] R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000004ee2
2019-08-03T18:51:32.787871+08:00 MGDT-ROG kernel: [11828.775905] R13: 0000000000000bb9 R14: 0000000000000000 R15: 0000000000000bb8
2019-08-03T18:51:32.787872+08:00 MGDT-ROG kernel: [11828.775906] FS:  0000000000000000(0000) GS:ffff960e9e640000(0000) knlGS:0000000000000000
2019-08-03T18:51:32.787874+08:00 MGDT-ROG kernel: [11828.775907] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
2019-08-03T18:51:32.787874+08:00 MGDT-ROG kernel: [11828.775907] CR2: 000055d4170da000 CR3: 0000000f03cd6000 CR4: 00000000003406e0
2019-08-03T18:51:32.787875+08:00 MGDT-ROG kernel: [11828.775908] Call Trace:
2019-08-03T18:51:32.787876+08:00 MGDT-ROG kernel: [11828.775982]  dce110_stream_encoder_dp_blank+0xda/0x120 [amdgpu]
2019-08-03T18:51:32.787877+08:00 MGDT-ROG kernel: [11828.776049]  core_link_disable_stream+0x32/0x260 [amdgpu]
2019-08-03T18:51:32.787878+08:00 MGDT-ROG kernel: [11828.776054]  ? printk+0x48/0x4a
2019-08-03T18:51:32.787879+08:00 MGDT-ROG kernel: [11828.776119]  dce110_reset_hw_ctx_wrap+0xc1/0x1e0 [amdgpu]
2019-08-03T18:51:32.787881+08:00 MGDT-ROG kernel: [11828.776192]  ? vega20_dpm_force_dpm_level.cold+0x5b/0x90 [amdgpu]
2019-08-03T18:51:32.787882+08:00 MGDT-ROG kernel: [11828.776256]  dce110_apply_ctx_to_hw+0x3a/0x470 [amdgpu]
2019-08-03T18:51:32.787883+08:00 MGDT-ROG kernel: [11828.776318]  ? hwmgr_handle_task+0x66/0xc0 [amdgpu]
2019-08-03T18:51:32.787884+08:00 MGDT-ROG kernel: [11828.776322]  ? mutex_lock+0xe/0x30
2019-08-03T18:51:32.787885+08:00 MGDT-ROG kernel: [11828.776385]  ? pp_dpm_dispatch_tasks+0x45/0x60 [amdgpu]
2019-08-03T18:51:32.787886+08:00 MGDT-ROG kernel: [11828.776450]  ? dm_pp_apply_display_requirements+0x1a1/0x1c0 [amdgpu]
2019-08-03T18:51:32.787887+08:00 MGDT-ROG kernel: [11828.776513]  dc_commit_state_no_check+0x200/0x530 [amdgpu]
2019-08-03T18:51:32.787888+08:00 MGDT-ROG kernel: [11828.776516]  ? get_page_from_freelist+0x289/0x380
2019-08-03T18:51:32.787889+08:00 MGDT-ROG kernel: [11828.776579]  dc_commit_state+0x8f/0xb0 [amdgpu]
2019-08-03T18:51:32.787889+08:00 MGDT-ROG kernel: [11828.776644]  amdgpu_dm_atomic_commit_tail+0x3a6/0xd30 [amdgpu]
2019-08-03T18:51:32.787890+08:00 MGDT-ROG kernel: [11828.776709]  ? bw_calcs+0x8ac/0x1440 [amdgpu]
2019-08-03T18:51:32.787892+08:00 MGDT-ROG kernel: [11828.776711]  ? __ww_mutex_lock.isra.0+0x2a/0x780
2019-08-03T18:51:32.787893+08:00 MGDT-ROG kernel: [11828.776714]  ? _raw_spin_unlock_irqrestore+0x24/0x40
2019-08-03T18:51:32.787893+08:00 MGDT-ROG kernel: [11828.776717]  ? __wake_up_common_lock+0x7c/0xa0
2019-08-03T18:51:32.787894+08:00 MGDT-ROG kernel: [11828.776719]  ? wait_for_completion_timeout+0xf3/0x110
2019-08-03T18:51:32.787895+08:00 MGDT-ROG kernel: [11828.776720]  ? wait_for_completion_interruptible+0x10b/0x150
2019-08-03T18:51:32.787896+08:00 MGDT-ROG kernel: [11828.776728]  ? commit_tail+0x3c/0x70 [drm_kms_helper]
2019-08-03T18:51:32.787897+08:00 MGDT-ROG kernel: [11828.776735]  commit_tail+0x3c/0x70 [drm_kms_helper]
2019-08-03T18:51:32.787898+08:00 MGDT-ROG kernel: [11828.776742]  drm_atomic_helper_commit+0x108/0x110 [drm_kms_helper]
2019-08-03T18:51:32.787899+08:00 MGDT-ROG kernel: [11828.776749]  drm_atomic_helper_disable_all+0x144/0x160 [drm_kms_helper]
2019-08-03T18:51:32.787900+08:00 MGDT-ROG kernel: [11828.776756]  drm_atomic_helper_suspend+0x4c/0xe0 [drm_kms_helper]
2019-08-03T18:51:32.787901+08:00 MGDT-ROG kernel: [11828.776820]  dm_suspend+0x20/0x60 [amdgpu]
2019-08-03T18:51:32.787902+08:00 MGDT-ROG kernel: [11828.776861]  amdgpu_device_ip_suspend_phase1+0x8b/0xc0 [amdgpu]
2019-08-03T18:51:32.787903+08:00 MGDT-ROG kernel: [11828.776903]  amdgpu_device_ip_suspend+0x1c/0x60 [amdgpu]
2019-08-03T18:51:32.787904+08:00 MGDT-ROG kernel: [11828.776975]  amdgpu_device_pre_asic_reset+0x1f4/0x209 [amdgpu]
2019-08-03T18:51:32.787905+08:00 MGDT-ROG kernel: [11828.777047]  amdgpu_device_gpu_recover+0x67/0x765 [amdgpu]
2019-08-03T18:51:32.787906+08:00 MGDT-ROG kernel: [11828.777106]  amdgpu_job_timedout+0xf7/0x120 [amdgpu]
2019-08-03T18:51:32.787906+08:00 MGDT-ROG kernel: [11828.777110]  drm_sched_job_timedout+0x3a/0x70 [gpu_sched]
2019-08-03T18:51:32.787907+08:00 MGDT-ROG kernel: [11828.777113]  process_one_work+0x1df/0x3c0
2019-08-03T18:51:32.787908+08:00 MGDT-ROG kernel: [11828.777115]  worker_thread+0x4d/0x400
2019-08-03T18:51:32.787909+08:00 MGDT-ROG kernel: [11828.777117]  kthread+0xf9/0x130
2019-08-03T18:51:32.787910+08:00 MGDT-ROG kernel: [11828.777119]  ? process_one_work+0x3c0/0x3c0
2019-08-03T18:51:32.787911+08:00 MGDT-ROG kernel: [11828.777120]  ? kthread_park+0x80/0x80
2019-08-03T18:51:32.787912+08:00 MGDT-ROG kernel: [11828.777122]  ret_from_fork+0x27/0x50
2019-08-03T18:51:32.787913+08:00 MGDT-ROG kernel: [11828.777125] ---[ end trace 9aaf1f62ae398b4b ]---
2019-08-03T18:51:37.791882+08:00 MGDT-ROG kernel: [11833.780084] [drm:atom_op_jump [amdgpu]] *ERROR* atombios stuck in loop for more than 5secs aborting
2019-08-03T18:51:37.791896+08:00 MGDT-ROG kernel: [11833.780129] [drm:amdgpu_atom_execute_table_locked [amdgpu]] *ERROR* atombios stuck executing B0B0 (len 2971, WS 4, PS 0) @ 0xB963
2019-08-03T18:51:37.791898+08:00 MGDT-ROG kernel: [11833.780172] [drm:amdgpu_atom_execute_table_locked [amdgpu]] *ERROR* atombios stuck executing AFB0 (len 255, WS 4, PS 0) @ 0xB089
2019-08-03T18:51:37.791899+08:00 MGDT-ROG kernel: [11833.780240] [drm:dce110_link_encoder_disable_output [amdgpu]] *ERROR* dce110_link_encoder_disable_output: Failed to execute VBIOS command table!
2019-08-03T18:51:37.791901+08:00 MGDT-ROG kernel: [11833.780240] ------------[ cut here ]------------
2019-08-03T18:51:37.791902+08:00 MGDT-ROG kernel: [11833.780328] WARNING: CPU: 1 PID: 10195 at drivers/gpu/drm/amd/amdgpu/../display/dc/dce/dce_link_encoder.c:1096 dce110_link_encoder_disable_output+0x13d/0x150 [amdgpu]
2019-08-03T18:51:37.791903+08:00 MGDT-ROG kernel: [11833.780329] Modules linked in: tun fuse af_packet ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter ip_tables x_tables bpfilter uvcvideo videobuf2_vmalloc videobuf2_memops videobuf2_v4l2 snd_usb_audio videobuf2_common snd_usbmidi_lib videodev snd_rawmidi snd_seq_device media joydev scsi_transport_iscsi msr nls_iso8859_1 nls_cp437 vfat fat edac_mce_amd kvm_amd kvm irqbypass snd_hda_codec_realtek crct10dif_pclmul snd_hda_codec_generic crc32_pclmul ledtrig_audio snd_hda_codec_hdmi ghash_clmulni_intel snd_hda_intel snd_hda_codec snd_hda_core snd_hwdep aesni_intel eeepc_wmi asus_wmi aes_x86_64 sparse_keymap snd_pcm crypto_simd rfkill cryptd video glue_helper wmi_bmof mxm_wmi igb snd_timer sp5100_tco snd ptp pcspkr i2c_piix4 pps_core dca k10temp ccp soundcore gpio_amdpt gpio_generic pcc_cpufreq button acpi_cpufreq btrfs libcrc32c xor hid_generic usbhid amdgpu raid6_pq amd_iommu_v2 gpu_sched i2c_algo_bit ttm drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops xhci_pci drm
2019-08-03T18:51:37.791905+08:00 MGDT-ROG kernel: [11833.780356]  crc32c_intel xhci_hcd usbcore sr_mod cdrom wmi pinctrl_amd l2tp_ppp l2tp_netlink l2tp_core ip6_udp_tunnel udp_tunnel pppox ppp_generic slhc sg dm_multipath dm_mod scsi_dh_rdac scsi_dh_emc scsi_dh_alua efivarfs
2019-08-03T18:51:37.791907+08:00 MGDT-ROG kernel: [11833.780365] CPU: 1 PID: 10195 Comm: kworker/1:0 Tainted: G        W         5.2.3-1-default #1 openSUSE Tumbleweed (unreleased)
2019-08-03T18:51:37.791908+08:00 MGDT-ROG kernel: [11833.780366] Hardware name: System manufacturer System Product Name/ROG STRIX X470-F GAMING, BIOS 5007 06/17/2019
2019-08-03T18:51:37.791910+08:00 MGDT-ROG kernel: [11833.780370] Workqueue: events drm_sched_job_timedout [gpu_sched]
2019-08-03T18:51:37.791911+08:00 MGDT-ROG kernel: [11833.780435] RIP: 0010:dce110_link_encoder_disable_output+0x13d/0x150 [amdgpu]
2019-08-03T18:51:37.791912+08:00 MGDT-ROG kernel: [11833.780437] Code: ff ff 48 83 c4 38 5b 5d 41 5c c3 48 c7 c6 c0 c8 6f c0 48 c7 c7 d8 d9 74 c0 e8 cf bb de ff 48 c7 c7 70 d9 74 c0 e8 61 13 8c f4 <0f> 0b eb d4 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 0f 1f 44
2019-08-03T18:51:37.791913+08:00 MGDT-ROG kernel: [11833.780438] RSP: 0018:ffffab7acdeb77f8 EFLAGS: 00010282
2019-08-03T18:51:37.791914+08:00 MGDT-ROG kernel: [11833.780439] RAX: 0000000000000024 RBX: ffff960e96034a80 RCX: 0000000000000006
2019-08-03T18:51:37.791915+08:00 MGDT-ROG kernel: [11833.780440] RDX: 0000000000000007 RSI: 0000000000000096 RDI: ffff960e9e659a10
2019-08-03T18:51:37.791917+08:00 MGDT-ROG kernel: [11833.780441] RBP: 0000000000000020 R08: 0000000000000518 R09: 0000000000000001
2019-08-03T18:51:37.791918+08:00 MGDT-ROG kernel: [11833.780441] R10: 0000000000000000 R11: 0000000000000001 R12: ffffab7acdeb77fc
2019-08-03T18:51:37.791919+08:00 MGDT-ROG kernel: [11833.780442] R13: ffff95ffc13c1000 R14: 0000000000000000 R15: ffff9601c92c8188
2019-08-03T18:51:37.791920+08:00 MGDT-ROG kernel: [11833.780443] FS:  0000000000000000(0000) GS:ffff960e9e640000(0000) knlGS:0000000000000000
2019-08-03T18:51:37.791921+08:00 MGDT-ROG kernel: [11833.780444] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
2019-08-03T18:51:37.791922+08:00 MGDT-ROG kernel: [11833.780445] CR2: 000055d4170da000 CR3: 0000000f03cd6000 CR4: 00000000003406e0
2019-08-03T18:51:37.791923+08:00 MGDT-ROG kernel: [11833.780446] Call Trace:
2019-08-03T18:51:37.791924+08:00 MGDT-ROG kernel: [11833.780512]  dp_disable_link_phy+0x73/0x110 [amdgpu]
2019-08-03T18:51:37.791925+08:00 MGDT-ROG kernel: [11833.780576]  core_link_disable_stream+0xb6/0x260 [amdgpu]
2019-08-03T18:51:37.791926+08:00 MGDT-ROG kernel: [11833.780580]  ? printk+0x48/0x4a
2019-08-03T18:51:37.791927+08:00 MGDT-ROG kernel: [11833.780642]  dce110_reset_hw_ctx_wrap+0xc1/0x1e0 [amdgpu]
2019-08-03T18:51:37.791928+08:00 MGDT-ROG kernel: [11833.780716]  ? vega20_dpm_force_dpm_level.cold+0x5b/0x90 [amdgpu]
2019-08-03T18:51:37.791929+08:00 MGDT-ROG kernel: [11833.780779]  dce110_apply_ctx_to_hw+0x3a/0x470 [amdgpu]
2019-08-03T18:51:37.791930+08:00 MGDT-ROG kernel: [11833.780840]  ? hwmgr_handle_task+0x66/0xc0 [amdgpu]
2019-08-03T18:51:37.791931+08:00 MGDT-ROG kernel: [11833.780843]  ? mutex_lock+0xe/0x30
2019-08-03T18:51:37.791933+08:00 MGDT-ROG kernel: [11833.780905]  ? pp_dpm_dispatch_tasks+0x45/0x60 [amdgpu]
2019-08-03T18:51:37.791934+08:00 MGDT-ROG kernel: [11833.780969]  ? dm_pp_apply_display_requirements+0x1a1/0x1c0 [amdgpu]
2019-08-03T18:51:37.791935+08:00 MGDT-ROG kernel: [11833.781032]  dc_commit_state_no_check+0x200/0x530 [amdgpu]
2019-08-03T18:51:37.791936+08:00 MGDT-ROG kernel: [11833.781036]  ? get_page_from_freelist+0x289/0x380
2019-08-03T18:51:37.791937+08:00 MGDT-ROG kernel: [11833.781098]  dc_commit_state+0x8f/0xb0 [amdgpu]
2019-08-03T18:51:37.791938+08:00 MGDT-ROG kernel: [11833.781162]  amdgpu_dm_atomic_commit_tail+0x3a6/0xd30 [amdgpu]
2019-08-03T18:51:37.791939+08:00 MGDT-ROG kernel: [11833.781227]  ? bw_calcs+0x8ac/0x1440 [amdgpu]
2019-08-03T18:51:37.791940+08:00 MGDT-ROG kernel: [11833.781229]  ? __ww_mutex_lock.isra.0+0x2a/0x780
2019-08-03T18:51:37.791941+08:00 MGDT-ROG kernel: [11833.781231]  ? _raw_spin_unlock_irqrestore+0x24/0x40
2019-08-03T18:51:37.791942+08:00 MGDT-ROG kernel: [11833.781234]  ? __wake_up_common_lock+0x7c/0xa0
2019-08-03T18:51:37.791943+08:00 MGDT-ROG kernel: [11833.781236]  ? wait_for_completion_timeout+0xf3/0x110
2019-08-03T18:51:37.791944+08:00 MGDT-ROG kernel: [11833.781237]  ? wait_for_completion_interruptible+0x10b/0x150
2019-08-03T18:51:37.791945+08:00 MGDT-ROG kernel: [11833.781245]  ? commit_tail+0x3c/0x70 [drm_kms_helper]
2019-08-03T18:51:37.791946+08:00 MGDT-ROG kernel: [11833.781251]  commit_tail+0x3c/0x70 [drm_kms_helper]
2019-08-03T18:51:37.791947+08:00 MGDT-ROG kernel: [11833.781258]  drm_atomic_helper_commit+0x108/0x110 [drm_kms_helper]
2019-08-03T18:51:37.791948+08:00 MGDT-ROG kernel: [11833.781265]  drm_atomic_helper_disable_all+0x144/0x160 [drm_kms_helper]
2019-08-03T18:51:37.791949+08:00 MGDT-ROG kernel: [11833.781272]  drm_atomic_helper_suspend+0x4c/0xe0 [drm_kms_helper]
2019-08-03T18:51:37.791950+08:00 MGDT-ROG kernel: [11833.781335]  dm_suspend+0x20/0x60 [amdgpu]
2019-08-03T18:51:37.791951+08:00 MGDT-ROG kernel: [11833.781377]  amdgpu_device_ip_suspend_phase1+0x8b/0xc0 [amdgpu]
2019-08-03T18:51:37.791952+08:00 MGDT-ROG kernel: [11833.781418]  amdgpu_device_ip_suspend+0x1c/0x60 [amdgpu]
2019-08-03T18:51:37.791953+08:00 MGDT-ROG kernel: [11833.781490]  amdgpu_device_pre_asic_reset+0x1f4/0x209 [amdgpu]
2019-08-03T18:51:37.791954+08:00 MGDT-ROG kernel: [11833.781561]  amdgpu_device_gpu_recover+0x67/0x765 [amdgpu]
2019-08-03T18:51:37.791955+08:00 MGDT-ROG kernel: [11833.781620]  amdgpu_job_timedout+0xf7/0x120 [amdgpu]
2019-08-03T18:51:37.791956+08:00 MGDT-ROG kernel: [11833.781624]  drm_sched_job_timedout+0x3a/0x70 [gpu_sched]
2019-08-03T18:51:37.791957+08:00 MGDT-ROG kernel: [11833.781627]  process_one_work+0x1df/0x3c0
2019-08-03T18:51:37.791958+08:00 MGDT-ROG kernel: [11833.781629]  worker_thread+0x4d/0x400
2019-08-03T18:51:37.791959+08:00 MGDT-ROG kernel: [11833.781631]  kthread+0xf9/0x130
2019-08-03T18:51:37.791960+08:00 MGDT-ROG kernel: [11833.781633]  ? process_one_work+0x3c0/0x3c0
2019-08-03T18:51:37.791961+08:00 MGDT-ROG kernel: [11833.781634]  ? kthread_park+0x80/0x80
2019-08-03T18:51:37.791962+08:00 MGDT-ROG kernel: [11833.781636]  ret_from_fork+0x27/0x50
2019-08-03T18:51:37.791963+08:00 MGDT-ROG kernel: [11833.781639] ---[ end trace 9aaf1f62ae398b4c ]---
2019-08-03T18:51:42.796019+08:00 MGDT-ROG kernel: [11838.784083] [drm:atom_op_jump [amdgpu]] *ERROR* atombios stuck in loop for more than 5secs aborting
2019-08-03T18:51:42.796034+08:00 MGDT-ROG kernel: [11838.784127] [drm:amdgpu_atom_execute_table_locked [amdgpu]] *ERROR* atombios stuck executing A048 (len 62, WS 0, PS 0) @ 0xA064
2019-08-03T18:51:42.796035+08:00 MGDT-ROG kernel: [11838.784208] amdgpu: [powerplay] Failed to send message 0x28, response 0xffffffff
2019-08-03T18:51:42.796036+08:00 MGDT-ROG kernel: [11838.784219] amdgpu: [powerplay] Failed to send message 0x28, response 0xffffffff
2019-08-03T18:51:42.796038+08:00 MGDT-ROG kernel: [11838.784233] amdgpu: [powerplay] Failed to send message 0x47, response 0xffffffff
2019-08-03T18:51:42.796039+08:00 MGDT-ROG kernel: [11838.784245] amdgpu: [powerplay] Failed to send message 0x28, response 0xffffffff
2019-08-03T18:51:42.796040+08:00 MGDT-ROG kernel: [11838.784245] amdgpu: [powerplay] [SetUclkToHightestDpmLevel] Set hard min uclk failed!
2019-08-03T18:51:42.796041+08:00 MGDT-ROG kernel: [11838.784258] amdgpu: [powerplay] Failed to send message 0x28, response 0xffffffff
2019-08-03T18:51:42.796042+08:00 MGDT-ROG kernel: [11838.784258] amdgpu: [powerplay] Attempt to set Hard Min for DCEFCLK Failed!
2019-08-03T18:51:42.796044+08:00 MGDT-ROG kernel: [11838.784269] amdgpu: [powerplay] Failed to send message 0x28, response 0xffffffff
2019-08-03T18:51:42.796045+08:00 MGDT-ROG kernel: [11838.784270] amdgpu: [powerplay] [SetHardMinFreq] Set hard min uclk failed!
2019-08-03T18:51:42.796046+08:00 MGDT-ROG kernel: [11838.784281] amdgpu: [powerplay] Failed to send message 0x26, response 0xffffffff
2019-08-03T18:51:42.796047+08:00 MGDT-ROG kernel: [11838.784282] amdgpu: [powerplay] Failed to set soft min gfxclk !
2019-08-03T18:51:42.796048+08:00 MGDT-ROG kernel: [11838.784282] amdgpu: [powerplay] Failed to upload DPM Bootup Levels!
2019-08-03T18:51:43.656061+08:00 MGDT-ROG kernel: [11839.645436] amdgpu: [powerplay] Failed to send message 0x26, response 0xffffffff
2019-08-03T18:51:43.656078+08:00 MGDT-ROG kernel: [11839.645438] amdgpu: [powerplay] Failed to set soft min gfxclk !
2019-08-03T18:51:43.656080+08:00 MGDT-ROG kernel: [11839.645438] amdgpu: [powerplay] Failed to upload DPM Bootup Levels!
2019-08-03T18:51:43.656081+08:00 MGDT-ROG kernel: [11839.645449] amdgpu: [powerplay] Failed to send message 0x7, response 0xffffffff
2019-08-03T18:51:43.656082+08:00 MGDT-ROG kernel: [11839.645450] amdgpu: [powerplay] [DisableAllSMUFeatures] Failed to disable all smu features!
2019-08-03T18:51:43.656083+08:00 MGDT-ROG kernel: [11839.645450] amdgpu: [powerplay] [DisableDpmTasks] Failed to disable all smu features!
2019-08-03T18:51:43.656084+08:00 MGDT-ROG kernel: [11839.645451] amdgpu: [powerplay] [PowerOffAsic] Failed to disable DPM!
2019-08-03T18:51:43.656086+08:00 MGDT-ROG kernel: [11839.645497] [drm:amdgpu_device_ip_suspend_phase2 [amdgpu]] *ERROR* suspend of IP block <powerplay> failed -5
2019-08-03T18:51:43.911990+08:00 MGDT-ROG kernel: [11839.902893] amdgpu 0000:0a:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring kiq_2.1.0 test failed (-110)
2019-08-03T18:51:43.912001+08:00 MGDT-ROG kernel: [11839.902947] [drm:gfx_v9_0_hw_fini [amdgpu]] *ERROR* KCQ disable failed
2019-08-03T18:51:44.167806+08:00 MGDT-ROG kernel: [11840.159797] [drm] Timeout wait for RLC serdes 0,0
2019-08-03T18:51:44.191826+08:00 MGDT-ROG kernel: [11840.180793] amdgpu 0000:0a:00.0: GPU mode1 reset
2019-08-03T18:51:44.451982+08:00 MGDT-ROG kernel: [11840.442308] [drm] psp is not working correctly before mode1 reset!
2019-08-03T18:51:44.451993+08:00 MGDT-ROG kernel: [11840.442310] amdgpu 0000:0a:00.0: GPU mode1 reset failed
2019-08-03T18:51:44.719056+08:00 MGDT-ROG kernel: [11840.710967] [drm:amdgpu_device_gpu_recover [amdgpu]] *ERROR* ASIC reset failed with error, -22 for drm dev, 0000:0a:00.0
2019-08-03T18:51:44.719066+08:00 MGDT-ROG kernel: [11840.711014] amdgpu 0000:0a:00.0: GPU reset(1) failed
2019-08-03T18:51:44.719068+08:00 MGDT-ROG kernel: [11840.711033] [drm] Skip scheduling IBs!
2019-08-03T18:51:44.719068+08:00 MGDT-ROG kernel: [11840.711038] [drm] Skip scheduling IBs!
2019-08-03T18:51:44.719070+08:00 MGDT-ROG kernel: [11840.711040] [drm] Skip scheduling IBs!
2019-08-03T18:51:44.719071+08:00 MGDT-ROG kernel: [11840.711043] [drm] Skip scheduling IBs!
2019-08-03T18:51:44.719072+08:00 MGDT-ROG kernel: [11840.711045] [drm] Skip scheduling IBs!
2019-08-03T18:51:44.719073+08:00 MGDT-ROG kernel: [11840.711049] [drm] Skip scheduling IBs!
2019-08-03T18:51:44.719075+08:00 MGDT-ROG kernel: [11840.711051] [drm] Skip scheduling IBs!
2019-08-03T18:51:44.719076+08:00 MGDT-ROG kernel: [11840.711053] [drm] Skip scheduling IBs!
2019-08-03T18:51:44.719077+08:00 MGDT-ROG kernel: [11840.711057] [drm] Skip scheduling IBs!
2019-08-03T18:51:44.719078+08:00 MGDT-ROG kernel: [11840.711059] [drm] Skip scheduling IBs!
2019-08-03T18:51:44.719079+08:00 MGDT-ROG kernel: [11840.711061] [drm] Skip scheduling IBs!
2019-08-03T18:51:44.719080+08:00 MGDT-ROG kernel: [11840.711064] [drm] Skip scheduling IBs!
2019-08-03T18:51:44.719081+08:00 MGDT-ROG kernel: [11840.711066] [drm] Skip scheduling IBs!
2019-08-03T18:51:44.719082+08:00 MGDT-ROG kernel: [11840.711068] [drm] Skip scheduling IBs!
2019-08-03T18:51:44.719083+08:00 MGDT-ROG kernel: [11840.711072] [drm] Skip scheduling IBs!
2019-08-03T18:51:44.719084+08:00 MGDT-ROG kernel: [11840.711075] [drm] Skip scheduling IBs!
2019-08-03T18:51:44.719085+08:00 MGDT-ROG kernel: [11840.711077] [drm] Skip scheduling IBs!
2019-08-03T18:51:44.719086+08:00 MGDT-ROG kernel: [11840.711080] [drm] Skip scheduling IBs!
2019-08-03T18:51:44.719087+08:00 MGDT-ROG kernel: [11840.711083] [drm] Skip scheduling IBs!
2019-08-03T18:51:44.719088+08:00 MGDT-ROG kernel: [11840.711085] [drm] Skip scheduling IBs!
2019-08-03T18:51:44.719089+08:00 MGDT-ROG kernel: [11840.711087] [drm] Skip scheduling IBs!
2019-08-03T18:51:44.719090+08:00 MGDT-ROG kernel: [11840.711090] [drm] Skip scheduling IBs!
2019-08-03T18:51:44.719091+08:00 MGDT-ROG kernel: [11840.711092] [drm] Skip scheduling IBs!
2019-08-03T18:51:44.719092+08:00 MGDT-ROG kernel: [11840.711094] [drm] Skip scheduling IBs!
2019-08-03T18:51:44.719093+08:00 MGDT-ROG kernel: [11840.711096] [drm] Skip scheduling IBs!
2019-08-03T18:51:44.719094+08:00 MGDT-ROG kernel: [11840.711097] [drm] Skip scheduling IBs!
2019-08-03T18:51:44.719095+08:00 MGDT-ROG kernel: [11840.711100] [drm] Skip scheduling IBs!
2019-08-03T18:51:44.719096+08:00 MGDT-ROG kernel: [11840.711102] amdgpu 0000:0a:00.0: GPU reset end with ret = -22
2019-08-03T18:51:44.719097+08:00 MGDT-ROG kernel: [11840.711102] [drm] Skip scheduling IBs!
2019-08-03T18:51:44.719098+08:00 MGDT-ROG kernel: [11840.711104] [drm] Skip scheduling IBs!
2019-08-03T18:51:44.719099+08:00 MGDT-ROG kernel: [11840.711106] [drm] Skip scheduling IBs!
2019-08-03T18:51:54.767980+08:00 MGDT-ROG kernel: [11850.756186] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 timeout, signaled seq=2324986, emitted seq=2324986
2019-08-03T18:51:54.767994+08:00 MGDT-ROG kernel: [11850.756247] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process X pid 2132 thread X:cs0 pid 2139
2019-08-03T18:51:54.767996+08:00 MGDT-ROG kernel: [11850.756251] amdgpu 0000:0a:00.0: GPU reset begin!

Comment 73 Sylvain BERTRAND 2019-08-03 16:54:17 UTC

On Sat, Aug 03, 2019 at 01:35:55PM +0000, bugzilla-daemon@freedesktop.org wrote:
> [    5.759204] amdgpu 0000:0a:00.0: Direct firmware load for
> amdgpu/vega20_ta.bin failed with error -2
> [    5.759205] amdgpu 0000:0a:00.0: psp v11.0: Failed to load firmware
> "amdgpu/vega20_ta.bin"

Did you get the latest and "greatest" amdgpu firmware package?

Comment 74 Mauro Gaspari 2019-08-03 17:43:01 UTC

(In reply to Sylvain BERTRAND from comment #73)
> On Sat, Aug 03, 2019 at 01:35:55PM +0000, bugzilla-daemon@freedesktop.org
> wrote:
> > [    5.759204] amdgpu 0000:0a:00.0: Direct firmware load for
> > amdgpu/vega20_ta.bin failed with error -2
> > [    5.759205] amdgpu 0000:0a:00.0: psp v11.0: Failed to load firmware
> > "amdgpu/vega20_ta.bin"
> 
> Did you get the latest and "greatest" amdgpu firmware package?

This is a fresh install I made to test this issue, so for now I only installed the packages per openSUSE wiki: https://en.opensuse.org/SDB:AMDGPU

I have done a snapper btrfs snapshot therefore if there is anything you want me to test, I am ready.

Comment 75 Sylvain BERTRAND 2019-08-03 18:46:19 UTC

On Sat, Aug 03, 2019 at 05:43:01PM +0000, bugzilla-daemon@freedesktop.org wrote:
> > > [    5.759204] amdgpu 0000:0a:00.0: Direct firmware load for
> > > amdgpu/vega20_ta.bin failed with error -2
> > > [    5.759205] amdgpu 0000:0a:00.0: psp v11.0: Failed to load firmware
> > > "amdgpu/vega20_ta.bin"

It seems you have a corrupted/old/missing vega20_ta.bin firmware file.
It looks like outdated distro files.

Comment 76 Mauro Gaspari 2019-08-04 05:05:52 UTC

(In reply to Sylvain BERTRAND from comment #75)
> On Sat, Aug 03, 2019 at 05:43:01PM +0000, bugzilla-daemon@freedesktop.org
> wrote:
> > > > [    5.759204] amdgpu 0000:0a:00.0: Direct firmware load for
> > > > amdgpu/vega20_ta.bin failed with error -2
> > > > [    5.759205] amdgpu 0000:0a:00.0: psp v11.0: Failed to load firmware
> > > > "amdgpu/vega20_ta.bin"
> 
> It seems you have a corrupted/old/missing vega20_ta.bin firmware file.
> It looks like outdated distro files.

Hello,
I did some quick search online and it seems a common problem for many users amdgpu. And looking around on other reports they seem to be dismissed as warnings and not mandatory. I am not an expert and I do not  want to dismiss it here, just report what I see.

By the way, Interesting to see that even my ubuntu budgie LTS with valve mesa-aco and different kernel, has the same warning.

[    5.435346] [drm] amdgpu kernel modesetting enabled.
[    5.435500] fb0: switching to amdgpudrmfb from EFI VGA
[    5.735058] amdgpu 0000:0a:00.0: No more image in the PCI ROM
[    5.735102] amdgpu 0000:0a:00.0: VRAM: 16368M 0x0000008000000000 - 0x00000083FEFFFFFF (16368M used)
[    5.735103] amdgpu 0000:0a:00.0: GART: 512M 0x0000000000000000 - 0x000000001FFFFFFF
[    5.735104] amdgpu 0000:0a:00.0: AGP: 267894784M 0x0000008400000000 - 0x0000FFFFFFFFFFFF
[    5.735185] [drm] amdgpu: 16368M of VRAM memory ready
[    5.735186] [drm] amdgpu: 16368M of GTT memory ready.
[    5.739656] amdgpu 0000:0a:00.0: Direct firmware load for amdgpu/vega20_ta.bin failed with error -2
[    5.739659] amdgpu 0000:0a:00.0: psp v11.0: Failed to load firmware "amdgpu/vega20_ta.bin"
[    6.354308] fbcon: amdgpudrmfb (fb0) is primary device
[    6.354490] amdgpu 0000:0a:00.0: fb0: amdgpudrmfb frame buffer device
[    6.384079] amdgpu 0000:0a:00.0: ring gfx uses VM inv eng 0 on hub 0
[    6.384080] amdgpu 0000:0a:00.0: ring comp_1.0.0 uses VM inv eng 1 on hub 0
[    6.384081] amdgpu 0000:0a:00.0: ring comp_1.1.0 uses VM inv eng 4 on hub 0
[    6.384082] amdgpu 0000:0a:00.0: ring comp_1.2.0 uses VM inv eng 5 on hub 0
[    6.384083] amdgpu 0000:0a:00.0: ring comp_1.3.0 uses VM inv eng 6 on hub 0
[    6.384084] amdgpu 0000:0a:00.0: ring comp_1.0.1 uses VM inv eng 7 on hub 0
[    6.384084] amdgpu 0000:0a:00.0: ring comp_1.1.1 uses VM inv eng 8 on hub 0
[    6.384085] amdgpu 0000:0a:00.0: ring comp_1.2.1 uses VM inv eng 9 on hub 0
[    6.384086] amdgpu 0000:0a:00.0: ring comp_1.3.1 uses VM inv eng 10 on hub 0
[    6.384087] amdgpu 0000:0a:00.0: ring kiq_2.1.0 uses VM inv eng 11 on hub 0
[    6.384088] amdgpu 0000:0a:00.0: ring sdma0 uses VM inv eng 0 on hub 1
[    6.384089] amdgpu 0000:0a:00.0: ring page0 uses VM inv eng 1 on hub 1
[    6.384089] amdgpu 0000:0a:00.0: ring sdma1 uses VM inv eng 4 on hub 1
[    6.384090] amdgpu 0000:0a:00.0: ring page1 uses VM inv eng 5 on hub 1
[    6.384090] amdgpu 0000:0a:00.0: ring uvd_0 uses VM inv eng 6 on hub 1
[    6.384091] amdgpu 0000:0a:00.0: ring uvd_enc_0.0 uses VM inv eng 7 on hub 1
[    6.384092] amdgpu 0000:0a:00.0: ring uvd_enc_0.1 uses VM inv eng 8 on hub 1
[    6.384092] amdgpu 0000:0a:00.0: ring uvd_1 uses VM inv eng 9 on hub 1
[    6.384093] amdgpu 0000:0a:00.0: ring uvd_enc_1.0 uses VM inv eng 10 on hub 1
[    6.384094] amdgpu 0000:0a:00.0: ring uvd_enc_1.1 uses VM inv eng 11 on hub 1
[    6.384094] amdgpu 0000:0a:00.0: ring vce0 uses VM inv eng 12 on hub 1
[    6.384095] amdgpu 0000:0a:00.0: ring vce1 uses VM inv eng 13 on hub 1
[    6.384096] amdgpu 0000:0a:00.0: ring vce2 uses VM inv eng 14 on hub 1
[    7.067068] [drm] Initialized amdgpu 3.27.0 20150101 for 0000:0a:00.0 on minor 0

Comment 77 Sylvain BERTRAND 2019-08-04 14:18:56 UTC

On Sun, Aug 04, 2019 at 05:05:52AM +0000, bugzilla-daemon@freedesktop.org wrote:
> By the way, Interesting to see that even my ubuntu budgie LTS with valve
> mesa-aco and different kernel, has the same warning.
> [    5.739656] amdgpu 0000:0a:00.0: Direct firmware load for
> amdgpu/vega20_ta.bin failed with error -2
> [    5.739659] amdgpu 0000:0a:00.0: psp v11.0: Failed to load firmware
> "amdgpu/vega20_ta.bin"

I don't know of an AMD GPU part able to run without properly loaded firmware.

That would have to be confirmed by official AMD devs which are the sole ppl
with that knowledge.

In the very probable case that the firmware _must_ be loaded for proper gpu
operations, you have to tell the maintainers of the distros you use to update
their linux/amdgpu firmware package.

Comment 78 Mauro Gaspari 2019-08-04 16:17:41 UTC

(In reply to Sylvain BERTRAND from comment #77)
> On Sun, Aug 04, 2019 at 05:05:52AM +0000, bugzilla-daemon@freedesktop.org
> wrote:
> > By the way, Interesting to see that even my ubuntu budgie LTS with valve
> > mesa-aco and different kernel, has the same warning.
> > [    5.739656] amdgpu 0000:0a:00.0: Direct firmware load for
> > amdgpu/vega20_ta.bin failed with error -2
> > [    5.739659] amdgpu 0000:0a:00.0: psp v11.0: Failed to load firmware
> > "amdgpu/vega20_ta.bin"
> 
> I don't know of an AMD GPU part able to run without properly loaded firmware.
> 
> That would have to be confirmed by official AMD devs which are the sole ppl
> with that knowledge.
> 
> In the very probable case that the firmware _must_ be loaded for proper gpu
> operations, you have to tell the maintainers of the distros you use to update
> their linux/amdgpu firmware package.

I believe so, and yes it makes total sense that you need the correct firmware for a piece of hardware to work properly. 
I will open bugs for openSUSE and ubuntu, and ask the questions, point to this bug tracker. Let's see what comes out. I will report back as I hear from distribution maintainers. 

I am using a RadeonVII at the moment. Is there anyone with a Vega64 or Vega56 that can do the same tests and let me know if they see same issue? I am happy to include those cards in my same bug reports if someone can confirm.

Comment 79 Alex Deucher 2019-08-05 05:54:44 UTC

the ta bin is optional.  It's only used for server cards with xgmi and ras features.  Consumer cards don't support those features and don't use it.

Comment 80 Mauro Gaspari 2019-08-05 06:16:32 UTC

(In reply to Alex Deucher from comment #79)
> the ta bin is optional.  It's only used for server cards with xgmi and ras
> features.  Consumer cards don't support those features and don't use it.

Alex,
Thank you for confirming this. Good to know.
Regarding the logs and dmesg I posted above, in comment #72, do you see anything useful? Is there any other specific tests I can do to help pinpoint the issue?

Comment 81 Pierre-Eric Pelloux-Prayer 2019-08-07 09:53:53 UTC

Can anyone provide a apitrace/renderdoc capture that can reliably reproduce the crash/freeze?

Comment 82 Mauro Gaspari 2019-08-11 09:31:41 UTC

(In reply to Pierre-Eric Pelloux-Prayer from comment #81)
> Can anyone provide a apitrace/renderdoc capture that can reliably reproduce
> the crash/freeze?

Hello, Sadly my freezes are hard to reproduce. Sometimes I can play for a day with no freeze, sometimes it freezes in 10 minutes, one hour, and so on.

I had another freeze today:

OS: openSUSE Tumbleweed x86_64 
Kernel: 5.2.5-1-default
Resolution: 3440x1440
DE: Xfce
WM: Xfwm4
CPU: AMD Ryzen 7 2700X (16) @ 3.700GHz
GPU: AMD ATI Radeon VII
Memory: 3791MiB / 64387MiB 
OpenGL version string: 4.5 (Compatibility Profile) Mesa 19.1.3

Game: EVE Online: Wine+DXVK. (Crossover 18.5.0) vsync off frame limiter off
Problem description: Afer rougly 1 hour of gameplay, desktop Frozen for a few seconds but managed to recover. Game did not recover and I killed the process. 

DMESG:

[20612.721860] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=12880412, emitted seq=12880414
[20612.721921] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process exefile.exe pid 1980 thread exefile.ex:cs0 pid 2057
[20612.721925] amdgpu 0000:0a:00.0: GPU reset begin!
[20613.526448] amdgpu 0000:0a:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring kiq_2.1.0 test failed (-110)
[20613.526502] [drm:gfx_v9_0_hw_fini [amdgpu]] *ERROR* KCQ disable failed
[20613.547524] amdgpu 0000:0a:00.0: GPU mode1 reset
[20614.055810] [drm] psp mode1 reset succeed 
[20614.128815] amdgpu 0000:0a:00.0: GPU reset succeeded, trying to resume
[20614.128943] [drm] PCIE GART of 512M enabled (table at 0x0000008000300000).
[20614.129304] [drm] PSP is resuming...
[20614.192202] [drm] reserve 0x400000 from 0x8000c00000 for PSP TMR SIZE
[20614.649220] [drm] UVD and UVD ENC initialized successfully.
[20614.748872] [drm] VCE initialized successfully.
[20615.271942] [drm] Fence fallback timer expired on ring gfx
[20615.783826] [drm] Fence fallback timer expired on ring comp_1.0.0
[20616.616023] [drm] Fence fallback timer expired on ring uvd_1
[20617.127844] [drm] Fence fallback timer expired on ring uvd_enc_1.0
[20617.639836] [drm] Fence fallback timer expired on ring uvd_enc_1.1
[20617.739606] [drm] recover vram bo from shadow start
[20617.742231] [drm] recover vram bo from shadow done
[20617.742233] [drm] Skip scheduling IBs!
[20617.742234] [drm] Skip scheduling IBs!
[20617.742259] amdgpu 0000:0a:00.0: GPU reset(2) succeeded!
[20617.742289] [drm] Skip scheduling IBs!
[20617.742309] [drm] Skip scheduling IBs!
[20617.742314] [drm] Skip scheduling IBs!
[20617.742316] [drm] Skip scheduling IBs!
[20617.742318] [drm] Skip scheduling IBs!
[20617.742320] [drm] Skip scheduling IBs!
[20617.743840] [drm] Skip scheduling IBs!
[20617.744006] [drm] Skip scheduling IBs!
[20617.744180] [drm] Skip scheduling IBs!
[20617.744450] [drm] Skip scheduling IBs!

System Logs:

2019-08-11T17:13:10.377029+08:00 MGDT-ROG kernel: [20612.721860] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=12880412, emitted seq=12880414
2019-08-11T17:13:10.377046+08:00 MGDT-ROG kernel: [20612.721921] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process exefile.exe pid 1980 thread exefile.ex:cs0 pid 2057
2019-08-11T17:13:10.377047+08:00 MGDT-ROG kernel: [20612.721925] amdgpu 0000:0a:00.0: GPU reset begin!
2019-08-11T17:13:11.182763+08:00 MGDT-ROG kernel: [20613.526448] amdgpu 0000:0a:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring kiq_2.1.0 test failed (-110)
2019-08-11T17:13:11.182776+08:00 MGDT-ROG kernel: [20613.526502] [drm:gfx_v9_0_hw_fini [amdgpu]] *ERROR* KCQ disable failed
2019-08-11T17:13:11.202766+08:00 MGDT-ROG kernel: [20613.547524] amdgpu 0000:0a:00.0: GPU mode1 reset
2019-08-11T17:13:11.714757+08:00 MGDT-ROG kernel: [20614.055810] [drm] psp mode1 reset succeed 
2019-08-11T17:13:11.786740+08:00 MGDT-ROG kernel: [20614.128815] amdgpu 0000:0a:00.0: GPU reset succeeded, trying to resume
2019-08-11T17:13:11.786749+08:00 MGDT-ROG kernel: [20614.128943] [drm] PCIE GART of 512M enabled (table at 0x0000008000300000).
2019-08-11T17:13:11.786751+08:00 MGDT-ROG kernel: [20614.129304] [drm] PSP is resuming...
2019-08-11T17:13:11.850739+08:00 MGDT-ROG kernel: [20614.192202] [drm] reserve 0x400000 from 0x8000c00000 for PSP TMR SIZE
2019-08-11T17:13:12.306756+08:00 MGDT-ROG kernel: [20614.649220] [drm] UVD and UVD ENC initialized successfully.
2019-08-11T17:13:12.406756+08:00 MGDT-ROG kernel: [20614.748872] [drm] VCE initialized successfully.
2019-08-11T17:13:12.926899+08:00 MGDT-ROG kernel: [20615.271942] [drm] Fence fallback timer expired on ring gfx
2019-08-11T17:13:13.438783+08:00 MGDT-ROG kernel: [20615.783826] [drm] Fence fallback timer expired on ring comp_1.0.0
2019-08-11T17:13:14.274773+08:00 MGDT-ROG kernel: [20616.616023] [drm] Fence fallback timer expired on ring uvd_1
2019-08-11T17:13:14.671435+08:00 MGDT-ROG tracker-store[4801]: OK
2019-08-11T17:13:14.672970+08:00 MGDT-ROG systemd[2481]: tracker-store.service: Succeeded.
2019-08-11T17:13:14.782896+08:00 MGDT-ROG kernel: [20617.127844] [drm] Fence fallback timer expired on ring uvd_enc_1.0
2019-08-11T17:13:15.294768+08:00 MGDT-ROG kernel: [20617.639836] [drm] Fence fallback timer expired on ring uvd_enc_1.1
2019-08-11T17:13:15.394759+08:00 MGDT-ROG kernel: [20617.739606] [drm] recover vram bo from shadow start
2019-08-11T17:13:15.397215+08:00 MGDT-ROG kernel: [20617.742231] [drm] recover vram bo from shadow done
2019-08-11T17:13:15.397227+08:00 MGDT-ROG kernel: [20617.742233] [drm] Skip scheduling IBs!
2019-08-11T17:13:15.397228+08:00 MGDT-ROG kernel: [20617.742234] [drm] Skip scheduling IBs!
2019-08-11T17:13:15.397231+08:00 MGDT-ROG kernel: [20617.742259] amdgpu 0000:0a:00.0: GPU reset(2) succeeded!
2019-08-11T17:13:15.397233+08:00 MGDT-ROG kernel: [20617.742289] [drm] Skip scheduling IBs!
2019-08-11T17:13:15.397235+08:00 MGDT-ROG kernel: [20617.742309] [drm] Skip scheduling IBs!
2019-08-11T17:13:15.397242+08:00 MGDT-ROG kernel: [20617.742314] [drm] Skip scheduling IBs!
2019-08-11T17:13:15.397262+08:00 MGDT-ROG kernel: [20617.742316] [drm] Skip scheduling IBs!
2019-08-11T17:13:15.397265+08:00 MGDT-ROG kernel: [20617.742318] [drm] Skip scheduling IBs!
2019-08-11T17:13:15.397268+08:00 MGDT-ROG kernel: [20617.742320] [drm] Skip scheduling IBs!
2019-08-11T17:13:15.402744+08:00 MGDT-ROG kernel: [20617.743840] [drm] Skip scheduling IBs!
2019-08-11T17:13:15.402753+08:00 MGDT-ROG kernel: [20617.744006] [drm] Skip scheduling IBs!
2019-08-11T17:13:15.402755+08:00 MGDT-ROG kernel: [20617.744180] [drm] Skip scheduling IBs!
2019-08-11T17:13:15.402757+08:00 MGDT-ROG kernel: [20617.744450] [drm] Skip scheduling IBs!

Comment 83 J. Andrew Lanz-O'Brien 2019-08-12 02:50:02 UTC

Can confirm that this bug is still present as of August 11, 2019 on kernel 5.2.8 with mesa 19.1.4. Borderlands 2 hard locked my system about 5 times tonight. Manually setting the power profile didn't help either, ie these two commands:

echo manual > /sys/class/drm/card0/device/power_dpm_force_performance_level
echo 7 > /sys/class/drm/card0/device/pp_dpm_sclk

Comment 84 Pierre-Eric Pelloux-Prayer 2019-08-12 08:16:49 UTC

(In reply to Mauro Gaspari from comment #82)
> (In reply to Pierre-Eric Pelloux-Prayer from comment #81)
> > Can anyone provide a apitrace/renderdoc capture that can reliably reproduce
> > the crash/freeze?
> 
> Hello, Sadly my freezes are hard to reproduce. Sometimes I can play for a
> day with no freeze, sometimes it freezes in 10 minutes, one hour, and so on.
> 

Ok.

This patch https://patchwork.freedesktop.org/series/64792/ might help: it won't fix any issue, but when a timeout is detected it should allow the soft recovery of the GPU.

Other things worth trying: setting AMD_DEBUG environment variables. I'd suggest:

   AMD_DEBUG=zerovram,nodma,nodpbb

There are others (see mesa/src/gallium/drivers/radeonsi/si_pipe.c) to try if these don't help.

Comment 85 Mauro Gaspari 2019-08-12 14:10:11 UTC

(In reply to Pierre-Eric Pelloux-Prayer from comment #84)
> (In reply to Mauro Gaspari from comment #82)
> > (In reply to Pierre-Eric Pelloux-Prayer from comment #81)
> > > Can anyone provide a apitrace/renderdoc capture that can reliably reproduce
> > > the crash/freeze?
> > 
> > Hello, Sadly my freezes are hard to reproduce. Sometimes I can play for a
> > day with no freeze, sometimes it freezes in 10 minutes, one hour, and so on.
> > 
> 
> Ok.
> 
> This patch https://patchwork.freedesktop.org/series/64792/ might help: it
> won't fix any issue, but when a timeout is detected it should allow the soft
> recovery of the GPU.
> 
> Other things worth trying: setting AMD_DEBUG environment variables. I'd
> suggest:
> 
>    AMD_DEBUG=zerovram,nodma,nodpbb
> 
> There are others (see mesa/src/gallium/drivers/radeonsi/si_pipe.c) to try if
> these don't help.

Thank you.

I will first try to reintroduce the kernel parameters I previously used. Do you think those can help at all?

CPU
rcu_nocbs=0-15 (adjust to the number of cores of your cpu)
idle=nomwait
processor.max_cstate=5
pcie_aspm=off 

GPU
amdgpu.dc=1
amdgpu.vm_update_mode=0
amdgpu.dpm=-1
amdgpu.ppfeaturemask=0xffffffff
amdgpu.vm_fault_stop=2
amdgpu.vm_debug=1
amdgpu.gpu_recovery=0

Comment 86 Pierre-Eric Pelloux-Prayer 2019-08-13 15:59:27 UTC

(In reply to Mauro Gaspari from comment #85)
> I will first try to reintroduce the kernel parameters I previously used.
> Do you think those can help at all?
> [...]
> GPU
> amdgpu.dc=1

Not needed: dc will be automatically enabled on recent GPU

> amdgpu.vm_update_mode=0

Shouldn't be needed since it should be the default value. 

> amdgpu.dpm=-1

Not needed: this is the default value

> amdgpu.ppfeaturemask=0xffffffff

The only difference with the default value is that you're enabling Overdrive.
I'd suggest to keep the default parameter here.

> amdgpu.vm_fault_stop=2

I think this one isn't helpful (it's a debugging tool)

> amdgpu.vm_debug=1

This one can help.

> amdgpu.gpu_recovery=0

No opinion on this one :)

Comment 87 Mauro Gaspari 2019-08-13 16:19:27 UTC

(In reply to Pierre-Eric Pelloux-Prayer from comment #86)
> (In reply to Mauro Gaspari from comment #85)
> > I will first try to reintroduce the kernel parameters I previously used.
> > Do you think those can help at all?
> > [...]
> > GPU
> > amdgpu.dc=1
> 
> Not needed: dc will be automatically enabled on recent GPU
> 
> > amdgpu.vm_update_mode=0
> 
> Shouldn't be needed since it should be the default value. 
> 
> > amdgpu.dpm=-1
> 
> Not needed: this is the default value
> 
> > amdgpu.ppfeaturemask=0xffffffff
> 
> The only difference with the default value is that you're enabling Overdrive.
> I'd suggest to keep the default parameter here.
> 
> > amdgpu.vm_fault_stop=2
> 
> I think this one isn't helpful (it's a debugging tool)
> 
> > amdgpu.vm_debug=1
> 
> This one can help.
> 
> > amdgpu.gpu_recovery=0
> 
> No opinion on this one :)

Thank you!

I am currently testing on ubuntu budgie with valve-released Mesa-ACO and so far, I am having no freezes nor crashes. Couple of days without incidents. But as I posted previously, it is all a bit random so I think I will need to use this for at least a week. 

I will report back soon with my findings.

Comment 88 Sam 2019-08-30 19:01:52 UTC

I have recently started to get even more frequent freezes even on Vulkan now on kernel 5.2.10

The workaround of the power profile still works (for me) and is the only way to avoid them:

# echo manual > /sys/class/drm/card0/device/power_dpm_force_performance_level
# echo 7 > /sys/class/drm/card0/device/pp_dpm_sclk

Comment 89 Jaap Buurman 2019-08-31 01:00:23 UTC

Freezes are getting way more frequent for me as well :(

Comment 90 Mauro Gaspari 2019-08-31 05:21:20 UTC

@Sam and @Jaap Buurman

Can you please help and post system info regarding your crash? I hope that with more detailed reports, we can get better help.

Example:

OS Info can be taken from neofetch:
System info:
OS: openSUSE Tumbleweed
Kernel: 5.2.10-1-default
Resolution: 3440x1440
CPU: AMD Ryzen 7 2700X (16) @ 3.700GHz
GPU: AMD ATI Radeon VII 
Memory: 6308MiB / 64387MiB 


Mesa info can be taken from this command:
glxinfo | grep "OpenGL version" 
OpenGL version string: 4.5 (Compatibility Profile) Mesa 19.1.5


Game being played: Eve Online
Native or Wine or Wine+DXVK: Wine+DXVK Directx11


Crash type: Game crash? Full System freeze? System freeze but still can drop to tty?



DMESG output after the crash:
sudo dmesg | grep amdgpu



systemd logs output after the crash (If your system did not freeze and you can get it before reboot):
sudo journalctl -b | grep amdgpu


systemd logs output after the crash (If your system froze and you get logs after reboot):
sudo journalctl -b -1 | grep amdgpu

If your distribution does not use persistent systemd logs you can change it according to your distribution. Example for openSUSE:
https://www.suse.com/documentation/sles-12/book_sle_admin/data/journalctl_persistent.html

Comment 91 Wilko Bartels 2019-08-31 22:38:24 UTC

how big are your swap partitions guys? just toying around here :-)

Comment 92 Jaap Buurman 2019-09-01 22:49:57 UTC

(In reply to Mauro Gaspari from comment #90)
> @Sam and @Jaap Buurman
> 
> Can you please help and post system info regarding your crash? I hope that
> with more detailed reports, we can get better help.

OS: Arch Linux x86_64 
                `+oooo:                  Host: AB350-Gaming 3 
               `+oooooo:                 Kernel: 5.2.11-arch1-1-ARCH 
               -+oooooo+:                Uptime: 1 min 
             `/:-:++oooo+:               Packages: 1229 (pacman) 
            `/++++/+++++++:              Shell: bash 5.0.9 
           `/++++++++++++++:             Terminal: /dev/pts/0 
          `/+++ooooooooooooo/`           CPU: AMD Ryzen 7 1800X (16) @ 3.600GHz 
         ./ooosssso++osssssso+`          GPU: AMD ATI Radeon RX Vega 56/64 
        .oossssso-````/ossssss+`         Memory: 1178MiB / 48304MiB 



> Mesa info can be taken from this command:
> glxinfo | grep "OpenGL version" 

[jaap@Jaap-Desktop ~]$ glxinfo | grep "OpenGL version"
OpenGL version string: 4.5 (Compatibility Profile) Mesa 19.3.0-devel (git-db73bde35c)

I am running this version because I was trying out the mesa-aco from the AUR. I experienced the same crashes with the regular mesa drivers from Arch's official repositories.

> Game being played: 

World of Warcraft: Classic Wine/DXVK 1.3.2

> Crash type: Game crash? Full System freeze? System freeze but still can drop
> to tty?

GPU doesn't successfully reset. Cannot drop to a different tty. However, I am able to access logs via SSH. Full dmesg log: https://pastebin.com/E2071wHF
 
> DMESG output after the crash:
> sudo dmesg | grep amdgpu

https://pastebin.com/2kWpeP1y

> systemd logs output after the crash (If your system did not freeze and you
> can get it before reboot):
> sudo journalctl -b | grep amdgpu

https://pastebin.com/4e1PkJ39

> systemd logs output after the crash (If your system froze and you get logs
> after reboot):
> sudo journalctl -b -1 | grep amdgpu

https://pastebin.com/4mqXNsNQ



Hopefully this information is detailed enough to assist in tracking down the root cause of the issue!

Comment 93 Wilko Bartels 2019-09-02 07:48:19 UTC

(In reply to Wilko Bartels from comment #91)
> how big are your swap partitions guys? just toying around here :-)

also i wanna know if anyone else on arch tested the amdgpu-pro yet?
i played only 3 hours now. we all know that doesnt mean anything :-)
but fingers crossed.
i also have no idea how to confirm its even used. the kernel module showing amdgpu in both circumstances right?

Comment 94 Mauro Gaspari 2019-09-02 10:07:42 UTC

(In reply to Wilko Bartels from comment #93)
> (In reply to Wilko Bartels from comment #91)
> > how big are your swap partitions guys? just toying around here :-)
> 
> also i wanna know if anyone else on arch tested the amdgpu-pro yet?
> i played only 3 hours now. we all know that doesnt mean anything :-)
> but fingers crossed.
> i also have no idea how to confirm its even used. the kernel module showing
> amdgpu in both circumstances right?

Hello,
I am testing on multiple distributions with different mesa drivers. Swap size is 2GB to 8GB depending on the distro. Having 64GB RAM, my swap is constantly empty.
So far the best performance I have is on ubuntu budgie 18.04 with MESA-ACO released by Valve. I had no crashes in quite some time. But I did not have much time to play lately, so I need more time to test.

Regarding AMDGPU-PRO, I tested on ubuntu a very long time ago, and it was quite bad. But I think it makes sense to test and compare. I will install another ubuntu budgie 18.04 on a separate SSD and use it with AMDGPU-PRO. and see if the same issues are shared with AMDGPU, or not.

Thanks, and let me know how AMDGPU-PRO works on arch.

Comment 95 koala_man 2019-09-04 20:41:33 UTC

I am also seeing this issue on my stock Ubuntu. 

>OS Info can be taken from neofetch
OS: Ubuntu 19.04 x86_64
Host: All Series
Kernel: 5.0.0-27-generic
Uptime: 8 mins
Packages: 2671 (dpkg), 6 (flatpak), 10 (snap)
Shell: bash 5.0.3
Terminal: /dev/pts/1
CPU: Intel i5-4690 (4) @ 3.900GHz
GPU: Intel HD Graphics
GPU: AMD ATI Radeon RX Vega 64
Memory: 861MiB / 23976MiB

> glxinfo | grep "OpenGL version" 
OpenGL version string: 4.5 (Compatibility Profile) Mesa 19.0.8

>Game being played
glxgears in a window, no other applications running

>Native or Wine or Wine+DXVK
Native

> Crash type: 
X crashed with colorful pattern, stopped responding to Ctrl-Alt-Fx. `ssh` still works. X server does not accept new commands, e.g. `DISPLAY=:0 glxgears`

>sudo dmesg | grep amdgpu
[    2.328917] [drm] amdgpu kernel modesetting enabled.
[    2.331916] fb0: switching to amdgpudrmfb from EFI VGA
[    2.333325] amdgpu 0000:03:00.0: No more image in the PCI ROM
[    2.333400] amdgpu 0000:03:00.0: VRAM: 8176M 0x000000F400000000 - 0x000000F5FEFFFFFF (8176M used)
[    2.333401] amdgpu 0000:03:00.0: GART: 512M 0x0000000000000000 - 0x000000001FFFFFFF
[    2.333403] amdgpu 0000:03:00.0: AGP: 267419648M 0x000000F800000000 - 0x0000FFFFFFFFFFFF
[    2.333866] [drm] amdgpu: 8176M of VRAM memory ready
[    2.333870] [drm] amdgpu: 8176M of GTT memory ready.
[    2.871622] fbcon: amdgpudrmfb (fb0) is primary device
[    2.929315] amdgpu 0000:03:00.0: fb0: amdgpudrmfb frame buffer device
[    2.944233] amdgpu 0000:03:00.0: ring gfx uses VM inv eng 0 on hub 0
[    2.944249] amdgpu 0000:03:00.0: ring comp_1.0.0 uses VM inv eng 1 on hub 0
[    2.944264] amdgpu 0000:03:00.0: ring comp_1.1.0 uses VM inv eng 4 on hub 0
[    2.944279] amdgpu 0000:03:00.0: ring comp_1.2.0 uses VM inv eng 5 on hub 0
[    2.944294] amdgpu 0000:03:00.0: ring comp_1.3.0 uses VM inv eng 6 on hub 0
[    2.944308] amdgpu 0000:03:00.0: ring comp_1.0.1 uses VM inv eng 7 on hub 0
[    2.944323] amdgpu 0000:03:00.0: ring comp_1.1.1 uses VM inv eng 8 on hub 0
[    2.944338] amdgpu 0000:03:00.0: ring comp_1.2.1 uses VM inv eng 9 on hub 0
[    2.944353] amdgpu 0000:03:00.0: ring comp_1.3.1 uses VM inv eng 10 on hub 0
[    2.944368] amdgpu 0000:03:00.0: ring kiq_2.1.0 uses VM inv eng 11 on hub 0
[    2.944382] amdgpu 0000:03:00.0: ring sdma0 uses VM inv eng 0 on hub 1
[    2.944396] amdgpu 0000:03:00.0: ring page0 uses VM inv eng 1 on hub 1
[    2.944410] amdgpu 0000:03:00.0: ring sdma1 uses VM inv eng 4 on hub 1
[    2.944424] amdgpu 0000:03:00.0: ring page1 uses VM inv eng 5 on hub 1
[    2.944438] amdgpu 0000:03:00.0: ring uvd_0 uses VM inv eng 6 on hub 1
[    2.944452] amdgpu 0000:03:00.0: ring uvd_enc_0.0 uses VM inv eng 7 on hub 1
[    2.944467] amdgpu 0000:03:00.0: ring uvd_enc_0.1 uses VM inv eng 8 on hub 1
[    2.944482] amdgpu 0000:03:00.0: ring vce0 uses VM inv eng 9 on hub 1
[    2.944496] amdgpu 0000:03:00.0: ring vce1 uses VM inv eng 10 on hub 1
[    2.944510] amdgpu 0000:03:00.0: ring vce2 uses VM inv eng 11 on hub 1
[    2.945073] [drm] Initialized amdgpu 3.27.0 20150101 for 0000:03:00.0 on minor 1
[  288.676190] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=72560, emitted seq=72562
[  288.676350] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process glxgears pid 2963 thread glxgears:cs0 pid 2964
[  288.676358] amdgpu 0000:03:00.0: GPU reset begin!
[  288.759763] amdgpu 0000:03:00.0: GPU reset
[  289.208563] RIP: 0010:amdgpu_cs_ioctl+0xaa3/0x1320 [amdgpu]
[  289.208604]  ? amdgpu_cs_find_mapping+0x120/0x120 [amdgpu]
[  289.208647]  ? amdgpu_cs_find_mapping+0x120/0x120 [amdgpu]
[  289.208673]  amdgpu_drm_ioctl+0x4f/0x80 [amdgpu]
[  289.208690] Modules linked in: aufs overlay cmac bnep binfmt_misc nls_iso8859_1 snd_hda_codec_ca0132 snd_hda_codec_realtek snd_hda_codec_generic snd_hda_codec_hdmi ledtrig_audio snd_hda_intel snd_hda_codec snd_hda_core snd_hwdep snd_pcm snd_seq_midi snd_seq_midi_event snd_rawmidi btusb input_leds btrtl btbcm btintel bluetooth eeepc_wmi asus_wmi snd_seq ecdh_generic intel_rapl x86_pkg_temp_thermal intel_powerclamp coretemp sparse_keymap kvm_intel intel_cstate intel_rapl_perf snd_seq_device snd_timer wmi_bmof snd soundcore mei_me mei tpm_infineon mac_hid acpi_pad sch_fq_codel parport_pc ppdev lp parport ip_tables x_tables autofs4 algif_skcipher af_alg hid_generic usbhid hid dm_crypt crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel amdgpu i915 kvmgt vfio_mdev mdev chash aes_x86_64 amd_iommu_v2 crypto_simd vfio_iommu_type1 gpu_sched cryptd glue_helper ttm vfio ahci libahci i2c_i801 kvm mxm_wmi lpc_ich irqbypass i2c_algo_bit pata_acpi e1000e drm_kms_helper syscopyarea sysfillrect
[  289.208743] RIP: 0010:amdgpu_cs_ioctl+0xaa3/0x1320 [amdgpu]
[  289.395715] amdgpu 0000:03:00.0: GPU reset succeeded, trying to resume
[  289.395813] [drm:amdgpu_device_gpu_recover [amdgpu]] *ERROR* VRAM is lost!
[  289.969158] amdgpu 0000:03:00.0: GPU reset(2) succeeded!
[  289.969333] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[  289.969519] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!


>sudo journalctl -b | grep amdgpu

Same as dmesg output (after dropping timestamps), verified by vimdiff.

>Other

No swap, 144hz monitor, GPU was very hot to the touch considering it had only run glxgears @ 144 fps for 5 minutes.

Comment 96 Rodney A Morris 2019-09-07 03:48:56 UTC

(In reply to Mauro Gaspari from comment #90)

I am experiencing periodic lockups with various games, including Hearts of Iron IV, BATTLETECH, and Stellaris all being played through Steam.  Below is the most recent crash from playing less than 5 minutes of Hearts of Iron IV.



> 
> OS Info can be taken from neofetch:
> System info:

           /:-------------:\          
       :-------------------::        -------------------------------- 
     :-----------/shhOHbmp---:\      OS: Fedora release 30 (Thirty) x86_64 
   /-----------omMMMNNNMMD  ---:     Kernel: 5.2.11-200.fc30.x86_64+debug 
  :-----------sMMMMNMNMP.    ---:    Uptime: 11 mins 
 :-----------:MMMdP-------    ---\   Packages: 2198 (rpm), 27 (flatpak) 
,------------:MMMd--------    ---:   Shell: bash 5.0.7 
:------------:MMMd-------    .---:   Resolution: 2560x1440 
:----    oNMMMMMMMMMNho     .----:   DE: GNOME 3.32.2 
:--     .+shhhMMMmhhy++   .------/   WM: GNOME Shell 
:-    -------:MMMd--------------:    WM Theme: Adwaita 
:-   --------/MMMd-------------;     Theme: Adapta-Nokto-Eta [GTK2/3] 
:-    ------/hMMMy------------:      Icons: Adwaita [GTK2/3] 
:-- :dMNdhhdNMMNo------------;       Terminal: tilix 
:---:sdNMMMMNds:------------:        CPU: Intel i7-6850K (12) @ 4.000GHz 
:------:://:-------------::          GPU: AMD ATI Radeon RX Vega 56/64 
:---------------------://            Memory: 1666MiB / 32045MiB 
 
> 
> Mesa info can be taken from this command:
> glxinfo | grep "OpenGL version" 

OpenGL version string: 4.5 (Compatibility Profile) Mesa 19.1.5
 
> 
> Game being played: 

Hearts of Iron IV through Steam for Linux

> Native or Wine or Wine+DXVK:

Native

> 
> Crash type: Game crash? Full System freeze? System freeze but still can drop
> to tty?

Screen goes black suddenly while music continues plays for less than a minute; music begins to loop; and computer reboots.

> 
> DMESG output after the crash:
> sudo dmesg | grep amdgpu

Here is the pertinent part dmesg with kernel debugging turned on.  Some of the information the crash would not be captured by grepping amdgpu.  Entire dmesg provided as an attachment.

[46957.810300] [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out or interrupted!
[46962.941366] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=2446766, emitted seq=2446767
[46962.941453] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process hoi4 pid 24014 thread hoi4:cs0 pid 24015
[46962.941459] amdgpu 0000:06:00.0: GPU reset begin!

[46962.942698] ======================================================
[46962.942700] WARNING: possible circular locking dependency detected
[46962.942702] 5.2.11-200.fc30.x86_64+debug #1 Not tainted
[46962.942704] ------------------------------------------------------
[46962.942705] kworker/3:0/20416 is trying to acquire lock:
[46962.942708] 00000000a4a3593f (&(&ring->fence_drv.lock)->rlock){-.-.}, at: dma_fence_remove_callback+0x1a/0x60
[46962.942717] 
               but task is already holding lock:
[46962.942718] 00000000d45cbf2b (&(&sched->job_list_lock)->rlock){-.-.}, at: drm_sched_stop+0x34/0x130 [gpu_sched]
[46962.942724] 
               which lock already depends on the new lock.

[46962.942725] 
               the existing dependency chain (in reverse order) is:
[46962.942727] 
               -> #1 (&(&sched->job_list_lock)->rlock){-.-.}:
[46962.942735]        _raw_spin_lock_irqsave+0x49/0x83
[46962.942738]        drm_sched_process_job+0x4d/0x180 [gpu_sched]
[46962.942741]        dma_fence_signal+0x111/0x1a0
[46962.942794]        amdgpu_fence_process+0xa3/0x100 [amdgpu]
[46962.942858]        sdma_v4_0_process_trap_irq+0x8d/0xa0 [amdgpu]
[46962.942918]        amdgpu_irq_dispatch+0xc0/0x250 [amdgpu]
[46962.942978]        amdgpu_ih_process+0x8d/0x110 [amdgpu]
[46962.943038]        amdgpu_irq_handler+0x1b/0x50 [amdgpu]
[46962.943043]        __handle_irq_event_percpu+0x3f/0x290
[46962.943046]        handle_irq_event_percpu+0x31/0x80
[46962.943048]        handle_irq_event+0x34/0x51
[46962.943053]        handle_edge_irq+0x83/0x1a0
[46962.943057]        handle_irq+0x1c/0x30
[46962.943059]        do_IRQ+0x61/0x120
[46962.943063]        ret_from_intr+0x0/0x22
[46962.943067]        cpuidle_enter_state+0xc9/0x450
[46962.943069]        cpuidle_enter+0x29/0x40
[46962.943074]        do_idle+0x1ec/0x280
[46962.943076]        cpu_startup_entry+0x19/0x20
[46962.943079]        start_secondary+0x189/0x1e0
[46962.943083]        secondary_startup_64+0xa4/0xb0
[46962.943087] 
               -> #0 (&(&ring->fence_drv.lock)->rlock){-.-.}:
[46962.943095]        lock_acquire+0xa2/0x1b0
[46962.943105]        _raw_spin_lock_irqsave+0x49/0x83
[46962.943109]        dma_fence_remove_callback+0x1a/0x60
[46962.943114]        drm_sched_stop+0x59/0x130 [gpu_sched]
[46962.943225]        amdgpu_device_pre_asic_reset+0x41/0x20c [amdgpu]
[46962.943338]        amdgpu_device_gpu_recover+0x77/0x788 [amdgpu]
[46962.943413]        amdgpu_job_timedout+0x109/0x130 [amdgpu]
[46962.943418]        drm_sched_job_timedout+0x40/0x70 [gpu_sched]
[46962.943421]        process_one_work+0x272/0x5e0
[46962.943423]        worker_thread+0x50/0x3b0
[46962.943427]        kthread+0x108/0x140
[46962.943431]        ret_from_fork+0x3a/0x50
[46962.943432] 
               other info that might help us debug this:

[46962.943435]  Possible unsafe locking scenario:

[46962.943437]        CPU0                    CPU1
[46962.943438]        ----                    ----
[46962.943439]   lock(&(&sched->job_list_lock)->rlock);
[46962.943441]                                lock(&(&ring->fence_drv.lock)->rlock);
[46962.943443]                                lock(&(&sched->job_list_lock)->rlock);
[46962.943445]   lock(&(&ring->fence_drv.lock)->rlock);
[46962.943447] 
                *** DEADLOCK ***

[46962.943449] 5 locks held by kworker/3:0/20416:
[46962.943450]  #0: 0000000043c92b99 ((wq_completion)events){+.+.}, at: process_one_work+0x1e9/0x5e0
[46962.943456]  #1: 000000000c360f0c ((work_completion)(&(&sched->work_tdr)->work)){+.+.}, at: process_one_work+0x1e9/0x5e0
[46962.943459]  #2: 000000007a135814 (&adev->lock_reset){+.+.}, at: amdgpu_device_lock_adev+0x17/0x39 [amdgpu]
[46962.943543]  #3: 00000000e83f7d6b (&dqm->lock_hidden){+.+.}, at: kgd2kfd_pre_reset+0x30/0x60 [amdgpu]
[46962.943614]  #4: 00000000d45cbf2b (&(&sched->job_list_lock)->rlock){-.-.}, at: drm_sched_stop+0x34/0x130 [gpu_sched]
[46962.943620] 
               stack backtrace:
[46962.943629] CPU: 3 PID: 20416 Comm: kworker/3:0 Not tainted 5.2.11-200.fc30.x86_64+debug #1
[46962.943631] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./X99 Taichi, BIOS P1.80 04/06/2018
[46962.943636] Workqueue: events drm_sched_job_timedout [gpu_sched]
[46962.943638] Call Trace:
[46962.943648]  dump_stack+0x85/0xc0
[46962.943654]  print_circular_bug.cold+0x15c/0x195
[46962.943658]  __lock_acquire+0x167c/0x1c90
[46962.943664]  lock_acquire+0xa2/0x1b0
[46962.943668]  ? dma_fence_remove_callback+0x1a/0x60
[46962.943674]  _raw_spin_lock_irqsave+0x49/0x83
[46962.943677]  ? dma_fence_remove_callback+0x1a/0x60
[46962.943680]  dma_fence_remove_callback+0x1a/0x60
[46962.943684]  drm_sched_stop+0x59/0x130 [gpu_sched]
[46962.943764]  amdgpu_device_pre_asic_reset+0x41/0x20c [amdgpu]
[46962.943847]  amdgpu_device_gpu_recover+0x77/0x788 [amdgpu]
[46962.943923]  amdgpu_job_timedout+0x109/0x130 [amdgpu]
[46962.943930]  drm_sched_job_timedout+0x40/0x70 [gpu_sched]
[46962.943934]  process_one_work+0x272/0x5e0
[46962.943938]  worker_thread+0x50/0x3b0
[46962.943942]  kthread+0x108/0x140
[46962.943945]  ? process_one_work+0x5e0/0x5e0
[46962.943948]  ? kthread_park+0x80/0x80
[46962.943952]  ret_from_fork+0x3a/0x50
[46962.961034] pcieport 0000:00:03.0: AER: Uncorrected (Non-Fatal) error received: 0000:00:03.0
[46962.961044] pcieport 0000:00:03.0: AER: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID)
[46962.961048] pcieport 0000:00:03.0: AER:   device [8086:6f08] error status/mask=00004000/00000000
[46962.961051] pcieport 0000:00:03.0: AER:    [14] CmpltTO                (First)
[46962.961149] pcieport 0000:00:03.0: AER: Device recovery failed
[46963.955209] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring page1 timeout, signaled seq=95391072, emitted seq=95391072
[46963.955328] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process  pid 0 thread  pid 0
[46963.955336] amdgpu 0000:06:00.0: GPU reset begin!
[46968.050083] [drm:drm_atomic_helper_wait_for_flip_done [drm_kms_helper]] *ERROR* [CRTC:47:crtc-0] flip_done timed out
[46973.170223] [drm:amdgpu_dm_atomic_check [amdgpu]] *ERROR* [CRTC:47:crtc-0] hw_done or flip_done timed out
[46983.410080] [drm:drm_atomic_helper_wait_for_dependencies [drm_kms_helper]] *ERROR* [CRTC:47:crtc-0] flip_done timed out
[46993.650098] [drm:drm_atomic_helper_wait_for_dependencies [drm_kms_helper]] *ERROR* [PLANE:45:plane-5] flip_done timed out
[46993.962192] amdgpu: [powerplay] No response from smu
[46993.962195] amdgpu: [powerplay] Failed message: 0xe, input parameter: 0x0, error code: 0x0
[46994.277773] amdgpu: [powerplay] No response from smu
[46994.593416] amdgpu: [powerplay] No response from smu
[46994.593420] amdgpu: [powerplay] Failed message: 0x42, input parameter: 0x1, error code: 0x0
[46994.908354] amdgpu: [powerplay] No response from smu
[46995.223718] amdgpu: [powerplay] No response from smu
[46995.223722] amdgpu: [powerplay] Failed message: 0x24, input parameter: 0x0, error code: 0x0
[46995.286504] [drm] REG_WAIT timeout 10us * 3500 tries - dce_mi_free_dmif line:634
[46995.286506] ------------[ cut here ]------------
[46995.286605] WARNING: CPU: 3 PID: 20416 at drivers/gpu/drm/amd/amdgpu/../display/dc/dc_helper.c:329 generic_reg_wait.cold+0x31/0x53 [amdgpu]
[46995.286606] Modules linked in: vhost_net vhost tap rfcomm xt_CHECKSUM xt_MASQUERADE tun bridge stp llc nf_conntrack_netbios_ns nf_conntrack_broadcast xt_CT ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 xt_conntrack ebtable_nat ip6table_nat ip6table_mangle ip6table_raw ip6table_security iptable_nat nf_nat iptable_mangle iptable_raw iptable_security nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c ip_set nfnetlink ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter ip_tables bnep nct6775 hwmon_vid intel_rapl vfat fat arc4 x86_pkg_temp_thermal intel_powerclamp coretemp fuse kvm_intel kvm iwlmvm irqbypass iTCO_wdt iTCO_vendor_support mac80211 crct10dif_pclmul crc32_pclmul snd_hda_codec_realtek ghash_clmulni_intel intel_cstate snd_hda_codec_generic iwlwifi snd_hda_codec_hdmi ledtrig_audio intel_uncore snd_hda_intel intel_rapl_perf cfg80211 snd_hda_codec btusb mxm_wmi snd_hda_core btrtl btbcm snd_hwdep btintel snd_seq i2c_i801 lpc_ich bluetooth
[46995.286626]  snd_seq_device joydev snd_pcm ecdh_generic snd_timer rfkill ecc mei_me snd mei soundcore pcc_cpufreq binfmt_misc auth_rpcgss sunrpc amdgpu amd_iommu_v2 gpu_sched ttm drm_kms_helper crc32c_intel igb uas drm usb_storage dca mpt3sas i2c_algo_bit e1000e nvme raid_class nvme_core scsi_transport_sas wmi
[46995.286638] CPU: 3 PID: 20416 Comm: kworker/3:0 Not tainted 5.2.11-200.fc30.x86_64+debug #1
[46995.286639] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./X99 Taichi, BIOS P1.80 04/06/2018
[46995.286643] Workqueue: events drm_sched_job_timedout [gpu_sched]
[46995.286682] RIP: 0010:generic_reg_wait.cold+0x31/0x53 [amdgpu]
[46995.286684] Code: 4c 24 18 44 89 fa 89 ee 48 c7 c7 78 93 80 c0 e8 45 fd a0 ca 83 7b 20 01 0f 84 27 11 fe ff 48 c7 c7 70 92 80 c0 e8 2f fd a0 ca <0f> 0b e9 14 11 fe ff 48 c7 c7 70 92 80 c0 89 54 24 04 e8 18 fd a0
[46995.286685] RSP: 0018:ffff9cd009b3f728 EFLAGS: 00010246
[46995.286687] RAX: 0000000000000024 RBX: ffff8ada6be8a780 RCX: 0000000000000006
[46995.286688] RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff8ada7ebd9c80
[46995.286689] RBP: 000000000000000a R08: 0000000000000001 R09: 0000000000000000
[46995.286690] R10: 0000000000000000 R11: 0000000000000000 R12: 00000000000035af
[46995.286691] R13: 0000000000000dad R14: 0000000000000001 R15: 0000000000000dac
[46995.286692] FS:  0000000000000000(0000) GS:ffff8ada7ea00000(0000) knlGS:0000000000000000
[46995.286694] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[46995.286695] CR2: 0000085777c78000 CR3: 00000003cb612005 CR4: 00000000003606e0
[46995.286696] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[46995.286697] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[46995.286698] Call Trace:
[46995.286741]  dce_mi_free_dmif+0xef/0x150 [amdgpu]
[46995.286780]  dce110_reset_hw_ctx_wrap+0x14a/0x1e0 [amdgpu]
[46995.286819]  dce110_apply_ctx_to_hw+0x4a/0x490 [amdgpu]
[46995.286843]  ? amdgpu_pm_compute_clocks.part.0+0xcb/0x610 [amdgpu]
[46995.286882]  ? dm_pp_apply_display_requirements+0x19e/0x1c0 [amdgpu]
[46995.286920]  dc_commit_state+0x262/0x580 [amdgpu]
[46995.286925]  ? vsnprintf+0x3aa/0x4f0
[46995.286965]  amdgpu_dm_atomic_commit_tail+0xc34/0x1970 [amdgpu]
[46995.286971]  ? console_unlock+0x363/0x5d0
[46995.286976]  ? __irq_work_queue_local+0x50/0x60
[46995.286977]  ? irq_work_queue+0x4d/0x60
[46995.286979]  ? wake_up_klogd+0x37/0x40
[46995.286984]  ? wait_for_completion_timeout+0x4c/0x190
[46995.286987]  ? _raw_spin_unlock_irq+0x29/0x40
[46995.286989]  ? wait_for_completion_timeout+0x75/0x190
[46995.287016]  ? commit_tail+0x3c/0x70 [drm_kms_helper]
[46995.287021]  commit_tail+0x3c/0x70 [drm_kms_helper]
[46995.287026]  drm_atomic_helper_commit+0xe3/0x150 [drm_kms_helper]
[46995.287031]  drm_atomic_helper_disable_all+0x14c/0x160 [drm_kms_helper]
[46995.287035]  drm_atomic_helper_suspend+0x66/0x100 [drm_kms_helper]
[46995.287076]  dm_suspend+0x20/0x60 [amdgpu]
[46995.287098]  amdgpu_device_ip_suspend_phase1+0x91/0xc0 [amdgpu]
[46995.287123]  amdgpu_device_ip_suspend+0x1c/0x60 [amdgpu]
[46995.287164]  amdgpu_device_pre_asic_reset+0x1f7/0x20c [amdgpu]
[46995.287204]  amdgpu_device_gpu_recover+0x77/0x788 [amdgpu]
[46995.287242]  amdgpu_job_timedout+0x109/0x130 [amdgpu]
[46995.287246]  drm_sched_job_timedout+0x40/0x70 [gpu_sched]
[46995.287249]  process_one_work+0x272/0x5e0
[46995.287252]  worker_thread+0x50/0x3b0
[46995.287256]  kthread+0x108/0x140
[46995.287258]  ? process_one_work+0x5e0/0x5e0
[46995.287260]  ? kthread_park+0x80/0x80
[46995.287263]  ret_from_fork+0x3a/0x50
[46995.287267] irq event stamp: 6288284
[46995.287269] hardirqs last  enabled at (6288283): [<ffffffff8bb04d8b>] _raw_spin_unlock_irqrestore+0x4b/0x60
[46995.287271] hardirqs last disabled at (6288284): [<ffffffff8bb05533>] _raw_spin_lock_irqsave+0x23/0x83
[46995.287273] softirqs last  enabled at (6288276): [<ffffffff8be0035d>] __do_softirq+0x35d/0x468
[46995.287276] softirqs last disabled at (6288269): [<ffffffff8b0f07a2>] irq_exit+0x102/0x110
[46995.287277] ---[ end trace 6a2158c4cfef5172 ]---
[46995.603082] amdgpu: [powerplay] No response from smu
[46995.918767] amdgpu: [powerplay] No response from smu
[46995.918770] amdgpu: [powerplay] Failed message: 0x4c, input parameter: 0x1, error code: 0x0
[46996.233769] amdgpu: [powerplay] No response from smu
[46996.549255] amdgpu: [powerplay] No response from smu
[46996.549258] amdgpu: [powerplay] Failed message: 0x4c, input parameter: 0x3, error code: 0x0
[46996.865320] amdgpu: [powerplay] No response from smu
[46997.181203] amdgpu: [powerplay] No response from smu
[46997.181206] amdgpu: [powerplay] Failed message: 0x9, input parameter: 0xf4, error code: 0x0
[46997.495804] amdgpu: [powerplay] No response from smu
[46997.811227] amdgpu: [powerplay] No response from smu
[46997.811231] amdgpu: [powerplay] Failed message: 0xa, input parameter: 0xa0b000, error code: 0x0
[46998.126794] amdgpu: [powerplay] No response from smu
[46998.442559] amdgpu: [powerplay] No response from smu
[46998.442561] amdgpu: [powerplay] Failed message: 0xe, input parameter: 0x0, error code: 0x0
[46998.756884] amdgpu: [powerplay] No response from smu
[46999.072680] amdgpu: [powerplay] No response from smu
[46999.072684] amdgpu: [powerplay] Failed message: 0x4, input parameter: 0x400, error code: 0x0
[46999.388310] amdgpu: [powerplay] No response from smu
[46999.704067] amdgpu: [powerplay] No response from smu
[46999.704069] amdgpu: [powerplay] Failed message: 0x42, input parameter: 0x1, error code: 0x0
[47000.019626] amdgpu: [powerplay] No response from smu
[47000.334247] amdgpu: [powerplay] No response from smu
[47000.334251] amdgpu: [powerplay] Failed message: 0x24, input parameter: 0x0, error code: 0x0
[47000.350026] pcieport 0000:00:03.0: AER: Uncorrected (Non-Fatal) error received: 0000:00:03.0
[47000.350043] pcieport 0000:00:03.0: AER: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID)
[47000.350052] pcieport 0000:00:03.0: AER:   device [8086:6f08] error status/mask=00004000/00000000
[47000.350061] pcieport 0000:00:03.0: AER:    [14] CmpltTO                (First)
[47000.350202] pcieport 0000:00:03.0: AER: Device recovery failed
[47000.367437] pcieport 0000:00:03.0: AER: Uncorrected (Non-Fatal) error received: 0000:00:03.0
[47000.367443] pcieport 0000:00:03.0: AER: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID)
[47000.367444] pcieport 0000:00:03.0: AER:   device [8086:6f08] error status/mask=00004000/00000000
[47000.367446] pcieport 0000:00:03.0: AER:    [14] CmpltTO                (First)
[47000.367486] pcieport 0000:00:03.0: AER: Device recovery failed
[47000.384977] pcieport 0000:00:03.0: AER: Uncorrected (Non-Fatal) error received: 0000:00:03.0
[47000.384982] pcieport 0000:00:03.0: AER: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID)
[47000.384983] pcieport 0000:00:03.0: AER:   device [8086:6f08] error status/mask=00004000/00000000
[47000.384985] pcieport 0000:00:03.0: AER:    [14] CmpltTO                (First)
[47000.385055] pcieport 0000:00:03.0: AER: Device recovery failed
[47000.402521] pcieport 0000:00:03.0: AER: Uncorrected (Non-Fatal) error received: 0000:00:03.0
[47000.402530] pcieport 0000:00:03.0: AER: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID)
[47000.402532] pcieport 0000:00:03.0: AER:   device [8086:6f08] error status/mask=00004000/00000000
[47000.402535] pcieport 0000:00:03.0: AER:    [14] CmpltTO                (First)
[47000.402578] pcieport 0000:00:03.0: AER: Device recovery failed
[47000.420068] pcieport 0000:00:03.0: AER: Uncorrected (Non-Fatal) error received: 0000:00:03.0
[47000.420079] pcieport 0000:00:03.0: AER: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID)
[47000.420085] pcieport 0000:00:03.0: AER:   device [8086:6f08] error status/mask=00004000/00000000
[47000.420090] pcieport 0000:00:03.0: AER:    [14] CmpltTO                (First)
[47000.420186] pcieport 0000:00:03.0: AER: Device recovery failed
[47000.437608] pcieport 0000:00:03.0: AER: Uncorrected (Non-Fatal) error received: 0000:00:03.0
[47000.437617] pcieport 0000:00:03.0: AER: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID)
[47000.437621] pcieport 0000:00:03.0: AER:   device [8086:6f08] error status/mask=00004000/00000000
[47000.437625] pcieport 0000:00:03.0: AER:    [14] CmpltTO                (First)
[47000.437726] pcieport 0000:00:03.0: AER: Device recovery failed
[47000.455143] pcieport 0000:00:03.0: AER: Uncorrected (Non-Fatal) error received: 0000:00:03.0
[47000.455151] pcieport 0000:00:03.0: AER: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID)
[47000.455154] pcieport 0000:00:03.0: AER:   device [8086:6f08] error status/mask=00004000/00000000
[47000.455157] pcieport 0000:00:03.0: AER:    [14] CmpltTO                (First)
[47000.455209] pcieport 0000:00:03.0: AER: Device recovery failed
[47000.472688] pcieport 0000:00:03.0: AER: Uncorrected (Non-Fatal) error received: 0000:00:03.0
[47000.472698] pcieport 0000:00:03.0: AER: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID)
[47000.472703] pcieport 0000:00:03.0: AER:   device [8086:6f08] error status/mask=00004000/00000000
[47000.472708] pcieport 0000:00:03.0: AER:    [14] CmpltTO                (First)
[47000.472826] pcieport 0000:00:03.0: AER: Device recovery failed
[47000.490225] pcieport 0000:00:03.0: AER: Uncorrected (Non-Fatal) error received: 0000:00:03.0
[47000.490232] pcieport 0000:00:03.0: AER: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID)
[47000.490236] pcieport 0000:00:03.0: AER:   device [8086:6f08] error status/mask=00004000/00000000
[47000.490239] pcieport 0000:00:03.0: AER:    [14] CmpltTO                (First)
[47000.490289] pcieport 0000:00:03.0: AER: Device recovery failed
[47000.507760] pcieport 0000:00:03.0: AER: Multiple Uncorrected (Non-Fatal) error received: 0000:00:03.0
[47000.735787] pcieport 0000:00:03.0: AER: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID)
[47000.735791] pcieport 0000:00:03.0: AER:   device [8086:6f08] error status/mask=00004000/00000000
[47000.735793] pcieport 0000:00:03.0: AER:    [14] CmpltTO                (First)
[47000.735824] pcieport 0000:00:03.0: AER: Device recovery failed
[47000.735826] pcieport 0000:00:03.0: AER: Multiple Uncorrected (Non-Fatal) error received: 0000:00:03.0


> systemd logs output after the crash (If your system froze and you get logs
> after reboot):

Sep 06 08:36:58 ezra.blanchardmorris.net kernel: Command line: BOOT_IMAGE=(hd4,gpt6)/vmlinuz-5.2.11-200.fc30.x86_64+debug root=UUID=e7b8b34a-e17f-4c2b-b223-eaa636249d2d ro resume=UUID=52cc8cd8-b06f-4613-8781-a105d0ebf44a rhgb quiet amdgpu.vm_debug=1
Sep 06 08:36:58 ezra.blanchardmorris.net kernel: Kernel command line: BOOT_IMAGE=(hd4,gpt6)/vmlinuz-5.2.11-200.fc30.x86_64+debug root=UUID=e7b8b34a-e17f-4c2b-b223-eaa636249d2d ro resume=UUID=52cc8cd8-b06f-4613-8781-a105d0ebf44a rhgb quiet amdgpu.vm_debug=1
Sep 06 08:36:59 ezra.blanchardmorris.net dracut-cmdline[361]: Using kernel command line parameters: BOOT_IMAGE=(hd4,gpt6)/vmlinuz-5.2.11-200.fc30.x86_64+debug root=UUID=e7b8b34a-e17f-4c2b-b223-eaa636249d2d ro resume=UUID=52cc8cd8-b06f-4613-8781-a105d0ebf44a rhgb quiet amdgpu.vm_debug=1
Sep 06 08:37:00 ezra.blanchardmorris.net kernel: [drm] amdgpu kernel modesetting enabled.
Sep 06 08:37:00 ezra.blanchardmorris.net kernel: amdgpu 0000:06:00.0: remove_conflicting_pci_framebuffers: bar 0: 0xe0000000 -> 0xefffffff
Sep 06 08:37:00 ezra.blanchardmorris.net kernel: amdgpu 0000:06:00.0: remove_conflicting_pci_framebuffers: bar 2: 0xf0000000 -> 0xf01fffff
Sep 06 08:37:00 ezra.blanchardmorris.net kernel: amdgpu 0000:06:00.0: remove_conflicting_pci_framebuffers: bar 5: 0xfb600000 -> 0xfb67ffff
Sep 06 08:37:00 ezra.blanchardmorris.net kernel: fb0: switching to amdgpudrmfb from EFI VGA
Sep 06 08:37:00 ezra.blanchardmorris.net kernel: amdgpu 0000:06:00.0: vgaarb: deactivate vga console
Sep 06 08:37:00 ezra.blanchardmorris.net kernel: amdgpu 0000:06:00.0: No more image in the PCI ROM
Sep 06 08:37:00 ezra.blanchardmorris.net kernel: amdgpu 0000:06:00.0: VRAM: 8176M 0x000000F400000000 - 0x000000F5FEFFFFFF (8176M used)
Sep 06 08:37:00 ezra.blanchardmorris.net kernel: amdgpu 0000:06:00.0: GART: 512M 0x0000000000000000 - 0x000000001FFFFFFF
Sep 06 08:37:00 ezra.blanchardmorris.net kernel: amdgpu 0000:06:00.0: AGP: 267419648M 0x000000F800000000 - 0x0000FFFFFFFFFFFF
Sep 06 08:37:00 ezra.blanchardmorris.net kernel: [drm] amdgpu: 8176M of VRAM memory ready
Sep 06 08:37:00 ezra.blanchardmorris.net kernel: [drm] amdgpu: 8176M of GTT memory ready.
Sep 06 08:37:01 ezra.blanchardmorris.net kernel: fbcon: amdgpudrmfb (fb0) is primary device
Sep 06 08:37:01 ezra.blanchardmorris.net kernel: amdgpu 0000:06:00.0: fb0: amdgpudrmfb frame buffer device
Sep 06 08:37:01 ezra.blanchardmorris.net kernel: amdgpu 0000:06:00.0: ring gfx uses VM inv eng 0 on hub 0
Sep 06 08:37:01 ezra.blanchardmorris.net kernel: amdgpu 0000:06:00.0: ring comp_1.0.0 uses VM inv eng 1 on hub 0
Sep 06 08:37:01 ezra.blanchardmorris.net kernel: amdgpu 0000:06:00.0: ring comp_1.1.0 uses VM inv eng 4 on hub 0
Sep 06 08:37:01 ezra.blanchardmorris.net kernel: amdgpu 0000:06:00.0: ring comp_1.2.0 uses VM inv eng 5 on hub 0
Sep 06 08:37:01 ezra.blanchardmorris.net kernel: amdgpu 0000:06:00.0: ring comp_1.3.0 uses VM inv eng 6 on hub 0
Sep 06 08:37:01 ezra.blanchardmorris.net kernel: amdgpu 0000:06:00.0: ring comp_1.0.1 uses VM inv eng 7 on hub 0
Sep 06 08:37:01 ezra.blanchardmorris.net kernel: amdgpu 0000:06:00.0: ring comp_1.1.1 uses VM inv eng 8 on hub 0
Sep 06 08:37:01 ezra.blanchardmorris.net kernel: amdgpu 0000:06:00.0: ring comp_1.2.1 uses VM inv eng 9 on hub 0
Sep 06 08:37:01 ezra.blanchardmorris.net kernel: amdgpu 0000:06:00.0: ring comp_1.3.1 uses VM inv eng 10 on hub 0
Sep 06 08:37:01 ezra.blanchardmorris.net kernel: amdgpu 0000:06:00.0: ring kiq_2.1.0 uses VM inv eng 11 on hub 0
Sep 06 08:37:01 ezra.blanchardmorris.net kernel: amdgpu 0000:06:00.0: ring sdma0 uses VM inv eng 0 on hub 1
Sep 06 08:37:01 ezra.blanchardmorris.net kernel: amdgpu 0000:06:00.0: ring page0 uses VM inv eng 1 on hub 1
Sep 06 08:37:01 ezra.blanchardmorris.net kernel: amdgpu 0000:06:00.0: ring sdma1 uses VM inv eng 4 on hub 1
Sep 06 08:37:01 ezra.blanchardmorris.net kernel: amdgpu 0000:06:00.0: ring page1 uses VM inv eng 5 on hub 1
Sep 06 08:37:01 ezra.blanchardmorris.net kernel: amdgpu 0000:06:00.0: ring uvd_0 uses VM inv eng 6 on hub 1
Sep 06 08:37:01 ezra.blanchardmorris.net kernel: amdgpu 0000:06:00.0: ring uvd_enc_0.0 uses VM inv eng 7 on hub 1
Sep 06 08:37:01 ezra.blanchardmorris.net kernel: amdgpu 0000:06:00.0: ring uvd_enc_0.1 uses VM inv eng 8 on hub 1
Sep 06 08:37:01 ezra.blanchardmorris.net kernel: amdgpu 0000:06:00.0: ring vce0 uses VM inv eng 9 on hub 1
Sep 06 08:37:01 ezra.blanchardmorris.net kernel: amdgpu 0000:06:00.0: ring vce1 uses VM inv eng 10 on hub 1
Sep 06 08:37:01 ezra.blanchardmorris.net kernel: amdgpu 0000:06:00.0: ring vce2 uses VM inv eng 11 on hub 1
Sep 06 08:37:01 ezra.blanchardmorris.net kernel: [drm] Initialized amdgpu 3.32.0 20150101 for 0000:06:00.0 on minor 0
Sep 06 08:37:48 ezra.blanchardmorris.net /usr/libexec/gdm-x-session[1969]: Kernel command line: BOOT_IMAGE=(hd4,gpt6)/vmlinuz-5.2.11-200.fc30.x86_64+debug root=UUID=e7b8b34a-e17f-4c2b-b223-eaa636249d2d ro resume=UUID=52cc8cd8-b06f-4613-8781-a105d0ebf44a rhgb quiet amdgpu.vm_debug=1
Sep 06 08:37:48 ezra.blanchardmorris.net /usr/libexec/gdm-x-session[1969]:         loading driver: amdgpu
Sep 06 08:37:48 ezra.blanchardmorris.net /usr/libexec/gdm-x-session[1969]: (==) Matched amdgpu as autoconfigured driver 0
Sep 06 08:37:48 ezra.blanchardmorris.net /usr/libexec/gdm-x-session[1969]: (II) LoadModule: "amdgpu"
Sep 06 08:37:48 ezra.blanchardmorris.net /usr/libexec/gdm-x-session[1969]: (II) Loading /usr/lib64/xorg/modules/drivers/amdgpu_drv.so
Sep 06 08:37:48 ezra.blanchardmorris.net /usr/libexec/gdm-x-session[1969]: (II) Module amdgpu: vendor="X.Org Foundation"
Sep 06 08:37:48 ezra.blanchardmorris.net /usr/libexec/gdm-x-session[1969]:         All GPUs supported by the amdgpu kernel driver
Sep 06 16:13:18 ezra.blanchardmorris.net net.lutris.Lutris.desktop[2234]: 2019-09-06 16:13:18,530: GPU: 1002:687F 1002:0B36 using amdgpu drivers
Sep 06 21:39:39 ezra.blanchardmorris.net kernel: [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out or interrupted!
Sep 06 21:39:39 ezra.blanchardmorris.net kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=2446766, emitted seq=2446767
Sep 06 21:39:39 ezra.blanchardmorris.net kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process hoi4 pid 24014 thread hoi4:cs0 pid 24015
Sep 06 21:39:39 ezra.blanchardmorris.net kernel: amdgpu 0000:06:00.0: GPU reset begin!
Sep 06 21:39:39 ezra.blanchardmorris.net kernel:        amdgpu_fence_process+0xa3/0x100 [amdgpu]
Sep 06 21:39:39 ezra.blanchardmorris.net kernel:        sdma_v4_0_process_trap_irq+0x8d/0xa0 [amdgpu]
Sep 06 21:39:39 ezra.blanchardmorris.net kernel:        amdgpu_irq_dispatch+0xc0/0x250 [amdgpu]
Sep 06 21:39:39 ezra.blanchardmorris.net kernel:        amdgpu_ih_process+0x8d/0x110 [amdgpu]
Sep 06 21:39:39 ezra.blanchardmorris.net kernel:        amdgpu_irq_handler+0x1b/0x50 [amdgpu]
Sep 06 21:39:39 ezra.blanchardmorris.net kernel:        amdgpu_device_pre_asic_reset+0x41/0x20c [amdgpu]
Sep 06 21:39:39 ezra.blanchardmorris.net kernel:        amdgpu_device_gpu_recover+0x77/0x788 [amdgpu]
Sep 06 21:39:39 ezra.blanchardmorris.net kernel:        amdgpu_job_timedout+0x109/0x130 [amdgpu]
Sep 06 21:39:39 ezra.blanchardmorris.net kernel:  #2: 000000007a135814 (&adev->lock_reset){+.+.}, at: amdgpu_device_lock_adev+0x17/0x39 [amdgpu]
Sep 06 21:39:39 ezra.blanchardmorris.net kernel:  #3: 00000000e83f7d6b (&dqm->lock_hidden){+.+.}, at: kgd2kfd_pre_reset+0x30/0x60 [amdgpu]
Sep 06 21:39:39 ezra.blanchardmorris.net kernel:  amdgpu_device_pre_asic_reset+0x41/0x20c [amdgpu]
Sep 06 21:39:39 ezra.blanchardmorris.net kernel:  amdgpu_device_gpu_recover+0x77/0x788 [amdgpu]
Sep 06 21:39:39 ezra.blanchardmorris.net kernel:  amdgpu_job_timedout+0x109/0x130 [amdgpu]
Sep 06 21:39:40 ezra.blanchardmorris.net kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring page1 timeout, signaled seq=95391072, emitted seq=95391072
Sep 06 21:39:40 ezra.blanchardmorris.net kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process  pid 0 thread  pid 0
Sep 06 21:39:40 ezra.blanchardmorris.net kernel: amdgpu 0000:06:00.0: GPU reset begin!
Sep 06 21:39:49 ezra.blanchardmorris.net kernel: [drm:amdgpu_dm_atomic_check [amdgpu]] *ERROR* [CRTC:47:crtc-0] hw_done or flip_done timed out
Sep 06 21:40:10 ezra.blanchardmorris.net kernel: amdgpu: [powerplay] No response from smu
Sep 06 21:40:10 ezra.blanchardmorris.net kernel: amdgpu: [powerplay] Failed message: 0xe, input parameter: 0x0, error code: 0x0
Sep 06 21:40:10 ezra.blanchardmorris.net kernel: amdgpu: [powerplay] No response from smu
Sep 06 21:40:10 ezra.blanchardmorris.net kernel: amdgpu: [powerplay] No response from smu
Sep 06 21:40:10 ezra.blanchardmorris.net kernel: amdgpu: [powerplay] Failed message: 0x42, input parameter: 0x1, error code: 0x0
Sep 06 21:40:11 ezra.blanchardmorris.net kernel: amdgpu: [powerplay] No response from smu

I will try to run apitrace on Hearts of Iron IV to try to capture more information.  Please let me know if I can be of further assistance in squashing this annoying bug, like providing crash information with the mesa debug packages installed.

Comment 97 Rodney A Morris 2019-09-07 03:50:40 UTC

Created attachment 145290 [details]
dmesg for crash

dmesg from crash while playing Hearts of Iron IV using Steam.  Related to comment #96.

Comment 98 koala_man 2019-09-12 20:08:21 UTC

(In reply to koala_man from comment #95)
> I am also seeing this issue on my stock Ubuntu. 

In my case it appears to have been faulty hardware. I tried it on Windows 10 with the latest drivers and still got crashes and reboots. Performance throttling did not help. I swapped out the GPU and haven't seen any crashes since.

Comment 99 Rodney A Morris 2019-09-15 01:16:19 UTC

Created attachment 145366 [details]
apitrace of Hearts of Iron IV hard lock

Apitrace from hard lock playing Hearts of Iron IV without Steam.  The replay from this trace will hard lock the computer, though inconsistently.  I've replayed the trace three times. The replay hard locked computer one time.

Comment 100 Rodney A Morris 2019-09-15 01:20:36 UTC

(In reply to Rodney A Morris from comment #99)
> Created attachment 145366 [details]
> apitrace of Hearts of Iron IV hard lock
> 
> Apitrace from hard lock playing Hearts of Iron IV without Steam.  The replay
> from this trace will hard lock the computer, though inconsistently.  I've
> replayed the trace three times. The replay hard locked computer one time.

neofetch from hardlock:

          /:-------------:\          
       :-------------------::        -------------------------------- 
     :-----------/shhOHbmp---:\      OS: Fedora release 30 (Thirty) x86_64 
   /-----------omMMMNNNMMD  ---:     Kernel: 5.2.13-200.fc30.x86_64 
  :-----------sMMMMNMNMP.    ---:    Uptime: 25 mins 
 :-----------:MMMdP-------    ---\   Packages: 2202 (rpm), 27 (flatpak) 
,------------:MMMd--------    ---:   Shell: bash 5.0.7 
:------------:MMMd-------    .---:   Resolution: 2560x1440 
:----    oNMMMMMMMMMNho     .----:   DE: GNOME 3.32.2 
:--     .+shhhMMMmhhy++   .------/   WM: GNOME Shell 
:-    -------:MMMd--------------:    WM Theme: Adwaita 
:-   --------/MMMd-------------;     Theme: Adapta-Nokto-Eta [GTK2/3] 
:-    ------/hMMMy------------:      Icons: Adwaita [GTK2/3] 
:-- :dMNdhhdNMMNo------------;       Terminal: tilix 
:---:sdNMMMMNds:------------:        CPU: Intel i7-6850K (12) @ 4.000GHz 
:------:://:-------------::          GPU: AMD ATI Radeon RX Vega 56/64 
:---------------------://            Memory: 2478MiB / 32084MiB 

OpenGL version string: 4.5 (Compatibility Profile) Mesa 19.1.6

Note:  hard lock replayed occurred when the Discord flatpak is also running.

Comment 101 Rodney A Morris 2019-09-15 01:21:05 UTC

(In reply to Rodney A Morris from comment #99)
> Created attachment 145366 [details]
> apitrace of Hearts of Iron IV hard lock
> 
> Apitrace from hard lock playing Hearts of Iron IV without Steam.  The replay
> from this trace will hard lock the computer, though inconsistently.  I've
> replayed the trace three times. The replay hard locked computer one time.

neofetch from hardlock:

          /:-------------:\          
       :-------------------::        -------------------------------- 
     :-----------/shhOHbmp---:\      OS: Fedora release 30 (Thirty) x86_64 
   /-----------omMMMNNNMMD  ---:     Kernel: 5.2.13-200.fc30.x86_64 
  :-----------sMMMMNMNMP.    ---:    Uptime: 25 mins 
 :-----------:MMMdP-------    ---\   Packages: 2202 (rpm), 27 (flatpak) 
,------------:MMMd--------    ---:   Shell: bash 5.0.7 
:------------:MMMd-------    .---:   Resolution: 2560x1440 
:----    oNMMMMMMMMMNho     .----:   DE: GNOME 3.32.2 
:--     .+shhhMMMmhhy++   .------/   WM: GNOME Shell 
:-    -------:MMMd--------------:    WM Theme: Adwaita 
:-   --------/MMMd-------------;     Theme: Adapta-Nokto-Eta [GTK2/3] 
:-    ------/hMMMy------------:      Icons: Adwaita [GTK2/3] 
:-- :dMNdhhdNMMNo------------;       Terminal: tilix 
:---:sdNMMMMNds:------------:        CPU: Intel i7-6850K (12) @ 4.000GHz 
:------:://:-------------::          GPU: AMD ATI Radeon RX Vega 56/64 
:---------------------://            Memory: 2478MiB / 32084MiB 

OpenGL version string: 4.5 (Compatibility Profile) Mesa 19.1.6

Note:  hard lock replayed occurred when the Discord flatpak is also running.

Comment 102 Rodney A Morris 2019-09-15 04:35:43 UTC

Created attachment 145367 [details]
Full dmesg from Stellaris crash

I had another crash and soft lockup tonight playing Stellaris through Steam.  Unfortunately, while I had the mesa debuginfo packages installed, I did not have the debug kernel installed.

          /:-------------:\          
       :-------------------::        -------------------------------- 
     :-----------/shhOHbmp---:\      OS: Fedora release 30 (Thirty) x86_64 
   /-----------omMMMNNNMMD  ---:     Kernel: 5.2.13-200.fc30.x86_64 
  :-----------sMMMMNMNMP.    ---:    Uptime: 25 mins 
 :-----------:MMMdP-------    ---\   Packages: 2202 (rpm), 27 (flatpak) 
,------------:MMMd--------    ---:   Shell: bash 5.0.7 
:------------:MMMd-------    .---:   Resolution: 2560x1440 
:----    oNMMMMMMMMMNho     .----:   DE: GNOME 3.32.2 
:--     .+shhhMMMmhhy++   .------/   WM: GNOME Shell 
:-    -------:MMMd--------------:    WM Theme: Adwaita 
:-   --------/MMMd-------------;     Theme: Adapta-Nokto-Eta [GTK2/3] 
:-    ------/hMMMy------------:      Icons: Adwaita [GTK2/3] 
:-- :dMNdhhdNMMNo------------;       Terminal: tilix 
:---:sdNMMMMNds:------------:        CPU: Intel i7-6850K (12) @ 4.000GHz 
:------:://:-------------::          GPU: AMD ATI Radeon RX Vega 56/64 
:---------------------://            Memory: 2478MiB / 32084MiB 

OpenGL version string: 4.5 (Compatibility Profile) Mesa 19.1.6

> Game being played: 


Stellaris through Steam for Linux.  Like other times Discord is running.

> Native or Wine or Wine+DXVK:


Native

> 
> Crash type: Game crash? Full System freeze? System freeze but still can drop
> to tty?


Screen goes black suddenly while music continues plays for less than a minute; music begins to loop; and computer reboots.

> 
> DMESG output after the crash:
Below is the pertinent dmesg messages.  Full file attached.

[ 5292.563342] [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out or interrupted!
[ 5297.683350] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring page1 timeout, signaled seq=97861046, emitted seq=97861048
[ 5297.683465] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process  pid 0 thread  pid 0
[ 5297.683470] amdgpu 0000:06:00.0: GPU reset begin!
[ 5297.693302] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=1321512, emitted seq=1321513
[ 5297.693406] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process stellaris pid 5624 thread stellaris:cs0 pid 5625
[ 5297.693409] amdgpu 0000:06:00.0: GPU reset begin!
[ 5297.709624] pcieport 0000:00:03.0: AER: Uncorrected (Non-Fatal) error received: 0000:00:03.0
[ 5297.709631] pcieport 0000:00:03.0: AER: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID)
[ 5297.709634] pcieport 0000:00:03.0: AER:   device [8086:6f08] error status/mask=00004000/00000000
[ 5297.709637] pcieport 0000:00:03.0: AER:    [14] CmpltTO                (First)
[ 5297.709706] pcieport 0000:00:03.0: AER: Device recovery failed
[ 5302.803236] [drm:drm_atomic_helper_wait_for_flip_done [drm_kms_helper]] *ERROR* [CRTC:47:crtc-0] flip_done timed out
[ 5307.923355] [drm:amdgpu_dm_atomic_check [amdgpu]] *ERROR* [CRTC:47:crtc-0] hw_done or flip_done timed out
[ 5318.163235] [drm:drm_atomic_helper_wait_for_dependencies [drm_kms_helper]] *ERROR* [CRTC:47:crtc-0] flip_done timed out
[ 5328.403235] [drm:drm_atomic_helper_wait_for_dependencies [drm_kms_helper]] *ERROR* [PLANE:45:plane-5] flip_done timed out
[ 5328.717149] amdgpu: [powerplay] No response from smu
[ 5328.717151] amdgpu: [powerplay] Failed message: 0xe, input parameter: 0x0, error code: 0x0
[ 5329.031482] amdgpu: [powerplay] No response from smu
[ 5329.345845] amdgpu: [powerplay] No response from smu
[ 5329.345847] amdgpu: [powerplay] Failed message: 0x42, input parameter: 0x1, error code: 0x0
[ 5329.659470] amdgpu: [powerplay] No response from smu
[ 5329.973320] amdgpu: [powerplay] No response from smu
[ 5329.973322] amdgpu: [powerplay] Failed message: 0x24, input parameter: 0x0, error code: 0x0
[ 5330.044255] [drm] REG_WAIT timeout 10us * 3500 tries - dce_mi_free_dmif line:634
[ 5330.044255] ------------[ cut here ]------------
[ 5330.044355] WARNING: CPU: 9 PID: 7317 at drivers/gpu/drm/amd/amdgpu/../display/dc/dc_helper.c:329 generic_reg_wait.cold+0x31/0x53 [amdgpu]
[ 5330.044356] Modules linked in: rfcomm xt_CHECKSUM xt_MASQUERADE tun bridge stp llc nf_conntrack_netbios_ns nf_conntrack_broadcast xt_CT ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 xt_conntrack ebtable_nat ip6table_nat ip6table_mangle ip6table_raw ip6table_security iptable_nat nf_nat iptable_mangle iptable_raw iptable_security nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c ip_set nfnetlink ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter ip_tables bnep nct6775 hwmon_vid intel_rapl arc4 x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel vfat fat kvm fuse irqbypass iwlmvm iTCO_wdt iTCO_vendor_support mac80211 crct10dif_pclmul crc32_pclmul snd_hda_codec_realtek ghash_clmulni_intel intel_cstate btusb iwlwifi snd_hda_codec_generic btrtl btbcm btintel ledtrig_audio snd_hda_codec_hdmi intel_uncore bluetooth snd_hda_intel intel_rapl_perf snd_hda_codec cfg80211 snd_hda_core snd_hwdep mxm_wmi i2c_i801 joydev snd_seq snd_seq_device xpad ecdh_generic
[ 5330.044372]  ff_memless snd_pcm rfkill ecc snd_timer mei_me snd mei soundcore lpc_ich pcc_cpufreq auth_rpcgss binfmt_misc sunrpc amdgpu amd_iommu_v2 gpu_sched ttm drm_kms_helper drm mpt3sas igb crc32c_intel e1000e nvme raid_class nvme_core dca i2c_algo_bit scsi_transport_sas wmi uas usb_storage
[ 5330.044380] CPU: 9 PID: 7317 Comm: kworker/9:0 Not tainted 5.2.13-200.fc30.x86_64 #1
[ 5330.044381] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./X99 Taichi, BIOS P1.80 04/06/2018
[ 5330.044384] Workqueue: events drm_sched_job_timedout [gpu_sched]
[ 5330.044424] RIP: 0010:generic_reg_wait.cold+0x31/0x53 [amdgpu]
[ 5330.044425] Code: 4c 24 18 44 89 fa 89 ee 48 c7 c7 b8 e2 7b c0 e8 fb d4 a2 fc 83 7b 20 01 0f 84 8d 14 fe ff 48 c7 c7 28 e2 7b c0 e8 e5 d4 a2 fc <0f> 0b e9 7a 14 fe ff 48 c7 c7 28 e2 7b c0 89 54 24 04 e8 ce d4 a2
[ 5330.044426] RSP: 0000:ffffb980493f37b8 EFLAGS: 00010246
[ 5330.044426] RAX: 0000000000000024 RBX: ffff911f70720780 RCX: 0000000000000006
[ 5330.044427] RDX: 0000000000000000 RSI: 0000000000000086 RDI: ffff911f7fa57900
[ 5330.044427] RBP: 000000000000000a R08: 0000000000000001 R09: 0000000000000737
[ 5330.044428] R10: 0000000000026ddc R11: 0000000000000003 R12: 00000000000035af
[ 5330.044428] R13: 0000000000000dad R14: 0000000000000001 R15: 0000000000000dac
[ 5330.044429] FS:  0000000000000000(0000) GS:ffff911f7fa40000(0000) knlGS:0000000000000000
[ 5330.044429] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 5330.044430] CR2: 000006af3a9fb000 CR3: 00000007ab40a003 CR4: 00000000003606e0
[ 5330.044430] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 5330.044431] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 5330.044431] Call Trace:
[ 5330.044487]  dce_mi_free_dmif+0xef/0x150 [amdgpu]
[ 5330.044524]  dce110_reset_hw_ctx_wrap+0x14a/0x1e0 [amdgpu]
[ 5330.044562]  dce110_apply_ctx_to_hw+0x4a/0x490 [amdgpu]
[ 5330.044588]  ? amdgpu_pm_compute_clocks.part.0+0xcb/0x610 [amdgpu]
[ 5330.044590]  ? _cond_resched+0x15/0x30
[ 5330.044629]  ? dm_pp_apply_display_requirements+0x1a8/0x1c0 [amdgpu]
[ 5330.044666]  dc_commit_state+0x27b/0x5c0 [amdgpu]
[ 5330.044669]  ? number+0x31c/0x360
[ 5330.044707]  amdgpu_dm_atomic_commit_tail+0xc15/0x1930 [amdgpu]
[ 5330.044710]  ? va_format.isra.0+0x6e/0xa0
[ 5330.044713]  ? sched_clock+0x5/0x10
[ 5330.044716]  ? sched_clock_cpu+0xc/0xa0
[ 5330.044719]  ? up+0x12/0x60
[ 5330.044721]  ? __irq_work_queue_local+0x50/0x60
[ 5330.044722]  ? irq_work_queue+0x46/0x50
[ 5330.044725]  ? wake_up_klogd+0x30/0x40
[ 5330.044726]  ? vprintk_emit+0x17c/0x260
[ 5330.044727]  ? printk+0x58/0x6f
[ 5330.044728]  ? __next_timer_interrupt+0xd0/0xd0
[ 5330.044736]  ? drm_atomic_helper_wait_for_dependencies+0x1e4/0x1f0 [drm_kms_helper]
[ 5330.044748]  ? drm_err+0x72/0x90 [drm]
[ 5330.044749]  ? _cond_resched+0x15/0x30
[ 5330.044750]  ? wait_for_completion_timeout+0x38/0x170
[ 5330.044754]  ? commit_tail+0x3c/0x70 [drm_kms_helper]
[ 5330.044791]  ? amdgpu_dm_atomic_check+0x6d0/0x6d0 [amdgpu]
[ 5330.044795]  commit_tail+0x3c/0x70 [drm_kms_helper]
[ 5330.044799]  drm_atomic_helper_commit+0x108/0x110 [drm_kms_helper]
[ 5330.044803]  drm_atomic_helper_disable_all+0x144/0x160 [drm_kms_helper]
[ 5330.044807]  drm_atomic_helper_suspend+0x60/0xf0 [drm_kms_helper]
[ 5330.044844]  dm_suspend+0x20/0x60 [amdgpu]
[ 5330.044867]  amdgpu_device_ip_suspend_phase1+0x8b/0xc0 [amdgpu]
[ 5330.044890]  amdgpu_device_ip_suspend+0x1c/0x60 [amdgpu]
[ 5330.044927]  amdgpu_device_pre_asic_reset+0x1f4/0x209 [amdgpu]
[ 5330.044965]  amdgpu_device_gpu_recover+0x77/0x785 [amdgpu]
[ 5330.044998]  amdgpu_job_timedout+0xf7/0x120 [amdgpu]
[ 5330.045000]  drm_sched_job_timedout+0x3a/0x70 [gpu_sched]
[ 5330.045003]  process_one_work+0x19d/0x380
[ 5330.045005]  worker_thread+0x50/0x3b0
[ 5330.045007]  kthread+0xfb/0x130
[ 5330.045008]  ? process_one_work+0x380/0x380
[ 5330.045009]  ? kthread_park+0x80/0x80
[ 5330.045010]  ret_from_fork+0x35/0x40
[ 5330.045012] ---[ end trace 7beee32e6101e37d ]---
[ 5330.358847] amdgpu: [powerplay] No response from smu
[ 5330.673262] amdgpu: [powerplay] No response from smu
[ 5330.673263] amdgpu: [powerplay] Failed message: 0x4c, input parameter: 0x1, error code: 0x0
[ 5330.987579] amdgpu: [powerplay] No response from smu
[ 5331.302073] amdgpu: [powerplay] No response from smu
[ 5331.302074] amdgpu: [powerplay] Failed message: 0x4c, input parameter: 0x3, error code: 0x0
[ 5331.616202] amdgpu: [powerplay] No response from smu
[ 5331.929678] amdgpu: [powerplay] No response from smu
[ 5331.929681] amdgpu: [powerplay] Failed message: 0x9, input parameter: 0xf4, error code: 0x0
[ 5332.243534] amdgpu: [powerplay] No response from smu
[ 5332.557383] amdgpu: [powerplay] No response from smu
[ 5332.557384] amdgpu: [powerplay] Failed message: 0xa, input parameter: 0xa0b000, error code: 0x0
[ 5332.871126] amdgpu: [powerplay] No response from smu
[ 5333.185009] amdgpu: [powerplay] No response from smu
[ 5333.185011] amdgpu: [powerplay] Failed message: 0xe, input parameter: 0x0, error code: 0x0
[ 5333.498596] amdgpu: [powerplay] No response from smu
[ 5333.812147] amdgpu: [powerplay] No response from smu
[ 5333.812155] amdgpu: [powerplay] Failed message: 0x4, input parameter: 0x400, error code: 0x0
[ 5334.126013] amdgpu: [powerplay] No response from smu
[ 5334.440194] amdgpu: [powerplay] No response from smu
[ 5334.440197] amdgpu: [powerplay] Failed message: 0x42, input parameter: 0x1, error code: 0x0
[ 5334.753930] amdgpu: [powerplay] No response from smu
[ 5335.067603] amdgpu: [powerplay] No response from smu
[ 5335.067605] amdgpu: [powerplay] Failed message: 0x24, input parameter: 0x0, error code: 0x0
[ 5335.083579] pcieport 0000:00:03.0: AER: Uncorrected (Non-Fatal) error received: 0000:00:03.0
[ 5335.083589] pcieport 0000:00:03.0: AER: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID)
[ 5335.083599] pcieport 0000:00:03.0: AER:   device [8086:6f08] error status/mask=00004000/00000000
[ 5335.083603] pcieport 0000:00:03.0: AER:    [14] CmpltTO                (First)
[ 5335.083694] pcieport 0000:00:03.0: AER: Device recovery failed
[ 5335.101028] pcieport 0000:00:03.0: AER: Uncorrected (Non-Fatal) error received: 0000:00:03.0
[ 5335.101034] pcieport 0000:00:03.0: AER: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID)
[ 5335.101036] pcieport 0000:00:03.0: AER:   device [8086:6f08] error status/mask=00004000/00000000
[ 5335.101039] pcieport 0000:00:03.0: AER:    [14] CmpltTO                (First)
[ 5335.101085] pcieport 0000:00:03.0: AER: Device recovery failed
[ 5335.118568] pcieport 0000:00:03.0: AER: Uncorrected (Non-Fatal) error received: 0000:00:03.0
[ 5335.118573] pcieport 0000:00:03.0: AER: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID)
[ 5335.118575] pcieport 0000:00:03.0: AER:   device [8086:6f08] error status/mask=00004000/00000000
[ 5335.118577] pcieport 0000:00:03.0: AER:    [14] CmpltTO                (First)
[ 5335.118621] pcieport 0000:00:03.0: AER: Device recovery failed
[ 5335.136108] pcieport 0000:00:03.0: AER: Uncorrected (Non-Fatal) error received: 0000:00:03.0
[ 5335.136113] pcieport 0000:00:03.0: AER: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID)
[ 5335.136116] pcieport 0000:00:03.0: AER:   device [8086:6f08] error status/mask=00004000/00000000
[ 5335.136118] pcieport 0000:00:03.0: AER:    [14] CmpltTO                (First)
[ 5335.136189] pcieport 0000:00:03.0: AER: Device recovery failed
[ 5335.153649] pcieport 0000:00:03.0: AER: Uncorrected (Non-Fatal) error received: 0000:00:03.0
[ 5335.153654] pcieport 0000:00:03.0: AER: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID)
[ 5335.153656] pcieport 0000:00:03.0: AER:   device [8086:6f08] error status/mask=00004000/00000000
[ 5335.153658] pcieport 0000:00:03.0: AER:    [14] CmpltTO                (First)
[ 5335.153702] pcieport 0000:00:03.0: AER: Device recovery failed
[ 5335.171189] pcieport 0000:00:03.0: AER: Uncorrected (Non-Fatal) error received: 0000:00:03.0
[ 5335.171194] pcieport 0000:00:03.0: AER: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID)
[ 5335.171196] pcieport 0000:00:03.0: AER:   device [8086:6f08] error status/mask=00004000/00000000
[ 5335.171199] pcieport 0000:00:03.0: AER:    [14] CmpltTO                (First)
[ 5335.171242] pcieport 0000:00:03.0: AER: Device recovery failed
[ 5335.188769] pcieport 0000:00:03.0: AER: Uncorrected (Non-Fatal) error received: 0000:00:03.0
[ 5335.188774] pcieport 0000:00:03.0: AER: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID)
[ 5335.188776] pcieport 0000:00:03.0: AER:   device [8086:6f08] error status/mask=00004000/00000000
[ 5335.188778] pcieport 0000:00:03.0: AER:    [14] CmpltTO                (First)
[ 5335.188819] pcieport 0000:00:03.0: AER: Device recovery failed
[ 5335.206263] pcieport 0000:00:03.0: AER: Uncorrected (Non-Fatal) error received: 0000:00:03.0
[ 5335.206266] pcieport 0000:00:03.0: AER: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID)
[ 5335.206267] pcieport 0000:00:03.0: AER:   device [8086:6f08] error status/mask=00004000/00000000
[ 5335.206268] pcieport 0000:00:03.0: AER:    [14] CmpltTO                (First)
[ 5335.206286] pcieport 0000:00:03.0: AER: Device recovery failed
[ 5335.223806] pcieport 0000:00:03.0: AER: Uncorrected (Non-Fatal) error received: 0000:00:03.0
[ 5335.223809] pcieport 0000:00:03.0: AER: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID)
[ 5335.223811] pcieport 0000:00:03.0: AER:   device [8086:6f08] error status/mask=00004000/00000000
[ 5335.223812] pcieport 0000:00:03.0: AER:    [14] CmpltTO                (First)
[ 5335.223837] pcieport 0000:00:03.0: AER: Device recovery failed
[ 5335.241348] pcieport 0000:00:03.0: AER: Multiple Uncorrected (Non-Fatal) error received: 0000:00:03.0
[ 5335.469372] pcieport 0000:00:03.0: AER: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID)
[ 5335.469374] pcieport 0000:00:03.0: AER:   device [8086:6f08] error status/mask=00004000/00000000
[ 5335.469375] pcieport 0000:00:03.0: AER:    [14] CmpltTO                (First)
[ 5335.469405] pcieport 0000:00:03.0: AER: Device recovery failed
[ 5335.469406] pcieport 0000:00:03.0: AER: Multiple Uncorrected (Non-Fatal) error received: 0000:00:03.0

> systemd logs output after the crash (If your system froze and you get logs
> after reboot):

Sep 14 20:52:48 ezra.blanchardmorris.net kernel: Command line: BOOT_IMAGE=(hd4,gpt6)/vmlinuz-5.2.13-200.fc30.x86_64 root=UUID=e7b8b34a-e17f-4c2b-b223-eaa636249d2d ro resume=UUID=52cc8cd8-b06f-4613-8781-a105d0ebf44a rhgb quiet amdgpu.vm_debug=1
Sep 14 20:52:48 ezra.blanchardmorris.net kernel: Kernel command line: BOOT_IMAGE=(hd4,gpt6)/vmlinuz-5.2.13-200.fc30.x86_64 root=UUID=e7b8b34a-e17f-4c2b-b223-eaa636249d2d ro resume=UUID=52cc8cd8-b06f-4613-8781-a105d0ebf44a rhgb quiet amdgpu.vm_debug=1
Sep 14 20:52:49 ezra.blanchardmorris.net dracut-cmdline[363]: Using kernel command line parameters: BOOT_IMAGE=(hd4,gpt6)/vmlinuz-5.2.13-200.fc30.x86_64 root=UUID=e7b8b34a-e17f-4c2b-b223-eaa636249d2d ro resume=UUID=52cc8cd8-b06f-4613-8781-a105d0ebf44a rhgb quiet amdgpu.vm_debug=1
Sep 14 20:52:49 ezra.blanchardmorris.net kernel: [drm] amdgpu kernel modesetting enabled.
Sep 14 20:52:49 ezra.blanchardmorris.net kernel: amdgpu 0000:06:00.0: remove_conflicting_pci_framebuffers: bar 0: 0xe0000000 -> 0xefffffff
Sep 14 20:52:49 ezra.blanchardmorris.net kernel: amdgpu 0000:06:00.0: remove_conflicting_pci_framebuffers: bar 2: 0xf0000000 -> 0xf01fffff
Sep 14 20:52:49 ezra.blanchardmorris.net kernel: amdgpu 0000:06:00.0: remove_conflicting_pci_framebuffers: bar 5: 0xfb600000 -> 0xfb67ffff
Sep 14 20:52:49 ezra.blanchardmorris.net kernel: fb0: switching to amdgpudrmfb from EFI VGA
Sep 14 20:52:49 ezra.blanchardmorris.net kernel: amdgpu 0000:06:00.0: vgaarb: deactivate vga console
Sep 14 20:52:49 ezra.blanchardmorris.net kernel: amdgpu 0000:06:00.0: No more image in the PCI ROM
Sep 14 20:52:49 ezra.blanchardmorris.net kernel: amdgpu 0000:06:00.0: VRAM: 8176M 0x000000F400000000 - 0x000000F5FEFFFFFF (8176M used)
Sep 14 20:52:49 ezra.blanchardmorris.net kernel: amdgpu 0000:06:00.0: GART: 512M 0x0000000000000000 - 0x000000001FFFFFFF
Sep 14 20:52:49 ezra.blanchardmorris.net kernel: amdgpu 0000:06:00.0: AGP: 267419648M 0x000000F800000000 - 0x0000FFFFFFFFFFFF
Sep 14 20:52:49 ezra.blanchardmorris.net kernel: [drm] amdgpu: 8176M of VRAM memory ready
Sep 14 20:52:49 ezra.blanchardmorris.net kernel: [drm] amdgpu: 8176M of GTT memory ready.
Sep 14 20:52:50 ezra.blanchardmorris.net kernel: fbcon: amdgpudrmfb (fb0) is primary device
Sep 14 20:52:50 ezra.blanchardmorris.net kernel: amdgpu 0000:06:00.0: fb0: amdgpudrmfb frame buffer device
Sep 14 20:52:50 ezra.blanchardmorris.net kernel: amdgpu 0000:06:00.0: ring gfx uses VM inv eng 0 on hub 0
Sep 14 20:52:50 ezra.blanchardmorris.net kernel: amdgpu 0000:06:00.0: ring comp_1.0.0 uses VM inv eng 1 on hub 0
Sep 14 20:52:50 ezra.blanchardmorris.net kernel: amdgpu 0000:06:00.0: ring comp_1.1.0 uses VM inv eng 4 on hub 0
Sep 14 20:52:50 ezra.blanchardmorris.net kernel: amdgpu 0000:06:00.0: ring comp_1.2.0 uses VM inv eng 5 on hub 0
Sep 14 20:52:50 ezra.blanchardmorris.net kernel: amdgpu 0000:06:00.0: ring comp_1.3.0 uses VM inv eng 6 on hub 0
Sep 14 20:52:50 ezra.blanchardmorris.net kernel: amdgpu 0000:06:00.0: ring comp_1.0.1 uses VM inv eng 7 on hub 0
Sep 14 20:52:50 ezra.blanchardmorris.net kernel: amdgpu 0000:06:00.0: ring comp_1.1.1 uses VM inv eng 8 on hub 0
Sep 14 20:52:50 ezra.blanchardmorris.net kernel: amdgpu 0000:06:00.0: ring comp_1.2.1 uses VM inv eng 9 on hub 0
Sep 14 20:52:50 ezra.blanchardmorris.net kernel: amdgpu 0000:06:00.0: ring comp_1.3.1 uses VM inv eng 10 on hub 0
Sep 14 20:52:50 ezra.blanchardmorris.net kernel: amdgpu 0000:06:00.0: ring kiq_2.1.0 uses VM inv eng 11 on hub 0
Sep 14 20:52:50 ezra.blanchardmorris.net kernel: amdgpu 0000:06:00.0: ring sdma0 uses VM inv eng 0 on hub 1
Sep 14 20:52:50 ezra.blanchardmorris.net kernel: amdgpu 0000:06:00.0: ring page0 uses VM inv eng 1 on hub 1
Sep 14 20:52:50 ezra.blanchardmorris.net kernel: amdgpu 0000:06:00.0: ring sdma1 uses VM inv eng 4 on hub 1
Sep 14 20:52:50 ezra.blanchardmorris.net kernel: amdgpu 0000:06:00.0: ring page1 uses VM inv eng 5 on hub 1
Sep 14 20:52:50 ezra.blanchardmorris.net kernel: amdgpu 0000:06:00.0: ring uvd_0 uses VM inv eng 6 on hub 1
Sep 14 20:52:50 ezra.blanchardmorris.net kernel: amdgpu 0000:06:00.0: ring uvd_enc_0.0 uses VM inv eng 7 on hub 1
Sep 14 20:52:50 ezra.blanchardmorris.net kernel: amdgpu 0000:06:00.0: ring uvd_enc_0.1 uses VM inv eng 8 on hub 1
Sep 14 20:52:50 ezra.blanchardmorris.net kernel: amdgpu 0000:06:00.0: ring vce0 uses VM inv eng 9 on hub 1
Sep 14 20:52:50 ezra.blanchardmorris.net kernel: amdgpu 0000:06:00.0: ring vce1 uses VM inv eng 10 on hub 1
Sep 14 20:52:50 ezra.blanchardmorris.net kernel: amdgpu 0000:06:00.0: ring vce2 uses VM inv eng 11 on hub 1
Sep 14 20:52:50 ezra.blanchardmorris.net kernel: [drm] Initialized amdgpu 3.32.0 20150101 for 0000:06:00.0 on minor 0
Sep 14 20:53:20 ezra.blanchardmorris.net /usr/libexec/gdm-x-session[1928]: Kernel command line: BOOT_IMAGE=(hd4,gpt6)/vmlinuz-5.2.13-200.fc30.x86_64 root=UUID=e7b8b34a-e17f-4c2b-b223-eaa636249d2d ro resume=UUID=52cc8cd8-b06f-4613-8781-a105d0ebf44a rhgb quiet amdgpu.vm_debug=1
Sep 14 20:53:20 ezra.blanchardmorris.net /usr/libexec/gdm-x-session[1928]:         loading driver: amdgpu
Sep 14 20:53:20 ezra.blanchardmorris.net /usr/libexec/gdm-x-session[1928]: (==) Matched amdgpu as autoconfigured driver 0
Sep 14 20:53:20 ezra.blanchardmorris.net /usr/libexec/gdm-x-session[1928]: (II) LoadModule: "amdgpu"
Sep 14 20:53:20 ezra.blanchardmorris.net /usr/libexec/gdm-x-session[1928]: (II) Loading /usr/lib64/xorg/modules/drivers/amdgpu_drv.so
Sep 14 20:53:20 ezra.blanchardmorris.net /usr/libexec/gdm-x-session[1928]: (II) Module amdgpu: vendor="X.Org Foundation"
Sep 14 20:53:20 ezra.blanchardmorris.net /usr/libexec/gdm-x-session[1928]:         All GPUs supported by the amdgpu kernel driver
Sep 14 22:21:05 ezra.blanchardmorris.net kernel: [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out or interrupted!
Sep 14 22:21:05 ezra.blanchardmorris.net kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring page1 timeout, signaled seq=97861046, emitted seq=97861048
Sep 14 22:21:05 ezra.blanchardmorris.net kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process  pid 0 thread  pid 0
Sep 14 22:21:05 ezra.blanchardmorris.net kernel: amdgpu 0000:06:00.0: GPU reset begin!
Sep 14 22:21:05 ezra.blanchardmorris.net kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=1321512, emitted seq=1321513
Sep 14 22:21:05 ezra.blanchardmorris.net kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process stellaris pid 5624 thread stellaris:cs0 pid 5625
Sep 14 22:21:05 ezra.blanchardmorris.net kernel: amdgpu 0000:06:00.0: GPU reset begin!
Sep 14 22:21:15 ezra.blanchardmorris.net kernel: [drm:amdgpu_dm_atomic_check [amdgpu]] *ERROR* [CRTC:47:crtc-0] hw_done or flip_done timed out
Sep 14 22:21:36 ezra.blanchardmorris.net kernel: amdgpu: [powerplay] No response from smu
Sep 14 22:21:36 ezra.blanchardmorris.net kernel: amdgpu: [powerplay] Failed message: 0xe, input parameter: 0x0, error code: 0x0
Sep 14 22:21:36 ezra.blanchardmorris.net kernel: amdgpu: [powerplay] No response from smu
Sep 14 22:21:37 ezra.blanchardmorris.net kernel: amdgpu: [powerplay] No response from smu

Comment 103 Mauro Gaspari 2019-09-21 02:05:52 UTC

(In reply to Rodney A Morris from comment #101)
> (In reply to Rodney A Morris from comment #99)
> > Created attachment 145366 [details]
> > apitrace of Hearts of Iron IV hard lock
> > 
> > Apitrace from hard lock playing Hearts of Iron IV without Steam.  The replay
> > from this trace will hard lock the computer, though inconsistently.  I've
> > replayed the trace three times. The replay hard locked computer one time.
> 
> neofetch from hardlock:
> 
>           /:-------------:\          
>        :-------------------::        -------------------------------- 
>      :-----------/shhOHbmp---:\      OS: Fedora release 30 (Thirty) x86_64 
>    /-----------omMMMNNNMMD  ---:     Kernel: 5.2.13-200.fc30.x86_64 
>   :-----------sMMMMNMNMP.    ---:    Uptime: 25 mins 
>  :-----------:MMMdP-------    ---\   Packages: 2202 (rpm), 27 (flatpak) 
> ,------------:MMMd--------    ---:   Shell: bash 5.0.7 
> :------------:MMMd-------    .---:   Resolution: 2560x1440 
> :----    oNMMMMMMMMMNho     .----:   DE: GNOME 3.32.2 
> :--     .+shhhMMMmhhy++   .------/   WM: GNOME Shell 
> :-    -------:MMMd--------------:    WM Theme: Adwaita 
> :-   --------/MMMd-------------;     Theme: Adapta-Nokto-Eta [GTK2/3] 
> :-    ------/hMMMy------------:      Icons: Adwaita [GTK2/3] 
> :-- :dMNdhhdNMMNo------------;       Terminal: tilix 
> :---:sdNMMMMNds:------------:        CPU: Intel i7-6850K (12) @ 4.000GHz 
> :------:://:-------------::          GPU: AMD ATI Radeon RX Vega 56/64 
> :---------------------://            Memory: 2478MiB / 32084MiB 
> 
> OpenGL version string: 4.5 (Compatibility Profile) Mesa 19.1.6
> 
> Note:  hard lock replayed occurred when the Discord flatpak is also running.

I also noticed some errors that pointed to discord in my logs. In my case discord was installed via .deb package. 
Could you please try and disable hardware acceleration in discord settings - appearance menu? Please let me know if it helps or changes anything. 
Thanks!

Comment 104 Rodney A Morris 2019-09-23 02:49:08 UTC

(In reply to Mauro Gaspari from comment #103)
> (In reply to Rodney A Morris from comment #101)
> > (In reply to Rodney A Morris from comment #99)
> > > Created attachment 145366 [details]
> > > apitrace of Hearts of Iron IV hard lock
> > > 
> > > Apitrace from hard lock playing Hearts of Iron IV without Steam.  The replay
> > > from this trace will hard lock the computer, though inconsistently.  I've
> > > replayed the trace three times. The replay hard locked computer one time.
> > 
> > neofetch from hardlock:
> > 
> >           /:-------------:\          
> >        :-------------------::        -------------------------------- 
> >      :-----------/shhOHbmp---:\      OS: Fedora release 30 (Thirty) x86_64 
> >    /-----------omMMMNNNMMD  ---:     Kernel: 5.2.13-200.fc30.x86_64 
> >   :-----------sMMMMNMNMP.    ---:    Uptime: 25 mins 
> >  :-----------:MMMdP-------    ---\   Packages: 2202 (rpm), 27 (flatpak) 
> > ,------------:MMMd--------    ---:   Shell: bash 5.0.7 
> > :------------:MMMd-------    .---:   Resolution: 2560x1440 
> > :----    oNMMMMMMMMMNho     .----:   DE: GNOME 3.32.2 
> > :--     .+shhhMMMmhhy++   .------/   WM: GNOME Shell 
> > :-    -------:MMMd--------------:    WM Theme: Adwaita 
> > :-   --------/MMMd-------------;     Theme: Adapta-Nokto-Eta [GTK2/3] 
> > :-    ------/hMMMy------------:      Icons: Adwaita [GTK2/3] 
> > :-- :dMNdhhdNMMNo------------;       Terminal: tilix 
> > :---:sdNMMMMNds:------------:        CPU: Intel i7-6850K (12) @ 4.000GHz 
> > :------:://:-------------::          GPU: AMD ATI Radeon RX Vega 56/64 
> > :---------------------://            Memory: 2478MiB / 32084MiB 
> > 
> > OpenGL version string: 4.5 (Compatibility Profile) Mesa 19.1.6
> > 
> > Note:  hard lock replayed occurred when the Discord flatpak is also running.
> 
> I also noticed some errors that pointed to discord in my logs. In my case
> discord was installed via .deb package. 
> Could you please try and disable hardware acceleration in discord settings -
> appearance menu? Please let me know if it helps or changes anything. 
> Thanks!

I have disabled hardware acceleration in discord settings to see if that improves my experience and report back my results.  I am doubtful that it will help much.  At least on the 5.2.11 kernel, I had lockups with or without discord running.  Discord running just seemed to make the problem appear more consistently.

Comment 105 Rodney A Morris 2019-09-23 03:06:55 UTC

Created attachment 145462 [details]
dmesg from Stellaris crash 2019-09-20

I had another lockup on Friday while playing Stellaris again.  This time I had the debug kernel running and the mesa debug packages installed.  I do not plan to post dmesg and journalctl dumps for future crashes unless the logs  indicate a new problem, or I can obtain more information than I previously provided.  Like the crash I reported for Hearts of Iron IV, this Stellaris crash seems to be caused by a circular lock dependency.

If someone believes my problems are caused by faulty hardware, please let me know.  As an FYI, this problem does not seem to manifest under Windows 10, playing the same game.

Card:

Sapphire Radeon Vega 64

OS Info:

          /:-------------:\           
       :-------------------::        -------------------------------- 
     :-----------/shhOHbmp---:\      OS: Fedora release 30 (Thirty) x86_64 
   /-----------omMMMNNNMMD  ---:     Kernel: 5.2.15-200.fc30.x86_64 
  :-----------sMMMMNMNMP.    ---:    Uptime: 1 day, 22 hours, 37 mins 
 :-----------:MMMdP-------    ---\   Packages: 2211 (rpm), 30 (flatpak) 
,------------:MMMd--------    ---:   Shell: bash 5.0.7 
:------------:MMMd-------    .---:   Resolution: 2560x1440 
:----    oNMMMMMMMMMNho     .----:   DE: GNOME 3.32.2 
:--     .+shhhMMMmhhy++   .------/   WM: Mutter 
:-    -------:MMMd--------------:    WM Theme: Adwaita 
:-   --------/MMMd-------------;     Theme: Adapta-Nokto-Eta [GTK2/3] 
:-    ------/hMMMy------------:      Icons: Adwaita [GTK2/3] 
:-- :dMNdhhdNMMNo------------;       Terminal: tilix 
:---:sdNMMMMNds:------------:        CPU: Intel i7-6850K (12) @ 4.000GHz 
:------:://:-------------::          GPU: AMD ATI Radeon RX Vega 56/64 
:---------------------://            Memory: 3097MiB / 32084MiB 

Mesa info:

OpenGL version string: 4.5 (Compatibility Profile) Mesa 19.1.6

Game being played:

Stellaris through steam for Linux

Native or Wine:

Native

Crash Type:

Screen goes black suddenly while music continues plays for less than a minute; music begins to loop; and computer reboots.

Full dmesg attached.  Pertinent part of dmesg with debug kernel:

[ 2383.732727] perf: interrupt took too long (2502 > 2500), lowering kernel.perf_event_max_sample_rate to 79000
[ 2923.530873] [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out or interrupted!
[ 2928.651952] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring page1 timeout, signaled seq=51954680, emitted seq=51954682
[ 2928.652090] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process  pid 0 thread  pid 0
[ 2928.652098] amdgpu 0000:06:00.0: GPU reset begin!
[ 2928.661852] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=734676, emitted seq=734677
[ 2928.661898] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process stellaris pid 5395 thread stellaris:cs0 pid 5397
[ 2928.661901] amdgpu 0000:06:00.0: GPU reset begin!

[ 2928.661997] ======================================================
[ 2928.661999] WARNING: possible circular locking dependency detected
[ 2928.662003] 5.2.15-200.fc30.x86_64+debug #1 Not tainted
[ 2928.662005] ------------------------------------------------------
[ 2928.662007] kworker/10:2/974 is trying to acquire lock:
[ 2928.662010] 00000000d514cf70 (&(&ring->fence_drv.lock)->rlock){-.-.}, at: dma_fence_remove_callback+0x1a/0x60
[ 2928.662021] 
               but task is already holding lock:
[ 2928.662023] 00000000e6ce7c0d (&(&sched->job_list_lock)->rlock){-.-.}, at: drm_sched_stop+0x34/0x130 [gpu_sched]
[ 2928.662031] 
               which lock already depends on the new lock.

[ 2928.662033] 
               the existing dependency chain (in reverse order) is:
[ 2928.662035] 
               -> #1 (&(&sched->job_list_lock)->rlock){-.-.}:
[ 2928.662044]        _raw_spin_lock_irqsave+0x49/0x83
[ 2928.662049]        drm_sched_process_job+0x4d/0x180 [gpu_sched]
[ 2928.662052]        dma_fence_signal+0x111/0x1a0
[ 2928.662128]        amdgpu_fence_process+0xa3/0x100 [amdgpu]
[ 2928.662223]        sdma_v4_0_process_trap_irq+0x8d/0xa0 [amdgpu]
[ 2928.662310]        amdgpu_irq_dispatch+0xc0/0x250 [amdgpu]
[ 2928.662398]        amdgpu_ih_process+0x8d/0x110 [amdgpu]
[ 2928.662482]        amdgpu_irq_handler+0x1b/0x50 [amdgpu]
[ 2928.662487]        __handle_irq_event_percpu+0x3f/0x290
[ 2928.662491]        handle_irq_event_percpu+0x31/0x80
[ 2928.662495]        handle_irq_event+0x34/0x51
[ 2928.662498]        handle_edge_irq+0x83/0x1a0
[ 2928.662502]        handle_irq+0x1c/0x30
[ 2928.662507]        do_IRQ+0x61/0x120
[ 2928.662511]        ret_from_intr+0x0/0x22
[ 2928.662517]        cpuidle_enter_state+0xc9/0x450
[ 2928.662519]        cpuidle_enter+0x29/0x40
[ 2928.662524]        do_idle+0x1ec/0x280
[ 2928.662528]        cpu_startup_entry+0x19/0x20
[ 2928.662531]        start_secondary+0x189/0x1e0
[ 2928.662537]        secondary_startup_64+0xa4/0xb0
[ 2928.662539] 
               -> #0 (&(&ring->fence_drv.lock)->rlock){-.-.}:
[ 2928.662548]        lock_acquire+0xa2/0x1b0
[ 2928.662551]        _raw_spin_lock_irqsave+0x49/0x83
[ 2928.662555]        dma_fence_remove_callback+0x1a/0x60
[ 2928.662560]        drm_sched_stop+0x59/0x130 [gpu_sched]
[ 2928.662709]        amdgpu_device_pre_asic_reset+0x41/0x20c [amdgpu]
[ 2928.662866]        amdgpu_device_gpu_recover+0x77/0x788 [amdgpu]
[ 2928.663007]        amdgpu_job_timedout+0x109/0x130 [amdgpu]
[ 2928.663018]        drm_sched_job_timedout+0x40/0x70 [gpu_sched]
[ 2928.663024]        process_one_work+0x272/0x5e0
[ 2928.663029]        worker_thread+0x50/0x3b0
[ 2928.663037]        kthread+0x108/0x140
[ 2928.663045]        ret_from_fork+0x3a/0x50
[ 2928.663048] 
               other info that might help us debug this:

[ 2928.663051]  Possible unsafe locking scenario:

[ 2928.663055]        CPU0                    CPU1
[ 2928.663059]        ----                    ----
[ 2928.663062]   lock(&(&sched->job_list_lock)->rlock);
[ 2928.663068]                                lock(&(&ring->fence_drv.lock)->rlock);
[ 2928.663072]                                lock(&(&sched->job_list_lock)->rlock);
[ 2928.663076]   lock(&(&ring->fence_drv.lock)->rlock);
[ 2928.663080] 
                *** DEADLOCK ***

[ 2928.663085] 5 locks held by kworker/10:2/974:
[ 2928.663090]  #0: 0000000057c9a435 ((wq_completion)events){+.+.}, at: process_one_work+0x1e9/0x5e0
[ 2928.663100]  #1: 00000000aadd5dda ((work_completion)(&(&sched->work_tdr)->work)){+.+.}, at: process_one_work+0x1e9/0x5e0
[ 2928.663108]  #2: 0000000007db378b (&adev->lock_reset){+.+.}, at: amdgpu_device_lock_adev+0x17/0x39 [amdgpu]
[ 2928.663261]  #3: 000000001e0a2926 (&dqm->lock_hidden){+.+.}, at: kgd2kfd_pre_reset+0x30/0x60 [amdgpu]
[ 2928.663392]  #4: 00000000e6ce7c0d (&(&sched->job_list_lock)->rlock){-.-.}, at: drm_sched_stop+0x34/0x130 [gpu_sched]
[ 2928.663403] 
               stack backtrace:
[ 2928.663409] CPU: 10 PID: 974 Comm: kworker/10:2 Not tainted 5.2.15-200.fc30.x86_64+debug #1
[ 2928.663413] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./X99 Taichi, BIOS P1.80 04/06/2018
[ 2928.663423] Workqueue: events drm_sched_job_timedout [gpu_sched]
[ 2928.663428] Call Trace:
[ 2928.663442]  dump_stack+0x85/0xc0
[ 2928.663453]  print_circular_bug.cold+0x15c/0x195
[ 2928.663462]  __lock_acquire+0x167c/0x1c90
[ 2928.663475]  lock_acquire+0xa2/0x1b0
[ 2928.663482]  ? dma_fence_remove_callback+0x1a/0x60
[ 2928.663494]  _raw_spin_lock_irqsave+0x49/0x83
[ 2928.663499]  ? dma_fence_remove_callback+0x1a/0x60
[ 2928.663506]  dma_fence_remove_callback+0x1a/0x60
[ 2928.663515]  drm_sched_stop+0x59/0x130 [gpu_sched]
[ 2928.663663]  amdgpu_device_pre_asic_reset+0x41/0x20c [amdgpu]
[ 2928.663818]  amdgpu_device_gpu_recover+0x77/0x788 [amdgpu]
[ 2928.663960]  amdgpu_job_timedout+0x109/0x130 [amdgpu]
[ 2928.663974]  drm_sched_job_timedout+0x40/0x70 [gpu_sched]
[ 2928.663981]  process_one_work+0x272/0x5e0
[ 2928.663991]  worker_thread+0x50/0x3b0
[ 2928.664000]  kthread+0x108/0x140
[ 2928.664005]  ? process_one_work+0x5e0/0x5e0
[ 2928.664011]  ? kthread_park+0x80/0x80
[ 2928.664021]  ret_from_fork+0x3a/0x50
[ 2928.681831] pcieport 0000:00:03.0: AER: Uncorrected (Non-Fatal) error received: 0000:00:03.0
[ 2928.681846] pcieport 0000:00:03.0: AER: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID)
[ 2928.681851] pcieport 0000:00:03.0: AER:   device [8086:6f08] error status/mask=00004000/00000000
[ 2928.681857] pcieport 0000:00:03.0: AER:    [14] CmpltTO                (First)
[ 2928.681963] pcieport 0000:00:03.0: AER: Device recovery failed
[ 2933.771664] [drm:drm_atomic_helper_wait_for_flip_done [drm_kms_helper]] *ERROR* [CRTC:47:crtc-0] flip_done timed out
[ 2938.890758] [drm:amdgpu_dm_atomic_check [amdgpu]] *ERROR* [CRTC:47:crtc-0] hw_done or flip_done timed out
[ 2939.118467] pcieport 0000:00:03.0: AER: Uncorrected (Non-Fatal) error received: 0000:00:03.0
[ 2939.118475] pcieport 0000:00:03.0: AER: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID)
[ 2939.118477] pcieport 0000:00:03.0: AER:   device [8086:6f08] error status/mask=00004000/00000000
[ 2939.118479] pcieport 0000:00:03.0: AER:    [14] CmpltTO                (First)
[ 2939.118536] pcieport 0000:00:03.0: AER: Device recovery failed
[ 2939.141034] pcieport 0000:00:03.0: AER: Multiple Uncorrected (Non-Fatal) error received: 0000:00:03.0
[ 2939.369014] pcieport 0000:00:03.0: AER: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID)
[ 2939.369018] pcieport 0000:00:03.0: AER:   device [8086:6f08] error status/mask=00004000/00000000
[ 2939.369021] pcieport 0000:00:03.0: AER:    [14] CmpltTO                (First)
[ 2939.369072] pcieport 0000:00:03.0: AER: Device recovery failed
[ 2939.369075] pcieport 0000:00:03.0: AER: Multiple Uncorrected (Non-Fatal) error received: 0000:00:03.0
[ 2939.597051] pcieport 0000:00:03.0: AER: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID)
[ 2939.597055] pcieport 0000:00:03.0: AER:   device [8086:6f08] error status/mask=00004000/00000000
[ 2939.597057] pcieport 0000:00:03.0: AER:    [14] CmpltTO                (First)
[ 2939.597103] pcieport 0000:00:03.0: AER: Device recovery failed
[ 2939.597106] pcieport 0000:00:03.0: AER: Multiple Uncorrected (Non-Fatal) error received: 0000:00:03.0

systemd logs:

Nothing interesting appears in the logs, not even the information from dmesg.  I'm unsure if systemd captured anything from the crash.

Comment 106 jeroenimo 2019-09-26 10:37:39 UTC

This is quite a severe bug. 
I have reasonable stable system with Mint 19.2 (runs hours without a crash
uname -a
Linux jeroenimo-amd 4.15.0-64-generic #73-Ubuntu SMP Thu Sep 12 13:16:13 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux


(X)ubuntu 18.04 LTS LTS crashes a lot faster (1 or 2 minutes) 5.0.0.29 kernel

I can reproduce the bug with glmark2 instantly 100% of the times

(https://launchpad.net/glmark2) or sudo apt install glmark2

I'm not very good at debugging but this is what my dmesg looks like when I ssh and run glmark2

[ 6619.587749] [drm:drm_atomic_helper_wait_for_flip_done [drm_kms_helper]] *ERROR* [CRTC:45:crtc-1] flip_done timed out

And that's it, no more info.

Comment 107 jeroenimo 2019-09-26 12:56:08 UTC

I have a workaround that at least makes the system workable.

After some testing I managed to run glmark2 at the lowest and second lowest clock speed on my RX560

From root:
echo manual > /sys/class/drm/card0/device/power_dpm_force_performance_level
echo 1 > /sys/class/drm/card0/device/pp_dpm_sclk

giving me this
cat /sys/class/drm/card0/device/pp_dpm_sclk 
0: 214Mhz 
1: 387Mhz *
2: 843Mhz 
3: 995Mhz 
4: 1062Mhz 
5: 1108Mhz 
6: 1149Mhz 
7: 1176Mhz 

Obviously this decreases performance big time, but I don't really game so it makes my system usable.

Any clock speeds over 4: 1062Mhz crashes my system immediately..

Comment 108 Wilko Bartels 2019-09-28 07:02:48 UTC

Did u try the amdgpu-pro driver as well?
i just did four runs of glmark and it just went through for me. going up to 1600mhz shader clock. tested both closed and opensource drivers. vega pulse here.

mesa result:

=======================================================
    glmark2 2014.03
=======================================================
    OpenGL Information
    GL_VENDOR:     X.Org
    GL_RENDERER:   Radeon RX Vega (VEGA10, DRM 3.33.0, 5.3.1-arch1-1-ARCH, LLVM 8.0.1)
    GL_VERSION:    4.5 (Compatibility Profile) Mesa 19.1.7
=======================================================
[build] use-vbo=false: FPS: 8617 FrameTime: 0.116 ms
[build] use-vbo=true: FPS: 10534 FrameTime: 0.095 ms
[texture] texture-filter=nearest: FPS: 11214 FrameTime: 0.089 ms
[texture] texture-filter=linear: FPS: 11274 FrameTime: 0.089 ms
[texture] texture-filter=mipmap: FPS: 10197 FrameTime: 0.098 ms
[shading] shading=gouraud: FPS: 9790 FrameTime: 0.102 ms
[shading] shading=blinn-phong-inf: FPS: 10979 FrameTime: 0.091 ms
[shading] shading=phong: FPS: 10167 FrameTime: 0.098 ms
[shading] shading=cel: FPS: 9662 FrameTime: 0.103 ms
[bump] bump-render=high-poly: FPS: 9830 FrameTime: 0.102 ms
[bump] bump-render=normals: FPS: 10151 FrameTime: 0.099 ms
[bump] bump-render=height: FPS: 10870 FrameTime: 0.092 ms
libpng warning: iCCP: known incorrect sRGB profile
[effect2d] kernel=0,1,0;1,-4,1;0,1,0;: FPS: 12008 FrameTime: 0.083 ms
libpng warning: iCCP: known incorrect sRGB profile
[effect2d] kernel=1,1,1,1,1;1,1,1,1,1;1,1,1,1,1;: FPS: 10876 FrameTime: 0.092 ms
[pulsar] light=false:quads=5:texture=false: FPS: 10232 FrameTime: 0.098 ms
libpng warning: iCCP: known incorrect sRGB profile
[desktop] blur-radius=5:effect=blur:passes=1:separable=true:windows=4: FPS: 6842 FrameTime: 0.146 ms
libpng warning: iCCP: known incorrect sRGB profile
[desktop] effect=shadow:windows=4: FPS: 7934 FrameTime: 0.126 ms
[buffer] columns=200:interleave=false:update-dispersion=0.9:update-fraction=0.5:update-method=map: FPS: 1770 FrameTime: 0.565 ms
[buffer] columns=200:interleave=false:update-dispersion=0.9:update-fraction=0.5:update-method=subdata: FPS: 2308 FrameTime: 0.433 ms
[buffer] columns=200:interleave=true:update-dispersion=0.9:update-fraction=0.5:update-method=map: FPS: 1875 FrameTime: 0.533 ms
[ideas] speed=duration: FPS: 4475 FrameTime: 0.223 ms
[jellyfish] <default>: FPS: 9499 FrameTime: 0.105 ms
[terrain] <default>: FPS: 2593 FrameTime: 0.386 ms
[shadow] <default>: FPS: 9423 FrameTime: 0.106 ms
[refract] <default>: FPS: 6008 FrameTime: 0.166 ms
[conditionals] fragment-steps=0:vertex-steps=0: FPS: 11364 FrameTime: 0.088 ms
[conditionals] fragment-steps=5:vertex-steps=0: FPS: 10816 FrameTime: 0.092 ms
[conditionals] fragment-steps=0:vertex-steps=5: FPS: 12000 FrameTime: 0.083 ms
[function] fragment-complexity=low:fragment-steps=5: FPS: 10932 FrameTime: 0.091 ms
[function] fragment-complexity=medium:fragment-steps=5: FPS: 11690 FrameTime: 0.086 ms
[loop] fragment-loop=false:fragment-steps=5:vertex-steps=5: FPS: 11119 FrameTime: 0.090 ms
[loop] fragment-steps=5:fragment-uniform=false:vertex-steps=5: FPS: 11003 FrameTime: 0.091 ms
[loop] fragment-steps=5:fragment-uniform=true:vertex-steps=5: FPS: 12886 FrameTime: 0.078 ms
=======================================================
                                  glmark2 Score: 9119 
=======================================================

amdgpu-pro result:

=======================================================
    glmark2 2014.03
=======================================================
    OpenGL Information
    GL_VENDOR:     ATI Technologies Inc.
    GL_RENDERER:   Radeon RX Vega
    GL_VERSION:    4.6.13572 Compatibility Profile Context
=======================================================
[build] use-vbo=false: FPS: 3727 FrameTime: 0.268 ms
[build] use-vbo=true: FPS: 9516 FrameTime: 0.105 ms
[texture] texture-filter=nearest: FPS: 7346 FrameTime: 0.136 ms
[texture] texture-filter=linear: FPS: 9236 FrameTime: 0.108 ms
[texture] texture-filter=mipmap: FPS: 9161 FrameTime: 0.109 ms
[shading] shading=gouraud: FPS: 9184 FrameTime: 0.109 ms
[shading] shading=blinn-phong-inf: FPS: 9363 FrameTime: 0.107 ms
[shading] shading=phong: FPS: 9424 FrameTime: 0.106 ms
[shading] shading=cel: FPS: 9060 FrameTime: 0.110 ms
[bump] bump-render=high-poly: FPS: 9047 FrameTime: 0.111 ms
[bump] bump-render=normals: FPS: 8804 FrameTime: 0.114 ms
[bump] bump-render=height: FPS: 9156 FrameTime: 0.109 ms
libpng warning: iCCP: known incorrect sRGB profile
[effect2d] kernel=0,1,0;1,-4,1;0,1,0;: FPS: 9121 FrameTime: 0.110 ms
libpng warning: iCCP: known incorrect sRGB profile
[effect2d] kernel=1,1,1,1,1;1,1,1,1,1;1,1,1,1,1;: FPS: 8866 FrameTime: 0.113 ms
[pulsar] light=false:quads=5:texture=false: FPS: 8286 FrameTime: 0.121 ms
libpng warning: iCCP: known incorrect sRGB profile
[desktop] blur-radius=5:effect=blur:passes=1:separable=true:windows=4: FPS: 3789 FrameTime: 0.264 ms
libpng warning: iCCP: known incorrect sRGB profile
[desktop] effect=shadow:windows=4: FPS: 4491 FrameTime: 0.223 ms
[buffer] columns=200:interleave=false:update-dispersion=0.9:update-fraction=0.5:update-method=map: FPS: 1026 FrameTime: 0.975 ms
[buffer] columns=200:interleave=false:update-dispersion=0.9:update-fraction=0.5:update-method=subdata: FPS: 2228 FrameTime: 0.449 ms
[buffer] columns=200:interleave=true:update-dispersion=0.9:update-fraction=0.5:update-method=map: FPS: 1275 FrameTime: 0.784 ms
[ideas] speed=duration: FPS: 4038 FrameTime: 0.248 ms
[jellyfish] <default>: FPS: 7342 FrameTime: 0.136 ms
[terrain] <default>: FPS: 790 FrameTime: 1.266 ms
[shadow] <default>: FPS: 6002 FrameTime: 0.167 ms
[refract] <default>: FPS: 4273 FrameTime: 0.234 ms
[conditionals] fragment-steps=0:vertex-steps=0: FPS: 9208 FrameTime: 0.109 ms
[conditionals] fragment-steps=5:vertex-steps=0: FPS: 8964 FrameTime: 0.112 ms
[conditionals] fragment-steps=0:vertex-steps=5: FPS: 8984 FrameTime: 0.111 ms
[function] fragment-complexity=low:fragment-steps=5: FPS: 9360 FrameTime: 0.107 ms
[function] fragment-complexity=medium:fragment-steps=5: FPS: 9214 FrameTime: 0.109 ms
[loop] fragment-loop=false:fragment-steps=5:vertex-steps=5: FPS: 8945 FrameTime: 0.112 ms
[loop] fragment-steps=5:fragment-uniform=false:vertex-steps=5: FPS: 9218 FrameTime: 0.108 ms
[loop] fragment-steps=5:fragment-uniform=true:vertex-steps=5: FPS: 9077 FrameTime: 0.110 ms
=======================================================
                                  glmark2 Score: 7197 
=======================================================

Comment 109 jeroenimo 2019-09-28 11:05:09 UTC

(In reply to Wilko Bartels from comment #108)
> Did u try the amdgpu-pro driver as well?
> i just did four runs of glmark and it just went through for me. going up to
> 1600mhz shader clock. tested both closed and opensource drivers. vega pulse
> here.
> 
Yes I did try all versions. I'm pretty sure it's not the driver, as all results in the same. Any higher clockspeed just crashed.

Ik have NVIDIA 1030 installed now, which is also buggy but at least it doesn't crash.

Comment 110 Rodney A Morris 2019-09-28 12:25:32 UTC

(In reply to Rodney A Morris from comment #104)
> (In reply to Mauro Gaspari from comment #103)
> > (In reply to Rodney A Morris from comment #101)
> > > (In reply to Rodney A Morris from comment #99)
> > > > Created attachment 145366 [details]
> > > > apitrace of Hearts of Iron IV hard lock
> > > > 
> > > > Apitrace from hard lock playing Hearts of Iron IV without Steam.  The replay
> > > > from this trace will hard lock the computer, though inconsistently.  I've
> > > > replayed the trace three times. The replay hard locked computer one time.
> > > 
> > > neofetch from hardlock:
> > > 
> > >           /:-------------:\          
> > >        :-------------------::        -------------------------------- 
> > >      :-----------/shhOHbmp---:\      OS: Fedora release 30 (Thirty) x86_64 
> > >    /-----------omMMMNNNMMD  ---:     Kernel: 5.2.13-200.fc30.x86_64 
> > >   :-----------sMMMMNMNMP.    ---:    Uptime: 25 mins 
> > >  :-----------:MMMdP-------    ---\   Packages: 2202 (rpm), 27 (flatpak) 
> > > ,------------:MMMd--------    ---:   Shell: bash 5.0.7 
> > > :------------:MMMd-------    .---:   Resolution: 2560x1440 
> > > :----    oNMMMMMMMMMNho     .----:   DE: GNOME 3.32.2 
> > > :--     .+shhhMMMmhhy++   .------/   WM: GNOME Shell 
> > > :-    -------:MMMd--------------:    WM Theme: Adwaita 
> > > :-   --------/MMMd-------------;     Theme: Adapta-Nokto-Eta [GTK2/3] 
> > > :-    ------/hMMMy------------:      Icons: Adwaita [GTK2/3] 
> > > :-- :dMNdhhdNMMNo------------;       Terminal: tilix 
> > > :---:sdNMMMMNds:------------:        CPU: Intel i7-6850K (12) @ 4.000GHz 
> > > :------:://:-------------::          GPU: AMD ATI Radeon RX Vega 56/64 
> > > :---------------------://            Memory: 2478MiB / 32084MiB 
> > > 
> > > OpenGL version string: 4.5 (Compatibility Profile) Mesa 19.1.6
> > > 
> > > Note:  hard lock replayed occurred when the Discord flatpak is also running.
> > 
> > I also noticed some errors that pointed to discord in my logs. In my case
> > discord was installed via .deb package. 
> > Could you please try and disable hardware acceleration in discord settings -
> > appearance menu? Please let me know if it helps or changes anything. 
> > Thanks!
> 
> I have disabled hardware acceleration in discord settings to see if that
> improves my experience and report back my results.  I am doubtful that it
> will help much.  At least on the 5.2.11 kernel, I had lockups with or
> without discord running.  Discord running just seemed to make the problem
> appear more consistently.

Another lockup and crash last night of Stellaris with identical dmesg kernel information as comment 105.

Kernel for this crash: 5.2.17.

  Unlike previous attempts, I also had cpupower configured to run the cpu in performance mode and was running feral gamemode.  Although I still wonder if my hardware has an issue, I am able to run Stellaris without issue under Windows.

Final Note: Getting an apitrace of my crash under Stellaris is not feasible for two reasons.  First, the crash typically happens between 30 minutes and 40 minutes of game play, resulting in a monster trace file.  Second, i cannot get apitrace to run correctly with Steam and a 64-bit game, which is necessary since the crashes happen most frequently in multiplayer.

I am happy to provide more data if someone can point me in the direction to capture it.  Aside from trying the amdgpu-pro drivers, is there anything else I can try?

Comment 111 Yury Zhuravlev 2019-10-03 09:57:41 UTC

Ok, it's many times was here:
echo 7 > /sys/class/drm/card0/device/pp_dpm_sclk

this thing also helped me. Without it, many games make my PC is freeze even without anything in logs or working ssh. 

Something wrong with the PowerPlay system on Vega cards. Can anybody open a ticket on the kernel bug tracker?

Comment 112 Jan Orsag 2019-10-05 10:12:00 UTC

screenfetch
 ██████████████████  ████████     johanides@johanides-manjaro
 ██████████████████  ████████     OS: Manjaro 18.1.0 Juhraya
 ██████████████████  ████████     Kernel: x86_64 Linux 4.19.69-1-MANJARO
 ██████████████████  ████████     Uptime: 18m
 ████████            ████████     Packages: 1186
 ████████  ████████  ████████     Shell: bash
 ████████  ████████  ████████     Resolution: 2560x1440
 ████████  ████████  ████████     DE: GNOME 3.32.2
 ████████  ████████  ████████     WM: Mutter
 ████████  ████████  ████████     WM Theme: Adapta-Nokto-Eta-Maia
 ████████  ████████  ████████     GTK Theme: Adapta-Nokto-Eta-Maia [GTK2/3]
 ████████  ████████  ████████     Icon Theme: Papirus-Adapta-Nokto-Maia
 ████████  ████████  ████████     Font: Noto Sans 10
 ████████  ████████  ████████     Disk: 565G / 1,2T (50%)
                                  CPU: AMD Ryzen 5 1600X Six-Core @ 12x 3.6GHz
                                  GPU: Radeon RX Vega (VEGA10, DRM 3.27.0, 4.19.69-1-MANJARO, LLVM 8.0.1)
                                  RAM: 2320MiB / 16050MiB

System hard freezes after some playtime in Civilization 6 (black/green/gray screen, music playing, need to use reset button)

Errors in system logs:
sep 19 16:39:13 kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=7763335, emitted seq=7763337
sep 19 16:39:13 kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma1 timeout, signaled seq=7703731, emitted seq=7703733
sep 19 16:41:11 kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 timeout, signaled seq=374796, emitted seq=374798

On my computer however, the crash/freeze occurs sooner with kernels 5.x and higher than with kernel 4.19. Its approximately 1 hour playtime (kernel 5+) vs. 8 hours (kernel 4.19). It doesnt matter what mesa I use- tried mesa-aco-git 19.3 and mesa 19.1.

Comment 113 Jason Playne 2019-10-05 12:02:13 UTC

As others have noted, with powerplay doing its thing we get system freezes.

Just had a successful 6+ hour gaming session on a kernel 5.3.2-050302-generic with the following being done:
 * Forcing high perf state
 * Undervolt/Overclock
 * Higher fan curve (https://github.com/grmat/amdgpu-fancontrol)

I know that I have been messing with all sorts here, but I think it suggests that PowerPlay may be at fault here when my system *does* crash (which is all the time without the force high perf state)

All details below:

# Forcing High Perf
echo high | sudo tee /sys/class/drm/card0/device/power_dpm_force_performance_level

# Undervolt / Overclock
I also have done some messing around with voltages/clocks

$ cat /sys/class/drm/card0/device/pp_od_clk_voltage
OD_SCLK:
0:        852Mhz        800mV
1:        991Mhz        900mV
2:       1084Mhz        940mV
3:       1138Mhz        990mV
4:       1200Mhz       1040mV
5:       1401Mhz       1090mV
6:       1536Mhz       1140mV
7:       1630Mhz       1190mV
OD_MCLK:
0:        167Mhz        800mV
1:        500Mhz        800mV
2:        850Mhz        940mV
3:       1000Mhz       1100mV
OD_RANGE:
SCLK:     852MHz       2400MHz
MCLK:     167MHz       1500MHz
VDDC:     800mV        1200mV


# Settings for AMDGPU Fancontrol
TEMPS=( 35000 70000 80000 )
PWMS=(     70   180   255 )

Comment 114 Rodney A Morris 2019-10-19 21:26:58 UTC

To rule out possible hardware issues, I purchased another Vega 64 card.  This time a factory overclocked card.  Since installing the card, I have experienced three lock ups.  Two playing Stellaris and one while playing a youtube video.  After playing Stellaris without issue two weeks ago, the computer locked up twice last night.  While my previous problems seemed to be, in part, linked to a circular lock dependence, the last logs indicate something different.  I'm seeing a lot of powerplay errors after the fence timeout.  Hope this new information provides some insight into the problem.

         /:-------------:\          rmorris@ezra.blanchardmorris.net 
       :-------------------::        -------------------------------- 
     :-----------/shhOHbmp---:\      OS: Fedora release 30 (Thirty) x86_64 
   /-----------omMMMNNNMMD  ---:     Kernel: 5.3.6-200.fc30.x86_64 
  :-----------sMMMMNMNMP.    ---:    Uptime: 16 hours, 21 mins 
 :-----------:MMMdP-------    ---\   Packages: 2214 (rpm), 36 (flatpak) 
,------------:MMMd--------    ---:   Shell: bash 5.0.7 
:------------:MMMd-------    .---:   Resolution: 2560x1440 
:----    oNMMMMMMMMMNho     .----:   DE: GNOME 3.32.2 
:--     .+shhhMMMmhhy++   .------/   WM: Mutter 
:-    -------:MMMd--------------:    WM Theme: Adwaita 
:-   --------/MMMd-------------;     Theme: Adapta-Nokto-Eta [GTK2/3] 
:-    ------/hMMMy------------:      Icons: Adwaita [GTK2/3] 
:-- :dMNdhhdNMMNo------------;       Terminal: tilix 
:---:sdNMMMMNds:------------:        CPU: Intel i7-6850K (12) @ 4.000GHz 
:------:://:-------------::          GPU: AMD ATI Radeon RX Vega 56/64 
:---------------------://            Memory: 2814MiB / 32036MiB 


Card:

MSI Vega 64 OC (Card works fine under windows 10)

Game being played:

Stellaris

Native Game

Description of Event:
Screen goes blank and music and sound continues to play before computer locks up or reboots.

relevant dmesg from crash:
[ 4244.670269] perf: interrupt took too long (2502 > 2500), lowering kernel.perf_event_max_sample_rate to 79000
[ 4298.241156] [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out or interrupted!
[ 4304.385587] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring page1 timeout, signaled seq=60549844, emitted seq=60549846
[ 4304.385634] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process  pid 0 thread  pid 0
[ 4304.385637] amdgpu 0000:06:00.0: GPU reset begin!
[ 4304.402938] pcieport 0000:00:03.0: AER: Uncorrected (Non-Fatal) error received: 0000:00:03.0
[ 4304.402945] pcieport 0000:00:03.0: AER: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID)
[ 4304.402947] pcieport 0000:00:03.0: AER:   device [8086:6f08] error status/mask=00004000/00000000
[ 4304.402948] pcieport 0000:00:03.0: AER:    [14] CmpltTO                (First)
[ 4304.404006] pcieport 0000:00:03.0: AER: Device recovery failed
[ 4308.481068] [drm:drm_atomic_helper_wait_for_flip_done [drm_kms_helper]] *ERROR* [CRTC:47:crtc-0] flip_done timed out
[ 4314.625180] [drm:amdgpu_dm_atomic_check [amdgpu]] *ERROR* [CRTC:47:crtc-0] hw_done or flip_done timed out
[ 4324.865057] [drm:drm_atomic_helper_wait_for_dependencies [drm_kms_helper]] *ERROR* [CRTC:47:crtc-0] flip_done timed out
[ 4335.105035] [drm:drm_atomic_helper_wait_for_dependencies [drm_kms_helper]] *ERROR* [PLANE:45:plane-5] flip_done timed out
[ 4336.695112] amdgpu: [powerplay] No response from smu
[ 4336.695128] amdgpu: [powerplay] Failed message: 0xe, input parameter: 0x0, error code: 0x0
[ 4338.307125] amdgpu: [powerplay] No response from smu
[ 4339.922039] amdgpu: [powerplay] No response from smu
[ 4339.922043] amdgpu: [powerplay] Failed message: 0x42, input parameter: 0x1, error code: 0x0
[ 4341.541675] amdgpu: [powerplay] No response from smu
[ 4343.162102] amdgpu: [powerplay] No response from smu
[ 4343.162105] amdgpu: [powerplay] Failed message: 0x24, input parameter: 0x0, error code: 0x0
[ 4343.221953] [drm] REG_WAIT timeout 10us * 3500 tries - dce_mi_free_dmif line:634
[ 4343.221962] ------------[ cut here ]------------
[ 4343.222070] WARNING: CPU: 0 PID: 16500 at drivers/gpu/drm/amd/amdgpu/../display/dc/dc_helper.c:332 generic_reg_wait.cold+0x31/0x53 [amdgpu]
[ 4343.222072] Modules linked in: rfcomm xt_CHECKSUM xt_MASQUERADE tun bridge stp llc nf_conntrack_netbios_ns nf_conntrack_broadcast xt_CT ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 xt_conntrack ebtable_nat ebtable_broute ip6table_nat ip6table_mangle ip6table_raw ip6table_security iptable_nat nf_nat iptable_mangle iptable_raw iptable_security nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c ip_set nfnetlink ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter ip_tables cmac bnep nct6775 hwmon_vid intel_rapl_msr intel_rapl_common vfat fat fuse x86_pkg_temp_thermal intel_powerclamp coretemp iwlmvm kvm_intel iTCO_wdt iTCO_vendor_support mac80211 kvm snd_hda_codec_realtek irqbypass snd_hda_codec_generic snd_hda_codec_hdmi libarc4 ledtrig_audio crct10dif_pclmul snd_hda_intel crc32_pclmul iwlwifi snd_hda_codec snd_hda_core btusb ghash_clmulni_intel btrtl intel_cstate snd_hwdep btbcm btintel intel_uncore snd_seq snd_seq_device intel_rapl_perf bluetooth
[ 4343.222099]  mxm_wmi cfg80211 snd_pcm joydev ecdh_generic ecc mei_me snd_timer rfkill snd mei i2c_i801 soundcore lpc_ich binfmt_misc auth_rpcgss sunrpc amdgpu amd_iommu_v2 gpu_sched ttm drm_kms_helper crc32c_intel uas mpt3sas igb drm e1000e nvme usb_storage dca i2c_algo_bit raid_class nvme_core scsi_transport_sas wmi
[ 4343.222114] CPU: 0 PID: 16500 Comm: kworker/0:1 Not tainted 5.3.6-200.fc30.x86_64+debug #1
[ 4343.222115] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./X99 Taichi, BIOS P1.80 04/06/2018
[ 4343.222119] Workqueue: events drm_sched_job_timedout [gpu_sched]
[ 4343.222167] RIP: 0010:generic_reg_wait.cold+0x31/0x53 [amdgpu]
[ 4343.222169] Code: 4c 24 18 44 89 fa 89 ee 48 c7 c7 f8 9d 73 c0 e8 60 46 b0 fa 83 7b 20 01 0f 84 02 ee fd ff 48 c7 c7 f0 9c 73 c0 e8 4a 46 b0 fa <0f> 0b e9 ef ed fd ff 48 c7 c7 f0 9c 73 c0 89 54 24 04 e8 33 46 b0
[ 4343.222170] RSP: 0018:ffffabda8729b690 EFLAGS: 00010246
[ 4343.222172] RAX: 0000000000000024 RBX: ffff9ceeab58f700 RCX: 0000000000000006
[ 4343.222173] RDX: 0000000000000000 RSI: ffff9ceeb50c8e50 RDI: ffff9ceebe5d9e00
[ 4343.222174] RBP: 000000000000000a R08: 000003f33c33ca38 R09: 0000000000000000
[ 4343.222175] R10: 0000000000000000 R11: 0000000000000000 R12: 00000000000035af
[ 4343.222176] R13: 0000000000000dad R14: 0000000000000001 R15: 0000000000000dac
[ 4343.222178] FS:  0000000000000000(0000) GS:ffff9ceebe400000(0000) knlGS:0000000000000000
[ 4343.222179] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 4343.222180] CR2: 00007f1480ef70c0 CR3: 0000000703f30002 CR4: 00000000003606f0
[ 4343.222182] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 4343.222183] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 4343.222184] Call Trace:
[ 4343.222237]  dce_mi_free_dmif+0xef/0x150 [amdgpu]
[ 4343.222285]  dce110_reset_hw_ctx_wrap+0x15f/0x200 [amdgpu]
[ 4343.222333]  dce110_apply_ctx_to_hw+0x4b/0x530 [amdgpu]
[ 4343.222365]  ? amdgpu_pm_compute_clocks+0xc9/0x5f0 [amdgpu]
[ 4343.222414]  ? dm_pp_apply_display_requirements+0x1a8/0x1c0 [amdgpu]
[ 4343.222461]  dc_commit_state+0x26b/0x590 [amdgpu]
[ 4343.222514]  amdgpu_dm_atomic_commit_tail+0xd18/0x1cf0 [amdgpu]
[ 4343.222521]  ? __lock_acquire+0x247/0x1910
[ 4343.222525]  ? find_held_lock+0x32/0x90
[ 4343.222529]  ? find_held_lock+0x32/0x90
[ 4343.222533]  ? sched_clock+0x5/0x10
[ 4343.222536]  ? mark_held_locks+0x50/0x80
[ 4343.222540]  ? __lock_acquire+0x247/0x1910
[ 4343.222545]  ? wake_up_klogd+0x37/0x40
[ 4343.222549]  ? find_held_lock+0x32/0x90
[ 4343.222552]  ? mark_held_locks+0x50/0x80
[ 4343.222556]  ? _raw_spin_unlock_irq+0x29/0x40
[ 4343.222559]  ? lockdep_hardirqs_on+0xf0/0x180
[ 4343.222561]  ? _raw_spin_unlock_irq+0x29/0x40
[ 4343.222564]  ? wait_for_completion_timeout+0x75/0x190
[ 4343.222576]  ? commit_tail+0x3c/0x70 [drm_kms_helper]
[ 4343.222622]  ? amdgpu_dm_audio_eld_notify+0x60/0x60 [amdgpu]
[ 4343.222628]  commit_tail+0x3c/0x70 [drm_kms_helper]
[ 4343.222634]  drm_atomic_helper_commit+0xe3/0x150 [drm_kms_helper]
[ 4343.222640]  drm_atomic_helper_disable_all+0x14c/0x160 [drm_kms_helper]
[ 4343.222647]  drm_atomic_helper_suspend+0x66/0x100 [drm_kms_helper]
[ 4343.222698]  dm_suspend+0x20/0x60 [amdgpu]
[ 4343.222726]  amdgpu_device_ip_suspend_phase1+0x91/0xc0 [amdgpu]
[ 4343.222755]  amdgpu_device_ip_suspend+0x1c/0x60 [amdgpu]
[ 4343.222801]  amdgpu_device_pre_asic_reset+0x191/0x1a4 [amdgpu]
[ 4343.222849]  amdgpu_device_gpu_recover+0x260/0x934 [amdgpu]
[ 4343.222893]  amdgpu_job_timedout+0x115/0x140 [amdgpu]
[ 4343.222899]  drm_sched_job_timedout+0x44/0xa0 [gpu_sched]
[ 4343.222903]  process_one_work+0x272/0x5a0
[ 4343.222908]  worker_thread+0x50/0x3b0
[ 4343.222915]  kthread+0x108/0x140
[ 4343.222916]  ? process_one_work+0x5a0/0x5a0
[ 4343.222918]  ? kthread_park+0x80/0x80
[ 4343.222921]  ret_from_fork+0x3a/0x50
[ 4343.222929] irq event stamp: 82808
[ 4343.222931] hardirqs last  enabled at (82807): [<ffffffffbb1716eb>] console_unlock+0x46b/0x5d0
[ 4343.222935] hardirqs last disabled at (82808): [<ffffffffbb0038da>] trace_hardirqs_off_thunk+0x1a/0x20
[ 4343.222938] softirqs last  enabled at (82794): [<ffffffffbbe0035d>] __do_softirq+0x35d/0x45d
[ 4343.222942] softirqs last disabled at (82787): [<ffffffffbb0f2077>] irq_exit+0xf7/0x100
[ 4343.222943] ---[ end trace 71731c9cc205c24d ]---
[ 4344.758203] amdgpu: [powerplay] No response from smu
[ 4346.363061] amdgpu: [powerplay] No response from smu
[ 4346.363065] amdgpu: [powerplay] Failed to send message: 0x26, ret value: 0x0
[ 4347.973948] amdgpu: [powerplay] No response from smu
[ 4349.588168] amdgpu: [powerplay] No response from smu
[ 4349.588173] amdgpu: [powerplay] Failed message: 0x4c, input parameter: 0x1, error code: 0x0
[ 4351.152764] amdgpu: [powerplay] No response from smu
[ 4352.722063] amdgpu: [powerplay] No response from smu
[ 4352.722068] amdgpu: [powerplay] Failed message: 0x4c, input parameter: 0x3, error code: 0x0
[ 4354.325541] amdgpu: [powerplay] No response from smu
[ 4355.924138] amdgpu: [powerplay] No response from smu
[ 4355.924141] amdgpu: [powerplay] Failed to send message: 0x63, ret value: 0x0
[ 4357.537736] amdgpu: [powerplay] No response from smu
[ 4359.154141] amdgpu: [powerplay] No response from smu
[ 4359.154146] amdgpu: [powerplay] Failed message: 0x9, input parameter: 0xf4, error code: 0x0
[ 4360.760856] amdgpu: [powerplay] No response from smu
[ 4362.372410] amdgpu: [powerplay] No response from smu
[ 4362.372414] amdgpu: [powerplay] Failed message: 0xa, input parameter: 0xa0b000, error code: 0x0
[ 4363.985961] amdgpu: [powerplay] No response from smu
[ 4365.599325] amdgpu: [powerplay] No response from smu
[ 4365.599331] amdgpu: [powerplay] Failed message: 0xe, input parameter: 0x0, error code: 0x0
[ 4367.214945] amdgpu: [powerplay] No response from smu
[ 4368.829650] amdgpu: [powerplay] No response from smu
[ 4368.829655] amdgpu: [powerplay] Failed message: 0x42, input parameter: 0x1, error code: 0x0
[ 4370.443783] amdgpu: [powerplay] No response from smu
[ 4372.057288] amdgpu: [powerplay] No response from smu
[ 4372.057293] amdgpu: [powerplay] Failed message: 0x24, input parameter: 0x0, error code: 0x0
[ 4372.074301] pcieport 0000:00:03.0: AER: Uncorrected (Non-Fatal) error received: 0000:00:03.0
[ 4372.074308] pcieport 0000:00:03.0: AER: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID)
[ 4372.074310] pcieport 0000:00:03.0: AER:   device [8086:6f08] error status/mask=00004000/00000000
[ 4372.074312] pcieport 0000:00:03.0: AER:    [14] CmpltTO                (First)
[ 4372.074569] pcieport 0000:00:03.0: AER: Device recovery failed
[ 4372.091832] pcieport 0000:00:03.0: AER: Uncorrected (Non-Fatal) error received: 0000:00:03.0
[ 4372.091837] pcieport 0000:00:03.0: AER: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID)
[ 4372.091839] pcieport 0000:00:03.0: AER:   device [8086:6f08] error status/mask=00004000/00000000
[ 4372.091840] pcieport 0000:00:03.0: AER:    [14] CmpltTO                (First)
[ 4372.091889] pcieport 0000:00:03.0: AER: Device recovery failed
[ 4372.109371] pcieport 0000:00:03.0: AER: Uncorrected (Non-Fatal) error received: 0000:00:03.0
[ 4372.109376] pcieport 0000:00:03.0: AER: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID)
[ 4372.109378] pcieport 0000:00:03.0: AER:   device [8086:6f08] error status/mask=00004000/00000000
[ 4372.109380] pcieport 0000:00:03.0: AER:    [14] CmpltTO                (First)
[ 4372.126998] pcieport 0000:00:03.0: AER: Device recovery failed
[ 4372.127002] pcieport 0000:00:03.0: AER: Uncorrected (Non-Fatal) error received: 0000:00:03.0
[ 4372.127009] pcieport 0000:00:03.0: AER: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID)
[ 4372.127021] pcieport 0000:00:03.0: AER:   device [8086:6f08] error status/mask=00004000/00000000
[ 4372.127024] pcieport 0000:00:03.0: AER:    [14] CmpltTO                (First)
[ 4372.127083] pcieport 0000:00:03.0: AER: Device recovery failed
[ 4372.144452] pcieport 0000:00:03.0: AER: Uncorrected (Non-Fatal) error received: 0000:00:03.0
[ 4372.144457] pcieport 0000:00:03.0: AER: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID)
[ 4372.144458] pcieport 0000:00:03.0: AER:   device [8086:6f08] error status/mask=00004000/00000000
[ 4372.144460] pcieport 0000:00:03.0: AER:    [14] CmpltTO                (First)
[ 4372.144514] pcieport 0000:00:03.0: AER: Device recovery failed
[ 4372.161992] pcieport 0000:00:03.0: AER: Uncorrected (Non-Fatal) error received: 0000:00:03.0
[ 4372.161997] pcieport 0000:00:03.0: AER: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID)
[ 4372.161999] pcieport 0000:00:03.0: AER:   device [8086:6f08] error status/mask=00004000/00000000
[ 4372.162001] pcieport 0000:00:03.0: AER:    [14] CmpltTO                (First)
[ 4372.162086] pcieport 0000:00:03.0: AER: Device recovery failed
[ 4372.179534] pcieport 0000:00:03.0: AER: Uncorrected (Non-Fatal) error received: 0000:00:03.0
[ 4372.179538] pcieport 0000:00:03.0: AER: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID)
[ 4372.179540] pcieport 0000:00:03.0: AER:   device [8086:6f08] error status/mask=00004000/00000000
[ 4372.179542] pcieport 0000:00:03.0: AER:    [14] CmpltTO                (First)
[ 4372.179674] pcieport 0000:00:03.0: AER: Device recovery failed
[ 4372.197074] pcieport 0000:00:03.0: AER: Uncorrected (Non-Fatal) error received: 0000:00:03.0
[ 4372.197079] pcieport 0000:00:03.0: AER: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID)
[ 4372.197081] pcieport 0000:00:03.0: AER:   device [8086:6f08] error status/mask=00004000/00000000
[ 4372.197082] pcieport 0000:00:03.0: AER:    [14] CmpltTO                (First)
[ 4372.197131] pcieport 0000:00:03.0: AER: Device recovery failed
[ 4372.214616] pcieport 0000:00:03.0: AER: Multiple Uncorrected (Non-Fatal) error received: 0000:00:03.0
[ 4372.267239] amdgpu: [powerplay] Failed to send message: 0x61, ret value: 0xffffffff

Relevant journalctl messages:

Oct 18 21:49:47 ezra.blanchardmorris.net kernel: perf: interrupt took too long (2502 > 2500), lowering kernel.perf_event_max_sample_rate to 79000
Oct 18 21:50:47 ezra.blanchardmorris.net kernel: [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out or interrupted!
Oct 18 21:50:47 ezra.blanchardmorris.net kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring page1 timeout, signaled seq=60549844, emitted seq=60549846
Oct 18 21:50:47 ezra.blanchardmorris.net kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process  pid 0 thread  pid 0
Oct 18 21:50:47 ezra.blanchardmorris.net kernel: amdgpu 0000:06:00.0: GPU reset begin!
Oct 18 21:50:47 ezra.blanchardmorris.net kernel: pcieport 0000:00:03.0: AER: Uncorrected (Non-Fatal) error received: 0000:00:03.0
Oct 18 21:50:47 ezra.blanchardmorris.net kernel: pcieport 0000:00:03.0: AER: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID)
Oct 18 21:50:47 ezra.blanchardmorris.net kernel: pcieport 0000:00:03.0: AER:   device [8086:6f08] error status/mask=00004000/00000000
Oct 18 21:50:47 ezra.blanchardmorris.net kernel: pcieport 0000:00:03.0: AER:    [14] CmpltTO                (First)
Oct 18 21:50:47 ezra.blanchardmorris.net kernel: pcieport 0000:00:03.0: AER: Device recovery failed
Oct 18 21:50:51 ezra.blanchardmorris.net kernel: [drm:drm_atomic_helper_wait_for_flip_done [drm_kms_helper]] *ERROR* [CRTC:47:crtc-0] flip_done timed out
Oct 18 21:50:57 ezra.blanchardmorris.net kernel: [drm:amdgpu_dm_atomic_check [amdgpu]] *ERROR* [CRTC:47:crtc-0] hw_done or flip_done timed out
Oct 18 21:51:07 ezra.blanchardmorris.net kernel: [drm:drm_atomic_helper_wait_for_dependencies [drm_kms_helper]] *ERROR* [CRTC:47:crtc-0] flip_done timed out
Oct 18 21:51:18 ezra.blanchardmorris.net kernel: [drm:drm_atomic_helper_wait_for_dependencies [drm_kms_helper]] *ERROR* [PLANE:45:plane-5] flip_done timed out
Oct 18 21:51:19 ezra.blanchardmorris.net kernel: amdgpu: [powerplay] No response from smu
Oct 18 21:51:19 ezra.blanchardmorris.net kernel: amdgpu: [powerplay] Failed message: 0xe, input parameter: 0x0, error code: 0x0
Oct 18 21:51:21 ezra.blanchardmorris.net kernel: amdgpu: [powerplay] No response from smu
Oct 18 21:51:22 ezra.blanchardmorris.net kernel: amdgpu: [powerplay] No response from smu
Oct 18 21:51:22 ezra.blanchardmorris.net kernel: amdgpu: [powerplay] Failed message: 0x42, input parameter: 0x1, error code: 0x0
Oct 18 21:51:24 ezra.blanchardmorris.net kernel: amdgpu: [powerplay] No response from smu
Oct 18 21:51:26 ezra.blanchardmorris.net kernel: amdgpu: [powerplay] No response from smu
Oct 18 21:51:26 ezra.blanchardmorris.net kernel: amdgpu: [powerplay] Failed message: 0x24, input parameter: 0x0, error code: 0x0
Oct 18 21:51:26 ezra.blanchardmorris.net kernel: [drm] REG_WAIT timeout 10us * 3500 tries - dce_mi_free_dmif line:634
Oct 18 21:51:26 ezra.blanchardmorris.net kernel: ------------[ cut here ]------------
Oct 18 21:51:26 ezra.blanchardmorris.net kernel: WARNING: CPU: 0 PID: 16500 at drivers/gpu/drm/amd/amdgpu/../display/dc/dc_helper.c:332 generic_reg_wait.cold+0x31/0x53 [amdgpu]
Oct 18 21:51:26 ezra.blanchardmorris.net kernel: Modules linked in: rfcomm xt_CHECKSUM xt_MASQUERADE tun bridge stp llc nf_conntrack_netbios_ns nf_conntrack_broadcast xt_CT ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 xt_conntrack ebtable_nat ebtable_broute ip6table_nat ip6table_mangle ip6table_raw ip6table_security iptable_nat nf_nat iptable_mangle iptable_raw iptable_security nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c ip_set nfnetlink ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter ip_tables cmac bnep nct6775 hwmon_vid intel_rapl_msr intel_rapl_common vfat fat fuse x86_pkg_temp_thermal intel_powerclamp coretemp iwlmvm kvm_intel iTCO_wdt iTCO_vendor_support mac80211 kvm snd_hda_codec_realtek irqbypass snd_hda_codec_generic snd_hda_codec_hdmi libarc4 ledtrig_audio crct10dif_pclmul snd_hda_intel crc32_pclmul iwlwifi snd_hda_codec snd_hda_core btusb ghash_clmulni_intel btrtl intel_cstate snd_hwdep btbcm btintel intel_uncore snd_seq snd_seq_device intel_rapl_perf bluetooth
Oct 18 21:51:26 ezra.blanchardmorris.net kernel:  mxm_wmi cfg80211 snd_pcm joydev ecdh_generic ecc mei_me snd_timer rfkill snd mei i2c_i801 soundcore lpc_ich binfmt_misc auth_rpcgss sunrpc amdgpu amd_iommu_v2 gpu_sched ttm drm_kms_helper crc32c_intel uas mpt3sas igb drm e1000e nvme usb_storage dca i2c_algo_bit raid_class nvme_core scsi_transport_sas wmi
Oct 18 21:51:26 ezra.blanchardmorris.net kernel: CPU: 0 PID: 16500 Comm: kworker/0:1 Not tainted 5.3.6-200.fc30.x86_64+debug #1
Oct 18 21:51:26 ezra.blanchardmorris.net kernel: Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./X99 Taichi, BIOS P1.80 04/06/2018
Oct 18 21:51:26 ezra.blanchardmorris.net kernel: Workqueue: events drm_sched_job_timedout [gpu_sched]
Oct 18 21:51:26 ezra.blanchardmorris.net kernel: RIP: 0010:generic_reg_wait.cold+0x31/0x53 [amdgpu]
Oct 18 21:51:26 ezra.blanchardmorris.net kernel: Code: 4c 24 18 44 89 fa 89 ee 48 c7 c7 f8 9d 73 c0 e8 60 46 b0 fa 83 7b 20 01 0f 84 02 ee fd ff 48 c7 c7 f0 9c 73 c0 e8 4a 46 b0 fa <0f> 0b e9 ef ed fd ff 48 c7 c7 f0 9c 73 c0 89 54 24 04 e8 33 46 b0
Oct 18 21:51:26 ezra.blanchardmorris.net kernel: RSP: 0018:ffffabda8729b690 EFLAGS: 00010246
Oct 18 21:51:26 ezra.blanchardmorris.net kernel: RAX: 0000000000000024 RBX: ffff9ceeab58f700 RCX: 0000000000000006
Oct 18 21:51:26 ezra.blanchardmorris.net kernel: RDX: 0000000000000000 RSI: ffff9ceeb50c8e50 RDI: ffff9ceebe5d9e00
Oct 18 21:51:26 ezra.blanchardmorris.net kernel: RBP: 000000000000000a R08: 000003f33c33ca38 R09: 0000000000000000
Oct 18 21:51:26 ezra.blanchardmorris.net kernel: R10: 0000000000000000 R11: 0000000000000000 R12: 00000000000035af
Oct 18 21:51:26 ezra.blanchardmorris.net kernel: R13: 0000000000000dad R14: 0000000000000001 R15: 0000000000000dac
Oct 18 21:51:26 ezra.blanchardmorris.net kernel: FS:  0000000000000000(0000) GS:ffff9ceebe400000(0000) knlGS:0000000000000000
Oct 18 21:51:26 ezra.blanchardmorris.net kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Oct 18 21:51:26 ezra.blanchardmorris.net kernel: CR2: 00007f1480ef70c0 CR3: 0000000703f30002 CR4: 00000000003606f0
Oct 18 21:51:26 ezra.blanchardmorris.net kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Oct 18 21:51:26 ezra.blanchardmorris.net kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Oct 18 21:51:26 ezra.blanchardmorris.net kernel: Call Trace:
Oct 18 21:51:26 ezra.blanchardmorris.net kernel:  dce_mi_free_dmif+0xef/0x150 [amdgpu]
Oct 18 21:51:26 ezra.blanchardmorris.net kernel:  dce110_reset_hw_ctx_wrap+0x15f/0x200 [amdgpu]
Oct 18 21:51:26 ezra.blanchardmorris.net kernel:  dce110_apply_ctx_to_hw+0x4b/0x530 [amdgpu]
Oct 18 21:51:26 ezra.blanchardmorris.net kernel:  ? amdgpu_pm_compute_clocks+0xc9/0x5f0 [amdgpu]
Oct 18 21:51:26 ezra.blanchardmorris.net kernel:  ? dm_pp_apply_display_requirements+0x1a8/0x1c0 [amdgpu]
Oct 18 21:51:26 ezra.blanchardmorris.net kernel:  dc_commit_state+0x26b/0x590 [amdgpu]
Oct 18 21:51:26 ezra.blanchardmorris.net kernel:  amdgpu_dm_atomic_commit_tail+0xd18/0x1cf0 [amdgpu]
Oct 18 21:51:26 ezra.blanchardmorris.net kernel:  ? __lock_acquire+0x247/0x1910
Oct 18 21:51:26 ezra.blanchardmorris.net kernel:  ? find_held_lock+0x32/0x90
Oct 18 21:51:26 ezra.blanchardmorris.net kernel:  ? find_held_lock+0x32/0x90
Oct 18 21:51:26 ezra.blanchardmorris.net kernel:  ? sched_clock+0x5/0x10
Oct 18 21:51:26 ezra.blanchardmorris.net kernel:  ? mark_held_locks+0x50/0x80
Oct 18 21:51:26 ezra.blanchardmorris.net kernel:  ? __lock_acquire+0x247/0x1910
Oct 18 21:51:26 ezra.blanchardmorris.net kernel:  ? wake_up_klogd+0x37/0x40
Oct 18 21:51:26 ezra.blanchardmorris.net kernel:  ? find_held_lock+0x32/0x90
Oct 18 21:51:26 ezra.blanchardmorris.net kernel:  ? mark_held_locks+0x50/0x80
Oct 18 21:51:26 ezra.blanchardmorris.net kernel:  ? _raw_spin_unlock_irq+0x29/0x40
Oct 18 21:51:26 ezra.blanchardmorris.net kernel:  ? lockdep_hardirqs_on+0xf0/0x180
Oct 18 21:51:26 ezra.blanchardmorris.net kernel:  ? _raw_spin_unlock_irq+0x29/0x40
Oct 18 21:51:26 ezra.blanchardmorris.net kernel:  ? wait_for_completion_timeout+0x75/0x190
Oct 18 21:51:26 ezra.blanchardmorris.net kernel:  ? commit_tail+0x3c/0x70 [drm_kms_helper]
Oct 18 21:51:26 ezra.blanchardmorris.net kernel:  ? amdgpu_dm_audio_eld_notify+0x60/0x60 [amdgpu]
Oct 18 21:51:26 ezra.blanchardmorris.net kernel:  commit_tail+0x3c/0x70 [drm_kms_helper]
Oct 18 21:51:26 ezra.blanchardmorris.net kernel:  drm_atomic_helper_commit+0xe3/0x150 [drm_kms_helper]
Oct 18 21:51:26 ezra.blanchardmorris.net kernel:  drm_atomic_helper_disable_all+0x14c/0x160 [drm_kms_helper]
Oct 18 21:51:26 ezra.blanchardmorris.net kernel:  drm_atomic_helper_suspend+0x66/0x100 [drm_kms_helper]
Oct 18 21:51:26 ezra.blanchardmorris.net kernel:  dm_suspend+0x20/0x60 [amdgpu]
Oct 18 21:51:26 ezra.blanchardmorris.net kernel:  amdgpu_device_ip_suspend_phase1+0x91/0xc0 [amdgpu]
Oct 18 21:51:26 ezra.blanchardmorris.net kernel:  amdgpu_device_ip_suspend+0x1c/0x60 [amdgpu]
Oct 18 21:51:26 ezra.blanchardmorris.net kernel:  amdgpu_device_pre_asic_reset+0x191/0x1a4 [amdgpu]
Oct 18 21:51:26 ezra.blanchardmorris.net kernel:  amdgpu_device_gpu_recover+0x260/0x934 [amdgpu]
Oct 18 21:51:26 ezra.blanchardmorris.net kernel:  amdgpu_job_timedout+0x115/0x140 [amdgpu]
Oct 18 21:51:26 ezra.blanchardmorris.net kernel:  drm_sched_job_timedout+0x44/0xa0 [gpu_sched]
Oct 18 21:51:26 ezra.blanchardmorris.net kernel:  process_one_work+0x272/0x5a0
Oct 18 21:51:26 ezra.blanchardmorris.net kernel:  worker_thread+0x50/0x3b0
Oct 18 21:51:26 ezra.blanchardmorris.net kernel:  kthread+0x108/0x140
Oct 18 21:51:26 ezra.blanchardmorris.net kernel:  ? process_one_work+0x5a0/0x5a0
Oct 18 21:51:26 ezra.blanchardmorris.net kernel:  ? kthread_park+0x80/0x80
Oct 18 21:51:26 ezra.blanchardmorris.net kernel:  ret_from_fork+0x3a/0x50
Oct 18 21:51:26 ezra.blanchardmorris.net kernel: irq event stamp: 82808
Oct 18 21:51:26 ezra.blanchardmorris.net kernel: hardirqs last  enabled at (82807): [<ffffffffbb1716eb>] console_unlock+0x46b/0x5d0
Oct 18 21:51:26 ezra.blanchardmorris.net kernel: hardirqs last disabled at (82808): [<ffffffffbb0038da>] trace_hardirqs_off_thunk+0x1a/0x20
Oct 18 21:51:26 ezra.blanchardmorris.net kernel: softirqs last  enabled at (82794): [<ffffffffbbe0035d>] __do_softirq+0x35d/0x45d
Oct 18 21:51:26 ezra.blanchardmorris.net kernel: softirqs last disabled at (82787): [<ffffffffbb0f2077>] irq_exit+0xf7/0x100
Oct 18 21:51:26 ezra.blanchardmorris.net kernel: ---[ end trace 71731c9cc205c24d ]---
Oct 18 21:51:27 ezra.blanchardmorris.net abrt-dump-journal-oops[1493]: abrt-dump-journal-oops: Found oopses: 1
Oct 18 21:51:27 ezra.blanchardmorris.net abrt-dump-journal-oops[1493]: abrt-dump-journal-oops: Creating problem directories
Oct 18 21:51:27 ezra.blanchardmorris.net kernel: amdgpu: [powerplay] No response from smu
Oct 18 21:51:28 ezra.blanchardmorris.net abrt-dump-journal-oops[1493]: Reported 1 kernel oopses to Abrt
Oct 18 21:51:29 ezra.blanchardmorris.net kernel: amdgpu: [powerplay] No response from smu
Oct 18 21:51:29 ezra.blanchardmorris.net kernel: amdgpu: [powerplay] Failed to send message: 0x26, ret value: 0x0
Oct 18 21:51:30 ezra.blanchardmorris.net kernel: amdgpu: [powerplay] No response from smu
Oct 18 21:51:32 ezra.blanchardmorris.net kernel: amdgpu: [powerplay] No response from smu
Oct 18 21:51:32 ezra.blanchardmorris.net kernel: amdgpu: [powerplay] Failed message: 0x4c, input parameter: 0x1, error code: 0x0
Oct 18 21:51:34 ezra.blanchardmorris.net kernel: amdgpu: [powerplay] No response from smu
Oct 18 21:51:35 ezra.blanchardmorris.net kernel: amdgpu: [powerplay] No response from smu
Oct 18 21:51:35 ezra.blanchardmorris.net kernel: amdgpu: [powerplay] Failed message: 0x4c, input parameter: 0x3, error code: 0x0
Oct 18 21:51:37 ezra.blanchardmorris.net kernel: amdgpu: [powerplay] No response from smu
Oct 18 21:51:38 ezra.blanchardmorris.net abrt-server[16691]: Can't find a meaningful backtrace for hashing in '.'
Oct 18 21:51:38 ezra.blanchardmorris.net abrt-server[16691]: Option 'DropNotReportableOopses' is not configured
Oct 18 21:51:38 ezra.blanchardmorris.net abrt-server[16691]: Preserving oops '.' because DropNotReportableOopses is 'no'
Oct 18 21:51:38 ezra.blanchardmorris.net kernel: amdgpu: [powerplay] No response from smu
Oct 18 21:51:38 ezra.blanchardmorris.net kernel: amdgpu: [powerplay] Failed to send message: 0x63, ret value: 0x0
Oct 18 21:51:40 ezra.blanchardmorris.net kernel: amdgpu: [powerplay] No response from smu
Oct 18 21:51:42 ezra.blanchardmorris.net kernel: amdgpu: [powerplay] No response from smu
Oct 18 21:51:42 ezra.blanchardmorris.net kernel: amdgpu: [powerplay] Failed message: 0x9, input parameter: 0xf4, error code: 0x0
Oct 18 21:51:42 ezra.blanchardmorris.net abrt-notification[16713]: System encountered a non-fatal error in ??()
Oct 18 21:51:43 ezra.blanchardmorris.net kernel: amdgpu: [powerplay] No response from smu
Oct 18 21:51:45 ezra.blanchardmorris.net kernel: amdgpu: [powerplay] No response from smu

Comment 115 Rodney A Morris 2019-10-19 21:27:39 UTC

Created attachment 145776 [details]
Full dmesg from crash

Full dmesg from crash

Comment 116 Rodney A Morris 2019-10-19 21:28:18 UTC

Created attachment 145777 [details]
Full journal from start to crash

Full journalctl from start to crash.

Comment 117 haro41 2019-10-21 16:24:35 UTC

...are this craches more frequently with VSYNC enabled?

If yes, it could be the same thing like this bug:

https://bugs.freedesktop.org/show_bug.cgi?id=110777

Comment 118 Rodney A Morris 2019-10-23 01:52:58 UTC

(In reply to haro41 from comment #117)
> ...are this craches more frequently with VSYNC enabled?
> 
> If yes, it could be the same thing like this bug:
> 
> https://bugs.freedesktop.org/show_bug.cgi?id=110777

vsync is defintely on for both Stellaris and Hearts of Iron.

I looked over the bug report you linked to.  It is very interesting and I will follow with interest.  The next time I play Stellaris or Hearts of Iron IV, I will have to see if I can record my memory frequency values to see if they are indeed not moving off the base frequency under low load with v-sync enabled.  The problem manifesting under low load would explain why I cannot replicate the problem while running Unigine Superposition.

I began to wonder if powerplay and the frequency at which the chip and memory were operating were not the problem after reading the following bug report for Vega 20:

https://bugs.freedesktop.org/show_bug.cgi?id=110674

Last Friday, I attempted to capture the operating frequency and temps, but my attempt utterly failed.

I will disable v-sync and see if that improves and report back here.  If I manage to capture frequency data, I will report back here and may be your thread.

Comment 119 haro41 2019-10-23 08:51:29 UTC

bellow is a simple script, i use to record dpm data in the background:

######################################################
#!/bin/bash

# adapt this sample inverval (seconds)
SLEEP_INTERVAL=0.05

# adapt the paths to your need
FILE_SCLK=/sys/class/drm/card0/device/hwmon/hwmon0/freq1_input
FILE_MCLK=/sys/class/drm/card0/device/hwmon/hwmon0/freq2_input
FILE_PWM=/sys/class/drm/card0/device/hwmon/hwmon0/pwm1
FILE_TEMP=/sys/class/drm/card0/device/hwmon/hwmon0/temp1_input
FILE_FAN=/sys/class/drm/card0/device/hwmon/hwmon0/fan1_input
FILE_GFXVDD=/sys/class/drm/card0/device/hwmon/hwmon0/in0_input
FILE_POW=/sys/class/drm/card0/device/hwmon/hwmon0/power1_average
FILE_BUS=/sys/class/drm/card0/device/gpu_busy_percent

# checking for privileges
if [ $UID -ne 0 ]
then
  echo "Writing to sysfs requires privileges, relaunch as root"
  exit 1
fi

function read_output {
  
  SCLK=$(cat $FILE_SCLK)
  MCLK=$(cat $FILE_MCLK)
  TEMP=$(cat $FILE_TEMP)
  FAN=$(cat $FILE_FAN)
  GFXVDD=$(cat $FILE_GFXVDD)
  POW=$(cat $FILE_POW)
  BUS=$(cat $FILE_BUS)

#  echo "sclk: $SCLK mclk: $MCLK gfx_vdd: $GFXVDD"
  echo "sclk: $SCLK mclk: $MCLK temp: $TEMP fan: $FAN gfx_vdd: $GFXVDD pow: $POW bus: $BUS"
}

function run_daemon {
  while :; do
    read_output
    sleep $SLEEP_INTERVAL
  done
}

# finally start the loop
run_daemon

######################################################

Comment 120 blppt 2019-10-24 03:12:28 UTC

I dont have anything to attach here, but same issue here, ubuntu 19.04, kernel 5.4-rc3, vega64 W/C, Mesa 19.3.0 -- it only seems to occur with DXVK and not D9VK for some reason.

Example: GW2 (DX9 game) will work perfectly under heavy load in WvW with massive zergs for hours with no crash, but FFXIV (DX11) will always lock the entire system up after a time.

That being said, when you force the top clock using

echo manual > /sys/class/drm/card0/device/power_dpm_force_performance_level

and

echo 7 > /sys/class/drm/card0/device/pp_dpm_sclk

FFXIV no longer locks the system at all. It does eat up a good deal more watts according to my UPS meter though, so resetting to auto is necessary IMHO.

So, it sounds like you guys are on the right track with the whole "power management" thing being the culprit. Just wanted to add my experience to this.

(and yes, echoing the guy above, the exact same system is stable in windows 10, so its not a hardware issue).

Comment 121 Mauro Gaspari 2019-10-24 04:58:21 UTC

(In reply to blppt from comment #120)
> I dont have anything to attach here, but same issue here, ubuntu 19.04,
> kernel 5.4-rc3, vega64 W/C, Mesa 19.3.0 -- it only seems to occur with DXVK
> and not D9VK for some reason.
> 
> Example: GW2 (DX9 game) will work perfectly under heavy load in WvW with
> massive zergs for hours with no crash, but FFXIV (DX11) will always lock the
> entire system up after a time.
> 
> That being said, when you force the top clock using
> 
> echo manual > /sys/class/drm/card0/device/power_dpm_force_performance_level
> 
> and
> 
> echo 7 > /sys/class/drm/card0/device/pp_dpm_sclk
> 
> FFXIV no longer locks the system at all. It does eat up a good deal more
> watts according to my UPS meter though, so resetting to auto is necessary
> IMHO.
> 
> So, it sounds like you guys are on the right track with the whole "power
> management" thing being the culprit. Just wanted to add my experience to
> this.
> 
> (and yes, echoing the guy above, the exact same system is stable in windows
> 10, so its not a hardware issue).

I agree with this. I am having much better experience myself even without commands to force the power performance level by doing:
- change game to windowed or full-screen borderless (fixed window)
- disable vsync
- disable frame limiter

by doing the above 3, it seems that GPU is forced into max power state all the time while playing. I have been using this method for a few days with DXVK games and I had no freeze so far.

But again this is just a temporary workaround. So is the command to manually force high power performance level. Hopefully a permanent fix comes with AMDGPU/Kernel updates.

Comment 122 haro41 2019-10-24 09:09:14 UTC

In my experience, this issue is related to mclk switching and it affects the lowest mclk level only.

So you guy's can save a lot of power, if you, insteed of switching to highest gfxlevel or to disable vsync, just disable the lowest mclk level by:

echo "manual" > /sys/class/drm/card0/device/power_dpm_force_performance_level
echo "1 2 3" > /sys/class/drm/card0/device/pp_dpm_mclk

If you are building your kernel locally, look in this thread for a driver code modification that works, without disabling the lowest mclk level (saves a few watt on idle).

Comment 123 haro41 2019-10-24 09:10:34 UTC

... i forgot the link to a related thread:


https://bugs.freedesktop.org/show_bug.cgi?id=110777

Comment 124 blppt 2019-10-29 19:00:25 UTC

(In reply to haro41 from comment #122)
> In my experience, this issue is related to mclk switching and it affects the
> lowest mclk level only.
> 
> So you guy's can save a lot of power, if you, insteed of switching to
> highest gfxlevel or to disable vsync, just disable the lowest mclk level by:
> 
> echo "manual" > /sys/class/drm/card0/device/power_dpm_force_performance_level
> echo "1 2 3" > /sys/class/drm/card0/device/pp_dpm_mclk
> 
> If you are building your kernel locally, look in this thread for a driver
> code modification that works, without disabling the lowest mclk level (saves
> a few watt on idle).

Ooh, that seems to have solved it. Haven't had a crash yet, ran The Outer Worlds for hours (addicting game!), ran FFXIV, ran GW2, no lockups. And, if there is much of a difference at idle in watt usage, I don't see it on the UPS meter.

Thanks a million!

(also of note, when using the valve ACO, as others have noted, you don't even have to do the above to (apparently) solve the problem. unfortunately, that has other issues, my V64 wont clock up high enough when using ACO for some reason, so i dont use it).

Comment 125 haro41 2019-11-05 18:01:08 UTC

... thanks for your feedback, so it seems we are faced with the same bug ...

Btw, i got crashes with at least one vulkan game and ACO compiler backend enabled too.
I think it really depends of the load pattern. And enabled vsync is triggering the typical load pattern, with at least one transient (from high to low load) per frame.

Is someone affected with this bug here, usually building the kernel from source locally?

Comment 126 Rodney A Morris 2019-11-06 02:46:02 UTC

(In reply to haro41 from comment #125)
> ... thanks for your feedback, so it seems we are faced with the same bug ...
> 
> Btw, i got crashes with at least one vulkan game and ACO compiler backend
> enabled too.
> I think it really depends of the load pattern. And enabled vsync is
> triggering the typical load pattern, with at least one transient (from high
> to low load) per frame.
> 
> Is someone affected with this bug here, usually building the kernel from
> source locally?

If you want someone to apply your changes in bug report no. 110777 to the kernel for testing, I can so but will not be to it until this weekend. 

As a side note, I've had great success manually limiting the memory clock to level 1,2,3 on my Vega 64.  I've played over 7 hours of Stellaris without a crash.

Comment 127 haro41 2019-11-06 09:49:49 UTC

(In reply to Rodney A Morris from comment #126)
> If you want someone to apply your changes in bug report no. 110777 to the
> kernel for testing, I can so but will not be to it until this weekend. 
 
... thanks for you reply. Yes, that was the idea and would be very nice...

Since i thing the proposed fix is more relevant to this very thread, let me repeat the proposed patch here:

in 'drivers/gpu/drm/amd/powerplay/hwmgr/vega10_hwmgr.c':

static void vega10_notify_smc_display_change(struct pp_hwmgr *hwmgr,
                bool has_disp)
{
	smum_send_msg_to_smc_with_parameter(hwmgr,
	                                    PPSMC_MSG_SetUclkFastSwitch,
	                                    has_disp ? 1 : 0);
/* proposed fix for crashes because of frequently mclk level 0/1 switching */
	smum_send_msg_to_smc_with_parameter(hwmgr, PPSMC_MSG_SetUclkDownHyst, 1);
}

Only module 'amdgpu.ko' needs to be rebuild and copied, like this:

$ cd /home/user/linux-5.x.x && make -j8 -C . M=drivers/gpu/drm/amd/amdgpu

# cp /home/user/linux-5.x.x/drivers/gpu/drm/amd/amdgpu/amdgpu.ko /lib/modules/5.x.x/kernel/drivers/gpu/drm/amd/amdgpu/amdgpu.ko && update-initramfs -u

... 'user' and 'x.x' have to be adapted, most likely ...

Comment 128 haro41 2019-11-06 10:23:39 UTC

Created attachment 145901 [details] [review]
proposed fix for crashes, caused by frequent mclk level 0/1 switches

At least one of the causes for crashes, are more frequently, if vsync is enabled. 

In this case, memory clock levels are switched usually more frequently.
By experiments i found, that especially the transient betweeen level 1 and level 0 is critical. The fact, that disabling memory level 0, helps as a workaround, confirms: this approach points in the right direction.

Result of further experiments:
By sending a 'PPSMC_MSG_SetUclkDownHyst' message to smc (enabling a hysterese feature ?), the crashes can be avoided, even with enabled mclk level 0 and vsync activated.

Comment 129 Wilko Bartels 2019-11-06 17:32:50 UTC

(In reply to haro41 from comment #122)
> In my experience, this issue is related to mclk switching and it affects the
> lowest mclk level only.
> 
> So you guy's can save a lot of power, if you, insteed of switching to
> highest gfxlevel or to disable vsync, just disable the lowest mclk level by:
> 
> echo "manual" > /sys/class/drm/card0/device/power_dpm_force_performance_level
> echo "1 2 3" > /sys/class/drm/card0/device/pp_dpm_mclk
> 
> If you are building your kernel locally, look in this thread for a driver
> code modification that works, without disabling the lowest mclk level (saves
> a few watt on idle).

do you have any suggestion to automate this? so far i can strictly run these commands after su. not even sudo works with scripts running these commands. or systemd files.

Comment 130 haro41 2019-11-06 18:32:31 UTC

> > 
> > echo "manual" > /sys/class/drm/card0/device/power_dpm_force_performance_level
> > echo "1 2 3" > /sys/class/drm/card0/device/pp_dpm_mclk
> > 
> 
> do you have any suggestion to automate this? so far i can strictly run these
> commands after su. not even sudo works with scripts running these commands.
> or systemd files.

Currently i use my patch (see above) to workaround the crashes.
If you prefer not to touch your kernel, you could create a systemd service: 

# cat /etc/systemd/system/amd-pp.service: 

[Unit]
Description=AMD PP adjust service
[Service]
User=root
Group=root
GuessMainPID=no
ExecStart=/srv/amdgpu-pp.sh
[Install]
WantedBy=multi-user.target
---------------------------------------------------------------
# cat /srv/amdgpu-pp.sh:

#!/bin/bash
echo "manual" > /sys/class/drm/card0/device/power_dpm_force_performance_level
echo "1 2 3" > /sys/class/drm/card0/device/pp_dpm_mclk
---------------------------------------------------------------
#systemctl enable amd-pp.service
#systemctl start amd-pp.service
---------------------------------------------------------------

... assuming you have 'amdgpu.ppfeaturemask=0xffffffff' set ...

Comment 131 Wilko Bartels 2019-11-06 19:26:11 UTC

(In reply to haro41 from comment #130)
> > > 
> > > echo "manual" > /sys/class/drm/card0/device/power_dpm_force_performance_level
> > > echo "1 2 3" > /sys/class/drm/card0/device/pp_dpm_mclk
> > > 
> > 
> > do you have any suggestion to automate this? so far i can strictly run these
> > commands after su. not even sudo works with scripts running these commands.
> > or systemd files.
> 
> Currently i use my patch (see above) to workaround the crashes.
> If you prefer not to touch your kernel, you could create a systemd service: 
> 
> # cat /etc/systemd/system/amd-pp.service: 
> 
> [Unit]
> Description=AMD PP adjust service
> [Service]
> User=root
> Group=root
> GuessMainPID=no
> ExecStart=/srv/amdgpu-pp.sh
> [Install]
> WantedBy=multi-user.target
> ---------------------------------------------------------------
> # cat /srv/amdgpu-pp.sh:
> 
> #!/bin/bash
> echo "manual" > /sys/class/drm/card0/device/power_dpm_force_performance_level
> echo "1 2 3" > /sys/class/drm/card0/device/pp_dpm_mclk
> ---------------------------------------------------------------
> #systemctl enable amd-pp.service
> #systemctl start amd-pp.service
> ---------------------------------------------------------------
> 
> ... assuming you have 'amdgpu.ppfeaturemask=0xffffffff' set ...

Thank you. I already tried exactly that. And the unit unable to autostart (permission denied). Only manual systemctl start works. Dont know why. 

I would try to patch the kernel instead if i had any clue how to do the steps.

Comment 132 haro41 2019-11-07 10:25:58 UTC

(In reply to Wilko Bartels from comment #131)
> Thank you. I already tried exactly that. And the unit unable to autostart
> (permission denied). Only manual systemctl start works. Dont know why. 

If you double checked the permissions of both, the .service and the .sh files,
you could try delay the automatic service start, for example by replacing:

'WantedBy=multi-user.target' with 'WantedBy=graphical.target'

and maybe insert a line in the [Unit] section: 'After=multi-user.target'

Comment 133 Wilko Bartels 2019-11-07 16:50:10 UTC

(In reply to haro41 from comment #132)
> (In reply to Wilko Bartels from comment #131)
> > Thank you. I already tried exactly that. And the unit unable to autostart
> > (permission denied). Only manual systemctl start works. Dont know why. 
> 
> If you double checked the permissions of both, the .service and the .sh
> files,
> you could try delay the automatic service start, for example by replacing:
> 
> 'WantedBy=multi-user.target' with 'WantedBy=graphical.target'
> 
> and maybe insert a line in the [Unit] section: 'After=multi-user.target'

sadly that doesnt change a thing
line 2: /sys/class/drm/card0/device/power_dpm_force_performance_level: Permission denied

line 3: /sys/class/drm/card0/device/pp_dpm_mclk: Permission denied
amd-pp.service: Main process exited, code=exited, status=1/FAILURE

-rw-r--r-- 1 root root 4,0K  7. Nov 17:45 /sys/class/drm/card0/device/power_dpm_force_performance_level

-rw-r--r-- 1 root root 4,0K  7. Nov 17:45 /sys/class/drm/card0/device/pp_dpm_mclk

again after logging (i3/xinit or plasma/sddm i have no errors with systemctl start and it works

[jason@behemoth ~]$ cat /sys/class/drm/card0/device/pp_dpm_mclk
0: 167Mhz 
1: 500Mhz *
2: 700Mhz 
3: 800Mhz

Comment 134 Wilko Bartels 2019-11-12 11:03:54 UTC

(In reply to Wilko Bartels from comment #133)
> (In reply to haro41 from comment #132)
> > (In reply to Wilko Bartels from comment #131)
> > > Thank you. I already tried exactly that. And the unit unable to autostart
> > > (permission denied). Only manual systemctl start works. Dont know why. 
> > 
> > If you double checked the permissions of both, the .service and the .sh
> > files,
> > you could try delay the automatic service start, for example by replacing:
> > 
> > 'WantedBy=multi-user.target' with 'WantedBy=graphical.target'
> > 
> > and maybe insert a line in the [Unit] section: 'After=multi-user.target'
> 
> sadly that doesnt change a thing
> line 2: /sys/class/drm/card0/device/power_dpm_force_performance_level:
> Permission denied
> 
> line 3: /sys/class/drm/card0/device/pp_dpm_mclk: Permission denied
> amd-pp.service: Main process exited, code=exited, status=1/FAILURE
> 
> -rw-r--r-- 1 root root 4,0K  7. Nov 17:45
> /sys/class/drm/card0/device/power_dpm_force_performance_level
> 
> -rw-r--r-- 1 root root 4,0K  7. Nov 17:45
> /sys/class/drm/card0/device/pp_dpm_mclk
> 
> again after logging (i3/xinit or plasma/sddm i have no errors with systemctl
> start and it works
> 
> [jason@behemoth ~]$ cat /sys/class/drm/card0/device/pp_dpm_mclk
> 0: 167Mhz 
> 1: 500Mhz *
> 2: 700Mhz 
> 3: 800Mhz

running a script at plasma login now. with no password for that command in sudoers. also after sleep.

Comment 135 Rodney A Morris 2019-11-17 14:24:39 UTC

(In reply to haro41 from comment #127)
> (In reply to Rodney A Morris from comment #126)
> > If you want someone to apply your changes in bug report no. 110777 to the
> > kernel for testing, I can so but will not be to it until this weekend. 
>  
> ... thanks for you reply. Yes, that was the idea and would be very nice...
> 
> Since i thing the proposed fix is more relevant to this very thread, let me
> repeat the proposed patch here:
> 
> in 'drivers/gpu/drm/amd/powerplay/hwmgr/vega10_hwmgr.c':
> 
> static void vega10_notify_smc_display_change(struct pp_hwmgr *hwmgr,
>                 bool has_disp)
> {
> 	smum_send_msg_to_smc_with_parameter(hwmgr,
> 	                                    PPSMC_MSG_SetUclkFastSwitch,
> 	                                    has_disp ? 1 : 0);
> /* proposed fix for crashes because of frequently mclk level 0/1 switching */
> 	smum_send_msg_to_smc_with_parameter(hwmgr, PPSMC_MSG_SetUclkDownHyst, 1);
> }
> 
> Only module 'amdgpu.ko' needs to be rebuild and copied, like this:
> 
> $ cd /home/user/linux-5.x.x && make -j8 -C . M=drivers/gpu/drm/amd/amdgpu
> 
> # cp /home/user/linux-5.x.x/drivers/gpu/drm/amd/amdgpu/amdgpu.ko
> /lib/modules/5.x.x/kernel/drivers/gpu/drm/amd/amdgpu/amdgpu.ko &&
> update-initramfs -u
> 
> ... 'user' and 'x.x' have to be adapted, most likely ...

I applied the patch and recompiled the kernel with the modified amdgpu driver.  Unfortunately, the patch did not resolve my issues.  I experienced a crash with the same symptoms as before within 20 minutes of playing Battletech and within 40 minutes of playing Stellaris.  Again, limiting the HMB memory clock to levels 1,2, and 3 prevents the system from crashing, indicating that something with the switching of the memory clock between level 0 and 1, 2, and 3 are causing the crash.

Interestingly, the debug output indicates a possible problem in amdgpu/../display/dc/dc_helper.c at, I am guessing, line 332.  If I have time later this week, I may take a look at the code in that file.  Here are the pertinent details from the Stellaris crash.

Distro:  Fedora
Kernel:  5.3.11

dmesg crash output:

[19792.781681] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=3875204, emitted seq=3875205
[19792.781727] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process stellaris pid 13309 thread stellaris:cs0 pid 13310
[19792.781731] amdgpu 0000:06:00.0: GPU reset begin!
[19792.798997] pcieport 0000:00:03.0: AER: Uncorrected (Non-Fatal) error received: 0000:00:03.0
[19792.799004] pcieport 0000:00:03.0: AER: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID)
[19792.799006] pcieport 0000:00:03.0: AER:   device [8086:6f08] error status/mask=00004000/00000000
[19792.799007] pcieport 0000:00:03.0: AER:    [14] CmpltTO                (First)
[19792.800004] pcieport 0000:00:03.0: AER: Device recovery failed
[19794.419525] amdgpu: [powerplay] No response from smu
[19794.419542] amdgpu: [powerplay] Failed message: 0xe, input parameter: 0x0, error code: 0x0
[19796.043441] amdgpu: [powerplay] No response from smu
[19797.665903] amdgpu: [powerplay] No response from smu
[19797.665907] amdgpu: [powerplay] Failed message: 0x42, input parameter: 0x1, error code: 0x0
[19799.287749] amdgpu: [powerplay] No response from smu
[19800.910845] amdgpu: [powerplay] No response from smu
[19800.910850] amdgpu: [powerplay] Failed message: 0x24, input parameter: 0x0, error code: 0x0
[19800.977846] [drm] REG_WAIT timeout 10us * 3500 tries - dce_mi_free_dmif line:634
[19800.977855] ------------[ cut here ]------------
[19800.977967] WARNING: CPU: 10 PID: 15123 at drivers/gpu/drm/amd/amdgpu/../display/dc/dc_helper.c:332 generic_reg_wait.cold+0x31/0x53 [amdgpu]
[19800.977968] Modules linked in: rfcomm xt_CHECKSUM xt_MASQUERADE nf_nat_tftp nf_conntrack_tftp tun bridge stp llc nf_conntrack_netbios_ns nf_conntrack_broadcast xt_CT ip6t_REJECT nf_reject_ipv6 ip6t_rpfilter ipt_REJECT nf_reject_ipv4 xt_conntrack ebtable_nat ebtable_broute ip6table_nat ip6table_mangle ip6table_raw ip6table_security iptable_nat nf_nat iptable_mangle iptable_raw iptable_security nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c ip_set nfnetlink ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter cmac bnep nct6775 hwmon_vid vfat fat intel_rapl_msr intel_rapl_common x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm iTCO_wdt iTCO_vendor_support irqbypass iwlmvm crct10dif_pclmul snd_hda_codec_realtek crc32_pclmul snd_hda_codec_generic ledtrig_audio snd_hda_codec_hdmi ghash_clmulni_intel mac80211 snd_hda_intel intel_cstate snd_hda_codec libarc4 intel_uncore snd_hda_core btusb snd_hwdep btrtl intel_rapl_perf btbcm iwlwifi snd_seq btintel snd_seq_device
[19800.977994]  bluetooth joydev mxm_wmi snd_pcm cfg80211 snd_timer ecdh_generic ecc rfkill snd mei_me soundcore i2c_i801 lpc_ich mei binfmt_misc auth_rpcgss sunrpc ip_tables amdgpu amd_iommu_v2 gpu_sched ttm drm_kms_helper drm crc32c_intel mpt3sas igb nvme e1000e dca raid_class i2c_algo_bit scsi_transport_sas nvme_core wmi usb_storage fuse
[19800.978009] CPU: 10 PID: 15123 Comm: kworker/10:1 Not tainted 5.3.11-300.RAM.local.fc31.x86_64+debug #1
[19800.978011] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./X99 Taichi, BIOS P1.80 04/06/2018
[19800.978014] Workqueue: events drm_sched_job_timedout [gpu_sched]
[19800.978082] RIP: 0010:generic_reg_wait.cold+0x31/0x53 [amdgpu]
[19800.978084] Code: 4c 24 18 44 89 fa 89 ee 48 c7 c7 a8 ee 7e c0 e8 82 00 a5 fa 83 7b 20 01 0f 84 94 ee fd ff 48 c7 c7 a0 ed 7e c0 e8 6c 00 a5 fa <0f> 0b e9 81 ee fd ff 48 c7 c7 a0 ed 7e c0 89 54 24 04 e8 55 00 a5
[19800.978086] RSP: 0018:ffff957a0520f690 EFLAGS: 00010246
[19800.978087] RAX: 0000000000000024 RBX: ffff88d6a8030780 RCX: 0000000000000006
[19800.978089] RDX: 0000000000000000 RSI: ffff88d645a10e50 RDI: ffff88d6bf9d9e00
[19800.978090] RBP: 000000000000000a R08: 0000120246405906 R09: 0000000000000000
[19800.978091] R10: 0000000000000000 R11: 0000000000000000 R12: 00000000000035af
[19800.978092] R13: 0000000000000dad R14: 0000000000000001 R15: 0000000000000dac
[19800.978093] FS:  0000000000000000(0000) GS:ffff88d6bf800000(0000) knlGS:0000000000000000
[19800.978095] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[19800.978096] CR2: 0000289e30054000 CR3: 0000000278612003 CR4: 00000000003606e0
[19800.978097] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[19800.978098] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[19800.978100] Call Trace:
[19800.978152]  dce_mi_free_dmif+0xef/0x150 [amdgpu]
[19800.978200]  dce110_reset_hw_ctx_wrap+0x15f/0x200 [amdgpu]
[19800.978261]  dce110_apply_ctx_to_hw+0x4b/0x530 [amdgpu]
[19800.978316]  ? amdgpu_pm_compute_clocks+0xc9/0x5f0 [amdgpu]
[19800.978383]  ? dm_pp_apply_display_requirements+0x1a8/0x1c0 [amdgpu]
[19800.978429]  dc_commit_state+0x26b/0x590 [amdgpu]
[19800.978479]  amdgpu_dm_atomic_commit_tail+0xd18/0x1cf0 [amdgpu]
[19800.978486]  ? check_irq_usage+0xa7/0x460
[19800.978488]  ? find_held_lock+0x32/0x90
[19800.978494]  ? check_path+0x22/0x40
[19800.978496]  ? check_noncircular+0xaf/0x1b0
[19800.978501]  ? __lock_acquire+0x247/0x1910
[19800.978507]  ? find_held_lock+0x32/0x90
[19800.978511]  ? mark_held_locks+0x50/0x80
[19800.978513]  ? _raw_spin_unlock_irq+0x29/0x40
[19800.978516]  ? lockdep_hardirqs_on+0xf0/0x180
[19800.978518]  ? _raw_spin_unlock_irq+0x29/0x40
[19800.978521]  ? wait_for_completion_timeout+0x75/0x190
[19800.978534]  ? commit_tail+0x3c/0x70 [drm_kms_helper]
[19800.978578]  ? amdgpu_dm_audio_eld_notify+0x60/0x60 [amdgpu]
[19800.978583]  commit_tail+0x3c/0x70 [drm_kms_helper]
[19800.978588]  drm_atomic_helper_commit+0xe3/0x150 [drm_kms_helper]
[19800.978595]  drm_atomic_helper_disable_all+0x14c/0x160 [drm_kms_helper]
[19800.978601]  drm_atomic_helper_suspend+0x66/0x100 [drm_kms_helper]
[19800.978652]  dm_suspend+0x20/0x60 [amdgpu]
[19800.978679]  amdgpu_device_ip_suspend_phase1+0x91/0xc0 [amdgpu]
[19800.978707]  amdgpu_device_ip_suspend+0x1c/0x60 [amdgpu]
[19800.978753]  amdgpu_device_pre_asic_reset+0x191/0x1a4 [amdgpu]
[19800.978799]  amdgpu_device_gpu_recover+0x260/0x934 [amdgpu]
[19800.978843]  amdgpu_job_timedout+0x115/0x140 [amdgpu]
[19800.978848]  drm_sched_job_timedout+0x44/0xa0 [gpu_sched]
[19800.978852]  process_one_work+0x272/0x5a0
[19800.978858]  worker_thread+0x50/0x3b0
[19800.978863]  kthread+0x108/0x140
[19800.978865]  ? process_one_work+0x5a0/0x5a0
[19800.978867]  ? kthread_park+0x80/0x80
[19800.978870]  ret_from_fork+0x3a/0x50
[19800.978878] irq event stamp: 211500
[19800.978881] hardirqs last  enabled at (211499): [<ffffffffbb1715db>] console_unlock+0x46b/0x5d0
[19800.978885] hardirqs last disabled at (211500): [<ffffffffbb0038da>] trace_hardirqs_off_thunk+0x1a/0x20
[19800.978887] softirqs last  enabled at (211486): [<ffffffffbbe0035d>] __do_softirq+0x35d/0x45d
[19800.978889] softirqs last disabled at (211479): [<ffffffffbb0f20c7>] irq_exit+0xf7/0x100
[19800.978891] ---[ end trace 722d34fe8b4d4012 ]---
[19802.595549] amdgpu: [powerplay] No response from smu
[19804.214995] amdgpu: [powerplay] No response from smu
[19804.215000] amdgpu: [powerplay] Failed message: 0x4c, input parameter: 0x1, error code: 0x0
[19805.837985] amdgpu: [powerplay] No response from smu
[19807.458610] amdgpu: [powerplay] No response from smu
[19807.458614] amdgpu: [powerplay] Failed message: 0x4c, input parameter: 0x3, error code: 0x0
[19809.078189] amdgpu: [powerplay] No response from smu
[19810.698831] amdgpu: [powerplay] No response from smu
[19810.698835] amdgpu: [powerplay] Failed message: 0x9, input parameter: 0xf4, error code: 0x0
[19812.321202] amdgpu: [powerplay] No response from smu
[19813.938039] amdgpu: [powerplay] No response from smu
[19813.938043] amdgpu: [powerplay] Failed message: 0xa, input parameter: 0xa0b000, error code: 0x0
[19815.558461] amdgpu: [powerplay] No response from smu
[19817.179965] amdgpu: [powerplay] No response from smu
[19817.179969] amdgpu: [powerplay] Failed message: 0xe, input parameter: 0x0, error code: 0x0
[19818.790507] amdgpu: [powerplay] No response from smu
[19820.409551] amdgpu: [powerplay] No response from smu
[19820.409555] amdgpu: [powerplay] Failed message: 0x42, input parameter: 0x1, error code: 0x0
[19822.030397] amdgpu: [powerplay] No response from smu
[19823.648860] amdgpu: [powerplay] No response from smu
[19823.648864] amdgpu: [powerplay] Failed message: 0x43, input parameter: 0x1, error code: 0x0
[19825.269615] amdgpu: [powerplay] No response from smu
[19826.890755] amdgpu: [powerplay] No response from smu
[19826.890760] amdgpu: [powerplay] Failed message: 0x24, input parameter: 0x0, error code: 0x0
[19826.907783] pcieport 0000:00:03.0: AER: Uncorrected (Non-Fatal) error received: 0000:00:03.0
[19826.907789] pcieport 0000:00:03.0: AER: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID)
[19826.907791] pcieport 0000:00:03.0: AER:   device [8086:6f08] error status/mask=00004000/00000000
[19826.907793] pcieport 0000:00:03.0: AER:    [14] CmpltTO                (First)
[19826.907853] pcieport 0000:00:03.0: AER: Device recovery failed
[19826.925319] pcieport 0000:00:03.0: AER: Uncorrected (Non-Fatal) error received: 0000:00:03.0
[19826.925325] pcieport 0000:00:03.0: AER: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID)
[19826.925326] pcieport 0000:00:03.0: AER:   device [8086:6f08] error status/mask=00004000/00000000
[19826.925328] pcieport 0000:00:03.0: AER:    [14] CmpltTO                (First)
[19826.925371] pcieport 0000:00:03.0: AER: Device recovery failed
[19826.942858] pcieport 0000:00:03.0: AER: Uncorrected (Non-Fatal) error received: 0000:00:03.0
[19826.942863] pcieport 0000:00:03.0: AER: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID)
[19826.942865] pcieport 0000:00:03.0: AER:   device [8086:6f08] error status/mask=00004000/00000000
[19826.942867] pcieport 0000:00:03.0: AER:    [14] CmpltTO                (First)
[19826.942922] pcieport 0000:00:03.0: AER: Device recovery failed
[19826.960471] pcieport 0000:00:03.0: AER: Uncorrected (Non-Fatal) error received: 0000:00:03.0
[19826.960477] pcieport 0000:00:03.0: AER: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID)
[19826.960480] pcieport 0000:00:03.0: AER:   device [8086:6f08] error status/mask=00004000/00000000
[19826.960483] pcieport 0000:00:03.0: AER:    [14] CmpltTO                (First)
[19826.960532] pcieport 0000:00:03.0: AER: Device recovery failed
[19826.977940] pcieport 0000:00:03.0: AER: Uncorrected (Non-Fatal) error received: 0000:00:03.0
[19826.977945] pcieport 0000:00:03.0: AER: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID)
[19826.977947] pcieport 0000:00:03.0: AER:   device [8086:6f08] error status/mask=00004000/00000000
[19826.977949] pcieport 0000:00:03.0: AER:    [14] CmpltTO                (First)
[19826.977988] pcieport 0000:00:03.0: AER: Device recovery failed
[19826.995481] pcieport 0000:00:03.0: AER: Uncorrected (Non-Fatal) error received: 0000:00:03.0
[19826.995486] pcieport 0000:00:03.0: AER: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID)
[19826.995487] pcieport 0000:00:03.0: AER:   device [8086:6f08] error status/mask=00004000/00000000
[19826.995489] pcieport 0000:00:03.0: AER:    [14] CmpltTO                (First)
[19826.995529] pcieport 0000:00:03.0: AER: Device recovery failed
[19827.013021] pcieport 0000:00:03.0: AER: Uncorrected (Non-Fatal) error received: 0000:00:03.0
[19827.013026] pcieport 0000:00:03.0: AER: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID)
[19827.013027] pcieport 0000:00:03.0: AER:   device [8086:6f08] error status/mask=00004000/00000000
[19827.013029] pcieport 0000:00:03.0: AER:    [14] CmpltTO                (First)
[19827.013091] pcieport 0000:00:03.0: AER: Device recovery failed
[19827.030562] pcieport 0000:00:03.0: AER: Uncorrected (Non-Fatal) error received: 0000:00:03.0
[19827.030567] pcieport 0000:00:03.0: AER: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID)
[19827.030568] pcieport 0000:00:03.0: AER:   device [8086:6f08] error status/mask=00004000/00000000
[19827.030570] pcieport 0000:00:03.0: AER:    [14] CmpltTO                (First)
[19827.030610] pcieport 0000:00:03.0: AER: Device recovery failed
[19827.048102] pcieport 0000:00:03.0: AER: Uncorrected (Non-Fatal) error received: 0000:00:03.0
[19827.048106] pcieport 0000:00:03.0: AER: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID)
[19827.048108] pcieport 0000:00:03.0: AER:   device [8086:6f08] error status/mask=00004000/00000000
[19827.048110] pcieport 0000:00:03.0: AER:    [14] CmpltTO                (First)
[19827.048148] pcieport 0000:00:03.0: AER: Device recovery failed
[19827.065644] pcieport 0000:00:03.0: AER: Uncorrected (Non-Fatal) error received: 0000:00:03.0
[19827.065648] pcieport 0000:00:03.0: AER: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID)
[19827.065650] pcieport 0000:00:03.0: AER:   device [8086:6f08] error status/mask=00004000/00000000
[19827.065652] pcieport 0000:00:03.0: AER:    [14] CmpltTO                (First)
[19827.065692] pcieport 0000:00:03.0: AER: Device recovery failed
[19827.083183] pcieport 0000:00:03.0: AER: Uncorrected (Non-Fatal) error received: 0000:00:03.0
[19827.083188] pcieport 0000:00:03.0: AER: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID)
[19827.083190] pcieport 0000:00:03.0: AER:   device [8086:6f08] error status/mask=00004000/00000000
[19827.083192] pcieport 0000:00:03.0: AER:    [14] CmpltTO                (First)
[19827.083231] pcieport 0000:00:03.0: AER: Device recovery failed
[19827.100724] pcieport 0000:00:03.0: AER: Uncorrected (Non-Fatal) error received: 0000:00:03.0
[19827.100729] pcieport 0000:00:03.0: AER: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID)
[19827.100731] pcieport 0000:00:03.0: AER:   device [8086:6f08] error status/mask=00004000/00000000
[19827.100732] pcieport 0000:00:03.0: AER:    [14] CmpltTO                (First)
[19827.100772] pcieport 0000:00:03.0: AER: Device recovery failed
[19827.118264] pcieport 0000:00:03.0: AER: Uncorrected (Non-Fatal) error received: 0000:00:03.0
[19827.118269] pcieport 0000:00:03.0: AER: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID)
[19827.118270] pcieport 0000:00:03.0: AER:   device [8086:6f08] error status/mask=00004000/00000000
[19827.118272] pcieport 0000:00:03.0: AER:    [14] CmpltTO                (First)
[19827.118310] pcieport 0000:00:03.0: AER: Device recovery failed
[19827.135804] pcieport 0000:00:03.0: AER: Uncorrected (Non-Fatal) error received: 0000:00:03.0
[19827.135809] pcieport 0000:00:03.0: AER: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID)
[19827.135811] pcieport 0000:00:03.0: AER:   device [8086:6f08] error status/mask=00004000/00000000
[19827.135812] pcieport 0000:00:03.0: AER:    [14] CmpltTO                (First)
[19827.135852] pcieport 0000:00:03.0: AER: Device recovery failed
[19827.153345] pcieport 0000:00:03.0: AER: Uncorrected (Non-Fatal) error received: 0000:00:03.0
[19827.153350] pcieport 0000:00:03.0: AER: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID)
[19827.153352] pcieport 0000:00:03.0: AER:   device [8086:6f08] error status/mask=00004000/00000000
[19827.153353] pcieport 0000:00:03.0: AER:    [14] CmpltTO                (First)
[19827.153393] pcieport 0000:00:03.0: AER: Device recovery failed
[19827.170887] pcieport 0000:00:03.0: AER: Uncorrected (Non-Fatal) error received: 0000:00:03.0
[19827.170892] pcieport 0000:00:03.0: AER: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID)
[19827.170893] pcieport 0000:00:03.0: AER:   device [8086:6f08] error status/mask=00004000/00000000
[19827.170895] pcieport 0000:00:03.0: AER:    [14] CmpltTO                (First)
[19827.170934] pcieport 0000:00:03.0: AER: Device recovery failed
[19827.188426] pcieport 0000:00:03.0: AER: Uncorrected (Non-Fatal) error received: 0000:00:03.0
[19827.188431] pcieport 0000:00:03.0: AER: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID)
[19827.188433] pcieport 0000:00:03.0: AER:   device [8086:6f08] error status/mask=00004000/00000000
[19827.188435] pcieport 0000:00:03.0: AER:    [14] CmpltTO                (First)
[19827.188473] pcieport 0000:00:03.0: AER: Device recovery failed
[19827.205966] pcieport 0000:00:03.0: AER: Uncorrected (Non-Fatal) error received: 0000:00:03.0
[19827.205971] pcieport 0000:00:03.0: AER: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID)
[19827.205973] pcieport 0000:00:03.0: AER:   device [8086:6f08] error status/mask=00004000/00000000
[19827.205974] pcieport 0000:00:03.0: AER:    [14] CmpltTO                (First)
[19827.206013] pcieport 0000:00:03.0: AER: Device recovery failed
[19827.223507] pcieport 0000:00:03.0: AER: Uncorrected (Non-Fatal) error received: 0000:00:03.0
[19827.223512] pcieport 0000:00:03.0: AER: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID)
[19827.223514] pcieport 0000:00:03.0: AER:   device [8086:6f08] error status/mask=00004000/00000000
[19827.223515] pcieport 0000:00:03.0: AER:    [14] CmpltTO                (First)
[19827.223554] pcieport 0000:00:03.0: AER: Device recovery failed
[19827.241053] pcieport 0000:00:03.0: AER: Uncorrected (Non-Fatal) error received: 0000:00:03.0
[19827.241058] pcieport 0000:00:03.0: AER: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID)
[19827.241059] pcieport 0000:00:03.0: AER:   device [8086:6f08] error status/mask=00004000/00000000
[19827.241061] pcieport 0000:00:03.0: AER:    [14] CmpltTO                (First)
[19827.241120] pcieport 0000:00:03.0: AER: Device recovery failed
[19827.258589] pcieport 0000:00:03.0: AER: Uncorrected (Non-Fatal) error received: 0000:00:03.0
[19827.258594] pcieport 0000:00:03.0: AER: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID)
[19827.258595] pcieport 0000:00:03.0: AER:   device [8086:6f08] error status/mask=00004000/00000000
[19827.258597] pcieport 0000:00:03.0: AER:    [14] CmpltTO                (First)
[19827.258637] pcieport 0000:00:03.0: AER: Device recovery failed
[19827.276129] pcieport 0000:00:03.0: AER: Uncorrected (Non-Fatal) error received: 0000:00:03.0
[19827.276134] pcieport 0000:00:03.0: AER: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID)
[19827.276135] pcieport 0000:00:03.0: AER:   device [8086:6f08] error status/mask=00004000/00000000
[19827.276137] pcieport 0000:00:03.0: AER:    [14] CmpltTO                (First)
[19827.276176] pcieport 0000:00:03.0: AER: Device recovery failed
[19827.293670] pcieport 0000:00:03.0: AER: Uncorrected (Non-Fatal) error received: 0000:00:03.0
[19827.293675] pcieport 0000:00:03.0: AER: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID)
[19827.293676] pcieport 0000:00:03.0: AER:   device [8086:6f08] error status/mask=00004000/00000000
[19827.293678] pcieport 0000:00:03.0: AER:    [14] CmpltTO                (First)
[19827.293718] pcieport 0000:00:03.0: AER: Device recovery failed
[19827.311211] pcieport 0000:00:03.0: AER: Uncorrected (Non-Fatal) error received: 0000:00:03.0
[19827.311215] pcieport 0000:00:03.0: AER: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID)
[19827.311217] pcieport 0000:00:03.0: AER:   device [8086:6f08] error status/mask=00004000/00000000
[19827.311219] pcieport 0000:00:03.0: AER:    [14] CmpltTO                (First)
[19827.311259] pcieport 0000:00:03.0: AER: Device recovery failed
[19827.328751] pcieport 0000:00:03.0: AER: Uncorrected (Non-Fatal) error received: 0000:00:03.0
[19827.328756] pcieport 0000:00:03.0: AER: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID)
[19827.328758] pcieport 0000:00:03.0: AER:   device [8086:6f08] error status/mask=00004000/00000000
[19827.328759] pcieport 0000:00:03.0: AER:    [14] CmpltTO                (First)
[19827.328800] pcieport 0000:00:03.0: AER: Device recovery failed
[19827.346291] pcieport 0000:00:03.0: AER: Uncorrected (Non-Fatal) error received: 0000:00:03.0
[19827.346295] pcieport 0000:00:03.0: AER: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID)
[19827.346297] pcieport 0000:00:03.0: AER:   device [8086:6f08] error status/mask=00004000/00000000
[19827.346299] pcieport 0000:00:03.0: AER:    [14] CmpltTO                (First)
[19827.346344] pcieport 0000:00:03.0: AER: Device recovery failed
[19827.363831] pcieport 0000:00:03.0: AER: Uncorrected (Non-Fatal) error received: 0000:00:03.0
[19827.363836] pcieport 0000:00:03.0: AER: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID)
[19827.363838] pcieport 0000:00:03.0: AER:   device [8086:6f08] error status/mask=00004000/00000000
[19827.363839] pcieport 0000:00:03.0: AER:    [14] CmpltTO                (First)
[19827.363886] pcieport 0000:00:03.0: AER: Device recovery failed
[19827.381372] pcieport 0000:00:03.0: AER: Uncorrected (Non-Fatal) error received: 0000:00:03.0
[19827.381376] pcieport 0000:00:03.0: AER: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID)
[19827.381378] pcieport 0000:00:03.0: AER:   device [8086:6f08] error status/mask=00004000/00000000
[19827.381380] pcieport 0000:00:03.0: AER:    [14] CmpltTO                (First)
[19827.381425] pcieport 0000:00:03.0: AER: Device recovery failed
[19827.398913] pcieport 0000:00:03.0: AER: Uncorrected (Non-Fatal) error received: 0000:00:03.0
[19827.398917] pcieport 0000:00:03.0: AER: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID)
[19827.398919] pcieport 0000:00:03.0: AER:   device [8086:6f08] error status/mask=00004000/00000000
[19827.398921] pcieport 0000:00:03.0: AER:    [14] CmpltTO                (First)
[19827.398959] pcieport 0000:00:03.0: AER: Device recovery failed
[19827.416453] pcieport 0000:00:03.0: AER: Uncorrected (Non-Fatal) error received: 0000:00:03.0
[19827.416458] pcieport 0000:00:03.0: AER: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID)
[19827.416460] pcieport 0000:00:03.0: AER:   device [8086:6f08] error status/mask=00004000/00000000
[19827.416467] pcieport 0000:00:03.0: AER:    [14] CmpltTO                (First)
[19827.416507] pcieport 0000:00:03.0: AER: Device recovery failed
[19827.433994] pcieport 0000:00:03.0: AER: Uncorrected (Non-Fatal) error received: 0000:00:03.0
[19827.433999] pcieport 0000:00:03.0: AER: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID)
[19827.434001] pcieport 0000:00:03.0: AER:   device [8086:6f08] error status/mask=00004000/00000000
[19827.434002] pcieport 0000:00:03.0: AER:    [14] CmpltTO                (First)
[19827.434042] pcieport 0000:00:03.0: AER: Device recovery failed
[19827.451536] pcieport 0000:00:03.0: AER: Uncorrected (Non-Fatal) error received: 0000:00:03.0
[19827.451542] pcieport 0000:00:03.0: AER: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID)
[19827.451544] pcieport 0000:00:03.0: AER:   device [8086:6f08] error status/mask=00004000/00000000
[19827.451545] pcieport 0000:00:03.0: AER:    [14] CmpltTO                (First)
[19827.451588] pcieport 0000:00:03.0: AER: Device recovery failed
[19827.469085] pcieport 0000:00:03.0: AER: Uncorrected (Non-Fatal) error received: 0000:00:03.0
[19827.469091] pcieport 0000:00:03.0: AER: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID)
[19827.469092] pcieport 0000:00:03.0: AER:   device [8086:6f08] error status/mask=00004000/00000000
[19827.469094] pcieport 0000:00:03.0: AER:    [14] CmpltTO                (First)
[19827.469136] pcieport 0000:00:03.0: AER: Device recovery failed
[19827.486616] pcieport 0000:00:03.0: AER: Uncorrected (Non-Fatal) error received: 0000:00:03.0
[19827.486626] pcieport 0000:00:03.0: AER: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID)
[19827.486628] pcieport 0000:00:03.0: AER:   device [8086:6f08] error status/mask=00004000/00000000
[19827.486630] pcieport 0000:00:03.0: AER:    [14] CmpltTO                (First)
[19827.486670] pcieport 0000:00:03.0: AER: Device recovery failed
[19827.504161] pcieport 0000:00:03.0: AER: Uncorrected (Non-Fatal) error received: 0000:00:03.0
[19827.504167] pcieport 0000:00:03.0: AER: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID)
[19827.504170] pcieport 0000:00:03.0: AER:   device [8086:6f08] error status/mask=00004000/00000000
[19827.504171] pcieport 0000:00:03.0: AER:    [14] CmpltTO                (First)
[19827.504218] pcieport 0000:00:03.0: AER: Device recovery failed
[19827.521697] pcieport 0000:00:03.0: AER: Uncorrected (Non-Fatal) error received: 0000:00:03.0
[19827.521702] pcieport 0000:00:03.0: AER: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID)
[19827.521704] pcieport 0000:00:03.0: AER:   device [8086:6f08] error status/mask=00004000/00000000
[19827.521706] pcieport 0000:00:03.0: AER:    [14] CmpltTO                (First)
[19827.521934] pcieport 0000:00:03.0: AER: Device recovery failed
[19827.539242] pcieport 0000:00:03.0: AER: Uncorrected (Non-Fatal) error received: 0000:00:03.0
[19827.539247] pcieport 0000:00:03.0: AER: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID)
[19827.539249] pcieport 0000:00:03.0: AER:   device [8086:6f08] error status/mask=00004000/00000000
[19827.539250] pcieport 0000:00:03.0: AER:    [14] CmpltTO                (First)
[19827.539290] pcieport 0000:00:03.0: AER: Device recovery failed
[19827.556778] pcieport 0000:00:03.0: AER: Uncorrected (Non-Fatal) error received: 0000:00:03.0
[19827.556782] pcieport 0000:00:03.0: AER: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID)
[19827.556784] pcieport 0000:00:03.0: AER:   device [8086:6f08] error status/mask=00004000/00000000
[19827.556786] pcieport 0000:00:03.0: AER:    [14] CmpltTO                (First)
[19827.556836] pcieport 0000:00:03.0: AER: Device recovery failed
[19827.574325] pcieport 0000:00:03.0: AER: Uncorrected (Non-Fatal) error received: 0000:00:03.0
[19827.574330] pcieport 0000:00:03.0: AER: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID)
[19827.574332] pcieport 0000:00:03.0: AER:   device [8086:6f08] error status/mask=00004000/00000000
[19827.574334] pcieport 0000:00:03.0: AER:    [14] CmpltTO                (First)
[19827.574373] pcieport 0000:00:03.0: AER: Device recovery failed
[19827.591858] pcieport 0000:00:03.0: AER: Uncorrected (Non-Fatal) error received: 0000:00:03.0
[19827.591863] pcieport 0000:00:03.0: AER: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID)
[19827.591865] pcieport 0000:00:03.0: AER:   device [8086:6f08] error status/mask=00004000/00000000
[19827.591867] pcieport 0000:00:03.0: AER:    [14] CmpltTO                (First)
[19827.591908] pcieport 0000:00:03.0: AER: Device recovery failed
[19827.609401] pcieport 0000:00:03.0: AER: Uncorrected (Non-Fatal) error received: 0000:00:03.0
[19827.609405] pcieport 0000:00:03.0: AER: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID)
[19827.609407] pcieport 0000:00:03.0: AER:   device [8086:6f08] error status/mask=00004000/00000000
[19827.609409] pcieport 0000:00:03.0: AER:    [14] CmpltTO                (First)
[19827.609448] pcieport 0000:00:03.0: AER: Device recovery failed
[19827.626939] pcieport 0000:00:03.0: AER: Uncorrected (Non-Fatal) error received: 0000:00:03.0
[19827.626944] pcieport 0000:00:03.0: AER: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID)
[19827.626946] pcieport 0000:00:03.0: AER:   device [8086:6f08] error status/mask=00004000/00000000
[19827.626947] pcieport 0000:00:03.0: AER:    [14] CmpltTO                (First)
[19827.626986] pcieport 0000:00:03.0: AER: Device recovery failed
[19827.644481] pcieport 0000:00:03.0: AER: Uncorrected (Non-Fatal) error received: 0000:00:03.0
[19827.644486] pcieport 0000:00:03.0: AER: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID)
[19827.644488] pcieport 0000:00:03.0: AER:   device [8086:6f08] error status/mask=00004000/00000000
[19827.644489] pcieport 0000:00:03.0: AER:    [14] CmpltTO                (First)
[19827.644528] pcieport 0000:00:03.0: AER: Device recovery failed
[19827.662021] pcieport 0000:00:03.0: AER: Uncorrected (Non-Fatal) error received: 0000:00:03.0
[19827.662026] pcieport 0000:00:03.0: AER: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID)
[19827.662028] pcieport 0000:00:03.0: AER:   device [8086:6f08] error status/mask=00004000/00000000
[19827.662029] pcieport 0000:00:03.0: AER:    [14] CmpltTO                (First)
[19827.662087] pcieport 0000:00:03.0: AER: Device recovery failed
[19827.679561] pcieport 0000:00:03.0: AER: Uncorrected (Non-Fatal) error received: 0000:00:03.0
[19827.679566] pcieport 0000:00:03.0: AER: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID)
[19827.679568] pcieport 0000:00:03.0: AER:   device [8086:6f08] error status/mask=00004000/00000000
[19827.679570] pcieport 0000:00:03.0: AER:    [14] CmpltTO                (First)
[19827.679608] pcieport 0000:00:03.0: AER: Device recovery failed
[19827.697101] pcieport 0000:00:03.0: AER: Uncorrected (Non-Fatal) error received: 0000:00:03.0
[19827.697106] pcieport 0000:00:03.0: AER: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID)
[19827.697108] pcieport 0000:00:03.0: AER:   device [8086:6f08] error status/mask=00004000/00000000
[19827.697110] pcieport 0000:00:03.0: AER:    [14] CmpltTO                (First)
[19827.697149] pcieport 0000:00:03.0: AER: Device recovery failed
[19827.714648] pcieport 0000:00:03.0: AER: Uncorrected (Non-Fatal) error received: 0000:00:03.0
[19827.714653] pcieport 0000:00:03.0: AER: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID)
[19827.714655] pcieport 0000:00:03.0: AER:   device [8086:6f08] error status/mask=00004000/00000000
[19827.714656] pcieport 0000:00:03.0: AER:    [14] CmpltTO                (First)
[19827.714703] pcieport 0000:00:03.0: AER: Device recovery failed
[19827.732183] pcieport 0000:00:03.0: AER: Uncorrected (Non-Fatal) error received: 0000:00:03.0
[19827.732188] pcieport 0000:00:03.0: AER: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID)
[19827.732190] pcieport 0000:00:03.0: AER:   device [8086:6f08] error status/mask=00004000/00000000
[19827.732191] pcieport 0000:00:03.0: AER:    [14] CmpltTO                (First)
[19827.732230] pcieport 0000:00:03.0: AER: Device recovery failed
[19827.749724] pcieport 0000:00:03.0: AER: Uncorrected (Non-Fatal) error received: 0000:00:03.0
[19827.749729] pcieport 0000:00:03.0: AER: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID)
[19827.749730] pcieport 0000:00:03.0: AER:   device [8086:6f08] error status/mask=00004000/00000000
[19827.749732] pcieport 0000:00:03.0: AER:    [14] CmpltTO                (First)
[19827.767327] pcieport 0000:00:03.0: AER: Device recovery failed
[19827.767330] pcieport 0000:00:03.0: AER: Uncorrected (Non-Fatal) error received: 0000:00:03.0
[19827.767335] pcieport 0000:00:03.0: AER: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID)
[19827.767336] pcieport 0000:00:03.0: AER:   device [8086:6f08] error status/mask=00004000/00000000
[19827.767338] pcieport 0000:00:03.0: AER:    [14] CmpltTO                (First)
[19827.767364] pcieport 0000:00:03.0: AER: Device recovery failed
[19827.784805] pcieport 0000:00:03.0: AER: Uncorrected (Non-Fatal) error received: 0000:00:03.0
[19827.784810] pcieport 0000:00:03.0: AER: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID)
[19827.784812] pcieport 0000:00:03.0: AER:   device [8086:6f08] error status/mask=00004000/00000000
[19827.784813] pcieport 0000:00:03.0: AER:    [14] CmpltTO                (First)
[19827.784853] pcieport 0000:00:03.0: AER: Device recovery failed
[19827.802345] pcieport 0000:00:03.0: AER: Uncorrected (Non-Fatal) error received: 0000:00:03.0
[19827.802350] pcieport 0000:00:03.0: AER: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID)
[19827.802352] pcieport 0000:00:03.0: AER:   device [8086:6f08] error status/mask=00004000/00000000
[19827.802354] pcieport 0000:00:03.0: AER:    [14] CmpltTO                (First)
[19827.802394] pcieport 0000:00:03.0: AER: Device recovery failed
[19827.819886] pcieport 0000:00:03.0: AER: Uncorrected (Non-Fatal) error received: 0000:00:03.0
[19827.819891] pcieport 0000:00:03.0: AER: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID)
[19827.819893] pcieport 0000:00:03.0: AER:   device [8086:6f08] error status/mask=00004000/00000000
[19827.819894] pcieport 0000:00:03.0: AER:    [14] CmpltTO                (First)
[19827.819934] pcieport 0000:00:03.0: AER: Device recovery failed

Comment 136 haro41 2019-11-17 17:13:10 UTC

Thank you for testing and reporting back.

I think the crashes are caused by voltage drops, followed by a hardware failure.
That would explain the many different kernel logs too, because from the drivers pow, it is randomly.

If vsync is enabled, mclk level is switched at least twice per frame (down/up).
And in some cases i have seen more switches inside a frame. 

I am not sure, if this fast memory clock level switching, multiple times during a frame really useful? It saves not much power, but makes the system instable, apparently.

I don't think this is wanted behavior, it looks more like a firmware bug, imo.

Maybe an opensource driver developer can help us to understand?

Comment 137 Martin Peres 2019-11-20 07:52:11 UTC

-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/drm/amd/issues/716.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.