This bug just occured spontaneously (while just using a text editor):
Aug 09 22:23:34 ryzen kernel: [drm:gfx_v8_0_priv_reg_irq [amdgpu]] *ERROR* Illegal register access in command stream
Aug 09 22:23:34 ryzen kernel: amdgpu 0000:0a:00.0: GPU reset begin!
Aug 09 22:23:38 ryzen kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=193874, emitted seq=193874
Aug 09 22:23:38 ryzen kernel: amdgpu 0000:0a:00.0: GPU reset begin!
Aug 09 22:23:44 ryzen kernel: [drm:amdgpu_dm_atomic_check [amdgpu]] *ERROR* [CRTC:44:crtc-0] hw_done or flip_done timed out
Kernel was compiled from amd-staging-drm-next as of commit bf1fd52b0632cd17ac875432a36d3e92be96d8cb.
The RX 460 GPU was (a day before) manually set to lowest mclk/sclk with
cd /sys/class/drm/card0/device ; echo manual >power_dpm_force_performance_level ;
echo 0 >pp_dpm_mclk ; echo 0 >pp_dpm_sclk
Created attachment 141028 [details]
dmesg, ending at crash
Created attachment 141029 [details]
Is this reproducible or was it a one time event?
So far it has been a one-time event.
It was probably unrelated to the "echo manual >power_dpm_force_performance_level" setting I mentioned above: I still need to use that setting in order to let the kernel not crash every few minutes (this is subject to https://bugs.freedesktop.org/show_bug.cgi?id=102322 ).
I can reproduce this in a very very specific way (discovered while reproducing bug 102322).
With the amdgpu driver, and RADV vulkan implementation, with DXVK 1.2.1, running "House Flipper" from Steam (wine-staging 4.8), on 2560x1440 144Hz display (DisplayPort). It crashes with the AMDVLK implementation as well, but with a different message.
Usually happens withing 2 minutes of firing up the game. It's notable that this *does not* occur if I render the game in 1080p and blow it up for the screen.
* LLVM 8.0.0
* vulkan-radeon/mesa 19.0.4
The register that it is not liking the access to flips between TC1 and TC2 seemingly nondeterministically.
I'm sorry for the poor information, but I'm not used to developing/debugging software at the kernel level. Let me know what information I can provide to be helpful, and I'd be happy to fish it out for you. Thanks in advance for your work and the help.
I also tried to reproduce with amdgpu.vm_update_mode=3, but I can't get Xorg to launch with that setting (KERNEL (not gpu) fails on a page request with that setting on, but that might be due to a lower amt of RAM, and the fact that I'm running an RX 590 w/ 8GB of GDDR5, so it might just be trying to allocate too much memory?).
The failures do NOT occur if I disable dynamic power management with amdgpu.dpm=0, but obviously, performance sucks with those low clock speeds. Game gets about 14fps.
Manual power management fared no better, but some quick debugging showed that it might be getting overridden by DXVK's DXGI implementation.
I also logged `sensors` output, which showed that the failures often occur quickly after the card reaches its maximum power draw at a little over 190W. I thought about increasing that, but I didn't want to fry my hardware since I don't have much experience mucking around with overclocking/overvolting GPUs.