Bug 107536 - gfx_v8_0_priv_reg_irq [amdgpu]] *ERROR* Illegal register access in command stream
Summary: gfx_v8_0_priv_reg_irq [amdgpu]] *ERROR* Illegal register access in command st...
Status: NEW
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/AMDgpu (show other bugs)
Version: DRI git
Hardware: x86-64 (AMD64) Linux (All)
: medium major
Assignee: Default DRI bug account
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2018-08-09 21:13 UTC by dwagner
Modified: 2019-06-03 19:54 UTC (History)
0 users

See Also:
i915 platform:
i915 features:


Attachments
dmesg, ending at crash (87.49 KB, text/plain)
2018-08-09 21:14 UTC, dwagner
no flags Details
X11 log (37.60 KB, text/plain)
2018-08-09 21:15 UTC, dwagner
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description dwagner 2018-08-09 21:13:16 UTC
This bug just occured spontaneously (while just using a text editor):

Aug 09 22:23:34 ryzen kernel: [drm:gfx_v8_0_priv_reg_irq [amdgpu]] *ERROR* Illegal register access in command stream
Aug 09 22:23:34 ryzen kernel: amdgpu 0000:0a:00.0: GPU reset begin!
Aug 09 22:23:38 ryzen kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=193874, emitted seq=193874
Aug 09 22:23:38 ryzen kernel: amdgpu 0000:0a:00.0: GPU reset begin!
Aug 09 22:23:44 ryzen kernel: [drm:amdgpu_dm_atomic_check [amdgpu]] *ERROR* [CRTC:44:crtc-0] hw_done or flip_done timed out


Kernel was compiled from amd-staging-drm-next as of commit bf1fd52b0632cd17ac875432a36d3e92be96d8cb.

The RX 460 GPU was (a day before) manually set to lowest mclk/sclk with
cd /sys/class/drm/card0/device ; echo manual >power_dpm_force_performance_level ; 
echo 0 >pp_dpm_mclk ; echo 0 >pp_dpm_sclk
Comment 1 dwagner 2018-08-09 21:14:56 UTC
Created attachment 141028 [details]
dmesg, ending at crash
Comment 2 dwagner 2018-08-09 21:15:32 UTC
Created attachment 141029 [details]
X11 log
Comment 3 Alex Deucher 2018-10-26 02:46:03 UTC
Is this reproducible or was it a one time event?
Comment 4 dwagner 2018-10-27 12:44:49 UTC
So far it has been a one-time event.

It was probably unrelated to the "echo manual >power_dpm_force_performance_level" setting I mentioned above: I still need to use that setting in order to let the kernel not crash every few minutes (this is subject to https://bugs.freedesktop.org/show_bug.cgi?id=102322 ).
Comment 5 Matt Coffin 2019-06-03 19:48:59 UTC
I can reproduce this in a very very specific way (discovered while reproducing bug 102322).

With the amdgpu driver, and RADV vulkan implementation, with DXVK 1.2.1, running "House Flipper" from Steam (wine-staging 4.8), on 2560x1440 144Hz display (DisplayPort). It crashes with the AMDVLK implementation as well, but with a different message.

Usually happens withing 2 minutes of firing up the game. It's notable that this *does not* occur if I render the game in 1080p and blow it up for the screen.

* 5.1.3-arch2-1-ARCH
* LLVM 8.0.0
* vulkan-radeon/mesa 19.0.4

The register that it is not liking the access to flips between TC1 and TC2 seemingly nondeterministically.

I'm sorry for the poor information, but I'm not used to developing/debugging software at the kernel level. Let me know what information I can provide to be helpful, and I'd be happy to fish it out for you. Thanks in advance for your work and the help.
Comment 6 Matt Coffin 2019-06-03 19:54:05 UTC
I also tried to reproduce with amdgpu.vm_update_mode=3, but I can't get Xorg to launch with that setting (KERNEL (not gpu) fails on a page request with that setting on, but that might be due to a lower amt of RAM, and the fact that I'm running an RX 590 w/ 8GB of GDDR5, so it might just be trying to allocate too much memory?).

The failures do NOT occur if I disable dynamic power management with amdgpu.dpm=0, but obviously, performance sucks with those low clock speeds. Game gets about 14fps.

Manual power management fared no better, but some quick debugging showed that it might be getting overridden by DXVK's DXGI implementation.

I also logged `sensors` output, which showed that the failures often occur quickly after the card reaches its maximum power draw at a little over 190W. I thought about increasing that, but I didn't want to fry my hardware since I don't have much experience mucking around with overclocking/overvolting GPUs.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.