AMD Vega 56 fails to reset: [ 188.771043] Evicting PASID 32782 queues [ 188.782094] Restoring PASID 32782 queues [ 214.563362] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=19285, emitted seq=19287 [ 214.563432] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process ACOdyssey.exe pid 3761 thread ACOdyssey.exe pid 3761 [ 214.563439] amdgpu 0000:43:00.0: GPU reset begin! [ 214.563445] Evicting PASID 32782 queues [ 224.793032] [drm:amdgpu_dm_atomic_check [amdgpu]] *ERROR* [CRTC:49:crtc-0] hw_done or flip_done timed out How do I go about diagnosing this problem?
Created attachment 144084 [details] ./umr -O bits -r *.*.mmGRBM_STATUS Output while GPU failed to reset.
Created attachment 144085 [details] /usr/src/umr/build/src/app/umr -wa Output of the wave.
Created attachment 144086 [details] dmesg dmesg during reset.
Created attachment 144087 [details] dmesg dmesg
This is a result of trying to play games in wine and dxvk. It used to work, but the latest mesa git fails. Games that fails are: Assassin's creed odyssey Devil May Cry 5 Both these games get through the title sequences, but fail when you reach the actual game play. The GPU hangs and tries to reset, but fails to reset. So, there are two problems: 1) Why does it hang in the first place 2) Why does it fail to recover and reset itself. I can ssh into the PC. poweroff <- Attempts to power off but never actually reaches off state. echo b > /proc/sysrq-trigger <- reboots the box, and everything is then ok again, so long as one does not try to play a game.
I think I have found the problem. [ 657.526313] amdgpu 0000:43:00.0: GPU reset begin! [ 657.526318] Evicting PASID 32782 queues [ 667.756000] [drm:amdgpu_dm_atomic_check [amdgpu]] *ERROR* [CRTC:49:crtc-0] hw_done or flip_done timed out The intention is to do a GPU reset, but the implementation in the code is just to try and do a suspend. Part of the suspend does this: Apr 29 14:29:19 thread kernel: [ 363.445607] INFO: task kworker/u258:0:55 blocked for more than 120 seconds. Apr 29 14:29:19 thread kernel: [ 363.445612] Not tainted 5.0.10-dirty #26 Apr 29 14:29:19 thread kernel: [ 363.445613] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Apr 29 14:29:19 thread kernel: [ 363.445615] kworker/u258:0 D 0 55 2 0x80000000 Apr 29 14:29:19 thread kernel: [ 363.445628] Workqueue: events_unbound commit_work [drm_kms_helper] Apr 29 14:29:19 thread kernel: [ 363.445629] Call Trace: Apr 29 14:29:19 thread kernel: [ 363.445635] __schedule+0x2c0/0x880 Apr 29 14:29:19 thread kernel: [ 363.445637] schedule+0x2c/0x70 Apr 29 14:29:19 thread kernel: [ 363.445639] schedule_timeout+0x1db/0x360 Apr 29 14:29:19 thread kernel: [ 363.445641] ? update_load_avg+0x8b/0x590 Apr 29 14:29:19 thread kernel: [ 363.445645] dma_fence_default_wait+0x1eb/0x270 Apr 29 14:29:19 thread kernel: [ 363.445647] ? dma_fence_release+0xa0/0xa0 Apr 29 14:29:19 thread kernel: [ 363.445649] dma_fence_wait_timeout+0xfd/0x110 Apr 29 14:29:19 thread kernel: [ 363.445651] reservation_object_wait_timeout_rcu+0x17d/0x370 Apr 29 14:29:19 thread kernel: [ 363.445710] amdgpu_dm_do_flip+0x14a/0x4a0 [amdgpu] Apr 29 14:29:19 thread kernel: [ 363.445767] amdgpu_dm_atomic_commit_tail+0x7b7/0xc10 [amdgpu] Apr 29 14:29:19 thread kernel: [ 363.445820] ? amdgpu_dm_atomic_commit_tail+0x7b7/0xc10 [amdgpu] Apr 29 14:29:19 thread kernel: [ 363.445828] commit_tail+0x42/0x70 [drm_kms_helper] Apr 29 14:29:19 thread kernel: [ 363.445835] commit_work+0x12/0x20 [drm_kms_helper] Apr 29 14:29:19 thread kernel: [ 363.445838] process_one_work+0x1fd/0x400 Apr 29 14:29:19 thread kernel: [ 363.445840] worker_thread+0x34/0x410 Apr 29 14:29:19 thread kernel: [ 363.445841] kthread+0x121/0x140 Apr 29 14:29:19 thread kernel: [ 363.445843] ? process_one_work+0x400/0x400 Apr 29 14:29:19 thread kernel: [ 363.445844] ? kthread_park+0x90/0x90 Apr 29 14:29:19 thread kernel: [ 363.445847] ret_from_fork+0x22/0x40 So, amggpu_dm_do_flip() is the bit that hangs. If the GPU needs to be reset because some of it has hung, trying a "flip" is unlikely to work. It is failing/hanging when doing "suspend of IP block <dm>" in amdgpu_device_ip_suspend_phase1(). I would suggest creating code that actually tries to reset the GPU, instead of trying to suspend it while GPU is hung.
(In reply to James.Dutton from comment #6) > > I would suggest creating code that actually tries to reset the GPU, instead > of trying to suspend it while GPU is hung. That is part of the GPU reset sequence. We need to attempt to stop the engines before resetting the GPU. That is what the suspend code does. Not all of the engines are necessarily hung so you need to stop and drain them properly.
Thank you for the feedback. Is there a data sheet somewhere that might help me work out a fix for this. What I would like is: 1) A way to scan all the engines and detect which ones have hung. 2) A way to intentionally halt an engine and tidy up. So that the modprobe, rmmod, modprobe scenario works. 3) data sheet details regarding how to un-hang each engine. Specifically, in this case the IP block <dm>. Maybe that is not possible, and (I think you are hinting at it), one cannot reset an individual IP block. So the approach is to suspend the card, and then do a full reset of the entire card, then resume. I think a different suspend process would be better. We have a for_each within the suspend code. The output of that code should not be a single error code, but instead an array indicating the current state of each engine (running/hung), the intended state and status of whether the intention worked or failed. If the loop through the for_each, it could compare the current state and intended state, and attempt to reach the intended state, and report an error code for each engine. Then the code to achieve the transition can been different depending on the current -> intended transition. i.e. code for running -> suspended, can be different than code for hung -> suspended. The code already needs to know which engines are enabled/disabled (Vega 56 vs Vega 64) I can hang this IP block <dm> at will. I have 2 games that hang it within seconds of starting.
(In reply to James.Dutton from comment #8) > Thank you for the feedback. > Is there a data sheet somewhere that might help me work out a fix for this. > What I would like is: > 1) A way to scan all the engines and detect which ones have hung. If the gpu scheduler for a queue on a particular engine times out, you can be pretty sure the engine has hung. At that point you can check the current busy status for the block (IP is_idle() callback). > 2) A way to intentionally halt an engine and tidy up. So that the modprobe, > rmmod, modprobe scenario works. hw_fini() IP callback. > 3) data sheet details regarding how to un-hang each engine. > Specifically, in this case the IP block <dm>. Each IP has a soft reset (implemented via the IP soft_reset() callback), but depending on the hang, in some cases, you may have to do a full GPU reset to recover. This is not a hw hang, it's a sw deadlock. > > Maybe that is not possible, and (I think you are hinting at it), one cannot > reset an individual IP block. So the approach is to suspend the card, and > then do a full reset of the entire card, then resume. All asics support full GPU reset which is implemented via the SOC level amdgpu_asic_funcs reset() callback. > > I think a different suspend process would be better. > We have a for_each within the suspend code. The output of that code should > not be a single error code, but instead an array indicating the current > state of each engine (running/hung), the intended state and status of > whether the intention worked or failed. If the loop through the for_each, it > could compare the current state and intended state, and attempt to reach the > intended state, and report an error code for each engine. Then the code to > achieve the transition can been different depending on the current -> > intended transition. > i.e. code for running -> suspended, can be different than code for hung -> > suspended. The code already needs to know which engines are enabled/disabled > (Vega 56 vs Vega 64) We don't really care of the suspend fails or not. See amdgpu_device_gpu_recover() for the full sequence. > > I can hang this IP block <dm> at will. I have 2 games that hang it within > seconds of starting. There was a deadlock in the dm code which has been fixed. Please try a new code base. e.g., https://cgit.freedesktop.org/~agd5f/linux/log/?h=amd-staging-drm-next https://cgit.freedesktop.org/~agd5f/linux/log/?h=drm-next-5.2-wip
Created attachment 144118 [details] dmesg with drm-next-5.2-wip
I tried with drm-next-5.2-wip. It does not hang any more, but I have a new error now. It is better, in the sense that I can now reboot the system normally, and not resort to echo b >/proc/sysrq-trigger [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125! After the GPU reset, the screen is corrupted. I can do, via ssh, service gdm stop. service gdm start and I then get a working login screen. (Mouse moves, I can type in password) I cannot actually login because X fails. The desktop fails to appear and it returns to the login greeter screen. I will try to get more details when I have time later.
The error is from this bit of code in: amdgpu_cs.c: Line about 232 In function: amdgpu_cs_parser_init: if (p->ctx->vram_lost_counter != p->job->vram_lost_counter) { ret = -ECANCELED; goto free_all_kdata; } So, I guess, somewhere is the gpu reset, those values need to be fixed up.
(In reply to James.Dutton from comment #12) > In function: amdgpu_cs_parser_init: > if (p->ctx->vram_lost_counter != p->job->vram_lost_counter) { > ret = -ECANCELED; > goto free_all_kdata; > } > > So, I guess, somewhere is the gpu reset, those values need to be fixed up. It means the VRAM contents were lost during the GPU reset, so any existing userspace contexts are invalid and need to be re-created (which at this point boils down to restarting any processes using the GPU for rendering).
I stop gdm and kill any remaining X processes. When I start gdm and login, it works, and displays the desktop. Previously, I was leaving on of the X processes running. So, I think this (drm-next-5.2-wip) has fixed this bug.
Created attachment 145050 [details] dmsg drm amdgpu linux 5.3-rc4 from ubuntu ppa I'm facing the same issue with 5.2.x and 5.3-rc4 kernel and a Radeon RX 580.
-- GitLab Migration Automatic Message -- This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity. You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/mesa/mesa/issues/1389.
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.