Bug 110509

Summary:	[drm:amdgpu_job_timedout [amdgpu]] ERROR ring gfx timeout
Product:	Mesa	Reporter:	James.Dutton
Component:	Drivers/Gallium/radeonsi	Assignee:	Default DRI bug account <dri-devel>
Status:	RESOLVED MOVED	QA Contact:	Default DRI bug account <dri-devel>
Severity:	normal
Priority:	medium	CC:	lifeisfoo
Version:	git
Hardware:	Other
OS:	All
Whiteboard:
i915 platform:		i915 features:
Attachments:	./umr -O bits -r ..mmGRBM_STATUS /usr/src/umr/build/src/app/umr -wa dmesg dmesg dmesg with drm-next-5.2-wip dmsg drm amdgpu linux 5.3-rc4 from ubuntu ppa

Description James.Dutton 2019-04-24 17:26:39 UTC

AMD Vega 56 fails to reset:
[  188.771043] Evicting PASID 32782 queues
[  188.782094] Restoring PASID 32782 queues
[  214.563362] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=19285, emitted seq=19287
[  214.563432] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process ACOdyssey.exe pid 3761 thread ACOdyssey.exe pid 3761
[  214.563439] amdgpu 0000:43:00.0: GPU reset begin!
[  214.563445] Evicting PASID 32782 queues
[  224.793032] [drm:amdgpu_dm_atomic_check [amdgpu]] *ERROR* [CRTC:49:crtc-0] hw_done or flip_done timed out


How do I go about diagnosing this problem?

Comment 1 James.Dutton 2019-04-24 17:31:49 UTC

Created attachment 144084 [details]
./umr -O bits -r *.*.mmGRBM_STATUS

Output while GPU failed to reset.

Comment 2 James.Dutton 2019-04-24 17:32:48 UTC

Created attachment 144085 [details]
/usr/src/umr/build/src/app/umr -wa

Output of the wave.

Comment 3 James.Dutton 2019-04-24 17:33:15 UTC

Created attachment 144086 [details]
dmesg

dmesg during reset.

Comment 4 James.Dutton 2019-04-24 17:35:15 UTC

Created attachment 144087 [details]
dmesg

dmesg

Comment 5 James.Dutton 2019-04-28 15:42:07 UTC

This is a result of trying to play games in wine and dxvk.
It used to work, but the latest mesa git fails.
Games that fails are:
Assassin's creed odyssey
Devil May Cry 5

Both these games get through the title sequences, but fail when you reach the actual game play. The GPU hangs and tries to reset, but fails to reset.

So, there are two problems:
1) Why does it hang in the first place
2) Why does it fail to recover and reset itself.

I can ssh into the PC.
poweroff   <-  Attempts to power off but never actually reaches off state.
echo b > /proc/sysrq-trigger    <-  reboots the box, and everything is then ok again, so long as one does not try to play a game.

Comment 6 James.Dutton 2019-04-29 13:41:31 UTC

I think I have found the problem.
[  657.526313] amdgpu 0000:43:00.0: GPU reset begin!
[  657.526318] Evicting PASID 32782 queues
[  667.756000] [drm:amdgpu_dm_atomic_check [amdgpu]] *ERROR* [CRTC:49:crtc-0] hw_done or flip_done timed out


The intention is to do a GPU reset, but the implementation in the code is just to try and do a suspend.
Part of the suspend does this:

Apr 29 14:29:19 thread kernel: [  363.445607] INFO: task kworker/u258:0:55 blocked for more than 120 seconds.
Apr 29 14:29:19 thread kernel: [  363.445612]       Not tainted 5.0.10-dirty #26
Apr 29 14:29:19 thread kernel: [  363.445613] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Apr 29 14:29:19 thread kernel: [  363.445615] kworker/u258:0  D    0    55      2 0x80000000
Apr 29 14:29:19 thread kernel: [  363.445628] Workqueue: events_unbound commit_work [drm_kms_helper]
Apr 29 14:29:19 thread kernel: [  363.445629] Call Trace:
Apr 29 14:29:19 thread kernel: [  363.445635]  __schedule+0x2c0/0x880
Apr 29 14:29:19 thread kernel: [  363.445637]  schedule+0x2c/0x70
Apr 29 14:29:19 thread kernel: [  363.445639]  schedule_timeout+0x1db/0x360
Apr 29 14:29:19 thread kernel: [  363.445641]  ? update_load_avg+0x8b/0x590
Apr 29 14:29:19 thread kernel: [  363.445645]  dma_fence_default_wait+0x1eb/0x270
Apr 29 14:29:19 thread kernel: [  363.445647]  ? dma_fence_release+0xa0/0xa0
Apr 29 14:29:19 thread kernel: [  363.445649]  dma_fence_wait_timeout+0xfd/0x110
Apr 29 14:29:19 thread kernel: [  363.445651]  reservation_object_wait_timeout_rcu+0x17d/0x370
Apr 29 14:29:19 thread kernel: [  363.445710]  amdgpu_dm_do_flip+0x14a/0x4a0 [amdgpu]
Apr 29 14:29:19 thread kernel: [  363.445767]  amdgpu_dm_atomic_commit_tail+0x7b7/0xc10 [amdgpu]
Apr 29 14:29:19 thread kernel: [  363.445820]  ? amdgpu_dm_atomic_commit_tail+0x7b7/0xc10 [amdgpu]
Apr 29 14:29:19 thread kernel: [  363.445828]  commit_tail+0x42/0x70 [drm_kms_helper]
Apr 29 14:29:19 thread kernel: [  363.445835]  commit_work+0x12/0x20 [drm_kms_helper]
Apr 29 14:29:19 thread kernel: [  363.445838]  process_one_work+0x1fd/0x400
Apr 29 14:29:19 thread kernel: [  363.445840]  worker_thread+0x34/0x410
Apr 29 14:29:19 thread kernel: [  363.445841]  kthread+0x121/0x140
Apr 29 14:29:19 thread kernel: [  363.445843]  ? process_one_work+0x400/0x400
Apr 29 14:29:19 thread kernel: [  363.445844]  ? kthread_park+0x90/0x90
Apr 29 14:29:19 thread kernel: [  363.445847]  ret_from_fork+0x22/0x40


So, amggpu_dm_do_flip()  is the bit that hangs.
If the GPU needs to be reset because some of it has hung, trying a "flip" is unlikely to work.
It is failing/hanging when doing "suspend of IP block <dm>" in amdgpu_device_ip_suspend_phase1().

I would suggest creating code that actually tries to reset the GPU, instead of trying to suspend it while GPU is hung.

Comment 7 Alex Deucher 2019-04-29 18:30:01 UTC

(In reply to James.Dutton from comment #6)
> 
> I would suggest creating code that actually tries to reset the GPU, instead
> of trying to suspend it while GPU is hung.

That is part of the GPU reset sequence.  We need to attempt to stop the engines before resetting the GPU.  That is what the suspend code does.  Not all of the engines are necessarily hung so you need to stop and drain them properly.

Comment 8 James.Dutton 2019-04-29 22:41:26 UTC

Thank you for the feedback.
Is there a data sheet somewhere that might help me work out a fix for this.
What I would like is:
1) A way to scan all the engines and detect which ones have hung.
2) A way to intentionally halt an engine and tidy up. So that the modprobe, rmmod, modprobe scenario works. 
3) data sheet details regarding how to un-hang each engine.
Specifically, in this case the IP block <dm>.

Maybe that is not possible, and (I think you are hinting at it), one cannot reset an individual IP block. So the approach is to suspend the card, and then do a full reset of the entire card, then resume.

I think a different suspend process would be better.
We have a for_each within the suspend code. The output of that code should not be a single error code, but instead an array indicating the current state of each engine (running/hung), the intended state and status of whether the intention worked or failed. If the loop through the for_each, it could compare the current state and intended state, and attempt to reach the intended state, and report an error code for each engine. Then the code to achieve the transition can been different depending on the current -> intended transition.
i.e. code for running -> suspended, can be different than code for hung -> suspended. The code already needs to know which engines are enabled/disabled  (Vega 56 vs Vega 64)

I can hang this IP block <dm> at will. I have 2 games that hang it within seconds of starting.

Comment 9 Alex Deucher 2019-04-30 01:26:49 UTC

(In reply to James.Dutton from comment #8)
> Thank you for the feedback.
> Is there a data sheet somewhere that might help me work out a fix for this.
> What I would like is:
> 1) A way to scan all the engines and detect which ones have hung.

If the gpu scheduler for a queue on a particular engine times out, you can be pretty sure the engine has hung.  At that point you can check the current busy status for the block (IP is_idle() callback).

> 2) A way to intentionally halt an engine and tidy up. So that the modprobe,
> rmmod, modprobe scenario works. 

hw_fini() IP callback.

> 3) data sheet details regarding how to un-hang each engine.
> Specifically, in this case the IP block <dm>.

Each IP has a soft reset (implemented via the IP soft_reset() callback), but depending on the hang, in some cases, you may have to do a full GPU reset to recover.  This is not a hw hang, it's a sw deadlock.  

> 
> Maybe that is not possible, and (I think you are hinting at it), one cannot
> reset an individual IP block. So the approach is to suspend the card, and
> then do a full reset of the entire card, then resume.

All asics support full GPU reset which is implemented via the SOC level amdgpu_asic_funcs reset() callback.

> 
> I think a different suspend process would be better.
> We have a for_each within the suspend code. The output of that code should
> not be a single error code, but instead an array indicating the current
> state of each engine (running/hung), the intended state and status of
> whether the intention worked or failed. If the loop through the for_each, it
> could compare the current state and intended state, and attempt to reach the
> intended state, and report an error code for each engine. Then the code to
> achieve the transition can been different depending on the current ->
> intended transition.
> i.e. code for running -> suspended, can be different than code for hung ->
> suspended. The code already needs to know which engines are enabled/disabled
> (Vega 56 vs Vega 64)

We don't really care of the suspend fails or not.  See amdgpu_device_gpu_recover() for the full sequence.

> 
> I can hang this IP block <dm> at will. I have 2 games that hang it within
> seconds of starting.

There was a deadlock in the dm code which has been fixed.  Please try a new code base.  e.g.,
https://cgit.freedesktop.org/~agd5f/linux/log/?h=amd-staging-drm-next
https://cgit.freedesktop.org/~agd5f/linux/log/?h=drm-next-5.2-wip

Comment 10 James.Dutton 2019-04-30 10:40:24 UTC

Created attachment 144118 [details]
dmesg with drm-next-5.2-wip

Comment 11 James.Dutton 2019-04-30 10:44:25 UTC

I tried with drm-next-5.2-wip.

It does not hang any more, but I have a new error now.

It is better, in the sense that I can now reboot the system normally, and not resort to echo b >/proc/sysrq-trigger

[drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!

After the GPU reset, the screen is corrupted.
I can do, via ssh,  service gdm stop.  service gdm start   and I then get a working login screen. (Mouse moves, I can type in password)
I cannot actually login because X fails. The desktop fails to appear and it returns to the login greeter screen.

I will try to get more details when I have time later.

Comment 12 James.Dutton 2019-04-30 14:22:19 UTC

The error is from this bit of code in:
amdgpu_cs.c:  Line about 232
In function: amdgpu_cs_parser_init:
        if (p->ctx->vram_lost_counter != p->job->vram_lost_counter) {
                ret = -ECANCELED;
                goto free_all_kdata;
        }

So, I guess, somewhere is the gpu reset, those values need to be fixed up.

Comment 13 Michel Dänzer 2019-04-30 14:26:47 UTC

(In reply to James.Dutton from comment #12)
> In function: amdgpu_cs_parser_init:
>         if (p->ctx->vram_lost_counter != p->job->vram_lost_counter) {
>                 ret = -ECANCELED;
>                 goto free_all_kdata;
>         }
> 
> So, I guess, somewhere is the gpu reset, those values need to be fixed up.

It means the VRAM contents were lost during the GPU reset, so any existing userspace contexts are invalid and need to be re-created (which at this point boils down to restarting any processes using the GPU for rendering).

Comment 14 James.Dutton 2019-04-30 14:43:17 UTC

I stop gdm and kill any remaining X processes.
When I start gdm and login, it works, and displays the desktop.

Previously, I was leaving on of the X processes running.

So, I think this (drm-next-5.2-wip) has fixed this bug.

Comment 15 Alessandro 2019-08-13 20:56:45 UTC

Created attachment 145050 [details]
dmsg drm amdgpu linux 5.3-rc4 from ubuntu ppa

I'm facing the same issue with 5.2.x and 5.3-rc4 kernel and a Radeon RX 580.

Comment 16 GitLab Migration User 2019-09-25 18:49:26 UTC

-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/mesa/mesa/issues/1389.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.