106500 – GPU Recovery + DC deadlock

Bug 106500 - GPU Recovery + DC deadlock

Summary: GPU Recovery + DC deadlock

Status:	RESOLVED MOVED

Alias:	None

Product:	DRI
Classification:	Unclassified
Component:	DRM/AMDgpu (show other bugs)
Version:	unspecified
Hardware:	Other All

Importance:	medium normal
Assignee:	Default DRI bug account
QA Contact:

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2018-05-13 11:38 UTC by Bas Nieuwenhuizen
Modified:	2019-11-19 08:38 UTC (History)
CC List:	3 users (show)

See Also:
i915 platform:
i915 features:

Attachments
Quick try to avoid deadlock (1.23 KB, text/plain) 2018-05-14 18:31 UTC, Andrey Grodzovsky	no flags	Details
dmesg after trying 139562 (131.15 KB, text/plain) 2018-05-14 21:59 UTC, Bas Nieuwenhuizen	no flags	Details
View All

Description Bas Nieuwenhuizen 2018-05-13 11:38:43 UTC

If you try to reset a GPU using 

cat /sys/kernel/debug/dri/2/amdgpu_gpu_recovery

while the GPU is hung the kernel deadlocks if the GPU is used for displaying stuff.

I found two causes. If I hang the GPU with the libdrm tests I get a deadlock in

https://cgit.freedesktop.org/~agd5f/linux/tree/drivers/gpu/drm/amd/amdgpu/amdgpu_pm.c?h=amd-staging-drm-next&id=da603c1d0aac505485490f5e0ba495d4e292e7b9#n1876

Looks like we disable DC during the reset, but as part of the disabling we change the clocks and for that we wait till the GPU is idle. It is of course not going to be idle without intervention if hung.

Supporting trace:

[ 1842.823262] INFO: task cat:3635 blocked for more than 120 seconds.
[ 1842.823268]       Tainted: G        W        4.16.0-rc7-g36031c0dfb2d #6
[ 1842.823270] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1842.823271] cat             D    0  3635   3630 0x00000000
[ 1842.823275] Call Trace:
[ 1842.823285]  ? __schedule+0x23c/0x870
[ 1842.823288]  schedule+0x2f/0x90
[ 1842.823291]  schedule_timeout+0x1fc/0x460
[ 1842.823296]  ? __alloc_pages_nodemask+0x10f/0xfd0
[ 1842.823300]  dma_fence_default_wait+0x1eb/0x280
[ 1842.823303]  ? dma_fence_default_wait+0x280/0x280
[ 1842.823306]  dma_fence_wait_timeout+0x38/0x110
[ 1842.823331]  amdgpu_fence_wait_empty+0x98/0xd0 [amdgpu]
[ 1842.823356]  ? dc_remove_plane_from_context+0x202/0x240 [amdgpu]
[ 1842.823378]  amdgpu_pm_compute_clocks.part.8+0x70/0x590 [amdgpu]
[ 1842.823409]  dm_pp_apply_display_requirements+0x159/0x160 [amdgpu]
[ 1842.823433]  pplib_apply_display_requirements+0x197/0x1c0 [amdgpu]
[ 1842.823457]  dc_commit_state+0x23b/0x560 [amdgpu]
[ 1842.823481]  ? dce112_validate_bandwidth+0x1bd/0x230 [amdgpu]
[ 1842.823506]  ? dce112_validate_bandwidth+0x1c9/0x230 [amdgpu]
[ 1842.823535]  amdgpu_dm_atomic_commit_tail+0x27a/0xc70 [amdgpu]
[ 1842.823540]  ? __wake_up_common_lock+0x89/0xc0
[ 1842.823542]  ? wait_for_common+0x151/0x180
[ 1842.823545]  ? wait_for_common+0x151/0x180
[ 1842.823551]  commit_tail+0x3d/0x70 [drm_kms_helper]
[ 1842.823557]  drm_atomic_helper_commit+0xfc/0x110 [drm_kms_helper]
[ 1842.823562]  drm_atomic_helper_disable_all+0x158/0x1b0 [drm_kms_helper]
[ 1842.823567]  drm_atomic_helper_suspend+0xd6/0x130 [drm_kms_helper]
[ 1842.823587]  amdgpu_device_gpu_recover+0x60f/0x8b0 [amdgpu]
[ 1842.823591]  ? __kmalloc_node+0x204/0x2b0
[ 1842.823611]  amdgpu_debugfs_gpu_recover+0x30/0x40 [amdgpu]
[ 1842.823615]  seq_read+0xee/0x480
[ 1842.823619]  full_proxy_read+0x53/0x80
[ 1842.823624]  __vfs_read+0x36/0x150
[ 1842.823627]  vfs_read+0x91/0x130
[ 1842.823630]  SyS_read+0x52/0xc0
[ 1842.823634]  do_syscall_64+0x67/0x120
[ 1842.823637]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[ 1842.823640] RIP: 0033:0x7f987c073701
[ 1842.823642] RSP: 002b:00007ffc0deeac08 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
[ 1842.823644] RAX: ffffffffffffffda RBX: 0000000000020000 RCX: 00007f987c073701
[ 1842.823646] RDX: 0000000000020000 RSI: 00007f987c549000 RDI: 0000000000000003
[ 1842.823647] RBP: 0000000000020000 R08: 00000000ffffffff R09: 0000000000000000
[ 1842.823649] R10: 0000000000000022 R11: 0000000000000246 R12: 00007f987c549000
[ 1842.823650] R13: 0000000000000003 R14: 00007f987c54900f R15: 0000000000020000

I managed to "fix" this by commenting out that code. Now a libdrm test caused hang recovers successfully though the display (even for the non-X terminals) is garbled.

However, I recently had a game hang and then tried recovering and that still gave a deadlock:

[127426.165215] INFO: task kworker/u256:0:77605 blocked for more than 120 seconds.
[127426.165221]       Tainted: G        W        4.16.0-rc7-gffd4abe7dbf9 #7
[127426.165222] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[127426.165224] kworker/u256:0  D    0 77605      2 0x80000000
[127426.165236] Workqueue: events_unbound commit_work [drm_kms_helper]
[127426.165239] Call Trace:
[127426.165248]  ? __schedule+0x23c/0x870
[127426.165251]  schedule+0x2f/0x90
[127426.165254]  schedule_timeout+0x1fc/0x460
[127426.165283]  ? dce120_timing_generator_get_crtc_position+0x5b/0x70 [amdgpu]
[127426.165308]  ? dce120_timing_generator_get_crtc_scanoutpos+0x70/0xb0 [amdgpu]
[127426.165312]  dma_fence_default_wait+0x1eb/0x280
[127426.165315]  ? dma_fence_default_wait+0x280/0x280
[127426.165317]  dma_fence_wait_timeout+0x38/0x110
[127426.165320]  reservation_object_wait_timeout_rcu+0x187/0x360
[127426.165350]  amdgpu_dm_do_flip+0x109/0x350 [amdgpu]
[127426.165382]  amdgpu_dm_atomic_commit_tail+0xa7c/0xc70 [amdgpu]
[127426.165386]  ? wait_for_common+0x151/0x180
[127426.165390]  ? pick_next_task_fair+0x48c/0x5a0
[127426.165393]  ? __switch_to+0x199/0x460
[127426.165399]  commit_tail+0x3d/0x70 [drm_kms_helper]
[127426.165403]  process_one_work+0x1ce/0x3f0
[127426.165405]  worker_thread+0x2b/0x3d0
[127426.165408]  ? process_one_work+0x3f0/0x3f0
[127426.165410]  kthread+0x113/0x130
[127426.165413]  ? kthread_create_on_node+0x70/0x70
[127426.165416]  ret_from_fork+0x22/0x40

which seems to be here:

https://cgit.freedesktop.org/~agd5f/linux/tree/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c?h=amd-staging-drm-next&id=da603c1d0aac505485490f5e0ba495d4e292e7b9#n3973

This is before the GPU reset itself happens, so either we use a BO somehow in the disabled state or this is an earlier flip.

Anyway that wait is not going to finish due to a hung GPU.

Comment 1 Andrey Grodzovsky 2018-05-14 18:31:35 UTC

Created attachment 139562 [details]
Quick try to avoid deadlock

Can u give this a quick try and seed if it helps ?

Comment 2 Bas Nieuwenhuizen 2018-05-14 21:59:15 UTC

Created attachment 139568 [details]
dmesg after trying 139562

I tried the patch and as expected we do not deadlock at the original places since we don't call those anymore. But I get garbage on my display (possibly expected due to loss of VRAM), can't switch VT and stopping X hangs X.

Furthermore I eventually still get stuck fence waits in dmesg (attached).

Furthermore, it seems the UVDF ringtest fails.

Comment 3 Andrey Grodzovsky 2018-05-14 22:30:37 UTC

(In reply to Bas Nieuwenhuizen from comment #2)
> Created attachment 139568 [details]
> dmesg after trying 139562
> 
> I tried the patch and as expected we do not deadlock at the original places
> since we don't call those anymore. But I get garbage on my display (possibly
> expected due to loss of VRAM), can't switch VT and stopping X hangs X.
> 
> Furthermore I eventually still get stuck fence waits in dmesg (attached).
> 
> Furthermore, it seems the UVDF ringtest fails.

I think indeed the garbage is due to VRAM lost, maybe we don't create a shadow BO for the display's BO. GPU reset fails due to  UVD failure to resume and SMU failure so I believe that why any further fence submission hangs. The pipe never recovers.

Harry, check the patch I attached, no reason to call  drm_atomic_helper_resume/suspend explicitly from amdgpu_device_gpu_recover - First of all it's already being called from the display code from amd_ip_funcs.suspend/resume hooks.
Second of all, the place in amdgpu_device_gpu_recover it's being called is wrong for GPU stalls since it is called BEFORE we cancel and force completion of all in flight jobs which are stuck on the GPU. So as Bas explained it will try to wait for fence  in amdgpu_pm_compute_clocks but the pipe is hanged so we end up in deadlock. If we call the mode set AFTER forceful completion (as the patch makes happen) no deadlock will happen.  

UVD/SMU failures require further debugging but I am on a different task at the moment so maybe some one can pick this up...
Do you remember why that code is there ? I think it's remains of old code.
If you OK with this patch I will send it for review.

Further

Comment 4 Martin Peres 2019-11-19 08:38:22 UTC

-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/drm/amd/issues/385.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.