Bug 111803

Summary: Annoying GPU stucks are continued on Vega 20 with Kernel 5.4 + mesa 9.3.0 + llvm 9.0.0 [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out!
Product: DRI Reporter: mikhail.v.gavrilov
Component: DRM/AMDgpuAssignee: Default DRI bug account <dri-devel>
Status: RESOLVED MOVED QA Contact:
Severity: not set    
Priority: not set CC: vicluo96
Version: XOrg git   
Hardware: Other   
OS: All   
Whiteboard:
i915 platform: i915 features:
Attachments:
Description Flags
dmesg
none
./umr -O halt_waves -wa
none
./umr -R gfx[.]
none
./umr -O many,bits -r *.*.mmGRBM_STATUS*
none
./umr -O many,bits -r *.*.mmCP_EOP_*
none
./umr -O many,bits -r *.*.mmCP_PFP_HEADER_DUMP
none
./umr -O many,bits -r *.*.mmCP_ME_HEADER_DUMP
none
dmesg
none
./umr -O halt_waves -wa
none
./umr -R gfx[.]
none
./umr -O many,bits -r *.*.mmGRBM_STATUS*
none
./umr -O many,bits -r *.*.mmCP_EOP_*
none
./umr -O many,bits -r *.*.mmCP_PFP_HEADER_DUMP
none
./umr -O many,bits -r *.*.mmCP_ME_HEADER_DUMP
none
trace-cmd start -e dma_fence -e gpu_scheduler -e amdgpu -v -e "amdgpu:amdgpu_mm_rreg" -e "amdgpu:amdgpu_mm_wreg" -e "amdgpu:amdgpu_iv"
none
dmesg
none
./umr -O halt_waves -wa
none
./umr -R gfx[.]
none
./umr -O many,bits -r *.*.mmGRBM_STATUS*
none
./umr -O many,bits -r *.*.mmCP_EOP_*
none
./umr -O many,bits -r *.*.mmCP_PFP_HEADER_DUMP
none
./umr -O many,bits -r *.*.mmCP_ME_HEADER_DUMP
none
dmesg
none
dmesg of AMD 2500U w/ Vega 8
none
./umr -O halt_waves -wa
none
./umr -R gfx[.]
none
./umr -O many,bits -r *.*.mmGRBM_STATUS*
none
./umr -O many,bits -r *.*.mmCP_EOP_*
none
./umr -O many,bits -r *.*.mmCP_PFP_HEADER_DUMP
none
./umr -O many,bits -r *.*.mmCP_ME_HEADER_DUMP
none
dmesg none

Description mikhail.v.gavrilov 2019-09-24 17:54:40 UTC
Created attachment 145490 [details]
dmesg

Annoying GPU stucks are continued on Vega 20 with Kernel 5.4 + mesa 9.3.0 + llvm 9.0.0

For reproducing is enough on the machine when happened memory pressing launch the game Supraland from steam store.

[48662.086736] INFO: task OnlineA-nstance:153979 blocked for more than 122 seconds.
[48662.086740]       Not tainted 5.4.0-0.rc0.git4.1a.fc32.x86_64 #1
[48662.086743] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[48662.086746] OnlineA-nstance D12600 153979 153907 0x80004002
[48662.086753] Call Trace:
[48662.086760]  ? __schedule+0x307/0x950
[48662.086770]  schedule+0x40/0xc0
[48662.086775]  schedule_timeout+0x289/0x3c0
[48662.086782]  ? mark_held_locks+0x50/0x80
[48662.086787]  ? _raw_spin_unlock_irqrestore+0x4b/0x60
[48662.086792]  ? lockdep_hardirqs_on+0xf0/0x180
[48662.086803]  dma_fence_wait_any_timeout+0x208/0x275
[48662.086881]  amdgpu_sa_bo_new+0x44b/0x510 [amdgpu]
[48662.086982]  amdgpu_ib_get+0x31/0x80 [amdgpu]
[48662.087075]  amdgpu_job_alloc_with_ib+0x46/0x70 [amdgpu]
[48662.087081]  ? find_held_lock+0x32/0x90
[48662.087154]  amdgpu_vm_sdma_prepare+0x30/0x90 [amdgpu]
[48662.087243]  amdgpu_vm_bo_update_mapping+0x7b/0xe0 [amdgpu]
[48662.087318]  amdgpu_vm_clear_freed+0xd5/0x1d0 [amdgpu]
[48662.087395]  amdgpu_gem_object_close+0x159/0x1b0 [amdgpu]
[48662.087407]  ? lockdep_hardirqs_on+0xf0/0x180
[48662.087432]  drm_gem_object_release_handle+0x30/0x90 [drm]
[48662.087447]  ? drm_gem_object_handle_put_unlocked+0xa0/0xa0 [drm]
[48662.087453]  idr_for_each+0x5e/0xd0
[48662.087459]  ? mark_held_locks+0x50/0x80
[48662.087477]  drm_gem_release+0x1c/0x30 [drm]
[48662.087492]  drm_file_free.part.0+0x22e/0x270 [drm]
[48662.087509]  drm_release+0xab/0xe0 [drm]
[48662.087517]  __fput+0xdd/0x270
[48662.087525]  task_work_run+0x93/0xd0
[48662.087533]  do_exit+0x349/0xcd0
[48662.087539]  ? find_held_lock+0x32/0x90
[48662.087548]  do_group_exit+0x47/0xb0
[48662.087554]  get_signal+0x17e/0xcb0
[48662.087565]  do_signal+0x36/0x680
[48662.087580]  exit_to_usermode_loop+0x8d/0x120
[48662.087588]  syscall_return_slowpath+0x205/0x330
[48662.087594]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
[48662.087599] RIP: 0033:0x7f0b10b4ffaa
[48662.087606] Code: Bad RIP value.
[48662.087610] RSP: 002b:00007f0ae77fdc40 EFLAGS: 00000246 ORIG_RAX: 00000000000000ca
[48662.087615] RAX: fffffffffffffdfc RBX: 00000000000051ac RCX: 00007f0b10b4ffaa
[48662.087619] RDX: 0000000000000000 RSI: 0000000000000189 RDI: 00007f0b0ebf1170
[48662.087622] RBP: 00007f0b0ebf1148 R08: 0000000000000000 R09: 00000000ffffffff
[48662.087626] R10: 00007f0ae77fdd48 R11: 0000000000000246 R12: 0000000000000000
[48662.087629] R13: 00007f0b0ebf1120 R14: 00007f0b0ebf1170 R15: 00007f0ae77fdc80
[48662.087646] 
               Showing all locks held in the system:
[48662.087662] 1 lock held by khungtaskd/96:
[48662.087665]  #0: ffffffff8d693760 (rcu_read_lock){....}, at: debug_show_all_locks+0x15/0x174
[48662.087738] 1 lock held by CPU 0/KVM/3098:
[48662.087833] 2 locks held by dnf/104312:
[48662.087836]  #0: ffff8d88dacc80a0 (&tty->ldisc_sem){++++}, at: tty_ldisc_ref_wait+0x24/0x50
[48662.087844]  #1: ffffa1088052a2f0 (&ldata->atomic_read_lock){+.+.}, at: n_tty_read+0xe3/0x980
[48662.088002] 3 locks held by kworker/15:0/152888:
[48662.088005]  #0: ffff8d8936c21548 ((wq_completion)events){+.+.}, at: process_one_work+0x1e9/0x5a0
[48662.088012]  #1: ffffa1088d61fe50 ((work_completion)(&(&bdev->wq)->work)){+.+.}, at: process_one_work+0x1e9/0x5a0
[48662.088018]  #2: ffff8d892bf5c9f8 (reservation_ww_class_mutex){+.+.}, at: ttm_bo_delayed_delete+0x8d/0x200 [ttm]
[48662.088032] 3 locks held by OnlineA-nstance/153979:
[48662.088035]  #0: ffffffffc0303070 (drm_global_mutex){+.+.}, at: drm_release+0x2c/0xe0 [drm]
[48662.088054]  #1: ffffa1088d457b30 (reservation_ww_class_acquire){+.+.}, at: amdgpu_gem_object_close+0xce/0x1b0 [amdgpu]
[48662.088126]  #2: ffff8d892bf5c9f8 (reservation_ww_class_mutex){+.+.}, at: ttm_eu_reserve_buffers+0x349/0x620 [ttm]

[48662.088146] =============================================
Comment 1 mikhail.v.gavrilov 2019-09-24 17:56:34 UTC
Created attachment 145491 [details]
./umr -O halt_waves -wa
Comment 2 mikhail.v.gavrilov 2019-09-24 17:56:53 UTC
Created attachment 145492 [details]
./umr -R gfx[.]
Comment 3 mikhail.v.gavrilov 2019-09-24 17:57:13 UTC
Created attachment 145493 [details]
./umr -O many,bits -r *.*.mmGRBM_STATUS*
Comment 4 mikhail.v.gavrilov 2019-09-24 19:04:26 UTC
Ups, when I uploaded the previous file, happened yet another hung on the machine where I filling this bugreport. This machine has also Vega 20 GPU aboard.
Comment 5 mikhail.v.gavrilov 2019-09-24 19:05:33 UTC
Created attachment 145501 [details]
./umr -O many,bits -r *.*.mmCP_EOP_*
Comment 6 mikhail.v.gavrilov 2019-09-24 19:05:50 UTC
Created attachment 145502 [details]
./umr -O many,bits -r *.*.mmCP_PFP_HEADER_DUMP
Comment 7 mikhail.v.gavrilov 2019-09-24 19:06:07 UTC
Created attachment 145503 [details]
./umr -O many,bits -r *.*.mmCP_ME_HEADER_DUMP
Comment 8 mikhail.v.gavrilov 2019-09-26 18:48:21 UTC
Created attachment 145528 [details]
dmesg
Comment 9 mikhail.v.gavrilov 2019-09-26 18:48:44 UTC
Created attachment 145529 [details]
./umr -O halt_waves -wa
Comment 10 mikhail.v.gavrilov 2019-09-26 18:49:02 UTC
Created attachment 145530 [details]
./umr -R gfx[.]
Comment 11 mikhail.v.gavrilov 2019-09-26 18:49:34 UTC
Created attachment 145531 [details]
./umr -O many,bits -r *.*.mmGRBM_STATUS*
Comment 12 mikhail.v.gavrilov 2019-09-26 18:50:19 UTC
Created attachment 145532 [details]
./umr -O many,bits -r *.*.mmCP_EOP_*
Comment 13 mikhail.v.gavrilov 2019-09-26 18:50:43 UTC
Created attachment 145533 [details]
./umr -O many,bits -r *.*.mmCP_PFP_HEADER_DUMP
Comment 14 mikhail.v.gavrilov 2019-09-26 18:51:05 UTC
Created attachment 145534 [details]
./umr -O many,bits -r *.*.mmCP_ME_HEADER_DUMP
Comment 15 mikhail.v.gavrilov 2019-09-27 16:54:01 UTC
Created attachment 145550 [details]
trace-cmd start -e dma_fence -e gpu_scheduler -e amdgpu -v -e "amdgpu:amdgpu_mm_rreg" -e "amdgpu:amdgpu_mm_wreg" -e "amdgpu:amdgpu_iv"
Comment 16 mikhail.v.gavrilov 2019-09-27 16:54:22 UTC
Created attachment 145551 [details]
dmesg
Comment 17 mikhail.v.gavrilov 2019-09-27 16:54:58 UTC
Created attachment 145552 [details]
./umr -O halt_waves -wa
Comment 18 mikhail.v.gavrilov 2019-09-27 16:55:15 UTC
Created attachment 145553 [details]
./umr -R gfx[.]
Comment 19 mikhail.v.gavrilov 2019-09-27 16:55:29 UTC
Created attachment 145554 [details]
./umr -O many,bits -r *.*.mmGRBM_STATUS*
Comment 20 mikhail.v.gavrilov 2019-09-27 16:55:44 UTC
Created attachment 145555 [details]
./umr -O many,bits -r *.*.mmCP_EOP_*
Comment 21 mikhail.v.gavrilov 2019-09-27 16:56:00 UTC
Created attachment 145556 [details]
./umr -O many,bits -r *.*.mmCP_PFP_HEADER_DUMP
Comment 22 mikhail.v.gavrilov 2019-09-27 16:56:17 UTC
Created attachment 145557 [details]
./umr -O many,bits -r *.*.mmCP_ME_HEADER_DUMP
Comment 23 mikhail.v.gavrilov 2019-09-30 04:18:28 UTC
Created attachment 145588 [details]
dmesg
Comment 24 Zheng Luo 2019-09-30 06:43:06 UTC
Also happens on Lenovo E585 with the latest firmware (R0UET74W (1.54 )), AMD 2500U w/ Vega 8, Kernel 5.3.1-arch1-1-ARCH, mesa 19.1.7-1, llvm 8.0.1. It happens after I launched LibreOffice Sheet.

Sep 29 23:29:36 lzThinkpad gnome-shell[1676]: meta_window_set_stack_position_no_sync: assertion 'window->stack_position >= 0' failed
Sep 29 23:29:41 lzThinkpad kernel: [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out or interrupted!
Sep 29 23:29:45 lzThinkpad tracker-store[1810]: OK
Sep 29 23:29:45 lzThinkpad systemd[1613]: tracker-store.service: Succeeded.
Sep 29 23:29:46 lzThinkpad kernel: [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out or interrupted!
Sep 29 23:29:46 lzThinkpad kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=757, emitted seq=759
Sep 29 23:29:46 lzThinkpad kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process gnome-shell pid 1676 thread gnome-shel:cs0 pid 1683
Sep 29 23:29:46 lzThinkpad kernel: [drm] GPU recovery disabled.
Comment 25 Zheng Luo 2019-09-30 06:43:46 UTC
Created attachment 145589 [details]
dmesg of AMD 2500U w/ Vega 8
Comment 26 mikhail.v.gavrilov 2019-10-05 09:17:52 UTC
Created attachment 145655 [details]
./umr -O halt_waves -wa
Comment 27 mikhail.v.gavrilov 2019-10-05 09:18:21 UTC
Created attachment 145656 [details]
./umr -R gfx[.]
Comment 28 mikhail.v.gavrilov 2019-10-05 09:18:53 UTC
Created attachment 145657 [details]
./umr -O many,bits -r *.*.mmGRBM_STATUS*
Comment 29 mikhail.v.gavrilov 2019-10-05 09:19:35 UTC
Created attachment 145658 [details]
./umr -O many,bits -r *.*.mmCP_EOP_*
Comment 30 mikhail.v.gavrilov 2019-10-05 09:20:11 UTC
Created attachment 145659 [details]
./umr -O many,bits -r *.*.mmCP_PFP_HEADER_DUMP
Comment 31 mikhail.v.gavrilov 2019-10-05 09:20:40 UTC
Created attachment 145660 [details]
./umr -O many,bits -r *.*.mmCP_ME_HEADER_DUMP
Comment 32 mikhail.v.gavrilov 2019-10-05 09:21:04 UTC
Created attachment 145661 [details]
dmesg
Comment 33 Martin Peres 2019-11-19 09:53:43 UTC
-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/drm/amd/issues/916.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.