Bug 104001

Summary: GPU driver hung when start steam client while playback video on Youtube (it occurs on latest staging kernel)
Product: DRI Reporter: mikhail.v.gavrilov
Component: DRM/AMDgpuAssignee: Default DRI bug account <dri-devel>
Status: RESOLVED FIXED QA Contact:
Severity: normal    
Priority: medium CC: devurandom, mboquien
Version: XOrg git   
Hardware: Other   
OS: All   
Whiteboard:
i915 platform: i915 features:
Attachments:
Description Flags
dmesg
none
dmesg
none
dmesg with 4.15.0-rc2 amd-staging-drm-next
none
dmesg with 4.15.0-rc2 amd-staging-drm-next
none
dmesg with 4.15.0-rc2 amd-staging-drm-next
none
dmesg with 4.15.0-rc2 amd-staging-drm-next
none
dmesg with 4.15.0-rc2 amd-staging-drm-next
none
dmesg with 4.15.0-rc2 amd-staging-drm-next
none
dmesg with 4.15.0-rc2 amd-staging-drm-next with SysRq : Show State
none
dmesg with 4.15.0-rc4 amd-staging-drm-next with SysRq : Show State
none
dmesg with 4.15.0-rc4 amd-staging-drm-next e6555e61902c with SysRq : Show State
none
dmesg with 4.16.0-rc1 amd-staging-drm-next
none
photo of test when computer is hang
none
my si_pipe.c file
none
Here GPU VEGA always hungs none

Description mikhail.v.gavrilov 2017-11-30 20:53:16 UTC
Created attachment 135839 [details]
dmesg

* Fedora 27 - https://download.fedoraproject.org/pub/fedora/linux/releases/27/Workstation/x86_64/iso/Fedora-Workstation-Live-x86_64-27-1.6.iso
* staging kernel - git://people.freedesktop.org/~agd5f/linux branch amd-staging-drm-next
* mesa 17.4 and llvm 6.0 - https://copr.fedorainfracloud.org/coprs/che/mesa/

For reproduction issue:
1) Start playback video on Youtube in browser (Firefox ot Opera it's
not matter)
2) Launch Steam client
After few seconds GPU driver will hung...

Demonstration: https://youtu.be/2LuWI47oCFg

If we wait after it more than two minutes we got follow backtrace:

[  492.840627] INFO: task kworker/u16:5:147 blocked for more than 120 seconds.
[  492.840641]       Not tainted 4.14.0-rc3-amd-vega+ #5
[  492.840644] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  492.840648] kworker/u16:5   D11392   147      2 0x80000000
[  492.840662] Workqueue: events_unbound commit_work [drm_kms_helper]
[  492.840666] Call Trace:
[  492.840674]  __schedule+0x2dc/0xbb0
[  492.840681]  schedule+0x33/0x90
[  492.840694]  schedule_timeout+0x288/0x5c0
[  492.840701]  ? mark_held_locks+0x57/0x80
[  492.840704]  ? _raw_spin_unlock_irqrestore+0x36/0x60
[  492.840713]  dma_fence_default_wait+0x22a/0x380
[  492.840716]  ? dma_fence_default_wait+0x22a/0x380
[  492.840720]  ? dma_fence_release+0x170/0x170
[  492.840725]  dma_fence_wait_timeout+0x4f/0x270
[  492.840729]  reservation_object_wait_timeout_rcu+0x18d/0x510
[  492.840768]  amdgpu_dm_do_flip+0x12b/0x390 [amdgpu]
[  492.840801]  amdgpu_dm_atomic_commit_tail+0xbe1/0xe80 [amdgpu]
[  492.840815]  commit_tail+0x3f/0x70 [drm_kms_helper]
[  492.840820]  commit_work+0x12/0x20 [drm_kms_helper]
[  492.840824]  process_one_work+0x26b/0x6c0
[  492.840832]  worker_thread+0x35/0x3b0
[  492.840837]  kthread+0x171/0x190
[  492.840840]  ? process_one_work+0x6c0/0x6c0
[  492.840843]  ? kthread_create_on_node+0x70/0x70
[  492.840847]  ? kthread_create_on_node+0x70/0x70
[  492.840850]  ret_from_fork+0x2a/0x40
[  492.840906] INFO: task amdgpu_cs:0:2013 blocked for more than 120 seconds.
[  492.840909]       Not tainted 4.14.0-rc3-amd-vega+ #5
[  492.840912] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  492.840915] amdgpu_cs:0     D13312  2013   1981 0x00000000
[  492.840921] Call Trace:
[  492.840926]  __schedule+0x2dc/0xbb0
[  492.840931]  ? save_stack_trace+0x1b/0x20
[  492.840936]  schedule+0x33/0x90
[  492.840939]  schedule_timeout+0x288/0x5c0
[  492.840944]  ? mark_held_locks+0x57/0x80
[  492.840947]  ? _raw_spin_unlock_irqrestore+0x36/0x60
[  492.840951]  ? trace_hardirqs_on_caller+0xf4/0x190
[  492.840957]  dma_fence_default_wait+0x22a/0x380
[  492.840960]  ? dma_fence_default_wait+0x22a/0x380
[  492.840964]  ? dma_fence_release+0x170/0x170
[  492.840969]  dma_fence_wait_timeout+0x4f/0x270
[  492.840987]  amdgpu_ctx_wait_prev_fence+0x4a/0x80 [amdgpu]
[  492.841003]  amdgpu_cs_ioctl+0xaf/0x1eb0 [amdgpu]
[  492.841038]  ? amdgpu_cs_find_mapping+0x110/0x110 [amdgpu]
[  492.841051]  drm_ioctl_kernel+0x5d/0xb0 [drm]
[  492.841060]  drm_ioctl+0x31b/0x3d0 [drm]
[  492.841074]  ? amdgpu_cs_find_mapping+0x110/0x110 [amdgpu]
[  492.841081]  ? trace_hardirqs_on_caller+0xf4/0x190
[  492.841085]  ? trace_hardirqs_on+0xd/0x10
[  492.841101]  amdgpu_drm_ioctl+0x4f/0x90 [amdgpu]
[  492.841107]  do_vfs_ioctl+0xa6/0x6c0
[  492.841115]  SyS_ioctl+0x79/0x90
[  492.841120]  entry_SYSCALL_64_fastpath+0x1f/0xbe
[  492.841123] RIP: 0033:0x7f7eb9078dc7
[  492.841125] RSP: 002b:00007f7eb0c479b8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[  492.841130] RAX: ffffffffffffffda RBX: 00000000025f5ab8 RCX: 00007f7eb9078dc7
[  492.841132] RDX: 00007f7eb0c47a20 RSI: 00000000c0186444 RDI: 000000000000000b
[  492.841134] RBP: 000000000266f860 R08: 00007f7eb0c47ad0 R09: 00007f7eb0c47a00
[  492.841136] R10: 00007f7eb0c47ad0 R11: 0000000000000246 R12: 0000000000000007
[  492.841138] R13: 0000000000000001 R14: 00000000025f5ab8 R15: 0000000000000000
[  492.841171] INFO: task kworker/3:3:2263 blocked for more than 120 seconds.
[  492.841174]       Not tainted 4.14.0-rc3-amd-vega+ #5
[  492.841177] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  492.841180] kworker/3:3     D13448  2263      2 0x80000000
[  492.841190] Workqueue: events ttm_bo_delayed_workqueue [ttm]
[  492.841194] Call Trace:
[  492.841199]  __schedule+0x2dc/0xbb0
[  492.841206]  schedule+0x33/0x90
[  492.841209]  schedule_preempt_disabled+0x15/0x20
[  492.841212]  __ww_mutex_lock.constprop.9+0xa6f/0x10a0
[  492.841216]  ? __lock_is_held+0x59/0xa0
[  492.841220]  ? ttm_bo_delayed_delete+0x108/0x1b0 [ttm]
[  492.841228]  ww_mutex_lock+0x5e/0x70
[  492.841231]  ? ww_mutex_lock+0x5e/0x70
[  492.841235]  ttm_bo_delayed_delete+0x108/0x1b0 [ttm]
[  492.841243]  ttm_bo_delayed_workqueue+0x1b/0x40 [ttm]
[  492.841246]  process_one_work+0x26b/0x6c0
[  492.841253]  worker_thread+0x35/0x3b0
[  492.841259]  kthread+0x171/0x190
[  492.841262]  ? process_one_work+0x6c0/0x6c0
[  492.841264]  ? kthread_create_on_node+0x70/0x70
[  492.841269]  ret_from_fork+0x2a/0x40
[  492.841323] 
               Showing all locks held in the system:
[  492.841332] 1 lock held by khungtaskd/62:
[  492.841336]  #0:  (tasklist_lock){.+.+}, at: [<ffffffffba1116ed>] debug_show_all_locks+0x3d/0x1a0
[  492.841353] 3 locks held by kworker/u16:5/147:
[  492.841355]  #0:  ("events_unbound"){+.+.}, at: [<ffffffffba0ceb41>] process_one_work+0x1e1/0x6c0
[  492.841365]  #1:  ((&state->commit_work)){+.+.}, at: [<ffffffffba0ceb41>] process_one_work+0x1e1/0x6c0
[  492.841376]  #2:  (reservation_ww_class_mutex){+.+.}, at: [<ffffffffc03bbbea>] amdgpu_dm_do_flip+0xea/0x390 [amdgpu]
[  492.841451] 1 lock held by gnome-shell/1981:
[  492.841453]  #0:  (reservation_ww_class_mutex){+.+.}, at: [<ffffffffc0273626>] ttm_bo_vm_fault+0x66/0x5d0 [ttm]
[  492.841468] 1 lock held by amdgpu_cs:0/2013:
[  492.841470]  #0:  (&ctx->lock){+.+.}, at: [<ffffffffc02e4d4e>] amdgpu_cs_ioctl+0x59e/0x1eb0 [amdgpu]
[  492.841513] 3 locks held by kworker/3:3/2263:
[  492.841515]  #0:  ("events"){+.+.}, at: [<ffffffffba0ceb41>] process_one_work+0x1e1/0x6c0
[  492.841526]  #1:  ((&(&bdev->wq)->work)){+.+.}, at: [<ffffffffba0ceb41>] process_one_work+0x1e1/0x6c0
[  492.841536]  #2:  (reservation_ww_class_mutex){+.+.}, at: [<ffffffffc026f948>] ttm_bo_delayed_delete+0x108/0x1b0 [ttm]
[  492.841623] 1 lock held by steamerrorrepor/4598:
[  492.841625]  #0:  (drm_global_mutex){+.+.}, at: [<ffffffffc01f86ab>] drm_release+0x3b/0x3b0 [drm]

[  492.841644] =============================================
Comment 1 Michel Dänzer 2017-12-01 09:06:22 UTC
Does "it occurs on latest staging kernel" mean it doesn't happen with an earlier staging kernel or with another kernel version? If so, can you provide more details about what kernels it doesn't happen with?
Comment 2 mikhail.v.gavrilov 2017-12-01 09:18:58 UTC
Earlier kernels don't support GPU Vega. So I can't recheck it with earlier kernel which works fine with IGPU on same machine.
Comment 3 mikhail.v.gavrilov 2017-12-05 16:56:28 UTC
Created attachment 135984 [details]
dmesg
Comment 4 mikhail.v.gavrilov 2017-12-06 20:13:26 UTC
Created attachment 136012 [details]
dmesg with 4.15.0-rc2 amd-staging-drm-next
Comment 5 mikhail.v.gavrilov 2017-12-07 17:08:28 UTC
Created attachment 136036 [details]
dmesg with 4.15.0-rc2 amd-staging-drm-next
Comment 6 mikhail.v.gavrilov 2017-12-15 17:43:50 UTC
Created attachment 136200 [details]
dmesg with 4.15.0-rc2 amd-staging-drm-next
Comment 7 mikhail.v.gavrilov 2017-12-21 18:27:38 UTC
With latest build in dmesg appear message when hang again occurs:
[  341.475043] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, last signaled seq=110200, last emitted seq=110202
[  341.475059] [drm] No hardware hang detected. Did some blocks stall?
Comment 8 mikhail.v.gavrilov 2017-12-21 18:28:03 UTC
Created attachment 136346 [details]
dmesg with 4.15.0-rc2 amd-staging-drm-next
Comment 9 mikhail.v.gavrilov 2018-01-03 09:58:48 UTC
Created attachment 136517 [details]
dmesg with 4.15.0-rc2 amd-staging-drm-next
Comment 10 Christian König 2018-01-03 10:07:04 UTC
Yeah, I enabled more error messages on amd-staging-drm-next.

But please don't change the bug subject to something less descriptive.
Comment 11 mikhail.v.gavrilov 2018-01-05 12:14:45 UTC
(In reply to Christian König from comment #10)
> Yeah, I enabled more error messages on amd-staging-drm-next.
But it still not enough for understand root cause?


Can you also rebase amd-staging-drm-next to RC5 with enabled KPTI patch? I do not want to sit on a vulnerable kernel. The default shipped kernel in Fedora already patched but not having AMD Vega support.
Comment 12 mikhail.v.gavrilov 2018-01-06 01:18:46 UTC
https://youtu.be/MZ2O4XqxBZM
Comment 13 mikhail.v.gavrilov 2018-01-06 01:20:50 UTC
Created attachment 136579 [details]
dmesg with 4.15.0-rc2 amd-staging-drm-next
Comment 14 Dieter Nützel 2018-01-06 03:44:38 UTC
(In reply to mikhail.v.gavrilov from comment #11)
> (In reply to Christian König from comment #10)
> > Yeah, I enabled more error messages on amd-staging-drm-next.
> But it still not enough for understand root cause?
> 
> 
> Can you also rebase amd-staging-drm-next to RC5 with enabled KPTI patch? I
> do not want to sit on a vulnerable kernel. The default shipped kernel in
> Fedora already patched but not having AMD Vega support.

Christian and Alex,

when you do this (rebase to RCx with enabled KPTI patch) please do it at least to RC7 (NOT penalize AMD chips) even that I'm on Intel Xeon currently...;-)
Comment 15 mikhail.v.gavrilov 2018-01-07 17:00:36 UTC
Created attachment 136599 [details]
dmesg with 4.15.0-rc2 amd-staging-drm-next with SysRq : Show State
Comment 16 mikhail.v.gavrilov 2018-01-17 17:51:49 UTC
Created attachment 136809 [details]
dmesg with 4.15.0-rc4 amd-staging-drm-next with SysRq : Show State
Comment 17 mikhail.v.gavrilov 2018-01-18 21:05:38 UTC
Created attachment 136836 [details]
dmesg with 4.15.0-rc4 amd-staging-drm-next e6555e61902c with SysRq : Show State
Comment 18 Christian König 2018-01-19 08:47:14 UTC
Please stop attaching more and more dmesg with unrelated information to the bug report. The initial one is perfectly sufficient.
Comment 19 mikhail.v.gavrilov 2018-01-19 18:00:56 UTC
I am sorry for misunderstanding.
Every time when I see new commits in branch I hope that this issue may be fixed.
And every time I rebuild kernel for testing.
And after it I every time I reproduce this annoying bug.
And I still hope that anybody works on it and improve logging for understanding root cause of this hung.
So I every time attach new dmesg log.
Comment 20 mikhail.v.gavrilov 2018-02-04 15:34:35 UTC
Anybody is investigated this bug?
It is not necessary watch video for occurring computer hung.
It hangs just after running the client's Steam or during the game if computer already worked some time.
I'm already tired of pressing the reset button because "init 6" is not able to restart the computer after such a hang.
For today I already press reset button more than 30 times.
But no one care about it :(
Comment 21 mikhail.v.gavrilov 2018-02-28 03:43:16 UTC
Created attachment 137680 [details]
dmesg with 4.16.0-rc1 amd-staging-drm-next
Comment 22 mikhail.v.gavrilov 2018-02-28 03:44:06 UTC
Sadly still present in 4.16 rc1
Comment 23 mikhail.v.gavrilov 2018-03-01 04:19:29 UTC
Found you another crash case:

The @GraphicsFuzz demo found 1 issue (14/15 tests passed) on my desktop device, affecting my @AMD GPU driver Give it a try: www.graphicsfuzz.com/#demo #GraphicsFuzz

Computer always hangs on shader15
Comment 24 mikhail.v.gavrilov 2018-03-01 04:21:30 UTC
Created attachment 137710 [details]
photo of test when computer is hang
Comment 25 Michel Dänzer 2018-03-01 09:06:38 UTC
(In reply to mikhail.v.gavrilov from comment #23)
> Found you another crash case:

That's unlikely to be the exact same cause as that of the Steam hang this report is about, so it needs to be tracked separately.
Comment 26 mikhail.v.gavrilov 2018-03-01 18:42:23 UTC
(In reply to Michel Dänzer from comment #25)
> That's unlikely to be the exact same cause as that of the Steam hang this
> report is about, so it needs to be tracked separately.

Ok, https://bugs.freedesktop.org/show_bug.cgi?id=105317
Comment 27 Dennis Schridde 2018-03-02 20:21:07 UTC
I run into this issue regularly with an AMD Ryzen 5 2400G (primary display, connected via DP to the monitor) and an AMD Radeon RX 560 (not connected to a monitor, secondary display according to mainboard firmware configuration).

After using my computer for some time, the graphics suddenly freezes and I see lines like the following in dmesg (after logging in via SSH):

[Fri Mar  2 21:05:33 2018] amdgpu: [powerplay] pp_dpm_get_temperature was not implemented.                                                                                        
[Fri Mar  2 21:06:03 2018] INFO: task X:898 blocked for more than 120 seconds.                                                     
[Fri Mar  2 21:06:03 2018]       Tainted: G        W        4.15.7-gentoo-r1 #2                                                                
[Fri Mar  2 21:06:03 2018] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.                                                                          
[Fri Mar  2 21:06:03 2018] X               D    0   898    881 0x00000004                                                                                                     
[Fri Mar  2 21:06:03 2018] Call Trace:                                                                                                                                       
[Fri Mar  2 21:06:03 2018]  ? __schedule+0x2a7/0x8b0
[Fri Mar  2 21:06:03 2018]  schedule+0x28/0x80
[Fri Mar  2 21:06:03 2018]  schedule_preempt_disabled+0xa/0x10
[Fri Mar  2 21:06:03 2018]  __ww_mutex_lock.isra.3+0x224/0x690
[Fri Mar  2 21:06:03 2018]  ? drm_modeset_backoff+0x3e/0xb0 [drm]
[Fri Mar  2 21:06:03 2018]  drm_modeset_backoff+0x3e/0xb0 [drm]
[Fri Mar  2 21:06:03 2018]  drm_mode_gamma_set_ioctl+0xb4/0x200 [drm]
[Fri Mar  2 21:06:03 2018]  ? drm_mode_crtc_set_gamma_size+0xa0/0xa0 [drm]
[Fri Mar  2 21:06:03 2018]  drm_ioctl_kernel+0x5b/0xb0 [drm]
[Fri Mar  2 21:06:03 2018]  drm_ioctl+0x2d5/0x370 [drm]
[Fri Mar  2 21:06:03 2018]  ? drm_mode_crtc_set_gamma_size+0xa0/0xa0 [drm]
[Fri Mar  2 21:06:03 2018]  amdgpu_drm_ioctl+0x49/0x80 [amdgpu]
[Fri Mar  2 21:06:03 2018]  do_vfs_ioctl+0xa4/0x670
[Fri Mar  2 21:06:03 2018]  ? __sys_recvmsg+0x64/0xa0
[Fri Mar  2 21:06:03 2018]  ? __sys_recvmsg+0x95/0xa0
[Fri Mar  2 21:06:03 2018]  SyS_ioctl+0x74/0x80
[Fri Mar  2 21:06:03 2018]  do_syscall_64+0x6e/0x120
[Fri Mar  2 21:06:03 2018]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[Fri Mar  2 21:06:03 2018] RIP: 0033:0x7fd8924c0467
[Fri Mar  2 21:06:03 2018] RSP: 002b:00007ffcb17d7b08 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[Fri Mar  2 21:06:03 2018] RAX: ffffffffffffffda RBX: 0000560ddb4480e0 RCX: 00007fd8924c0467
[Fri Mar  2 21:06:03 2018] RDX: 00007ffcb17d7b40 RSI: 00000000c02064a5 RDI: 0000000000000016
[Fri Mar  2 21:06:03 2018] RBP: 00007ffcb17d7b40 R08: 0000560ddb4487a0 R09: 0000560ddb4489a0
[Fri Mar  2 21:06:03 2018] R10: 0000000000000001 R11: 0000000000000246 R12: 00000000c02064a5
[Fri Mar  2 21:06:03 2018] R13: 0000000000000016 R14: 0000560ddb448bb0 R15: 0000560ddb4485a0
[Fri Mar  2 21:06:03 2018] INFO: task kworker/u32:2:32344 blocked for more than 120 seconds.
[Fri Mar  2 21:06:03 2018]       Tainted: G        W        4.15.7-gentoo-r1 #2
[Fri Mar  2 21:06:03 2018] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Fri Mar  2 21:06:03 2018] kworker/u32:2   D    0 32344      2 0x80000000
[Fri Mar  2 21:06:03 2018] Workqueue: events_unbound commit_work [drm_kms_helper]
[Fri Mar  2 21:06:03 2018] Call Trace:
[Fri Mar  2 21:06:03 2018]  ? __schedule+0x2a7/0x8b0
[Fri Mar  2 21:06:03 2018]  schedule+0x28/0x80
[Fri Mar  2 21:06:03 2018]  schedule_timeout+0x1e7/0x370
[Fri Mar  2 21:06:03 2018]  ? generic_reg_get+0x21/0x30 [amdgpu]
[Fri Mar  2 21:06:03 2018]  dma_fence_default_wait+0x1f0/0x280
[Fri Mar  2 21:06:03 2018]  ? dma_fence_release+0x90/0x90
[Fri Mar  2 21:06:03 2018]  dma_fence_wait_timeout+0x39/0xf0
[Fri Mar  2 21:06:03 2018]  reservation_object_wait_timeout_rcu+0x17b/0x370
[Fri Mar  2 21:06:03 2018]  amdgpu_dm_do_flip+0x11f/0x360 [amdgpu]
[Fri Mar  2 21:06:03 2018]  amdgpu_dm_atomic_commit_tail+0x8a1/0x9a0 [amdgpu]
[Fri Mar  2 21:06:03 2018]  ? _cond_resched+0x15/0x40
[Fri Mar  2 21:06:03 2018]  ? wait_for_completion_timeout+0x35/0x180
[Fri Mar  2 21:06:03 2018]  commit_tail+0x3d/0x70 [drm_kms_helper]
[Fri Mar  2 21:06:03 2018]  process_one_work+0x1da/0x3d0
[Fri Mar  2 21:06:03 2018]  worker_thread+0x2b/0x3f0
[Fri Mar  2 21:06:03 2018]  ? process_one_work+0x3d0/0x3d0
[Fri Mar  2 21:06:03 2018]  kthread+0x113/0x130
[Fri Mar  2 21:06:03 2018]  ? kthread_create_worker_on_cpu+0x70/0x70
[Fri Mar  2 21:06:03 2018]  ? SyS_exit_group+0x10/0x10
[Fri Mar  2 21:06:03 2018]  ret_from_fork+0x22/0x40
[Fri Mar  2 21:06:33 2018] i2c /dev entries driver

Everything apart from the graphics appears to continue to run fine, except any application (e.g. started on the command line) that tries to talk to the X server: They will hang.  Most applications that hang can be killed with SIGKILL, except the X server and a few others, which will be zombies forever.
Comment 28 Dennis Schridde 2018-03-02 20:22:40 UTC
Linux kernel is at 4.15.7-gentoo-r1, LLVM at 5.0.1, Mesa at 18.0.0_rc4.
Comment 29 mikhail.v.gavrilov 2018-03-24 14:28:22 UTC
[   69.089101] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, last signaled seq=897, last emitted seq=899
[   69.089176] [drm] No hardware hang detected. Did some blocks stall?
[   85.813890] sysrq: SysRq : Show Blocked State
[   85.813982]   task                        PC stack   pid father
[   85.814019] kworker/u16:4   D14104   146      2 0x80000000
[   85.814055] Workqueue: events_unbound commit_work [drm_kms_helper]
[   85.814058] Call Trace:
[   85.814064]  ? __schedule+0x2ed/0xba0
[   85.814070]  ? dma_fence_default_wait+0x14f/0x370
[   85.814073]  schedule+0x2f/0x90
[   85.814076]  schedule_timeout+0x23d/0x540
[   85.814079]  ? find_held_lock+0x34/0xa0
[   85.814084]  ? mark_held_locks+0x56/0x80
[   85.814087]  ? _raw_spin_unlock_irqrestore+0x32/0x60
[   85.814091]  ? dma_fence_default_wait+0x14f/0x370
[   85.814094]  dma_fence_default_wait+0x23b/0x370
[   85.814097]  ? dma_fence_release+0x170/0x170
[   85.814101]  dma_fence_wait_timeout+0x4f/0x270
[   85.814105]  reservation_object_wait_timeout_rcu+0x193/0x4d0
[   85.814148]  amdgpu_dm_do_flip+0x112/0x350 [amdgpu]
[   85.814188]  amdgpu_dm_atomic_commit_tail+0xb66/0xdc0 [amdgpu]
[   85.814194]  ? wait_for_completion_timeout+0x76/0x1b0
[   85.814206]  commit_tail+0x3d/0x70 [drm_kms_helper]
[   85.814211]  process_one_work+0x266/0x6b0
[   85.814218]  worker_thread+0x3a/0x390
[   85.814222]  ? process_one_work+0x6b0/0x6b0
[   85.814225]  kthread+0x121/0x140
[   85.814228]  ? kthread_create_worker_on_cpu+0x70/0x70
[   85.814231]  ret_from_fork+0x3a/0x50
[   85.814391] tracker-store   D12184  2786   2167 0x00000000
[   85.814395] Call Trace:
[   85.814400]  ? __schedule+0x2ed/0xba0
[   85.814406]  schedule+0x2f/0x90
[   85.814409]  io_schedule+0x12/0x40
[   85.814413]  generic_file_read_iter+0x39e/0xdb0
[   85.814420]  ? page_cache_tree_insert+0x130/0x130
[   85.814474]  xfs_file_buffered_aio_read+0x65/0x1a0 [xfs]
[   85.814498]  xfs_file_read_iter+0x64/0xc0 [xfs]
[   85.814504]  __vfs_read+0x102/0x170
[   85.814511]  vfs_read+0x9e/0x150
[   85.814515]  SyS_pread64+0x93/0xb0
[   85.814518]  ? trace_hardirqs_off_thunk+0x1a/0x1c
[   85.814523]  do_syscall_64+0x79/0x220
[   85.814526]  entry_SYSCALL_64_after_hwframe+0x42/0xb7
[   85.814528] RIP: 0033:0x7ff185a6a873
[   85.814530] RSP: 002b:00007ffe0e646780 EFLAGS: 00000293 ORIG_RAX: 0000000000000011
[   85.814533] RAX: ffffffffffffffda RBX: 0000000000001000 RCX: 00007ff185a6a873
[   85.814535] RDX: 0000000000001000 RSI: 00005613cc6fbe48 RDI: 0000000000000008
[   85.814536] RBP: 0000000000001000 R08: 00005613cc6fbe48 R09: 000000000ff80fff
[   85.814538] R10: 000000001982a000 R11: 0000000000000293 R12: 0000000000000000
[   85.814539] R13: 00005613cc6fbe48 R14: 000000001982a000 R15: 00005613cc446580
[   85.814601] gldriverquery   D12856  4120   4072 0xa0020002
[   85.814606] Call Trace:
[   85.814611]  ? __schedule+0x2ed/0xba0
[   85.814617]  schedule+0x2f/0x90
[   85.814621]  drm_sched_entity_fini+0xbe/0x2b0 [gpu_sched]
[   85.814626]  ? finish_wait+0x80/0x80
[   85.814649]  amdgpu_ctx_fini+0xbf/0x100 [amdgpu]
[   85.814672]  amdgpu_ctx_mgr_fini+0x7c/0xc0 [amdgpu]
[   85.814692]  amdgpu_driver_postclose_kms+0x57/0x220 [amdgpu]
[   85.814708]  drm_release+0x2a0/0x3c0 [drm]
[   85.814714]  __fput+0xe9/0x200
[   85.814719]  task_work_run+0x87/0xb0
[   85.814723]  do_exit+0x345/0xd70
[   85.814727]  ? up_read+0x1c/0x40
[   85.814730]  ? __do_page_fault+0x2af/0x530
[   85.814735]  do_group_exit+0x47/0xc0
[   85.814738]  SyS_exit_group+0x10/0x10
[   85.814740]  do_fast_syscall_32+0xbf/0x376
[   85.814744]  entry_SYSENTER_compat+0x84/0x96
Comment 31 mikhail.v.gavrilov 2018-04-21 16:31:37 UTC
(In reply to Marek Olšák from comment #30)
> This might fix it:
> https://cgit.freedesktop.org/mesa/mesa/commit/
> ?id=d15fb766aa3c98ffbe16d050b2af4804e4b12c57

For which mesa version this patch?

My si_pipe.c (mesa 18.0.1) looks differently
https://imgur.com/a/dc3RoHi
Comment 32 Marek Olšák 2018-04-27 21:57:42 UTC
The patch is already backported in the 18.0 branch:
https://cgit.freedesktop.org/mesa/mesa/log/?h=18.0
Comment 33 mikhail.v.gavrilov 2018-05-05 08:56:02 UTC
(In reply to Marek Olšák from comment #32)
> The patch is already backported in the 18.0 branch:
> https://cgit.freedesktop.org/mesa/mesa/log/?h=18.0

How I can sure what patch already applied in my mesa?
Comment 34 mikhail.v.gavrilov 2018-05-05 08:58:25 UTC
Created attachment 139363 [details]
my si_pipe.c file
Comment 35 mikhail.v.gavrilov 2018-05-05 09:14:41 UTC
Looks like my si_pipe.c already patched.
But my GPU still hangs when I try pass one and the same place in the game Rise of Tomb Rider.
Comment 36 mikhail.v.gavrilov 2018-05-05 09:17:46 UTC
Created attachment 139364 [details]
Here GPU VEGA always hungs
Comment 37 Marek Olšák 2018-05-06 04:19:37 UTC
Rise of Tomb Raider is a Vulkan game. You can file a RADV bug for it. I'm closing this since you are not reporting any issues with Youtube.
Comment 38 mikhail.v.gavrilov 2018-05-06 17:50:41 UTC
(In reply to Marek Olšák from comment #37)
> Rise of Tomb Raider is a Vulkan game. You can file a RADV bug for it. I'm
> closing this since you are not reporting any issues with Youtube.

My bug report about the game Rise of Tomb Raider unfortunately closed without explanations which patches needed for fix hung:
https://bugs.freedesktop.org/show_bug.cgi?id=106196
And symptomes of problem same as in this bug report:
- The system stop to respod.
- All the LEDs on the video card showing power consumption start to glow.
- The turbine on the video card starts to make a lot of noise.

Long time ago the Intel driver hungs too:
https://bugs.freedesktop.org/show_bug.cgi?id=54226
but intel developers add GPU reset in such situations why not add GPU reset also for AMD GPU?
Comment 39 Marek Olšák 2018-05-06 20:23:59 UTC
We are working on the GPU reset, we just don't have any ETA. The GPU reset is something you can't rely on to save you. In most cases, a successful GPU reset needs a complete restart of X or Wayland, so you'll lose the whole desktop and all running desktop applications. You are likely to get into an infinite GPU-hang+GPU-reset loop if the driver doesn't kill all apps. With the current Linux desktop architecture that isn't aware of GPU resets, a GPU reset is mostly unusable.

The implementation of the GPU reset is secondary to making sure that GPU hangs don't occur. Thus, bugs about GPU hangs are only about fixing GPU hangs. Rise Of The Tomb Raider is a Vulkan game, so any hangs within the game are RADV bugs. Filing a bug against DRM/AMDgpu for a GPU hang within that game is less effective  than filing a bug against RADV.
Comment 40 mikhail.v.gavrilov 2018-05-07 18:04:34 UTC
Ok, now video playback results in a hangup without a steam client if vaapi is used:
https://bugs.freedesktop.org/show_bug.cgi?id=106430

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.