Bug 106430 - GPU hang when played video with acceleration (vaapi)
Summary: GPU hang when played video with acceleration (vaapi)
Status: RESOLVED MOVED
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/AMDgpu (show other bugs)
Version: XOrg git
Hardware: Other All
: medium normal
Assignee: Default DRI bug account
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2018-05-07 18:00 UTC by mikhail.v.gavrilov
Modified: 2019-11-19 08:37 UTC (History)
3 users (show)

See Also:
i915 platform:
i915 features:


Attachments
dmesg (125.22 KB, text/plain)
2018-05-07 18:00 UTC, mikhail.v.gavrilov
no flags Details
dmesg (147.70 KB, text/plain)
2018-05-15 20:40 UTC, mikhail.v.gavrilov
no flags Details
dmesg (77.64 KB, text/plain)
2018-05-25 16:41 UTC, mikhail.v.gavrilov
no flags Details
dmesg (127.96 KB, text/plain)
2018-05-27 19:48 UTC, mikhail.v.gavrilov
no flags Details
fresh dmesg from kernel 4.18.0-0.rc0.git2.1 (122.31 KB, text/plain)
2018-06-08 04:01 UTC, mikhail.v.gavrilov
no flags Details

Description mikhail.v.gavrilov 2018-05-07 18:00:37 UTC
Created attachment 139407 [details]
dmesg

* Fedora 29 (Rawhide)
* Latest system updates:
 - kernel 4.17.0-0.rc3.git4.1
 - drm 3.25.0
 - mesa 18.1.0-rc2
 - llvm 6.0.0


For reproduction issue:
1) # dnf install gstreamer1-vaapi
2) Play video encoded with H.264 in Totem player

Symptoms:
1. The system stop to respod.

kernel output after GPU hang:
[   89.056879] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, last signaled seq=2638, last emitted seq=2640
[   89.056926] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring uvd timeout, last signaled seq=80, last emitted seq=82
[   89.056932] [drm] No hardware hang detected. Did some blocks stall?
[   89.056948] [drm] No hardware hang detected. Did some blocks stall?
Comment 1 mikhail.v.gavrilov 2018-05-13 14:50:13 UTC
If do not restart the computer and leave it in a hang state, then after a while the turbine starts spinning at full speed, and the LEDs on the video card all go out.

I was even frightened. reboot through the reset button did not help the turbine continued to make noise, and the LED on the video card did not catch fire.

Only after turning off the computer it was possible to restore the working of the video card.


Here is that was logged in dmesg at this time:
[247125.285043] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, last signaled seq=20977028, last emitted seq=20977030
[247125.285083] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring uvd timeout, last signaled seq=30, last emitted seq=31
[247125.285085] [drm] No hardware hang detected. Did some blocks stall?
[247125.285087] [drm] No hardware hang detected. Did some blocks stall?
[247359.270184] INFO: task amdgpu_cs:0:21382 blocked for more than 120 seconds.
[247359.270188]       Not tainted 4.17.0-0.rc3.git4.1.fc29.x86_64 #1
[247359.270190] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[247359.270191] amdgpu_cs:0     D12728 21382  21309 0x00000000
[247359.270196] Call Trace:
[247359.270203]  ? __schedule+0x2ba/0xaf0
[247359.270220]  ? dma_fence_default_wait+0x231/0x370
[247359.270222]  schedule+0x2f/0x90
[247359.270235]  schedule_timeout+0x35c/0x520
[247359.270238]  ? dma_fence_default_wait+0x72/0x370
[247359.270242]  ? dma_fence_default_wait+0x231/0x370
[247359.270245]  dma_fence_default_wait+0x25d/0x370
[247359.270247]  ? dma_fence_release+0x160/0x160
[247359.270251]  dma_fence_wait_timeout+0x4f/0x270
[247359.270300]  amdgpu_ctx_wait_prev_fence+0x4c/0x80 [amdgpu]
[247359.270325]  amdgpu_cs_ioctl+0x9d/0x1d10 [amdgpu]
[247359.270356]  ? amdgpu_cs_find_mapping+0x120/0x120 [amdgpu]
[247359.270368]  drm_ioctl_kernel+0x5b/0xb0 [drm]
[247359.270375]  drm_ioctl+0x1b3/0x370 [drm]
[247359.270397]  ? amdgpu_cs_find_mapping+0x120/0x120 [amdgpu]
[247359.270420]  amdgpu_drm_ioctl+0x49/0x80 [amdgpu]
[247359.270424]  do_vfs_ioctl+0xa5/0x6d0
[247359.270428]  ksys_ioctl+0x60/0x90
[247359.270431]  __x64_sys_ioctl+0x16/0x20
[247359.270434]  do_syscall_64+0x60/0x1f0
[247359.270438]  ? entry_SYSCALL_64_after_hwframe+0x49/0xbe
[247359.270545] INFO: task amdgpu_cs:0:12186 blocked for more than 120 seconds.
[247359.270546]       Not tainted 4.17.0-0.rc3.git4.1.fc29.x86_64 #1
[247359.270548] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[247359.270550] amdgpu_cs:0     D13400 12186  12133 0x00000000
[247359.270554] Call Trace:
[247359.270557]  ? __schedule+0x2ba/0xaf0
[247359.270561]  ? dma_fence_default_wait+0x231/0x370
[247359.270564]  schedule+0x2f/0x90
[247359.270566]  schedule_timeout+0x35c/0x520
[247359.270569]  ? dma_fence_default_wait+0x72/0x370
[247359.270573]  ? dma_fence_default_wait+0x231/0x370
[247359.270575]  dma_fence_default_wait+0x25d/0x370
[247359.270577]  ? dma_fence_release+0x160/0x160
[247359.270580]  dma_fence_wait_timeout+0x4f/0x270
[247359.270604]  amdgpu_ctx_wait_prev_fence+0x4c/0x80 [amdgpu]
[247359.270626]  amdgpu_cs_ioctl+0x9d/0x1d10 [amdgpu]
[247359.270656]  ? amdgpu_cs_find_mapping+0x120/0x120 [amdgpu]
[247359.270665]  drm_ioctl_kernel+0x5b/0xb0 [drm]
[247359.270672]  drm_ioctl+0x1b3/0x370 [drm]
[247359.270692]  ? amdgpu_cs_find_mapping+0x120/0x120 [amdgpu]
[247359.270713]  amdgpu_drm_ioctl+0x49/0x80 [amdgpu]
[247359.270717]  do_vfs_ioctl+0xa5/0x6d0
[247359.270721]  ksys_ioctl+0x60/0x90
[247359.270724]  __x64_sys_ioctl+0x16/0x20
[247359.270727]  do_syscall_64+0x60/0x1f0
[247359.270730]  ? entry_SYSCALL_64_after_hwframe+0x49/0xbe
[247359.270886] INFO: task kworker/u16:1:16581 blocked for more than 120 seconds.
[247359.270887]       Not tainted 4.17.0-0.rc3.git4.1.fc29.x86_64 #1
[247359.270889] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[247359.270890] kworker/u16:1   D10936 16581      2 0x80000000
[247359.270905] Workqueue: events_unbound commit_work [drm_kms_helper]
[247359.270907] Call Trace:
[247359.270910]  ? __schedule+0x2ba/0xaf0
[247359.270914]  ? dma_fence_default_wait+0x231/0x370
[247359.270916]  schedule+0x2f/0x90
[247359.270919]  schedule_timeout+0x35c/0x520
[247359.270922]  ? dma_fence_default_wait+0x72/0x370
[247359.270925]  ? dma_fence_default_wait+0x231/0x370
[247359.270927]  dma_fence_default_wait+0x25d/0x370
[247359.270929]  ? dma_fence_release+0x160/0x160
[247359.270932]  dma_fence_wait_timeout+0x4f/0x270
[247359.270935]  reservation_object_wait_timeout_rcu+0x236/0x4e0
[247359.270967]  amdgpu_dm_do_flip+0x112/0x350 [amdgpu]
[247359.271003]  amdgpu_dm_atomic_commit_tail+0xa76/0xd00 [amdgpu]
[247359.271008]  ? wait_for_completion_timeout+0x73/0x1a0
[247359.271019]  commit_tail+0x3d/0x70 [drm_kms_helper]
[247359.271025]  process_one_work+0x261/0x630
[247359.271030]  worker_thread+0x3a/0x390
[247359.271033]  ? process_one_work+0x630/0x630
[247359.271036]  kthread+0x120/0x140
[247359.271039]  ? kthread_create_worker_on_cpu+0x70/0x70
[247359.271041]  ret_from_fork+0x3a/0x50
[247359.271056] INFO: lockdep is turned off.
[247482.151777] INFO: task amdgpu_cs:0:21382 blocked for more than 120 seconds.
[247482.151781]       Not tainted 4.17.0-0.rc3.git4.1.fc29.x86_64 #1
[247482.151782] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[247482.151784] amdgpu_cs:0     D12728 21382  21309 0x00000000
[247482.151788] Call Trace:
[247482.151796]  ? __schedule+0x2ba/0xaf0
[247482.151799]  ? dma_fence_default_wait+0x231/0x370
[247482.151802]  schedule+0x2f/0x90
[247482.151804]  schedule_timeout+0x35c/0x520
[247482.151807]  ? dma_fence_default_wait+0x72/0x370
[247482.151810]  ? dma_fence_default_wait+0x231/0x370
[247482.151812]  dma_fence_default_wait+0x25d/0x370
[247482.151814]  ? dma_fence_release+0x160/0x160
[247482.151817]  dma_fence_wait_timeout+0x4f/0x270
[247482.151863]  amdgpu_ctx_wait_prev_fence+0x4c/0x80 [amdgpu]
[247482.151884]  amdgpu_cs_ioctl+0x9d/0x1d10 [amdgpu]
[247482.151912]  ? amdgpu_cs_find_mapping+0x120/0x120 [amdgpu]
[247482.151924]  drm_ioctl_kernel+0x5b/0xb0 [drm]
[247482.151932]  drm_ioctl+0x1b3/0x370 [drm]
[247482.151952]  ? amdgpu_cs_find_mapping+0x120/0x120 [amdgpu]
[247482.151973]  amdgpu_drm_ioctl+0x49/0x80 [amdgpu]
[247482.151977]  do_vfs_ioctl+0xa5/0x6d0
[247482.151982]  ksys_ioctl+0x60/0x90
[247482.151986]  __x64_sys_ioctl+0x16/0x20
[247482.151989]  do_syscall_64+0x60/0x1f0
[247482.151993]  ? entry_SYSCALL_64_after_hwframe+0x49/0xbe
[247482.152121] INFO: task amdgpu_cs:0:12186 blocked for more than 120 seconds.
[247482.152123]       Not tainted 4.17.0-0.rc3.git4.1.fc29.x86_64 #1
[247482.152124] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[247482.152126] amdgpu_cs:0     D13400 12186  12133 0x00000000
[247482.152130] Call Trace:
[247482.152143]  ? __schedule+0x2ba/0xaf0
[247482.152146]  ? dma_fence_default_wait+0x231/0x370
[247482.152148]  schedule+0x2f/0x90
[247482.152150]  schedule_timeout+0x35c/0x520
[247482.152153]  ? dma_fence_default_wait+0x72/0x370
[247482.152156]  ? dma_fence_default_wait+0x231/0x370
[247482.152169]  dma_fence_default_wait+0x25d/0x370
[247482.152171]  ? dma_fence_release+0x160/0x160
[247482.152174]  dma_fence_wait_timeout+0x4f/0x270
[247482.152203]  amdgpu_ctx_wait_prev_fence+0x4c/0x80 [amdgpu]
[247482.152233]  amdgpu_cs_ioctl+0x9d/0x1d10 [amdgpu]
[247482.152281]  ? amdgpu_cs_find_mapping+0x120/0x120 [amdgpu]
[247482.152299]  drm_ioctl_kernel+0x5b/0xb0 [drm]
[247482.152316]  drm_ioctl+0x1b3/0x370 [drm]
[247482.152335]  ? amdgpu_cs_find_mapping+0x120/0x120 [amdgpu]
[247482.152375]  amdgpu_drm_ioctl+0x49/0x80 [amdgpu]
[247482.152379]  do_vfs_ioctl+0xa5/0x6d0
[247482.152382]  ksys_ioctl+0x60/0x90
[247482.152385]  __x64_sys_ioctl+0x16/0x20
[247482.152387]  do_syscall_64+0x60/0x1f0
[247482.152390]  ? entry_SYSCALL_64_after_hwframe+0x49/0xbe
[247482.152554] INFO: task kworker/u16:1:16581 blocked for more than 120 seconds.
[247482.152556]       Not tainted 4.17.0-0.rc3.git4.1.fc29.x86_64 #1
[247482.152558] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[247482.152560] kworker/u16:1   D10936 16581      2 0x80000000
[247482.152571] Workqueue: events_unbound commit_work [drm_kms_helper]
[247482.152574] Call Trace:
[247482.152579]  ? __schedule+0x2ba/0xaf0
[247482.152584]  ? dma_fence_default_wait+0x231/0x370
[247482.152587]  schedule+0x2f/0x90
[247482.152590]  schedule_timeout+0x35c/0x520
[247482.152594]  ? dma_fence_default_wait+0x72/0x370
[247482.152599]  ? dma_fence_default_wait+0x231/0x370
[247482.152603]  dma_fence_default_wait+0x25d/0x370
[247482.152606]  ? dma_fence_release+0x160/0x160
[247482.152610]  dma_fence_wait_timeout+0x4f/0x270
[247482.152615]  reservation_object_wait_timeout_rcu+0x236/0x4e0
[247482.152651]  amdgpu_dm_do_flip+0x112/0x350 [amdgpu]
[247482.152691]  amdgpu_dm_atomic_commit_tail+0xa76/0xd00 [amdgpu]
[247482.152713]  ? wait_for_completion_timeout+0x73/0x1a0
[247482.152721]  commit_tail+0x3d/0x70 [drm_kms_helper]
[247482.152725]  process_one_work+0x261/0x630
[247482.152732]  worker_thread+0x3a/0x390
[247482.152735]  ? process_one_work+0x630/0x630
[247482.152737]  kthread+0x120/0x140
[247482.152740]  ? kthread_create_worker_on_cpu+0x70/0x70
[247482.152742]  ret_from_fork+0x3a/0x50
[247482.152751] INFO: lockdep is turned off.
[247605.031356] INFO: task amdgpu_cs:0:21382 blocked for more than 120 seconds.
[247605.031360]       Not tainted 4.17.0-0.rc3.git4.1.fc29.x86_64 #1
[247605.031362] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[247605.031364] amdgpu_cs:0     D12728 21382  21309 0x00000000
[247605.031369] Call Trace:
[247605.031376]  ? __schedule+0x2ba/0xaf0
[247605.031381]  ? dma_fence_default_wait+0x231/0x370
[247605.031383]  schedule+0x2f/0x90
[247605.031386]  schedule_timeout+0x35c/0x520
[247605.031389]  ? dma_fence_default_wait+0x72/0x370
[247605.031393]  ? dma_fence_default_wait+0x231/0x370
[247605.031396]  dma_fence_default_wait+0x25d/0x370
[247605.031398]  ? dma_fence_release+0x160/0x160
[247605.031401]  dma_fence_wait_timeout+0x4f/0x270
[247605.031439]  amdgpu_ctx_wait_prev_fence+0x4c/0x80 [amdgpu]
[247605.031467]  amdgpu_cs_ioctl+0x9d/0x1d10 [amdgpu]
[247605.031512]  ? amdgpu_cs_find_mapping+0x120/0x120 [amdgpu]
[247605.031525]  drm_ioctl_kernel+0x5b/0xb0 [drm]
[247605.031543]  drm_ioctl+0x1b3/0x370 [drm]
[247605.031566]  ? amdgpu_cs_find_mapping+0x120/0x120 [amdgpu]
[247605.031590]  amdgpu_drm_ioctl+0x49/0x80 [amdgpu]
[247605.031596]  do_vfs_ioctl+0xa5/0x6d0
[247605.031600]  ksys_ioctl+0x60/0x90
[247605.031603]  __x64_sys_ioctl+0x16/0x20
[247605.031606]  do_syscall_64+0x60/0x1f0
[247605.031611]  ? entry_SYSCALL_64_after_hwframe+0x49/0xbe
[247605.031715] INFO: task amdgpu_cs:0:12186 blocked for more than 120 seconds.
[247605.031717]       Not tainted 4.17.0-0.rc3.git4.1.fc29.x86_64 #1
[247605.031718] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[247605.031720] amdgpu_cs:0     D13400 12186  12133 0x00000000
[247605.031725] Call Trace:
[247605.031729]  ? __schedule+0x2ba/0xaf0
[247605.031733]  ? dma_fence_default_wait+0x231/0x370
[247605.031735]  schedule+0x2f/0x90
[247605.031738]  schedule_timeout+0x35c/0x520
[247605.031741]  ? dma_fence_default_wait+0x72/0x370
[247605.031744]  ? dma_fence_default_wait+0x231/0x370
[247605.031746]  dma_fence_default_wait+0x25d/0x370
[247605.031749]  ? dma_fence_release+0x160/0x160
[247605.031752]  dma_fence_wait_timeout+0x4f/0x270
[247605.031775]  amdgpu_ctx_wait_prev_fence+0x4c/0x80 [amdgpu]
[247605.031798]  amdgpu_cs_ioctl+0x9d/0x1d10 [amdgpu]
[247605.031828]  ? amdgpu_cs_find_mapping+0x120/0x120 [amdgpu]
[247605.031838]  drm_ioctl_kernel+0x5b/0xb0 [drm]
[247605.031846]  drm_ioctl+0x1b3/0x370 [drm]
[247605.031866]  ? amdgpu_cs_find_mapping+0x120/0x120 [amdgpu]
[247605.031887]  amdgpu_drm_ioctl+0x49/0x80 [amdgpu]
[247605.031892]  do_vfs_ioctl+0xa5/0x6d0
[247605.031896]  ksys_ioctl+0x60/0x90
[247605.031899]  __x64_sys_ioctl+0x16/0x20
[247605.031902]  do_syscall_64+0x60/0x1f0
[247605.031906]  ? entry_SYSCALL_64_after_hwframe+0x49/0xbe
[247605.032047] INFO: task kworker/u16:1:16581 blocked for more than 120 seconds.
[247605.032049]       Not tainted 4.17.0-0.rc3.git4.1.fc29.x86_64 #1
[247605.032050] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[247605.032052] kworker/u16:1   D10936 16581      2 0x80000000
[247605.032063] Workqueue: events_unbound commit_work [drm_kms_helper]
[247605.032065] Call Trace:
[247605.032069]  ? __schedule+0x2ba/0xaf0
[247605.032073]  ? dma_fence_default_wait+0x231/0x370
[247605.032075]  schedule+0x2f/0x90
[247605.032078]  schedule_timeout+0x35c/0x520
[247605.032081]  ? dma_fence_default_wait+0x72/0x370
[247605.032085]  ? dma_fence_default_wait+0x231/0x370
[247605.032087]  dma_fence_default_wait+0x25d/0x370
[247605.032089]  ? dma_fence_release+0x160/0x160
[247605.032092]  dma_fence_wait_timeout+0x4f/0x270
[247605.032095]  reservation_object_wait_timeout_rcu+0x236/0x4e0
[247605.032127]  amdgpu_dm_do_flip+0x112/0x350 [amdgpu]
[247605.032162]  amdgpu_dm_atomic_commit_tail+0xa76/0xd00 [amdgpu]
[247605.032166]  ? wait_for_completion_timeout+0x73/0x1a0
[247605.032175]  commit_tail+0x3d/0x70 [drm_kms_helper]
[247605.032180]  process_one_work+0x261/0x630
[247605.032185]  worker_thread+0x3a/0x390
[247605.032188]  ? process_one_work+0x630/0x630
[247605.032191]  kthread+0x120/0x140
[247605.032194]  ? kthread_create_worker_on_cpu+0x70/0x70
[247605.032197]  ret_from_fork+0x3a/0x50
[247605.032208] INFO: lockdep is turned off.
[247640.263559] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0!
[247640.663689] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0!
[247641.416206] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0!
[247641.512251] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0!
[247641.773087] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0!
[247642.121791] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0!
[247642.220684] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0!
[247642.481411] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0!
[247642.612305] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0!
[247642.900084] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0!
[247642.935635] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0!
[247642.999194] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0!
[247643.552447] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0!
[247643.668968] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0!
[247643.690139] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0!
[247644.099977] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0!
[247644.232435] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0!
[247644.292521] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0!
[247644.358833] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0!
[247644.376341] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0!
[247644.390073] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0!
[247644.514553] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0!
[247644.529169] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0!
[247644.581504] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0!
[247644.688219] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0!
[247644.787111] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0!
[247644.812531] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0!
[247644.873729] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0!
[247644.928613] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0!
[247644.939548] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0!
[247644.961052] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0!
[247645.056869] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0!
[247645.198003] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0!
[247645.280336] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0!
[247645.360668] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0!
[247645.434358] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0!
[247645.441931] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0!
[247645.565895] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0!
[247645.639253] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0!
[247645.711531] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0!
[247645.729971] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0!
[247645.744137] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0!
[247645.952694] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0!
[247646.140934] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0!
[247646.259925] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0!
[247646.319308] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0!
[247646.363976] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0!
[247646.389526] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0!
[247646.457577] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0!
[247646.513275] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0!
[247646.544150] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0!
[247646.637789] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0!
[247646.651337] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0!
[247646.710404] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0!
[247646.785978] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0!
[247646.928178] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0!
[247646.955859] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0!
[247647.016425] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0!
[247647.134880] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0!
[247647.159276] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0!
[247647.249781] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0!
[247647.315185] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0!
[247647.325523] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0!
[247647.361488] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0!
[247647.383235] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0!
[247647.439095] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0!
[247647.460806] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0!
[247647.485170] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0!
[247647.502436] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0!
[247647.548979] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0!
[247647.594343] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0!
[247647.621786] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0!
[247647.649303] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0!
[247647.670292] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0!
[247647.701090] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0!
[247647.735796] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0!
[247647.774236] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0!
[247647.816521] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0!
[247647.840603] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0!
[247647.869076] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0!
[247647.948394] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0!
[247647.977194] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0!
[247648.008216] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0!
[247648.041878] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0!
[247648.102950] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0!
[247648.123688] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0!
[247648.161477] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0!
[247648.210530] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0!
[247648.248898] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0!
[247648.273809] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0!
[247648.308455] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0!
[247648.357214] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0!
[247648.393870] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0!
[247648.418454] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0!
[247648.429277] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0!
[247648.508805] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0!
[247648.529862] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0!
[247648.581775] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0!
[247648.595466] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0!
[247648.679402] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0!
[247648.714558] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0!
[247648.767368] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0!
[247648.784370] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0!
[247648.805855] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0!
[247648.872980] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0!
[247648.933891] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0!
[247648.944161] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0!
[247648.979727] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0!
[247649.036203] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0!
[247649.094332] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0!
[247649.138191] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0!
[247649.175616] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0!
[247649.279457] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0!
[247649.313344] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0!
[247649.483680] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0!
[247649.519062] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0!
[247649.554865] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0!
[247649.601461] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0!
[247649.655004] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0!
[247649.760903] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0!
[247649.784816] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0!
[247649.870742] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0!
[247649.923269] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0!
[247650.003330] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0!
[247650.129582] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0!
[247650.206246] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0!
[247650.330698] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0!
[247650.481865] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0!
[247650.513212] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0!
[247650.564055] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0!
[247650.773681] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0!
[247650.780123] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0!
[247650.821904] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0!
[247650.841934] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0!
[247650.877117] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0!
[247650.901374] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0!
[247650.985498] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0!
[247651.026897] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0!
[247651.068131] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0!
[247651.109751] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0!
[247651.126539] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0!
[247651.355831] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0!
[247651.791237] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0!
[247651.829065] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0!
[247651.928932] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0!
[247652.077168] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0!
[247652.083449] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0!
[247652.211548] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0!
[247652.288786] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0!
[247652.302159] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0!
[247652.496320] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0!
[247652.614161] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0!
[247652.655070] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0!
[247652.745940] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0!
[247652.808084] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0!
[247653.117247] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0!
[247653.141879] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0!
[247653.166410] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0!
[247653.193642] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0!
[247653.338192] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0!
[247653.560506] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0!
[247653.898569] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0!
[247654.135093] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0!
[247654.283233] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0!
[247654.445210] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0!
[247654.465085] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0!
[247654.865339] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0!
[247654.987101] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0!
[247655.933191] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0!
[247655.993198] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0!
[247656.465146] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0!
[247669.543630] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 timeout, last signaled seq=6004078, last emitted seq=6004080
[247669.543635] [drm] No hardware hang detected. Did some blocks stall?
Comment 2 mikhail.v.gavrilov 2018-05-15 20:40:19 UTC
Created attachment 139579 [details]
dmesg
Comment 3 mikhail.v.gavrilov 2018-05-24 21:41:02 UTC
A very strange coincidence:
Every time I reproduce the described bug case with GPU hangup while playing a video with VAAPI acceleration.
The following messages will appear in the kernel log after reboot:

[    0.059000] mce: [Hardware Error]: Machine check events logged
[    0.059000] mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 4: fe00000000800400
[    0.059000] mce: [Hardware Error]: TSC 0 ADDR ffffffffc07f31d5 MISC ffffffffc07f31d5 
[    0.059000] mce: [Hardware Error]: PROCESSOR 0:306c3 TIME 1527195880 SOCKET 0 APIC 0 microcode 24
[    0.059000] Performance Events: PEBS fmt2+, Haswell events, 16-deep LBR, full-width counters, Intel PMU driver.
[    0.059000] ... version:                3
[    0.059000] ... bit width:              48
[    0.059000] ... generic registers:      4
[    0.059000] ... value mask:             0000ffffffffffff
[    0.059000] ... max period:             00007fffffffffff
[    0.059000] ... fixed-purpose events:   3
[    0.059000] ... event mask:             000000070000000f
[    0.059000] Hierarchical SRCU implementation.
[    0.059692] NMI watchdog: Enabled. Permanently consumes one hw-PMU counter.
[    0.059740] smp: Bringing up secondary CPUs ...
[    0.060031] x86: Booting SMP configuration:
[    0.060035] .... node  #0, CPUs:      #1
[    0.061563] mce: [Hardware Error]: Machine check events logged
[    0.061567] mce: [Hardware Error]: CPU 1: Machine Check: 0 Bank 3: be00000000800400
[    0.061583] mce: [Hardware Error]: TSC 0 ADDR ffffffff957932bb MISC ffffffff957932bb 
[    0.061602] mce: [Hardware Error]: PROCESSOR 0:306c3 TIME 1527195880 SOCKET 0 APIC 2 microcode 24
[    0.061684]  #2
[    0.063341] mce: [Hardware Error]: CPU 2: Machine Check: 0 Bank 3: be00000000800400
[    0.063341] mce: [Hardware Error]: TSC 0 ADDR ffffffffc02bd4e1 MISC ffffffffc02bd4e1 
[    0.063341] mce: [Hardware Error]: PROCESSOR 0:306c3 TIME 1527195880 SOCKET 0 APIC 4 microcode 24
[    0.063471]  #3
[    0.065119] mce: [Hardware Error]: CPU 3: Machine Check: 0 Bank 3: be00000000800400
[    0.065125] mce: [Hardware Error]: TSC 0 ADDR ffffffffc07f31d5 MISC ffffffffc07f31d5 
[    0.065144] mce: [Hardware Error]: PROCESSOR 0:306c3 TIME 1527195880 SOCKET 0 APIC 6 microcode 24
[    0.065255]  #4 #5 #6 #7
Comment 4 mikhail.v.gavrilov 2018-05-25 16:41:14 UTC
Created attachment 139764 [details]
dmesg
Comment 5 mikhail.v.gavrilov 2018-05-25 16:47:12 UTC
After updating kernel to 4.17.0-0.rc6.git1.1 strange mce error messages after reboot disappeared. But GPU still hangs.
Comment 6 mikhail.v.gavrilov 2018-05-27 19:48:03 UTC
Strange mce messages returned again with kernel 4.17.0-0.rc6.git3.1.fc29.x86_64

$ dmesg | grep mce
[    0.027300] mce: CPU supports 9 MCE banks
[    0.058829] mce: [Hardware Error]: Machine check events logged
[    0.058834] mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 3: b200000000800400
[    0.058856] mce: [Hardware Error]: TSC 0 
[    0.058867] mce: [Hardware Error]: PROCESSOR 0:306c3 TIME 1527435635 SOCKET 0 APIC 0 microcode 24
[    0.058883] mce: [Hardware Error]: Machine check events logged
[    0.058885] mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 4: fe00000000800400
[    0.058898] mce: [Hardware Error]: TSC 0 ADDR ffffffffc055a2f6 MISC ffffffffc055a2f6 
[    0.058916] mce: [Hardware Error]: PROCESSOR 0:306c3 TIME 1527435635 SOCKET 0 APIC 0 microcode 24
[    0.061682] mce: [Hardware Error]: CPU 1: Machine Check: 0 Bank 3: be00000000800400
[    0.061700] mce: [Hardware Error]: TSC 0 ADDR ffffffffc0bd9215 MISC ffffffffc0bd9215 
[    0.061719] mce: [Hardware Error]: PROCESSOR 0:306c3 TIME 1527435635 SOCKET 0 APIC 2 microcode 24
[    0.063495] mce: [Hardware Error]: CPU 2: Machine Check: 0 Bank 3: be00000000800400
[    0.063495] mce: [Hardware Error]: TSC 0 ADDR ffffffffc04404e1 MISC ffffffffc04404e1 
[    0.063495] mce: [Hardware Error]: PROCESSOR 0:306c3 TIME 1527435635 SOCKET 0 APIC 4 microcode 24
[    0.065271] mce: [Hardware Error]: CPU 3: Machine Check: 0 Bank 3: be00000000800400
[    0.065271] mce: [Hardware Error]: TSC 0 ADDR ffffffffc055a2f6 MISC ffffffffc055a2f6 
[    0.065271] mce: [Hardware Error]: PROCESSOR 0:306c3 TIME 1527435635 SOCKET 0 APIC 6 microcode 24
Comment 7 mikhail.v.gavrilov 2018-05-27 19:48:28 UTC
Created attachment 139798 [details]
dmesg
Comment 8 mikhail.v.gavrilov 2018-06-08 04:01:58 UTC
Created attachment 140078 [details]
fresh dmesg from kernel 4.18.0-0.rc0.git2.1
Comment 9 Benjamin Xiao 2018-06-27 16:41:02 UTC
I am seeing the same thing with VLC when setting Hardware-accelerated decoding from Automatic to VA-API.

Fedora 28
RX Vega 64
Kernel 4.17.2
mesa 18.0.5
llvm 6
Comment 10 mikhail.v.gavrilov 2018-06-27 16:48:26 UTC
Benjamin, I see that in Fedora 29 (Rawhide) with kernel 4.18.0-0.rc0.git9.1.fc29 problem was gone.

But with kernel 4.18.0-0.rc0.git10.1.fc29.x86_64 came yet another problem (video output at least DP stop working)
Comment 11 Martin Peres 2019-11-19 08:37:43 UTC
-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/drm/amd/issues/374.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.