Summary: | GPU hang when played video with acceleration (vaapi) | ||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | DRI | Reporter: | mikhail.v.gavrilov | ||||||||||||
Component: | DRM/AMDgpu | Assignee: | Default DRI bug account <dri-devel> | ||||||||||||
Status: | RESOLVED MOVED | QA Contact: | |||||||||||||
Severity: | normal | ||||||||||||||
Priority: | medium | CC: | ben.r.xiao, devurandom, julien.isorce | ||||||||||||
Version: | XOrg git | ||||||||||||||
Hardware: | Other | ||||||||||||||
OS: | All | ||||||||||||||
Whiteboard: | |||||||||||||||
i915 platform: | i915 features: | ||||||||||||||
Attachments: |
|
If do not restart the computer and leave it in a hang state, then after a while the turbine starts spinning at full speed, and the LEDs on the video card all go out. I was even frightened. reboot through the reset button did not help the turbine continued to make noise, and the LED on the video card did not catch fire. Only after turning off the computer it was possible to restore the working of the video card. Here is that was logged in dmesg at this time: [247125.285043] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, last signaled seq=20977028, last emitted seq=20977030 [247125.285083] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring uvd timeout, last signaled seq=30, last emitted seq=31 [247125.285085] [drm] No hardware hang detected. Did some blocks stall? [247125.285087] [drm] No hardware hang detected. Did some blocks stall? [247359.270184] INFO: task amdgpu_cs:0:21382 blocked for more than 120 seconds. [247359.270188] Not tainted 4.17.0-0.rc3.git4.1.fc29.x86_64 #1 [247359.270190] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [247359.270191] amdgpu_cs:0 D12728 21382 21309 0x00000000 [247359.270196] Call Trace: [247359.270203] ? __schedule+0x2ba/0xaf0 [247359.270220] ? dma_fence_default_wait+0x231/0x370 [247359.270222] schedule+0x2f/0x90 [247359.270235] schedule_timeout+0x35c/0x520 [247359.270238] ? dma_fence_default_wait+0x72/0x370 [247359.270242] ? dma_fence_default_wait+0x231/0x370 [247359.270245] dma_fence_default_wait+0x25d/0x370 [247359.270247] ? dma_fence_release+0x160/0x160 [247359.270251] dma_fence_wait_timeout+0x4f/0x270 [247359.270300] amdgpu_ctx_wait_prev_fence+0x4c/0x80 [amdgpu] [247359.270325] amdgpu_cs_ioctl+0x9d/0x1d10 [amdgpu] [247359.270356] ? amdgpu_cs_find_mapping+0x120/0x120 [amdgpu] [247359.270368] drm_ioctl_kernel+0x5b/0xb0 [drm] [247359.270375] drm_ioctl+0x1b3/0x370 [drm] [247359.270397] ? amdgpu_cs_find_mapping+0x120/0x120 [amdgpu] [247359.270420] amdgpu_drm_ioctl+0x49/0x80 [amdgpu] [247359.270424] do_vfs_ioctl+0xa5/0x6d0 [247359.270428] ksys_ioctl+0x60/0x90 [247359.270431] __x64_sys_ioctl+0x16/0x20 [247359.270434] do_syscall_64+0x60/0x1f0 [247359.270438] ? entry_SYSCALL_64_after_hwframe+0x49/0xbe [247359.270545] INFO: task amdgpu_cs:0:12186 blocked for more than 120 seconds. [247359.270546] Not tainted 4.17.0-0.rc3.git4.1.fc29.x86_64 #1 [247359.270548] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [247359.270550] amdgpu_cs:0 D13400 12186 12133 0x00000000 [247359.270554] Call Trace: [247359.270557] ? __schedule+0x2ba/0xaf0 [247359.270561] ? dma_fence_default_wait+0x231/0x370 [247359.270564] schedule+0x2f/0x90 [247359.270566] schedule_timeout+0x35c/0x520 [247359.270569] ? dma_fence_default_wait+0x72/0x370 [247359.270573] ? dma_fence_default_wait+0x231/0x370 [247359.270575] dma_fence_default_wait+0x25d/0x370 [247359.270577] ? dma_fence_release+0x160/0x160 [247359.270580] dma_fence_wait_timeout+0x4f/0x270 [247359.270604] amdgpu_ctx_wait_prev_fence+0x4c/0x80 [amdgpu] [247359.270626] amdgpu_cs_ioctl+0x9d/0x1d10 [amdgpu] [247359.270656] ? amdgpu_cs_find_mapping+0x120/0x120 [amdgpu] [247359.270665] drm_ioctl_kernel+0x5b/0xb0 [drm] [247359.270672] drm_ioctl+0x1b3/0x370 [drm] [247359.270692] ? amdgpu_cs_find_mapping+0x120/0x120 [amdgpu] [247359.270713] amdgpu_drm_ioctl+0x49/0x80 [amdgpu] [247359.270717] do_vfs_ioctl+0xa5/0x6d0 [247359.270721] ksys_ioctl+0x60/0x90 [247359.270724] __x64_sys_ioctl+0x16/0x20 [247359.270727] do_syscall_64+0x60/0x1f0 [247359.270730] ? entry_SYSCALL_64_after_hwframe+0x49/0xbe [247359.270886] INFO: task kworker/u16:1:16581 blocked for more than 120 seconds. [247359.270887] Not tainted 4.17.0-0.rc3.git4.1.fc29.x86_64 #1 [247359.270889] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [247359.270890] kworker/u16:1 D10936 16581 2 0x80000000 [247359.270905] Workqueue: events_unbound commit_work [drm_kms_helper] [247359.270907] Call Trace: [247359.270910] ? __schedule+0x2ba/0xaf0 [247359.270914] ? dma_fence_default_wait+0x231/0x370 [247359.270916] schedule+0x2f/0x90 [247359.270919] schedule_timeout+0x35c/0x520 [247359.270922] ? dma_fence_default_wait+0x72/0x370 [247359.270925] ? dma_fence_default_wait+0x231/0x370 [247359.270927] dma_fence_default_wait+0x25d/0x370 [247359.270929] ? dma_fence_release+0x160/0x160 [247359.270932] dma_fence_wait_timeout+0x4f/0x270 [247359.270935] reservation_object_wait_timeout_rcu+0x236/0x4e0 [247359.270967] amdgpu_dm_do_flip+0x112/0x350 [amdgpu] [247359.271003] amdgpu_dm_atomic_commit_tail+0xa76/0xd00 [amdgpu] [247359.271008] ? wait_for_completion_timeout+0x73/0x1a0 [247359.271019] commit_tail+0x3d/0x70 [drm_kms_helper] [247359.271025] process_one_work+0x261/0x630 [247359.271030] worker_thread+0x3a/0x390 [247359.271033] ? process_one_work+0x630/0x630 [247359.271036] kthread+0x120/0x140 [247359.271039] ? kthread_create_worker_on_cpu+0x70/0x70 [247359.271041] ret_from_fork+0x3a/0x50 [247359.271056] INFO: lockdep is turned off. [247482.151777] INFO: task amdgpu_cs:0:21382 blocked for more than 120 seconds. [247482.151781] Not tainted 4.17.0-0.rc3.git4.1.fc29.x86_64 #1 [247482.151782] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [247482.151784] amdgpu_cs:0 D12728 21382 21309 0x00000000 [247482.151788] Call Trace: [247482.151796] ? __schedule+0x2ba/0xaf0 [247482.151799] ? dma_fence_default_wait+0x231/0x370 [247482.151802] schedule+0x2f/0x90 [247482.151804] schedule_timeout+0x35c/0x520 [247482.151807] ? dma_fence_default_wait+0x72/0x370 [247482.151810] ? dma_fence_default_wait+0x231/0x370 [247482.151812] dma_fence_default_wait+0x25d/0x370 [247482.151814] ? dma_fence_release+0x160/0x160 [247482.151817] dma_fence_wait_timeout+0x4f/0x270 [247482.151863] amdgpu_ctx_wait_prev_fence+0x4c/0x80 [amdgpu] [247482.151884] amdgpu_cs_ioctl+0x9d/0x1d10 [amdgpu] [247482.151912] ? amdgpu_cs_find_mapping+0x120/0x120 [amdgpu] [247482.151924] drm_ioctl_kernel+0x5b/0xb0 [drm] [247482.151932] drm_ioctl+0x1b3/0x370 [drm] [247482.151952] ? amdgpu_cs_find_mapping+0x120/0x120 [amdgpu] [247482.151973] amdgpu_drm_ioctl+0x49/0x80 [amdgpu] [247482.151977] do_vfs_ioctl+0xa5/0x6d0 [247482.151982] ksys_ioctl+0x60/0x90 [247482.151986] __x64_sys_ioctl+0x16/0x20 [247482.151989] do_syscall_64+0x60/0x1f0 [247482.151993] ? entry_SYSCALL_64_after_hwframe+0x49/0xbe [247482.152121] INFO: task amdgpu_cs:0:12186 blocked for more than 120 seconds. [247482.152123] Not tainted 4.17.0-0.rc3.git4.1.fc29.x86_64 #1 [247482.152124] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [247482.152126] amdgpu_cs:0 D13400 12186 12133 0x00000000 [247482.152130] Call Trace: [247482.152143] ? __schedule+0x2ba/0xaf0 [247482.152146] ? dma_fence_default_wait+0x231/0x370 [247482.152148] schedule+0x2f/0x90 [247482.152150] schedule_timeout+0x35c/0x520 [247482.152153] ? dma_fence_default_wait+0x72/0x370 [247482.152156] ? dma_fence_default_wait+0x231/0x370 [247482.152169] dma_fence_default_wait+0x25d/0x370 [247482.152171] ? dma_fence_release+0x160/0x160 [247482.152174] dma_fence_wait_timeout+0x4f/0x270 [247482.152203] amdgpu_ctx_wait_prev_fence+0x4c/0x80 [amdgpu] [247482.152233] amdgpu_cs_ioctl+0x9d/0x1d10 [amdgpu] [247482.152281] ? amdgpu_cs_find_mapping+0x120/0x120 [amdgpu] [247482.152299] drm_ioctl_kernel+0x5b/0xb0 [drm] [247482.152316] drm_ioctl+0x1b3/0x370 [drm] [247482.152335] ? amdgpu_cs_find_mapping+0x120/0x120 [amdgpu] [247482.152375] amdgpu_drm_ioctl+0x49/0x80 [amdgpu] [247482.152379] do_vfs_ioctl+0xa5/0x6d0 [247482.152382] ksys_ioctl+0x60/0x90 [247482.152385] __x64_sys_ioctl+0x16/0x20 [247482.152387] do_syscall_64+0x60/0x1f0 [247482.152390] ? entry_SYSCALL_64_after_hwframe+0x49/0xbe [247482.152554] INFO: task kworker/u16:1:16581 blocked for more than 120 seconds. [247482.152556] Not tainted 4.17.0-0.rc3.git4.1.fc29.x86_64 #1 [247482.152558] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [247482.152560] kworker/u16:1 D10936 16581 2 0x80000000 [247482.152571] Workqueue: events_unbound commit_work [drm_kms_helper] [247482.152574] Call Trace: [247482.152579] ? __schedule+0x2ba/0xaf0 [247482.152584] ? dma_fence_default_wait+0x231/0x370 [247482.152587] schedule+0x2f/0x90 [247482.152590] schedule_timeout+0x35c/0x520 [247482.152594] ? dma_fence_default_wait+0x72/0x370 [247482.152599] ? dma_fence_default_wait+0x231/0x370 [247482.152603] dma_fence_default_wait+0x25d/0x370 [247482.152606] ? dma_fence_release+0x160/0x160 [247482.152610] dma_fence_wait_timeout+0x4f/0x270 [247482.152615] reservation_object_wait_timeout_rcu+0x236/0x4e0 [247482.152651] amdgpu_dm_do_flip+0x112/0x350 [amdgpu] [247482.152691] amdgpu_dm_atomic_commit_tail+0xa76/0xd00 [amdgpu] [247482.152713] ? wait_for_completion_timeout+0x73/0x1a0 [247482.152721] commit_tail+0x3d/0x70 [drm_kms_helper] [247482.152725] process_one_work+0x261/0x630 [247482.152732] worker_thread+0x3a/0x390 [247482.152735] ? process_one_work+0x630/0x630 [247482.152737] kthread+0x120/0x140 [247482.152740] ? kthread_create_worker_on_cpu+0x70/0x70 [247482.152742] ret_from_fork+0x3a/0x50 [247482.152751] INFO: lockdep is turned off. [247605.031356] INFO: task amdgpu_cs:0:21382 blocked for more than 120 seconds. [247605.031360] Not tainted 4.17.0-0.rc3.git4.1.fc29.x86_64 #1 [247605.031362] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [247605.031364] amdgpu_cs:0 D12728 21382 21309 0x00000000 [247605.031369] Call Trace: [247605.031376] ? __schedule+0x2ba/0xaf0 [247605.031381] ? dma_fence_default_wait+0x231/0x370 [247605.031383] schedule+0x2f/0x90 [247605.031386] schedule_timeout+0x35c/0x520 [247605.031389] ? dma_fence_default_wait+0x72/0x370 [247605.031393] ? dma_fence_default_wait+0x231/0x370 [247605.031396] dma_fence_default_wait+0x25d/0x370 [247605.031398] ? dma_fence_release+0x160/0x160 [247605.031401] dma_fence_wait_timeout+0x4f/0x270 [247605.031439] amdgpu_ctx_wait_prev_fence+0x4c/0x80 [amdgpu] [247605.031467] amdgpu_cs_ioctl+0x9d/0x1d10 [amdgpu] [247605.031512] ? amdgpu_cs_find_mapping+0x120/0x120 [amdgpu] [247605.031525] drm_ioctl_kernel+0x5b/0xb0 [drm] [247605.031543] drm_ioctl+0x1b3/0x370 [drm] [247605.031566] ? amdgpu_cs_find_mapping+0x120/0x120 [amdgpu] [247605.031590] amdgpu_drm_ioctl+0x49/0x80 [amdgpu] [247605.031596] do_vfs_ioctl+0xa5/0x6d0 [247605.031600] ksys_ioctl+0x60/0x90 [247605.031603] __x64_sys_ioctl+0x16/0x20 [247605.031606] do_syscall_64+0x60/0x1f0 [247605.031611] ? entry_SYSCALL_64_after_hwframe+0x49/0xbe [247605.031715] INFO: task amdgpu_cs:0:12186 blocked for more than 120 seconds. [247605.031717] Not tainted 4.17.0-0.rc3.git4.1.fc29.x86_64 #1 [247605.031718] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [247605.031720] amdgpu_cs:0 D13400 12186 12133 0x00000000 [247605.031725] Call Trace: [247605.031729] ? __schedule+0x2ba/0xaf0 [247605.031733] ? dma_fence_default_wait+0x231/0x370 [247605.031735] schedule+0x2f/0x90 [247605.031738] schedule_timeout+0x35c/0x520 [247605.031741] ? dma_fence_default_wait+0x72/0x370 [247605.031744] ? dma_fence_default_wait+0x231/0x370 [247605.031746] dma_fence_default_wait+0x25d/0x370 [247605.031749] ? dma_fence_release+0x160/0x160 [247605.031752] dma_fence_wait_timeout+0x4f/0x270 [247605.031775] amdgpu_ctx_wait_prev_fence+0x4c/0x80 [amdgpu] [247605.031798] amdgpu_cs_ioctl+0x9d/0x1d10 [amdgpu] [247605.031828] ? amdgpu_cs_find_mapping+0x120/0x120 [amdgpu] [247605.031838] drm_ioctl_kernel+0x5b/0xb0 [drm] [247605.031846] drm_ioctl+0x1b3/0x370 [drm] [247605.031866] ? amdgpu_cs_find_mapping+0x120/0x120 [amdgpu] [247605.031887] amdgpu_drm_ioctl+0x49/0x80 [amdgpu] [247605.031892] do_vfs_ioctl+0xa5/0x6d0 [247605.031896] ksys_ioctl+0x60/0x90 [247605.031899] __x64_sys_ioctl+0x16/0x20 [247605.031902] do_syscall_64+0x60/0x1f0 [247605.031906] ? entry_SYSCALL_64_after_hwframe+0x49/0xbe [247605.032047] INFO: task kworker/u16:1:16581 blocked for more than 120 seconds. [247605.032049] Not tainted 4.17.0-0.rc3.git4.1.fc29.x86_64 #1 [247605.032050] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [247605.032052] kworker/u16:1 D10936 16581 2 0x80000000 [247605.032063] Workqueue: events_unbound commit_work [drm_kms_helper] [247605.032065] Call Trace: [247605.032069] ? __schedule+0x2ba/0xaf0 [247605.032073] ? dma_fence_default_wait+0x231/0x370 [247605.032075] schedule+0x2f/0x90 [247605.032078] schedule_timeout+0x35c/0x520 [247605.032081] ? dma_fence_default_wait+0x72/0x370 [247605.032085] ? dma_fence_default_wait+0x231/0x370 [247605.032087] dma_fence_default_wait+0x25d/0x370 [247605.032089] ? dma_fence_release+0x160/0x160 [247605.032092] dma_fence_wait_timeout+0x4f/0x270 [247605.032095] reservation_object_wait_timeout_rcu+0x236/0x4e0 [247605.032127] amdgpu_dm_do_flip+0x112/0x350 [amdgpu] [247605.032162] amdgpu_dm_atomic_commit_tail+0xa76/0xd00 [amdgpu] [247605.032166] ? wait_for_completion_timeout+0x73/0x1a0 [247605.032175] commit_tail+0x3d/0x70 [drm_kms_helper] [247605.032180] process_one_work+0x261/0x630 [247605.032185] worker_thread+0x3a/0x390 [247605.032188] ? process_one_work+0x630/0x630 [247605.032191] kthread+0x120/0x140 [247605.032194] ? kthread_create_worker_on_cpu+0x70/0x70 [247605.032197] ret_from_fork+0x3a/0x50 [247605.032208] INFO: lockdep is turned off. [247640.263559] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247640.663689] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247641.416206] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247641.512251] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247641.773087] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247642.121791] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247642.220684] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247642.481411] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247642.612305] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247642.900084] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247642.935635] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247642.999194] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247643.552447] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247643.668968] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247643.690139] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247644.099977] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247644.232435] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247644.292521] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247644.358833] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247644.376341] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247644.390073] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247644.514553] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247644.529169] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247644.581504] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247644.688219] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247644.787111] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247644.812531] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247644.873729] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247644.928613] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247644.939548] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247644.961052] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247645.056869] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247645.198003] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247645.280336] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247645.360668] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247645.434358] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247645.441931] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247645.565895] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247645.639253] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247645.711531] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247645.729971] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247645.744137] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247645.952694] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247646.140934] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247646.259925] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247646.319308] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247646.363976] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247646.389526] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247646.457577] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247646.513275] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247646.544150] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247646.637789] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247646.651337] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247646.710404] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247646.785978] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247646.928178] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247646.955859] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247647.016425] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247647.134880] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247647.159276] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247647.249781] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247647.315185] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247647.325523] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247647.361488] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247647.383235] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247647.439095] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247647.460806] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247647.485170] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247647.502436] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247647.548979] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247647.594343] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247647.621786] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247647.649303] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247647.670292] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247647.701090] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247647.735796] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247647.774236] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247647.816521] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247647.840603] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247647.869076] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247647.948394] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247647.977194] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247648.008216] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247648.041878] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247648.102950] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247648.123688] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247648.161477] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247648.210530] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247648.248898] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247648.273809] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247648.308455] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247648.357214] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247648.393870] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247648.418454] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247648.429277] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247648.508805] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247648.529862] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247648.581775] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247648.595466] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247648.679402] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247648.714558] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247648.767368] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247648.784370] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247648.805855] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247648.872980] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247648.933891] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247648.944161] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247648.979727] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247649.036203] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247649.094332] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247649.138191] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247649.175616] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247649.279457] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247649.313344] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247649.483680] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247649.519062] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247649.554865] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247649.601461] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247649.655004] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247649.760903] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247649.784816] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247649.870742] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247649.923269] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247650.003330] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247650.129582] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247650.206246] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247650.330698] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247650.481865] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247650.513212] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247650.564055] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247650.773681] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247650.780123] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247650.821904] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247650.841934] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247650.877117] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247650.901374] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247650.985498] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247651.026897] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247651.068131] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247651.109751] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247651.126539] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247651.355831] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247651.791237] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247651.829065] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247651.928932] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247652.077168] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247652.083449] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247652.211548] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247652.288786] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247652.302159] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247652.496320] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247652.614161] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247652.655070] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247652.745940] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247652.808084] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247653.117247] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247653.141879] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247653.166410] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247653.193642] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247653.338192] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247653.560506] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247653.898569] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247654.135093] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247654.283233] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247654.445210] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247654.465085] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247654.865339] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247654.987101] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247655.933191] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247655.993198] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247656.465146] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247669.543630] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 timeout, last signaled seq=6004078, last emitted seq=6004080 [247669.543635] [drm] No hardware hang detected. Did some blocks stall? Created attachment 139579 [details]
dmesg
A very strange coincidence: Every time I reproduce the described bug case with GPU hangup while playing a video with VAAPI acceleration. The following messages will appear in the kernel log after reboot: [ 0.059000] mce: [Hardware Error]: Machine check events logged [ 0.059000] mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 4: fe00000000800400 [ 0.059000] mce: [Hardware Error]: TSC 0 ADDR ffffffffc07f31d5 MISC ffffffffc07f31d5 [ 0.059000] mce: [Hardware Error]: PROCESSOR 0:306c3 TIME 1527195880 SOCKET 0 APIC 0 microcode 24 [ 0.059000] Performance Events: PEBS fmt2+, Haswell events, 16-deep LBR, full-width counters, Intel PMU driver. [ 0.059000] ... version: 3 [ 0.059000] ... bit width: 48 [ 0.059000] ... generic registers: 4 [ 0.059000] ... value mask: 0000ffffffffffff [ 0.059000] ... max period: 00007fffffffffff [ 0.059000] ... fixed-purpose events: 3 [ 0.059000] ... event mask: 000000070000000f [ 0.059000] Hierarchical SRCU implementation. [ 0.059692] NMI watchdog: Enabled. Permanently consumes one hw-PMU counter. [ 0.059740] smp: Bringing up secondary CPUs ... [ 0.060031] x86: Booting SMP configuration: [ 0.060035] .... node #0, CPUs: #1 [ 0.061563] mce: [Hardware Error]: Machine check events logged [ 0.061567] mce: [Hardware Error]: CPU 1: Machine Check: 0 Bank 3: be00000000800400 [ 0.061583] mce: [Hardware Error]: TSC 0 ADDR ffffffff957932bb MISC ffffffff957932bb [ 0.061602] mce: [Hardware Error]: PROCESSOR 0:306c3 TIME 1527195880 SOCKET 0 APIC 2 microcode 24 [ 0.061684] #2 [ 0.063341] mce: [Hardware Error]: CPU 2: Machine Check: 0 Bank 3: be00000000800400 [ 0.063341] mce: [Hardware Error]: TSC 0 ADDR ffffffffc02bd4e1 MISC ffffffffc02bd4e1 [ 0.063341] mce: [Hardware Error]: PROCESSOR 0:306c3 TIME 1527195880 SOCKET 0 APIC 4 microcode 24 [ 0.063471] #3 [ 0.065119] mce: [Hardware Error]: CPU 3: Machine Check: 0 Bank 3: be00000000800400 [ 0.065125] mce: [Hardware Error]: TSC 0 ADDR ffffffffc07f31d5 MISC ffffffffc07f31d5 [ 0.065144] mce: [Hardware Error]: PROCESSOR 0:306c3 TIME 1527195880 SOCKET 0 APIC 6 microcode 24 [ 0.065255] #4 #5 #6 #7 Created attachment 139764 [details]
dmesg
After updating kernel to 4.17.0-0.rc6.git1.1 strange mce error messages after reboot disappeared. But GPU still hangs. Strange mce messages returned again with kernel 4.17.0-0.rc6.git3.1.fc29.x86_64 $ dmesg | grep mce [ 0.027300] mce: CPU supports 9 MCE banks [ 0.058829] mce: [Hardware Error]: Machine check events logged [ 0.058834] mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 3: b200000000800400 [ 0.058856] mce: [Hardware Error]: TSC 0 [ 0.058867] mce: [Hardware Error]: PROCESSOR 0:306c3 TIME 1527435635 SOCKET 0 APIC 0 microcode 24 [ 0.058883] mce: [Hardware Error]: Machine check events logged [ 0.058885] mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 4: fe00000000800400 [ 0.058898] mce: [Hardware Error]: TSC 0 ADDR ffffffffc055a2f6 MISC ffffffffc055a2f6 [ 0.058916] mce: [Hardware Error]: PROCESSOR 0:306c3 TIME 1527435635 SOCKET 0 APIC 0 microcode 24 [ 0.061682] mce: [Hardware Error]: CPU 1: Machine Check: 0 Bank 3: be00000000800400 [ 0.061700] mce: [Hardware Error]: TSC 0 ADDR ffffffffc0bd9215 MISC ffffffffc0bd9215 [ 0.061719] mce: [Hardware Error]: PROCESSOR 0:306c3 TIME 1527435635 SOCKET 0 APIC 2 microcode 24 [ 0.063495] mce: [Hardware Error]: CPU 2: Machine Check: 0 Bank 3: be00000000800400 [ 0.063495] mce: [Hardware Error]: TSC 0 ADDR ffffffffc04404e1 MISC ffffffffc04404e1 [ 0.063495] mce: [Hardware Error]: PROCESSOR 0:306c3 TIME 1527435635 SOCKET 0 APIC 4 microcode 24 [ 0.065271] mce: [Hardware Error]: CPU 3: Machine Check: 0 Bank 3: be00000000800400 [ 0.065271] mce: [Hardware Error]: TSC 0 ADDR ffffffffc055a2f6 MISC ffffffffc055a2f6 [ 0.065271] mce: [Hardware Error]: PROCESSOR 0:306c3 TIME 1527435635 SOCKET 0 APIC 6 microcode 24 Created attachment 139798 [details]
dmesg
Created attachment 140078 [details]
fresh dmesg from kernel 4.18.0-0.rc0.git2.1
I am seeing the same thing with VLC when setting Hardware-accelerated decoding from Automatic to VA-API. Fedora 28 RX Vega 64 Kernel 4.17.2 mesa 18.0.5 llvm 6 Benjamin, I see that in Fedora 29 (Rawhide) with kernel 4.18.0-0.rc0.git9.1.fc29 problem was gone. But with kernel 4.18.0-0.rc0.git10.1.fc29.x86_64 came yet another problem (video output at least DP stop working) -- GitLab Migration Automatic Message -- This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity. You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/drm/amd/issues/374. |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.
Created attachment 139407 [details] dmesg * Fedora 29 (Rawhide) * Latest system updates: - kernel 4.17.0-0.rc3.git4.1 - drm 3.25.0 - mesa 18.1.0-rc2 - llvm 6.0.0 For reproduction issue: 1) # dnf install gstreamer1-vaapi 2) Play video encoded with H.264 in Totem player Symptoms: 1. The system stop to respod. kernel output after GPU hang: [ 89.056879] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, last signaled seq=2638, last emitted seq=2640 [ 89.056926] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring uvd timeout, last signaled seq=80, last emitted seq=82 [ 89.056932] [drm] No hardware hang detected. Did some blocks stall? [ 89.056948] [drm] No hardware hang detected. Did some blocks stall?