Bug 106151

Summary:	[amdgpu][vulkan] GPU hang (Vega 56) while running game (Rise of the Tomb Raider)
Product:	Mesa	Reporter:	Martin F <martin.fretigne>
Component:	Drivers/Vulkan/radeon	Assignee:	mesa-dev
Status:	RESOLVED FIXED	QA Contact:	mesa-dev
Severity:	normal
Priority:	medium	CC:	jaapbuurman, pritzl3452, rafalcieslak256
Version:	git
Hardware:	x86-64 (AMD64)
OS:	Linux (All)
Whiteboard:
i915 platform:		i915 features:
Bug Depends on:
Bug Blocks:	77449
Attachments:	patch Savegame ROTTR possible fix

Description Martin F 2018-04-20 14:44:54 UTC

The hang happens very quickly, after playing for about 30 seconds.

Distrib : Debian GNU/Linux testing (buster)
linux : tag v4.16 (0adb32858b0bddf4ada5f364a84ed60b196dbcda)
mesa : 18.1.0-devel (24fb3e6aa166b3afe906eb2845077766075189ed)
clang/llvm : 7.0.0 (c33e1469f018cf71327e06df054a0ffc87f4d9f8/2f0f603361d156e655216b46f95d1136934daba2)

Syslog :
Apr 20 16:10:33 galaxie kernel: [  725.999915] INFO: task kworker/u32:5:165 blocked for more than 120 seconds.
Apr 20 16:10:33 galaxie kernel: [  725.999924]       Not tainted 4.16.0 #4
Apr 20 16:10:33 galaxie kernel: [  725.999927] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Apr 20 16:10:33 galaxie kernel: [  725.999931] kworker/u32:5   D    0   165      2 0x80000000
Apr 20 16:10:33 galaxie kernel: [  725.999952] Workqueue: events_unbound commit_work [drm_kms_helper]
Apr 20 16:10:33 galaxie kernel: [  725.999955] Call Trace:
Apr 20 16:10:33 galaxie kernel: [  725.999965]  ? __schedule+0x297/0x870
Apr 20 16:10:33 galaxie kernel: [  725.999969]  schedule+0x28/0x80
Apr 20 16:10:33 galaxie kernel: [  725.999973]  schedule_timeout+0x1ee/0x380
Apr 20 16:10:33 galaxie kernel: [  726.000061]  ? dce120_timing_generator_get_crtc_position+0x5d/0x70 [amdgpu]
Apr 20 16:10:33 galaxie kernel: [  726.000139]  ? dce120_timing_generator_get_crtc_scanoutpos+0x71/0xb0 [amdgpu]
Apr 20 16:10:33 galaxie kernel: [  726.000144]  dma_fence_default_wait+0x1fd/0x280
Apr 20 16:10:33 galaxie kernel: [  726.000148]  ? dma_fence_release+0x90/0x90
Apr 20 16:10:33 galaxie kernel: [  726.000151]  dma_fence_wait_timeout+0x39/0xf0
Apr 20 16:10:33 galaxie kernel: [  726.000155]  reservation_object_wait_timeout_rcu+0x17b/0x370
Apr 20 16:10:33 galaxie kernel: [  726.000233]  amdgpu_dm_do_flip+0x112/0x350 [amdgpu]
Apr 20 16:10:33 galaxie kernel: [  726.000312]  amdgpu_dm_atomic_commit_tail+0xb54/0xdb0 [amdgpu]
Apr 20 16:10:33 galaxie kernel: [  726.000318]  ? wait_for_completion_timeout+0x3b/0x1a0
Apr 20 16:10:33 galaxie kernel: [  726.000323]  ? __switch_to+0x3ff/0x450
Apr 20 16:10:33 galaxie kernel: [  726.000326]  ? __switch_to+0x3ff/0x450
Apr 20 16:10:33 galaxie kernel: [  726.000339]  commit_tail+0x3d/0x70 [drm_kms_helper]
Apr 20 16:10:33 galaxie kernel: [  726.000344]  process_one_work+0x17b/0x360
Apr 20 16:10:33 galaxie kernel: [  726.000348]  worker_thread+0x2e/0x390
Apr 20 16:10:33 galaxie kernel: [  726.000352]  ? process_one_work+0x360/0x360
Apr 20 16:10:33 galaxie kernel: [  726.000356]  kthread+0x113/0x130
Apr 20 16:10:33 galaxie kernel: [  726.000359]  ? kthread_create_worker_on_cpu+0x70/0x70
Apr 20 16:10:33 galaxie kernel: [  726.000362]  ret_from_fork+0x35/0x40

Comment 1 Samuel Pitoiset 2018-04-20 16:04:15 UTC

Hi,

Yeah, 24fb3e6aa166b3afe906eb2845077766075189ed is a broken commit which might hang your GPU. You are lucky because I have just wrote a fix for that issue. :)

Can you try https://cgit.freedesktop.org/~hakzsam/mesa/commit/?h=radv_image_fix&id=22082cf2c1b2613ee4080347472f6c82086121b0 and let me know if it's fixed?

Comment 2 Samuel Pitoiset 2018-04-20 16:45:33 UTC

FYI, I have pushed the fix https://cgit.freedesktop.org/mesa/mesa/commit/?id=8f13975713a7a7b8d625e3561a7fc9ce202ac64b

Comment 3 Martin F 2018-04-20 17:43:00 UTC

Hi Samuel. I just tried with your commit but  still got a hang, with the same syslog messages as before.

Comment 4 Martin F 2018-04-20 18:36:21 UTC

I tried again after updating my kernel to 4.17.0-rc1+ (87ef12027b9b1dd0e0b12cf311fbcb19f9d92539) and after cleaning the cache (~/.cache/mesa_shader_cache/), but the game hanged once more.

Comment 5 Samuel Pitoiset 2018-04-21 12:40:03 UTC

Okay, can you give the steps to reproduce (or bisect) ? Thanks!

Comment 6 pritzl3452 2018-04-21 18:20:21 UTC

Hi,

I also get hangs while playing the game. I can play for a pretty long time usually though, around 30-60 mins before it hangs. I can ssh into the system and the only errors I see in dmesg are:

apr 21 20:09:57 serenity kernel: [drm:amdgpu_job_timedout] *ERROR* ring gfx timeout, last signaled seq=4236234, last emitted seq=4236236
apr 21 20:09:57 serenity kernel: [drm] No hardware hang detected. Did some blocks stall?

Device: Radeon RX Vega (VEGA10 / DRM 3.23.0 / 4.16.3-gentoo, LLVM 6.0.0) (0x687f)
Version: 18.0.1

This is on a Vega 64 LC

Comment 7 Martin F 2018-04-21 21:51:26 UTC

The problem happened multiple times at the very beginning of the game, but not very often later (1 hang every 3 hours maybe). I just started a new game to see if I would get the hangs I got yesterday, but did not. Since the bug is not systematic it's not practicable to bisect it.

Comment 8 Samuel Pitoiset 2018-04-23 13:02:04 UTC

Well, I have a vega 56 so I can reproduce the hang but I would need a bit more info. What preset are you using? Do you have something in dmesg when it hangs?

Comment 9 Martin F 2018-04-23 20:23:08 UTC

The preset I'm using is 'Medium', in 1920x1080. The hang seem to happen very often (or maybe every time when the game was never launched before) at the very beginning of the game (when Jonas catch Lara when she jumps after the very first fall).

dmesg:
[ 3442.737830] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, last signaled seq=272116, last emitted seq=272118
[ 3442.737835] [drm] No hardware hang detected. Did some blocks stall?
[ 3626.038022] INFO: task kworker/u32:3:163 blocked for more than 120 seconds.
[ 3626.038029]       Not tainted 4.17.0-rc1+ #5
[ 3626.038031] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 3626.038035] kworker/u32:3   D    0   163      2 0x80000000
[ 3626.038055] Workqueue: events_unbound commit_work [drm_kms_helper]
[ 3626.038058] Call Trace:
[ 3626.038068]  ? __schedule+0x291/0x870
[ 3626.038072]  schedule+0x28/0x80
[ 3626.038076]  schedule_timeout+0x1ee/0x380
[ 3626.038167]  ? dce120_timing_generator_get_crtc_position+0x5b/0x70 [amdgpu]
[ 3626.038271]  ? dce120_timing_generator_get_crtc_scanoutpos+0x70/0xb0 [amdgpu]
[ 3626.038279]  dma_fence_default_wait+0x1fd/0x280
[ 3626.038286]  ? dma_fence_release+0x90/0x90
[ 3626.038290]  dma_fence_wait_timeout+0x39/0xf0
[ 3626.038294]  reservation_object_wait_timeout_rcu+0x17b/0x370
[ 3626.038375]  amdgpu_dm_do_flip+0x112/0x350 [amdgpu]
[ 3626.038457]  amdgpu_dm_atomic_commit_tail+0xb00/0xd00 [amdgpu]
[ 3626.038463]  ? wait_for_completion_timeout+0x3b/0x1a0
[ 3626.038467]  ? pick_next_task_fair+0x35b/0x660
[ 3626.038473]  ? __switch_to+0xa2/0x450
[ 3626.038486]  commit_tail+0x3d/0x70 [drm_kms_helper]
[ 3626.038491]  process_one_work+0x17b/0x360
[ 3626.038495]  worker_thread+0x2e/0x390
[ 3626.038498]  ? process_one_work+0x360/0x360
[ 3626.038502]  kthread+0x113/0x130
[ 3626.038506]  ? kthread_create_worker_on_cpu+0x70/0x70
[ 3626.038509]  ret_from_fork+0x35/0x40

FYI I tried to start bisecting from mesa 17.3.5 but could not get very far due to compilation issues (maybe related to llvm, not sure, I'm trying again).

Comment 10 Samuel Pitoiset 2018-04-26 09:11:00 UTC

I have been able to reproduce the hang one or two times. Unfortunately it's really hard to reproduce, so really hard to fix but we are working on!

Comment 11 Samuel Pitoiset 2018-04-30 18:09:20 UTC

Created attachment 139232 [details] [review]
patch

Guys, can you apply the attached patch and let me know if it improves the situation?

Comment 12 Martin F 2018-04-30 19:45:16 UTC

Hi Samuel. It looks good, no hang here so far. Well done!

Comment 13 pritzl3452 2018-05-01 14:32:38 UTC

Thank you for the patch! Unfortunately it still hangs for me. I applied the patch to Mesa 18.1.0-rc2.

Comment 14 Alex Smith 2018-06-04 14:58:39 UTC

If you're still seeing a hang, could you try the latest game update (released today)?

Comment 15 pritzl3452 2018-06-04 19:56:49 UTC

Hi Alex,

I gave it a try now with the new update to ROTTR and the game still hangs in the same way.

I am now on kernel 4.17.0
Mesa and llvm from git updated about 2 hours ago.

OpenGL renderer string: Radeon RX Vega (VEGA10, DRM 3.25.0, 4.17.0-gentoo, LLVM 7.0.0)
OpenGL core profile version string: 4.5 (Core Profile) Mesa 18.2.0-devel (git-b3ba47c592)

Comment 16 Samuel Pitoiset 2018-06-19 16:07:56 UTC

(In reply to pritzl3452 from comment #15)
> Hi Alex,
> 
> I gave it a try now with the new update to ROTTR and the game still hangs in
> the same way.

Can you explain how to reproduce? Maybe you can also upload your savefile (ie. lastauto.ldat)?

Comment 17 pritzl3452 2018-06-19 19:12:47 UTC

(In reply to Samuel Pitoiset from comment #16)
> (In reply to pritzl3452 from comment #15)
> > Hi Alex,
> > 
> > I gave it a try now with the new update to ROTTR and the game still hangs in
> > the same way.
> 
> Can you explain how to reproduce? Maybe you can also upload your savefile
> (ie. lastauto.ldat)?

I have not found anything specific that makes it hang unfortunately. I just play the game and sometimes it hangs, it has happened within 30 seconds of getting into the game and it has taken more than an hour before hanging.

I am in a different place everytime the game hangs.

Comment 18 pritzl3452 2018-06-19 19:18:35 UTC

Created attachment 140234 [details]
Savegame ROTTR

Comment 19 Samuel Pitoiset 2018-06-20 07:56:36 UTC

Thanks, what preset do you use?

Comment 20 Samuel Pitoiset 2018-06-20 12:28:10 UTC

Created attachment 140246 [details] [review]
possible fix

Does this patch help?

Comment 21 Samuel Pitoiset 2018-06-20 13:45:13 UTC

If my patch doesn't help, can you try master with "export RADV_DEBUG=nocompute"?

Comment 22 pritzl3452 2018-06-20 14:32:36 UTC

I am away travelling and I wont be able to try the patch until late next week. I will try the patch when I'm back and get back with the results.

Comment 23 Samuel Pitoiset 2018-07-04 10:17:38 UTC

Can you also try with RADV_PERFTEST=nobatchchain please?

Comment 24 pritzl3452 2018-07-06 17:31:58 UTC

(In reply to Samuel Pitoiset from comment #19)
> Thanks, what preset do you use?

I'm using the high preset and I have disabled AA. Resolution is 3840x2160.

Comment 25 pritzl3452 2018-07-06 17:32:55 UTC

(In reply to Samuel Pitoiset from comment #20)
> Created attachment 140246 [details] [review] [review]
> possible fix
> 
> Does this patch help?

The game still hangs with this patch using Mesa 18.1.3.

Comment 26 pritzl3452 2018-07-07 19:03:58 UTC

(In reply to Samuel Pitoiset from comment #21)
> If my patch doesn't help, can you try master with "export
> RADV_DEBUG=nocompute"?

The game still hangs in the same way using LLVM 7 and Mesa master with this.

Comment 27 Samuel Pitoiset 2018-07-09 15:55:07 UTC

I guess it also hangs with RADV_PERFTEST=nobatchchain ?

Comment 28 pritzl3452 2018-07-09 18:24:28 UTC

(In reply to Samuel Pitoiset from comment #27)
> I guess it also hangs with RADV_PERFTEST=nobatchchain ?

Yes I just gave it a try and it hangs in the same way.

I am now on kernel 4.16.13
Mesa 18.2.0-devel (git-f8e54d02f7)

Comment 29 Jaap Buurman 2018-07-30 12:57:11 UTC

I am seeing the same issue in multiple games on my Vega 64:

-Assassin's creed 2 played through Wine with the Gallium Nine patches
-Assassin's creed brotherhood played through Wine with the Gallium Nine patches.
-GTA V played through Wine with the latest DXVK (Vulkan)

When I SSH into the machine I can see the following messages in dmesg:

[ 3442.737830] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, last signaled seq=272116, last emitted seq=272118
[ 3442.737835] [drm] No hardware hang detected. Did some blocks stall?


Software versions:

Kernel: 4.17.10
Mesa: 18.1.4
LLVM: 6.0.1

Comment 30 Jaap Buurman 2019-01-19 14:53:34 UTC

Still running into this issue, now while running Mario Party 9 through Dolphin. This is a particularly good test case, because I can reliably get it to crash in the main menu after seconds/minutes. This ONLY happens with the Vulkan renderer.

Versions: Radeon RX Vega (VEGA10, DRM 3.27.0, 4.20.3-arch1-1-ARCH, LLVM 7.0.0)
Mesa: 18.3.1

I have also managed to get a stack trace this time, which is hopefully useful for debugging:

[  858.970202] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=160177, emitted seq=160179
[  858.970205] [drm] GPU recovery disabled.
[  982.906053] INFO: task kworker/u32:6:398 blocked for more than 120 seconds.
[  982.906055]       Not tainted 4.20.3-arch1-1-ARCH #1
[  982.906056] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  982.906057] kworker/u32:6   D    0   398      2 0x80000000
[  982.906068] Workqueue: events_unbound commit_work [drm_kms_helper]
[  982.906069] Call Trace:
[  982.906075]  ? __schedule+0x29b/0x8b0
[  982.906077]  ? __switch_to_asm+0x40/0x70
[  982.906079]  schedule+0x32/0x90
[  982.906080]  schedule_timeout+0x311/0x4a0
[  982.906126]  ? dce120_timing_generator_get_crtc_position+0x5b/0x70 [amdgpu]
[  982.906167]  ? dce120_timing_generator_get_crtc_scanoutpos+0x70/0xb0 [amdgpu]
[  982.906170]  dma_fence_default_wait+0x204/0x280
[  982.906172]  ? dma_fence_wait_timeout+0x120/0x120
[  982.906173]  dma_fence_wait_timeout+0x105/0x120
[  982.906175]  reservation_object_wait_timeout_rcu+0x1f2/0x370
[  982.906178]  ? preempt_count_add+0x79/0xb0
[  982.906221]  amdgpu_dm_do_flip+0x10d/0x370 [amdgpu]
[  982.906265]  amdgpu_dm_atomic_commit_tail+0x6c4/0xd20 [amdgpu]
[  982.906267]  ? _raw_spin_lock_irq+0x1a/0x40
[  982.906268]  ? wait_for_common+0x113/0x190
[  982.906269]  ? __switch_to_asm+0x34/0x70
[  982.906275]  commit_tail+0x3d/0x70 [drm_kms_helper]
[  982.906278]  process_one_work+0x1eb/0x410
[  982.906280]  worker_thread+0x2d/0x3d0
[  982.906282]  ? process_one_work+0x410/0x410
[  982.906283]  kthread+0x112/0x130
[  982.906284]  ? kthread_park+0x80/0x80
[  982.906286]  ret_from_fork+0x22/0x40
[  982.906290] INFO: task kworker/u32:8:404 blocked for more than 120 seconds.
[  982.906290]       Not tainted 4.20.3-arch1-1-ARCH #1
[  982.906291] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  982.906291] kworker/u32:8   D    0   404      2 0x80000000
[  982.906297] Workqueue: events_unbound commit_work [drm_kms_helper]
[  982.906298] Call Trace:
[  982.906300]  ? __schedule+0x29b/0x8b0
[  982.906301]  schedule+0x32/0x90
[  982.906302]  schedule_preempt_disabled+0x14/0x20
[  982.906303]  __ww_mutex_lock.isra.2+0x413/0x7f0
[  982.906329]  ? amdgpu_get_vblank_counter_kms+0x110/0x160 [amdgpu]
[  982.906370]  amdgpu_dm_do_flip+0xd2/0x370 [amdgpu]
[  982.906412]  amdgpu_dm_atomic_commit_tail+0x6c4/0xd20 [amdgpu]
[  982.906414]  ? _raw_spin_lock_irq+0x1a/0x40
[  982.906415]  ? wait_for_common+0x113/0x190
[  982.906416]  ? __switch_to_asm+0x34/0x70
[  982.906422]  commit_tail+0x3d/0x70 [drm_kms_helper]
[  982.906424]  process_one_work+0x1eb/0x410
[  982.906425]  worker_thread+0x2d/0x3d0
[  982.906427]  ? process_one_work+0x410/0x410
[  982.906428]  kthread+0x112/0x130
[  982.906429]  ? kthread_park+0x80/0x80
[  982.906431]  ret_from_fork+0x22/0x40




Please let me know if I can help debugging. The fact I can get it to crash reliably and easily should help immensely.

Comment 31 Bas Nieuwenhuizen 2019-01-19 15:12:35 UTC

Moved the Mario Party issue to https://bugs.freedesktop.org/show_bug.cgi?id=109393 . A priori I'd like to not combine bugs based on the fact that it is a hang with the same GPU. These often have different causes.

Comment 32 Samuel Pitoiset 2019-03-21 11:19:07 UTC

Does this still happen with mesa 19.0 ?

Comment 33 pritzl3452 2019-03-23 11:37:11 UTC

(In reply to Samuel Pitoiset from comment #32)
> Does this still happen with mesa 19.0 ?

I played a few hours now without hangs so I think its fixed for me.

Using mesa 19.0 and llvm 8.

Comment 34 Samuel Pitoiset 2019-03-25 08:27:09 UTC

Very nice, thanks for checking!
Feel free to re-open if the problem happens again.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.