Summary: | [amdgpu][vulkan] GPU hang (Vega 56) while running game (Rise of the Tomb Raider) | ||
---|---|---|---|
Product: | Mesa | Reporter: | Martin F <martin.fretigne> |
Component: | Drivers/Vulkan/radeon | Assignee: | mesa-dev |
Status: | RESOLVED FIXED | QA Contact: | mesa-dev |
Severity: | normal | ||
Priority: | medium | CC: | jaapbuurman, pritzl3452, rafalcieslak256 |
Version: | git | ||
Hardware: | x86-64 (AMD64) | ||
OS: | Linux (All) | ||
Whiteboard: | |||
i915 platform: | i915 features: | ||
Bug Depends on: | |||
Bug Blocks: | 77449 | ||
Attachments: |
patch
Savegame ROTTR possible fix |
Description
Martin F
2018-04-20 14:44:54 UTC
Hi, Yeah, 24fb3e6aa166b3afe906eb2845077766075189ed is a broken commit which might hang your GPU. You are lucky because I have just wrote a fix for that issue. :) Can you try https://cgit.freedesktop.org/~hakzsam/mesa/commit/?h=radv_image_fix&id=22082cf2c1b2613ee4080347472f6c82086121b0 and let me know if it's fixed? FYI, I have pushed the fix https://cgit.freedesktop.org/mesa/mesa/commit/?id=8f13975713a7a7b8d625e3561a7fc9ce202ac64b Hi Samuel. I just tried with your commit but still got a hang, with the same syslog messages as before. I tried again after updating my kernel to 4.17.0-rc1+ (87ef12027b9b1dd0e0b12cf311fbcb19f9d92539) and after cleaning the cache (~/.cache/mesa_shader_cache/), but the game hanged once more. Okay, can you give the steps to reproduce (or bisect) ? Thanks! Hi, I also get hangs while playing the game. I can play for a pretty long time usually though, around 30-60 mins before it hangs. I can ssh into the system and the only errors I see in dmesg are: apr 21 20:09:57 serenity kernel: [drm:amdgpu_job_timedout] *ERROR* ring gfx timeout, last signaled seq=4236234, last emitted seq=4236236 apr 21 20:09:57 serenity kernel: [drm] No hardware hang detected. Did some blocks stall? Device: Radeon RX Vega (VEGA10 / DRM 3.23.0 / 4.16.3-gentoo, LLVM 6.0.0) (0x687f) Version: 18.0.1 This is on a Vega 64 LC The problem happened multiple times at the very beginning of the game, but not very often later (1 hang every 3 hours maybe). I just started a new game to see if I would get the hangs I got yesterday, but did not. Since the bug is not systematic it's not practicable to bisect it. Well, I have a vega 56 so I can reproduce the hang but I would need a bit more info. What preset are you using? Do you have something in dmesg when it hangs? The preset I'm using is 'Medium', in 1920x1080. The hang seem to happen very often (or maybe every time when the game was never launched before) at the very beginning of the game (when Jonas catch Lara when she jumps after the very first fall). dmesg: [ 3442.737830] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, last signaled seq=272116, last emitted seq=272118 [ 3442.737835] [drm] No hardware hang detected. Did some blocks stall? [ 3626.038022] INFO: task kworker/u32:3:163 blocked for more than 120 seconds. [ 3626.038029] Not tainted 4.17.0-rc1+ #5 [ 3626.038031] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 3626.038035] kworker/u32:3 D 0 163 2 0x80000000 [ 3626.038055] Workqueue: events_unbound commit_work [drm_kms_helper] [ 3626.038058] Call Trace: [ 3626.038068] ? __schedule+0x291/0x870 [ 3626.038072] schedule+0x28/0x80 [ 3626.038076] schedule_timeout+0x1ee/0x380 [ 3626.038167] ? dce120_timing_generator_get_crtc_position+0x5b/0x70 [amdgpu] [ 3626.038271] ? dce120_timing_generator_get_crtc_scanoutpos+0x70/0xb0 [amdgpu] [ 3626.038279] dma_fence_default_wait+0x1fd/0x280 [ 3626.038286] ? dma_fence_release+0x90/0x90 [ 3626.038290] dma_fence_wait_timeout+0x39/0xf0 [ 3626.038294] reservation_object_wait_timeout_rcu+0x17b/0x370 [ 3626.038375] amdgpu_dm_do_flip+0x112/0x350 [amdgpu] [ 3626.038457] amdgpu_dm_atomic_commit_tail+0xb00/0xd00 [amdgpu] [ 3626.038463] ? wait_for_completion_timeout+0x3b/0x1a0 [ 3626.038467] ? pick_next_task_fair+0x35b/0x660 [ 3626.038473] ? __switch_to+0xa2/0x450 [ 3626.038486] commit_tail+0x3d/0x70 [drm_kms_helper] [ 3626.038491] process_one_work+0x17b/0x360 [ 3626.038495] worker_thread+0x2e/0x390 [ 3626.038498] ? process_one_work+0x360/0x360 [ 3626.038502] kthread+0x113/0x130 [ 3626.038506] ? kthread_create_worker_on_cpu+0x70/0x70 [ 3626.038509] ret_from_fork+0x35/0x40 FYI I tried to start bisecting from mesa 17.3.5 but could not get very far due to compilation issues (maybe related to llvm, not sure, I'm trying again). I have been able to reproduce the hang one or two times. Unfortunately it's really hard to reproduce, so really hard to fix but we are working on! Created attachment 139232 [details] [review] patch Guys, can you apply the attached patch and let me know if it improves the situation? Hi Samuel. It looks good, no hang here so far. Well done! Thank you for the patch! Unfortunately it still hangs for me. I applied the patch to Mesa 18.1.0-rc2. If you're still seeing a hang, could you try the latest game update (released today)? Hi Alex, I gave it a try now with the new update to ROTTR and the game still hangs in the same way. I am now on kernel 4.17.0 Mesa and llvm from git updated about 2 hours ago. OpenGL renderer string: Radeon RX Vega (VEGA10, DRM 3.25.0, 4.17.0-gentoo, LLVM 7.0.0) OpenGL core profile version string: 4.5 (Core Profile) Mesa 18.2.0-devel (git-b3ba47c592) (In reply to pritzl3452 from comment #15) > Hi Alex, > > I gave it a try now with the new update to ROTTR and the game still hangs in > the same way. Can you explain how to reproduce? Maybe you can also upload your savefile (ie. lastauto.ldat)? (In reply to Samuel Pitoiset from comment #16) > (In reply to pritzl3452 from comment #15) > > Hi Alex, > > > > I gave it a try now with the new update to ROTTR and the game still hangs in > > the same way. > > Can you explain how to reproduce? Maybe you can also upload your savefile > (ie. lastauto.ldat)? I have not found anything specific that makes it hang unfortunately. I just play the game and sometimes it hangs, it has happened within 30 seconds of getting into the game and it has taken more than an hour before hanging. I am in a different place everytime the game hangs. Created attachment 140234 [details]
Savegame ROTTR
Thanks, what preset do you use? Created attachment 140246 [details] [review] possible fix Does this patch help? If my patch doesn't help, can you try master with "export RADV_DEBUG=nocompute"? I am away travelling and I wont be able to try the patch until late next week. I will try the patch when I'm back and get back with the results. Can you also try with RADV_PERFTEST=nobatchchain please? (In reply to Samuel Pitoiset from comment #19) > Thanks, what preset do you use? I'm using the high preset and I have disabled AA. Resolution is 3840x2160. (In reply to Samuel Pitoiset from comment #20) > Created attachment 140246 [details] [review] [review] > possible fix > > Does this patch help? The game still hangs with this patch using Mesa 18.1.3. (In reply to Samuel Pitoiset from comment #21) > If my patch doesn't help, can you try master with "export > RADV_DEBUG=nocompute"? The game still hangs in the same way using LLVM 7 and Mesa master with this. I guess it also hangs with RADV_PERFTEST=nobatchchain ? (In reply to Samuel Pitoiset from comment #27) > I guess it also hangs with RADV_PERFTEST=nobatchchain ? Yes I just gave it a try and it hangs in the same way. I am now on kernel 4.16.13 Mesa 18.2.0-devel (git-f8e54d02f7) I am seeing the same issue in multiple games on my Vega 64: -Assassin's creed 2 played through Wine with the Gallium Nine patches -Assassin's creed brotherhood played through Wine with the Gallium Nine patches. -GTA V played through Wine with the latest DXVK (Vulkan) When I SSH into the machine I can see the following messages in dmesg: [ 3442.737830] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, last signaled seq=272116, last emitted seq=272118 [ 3442.737835] [drm] No hardware hang detected. Did some blocks stall? Software versions: Kernel: 4.17.10 Mesa: 18.1.4 LLVM: 6.0.1 Still running into this issue, now while running Mario Party 9 through Dolphin. This is a particularly good test case, because I can reliably get it to crash in the main menu after seconds/minutes. This ONLY happens with the Vulkan renderer. Versions: Radeon RX Vega (VEGA10, DRM 3.27.0, 4.20.3-arch1-1-ARCH, LLVM 7.0.0) Mesa: 18.3.1 I have also managed to get a stack trace this time, which is hopefully useful for debugging: [ 858.970202] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=160177, emitted seq=160179 [ 858.970205] [drm] GPU recovery disabled. [ 982.906053] INFO: task kworker/u32:6:398 blocked for more than 120 seconds. [ 982.906055] Not tainted 4.20.3-arch1-1-ARCH #1 [ 982.906056] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 982.906057] kworker/u32:6 D 0 398 2 0x80000000 [ 982.906068] Workqueue: events_unbound commit_work [drm_kms_helper] [ 982.906069] Call Trace: [ 982.906075] ? __schedule+0x29b/0x8b0 [ 982.906077] ? __switch_to_asm+0x40/0x70 [ 982.906079] schedule+0x32/0x90 [ 982.906080] schedule_timeout+0x311/0x4a0 [ 982.906126] ? dce120_timing_generator_get_crtc_position+0x5b/0x70 [amdgpu] [ 982.906167] ? dce120_timing_generator_get_crtc_scanoutpos+0x70/0xb0 [amdgpu] [ 982.906170] dma_fence_default_wait+0x204/0x280 [ 982.906172] ? dma_fence_wait_timeout+0x120/0x120 [ 982.906173] dma_fence_wait_timeout+0x105/0x120 [ 982.906175] reservation_object_wait_timeout_rcu+0x1f2/0x370 [ 982.906178] ? preempt_count_add+0x79/0xb0 [ 982.906221] amdgpu_dm_do_flip+0x10d/0x370 [amdgpu] [ 982.906265] amdgpu_dm_atomic_commit_tail+0x6c4/0xd20 [amdgpu] [ 982.906267] ? _raw_spin_lock_irq+0x1a/0x40 [ 982.906268] ? wait_for_common+0x113/0x190 [ 982.906269] ? __switch_to_asm+0x34/0x70 [ 982.906275] commit_tail+0x3d/0x70 [drm_kms_helper] [ 982.906278] process_one_work+0x1eb/0x410 [ 982.906280] worker_thread+0x2d/0x3d0 [ 982.906282] ? process_one_work+0x410/0x410 [ 982.906283] kthread+0x112/0x130 [ 982.906284] ? kthread_park+0x80/0x80 [ 982.906286] ret_from_fork+0x22/0x40 [ 982.906290] INFO: task kworker/u32:8:404 blocked for more than 120 seconds. [ 982.906290] Not tainted 4.20.3-arch1-1-ARCH #1 [ 982.906291] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 982.906291] kworker/u32:8 D 0 404 2 0x80000000 [ 982.906297] Workqueue: events_unbound commit_work [drm_kms_helper] [ 982.906298] Call Trace: [ 982.906300] ? __schedule+0x29b/0x8b0 [ 982.906301] schedule+0x32/0x90 [ 982.906302] schedule_preempt_disabled+0x14/0x20 [ 982.906303] __ww_mutex_lock.isra.2+0x413/0x7f0 [ 982.906329] ? amdgpu_get_vblank_counter_kms+0x110/0x160 [amdgpu] [ 982.906370] amdgpu_dm_do_flip+0xd2/0x370 [amdgpu] [ 982.906412] amdgpu_dm_atomic_commit_tail+0x6c4/0xd20 [amdgpu] [ 982.906414] ? _raw_spin_lock_irq+0x1a/0x40 [ 982.906415] ? wait_for_common+0x113/0x190 [ 982.906416] ? __switch_to_asm+0x34/0x70 [ 982.906422] commit_tail+0x3d/0x70 [drm_kms_helper] [ 982.906424] process_one_work+0x1eb/0x410 [ 982.906425] worker_thread+0x2d/0x3d0 [ 982.906427] ? process_one_work+0x410/0x410 [ 982.906428] kthread+0x112/0x130 [ 982.906429] ? kthread_park+0x80/0x80 [ 982.906431] ret_from_fork+0x22/0x40 Please let me know if I can help debugging. The fact I can get it to crash reliably and easily should help immensely. Moved the Mario Party issue to https://bugs.freedesktop.org/show_bug.cgi?id=109393 . A priori I'd like to not combine bugs based on the fact that it is a hang with the same GPU. These often have different causes. Does this still happen with mesa 19.0 ? (In reply to Samuel Pitoiset from comment #32) > Does this still happen with mesa 19.0 ? I played a few hours now without hangs so I think its fixed for me. Using mesa 19.0 and llvm 8. Very nice, thanks for checking! Feel free to re-open if the problem happens again. |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.