From https://bugs.freedesktop.org/show_bug.cgi?id=106151#c30 :
Still running into this issue, now while running Mario Party 9 through Dolphin. This is a particularly good test case, because I can reliably get it to crash in the main menu after seconds/minutes. This ONLY happens with the Vulkan renderer.
Versions: Radeon RX Vega (VEGA10, DRM 3.27.0, 4.20.3-arch1-1-ARCH, LLVM 7.0.0)
I have also managed to get a stack trace this time, which is hopefully useful for debugging:
[ 858.970202] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=160177, emitted seq=160179
[ 858.970205] [drm] GPU recovery disabled.
[ 982.906053] INFO: task kworker/u32:6:398 blocked for more than 120 seconds.
[ 982.906055] Not tainted 4.20.3-arch1-1-ARCH #1
[ 982.906056] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 982.906057] kworker/u32:6 D 0 398 2 0x80000000
[ 982.906068] Workqueue: events_unbound commit_work [drm_kms_helper]
[ 982.906069] Call Trace:
[ 982.906075] ? __schedule+0x29b/0x8b0
[ 982.906077] ? __switch_to_asm+0x40/0x70
[ 982.906079] schedule+0x32/0x90
[ 982.906080] schedule_timeout+0x311/0x4a0
[ 982.906126] ? dce120_timing_generator_get_crtc_position+0x5b/0x70 [amdgpu]
[ 982.906167] ? dce120_timing_generator_get_crtc_scanoutpos+0x70/0xb0 [amdgpu]
[ 982.906170] dma_fence_default_wait+0x204/0x280
[ 982.906172] ? dma_fence_wait_timeout+0x120/0x120
[ 982.906173] dma_fence_wait_timeout+0x105/0x120
[ 982.906175] reservation_object_wait_timeout_rcu+0x1f2/0x370
[ 982.906178] ? preempt_count_add+0x79/0xb0
[ 982.906221] amdgpu_dm_do_flip+0x10d/0x370 [amdgpu]
[ 982.906265] amdgpu_dm_atomic_commit_tail+0x6c4/0xd20 [amdgpu]
[ 982.906267] ? _raw_spin_lock_irq+0x1a/0x40
[ 982.906268] ? wait_for_common+0x113/0x190
[ 982.906269] ? __switch_to_asm+0x34/0x70
[ 982.906275] commit_tail+0x3d/0x70 [drm_kms_helper]
[ 982.906278] process_one_work+0x1eb/0x410
[ 982.906280] worker_thread+0x2d/0x3d0
[ 982.906282] ? process_one_work+0x410/0x410
[ 982.906283] kthread+0x112/0x130
[ 982.906284] ? kthread_park+0x80/0x80
[ 982.906286] ret_from_fork+0x22/0x40
[ 982.906290] INFO: task kworker/u32:8:404 blocked for more than 120 seconds.
[ 982.906290] Not tainted 4.20.3-arch1-1-ARCH #1
[ 982.906291] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 982.906291] kworker/u32:8 D 0 404 2 0x80000000
[ 982.906297] Workqueue: events_unbound commit_work [drm_kms_helper]
[ 982.906298] Call Trace:
[ 982.906300] ? __schedule+0x29b/0x8b0
[ 982.906301] schedule+0x32/0x90
[ 982.906302] schedule_preempt_disabled+0x14/0x20
[ 982.906303] __ww_mutex_lock.isra.2+0x413/0x7f0
[ 982.906329] ? amdgpu_get_vblank_counter_kms+0x110/0x160 [amdgpu]
[ 982.906370] amdgpu_dm_do_flip+0xd2/0x370 [amdgpu]
[ 982.906412] amdgpu_dm_atomic_commit_tail+0x6c4/0xd20 [amdgpu]
[ 982.906414] ? _raw_spin_lock_irq+0x1a/0x40
[ 982.906415] ? wait_for_common+0x113/0x190
[ 982.906416] ? __switch_to_asm+0x34/0x70
[ 982.906422] commit_tail+0x3d/0x70 [drm_kms_helper]
[ 982.906424] process_one_work+0x1eb/0x410
[ 982.906425] worker_thread+0x2d/0x3d0
[ 982.906427] ? process_one_work+0x410/0x410
[ 982.906428] kthread+0x112/0x130
[ 982.906429] ? kthread_park+0x80/0x80
[ 982.906431] ret_from_fork+0x22/0x40
Please let me know if I can help debugging. The fact I can get it to crash reliably and easily should help immensely.
Still happening with Mesa 18.3.2, LVM 7.0.1 and 4.20.5 kernel.
Just tried to reproduce with these settings
OpenGL renderer string: Radeon RX Vega (VEGA10, DRM 3.27.0, 4.20.5-arch1-1-ARCH, LLVM 7.0.1)
OpenGL core profile version string: 4.5 (Core Profile) Mesa 18.3.2
No hangs so far with the Vulkan backend.
Maybe you use some "Enhancements" settings?
I am using a stock install of Dolphin. The only thing I changed was switching from the OpenGL backend to the Vulkan backend. Are you using Vega 64 or Vega 56?
I cannot edit my previous post: Did you test with Mario party as well? I am in Europe, and am therefor using a PAL version. Not sure if that matters.
Yes, I tried Mario Party 9 on my vega 56.
Is there anything I can do, test or research to find out why this is happening? Could this be a Vega 64 bug that doesn't affect Vega 56?
What you can do is to capture a hang report as follows:
dolphin-emu &> hang.report
And attach the files here.
I've set the environment variables and started Dolphin as instructed. However, I can no longer get it to crash. The FPS is around 4-6 FPS, instead of the regular 60 FPS, probably because a trace is being generated.
Is there anything different between having debugging enabled vs disabled? I've let it run for 30-60 minutes without a crash now.
Created attachment 143257 [details]
Hang.report file, no crash. Contains an error though.
I have just closed the emulator, since it still hadn't crashed. The hang.report file does contain an error, but it looks innocent. I have attached it just in case.
Do you have anything in dmesg when it freezes?
Can you reproduce the problem with "export RADV_DEBUG=zerovram" ?
I can still reproduce the hang with "export RADV_DEBUG=zerovram". The following message is present in dmesg:
[ 3638.209907] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=167782, emitted seq=167784
[ 3638.209910] [drm] GPU recovery disabled.
I definitely can't reproduce the problem on my Vega 56.
Can you attach the output of "R600_DEBUG=info glxgears"?
[jaap@Antergos-Desktop ~]$ R600_DEBUG=info glxgears
pci (domain:bus:dev.func): 0000:09:00.0
pci_id = 0x687f
family = 68
chip_class = 11
num_compute_rings = 4
num_sdma_rings = 2
clock_crystal_freq = 27000
tcc_cache_line_size = 64
pte_fragment_size = 2097152
gart_page_size = 4096
gart_size = 8176 MB
vram_size = 8176 MB
vram_vis_size = 256 MB
gds_size = 64 kB
gds_gfx_partition_size = 4 kB
max_alloc_size = 6541 MB
min_alloc_size = 512
address32_hi = 4294934528
has_dedicated_vram = 1
gfx_ib_pad_with_type2 = 0
ib_start_alignment = 256
me_fw_version = 156
me_fw_feature = 42
pfp_fw_version = 177
pfp_fw_feature = 42
ce_fw_version = 77
ce_fw_feature = 42
has_hw_decode = 1
uvd_enc_supported = 1
uvd_fw_version = 22483200
vce_fw_version = 922944512
vce_harvest_config = 0
Kernel & winsys capabilities:
drm = 3.27.0
has_userptr = 1
has_syncobj = 1
has_syncobj_wait_for_submit = 1
has_fence_to_handle = 1
has_ctx_priority = 1
has_local_buffers = 0
kernel_flushes_hdp_before_ib = 1
htile_cmask_support_1d_tiling = 1
si_TA_CS_BC_BASE_ADDR_allowed = 1
has_bo_metadata = 1
has_gpu_reset_status_query = 1
has_gpu_reset_counter_query = 0
has_eqaa_surface_allocator = 1
has_format_bc1_through_bc7 = 1
kernel_flushes_tc_l2_after_ib = 1
has_indirect_compute_dispatch = 1
has_unaligned_shader_loads = 1
has_sparse_vm_mappings = 0
has_2d_tiling = 1
has_read_registers_query = 1
Shader core info:
max_shader_clock = 1630
num_good_compute_units = 64
num_good_cu_per_sh = 16
num_tcc_blocks = 16
max_se = 4
max_sh_per_se = 1
Render backend info:
num_render_backends = 16
num_tile_pipes = 4
pipe_interleave_bytes = 256
enabled_rb_mask = 0xffff
max_alignment = 65536
num_pipes = 4
pipe_interleave_size = 256
max_compressed_frags = 2
bank_interleave_size = 1
num_banks = 16
shader_engine_tile_size = 32
num_shader_engines = 4
num_gpus = 0 (raw)
multi_gpu_tile_size = 2 (raw)
num_rb_per_se = 4
row_size = 4096
num_lower_pipes = 0 (raw)
se_enable = 0 (raw)
471 frames in 5.0 seconds = 94.092 FPS
300 frames in 5.0 seconds = 59.908 FPS
300 frames in 5.0 seconds = 59.906 FPS
300 frames in 5.0 seconds = 59.911 FPS
300 frames in 5.0 seconds = 59.907 FPS
XIO: fatal IO error 11 (Resource temporarily unavailable) on X server ":0"
after 1882 requests (65 known processed) with 0 events remaining.
The error was me closing the Window (or should I have left it running?).
Does the attached patch help https://patchwork.freedesktop.org/patch/290846/?series=57689&rev=1 ?
I am only familiar with Building mesa through the AUR (Mesa-Git). Can I easily apply that patch somehow in the PKGBUILD or would it be easier for me to wait for the patch to be accepted into the repository and build from that?
Just got the Mesa 19.0 update in Arch's official repository, and the issue seems to be fixed :) I have been running Mario Party 9 in the main menu for many minutes now, where it would usually crash within 20-30 seconds. I will report back if it happens again, but if you don't hear back from me, consider this issue solved. Thank you very much for the amazing work you guys are doing on Mesa and RADV!
Thanks for confirming. Feel free to re-open if needed, hopefully not. :)