Bug 109393

Summary: [vega10] hang with Mario Party 9 through Dolphin
Product: Mesa Reporter: Bas Nieuwenhuizen <bas>
Component: Drivers/Vulkan/radeonAssignee: mesa-dev
Status: RESOLVED FIXED QA Contact: mesa-dev
Severity: normal    
Priority: medium CC: jaapbuurman
Version: 18.3   
Hardware: Other   
OS: All   
Whiteboard:
i915 platform: i915 features:
Attachments: Hang.report file, no crash. Contains an error though.

Description Bas Nieuwenhuizen 2019-01-19 15:11:41 UTC
From https://bugs.freedesktop.org/show_bug.cgi?id=106151#c30 :

Still running into this issue, now while running Mario Party 9 through Dolphin. This is a particularly good test case, because I can reliably get it to crash in the main menu after seconds/minutes. This ONLY happens with the Vulkan renderer.

Versions: Radeon RX Vega (VEGA10, DRM 3.27.0, 4.20.3-arch1-1-ARCH, LLVM 7.0.0)
Mesa: 18.3.1

I have also managed to get a stack trace this time, which is hopefully useful for debugging:

[  858.970202] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=160177, emitted seq=160179
[  858.970205] [drm] GPU recovery disabled.
[  982.906053] INFO: task kworker/u32:6:398 blocked for more than 120 seconds.
[  982.906055]       Not tainted 4.20.3-arch1-1-ARCH #1
[  982.906056] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  982.906057] kworker/u32:6   D    0   398      2 0x80000000
[  982.906068] Workqueue: events_unbound commit_work [drm_kms_helper]
[  982.906069] Call Trace:
[  982.906075]  ? __schedule+0x29b/0x8b0
[  982.906077]  ? __switch_to_asm+0x40/0x70
[  982.906079]  schedule+0x32/0x90
[  982.906080]  schedule_timeout+0x311/0x4a0
[  982.906126]  ? dce120_timing_generator_get_crtc_position+0x5b/0x70 [amdgpu]
[  982.906167]  ? dce120_timing_generator_get_crtc_scanoutpos+0x70/0xb0 [amdgpu]
[  982.906170]  dma_fence_default_wait+0x204/0x280
[  982.906172]  ? dma_fence_wait_timeout+0x120/0x120
[  982.906173]  dma_fence_wait_timeout+0x105/0x120
[  982.906175]  reservation_object_wait_timeout_rcu+0x1f2/0x370
[  982.906178]  ? preempt_count_add+0x79/0xb0
[  982.906221]  amdgpu_dm_do_flip+0x10d/0x370 [amdgpu]
[  982.906265]  amdgpu_dm_atomic_commit_tail+0x6c4/0xd20 [amdgpu]
[  982.906267]  ? _raw_spin_lock_irq+0x1a/0x40
[  982.906268]  ? wait_for_common+0x113/0x190
[  982.906269]  ? __switch_to_asm+0x34/0x70
[  982.906275]  commit_tail+0x3d/0x70 [drm_kms_helper]
[  982.906278]  process_one_work+0x1eb/0x410
[  982.906280]  worker_thread+0x2d/0x3d0
[  982.906282]  ? process_one_work+0x410/0x410
[  982.906283]  kthread+0x112/0x130
[  982.906284]  ? kthread_park+0x80/0x80
[  982.906286]  ret_from_fork+0x22/0x40
[  982.906290] INFO: task kworker/u32:8:404 blocked for more than 120 seconds.
[  982.906290]       Not tainted 4.20.3-arch1-1-ARCH #1
[  982.906291] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  982.906291] kworker/u32:8   D    0   404      2 0x80000000
[  982.906297] Workqueue: events_unbound commit_work [drm_kms_helper]
[  982.906298] Call Trace:
[  982.906300]  ? __schedule+0x29b/0x8b0
[  982.906301]  schedule+0x32/0x90
[  982.906302]  schedule_preempt_disabled+0x14/0x20
[  982.906303]  __ww_mutex_lock.isra.2+0x413/0x7f0
[  982.906329]  ? amdgpu_get_vblank_counter_kms+0x110/0x160 [amdgpu]
[  982.906370]  amdgpu_dm_do_flip+0xd2/0x370 [amdgpu]
[  982.906412]  amdgpu_dm_atomic_commit_tail+0x6c4/0xd20 [amdgpu]
[  982.906414]  ? _raw_spin_lock_irq+0x1a/0x40
[  982.906415]  ? wait_for_common+0x113/0x190
[  982.906416]  ? __switch_to_asm+0x34/0x70
[  982.906422]  commit_tail+0x3d/0x70 [drm_kms_helper]
[  982.906424]  process_one_work+0x1eb/0x410
[  982.906425]  worker_thread+0x2d/0x3d0
[  982.906427]  ? process_one_work+0x410/0x410
[  982.906428]  kthread+0x112/0x130
[  982.906429]  ? kthread_park+0x80/0x80
[  982.906431]  ret_from_fork+0x22/0x40




Please let me know if I can help debugging. The fact I can get it to crash reliably and easily should help immensely.
Comment 1 Jaap Buurman 2019-01-29 17:51:52 UTC
Still happening with Mesa 18.3.2, LVM 7.0.1 and 4.20.5 kernel.
Comment 2 Samuel Pitoiset 2019-01-30 09:41:03 UTC
Just tried to reproduce with these settings

OpenGL renderer string: Radeon RX Vega (VEGA10, DRM 3.27.0, 4.20.5-arch1-1-ARCH, LLVM 7.0.1)
OpenGL core profile version string: 4.5 (Core Profile) Mesa 18.3.2

No hangs so far with the Vulkan backend.

Maybe you use some "Enhancements" settings?
Comment 3 Jaap Buurman 2019-01-30 16:29:00 UTC
I am using a stock install of Dolphin. The only thing I changed was switching from the OpenGL backend to the Vulkan backend. Are you using Vega 64 or Vega 56?
Comment 4 Jaap Buurman 2019-01-30 16:30:18 UTC
I cannot edit my previous post: Did you test with Mario party as well? I am in Europe, and am therefor using a PAL version. Not sure if that matters.
Comment 5 Samuel Pitoiset 2019-01-30 16:33:25 UTC
Yes, I tried Mario Party 9 on my vega 56.
Comment 6 Jaap Buurman 2019-01-30 16:36:05 UTC
Is there anything I can do, test or research to find out why this is happening? Could this be a Vega 64 bug that doesn't affect Vega 56?
Comment 7 Samuel Pitoiset 2019-01-30 16:40:15 UTC
Unlikely.

What you can do is to capture a hang report as follows:

export RADV_TRACE_FILE=$HOME/hang.trace
export RADV_DEBUG=syncshaders,vmfaults,allbos
dolphin-emu &> hang.report

And attach the files here.
Comment 8 Jaap Buurman 2019-01-30 18:23:45 UTC
I've set the environment variables and started Dolphin as instructed. However, I can no longer get it to crash. The FPS is around 4-6 FPS, instead of the regular 60 FPS, probably because a trace is being generated.

Is there anything different between having debugging enabled vs disabled? I've let it run for 30-60 minutes without a crash now.
Comment 9 Jaap Buurman 2019-01-30 18:37:14 UTC
Created attachment 143257 [details]
Hang.report file, no crash. Contains an error though.

I have just closed the emulator, since it still hadn't crashed. The hang.report file does contain an error, but it looks innocent. I have attached it just in case.
Comment 10 Samuel Pitoiset 2019-02-11 14:38:37 UTC
Do you have anything in dmesg when it freezes?
Comment 11 Samuel Pitoiset 2019-02-11 14:41:29 UTC
Can you reproduce the problem with "export RADV_DEBUG=zerovram" ?
Comment 12 Jaap Buurman 2019-02-11 16:00:41 UTC
I can still reproduce the hang with "export RADV_DEBUG=zerovram". The following message is present in dmesg:

[ 3638.209907] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=167782, emitted seq=167784
[ 3638.209910] [drm] GPU recovery disabled.
Comment 13 Samuel Pitoiset 2019-02-11 16:19:30 UTC
I definitely can't reproduce the problem on my Vega 56.
Can you attach the output of "R600_DEBUG=info glxgears"?
Comment 14 Jaap Buurman 2019-02-11 16:21:43 UTC
[jaap@Antergos-Desktop ~]$ R600_DEBUG=info glxgears
Device info:
    pci (domain:bus:dev.func): 0000:09:00.0
    pci_id = 0x687f
    family = 68
    chip_class = 11
    num_compute_rings = 4
    num_sdma_rings = 2
    clock_crystal_freq = 27000
    tcc_cache_line_size = 64
Memory info:
    pte_fragment_size = 2097152
    gart_page_size = 4096
    gart_size = 8176 MB
    vram_size = 8176 MB
    vram_vis_size = 256 MB
    gds_size = 64 kB
    gds_gfx_partition_size = 4 kB
    max_alloc_size = 6541 MB
    min_alloc_size = 512
    address32_hi = 4294934528
    has_dedicated_vram = 1
CP info:
    gfx_ib_pad_with_type2 = 0
    ib_start_alignment = 256
    me_fw_version = 156
    me_fw_feature = 42
    pfp_fw_version = 177
    pfp_fw_feature = 42
    ce_fw_version = 77
    ce_fw_feature = 42
Multimedia info:
    has_hw_decode = 1
    uvd_enc_supported = 1
    uvd_fw_version = 22483200
    vce_fw_version = 922944512
    vce_harvest_config = 0
Kernel & winsys capabilities:
    drm = 3.27.0
    has_userptr = 1
    has_syncobj = 1
    has_syncobj_wait_for_submit = 1
    has_fence_to_handle = 1
    has_ctx_priority = 1
    has_local_buffers = 0
    kernel_flushes_hdp_before_ib = 1
    htile_cmask_support_1d_tiling = 1
    si_TA_CS_BC_BASE_ADDR_allowed = 1
    has_bo_metadata = 1
    has_gpu_reset_status_query = 1
    has_gpu_reset_counter_query = 0
    has_eqaa_surface_allocator = 1
    has_format_bc1_through_bc7 = 1
    kernel_flushes_tc_l2_after_ib = 1
    has_indirect_compute_dispatch = 1
    has_unaligned_shader_loads = 1
    has_sparse_vm_mappings = 0
    has_2d_tiling = 1
    has_read_registers_query = 1
Shader core info:
    max_shader_clock = 1630
    num_good_compute_units = 64
    num_good_cu_per_sh = 16
    num_tcc_blocks = 16
    max_se = 4
    max_sh_per_se = 1
Render backend info:
    num_render_backends = 16
    num_tile_pipes = 4
    pipe_interleave_bytes = 256
    enabled_rb_mask = 0xffff
    max_alignment = 65536
GB_ADDR_CONFIG:
    num_pipes = 4
    pipe_interleave_size = 256
    max_compressed_frags = 2
    bank_interleave_size = 1
    num_banks = 16
    shader_engine_tile_size = 32
    num_shader_engines = 4
    num_gpus = 0 (raw)
    multi_gpu_tile_size = 2 (raw)
    num_rb_per_se = 4
    row_size = 4096
    num_lower_pipes = 0 (raw)
    se_enable = 0 (raw)
471 frames in 5.0 seconds = 94.092 FPS
300 frames in 5.0 seconds = 59.908 FPS
300 frames in 5.0 seconds = 59.906 FPS
300 frames in 5.0 seconds = 59.911 FPS
300 frames in 5.0 seconds = 59.907 FPS
XIO:  fatal IO error 11 (Resource temporarily unavailable) on X server ":0"
      after 1882 requests (65 known processed) with 0 events remaining.


The error was me closing the Window (or should I have left it running?).
Comment 15 Samuel Pitoiset 2019-03-07 10:47:16 UTC
Does the attached patch help https://patchwork.freedesktop.org/patch/290846/?series=57689&rev=1 ?
Comment 16 Jaap Buurman 2019-03-10 11:10:32 UTC
I am only familiar with Building mesa through the AUR (Mesa-Git). Can I easily apply that patch somehow in the PKGBUILD or would it be easier for me to wait for the patch to be accepted into the repository and build from that?
Comment 17 Jaap Buurman 2019-03-19 17:01:06 UTC
Just got the Mesa 19.0 update in Arch's official repository, and the issue seems to be fixed :) I have been running Mario Party 9 in the main menu for many minutes now, where it would usually crash within 20-30 seconds. I will report back if it happens again, but if you don't hear back from me, consider this issue solved. Thank you very much for the amazing work you guys are doing on Mesa and RADV!
Comment 18 Samuel Pitoiset 2019-03-19 18:37:03 UTC
Thanks for confirming. Feel free to re-open if needed, hopefully not. :)

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.