Bug 109393 - [vega10] hang with Mario Party 9 through Dolphin
Summary: [vega10] hang with Mario Party 9 through Dolphin
Status: NEW
Alias: None
Product: Mesa
Classification: Unclassified
Component: Drivers/Vulkan/radeon (show other bugs)
Version: 18.3
Hardware: Other All
: medium normal
Assignee: mesa-dev
QA Contact: mesa-dev
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2019-01-19 15:11 UTC by Bas Nieuwenhuizen
Modified: 2019-02-11 16:21 UTC (History)
1 user (show)

See Also:
i915 platform:
i915 features:


Attachments
Hang.report file, no crash. Contains an error though. (567 bytes, text/plain)
2019-01-30 18:37 UTC, Jaap Buurman
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Bas Nieuwenhuizen 2019-01-19 15:11:41 UTC
From https://bugs.freedesktop.org/show_bug.cgi?id=106151#c30 :

Still running into this issue, now while running Mario Party 9 through Dolphin. This is a particularly good test case, because I can reliably get it to crash in the main menu after seconds/minutes. This ONLY happens with the Vulkan renderer.

Versions: Radeon RX Vega (VEGA10, DRM 3.27.0, 4.20.3-arch1-1-ARCH, LLVM 7.0.0)
Mesa: 18.3.1

I have also managed to get a stack trace this time, which is hopefully useful for debugging:

[  858.970202] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=160177, emitted seq=160179
[  858.970205] [drm] GPU recovery disabled.
[  982.906053] INFO: task kworker/u32:6:398 blocked for more than 120 seconds.
[  982.906055]       Not tainted 4.20.3-arch1-1-ARCH #1
[  982.906056] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  982.906057] kworker/u32:6   D    0   398      2 0x80000000
[  982.906068] Workqueue: events_unbound commit_work [drm_kms_helper]
[  982.906069] Call Trace:
[  982.906075]  ? __schedule+0x29b/0x8b0
[  982.906077]  ? __switch_to_asm+0x40/0x70
[  982.906079]  schedule+0x32/0x90
[  982.906080]  schedule_timeout+0x311/0x4a0
[  982.906126]  ? dce120_timing_generator_get_crtc_position+0x5b/0x70 [amdgpu]
[  982.906167]  ? dce120_timing_generator_get_crtc_scanoutpos+0x70/0xb0 [amdgpu]
[  982.906170]  dma_fence_default_wait+0x204/0x280
[  982.906172]  ? dma_fence_wait_timeout+0x120/0x120
[  982.906173]  dma_fence_wait_timeout+0x105/0x120
[  982.906175]  reservation_object_wait_timeout_rcu+0x1f2/0x370
[  982.906178]  ? preempt_count_add+0x79/0xb0
[  982.906221]  amdgpu_dm_do_flip+0x10d/0x370 [amdgpu]
[  982.906265]  amdgpu_dm_atomic_commit_tail+0x6c4/0xd20 [amdgpu]
[  982.906267]  ? _raw_spin_lock_irq+0x1a/0x40
[  982.906268]  ? wait_for_common+0x113/0x190
[  982.906269]  ? __switch_to_asm+0x34/0x70
[  982.906275]  commit_tail+0x3d/0x70 [drm_kms_helper]
[  982.906278]  process_one_work+0x1eb/0x410
[  982.906280]  worker_thread+0x2d/0x3d0
[  982.906282]  ? process_one_work+0x410/0x410
[  982.906283]  kthread+0x112/0x130
[  982.906284]  ? kthread_park+0x80/0x80
[  982.906286]  ret_from_fork+0x22/0x40
[  982.906290] INFO: task kworker/u32:8:404 blocked for more than 120 seconds.
[  982.906290]       Not tainted 4.20.3-arch1-1-ARCH #1
[  982.906291] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  982.906291] kworker/u32:8   D    0   404      2 0x80000000
[  982.906297] Workqueue: events_unbound commit_work [drm_kms_helper]
[  982.906298] Call Trace:
[  982.906300]  ? __schedule+0x29b/0x8b0
[  982.906301]  schedule+0x32/0x90
[  982.906302]  schedule_preempt_disabled+0x14/0x20
[  982.906303]  __ww_mutex_lock.isra.2+0x413/0x7f0
[  982.906329]  ? amdgpu_get_vblank_counter_kms+0x110/0x160 [amdgpu]
[  982.906370]  amdgpu_dm_do_flip+0xd2/0x370 [amdgpu]
[  982.906412]  amdgpu_dm_atomic_commit_tail+0x6c4/0xd20 [amdgpu]
[  982.906414]  ? _raw_spin_lock_irq+0x1a/0x40
[  982.906415]  ? wait_for_common+0x113/0x190
[  982.906416]  ? __switch_to_asm+0x34/0x70
[  982.906422]  commit_tail+0x3d/0x70 [drm_kms_helper]
[  982.906424]  process_one_work+0x1eb/0x410
[  982.906425]  worker_thread+0x2d/0x3d0
[  982.906427]  ? process_one_work+0x410/0x410
[  982.906428]  kthread+0x112/0x130
[  982.906429]  ? kthread_park+0x80/0x80
[  982.906431]  ret_from_fork+0x22/0x40




Please let me know if I can help debugging. The fact I can get it to crash reliably and easily should help immensely.
Comment 1 Jaap Buurman 2019-01-29 17:51:52 UTC
Still happening with Mesa 18.3.2, LVM 7.0.1 and 4.20.5 kernel.
Comment 2 Samuel Pitoiset 2019-01-30 09:41:03 UTC
Just tried to reproduce with these settings

OpenGL renderer string: Radeon RX Vega (VEGA10, DRM 3.27.0, 4.20.5-arch1-1-ARCH, LLVM 7.0.1)
OpenGL core profile version string: 4.5 (Core Profile) Mesa 18.3.2

No hangs so far with the Vulkan backend.

Maybe you use some "Enhancements" settings?
Comment 3 Jaap Buurman 2019-01-30 16:29:00 UTC
I am using a stock install of Dolphin. The only thing I changed was switching from the OpenGL backend to the Vulkan backend. Are you using Vega 64 or Vega 56?
Comment 4 Jaap Buurman 2019-01-30 16:30:18 UTC
I cannot edit my previous post: Did you test with Mario party as well? I am in Europe, and am therefor using a PAL version. Not sure if that matters.
Comment 5 Samuel Pitoiset 2019-01-30 16:33:25 UTC
Yes, I tried Mario Party 9 on my vega 56.
Comment 6 Jaap Buurman 2019-01-30 16:36:05 UTC
Is there anything I can do, test or research to find out why this is happening? Could this be a Vega 64 bug that doesn't affect Vega 56?
Comment 7 Samuel Pitoiset 2019-01-30 16:40:15 UTC
Unlikely.

What you can do is to capture a hang report as follows:

export RADV_TRACE_FILE=$HOME/hang.trace
export RADV_DEBUG=syncshaders,vmfaults,allbos
dolphin-emu &> hang.report

And attach the files here.
Comment 8 Jaap Buurman 2019-01-30 18:23:45 UTC
I've set the environment variables and started Dolphin as instructed. However, I can no longer get it to crash. The FPS is around 4-6 FPS, instead of the regular 60 FPS, probably because a trace is being generated.

Is there anything different between having debugging enabled vs disabled? I've let it run for 30-60 minutes without a crash now.
Comment 9 Jaap Buurman 2019-01-30 18:37:14 UTC
Created attachment 143257 [details]
Hang.report file, no crash. Contains an error though.

I have just closed the emulator, since it still hadn't crashed. The hang.report file does contain an error, but it looks innocent. I have attached it just in case.
Comment 10 Samuel Pitoiset 2019-02-11 14:38:37 UTC
Do you have anything in dmesg when it freezes?
Comment 11 Samuel Pitoiset 2019-02-11 14:41:29 UTC
Can you reproduce the problem with "export RADV_DEBUG=zerovram" ?
Comment 12 Jaap Buurman 2019-02-11 16:00:41 UTC
I can still reproduce the hang with "export RADV_DEBUG=zerovram". The following message is present in dmesg:

[ 3638.209907] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=167782, emitted seq=167784
[ 3638.209910] [drm] GPU recovery disabled.
Comment 13 Samuel Pitoiset 2019-02-11 16:19:30 UTC
I definitely can't reproduce the problem on my Vega 56.
Can you attach the output of "R600_DEBUG=info glxgears"?
Comment 14 Jaap Buurman 2019-02-11 16:21:43 UTC
[jaap@Antergos-Desktop ~]$ R600_DEBUG=info glxgears
Device info:
    pci (domain:bus:dev.func): 0000:09:00.0
    pci_id = 0x687f
    family = 68
    chip_class = 11
    num_compute_rings = 4
    num_sdma_rings = 2
    clock_crystal_freq = 27000
    tcc_cache_line_size = 64
Memory info:
    pte_fragment_size = 2097152
    gart_page_size = 4096
    gart_size = 8176 MB
    vram_size = 8176 MB
    vram_vis_size = 256 MB
    gds_size = 64 kB
    gds_gfx_partition_size = 4 kB
    max_alloc_size = 6541 MB
    min_alloc_size = 512
    address32_hi = 4294934528
    has_dedicated_vram = 1
CP info:
    gfx_ib_pad_with_type2 = 0
    ib_start_alignment = 256
    me_fw_version = 156
    me_fw_feature = 42
    pfp_fw_version = 177
    pfp_fw_feature = 42
    ce_fw_version = 77
    ce_fw_feature = 42
Multimedia info:
    has_hw_decode = 1
    uvd_enc_supported = 1
    uvd_fw_version = 22483200
    vce_fw_version = 922944512
    vce_harvest_config = 0
Kernel & winsys capabilities:
    drm = 3.27.0
    has_userptr = 1
    has_syncobj = 1
    has_syncobj_wait_for_submit = 1
    has_fence_to_handle = 1
    has_ctx_priority = 1
    has_local_buffers = 0
    kernel_flushes_hdp_before_ib = 1
    htile_cmask_support_1d_tiling = 1
    si_TA_CS_BC_BASE_ADDR_allowed = 1
    has_bo_metadata = 1
    has_gpu_reset_status_query = 1
    has_gpu_reset_counter_query = 0
    has_eqaa_surface_allocator = 1
    has_format_bc1_through_bc7 = 1
    kernel_flushes_tc_l2_after_ib = 1
    has_indirect_compute_dispatch = 1
    has_unaligned_shader_loads = 1
    has_sparse_vm_mappings = 0
    has_2d_tiling = 1
    has_read_registers_query = 1
Shader core info:
    max_shader_clock = 1630
    num_good_compute_units = 64
    num_good_cu_per_sh = 16
    num_tcc_blocks = 16
    max_se = 4
    max_sh_per_se = 1
Render backend info:
    num_render_backends = 16
    num_tile_pipes = 4
    pipe_interleave_bytes = 256
    enabled_rb_mask = 0xffff
    max_alignment = 65536
GB_ADDR_CONFIG:
    num_pipes = 4
    pipe_interleave_size = 256
    max_compressed_frags = 2
    bank_interleave_size = 1
    num_banks = 16
    shader_engine_tile_size = 32
    num_shader_engines = 4
    num_gpus = 0 (raw)
    multi_gpu_tile_size = 2 (raw)
    num_rb_per_se = 4
    row_size = 4096
    num_lower_pipes = 0 (raw)
    se_enable = 0 (raw)
471 frames in 5.0 seconds = 94.092 FPS
300 frames in 5.0 seconds = 59.908 FPS
300 frames in 5.0 seconds = 59.906 FPS
300 frames in 5.0 seconds = 59.911 FPS
300 frames in 5.0 seconds = 59.907 FPS
XIO:  fatal IO error 11 (Resource temporarily unavailable) on X server ":0"
      after 1882 requests (65 known processed) with 0 events remaining.


The error was me closing the Window (or should I have left it running?).


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.