Summary: | [KBL-G][Vulkan] Non-recoverable GPU hangs with GfxBench v5 Aztec Ruins Vulkan test | ||
---|---|---|---|
Product: | Mesa | Reporter: | Eero Tamminen <eero.t.tamminen> |
Component: | Drivers/Vulkan/radeon | Assignee: | mesa-dev |
Status: | VERIFIED WONTFIX | QA Contact: | mesa-dev |
Severity: | critical | ||
Priority: | high | CC: | cstout, keramidasceid |
Version: | git | ||
Hardware: | Other | ||
OS: | All | ||
See Also: | https://bugs.freedesktop.org/show_bug.cgi?id=108898 | ||
Whiteboard: | |||
i915 platform: | i915 features: | ||
Bug Depends on: | 109920 | ||
Bug Blocks: | |||
Attachments: |
Hang trace
70MB output from the RADV debug options (compressed) |
Description
Eero Tamminen
2018-11-29 13:11:35 UTC
Yes, this messes also other things, not just 3D (after this issue, script using pycurl to upload test results, will just sit in poll() instead of working, so I think something on kernel side gets corrupted). The link is dead, if you have the demo can you upload it somewhere? (In reply to Samuel Pitoiset from comment #2) > The link is dead, if you have the demo can you upload it somewhere? It still worked when I filed this (and has worked for years before). You can still get the page from Google cache: http://webcache.googleusercontent.com/search?q=cache:https://gfxbench.com/ As you can see, there isn't yet a public Linux version of GfxBench v5, only Android, iOS, MacOS and Windows versions. And I naturally can't provide the proprietary version. Doesn't Valve have licenses to industry standard 3D benchmarks (of which GfxBench is the main one on mobile, and as result, nowadays important also on desktop)? If not, you could try using the Windows version with Wine, when the site works again. If Windows version supports Vulkan and Wine doesn't mangle its API calls for Linux, you could be able to trigger the issue (going through DX -> DXVK probably isn't good enough). Or if there's some Linux Android container that passes Vulkan calls through, you could try the Android version: https://play.google.com/store/apps/details?id=com.glbenchmark.glbenchmark27 Here's some extra info on the Aztec Ruins benchmark: https://www.anandtech.com/show/13271/kishonti-releases-vulkan-gfxbench-5 FYI: https://gfxbench.com/ site works again. *** Bug 109058 has been marked as a duplicate of this bug. *** Ubuntu 18.04 got LLVM 7 so I checked this again with latest Mesa (4f0a3c9f9eda65), and the hangs are still happening: ----------------------------------------------- [61150.480135] Iteration 1/1: testfw_app --gfx vulkan --gl_api vulkan --width 1920 --height 1080 --fullscreen 1 --test_id vulkan_5_normal [61214.327444] amdgpu 0000:01:00.0: GPU fault detected: 146 0x0fa0880c for process testfw_app pid 26882 thread testfw_app pid 26883 [61214.327446] amdgpu 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x001001F4 [61214.327447] amdgpu 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0A08800C [61214.327449] amdgpu 0000:01:00.0: VM fault (0x0c, vmid 5, pasid 32772) at page 1049076, read from 'TC4' (0x54433400) (136) [61214.327456] amdgpu 0000:01:00.0: GPU fault detected: 146 0x0fa0840c for process testfw_app pid 26882 thread testfw_app pid 26883 [61214.327457] amdgpu 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x001001FE [61214.327458] amdgpu 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0A10800C [61214.327459] amdgpu 0000:01:00.0: VM fault (0x0c, vmid 5, pasid 32772) at page 1049086, read from 'TC2' (0x54433200) (264) [61214.327534] amdgpu 0000:01:00.0: GPU fault detected: 146 0x0fb0840c for process testfw_app pid 26882 thread testfw_app pid 26883 [61214.327535] amdgpu 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00000000 [61214.327535] amdgpu 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0A18802C [61214.327537] amdgpu 0000:01:00.0: VM fault (0x2c, vmid 5, pasid 32772) at page 0, read from 'TC0' (0x54433000) (392) [61224.391478] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=11267, emitted seq=11269 [61224.391506] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process testfw_app pid 26882 thread testfw_app pid 26883 [61224.391509] amdgpu 0000:01:00.0: GPU reset begin! [61233.409935] WARNING: CPU: 6 PID: 26883 at kernel/kthread.c:501 kthread_park+0x76/0x90 [61233.409937] Modules linked in: fuse snd_hda_codec_realtek snd_hda_codec_generic amdgpu i915 x86_pkg_temp_thermal coretemp snd_hda_codec_hdmi snd_hda_intel snd_hda_codec e1000e crct10dif_pclmul snd_hwdep snd_hda_core crc32_pclmul chash gpu_sched snd_pcm ttm igb mei_me mei pinctrl_sunrisepoint pinctrl_intel [61233.409954] CPU: 6 PID: 26883 Comm: testfw_app Not tainted 5.0.0-rc5-CI-Nightly_1639+ #1 [61233.409956] Hardware name: Intel Corporation NUC8i7HVK/NUC8i7HVB, BIOS HNKBLi70.86A.0053.2018.1217.1739 12/17/2018 [61233.409959] RIP: 0010:kthread_park+0x76/0x90 [61233.409961] Code: 00 48 89 df e8 eb b6 00 00 48 85 c0 74 25 31 c0 5b 5d c3 0f 0b a8 04 48 8b af f0 04 00 00 74 b0 0f 0b b8 da ff ff ff 5b 5d c3 <0f> 0b b8 f0 ff ff ff eb dd 0f 0b eb d9 0f 1f 00 66 2e 0f 1f 84 00 [61233.409963] RSP: 0018:ffffc900016f7b80 EFLAGS: 00010202 [61233.409965] RAX: 0000000000000004 RBX: ffff888466df68c0 RCX: 0000000000000000 [61233.409966] RDX: ffff888459f46ef0 RSI: ffff888466df68c0 RDI: ffff88846b7e0dc0 [61233.409968] RBP: ffff88845f8694e0 R08: 000037b103a4e400 R09: 0000000000000139 [61233.409969] R10: ffffc900000f7dd0 R11: 00000000000029ec R12: ffff888450031c00 [61233.409970] R13: ffff888459f427c0 R14: 0000000000000202 R15: ffff888450031cc8 [61233.409972] FS: 00007f7c03838700(0000) GS:ffff88846eb80000(0000) knlGS:0000000000000000 [61233.409974] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [61233.409975] CR2: 00007f895193cca0 CR3: 000000000320e005 CR4: 00000000003606e0 [61233.409976] Call Trace: [61233.409985] drm_sched_entity_fini+0x35/0x180 [gpu_sched] [61233.410050] amdgpu_vm_fini+0xba/0x520 [amdgpu] [61233.410056] ? idr_destroy+0x7a/0xc0 [61233.410098] amdgpu_driver_postclose_kms+0x151/0x270 [amdgpu] [61233.410104] drm_file_free.part.0+0x21b/0x300 [61233.410108] drm_release+0x9d/0x110 [61233.410113] __fput+0xa2/0x1d0 [61233.410117] task_work_run+0x84/0xa0 [61233.410121] do_exit+0x308/0xba0 [61233.410126] do_group_exit+0x33/0xa0 [61233.410128] get_signal+0x203/0x5f0 [61233.410133] do_signal+0x30/0x6c0 [61233.410137] ? do_vfs_ioctl+0xa4/0x630 [61233.410140] ? __do_munmap+0x308/0x400 [61233.410144] exit_to_usermode_loop+0x96/0xb0 [61233.410147] do_syscall_64+0xe7/0x100 [61233.410150] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [61233.410153] RIP: 0033:0x7f7c05c3a5d7 [61233.410159] Code: Bad RIP value. [61233.410160] RSP: 002b:00007f7c03836da8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 [61233.410162] RAX: fffffffffffffe00 RBX: 00007f7c03836e34 RCX: 00007f7c05c3a5d7 [61233.410164] RDX: 00007f7c03836de0 RSI: 00000000c0206449 RDI: 0000000000000006 [61233.410165] RBP: 00007f7c03836de0 R08: 0000000000000000 R09: 0000000000000000 [61233.410167] R10: 00000000005544cc R11: 0000000000000246 R12: 00000000c0206449 [61233.410168] R13: 0000000000000006 R14: 00007f7c03836e60 R15: 00007f7bfc424ad8 [61233.410174] WARNING: CPU: 6 PID: 26883 at kernel/kthread.c:501 kthread_park+0x76/0x90 [61233.410175] ---[ end trace 1a47caa3a1dbe0cf ]--- ----------------------------------------------- PS. Outputting the test commands to dmesg before they start is IMHO nice way to find out which of the automated tests triggers hangs. Still hangs with latest drm-tip kernel v5.0 and Mesa git from yesterday: ---------------------------------------------------- [28648.228909] amdgpu 0000:01:00.0: GPU fault detected: 146 0x0fa0880c for process testfw_app pid 17622 thread testfw_app pid 17623 [28648.228911] amdgpu 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x001001F4 [28648.228912] amdgpu 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0A08800C [28648.228914] amdgpu 0000:01:00.0: VM fault (0x0c, vmid 5, pasid 32772) at page 1049076, read from 'TC4' (0x54433400) (136) [28648.228920] amdgpu 0000:01:00.0: GPU fault detected: 146 0x0fa0840c for process testfw_app pid 17622 thread testfw_app pid 17623 [28648.228921] amdgpu 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x001001FE [28648.228922] amdgpu 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0A08800C [28648.228923] amdgpu 0000:01:00.0: VM fault (0x0c, vmid 5, pasid 32772) at page 1049086, read from 'TC4' (0x54433400) (136) [28648.229002] amdgpu 0000:01:00.0: GPU fault detected: 146 0x0fb1880c for process testfw_app pid 17622 thread testfw_app pid 17623 [28648.229003] amdgpu 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x001001F5 [28648.229004] amdgpu 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0A18802C [28648.229005] amdgpu 0000:01:00.0: VM fault (0x2c, vmid 5, pasid 32772) at page 1049077, read from 'TC0' (0x54433000) (392) [28657.458604] [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out. [28658.492490] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=164402, emitted seq=164404 [28658.492519] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process testfw_app pid 17622 thread testfw_app pid 17623 [28658.492521] amdgpu 0000:01:00.0: GPU reset begin! [28662.578695] [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out. [28663.014035] cp is busy, skip halt cp [28663.202474] rlc is busy, skip halt rlc [28663.203486] amdgpu 0000:01:00.0: GPU pci config reset [28663.215357] amdgpu 0000:01:00.0: GPU reset succeeded, trying to resume [28663.215407] [drm] PCIE GART of 256M enabled (table at 0x000000F4007E9000). [28663.215444] [drm:amdgpu_device_gpu_recover [amdgpu]] *ERROR* VRAM is lost! [28663.292813] [drm] UVD and UVD ENC initialized successfully. [28663.393739] [drm] VCE initialized successfully. [28663.403269] [drm] recover vram bo from shadow start [28663.408650] [drm] recover vram bo from shadow done [28663.408651] [drm] Skip scheduling IBs! [28663.408652] [drm] Skip scheduling IBs! [28663.408669] [drm] Skip scheduling IBs! [28663.408709] amdgpu 0000:01:00.0: GPU reset(2) succeeded! [28663.408849] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125! [28663.452225] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125! ... ---------------------------------------------------- Again, without the demo is hard to fix. Can you try 'export RADV_DEBUG=nodcc,nohiz,zerovram,nofastclears' ? If it still hangs, generating a hang report might help export RADV_TRACE_FILE=$HOME/hang.trace export RADV_DEBUG=allbos,vmfaults,zerovram,syncshaders Created attachment 143572 [details] Hang trace (In reply to Samuel Pitoiset from comment #8) > Again, without the demo is hard to fix. While GfxBench v5 / AztecRuins seems still to be proprietary for Desktop Linux (available for free only on Windows & Android), (recoverable) Manhattan hangs in bug 108898 can be tested with the public GfxBench v4 version. > Can you try 'export RADV_DEBUG=nodcc,nohiz,zerovram,nofastclears' ? > > If it still hangs Yes, it still hangs, just less verbosely. dmesg: [ 546.116535] amdgpu 0000:01:00.0: GPU fault detected: 146 0x0fa0880c for process testfw_app pid 1859 thread testfw_app pid 1860 [ 546.116538] amdgpu 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x001001F4 [ 546.116539] amdgpu 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0808800C [ 546.116541] amdgpu 0000:01:00.0: VM fault (0x0c, vmid 4, pasid 32772) at page 1049076, read from 'TC4' (0x54433400) (136) [ 556.201073] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=11253, emitted seq=11254 [ 556.201101] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process testfw_app pid 1859 thread testfw_app pid 1860 [ 556.201104] amdgpu 0000:01:00.0: GPU reset begin! [ 556.616910] cp is busy, skip halt cp [ 556.805398] rlc is busy, skip halt rlc [ 556.806410] amdgpu 0000:01:00.0: GPU pci config reset [ 556.818925] amdgpu 0000:01:00.0: GPU reset succeeded, trying to resume [ 556.818962] [drm] PCIE GART of 256M enabled (table at 0x000000F4007E9000). [ 556.818991] [drm:amdgpu_device_gpu_recover [amdgpu]] *ERROR* VRAM is lost! [ 556.896623] [drm] UVD and UVD ENC initialized successfully. [ 556.997551] [drm] VCE initialized successfully. [ 557.007168] [drm] recover vram bo from shadow start [ 557.012867] [drm] recover vram bo from shadow done [ 557.012869] [drm] Skip scheduling IBs! [ 557.012956] amdgpu 0000:01:00.0: GPU reset(2) succeeded! [ 557.013063] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125! ... Application: -------------- Warm up Generate SH shader... Workgroup size: 8 compile deferred_irradiance_volumes/m_envprobe_generate_sh_compute.shader... done amdgpu: radv_amdgpu_cs_query_fence_status failed. glVkError: 2 line: 4329 func: Finish amdgpu: radv_amdgpu_cs_query_fence_status failed. glVkError: 2 line: 4219 func: BeginCommandBuffer amdgpu: The CS has been rejected, see dmesg for more information. vk: error: failed to submit CS 0 -------------- > generating a hang report might help > export RADV_TRACE_FILE=$HOME/hang.trace > export RADV_DEBUG=allbos,vmfaults,zerovram,syncshaders Hang trace attached. Created attachment 143573 [details]
70MB output from the RADV debug options (compressed)
First time I see a shader like that... Can you install spirv-dis and generate a new hang report, please? The SPIR-V is probably useful too. (In reply to Samuel Pitoiset from comment #11) > First time I see a shader like that... > > Can you install spirv-dis and generate a new hang report, please? The SPIR-V > is probably useful too. Sorry, shortly after filing this bug, I stopped having extra time for 3D bugs. I'll still file bugs on larger issues I notice, and can verify the fixed ones, but not spend time investigating them (it would have helped if spirv-tools would have been available in Ubuntu :-/). I asked for a copy of this benchmark (the Linux version) and they said me that it will no longer be supported, don't expect further releases of Aztec. Closing because we just can't debug it. (In reply to Samuel Pitoiset from comment #13) > I asked for a copy of this benchmark (the Linux version) and they said me > that it will no longer be supported, don't expect further releases of Aztec. > Closing because we just can't debug it. After looking at kernel firmware repo, I wonder whether the problem is firmware: https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git/ VegaM hasn't been updated since it has been added, almost a year ago: ---------------------------------------------------- $ git log --format=fuller vegam* commit 153a51e438cafe07610b28db0304b1721b91d847 Author: Alex Deucher <alexander.deucher@amd.com> AuthorDate: Tue Jul 10 15:53:14 2018 -0500 Commit: Josh Boyer <jwboyer@kernel.org> CommitDate: Tue Jul 17 07:54:55 2018 -0400 amdgpu: add initial VegaM firmware ---------------------------------------------------- Whereas other Vega firmware was updated this month: ---------------------------------------------------- $ git log -1 --format=fuller vega* commit 92e17d0dd2437140fab044ae62baf69b35d7d1fa (HEAD -> master, origin/master, origin/HEAD) Author: Alex Deucher <alexander.deucher@amd.com> AuthorDate: Mon Apr 29 08:50:27 2019 -0500 Commit: Josh Boyer <jwboyer@kernel.org> CommitDate: Thu May 2 06:24:19 2019 -0400 amdgpu: update vega20 to the latest 19.10 firmware ---------------------------------------------------- As was previous generation: ---------------------------------------------------- $ git log -1 --format=fuller polaris* commit 4ea5c73b96ed4a508f90047e22ccbaa477481310 Author: Alex Deucher <alexander.deucher@amd.com> AuthorDate: Mon Apr 29 08:47:55 2019 -0500 Commit: Josh Boyer <jwboyer@kernel.org> CommitDate: Thu May 2 06:23:54 2019 -0400 amdgpu: update polaris11 to the latest 19.10 firmware ---------------------------------------------------- Even two generations older cards have newer update: ---------------------------------------------------- $ git log -1 --format=fuller tonga_* fiji_* carrizo_* stoney_* commit fcd5a5f14abf1c0202abb8dc6b98ddb2ff23c359 Author: Alex Deucher <alexander.deucher@amd.com> AuthorDate: Tue Oct 23 16:35:58 2018 -0500 Commit: Josh Boyer <jwboyer@kernel.org> CommitDate: Fri Oct 26 08:08:40 2018 -0400 amdgpu: update fiji firmware to 18.40 ---------------------------------------------------- (In reply to Eero Tamminen from comment #14) > > After looking at kernel firmware repo, I wonder whether the problem is > firmware: > > https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git/ > > VegaM hasn't been updated since it has been added, almost a year ago: It hasn't been updated because there have not been any changes internally. I always update all asic firmware when updates are available. |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.