Bug 108900 - Non-recoverable GPU hangs with GfxBench v5 Aztec Ruins Vulkan test
Summary: Non-recoverable GPU hangs with GfxBench v5 Aztec Ruins Vulkan test
Status: NEW
Alias: None
Product: Mesa
Classification: Unclassified
Component: Drivers/Vulkan/radeon (show other bugs)
Version: git
Hardware: Other All
: high critical
Assignee: mesa-dev
QA Contact: mesa-dev
URL:
Whiteboard:
Keywords:
: 109058 (view as bug list)
Depends on:
Blocks:
 
Reported: 2018-11-29 13:11 UTC by Eero Tamminen
Modified: 2019-02-05 17:25 UTC (History)
2 users (show)

See Also:
i915 platform:
i915 features:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Eero Tamminen 2018-11-29 13:11:35 UTC
Setup:
- FullHD monitor (through HDMI KVM)
- HadesCanyon KBL i7-8809G ([AMD/ATI] Vega [Radeon RX Vega M] (rev c0))
- Ubuntu 18.04
- drm-tip git kernel v4.20-rc4 (i.e. kernel.org v4.20-rc4 kernel + latest drm code from yesterday)
- Mesa git (c120dbfe4d)
- X server git version
- Proprietary GfxBench v5-GOLD2:  http://gfxbench.com

Test-case:
* bin/testfw_app --gfx vulkan --gl_api vulkan --width 1920 --height 1080 --fullscreen 1 --test_id vulkan_5_normal

Expected outcome:
* Works fine like the Aztec Ruins GL version and Sacha Willems' Vulkan tests, no GPU hangs

Actual outcome:
* Right after test starts, following in dmesg:
-----
[ 3057.480868] amdgpu 0000:01:00.0: GPU fault detected: 146 0x0fa0880c for process testfw_app pid 2995 thread testfw_app pid 2997
[ 3057.480870] amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x001001F4
[ 3057.480871] amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0C08800C
[ 3057.480873] amdgpu 0000:01:00.0: VM fault (0x0c, vmid 6, pasid 32772) at page 1049076, read from 'TC4' (0x54433400) (136)
[ 3057.480879] amdgpu 0000:01:00.0: GPU fault detected: 146 0x0fa0840c for process testfw_app pid 2995 thread testfw_app pid 2997
[ 3057.480880] amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x001001FD
[ 3057.480881] amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0C08400C
[ 3057.480883] amdgpu 0000:01:00.0: VM fault (0x0c, vmid 6, pasid 32772) at page 1049085, read from 'TC5' (0x54433500) (132)
[ 3057.480944] amdgpu 0000:01:00.0: GPU fault detected: 146 0x0fa9080c for process testfw_app pid 2995 thread testfw_app pid 2997
[ 3057.480945] amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00000000
[ 3057.480946] amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0C18802C
[ 3057.480947] amdgpu 0000:01:00.0: VM fault (0x2c, vmid 6, pasid 32772) at page 0, read from 'TC0' (0x54433000) (392)
[ 3067.564630] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=53811, emitted seq=53814
[ 3067.564633] [drm] GPU recovery disabled.
-----

After this, no other GPU operations seem to work properly.  There are also other things that don't work properly in automated testing at this point, but I'm not sure whether they're related.

No idea whether this is a regression as I checked it only now.  There are some  issues with this particular test also on Intel (see e.g. bug 104634, bug 105276), so the problem could be in common code.  No idea whether this is related to GL bug 108898 on same device.
Comment 1 Eero Tamminen 2018-11-30 13:22:22 UTC
Yes, this messes also other things, not just 3D (after this issue, script using pycurl to upload test results, will just sit in poll() instead of working, so I think something on kernel side gets corrupted).
Comment 2 Samuel Pitoiset 2018-12-04 13:30:18 UTC
The link is dead, if you have the demo can you upload it somewhere?
Comment 3 Eero Tamminen 2018-12-04 14:25:41 UTC
(In reply to Samuel Pitoiset from comment #2)
> The link is dead, if you have the demo can you upload it somewhere?

It still worked when I filed this (and has worked for years before).  You can still get the page from Google cache:
http://webcache.googleusercontent.com/search?q=cache:https://gfxbench.com/

As you can see, there isn't yet a public Linux version of GfxBench v5, only Android, iOS, MacOS and Windows versions.

And I naturally can't provide the proprietary version.

Doesn't Valve have licenses to industry standard 3D benchmarks (of which GfxBench is the main one on mobile, and as result, nowadays important also on desktop)?

If not, you could try using the Windows version with Wine, when the site works again.  If Windows version supports Vulkan and Wine doesn't mangle its API calls for Linux, you could be able to trigger the issue (going through DX -> DXVK probably isn't good enough).

Or if there's some Linux Android container that passes Vulkan calls through, you could try the Android version:
https://play.google.com/store/apps/details?id=com.glbenchmark.glbenchmark27

Here's some extra info on the Aztec Ruins benchmark:
https://www.anandtech.com/show/13271/kishonti-releases-vulkan-gfxbench-5
Comment 4 Eero Tamminen 2018-12-13 12:07:18 UTC
FYI: https://gfxbench.com/ site works again.
Comment 5 Bas Nieuwenhuizen 2019-01-02 17:11:17 UTC
*** Bug 109058 has been marked as a duplicate of this bug. ***
Comment 6 Eero Tamminen 2019-02-05 17:25:58 UTC
Ubuntu 18.04 got LLVM 7 so I checked this again with latest Mesa (4f0a3c9f9eda65), and the hangs are still happening:
-----------------------------------------------
[61150.480135] Iteration 1/1: testfw_app --gfx vulkan --gl_api vulkan --width 1920 --height 1080 --fullscreen 1 --test_id vulkan_5_normal
[61214.327444] amdgpu 0000:01:00.0: GPU fault detected: 146 0x0fa0880c for process testfw_app pid 26882 thread testfw_app pid 26883
[61214.327446] amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x001001F4
[61214.327447] amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0A08800C
[61214.327449] amdgpu 0000:01:00.0: VM fault (0x0c, vmid 5, pasid 32772) at page 1049076, read from 'TC4' (0x54433400) (136)
[61214.327456] amdgpu 0000:01:00.0: GPU fault detected: 146 0x0fa0840c for process testfw_app pid 26882 thread testfw_app pid 26883
[61214.327457] amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x001001FE
[61214.327458] amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0A10800C
[61214.327459] amdgpu 0000:01:00.0: VM fault (0x0c, vmid 5, pasid 32772) at page 1049086, read from 'TC2' (0x54433200) (264)
[61214.327534] amdgpu 0000:01:00.0: GPU fault detected: 146 0x0fb0840c for process testfw_app pid 26882 thread testfw_app pid 26883
[61214.327535] amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00000000
[61214.327535] amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0A18802C
[61214.327537] amdgpu 0000:01:00.0: VM fault (0x2c, vmid 5, pasid 32772) at page 0, read from 'TC0' (0x54433000) (392)
[61224.391478] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=11267, emitted seq=11269
[61224.391506] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process testfw_app pid 26882 thread testfw_app pid 26883
[61224.391509] amdgpu 0000:01:00.0: GPU reset begin!
[61233.409935] WARNING: CPU: 6 PID: 26883 at kernel/kthread.c:501 kthread_park+0x76/0x90
[61233.409937] Modules linked in: fuse snd_hda_codec_realtek snd_hda_codec_generic amdgpu i915 x86_pkg_temp_thermal coretemp snd_hda_codec_hdmi snd_hda_intel snd_hda_codec e1000e crct10dif_pclmul snd_hwdep snd_hda_core crc32_pclmul chash gpu_sched snd_pcm ttm igb mei_me mei pinctrl_sunrisepoint pinctrl_intel
[61233.409954] CPU: 6 PID: 26883 Comm: testfw_app Not tainted 5.0.0-rc5-CI-Nightly_1639+ #1
[61233.409956] Hardware name: Intel Corporation NUC8i7HVK/NUC8i7HVB, BIOS HNKBLi70.86A.0053.2018.1217.1739 12/17/2018
[61233.409959] RIP: 0010:kthread_park+0x76/0x90
[61233.409961] Code: 00 48 89 df e8 eb b6 00 00 48 85 c0 74 25 31 c0 5b 5d c3 0f 0b a8 04 48 8b af f0 04 00 00 74 b0 0f 0b b8 da ff ff ff 5b 5d c3 <0f> 0b b8 f0 ff ff ff eb dd 0f 0b eb d9 0f 1f 00 66 2e 0f 1f 84 00
[61233.409963] RSP: 0018:ffffc900016f7b80 EFLAGS: 00010202
[61233.409965] RAX: 0000000000000004 RBX: ffff888466df68c0 RCX: 0000000000000000
[61233.409966] RDX: ffff888459f46ef0 RSI: ffff888466df68c0 RDI: ffff88846b7e0dc0
[61233.409968] RBP: ffff88845f8694e0 R08: 000037b103a4e400 R09: 0000000000000139
[61233.409969] R10: ffffc900000f7dd0 R11: 00000000000029ec R12: ffff888450031c00
[61233.409970] R13: ffff888459f427c0 R14: 0000000000000202 R15: ffff888450031cc8
[61233.409972] FS:  00007f7c03838700(0000) GS:ffff88846eb80000(0000) knlGS:0000000000000000
[61233.409974] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[61233.409975] CR2: 00007f895193cca0 CR3: 000000000320e005 CR4: 00000000003606e0
[61233.409976] Call Trace:
[61233.409985]  drm_sched_entity_fini+0x35/0x180 [gpu_sched]
[61233.410050]  amdgpu_vm_fini+0xba/0x520 [amdgpu]
[61233.410056]  ? idr_destroy+0x7a/0xc0
[61233.410098]  amdgpu_driver_postclose_kms+0x151/0x270 [amdgpu]
[61233.410104]  drm_file_free.part.0+0x21b/0x300
[61233.410108]  drm_release+0x9d/0x110
[61233.410113]  __fput+0xa2/0x1d0
[61233.410117]  task_work_run+0x84/0xa0
[61233.410121]  do_exit+0x308/0xba0
[61233.410126]  do_group_exit+0x33/0xa0
[61233.410128]  get_signal+0x203/0x5f0
[61233.410133]  do_signal+0x30/0x6c0
[61233.410137]  ? do_vfs_ioctl+0xa4/0x630
[61233.410140]  ? __do_munmap+0x308/0x400
[61233.410144]  exit_to_usermode_loop+0x96/0xb0
[61233.410147]  do_syscall_64+0xe7/0x100
[61233.410150]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[61233.410153] RIP: 0033:0x7f7c05c3a5d7
[61233.410159] Code: Bad RIP value.
[61233.410160] RSP: 002b:00007f7c03836da8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[61233.410162] RAX: fffffffffffffe00 RBX: 00007f7c03836e34 RCX: 00007f7c05c3a5d7
[61233.410164] RDX: 00007f7c03836de0 RSI: 00000000c0206449 RDI: 0000000000000006
[61233.410165] RBP: 00007f7c03836de0 R08: 0000000000000000 R09: 0000000000000000
[61233.410167] R10: 00000000005544cc R11: 0000000000000246 R12: 00000000c0206449
[61233.410168] R13: 0000000000000006 R14: 00007f7c03836e60 R15: 00007f7bfc424ad8
[61233.410174] WARNING: CPU: 6 PID: 26883 at kernel/kthread.c:501 kthread_park+0x76/0x90
[61233.410175] ---[ end trace 1a47caa3a1dbe0cf ]---
-----------------------------------------------

PS. Outputting the test commands to dmesg before they start is IMHO nice way to find out which of the automated tests triggers hangs.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.