Bug 108900 - Non-recoverable GPU hangs with GfxBench v5 Aztec Ruins Vulkan test
Summary: Non-recoverable GPU hangs with GfxBench v5 Aztec Ruins Vulkan test
Status: NEW
Alias: None
Product: Mesa
Classification: Unclassified
Component: Drivers/Vulkan/radeon (show other bugs)
Version: git
Hardware: Other All
: high critical
Assignee: mesa-dev
QA Contact: mesa-dev
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2018-11-29 13:11 UTC by Eero Tamminen
Modified: 2018-12-04 14:25 UTC (History)
0 users

See Also:
i915 platform:
i915 features:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Eero Tamminen 2018-11-29 13:11:35 UTC
Setup:
- FullHD monitor (through HDMI KVM)
- HadesCanyon KBL i7-8809G ([AMD/ATI] Vega [Radeon RX Vega M] (rev c0))
- Ubuntu 18.04
- drm-tip git kernel v4.20-rc4 (i.e. kernel.org v4.20-rc4 kernel + latest drm code from yesterday)
- Mesa git (c120dbfe4d)
- X server git version
- Proprietary GfxBench v5-GOLD2:  http://gfxbench.com

Test-case:
* bin/testfw_app --gfx vulkan --gl_api vulkan --width 1920 --height 1080 --fullscreen 1 --test_id vulkan_5_normal

Expected outcome:
* Works fine like the Aztec Ruins GL version and Sacha Willems' Vulkan tests, no GPU hangs

Actual outcome:
* Right after test starts, following in dmesg:
-----
[ 3057.480868] amdgpu 0000:01:00.0: GPU fault detected: 146 0x0fa0880c for process testfw_app pid 2995 thread testfw_app pid 2997
[ 3057.480870] amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x001001F4
[ 3057.480871] amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0C08800C
[ 3057.480873] amdgpu 0000:01:00.0: VM fault (0x0c, vmid 6, pasid 32772) at page 1049076, read from 'TC4' (0x54433400) (136)
[ 3057.480879] amdgpu 0000:01:00.0: GPU fault detected: 146 0x0fa0840c for process testfw_app pid 2995 thread testfw_app pid 2997
[ 3057.480880] amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x001001FD
[ 3057.480881] amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0C08400C
[ 3057.480883] amdgpu 0000:01:00.0: VM fault (0x0c, vmid 6, pasid 32772) at page 1049085, read from 'TC5' (0x54433500) (132)
[ 3057.480944] amdgpu 0000:01:00.0: GPU fault detected: 146 0x0fa9080c for process testfw_app pid 2995 thread testfw_app pid 2997
[ 3057.480945] amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00000000
[ 3057.480946] amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0C18802C
[ 3057.480947] amdgpu 0000:01:00.0: VM fault (0x2c, vmid 6, pasid 32772) at page 0, read from 'TC0' (0x54433000) (392)
[ 3067.564630] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=53811, emitted seq=53814
[ 3067.564633] [drm] GPU recovery disabled.
-----

After this, no other GPU operations seem to work properly.  There are also other things that don't work properly in automated testing at this point, but I'm not sure whether they're related.

No idea whether this is a regression as I checked it only now.  There are some  issues with this particular test also on Intel (see e.g. bug 104634, bug 105276), so the problem could be in common code.  No idea whether this is related to GL bug 108898 on same device.
Comment 1 Eero Tamminen 2018-11-30 13:22:22 UTC
Yes, this messes also other things, not just 3D (after this issue, script using pycurl to upload test results, will just sit in poll() instead of working, so I think something on kernel side gets corrupted).
Comment 2 Samuel Pitoiset 2018-12-04 13:30:18 UTC
The link is dead, if you have the demo can you upload it somewhere?
Comment 3 Eero Tamminen 2018-12-04 14:25:41 UTC
(In reply to Samuel Pitoiset from comment #2)
> The link is dead, if you have the demo can you upload it somewhere?

It still worked when I filed this (and has worked for years before).  You can still get the page from Google cache:
http://webcache.googleusercontent.com/search?q=cache:https://gfxbench.com/

As you can see, there isn't yet a public Linux version of GfxBench v5, only Android, iOS, MacOS and Windows versions.

And I naturally can't provide the proprietary version.

Doesn't Valve have licenses to industry standard 3D benchmarks (of which GfxBench is the main one on mobile, and as result, nowadays important also on desktop)?

If not, you could try using the Windows version with Wine, when the site works again.  If Windows version supports Vulkan and Wine doesn't mangle its API calls for Linux, you could be able to trigger the issue (going through DX -> DXVK probably isn't good enough).

Or if there's some Linux Android container that passes Vulkan calls through, you could try the Android version:
https://play.google.com/store/apps/details?id=com.glbenchmark.glbenchmark27

Here's some extra info on the Aztec Ruins benchmark:
https://www.anandtech.com/show/13271/kishonti-releases-vulkan-gfxbench-5


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.