Bug 104636

Summary:	[BSW/HD400] Aztec Ruins GL version GPU hangs
Product:	Mesa	Reporter:	Eero Tamminen <eero.t.tamminen>
Component:	Drivers/DRI/i965	Assignee:	Jordan Justen <jljusten>
Status:	VERIFIED FIXED	QA Contact:	Intel 3D Bugs Mailing List <intel-3d-bugs>
Severity:	normal
Priority:	medium	CC:	clayton.a.craft
Version:	git
Hardware:	Other
OS:	All
See Also:	https://bugs.freedesktop.org/show_bug.cgi?id=105290
Whiteboard:
i915 platform:		i915 features:
Attachments:	error state for GPU hang with aztec ruins on GLK BSW GPU error state for GL Aztec Ruins

Description Eero Tamminen 2018-01-15 12:03:54 UTC

Aztec Ruins test *GL* version in the proprietary GfxBench v5.0-GOLD2 benchmark suite GPU hangs everytime on BSW (N3050).

Setup:
- Mesa, drm-tip kernel & X server from git
- Ubuntu 16.04

(Hangs happen also with at least few months older version of Mesa, kernel and X server, but Mesa has had earlier other issues with this use-case, so much older data isn't very useful.)

Use-case example:
bin/testfw_app --gfx glfw --gl_api desktop_core --width 1920 --height 1080 --fullscreen 1 --test_id gl_5_normal

Hang would seems to happen on start of the benchmark run.

This test works fine on all the other devices, it and its Vulkan version works fine on BSW.  Both GL & Vulkan version use DXT5 textures.

This isn't a regression.

(Filed this to Bugzilla on request from Jason.  As it's a not a public test-case, binary, GPU error states & other info are only available internally.)

Comment 1 Sagar Kamble 2018-02-02 08:40:57 UTC

We are seeing GL hang for aztec ruins on GLK Linux. Can you share the error state for BSW hang. Attaching GLK one for reference.

Comment 2 Sagar Kamble 2018-02-02 08:41:32 UTC

Created attachment 137126 [details]
error state for GPU hang with aztec ruins on GLK

Comment 3 Eero Tamminen 2018-02-02 12:24:22 UTC

Created attachment 137127 [details]
BSW GPU error state for GL Aztec Ruins

On BSW, there's GPU hang in GL Azec Ruins approximately on 4 runs out of 5.  Error state attached.  Vulkan version works still fine.

Comment 4 Kenneth Graunke 2018-03-06 03:09:02 UTC

(In reply to Sagar Kamble from comment #2)
> Created attachment 137126 [details]
> error state for GPU hang with aztec ruins on GLK

Hi Sagar,

Based on your error state, it looks like your Mesa does not have commit 55a97db52347f62111a24715078c6035380d3e19, which ought to fix that hang.  We shipped this fix in 18.0-rc1, but it looks like I forgot to get it included in the 17.3.x releases.  I just nominated it [ https://lists.freedesktop.org/archives/mesa-stable/2018-March/007941.html ] so hopefully it'll hit 17.3.7.

Your bug is different than Eero's.  His is hanging after a BLORP operation.  Yours was hanging on a compute shader.

I can't reproduce Eero's BSW hang.

Comment 5 Kenneth Graunke 2018-03-06 07:26:28 UTC

(In reply to Eero Tamminen from comment #3)
> Created attachment 137127 [details]
> BSW GPU error state for GL Aztec Ruins
> 
> On BSW, there's GPU hang in GL Azec Ruins approximately on 4 runs out of 5. 
> Error state attached.  Vulkan version works still fine.

Hey Eero, are you using a HD 400 (12 EU) or HD 405 (16 EU)?  Unfortunately, Braswell is not identifiable by PCI ID alone :(

Comment 6 Kenneth Graunke 2018-03-06 07:27:57 UTC

*** Bug 105210 has been marked as a duplicate of this bug. ***

Comment 7 Kenneth Graunke 2018-03-06 07:34:15 UTC

Both Eero and Clayton's error states look identical, and the hanging BLORP operation has a totally bogus pixel shader.  The one and only instruction is:

illegal(1)                                                      { align1 1N };

So that's clearly not going to work out.  Now the question is...how did that happen?

Comment 8 Kenneth Graunke 2018-03-06 07:50:38 UTC

It looks like somebody has scribbled over the program cache.

It appears to contain:
- Zeroes
- <0x14, 0x14, 0x14, 0x14> (offset 0x280)
- Zeroes
- <14.5f, 14.5f, 14.5f, 14.5f> (offset 0x680)
- <15.5f, 15.5f, 15.5f, 15.5f> (offset 0x690)
- Zeroes
- <1.5f, 5.5f, 9.5f, 13.5f>, repeated twice (offset 0x720)
- Zeroes
- <14.5f, 14.5f, 14.5f, 14.5f> (offset 0x840)
- <15.5f, 15.5f, 15.5f, 15.5f> (offset 0x850)
- Zeroes
- some more floating point numbers and zeroes
- wildly different looking data starting at 0xc80 - probably the real data

Comment 9 Eero Tamminen 2018-03-06 08:41:27 UTC

(In reply to Kenneth Graunke from comment #5)
> Hey Eero, are you using a HD 400 (12 EU) or HD 405 (16 EU)?  Unfortunately,
> Braswell is not identifiable by PCI ID alone :(

It's same as what Clayton has:
"Device: Mesa DRI Intel(R) HD Graphics 400 (Braswell)  (0x22b1)"

Comment 10 Kenneth Graunke 2018-03-07 08:39:34 UTC

(In reply to Eero Tamminen from comment #9)
> It's same as what Clayton has:
> "Device: Mesa DRI Intel(R) HD Graphics 400 (Braswell)  (0x22b1)"

Ah, terrific, thanks!  I believe compute shader scratch is scribbling over the program cache, but only for the HD 400 (6 EU) model.  The HD 405 (8 EU) model that I've been testing with doesn't suffer from this bug, which is why I couldn't reproduce it.

Jordan sent a patch to the mailing list which should fix this bug:
https://patchwork.freedesktop.org/patch/208502/

Comment 11 Eero Tamminen 2018-03-12 16:43:18 UTC

*** Bug 105290 has been marked as a duplicate of this bug. ***

Comment 12 Jordan Justen 2018-03-12 17:42:00 UTC

Should be fixed by:

commit 06e3bd02c01e499332a9c02b40f506df9695bced
i965: Hard code CS scratch_ids_per_subslice for Cherryview

Comment 13 Eero Tamminen 2018-03-26 11:44:34 UTC

Verified.

I've seen once a GPU hang in the Aztec Ruins Vulkan version after this, but not in the GL/GLES versions.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.