Bug 98159 - SKL GPU hang when execute the dEQP GLES3 case (GT4e)
Summary: SKL GPU hang when execute the dEQP GLES3 case (GT4e)
Status: RESOLVED FIXED
Alias: None
Product: Mesa
Classification: Unclassified
Component: Drivers/DRI/i965 (show other bugs)
Version: git
Hardware: x86-64 (AMD64) Linux (All)
: medium major
Assignee: Kenneth Graunke
QA Contact: Intel 3D Bugs Mailing List
URL:
Whiteboard:
Keywords:
Depends on:
Blocks: i965-deqp
  Show dependency treegraph
 
Reported: 2016-10-08 08:32 UTC by Randy
Modified: 2019-03-14 08:40 UTC (History)
4 users (show)

See Also:
i915 platform:
i915 features:


Attachments
i915 error state of GPU hang (611.00 KB, text/plain)
2016-10-08 08:32 UTC, Randy
Details

Description Randy 2016-10-08 08:32:51 UTC
Created attachment 127134 [details]
i915 error state of GPU hang

Execute below dEQP case may cause GPU hang on SKL

$ ./deqp-gles3 --deqp-case=dEQP-GLES3.functional.ubo.single_nested_struct_array.single_buffer.std140_instance_array_fragment
ERROR: <Text>Image comparison failed, got 16384 non-white pixels</Text>

[782161.763223] [drm] stuck on render ring
[782161.763416] [drm] GPU HANG: ecode 9:0:0x85dffffb, in deqp-gles3 [21566], reason: Engine(s) hung, action: reset
[782161.765222] drm/i915: Resetting chip after gpu hang
[782163.763513] [drm] RC6 on

See attached log for i915 hang state
Comment 1 Randy 2016-10-08 08:34:32 UTC
The same test case can pass on HSW and BDW w/ the same driver, the GPU hang is SKL specific now.
Comment 2 Kenneth Graunke 2016-10-09 00:53:20 UTC
Does it hang all the time or just occasionally?
Comment 3 Randy 2016-10-09 01:33:44 UTC
(In reply to Kenneth Graunke from comment #2)
> Does it hang all the time or just occasionally?

It can be reproduced consistently, not an occasional issue.
Suspect it's due to memory resource mis-alignment
Comment 4 Mark Janes 2016-10-10 16:39:47 UTC
This test passes reliably in the Mesa CI on sklgt2
Comment 5 Mark Janes 2016-10-10 20:24:29 UTC
I have been enabling sklgt4e in the Mesa CI and see similar gpu hangs on that platform.  Randy, please specify which sku of skl you are testing.
Comment 6 Randy 2016-10-11 00:53:10 UTC
(In reply to Mark Janes from comment #5)
> I have been enabling sklgt4e in the Mesa CI and see similar gpu hangs on
> that platform.  Randy, please specify which sku of skl you are testing.

Hi, Mark

Yes, I am using GT4E, it's Intel NUC6i7KYK. And it can also be reproduced on the Kernel 4.7 

More Infos:

   - mesa git top commit 1d466b9b04662d41a403ea8fd617a5365750b1de
Author: Steven Toth <stoth@kernellabs.com>
Date:   Thu Sep 29 08:11:00 2016 -0600
    gallium/hud: Add power sensor support

   - libdrm git top commit b382b22fd4aa6faa954396c94330f2c7d8428aba
Author: Sean Paul <seanpaul@chromium.org>
Date:   Tue Jul 14 15:43:20 2015 -0400
    libdrm: Add rotation property fields

   - latest publicly released kernel (4.7) and i915 top commit is
commit ad778f8967ea2f0bfda02701f918bcfcd495b721
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Thu Aug 4 16:32:42 2016 +0100
    drm/i915: Export our request as a dma-buf fence on the reservation object

   - version of the test suite
	dEQP git: https://android.googlesource.com/platform/external/deqp
	top commit is ca988480be945772473f9256b6ae91fa6aa62bd1 

Thanks,
Randy
Comment 7 Tapani Pälli 2016-10-11 08:39:57 UTC
I did some experimental debugging for this .. fragment shader has a huge number of comparisons made (114), if I comment out all the comparisons after 58 which looks like this:

result *= compare_ivec2(block[1].t[0].b[2].b[3], ivec2(1, -7));

then hang disappears and test passes. Not sure if this helps but for me it seems that the hang is related to the code generated by these comparisons (?)
Comment 8 Tapani Pälli 2016-10-11 08:47:31 UTC
(In reply to Tapani Pälli from comment #7)
> I did some experimental debugging for this .. fragment shader has a huge
> number of comparisons made (114), if I comment out all the comparisons after
> 58 which looks like this:
> 
> result *= compare_ivec2(block[1].t[0].b[2].b[3], ivec2(1, -7));
> 
> then hang disappears and test passes. Not sure if this helps but for me it
> seems that the hang is related to the code generated by these comparisons (?)

so just to speculate a bit more, this test generates a huge number of ubo loads (there are total 473 ubo_load_tmp variables), possibly maybe related to this.
Comment 9 Mark Janes 2016-10-11 16:50:49 UTC
Ben Widawsky asked me to provide card error state from a SKL GT4e to Mika Kuoppala to investigate this failure.

In addition to the state attached by Randy, I captured another crash at:

http://otc-mesa-ci.jf.intel.com/userContent/gt4_error/*view*/
Comment 10 Randy 2016-10-14 03:16:22 UTC
Another two dEQP cases can reproduce the GPU hang issue on GT4e, the signature is similar, i.e HEAD 440, TAIL 460

render command stream:
  START: 0x017a2000
  HEAD:  0x00000440
  TAIL:  0x00000460
  CTL:   0x00003001
  HWS:   0x007ec000
  ACTHD: 0x00000000 00000440



#deqp-gles3 --deqp-case=dEQP-GLES3.functional.ubo.single_nested_struct_array.per_block_buffer.shared_instance_array_fragment

#deqp-gles3 --deqp-case=dEQP-GLES3.functional.ubo.random.all_per_block_buffers.33
Comment 11 Randy 2016-10-20 07:25:49 UTC
27 cases failed on SKL GT4e due to gpu reset, they are

dEQP-GLES3.functional.ubo.multi_nested_struct.per_block_buffer.packed_instance_array_fragment
dEQP-GLES3.functional.ubo.multi_nested_struct.per_block_buffer.shared_instance_array_fragment
dEQP-GLES3.functional.ubo.multi_nested_struct.per_block_buffer.std140_instance_array_fragment
dEQP-GLES3.functional.ubo.multi_nested_struct.single_buffer.packed_instance_array_fragment
dEQP-GLES3.functional.ubo.multi_nested_struct.single_buffer.shared_instance_array_fragment
dEQP-GLES3.functional.ubo.multi_nested_struct.single_buffer.std140_instance_array_fragment
dEQP-GLES3.functional.ubo.random.all_per_block_buffers.33
dEQP-GLES3.functional.ubo.random.all_shared_buffer.23
dEQP-GLES3.functional.ubo.random.nested_structs_arrays_instance_arrays.24
dEQP-GLES3.functional.ubo.single_nested_struct_array.per_block_buffer.packed_instance_array_fragment
dEQP-GLES3.functional.ubo.single_nested_struct_array.per_block_buffer.shared_instance_array_fragment
dEQP-GLES3.functional.ubo.single_nested_struct_array.per_block_buffer.std140_instance_array_fragment
dEQP-GLES3.functional.ubo.single_nested_struct_array.single_buffer.packed_instance_array_fragment
dEQP-GLES3.functional.ubo.single_nested_struct_array.single_buffer.shared_instance_array_fragment
dEQP-GLES3.functional.ubo.single_nested_struct_array.single_buffer.std140_instance_array_fragment
dEQP-GLES3.functional.ubo.single_nested_struct.per_block_buffer.packed_instance_array_fragment
dEQP-GLES3.functional.ubo.single_nested_struct.per_block_buffer.shared_instance_array_fragment
dEQP-GLES3.functional.ubo.single_nested_struct.per_block_buffer.std140_instance_array_fragment
dEQP-GLES3.functional.ubo.single_nested_struct.single_buffer.packed_instance_array_fragment
dEQP-GLES3.functional.ubo.single_nested_struct.single_buffer.shared_instance_array_fragment
dEQP-GLES3.functional.ubo.single_nested_struct.single_buffer.std140_instance_array_fragment
dEQP-GLES3.functional.ubo.single_struct_array.per_block_buffer.packed_instance_array_fragment
dEQP-GLES3.functional.ubo.single_struct_array.per_block_buffer.shared_instance_array_fragment
dEQP-GLES3.functional.ubo.single_struct_array.per_block_buffer.std140_instance_array_fragment
dEQP-GLES3.functional.ubo.single_struct_array.single_buffer.packed_instance_array_fragment
dEQP-GLES3.functional.ubo.single_struct_array.single_buffer.shared_instance_array_fragment
dEQP-GLES3.functional.ubo.single_struct_array.single_buffer.std140_instance_array_fragment
Comment 12 Kenneth Graunke 2016-11-05 23:00:57 UTC
I believe this is a scratch space allocation problem.  Increasing max_wm_threads from 64 * 9 to 72 * 9 in src/intel/common/gen_device_info.c seems to fix the problem.
Comment 13 Kenneth Graunke 2016-11-06 00:06:55 UTC
(In reply to Kenneth Graunke from comment #12)
> I believe this is a scratch space allocation problem.  Increasing
> max_wm_threads from 64 * 9 to 72 * 9 in src/intel/common/gen_device_info.c
> seems to fix the problem.

I probably spoke too soon - increasing the size of the buffer can also just move things around in the GTT so it happens to work.  Ben and I think the old calculation is correct, but I'll look at this more carefully.
Comment 14 Mark Janes 2016-11-08 00:44:11 UTC
Mika, can you reproduce this gpu hang?
Comment 15 Mark Janes 2016-11-08 04:54:39 UTC
Mika, don't bother reproducing this.  Ken Graunke has found a bug and has a patch to address SKLGT4e instabilities.
Comment 16 Kenneth Graunke 2016-11-08 19:25:33 UTC
It turns out this was our fault:

https://lists.freedesktop.org/archives/mesa-dev/2016-November/134606.html

Once again...documented...but in an obscure place.  Nobody thinks to read the description of "scratch space base pointer", as that pointer has meant the same thing for 10 years...


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.