Bug 92589

Summary: [BDW BSW SKL CTS] ES31-CTS.texture_gather.* GPU_HANG
Product: Mesa Reporter: Marta Löfstedt <marta.lofstedt>
Component: Drivers/DRI/i965Assignee: Jordan Justen <jljusten>
Status: RESOLVED FIXED QA Contact: Intel 3D Bugs Mailing List <intel-3d-bugs>
Severity: major    
Priority: high CC: ben, idr, jljusten, kenneth, marta.lofstedt
Version: git   
Hardware: Other   
OS: All   
Whiteboard:
i915 platform: i915 features:
Bug Depends on:    
Bug Blocks: 92778, 93312    
Attachments: dmesg with GPU_HANG
implement WaSendDummyVFEafterPipelineSelect
i965-upload_compute_sampler_state.patch

Description Marta Löfstedt 2015-10-22 09:03:59 UTC
Created attachment 119061 [details]
dmesg with GPU_HANG

Software versions:
    4.3.0-rc3+
    OpenGL version string: 3.0 Mesa 11.1.0-devel (git-13a5805)

GPU hardware:
    OpenGL renderer string: Mesa DRI Intel(R) HD Graphics 5300 (Broadwell GT2)
    00:02.0 VGA compatible controller [0300]: Intel Corporation Broadwell-U Integrated Graphics [8086:161e] (rev 06)

CPU hardware:
    x86_64
    Genuine Intel(R) CPU 0000 @ 0.60GHz

CTS version:
git@67ae88f31295

command: 
./glcts --deqp-case=ES31-CTS.texture_gather*

Environment:
Mesa built with: --enable-debug
export MESA_GLES_VERSION_OVERRIDE=3.1
export MESA_EXTENSION_OVERRIDE=GL_ARB_compute_shader

------------------------------------------------------
When running all the texture_gather test of the OpenGL ES 3.1 CTS, execution stops after 3 or 4 test with:
"intel_do_flush_locked failed: Input/output error"

dmesg shows:[drm] GPU HANG: ecode 8:0:0x85ddfffb, in glcts [26007], reason: Ring hung, action: reset

The is not reproducible when the tests are run alone.
The issue is not reproducible on HSW.
Comment 1 Ian Romanick 2015-10-27 18:27:55 UTC
This sounds a little bit similar to bug #92623.  In that bug, Ken was able to make the hang go away by:

    FWIW, enabling the "Always re-emit all state" block in
    brw_state_upload.c:708 seems to fix this problem.  always_flush_batch=true
    and always_flush_cache=true (way more flushing) have no impact.  This makes
    me suspect missing dirty flags...

Marta: Can you try this to see if it helps?

Ken: Is there any other information that might be helpful?
Comment 2 Kenneth Graunke 2015-10-27 19:15:26 UTC
I don't think this has anything to do with that.  That bug is about PMA and early depth testing.  None of the workarounds work here.

This error state is completely different.  CS (command streamer) and TSG are busy, and the offending batch appears to be a compute shader invocation.  Guessing it's some kind of compute bug.

CC'ing Jordan.
Comment 3 Kenneth Graunke 2015-10-27 19:33:47 UTC
Ben suggested taking a look at WaSendDummyVFEafterPipelineSelect.

"TSG unit writes null entries into context which can hang restore CS and TSG.
 WA: Send dummy VFE immediately after GPGPU pipeline select."
Comment 4 Ben Widawsky 2015-10-27 21:17:33 UTC
Created attachment 119232 [details] [review]
implement WaSendDummyVFEafterPipelineSelect

I was waiting for a build of something so I thought I'd give the patch a shot. This is just compile tested.
Comment 5 Marta Löfstedt 2015-10-28 08:04:22 UTC
Unfortunately the "implement WaSendDummyVFEafterPipelineSelect" patch does not help.

Some combinations of the test are not causing GPU_HANG, for example:
./glcts --deqp-case=ES31-CTS.texture_gather.plain-gather-int*
is OK,

but:
./glcts --deqp-case=ES31-CTS.texture_gather.plain-gather-uint*
./glcts --deqp-case=ES31-CTS.texture_gather.plain-gather-float*
cause HANG.
Comment 6 Ben Widawsky 2015-10-28 14:44:59 UTC
(In reply to Marta Löfstedt from comment #5)
> Unfortunately the "implement WaSendDummyVFEafterPipelineSelect" patch does
> not help.
> 
> Some combinations of the test are not causing GPU_HANG, for example:
> ./glcts --deqp-case=ES31-CTS.texture_gather.plain-gather-int*
> is OK,
> 
> but:
> ./glcts --deqp-case=ES31-CTS.texture_gather.plain-gather-uint*
> ./glcts --deqp-case=ES31-CTS.texture_gather.plain-gather-float*
> cause HANG.

Can you please post the 4 INSTDONE values from the error state both with, and without the patch? Note that you must manually clear the error state by writing to the sysfs file.
Comment 7 Marta Löfstedt 2015-10-28 16:42:21 UTC
The values are the same both with and without the patch:

INSTDONE_0: 0xffddffff
INSTDONE_1: 0xffffffff
INSTDONE_2: 0xffffffff
INSTDONE_3: 0xfffeffff
Comment 8 Ben Widawsky 2015-10-28 16:58:01 UTC
I don't know how to run deqp... could you dump the shader? INTEL_DEBUG=cs (I am guessing)
Comment 9 Marta Löfstedt 2015-10-28 17:01:39 UTC
It's not deqp it is Khronos CTS its available in teamforge.
Comment 10 Marta Löfstedt 2016-01-14 16:00:37 UTC
When running the tests individually these test will fail on both BSW and SKL:

ES31-CTS.texture_gather.plain-gather-depth-2d
ES31-CTS.texture_gather.plain-gather-depth-2darray
ES31-CTS.texture_gather.plain-gather-depth-cube
ES31-CTS.texture_gather.offset-gather-depth-2d
ES31-CTS.texture_gather.offset-gather-depth-2darray

These are the only tests that set internal format to GL_DEPTH_COMPONENT32F for glTexImage2D|3D

I don't think this is related to the GPU HANG, since running:
./glcts --deqp-case=ES31-CTS.texture_gather.plain-gather-uint*
does not run any "depth" cases.

However, only running the "depth" cases:
./glcts --deqp-case=ES31-CTS.texture_gather.plain-gather*depth*
will cause GPU HANG as a lot of other combinations of texture gather test discussed earlier.
Comment 11 Francisco Jerez 2016-01-16 23:59:41 UTC
*** Bug 93725 has been marked as a duplicate of this bug. ***
Comment 12 Francisco Jerez 2016-01-17 00:04:10 UTC
Created attachment 121077 [details] [review]
i965-upload_compute_sampler_state.patch

This fixes the problem for me.
Comment 13 Marta Löfstedt 2016-01-18 11:34:21 UTC
Your patch is amazing, Curro!

With your patch on top of Mesa: git@ad20be1f30ef5

I can no longer reproduce the HANG on BDW and SKL. And all texture gather tests pass.

Also, it fixes BUG 93312.

Please send it up to mesa-dev!
Comment 14 Marta Löfstedt 2016-01-18 11:57:13 UTC
*** Bug 93312 has been marked as a duplicate of this bug. ***
Comment 15 Marta Löfstedt 2016-01-18 12:29:31 UTC
*** Bug 93325 has been marked as a duplicate of this bug. ***
Comment 16 Marta Löfstedt 2016-01-18 12:32:43 UTC
*** Bug 93407 has been marked as a duplicate of this bug. ***
Comment 17 Francisco Jerez 2016-01-18 20:45:23 UTC
Cool, thanks for testing it out :)
Comment 18 Tapani Pälli 2016-01-20 07:44:12 UTC
commit f8ac314cc2353f439e6a917db4e3aeaf47e2093e
Author: Francisco Jerez <currojerez@riseup.net>
Date:   Sat Jan 16 15:11:03 2016 -0800

    i965: Implement compute sampler state atom.
    
    Fixes a number of GLES31 CTS failures and hangs on various hardware:
    
     ES31-CTS.texture_gather.plain-gather-depth-2d
     ES31-CTS.texture_gather.plain-gather-depth-2darray
     ES31-CTS.texture_gather.plain-gather-depth-cube
     ES31-CTS.texture_gather.offset-gather-depth-2d
     ES31-CTS.texture_gather.offset-gather-depth-2darray
     ES31-CTS.layout_binding.sampler2D_layout_binding_texture_ComputeShader
     ES31-CTS.layout_binding.sampler2DArray_layout_binding_texture_ComputeShader
     ES31-CTS.explicit_uniform_location.uniform-loc-types-samplers
     ES31-CTS.compute_shader.resources-texture
    
    Some of them were actually passing by luck on some generations even
    though we weren't uploading sampler state tables explicitly for the
    compute stage, most likely because they relied on the cached sampler
    state left from previous rendering to be close enough.
    
    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=92589
    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=93312
    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=93325
    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=93407
    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=93725
    Reported-by: Marta Lofstedt <marta.lofstedt@intel.com>
    Reviewed-by: Marta Lofstedt <marta.lofstedt@intel.com>
    Reviewed-by: Jordan Justen <jordan.l.justen@intel.com>

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.