Created attachment 119061 [details]
dmesg with GPU_HANG
OpenGL version string: 3.0 Mesa 11.1.0-devel (git-13a5805)
OpenGL renderer string: Mesa DRI Intel(R) HD Graphics 5300 (Broadwell GT2)
00:02.0 VGA compatible controller : Intel Corporation Broadwell-U Integrated Graphics [8086:161e] (rev 06)
Genuine Intel(R) CPU 0000 @ 0.60GHz
Mesa built with: --enable-debug
When running all the texture_gather test of the OpenGL ES 3.1 CTS, execution stops after 3 or 4 test with:
"intel_do_flush_locked failed: Input/output error"
dmesg shows:[drm] GPU HANG: ecode 8:0:0x85ddfffb, in glcts , reason: Ring hung, action: reset
The is not reproducible when the tests are run alone.
The issue is not reproducible on HSW.
This sounds a little bit similar to bug #92623. In that bug, Ken was able to make the hang go away by:
FWIW, enabling the "Always re-emit all state" block in
brw_state_upload.c:708 seems to fix this problem. always_flush_batch=true
and always_flush_cache=true (way more flushing) have no impact. This makes
me suspect missing dirty flags...
Marta: Can you try this to see if it helps?
Ken: Is there any other information that might be helpful?
I don't think this has anything to do with that. That bug is about PMA and early depth testing. None of the workarounds work here.
This error state is completely different. CS (command streamer) and TSG are busy, and the offending batch appears to be a compute shader invocation. Guessing it's some kind of compute bug.
Ben suggested taking a look at WaSendDummyVFEafterPipelineSelect.
"TSG unit writes null entries into context which can hang restore CS and TSG.
WA: Send dummy VFE immediately after GPGPU pipeline select."
Created attachment 119232 [details] [review]
I was waiting for a build of something so I thought I'd give the patch a shot. This is just compile tested.
Unfortunately the "implement WaSendDummyVFEafterPipelineSelect" patch does not help.
Some combinations of the test are not causing GPU_HANG, for example:
(In reply to Marta Löfstedt from comment #5)
> Unfortunately the "implement WaSendDummyVFEafterPipelineSelect" patch does
> not help.
> Some combinations of the test are not causing GPU_HANG, for example:
> ./glcts --deqp-case=ES31-CTS.texture_gather.plain-gather-int*
> is OK,
> ./glcts --deqp-case=ES31-CTS.texture_gather.plain-gather-uint*
> ./glcts --deqp-case=ES31-CTS.texture_gather.plain-gather-float*
> cause HANG.
Can you please post the 4 INSTDONE values from the error state both with, and without the patch? Note that you must manually clear the error state by writing to the sysfs file.
The values are the same both with and without the patch:
I don't know how to run deqp... could you dump the shader? INTEL_DEBUG=cs (I am guessing)
It's not deqp it is Khronos CTS its available in teamforge.
When running the tests individually these test will fail on both BSW and SKL:
These are the only tests that set internal format to GL_DEPTH_COMPONENT32F for glTexImage2D|3D
I don't think this is related to the GPU HANG, since running:
does not run any "depth" cases.
However, only running the "depth" cases:
will cause GPU HANG as a lot of other combinations of texture gather test discussed earlier.
*** Bug 93725 has been marked as a duplicate of this bug. ***
Created attachment 121077 [details] [review]
This fixes the problem for me.
Your patch is amazing, Curro!
With your patch on top of Mesa: git@ad20be1f30ef5
I can no longer reproduce the HANG on BDW and SKL. And all texture gather tests pass.
Also, it fixes BUG 93312.
Please send it up to mesa-dev!
*** Bug 93312 has been marked as a duplicate of this bug. ***
*** Bug 93325 has been marked as a duplicate of this bug. ***
*** Bug 93407 has been marked as a duplicate of this bug. ***
Cool, thanks for testing it out :)
Author: Francisco Jerez <firstname.lastname@example.org>
Date: Sat Jan 16 15:11:03 2016 -0800
i965: Implement compute sampler state atom.
Fixes a number of GLES31 CTS failures and hangs on various hardware:
Some of them were actually passing by luck on some generations even
though we weren't uploading sampler state tables explicitly for the
compute stage, most likely because they relied on the cached sampler
state left from previous rendering to be close enough.
Reported-by: Marta Lofstedt <email@example.com>
Reviewed-by: Marta Lofstedt <firstname.lastname@example.org>
Reviewed-by: Jordan Justen <email@example.com>
on Mar 27, 2017 at 02:51:57.
(provided by the Example extension).