Bug 91718

Summary: piglit.spec.arb_shader_image_load_store.invalid causes intermittent GPU HANG
Product: Mesa Reporter: Mark Janes <mark.a.janes>
Component: Drivers/DRI/i965Assignee: Francisco Jerez <currojerez>
Status: RESOLVED FIXED QA Contact: Intel 3D Bugs Mailing List <intel-3d-bugs>
Severity: major    
Priority: medium CC: currojerez, mark.a.janes
Version: git   
Hardware: x86-64 (AMD64)   
OS: Linux (All)   
Whiteboard:
i915 platform: i915 features:
Attachments: reproduces gpu hang with image_load_store tests
gen7_prevent_untyped_format_mismatch.patch

Description Mark Janes 2015-08-21 17:36:39 UTC
Created attachment 117842 [details]
reproduces gpu hang with image_load_store tests

Mesa's CI system has been encountering intermittent GPU HANGS since the introduction of arb_image_load_store

I was able to reproduce this easily on IVB by repeatedly running image_load_store tests and watching syslog.  It may also occur on snb, bsw, and bdw, because those systems have encountered a similar increase in gpu hangs.

I've attached the trivial shell script I used to reproduce it.
Comment 1 Mark Janes 2015-08-21 17:37:57 UTC
I will disable image_load_store tests on the CI, because it interferes with regression analysis.
Comment 2 Francisco Jerez 2015-08-21 18:23:38 UTC
It sounds like this is mixing a number of unrelated issues.  If you say that arb_shader_image_load_store.invalid is causing a gpu hang, what's the rationale behind disabling all arb_shader_image_load_store tests in the CI?

Regarding BSW and BDW, the issue with the hangs is known, there is a fix on the mailing list waiting for review since last week [1].

Regarding IVB, it would be useful to know what test is causing the hang.  Are there any piglit failures occurring at the same time as the hang?

Regarding SNB, it doesn't even expose ARB_shader_image_load_store so the piglit tests you say are at fault aren't being run in the first place.  Again it would help to know what test is causing the hang, but it cannot be caused by any of the 
ARB_shader_image_load_store ones.

[1] http://lists.freedesktop.org/archives/mesa-dev/2015-August/091705.html
Comment 3 Mark Janes 2015-08-21 18:45:38 UTC
Rationale for disabling the test:  any gpu hang results in spurious errors being reported by other tests.  Any intermittent behavior generates spurious results that have to be analyzed.

If there is a smaller set of tests that can be disabled, let me know and I will do it.  

The test that caused the hang was:

piglit.spec.arb_shader_image_load_store.invalid

I verified this by running with reset disabled, and checking the tests that were running.

I can re-enable bsw/bdw when the UAV patch is merged.  Please make a comment in this bug or contact me when it gets in.

Based on your advice, I'll follow up and figure out what test is hanging on snb.
Comment 4 Francisco Jerez 2015-08-21 18:53:25 UTC
(In reply to Mark Janes from comment #3)
> Rationale for disabling the test:  any gpu hang results in spurious errors
> being reported by other tests.  Any intermittent behavior generates spurious
> results that have to be analyzed.
> 
> If there is a smaller set of tests that can be disabled, let me know and I
> will do it.  
> 
> The test that caused the hang was:
> 
> piglit.spec.arb_shader_image_load_store.invalid
> 

You seem to be answering your own question.

> I verified this by running with reset disabled, and checking the tests that
> were running.
> 

On what platform?

> I can re-enable bsw/bdw when the UAV patch is merged.  Please make a comment
> in this bug or contact me when it gets in.
> 
I suggest you disable piglit.spec.arb_shader_image_load_store.host-mem-barrier on BSW and BDW until the patch is merged.  Or are you able to reproduce a hang on those platforms with that one test disabled?

> Based on your advice, I'll follow up and figure out what test is hanging on
> snb.
Comment 5 Mark Janes 2015-08-21 20:08:52 UTC
I see.  You want me to do the legwork to determine which subset of your tests causes intermittent hangs on which platforms, then disable the minimal set.

That will take some time.
Comment 6 Francisco Jerez 2015-08-22 01:48:58 UTC
(In reply to Mark Janes from comment #5)
> I see.  You want me to do the legwork to determine which subset of your
> tests causes intermittent hangs on which platforms, then disable the minimal
> set.
> 
> That will take some time.

No, not at all, that would have been too much to ask, I just asked you two simple questions, I wont be able to reproduce this bug unless I know:

 - What platform(s) you saw the piglit.spec.arb_shader_image_load_store.invalid hang on.  Originally you said IVB, SNB, BSW and BDW, but I was skeptical about this bit of information because invalid isn't run on SNB and because BSW and BDW suffer from another known hang bug unlikely to affect the "invalid" test, so I wanted you to confirm where you had actually seen it.

 - Whether you have seen a test other than  piglit.spec.arb_shader_image_load_store.host-mem-barrier cause a gpu hang on BSW and BDW.  If you haven't there is no reason for me to try to reproduce it, and no reason for you to black-list them, which would be counter-productive and possibly conceal an unrelated gpu hang if there actually is one.

Thank you.
Comment 7 Mark Janes 2015-08-24 19:53:15 UTC
OK, thanks for clarifying.  I only saw the piglit.spec.arb_shader_image_load_store.invalid test hang on IVB.  I don't have a way to audit historical results to understand which tests cause gpu hangs.

To find out which test is hanging, I have to take a machine out of the test pool, boot it with i915.reset=0, and repeatedly run piglit until I get a hang.  Then I can look in the test directory to see which tests were running at the time.

If reset is enabled, then there is no information that I can use to figure out which tests were running when the hang occurred.  For example, arb_shader_image_load_store.invalid triggers the hang while reporting success.

IVB had the most common hangs, so I investigated that platform first.  BSW is encountering hangs even with host-mem-barrier disabled, but I have to do some investigation before I can tell you what test is the culprit.  I have no information indicating it is from image_load_store.

I'll try to catch a hang on BSW tonight.
Comment 8 Mark Janes 2015-08-25 03:08:14 UTC
I enabled arb_shader_image_load_store.invalid on all systems other than IVB.  It looks to have created a hang on BYT

http://otc-mesa-ci.jf.intel.com/job/Leeroy/163544/

Let me know if you think "invalid" should be reliable on BYT.
Comment 9 Francisco Jerez 2015-08-25 17:06:41 UTC
(In reply to Mark Janes from comment #8)
> I enabled arb_shader_image_load_store.invalid on all systems other than IVB.
> It looks to have created a hang on BYT
> 
> http://otc-mesa-ci.jf.intel.com/job/Leeroy/163544/
> 
> Let me know if you think "invalid" should be reliable on BYT.

Thank you.  I've been able to reproduce the hang locally today.  We seem to be hitting another hardware bug in the "invalid/format mismatch" subtests causing a GPU hang when a shader uses an untyped format but the bound surface is not of type RAW.  It's likely to affect both IVB and BYT equally, but other gens should be okay.  The attached workaround seems to fix it reliably on at least IVB.
Comment 10 Francisco Jerez 2015-08-25 17:08:01 UTC
Created attachment 117911 [details] [review]
gen7_prevent_untyped_format_mismatch.patch
Comment 11 Mark Janes 2015-09-08 22:53:12 UTC
I tested https://patchwork.freedesktop.org/patch/58594/ and found that it resolved the gpu hangs on IVB.
Comment 12 Francisco Jerez 2015-09-28 15:37:09 UTC
Fix pushed as b61292296bd7e1876fdb64725a783a7e96f6c4c1.
Comment 13 Mark Janes 2016-01-21 20:04:53 UTC
In the past week or so, ivbgt1 has been intermittently hanging on this test.  I'll try to get a reliable way to reproduce it.
Comment 14 Mark Janes 2016-01-21 20:47:27 UTC
This hang can easily be reproduced on ivbgt1 by repeatedly running arb_shader_image_load_store.invalid.  In my test, it took about 5 interations to trigger the hang.

Curro, does your previous resolution to this bug provide any insight into what might be going wrong?
Comment 15 Francisco Jerez 2016-01-21 22:28:53 UTC
The intermittent hang we had seen in this test on IVB was fixed by b61292296bd7e1876fdb64725a783a7e96f6c4c1. This is likely unrelated, might be a regression.
Comment 16 Mark Janes 2016-03-02 00:27:21 UTC
I can't reproduce this with linux 4.4

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.