Bug 93029

Summary: Hangcheck timer too agressive to pass dEQP for SNBGT1
Product: DRI Reporter: Mark Janes <mark.a.janes>
Component: DRM/IntelAssignee: Mika Kuoppala <mika.kuoppala>
Status: CLOSED FIXED QA Contact: Intel GFX Bugs mailing list <intel-gfx-bugs>
Severity: normal    
Priority: medium CC: eero.t.tamminen, intel-gfx-bugs, mattst88, valtteri.rantala
Version: XOrg git   
Hardware: x86-64 (AMD64)   
OS: Linux (All)   
Whiteboard:
i915 platform: SNB i915 features: GPU hang
Attachments:
Description Flags
drm/i915: Teach hangcheck about long operations on rings
none
drm/i915: Inspect subunit states on hangcheck
none
drm/i915: Inspect subunit states on hangcheck none

Description Mark Janes 2015-11-19 23:02:10 UTC
Mesa began running Google's dEQP validation suite on older platforms, and found that spurious/random GPU hangs occured on SNB, especially SNBGT1.

Turning off gpu reset did not prevent GPU HANGS as reported in dmesg.  In spite of the detected hangs, the test suite completed without error.  If the GPU were *really* hung, we would be able to see which dEQP test was stuck.  However, the test suite completed without incident.

Disabling hang check on SNB systems has eliminated these spurious dEQP failures entirely.

Google's suite includes some tests which are intended to generate GPU Hangs (we skip those) and others which have long-running shaders that may not appear to make progress.

Based on this behavior, it seems reasonable to tailor the hangcheck timer to the platform.

Passing dEQP is important because Google uses the suite to validate platforms for ChromeOS.  Sandy bridge is an important ChromeOS platform.  We don't want a stabel Mesa release to be blocked from release because of spurious GPU Hangs on SNB.
Comment 1 Mark Janes 2015-11-19 23:02:32 UTC
I'll attempt to find a single dEQP test which can trigger this behavior.
Comment 2 Chris Wilson 2015-11-19 23:13:43 UTC
If it was a timer issue it would not be random, and you could quickly identify the batch that was running for over 6s. (Which is a nasty DoS issue). Perhaps attaching a few of the hangs would help identify the common theme.
Comment 3 Mark Janes 2015-11-25 21:43:02 UTC
This hangcheck behavior can be easily reproduced with:

dEQP-GLES3.stress.long_running_shaders.long_dynamic_do_while_fragment

if hangcheck is disabled, the test passes.  By default, the test will trigger gpu hang.

This test is not random, and there are more like it.  For Mesa's CI, we disabled the dEQP tests that consistently produced GPU Hang.  There are still an unknown number of dEQP tests enabled which trigger GPU Hang intermittently, but I don't have a good way to produce that list.  Running suspicious candidates individually failed to reproduce the hangcheck.

The set of intermittent hangs is beside the point, since several tests produce hangs reliably:

dEQP-GLES3.stress.long_running_shaders.long_static_while_vertex
dEQP-GLES3.stress.long_running_shaders.long_static_do_while_fragment
dEQP-GLES3.stress.long_running_shaders.long_uniform_for_vertex
dEQP-GLES3.stress.long_running_shaders.long_uniform_do_while_fragment
dEQP-GLES3.stress.long_running_shaders.long_dynamic_for_vertex
dEQP-GLES3.stress.long_running_shaders.long_dynamic_for_fragment
dEQP-GLES3.stress.long_running_shaders.long_dynamic_while_vertex
dEQP-GLES3.stress.long_running_shaders.long_dynamic_while_fragment
dEQP-GLES3.stress.long_running_shaders.long_dynamic_do_while_vertex

dmesg output:

[  789.583298] [drm] stuck on render ring
[  789.583856] [drm] GPU HANG: ecode 6:0:0x8588cff8, in deqp-gles3 [1116], reason: Ring hung, action: reset
[  789.586392] drm/i915: Resetting chip after gpu hang
[  799.592116] [drm] stuck on render ring
[  799.592646] [drm] GPU HANG: ecode 6:0:0x4080ffff, in deqp-gles3 [1116], reason: Ring hung, action: reset
[  799.594715] drm/i915: Resetting chip after gpu hang
Comment 4 Mark Janes 2015-11-25 21:45:27 UTC
The system I used to reproduce these failures was Debian Testing, with the stock 4.3 kernel, on a dual core SNBGT1 laptop.
Comment 5 Mika Kuoppala 2015-11-30 16:26:19 UTC
Created attachment 120213 [details] [review]
drm/i915: Teach hangcheck about long operations on rings
Comment 6 Chris Wilson 2015-11-30 16:38:22 UTC
(In reply to Mika Kuoppala from comment #5)
> Created attachment 120213 [details] [review] [review]
> drm/i915: Teach hangcheck about long operations on rings

I don't believe that it is flushing the tiny WRITE_CACHE that is causing the huge delays, so guess it must be the top-of-pipe synchronisation (i.e. the wait for the shader to complete). Still this smells like a DoS issue inside the batch.

So what is different if mesa adds the PIPE_CONTROL as its last instruction in the batch?
Comment 7 Mika Kuoppala 2015-11-30 16:48:22 UTC
I tested this with witb bdw ppgtt=1 and execlists=0, as I didn't have
snb available. I reduced the hangcheck tick to 500ms, in order to lure this condition into light. So yes the setup is far from snb.

But with those changes, I added MI_NOOPs after flush and on multiple
times, the actual head stayed in PIPE_CONTROL for > 500ms.

I will post patches in list so we can continue discussion in there.
Comment 8 Mika Kuoppala 2015-12-01 12:15:51 UTC
Created attachment 120223 [details] [review]
drm/i915: Inspect subunit states on hangcheck
Comment 9 Mika Kuoppala 2015-12-01 15:43:00 UTC
Created attachment 120224 [details] [review]
drm/i915: Inspect subunit states on hangcheck
Comment 10 Mark Janes 2015-12-07 19:55:13 UTC
With Mika's patch, spurious hangchecks can still be generated by invoking the the tests listed in the comments (long_running_shaders).  These tests are blacklisted by Chrome's test suite.

The good news is that all of the *intermittent* gpu hangs that were generated by the rest of the suite have been fixed by this patch.

In my opinion, this patch should be applied to the kernel, as it improves the stability of many tests as seen by Google for ChromeOS validation.
Comment 11 Mark Janes 2015-12-10 18:14:50 UTC
Valtteri Rantala has an automated system that can be used to reproduce/verify these gpu hangs.
Comment 12 Eero Tamminen 2015-12-11 09:34:04 UTC
(In reply to Mark Janes from comment #10)
> With Mika's patch, spurious hangchecks can still be generated by invoking
> the the tests listed in the comments (long_running_shaders).  These tests
> are blacklisted by Chrome's test suite.
> 
> The good news is that all of the *intermittent* gpu hangs that were
> generated by the rest of the suite have been fixed by this patch.

Are the still hanging tests too heavy for a slow GPU like SNB GT1?

What if the hang period would just be increased while running them?
Comment 13 Mark Janes 2015-12-11 18:12:04 UTC
Eero: increasing the timer was my initial suggestion.  SNBGT1 will definitely pass if the timer is increased, although I haven't investigated the threshold that is required.

I'm not sure the timer should accommodate the "long_running_shaders" tests unless they are clearly targeted by Google's dEQP whitelist for android conformance.
Comment 14 Eero Tamminen 2015-12-14 08:41:48 UTC
I'm not familiar with these tests.

Why these particular tests are named as "long running", how long they're _supposed_ to run?

Because I know that other gfx chip drivers have also watchdogs, I'm just wondering are they actually supposed to trigger hangcheck...
Comment 15 Mika Kuoppala 2015-12-14 09:04:24 UTC
(In reply to Eero Tamminen from comment #12)
> (In reply to Mark Janes from comment #10)
> > With Mika's patch, spurious hangchecks can still be generated by invoking
> > the the tests listed in the comments (long_running_shaders).  These tests
> > are blacklisted by Chrome's test suite.
> > 
> > The good news is that all of the *intermittent* gpu hangs that were
> > generated by the rest of the suite have been fixed by this patch.
> 
> Are the still hanging tests too heavy for a slow GPU like SNB GT1?
> 
> What if the hang period would just be increased while running them?

I would like to understand the tests a little better and try to improve the heuristics, before adding a tunable.

Is this still blocking a release?
Comment 16 Mark Janes 2015-12-14 17:03:58 UTC
Mika: this bug is not a regression, and does not block a release.  There may be a point (soon) at which Google will require stricter dEQP conformance, but based on the list here, the long_running_shaders are not required.

https://android.googlesource.com/platform/external/deqp/+/4423ddaeef6c6a252152f774b76950f62c412f94/android/cts

However, the hang detection's intermittent firing on *other* non-hanging dEQP tests creates instability in the test suite.  The patch you attached to this bug improves stability, and I haven't seen any spurious dEQP failures using it on SNB (long-running-tests are disabled in CI).

(In reply to Mika Kuoppala from comment #15)
> (In reply to Eero Tamminen from comment #12)
> > (In reply to Mark Janes from comment #10)
> > > With Mika's patch, spurious hangchecks can still be generated by invoking
> > > the the tests listed in the comments (long_running_shaders).  These tests
> > > are blacklisted by Chrome's test suite.
> > > 
> > > The good news is that all of the *intermittent* gpu hangs that were
> > > generated by the rest of the suite have been fixed by this patch.
> > 
> > Are the still hanging tests too heavy for a slow GPU like SNB GT1?
> > 
> > What if the hang period would just be increased while running them?
> 
> I would like to understand the tests a little better and try to improve the
> heuristics, before adding a tunable.
> 
> Is this still blocking a release?
Comment 17 Mark Janes 2015-12-14 17:07:28 UTC
(In reply to Eero Tamminen from comment #14)
> I'm not familiar with these tests.
> 
> Why these particular tests are named as "long running", how long they're
> _supposed_ to run?
> 
> Because I know that other gfx chip drivers have also watchdogs, I'm just
> wondering are they actually supposed to trigger hangcheck...

Eero, you may be right about the intention of the tests, it's hard to know.  However, it seems like they could use an infinite loop if they really wanted to trigger hangcheck.
Comment 18 Mika Kuoppala 2016-03-21 12:43:06 UTC
commit 24a65e624bcdc726c7711ae90efeffaf0a8e9f32
Author: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Date:   Wed Mar 2 16:48:29 2016 +0200

    drm/i915/hangcheck: Prevent long walks across full-ppgtt

makes hangcheck less aggressive against long running consequtive batches.

Please retest with latest drm-intel-nightly.
Comment 19 Mark Janes 2016-04-12 17:16:31 UTC
Retesting with drm-nightly shows great improvement.  All of the tests which formerly could be used to trigger spurious hang detection are fixed, with the exception of:

dEQP-GLES3.stress.long_running_shaders.long_static_do_while_fragment

This test is not a requirement for intel hardware to pass dEQP.

From my perspective, this bug is resolved in drm-nightly.
Comment 20 valtteri.rantala 2016-04-12 17:17:15 UTC
Hi!

I am currently out of office  6.7-3.8. I have very limited access to mail during that time.
In please contact : Tomi Sarvela in Performance Benchmark System related question, Martin Peres and Eero Tamminen with performance related questions or my manager Jani Saarinen.
Comment 21 Mark Janes 2016-04-12 19:00:53 UTC
With this kernel deployed in Mesa's CI, the snbgt1 machines produced a hard GPU hang within a few iterations of piglit.

I don't have any information as to whether this is due to the hangcheck mechanism or other issues with drm-nightly.
Comment 22 yann 2016-09-20 15:01:31 UTC
Hi, can you update the status of this issue?
Comment 23 Mark Janes 2017-04-03 13:39:19 UTC
dEQP is now reliable on linux 4.8.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.