Summary: | Hangcheck timer too agressive to pass dEQP for SNBGT1 | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Product: | DRI | Reporter: | Mark Janes <mark.a.janes> | ||||||||
Component: | DRM/Intel | Assignee: | Mika Kuoppala <mika.kuoppala> | ||||||||
Status: | CLOSED FIXED | QA Contact: | Intel GFX Bugs mailing list <intel-gfx-bugs> | ||||||||
Severity: | normal | ||||||||||
Priority: | medium | CC: | eero.t.tamminen, intel-gfx-bugs, mattst88, valtteri.rantala | ||||||||
Version: | XOrg git | ||||||||||
Hardware: | x86-64 (AMD64) | ||||||||||
OS: | Linux (All) | ||||||||||
Whiteboard: | |||||||||||
i915 platform: | SNB | i915 features: | GPU hang | ||||||||
Attachments: |
|
Description
Mark Janes
2015-11-19 23:02:10 UTC
I'll attempt to find a single dEQP test which can trigger this behavior. If it was a timer issue it would not be random, and you could quickly identify the batch that was running for over 6s. (Which is a nasty DoS issue). Perhaps attaching a few of the hangs would help identify the common theme. This hangcheck behavior can be easily reproduced with: dEQP-GLES3.stress.long_running_shaders.long_dynamic_do_while_fragment if hangcheck is disabled, the test passes. By default, the test will trigger gpu hang. This test is not random, and there are more like it. For Mesa's CI, we disabled the dEQP tests that consistently produced GPU Hang. There are still an unknown number of dEQP tests enabled which trigger GPU Hang intermittently, but I don't have a good way to produce that list. Running suspicious candidates individually failed to reproduce the hangcheck. The set of intermittent hangs is beside the point, since several tests produce hangs reliably: dEQP-GLES3.stress.long_running_shaders.long_static_while_vertex dEQP-GLES3.stress.long_running_shaders.long_static_do_while_fragment dEQP-GLES3.stress.long_running_shaders.long_uniform_for_vertex dEQP-GLES3.stress.long_running_shaders.long_uniform_do_while_fragment dEQP-GLES3.stress.long_running_shaders.long_dynamic_for_vertex dEQP-GLES3.stress.long_running_shaders.long_dynamic_for_fragment dEQP-GLES3.stress.long_running_shaders.long_dynamic_while_vertex dEQP-GLES3.stress.long_running_shaders.long_dynamic_while_fragment dEQP-GLES3.stress.long_running_shaders.long_dynamic_do_while_vertex dmesg output: [ 789.583298] [drm] stuck on render ring [ 789.583856] [drm] GPU HANG: ecode 6:0:0x8588cff8, in deqp-gles3 [1116], reason: Ring hung, action: reset [ 789.586392] drm/i915: Resetting chip after gpu hang [ 799.592116] [drm] stuck on render ring [ 799.592646] [drm] GPU HANG: ecode 6:0:0x4080ffff, in deqp-gles3 [1116], reason: Ring hung, action: reset [ 799.594715] drm/i915: Resetting chip after gpu hang The system I used to reproduce these failures was Debian Testing, with the stock 4.3 kernel, on a dual core SNBGT1 laptop. Created attachment 120213 [details] [review] drm/i915: Teach hangcheck about long operations on rings (In reply to Mika Kuoppala from comment #5) > Created attachment 120213 [details] [review] [review] > drm/i915: Teach hangcheck about long operations on rings I don't believe that it is flushing the tiny WRITE_CACHE that is causing the huge delays, so guess it must be the top-of-pipe synchronisation (i.e. the wait for the shader to complete). Still this smells like a DoS issue inside the batch. So what is different if mesa adds the PIPE_CONTROL as its last instruction in the batch? I tested this with witb bdw ppgtt=1 and execlists=0, as I didn't have snb available. I reduced the hangcheck tick to 500ms, in order to lure this condition into light. So yes the setup is far from snb. But with those changes, I added MI_NOOPs after flush and on multiple times, the actual head stayed in PIPE_CONTROL for > 500ms. I will post patches in list so we can continue discussion in there. Created attachment 120223 [details] [review] drm/i915: Inspect subunit states on hangcheck Created attachment 120224 [details] [review] drm/i915: Inspect subunit states on hangcheck With Mika's patch, spurious hangchecks can still be generated by invoking the the tests listed in the comments (long_running_shaders). These tests are blacklisted by Chrome's test suite. The good news is that all of the *intermittent* gpu hangs that were generated by the rest of the suite have been fixed by this patch. In my opinion, this patch should be applied to the kernel, as it improves the stability of many tests as seen by Google for ChromeOS validation. Valtteri Rantala has an automated system that can be used to reproduce/verify these gpu hangs. (In reply to Mark Janes from comment #10) > With Mika's patch, spurious hangchecks can still be generated by invoking > the the tests listed in the comments (long_running_shaders). These tests > are blacklisted by Chrome's test suite. > > The good news is that all of the *intermittent* gpu hangs that were > generated by the rest of the suite have been fixed by this patch. Are the still hanging tests too heavy for a slow GPU like SNB GT1? What if the hang period would just be increased while running them? Eero: increasing the timer was my initial suggestion. SNBGT1 will definitely pass if the timer is increased, although I haven't investigated the threshold that is required. I'm not sure the timer should accommodate the "long_running_shaders" tests unless they are clearly targeted by Google's dEQP whitelist for android conformance. I'm not familiar with these tests. Why these particular tests are named as "long running", how long they're _supposed_ to run? Because I know that other gfx chip drivers have also watchdogs, I'm just wondering are they actually supposed to trigger hangcheck... (In reply to Eero Tamminen from comment #12) > (In reply to Mark Janes from comment #10) > > With Mika's patch, spurious hangchecks can still be generated by invoking > > the the tests listed in the comments (long_running_shaders). These tests > > are blacklisted by Chrome's test suite. > > > > The good news is that all of the *intermittent* gpu hangs that were > > generated by the rest of the suite have been fixed by this patch. > > Are the still hanging tests too heavy for a slow GPU like SNB GT1? > > What if the hang period would just be increased while running them? I would like to understand the tests a little better and try to improve the heuristics, before adding a tunable. Is this still blocking a release? Mika: this bug is not a regression, and does not block a release. There may be a point (soon) at which Google will require stricter dEQP conformance, but based on the list here, the long_running_shaders are not required. https://android.googlesource.com/platform/external/deqp/+/4423ddaeef6c6a252152f774b76950f62c412f94/android/cts However, the hang detection's intermittent firing on *other* non-hanging dEQP tests creates instability in the test suite. The patch you attached to this bug improves stability, and I haven't seen any spurious dEQP failures using it on SNB (long-running-tests are disabled in CI). (In reply to Mika Kuoppala from comment #15) > (In reply to Eero Tamminen from comment #12) > > (In reply to Mark Janes from comment #10) > > > With Mika's patch, spurious hangchecks can still be generated by invoking > > > the the tests listed in the comments (long_running_shaders). These tests > > > are blacklisted by Chrome's test suite. > > > > > > The good news is that all of the *intermittent* gpu hangs that were > > > generated by the rest of the suite have been fixed by this patch. > > > > Are the still hanging tests too heavy for a slow GPU like SNB GT1? > > > > What if the hang period would just be increased while running them? > > I would like to understand the tests a little better and try to improve the > heuristics, before adding a tunable. > > Is this still blocking a release? (In reply to Eero Tamminen from comment #14) > I'm not familiar with these tests. > > Why these particular tests are named as "long running", how long they're > _supposed_ to run? > > Because I know that other gfx chip drivers have also watchdogs, I'm just > wondering are they actually supposed to trigger hangcheck... Eero, you may be right about the intention of the tests, it's hard to know. However, it seems like they could use an infinite loop if they really wanted to trigger hangcheck. commit 24a65e624bcdc726c7711ae90efeffaf0a8e9f32 Author: Mika Kuoppala <mika.kuoppala@linux.intel.com> Date: Wed Mar 2 16:48:29 2016 +0200 drm/i915/hangcheck: Prevent long walks across full-ppgtt makes hangcheck less aggressive against long running consequtive batches. Please retest with latest drm-intel-nightly. Retesting with drm-nightly shows great improvement. All of the tests which formerly could be used to trigger spurious hang detection are fixed, with the exception of: dEQP-GLES3.stress.long_running_shaders.long_static_do_while_fragment This test is not a requirement for intel hardware to pass dEQP. From my perspective, this bug is resolved in drm-nightly. Hi! I am currently out of office 6.7-3.8. I have very limited access to mail during that time. In please contact : Tomi Sarvela in Performance Benchmark System related question, Martin Peres and Eero Tamminen with performance related questions or my manager Jani Saarinen. With this kernel deployed in Mesa's CI, the snbgt1 machines produced a hard GPU hang within a few iterations of piglit. I don't have any information as to whether this is due to the hangcheck mechanism or other issues with drm-nightly. Hi, can you update the status of this issue? dEQP is now reliable on linux 4.8. |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.