Mesa began running Google's dEQP validation suite on older platforms, and found that spurious/random GPU hangs occured on SNB, especially SNBGT1. Turning off gpu reset did not prevent GPU HANGS as reported in dmesg. In spite of the detected hangs, the test suite completed without error. If the GPU were *really* hung, we would be able to see which dEQP test was stuck. However, the test suite completed without incident. Disabling hang check on SNB systems has eliminated these spurious dEQP failures entirely. Google's suite includes some tests which are intended to generate GPU Hangs (we skip those) and others which have long-running shaders that may not appear to make progress. Based on this behavior, it seems reasonable to tailor the hangcheck timer to the platform. Passing dEQP is important because Google uses the suite to validate platforms for ChromeOS. Sandy bridge is an important ChromeOS platform. We don't want a stabel Mesa release to be blocked from release because of spurious GPU Hangs on SNB.
I'll attempt to find a single dEQP test which can trigger this behavior.
If it was a timer issue it would not be random, and you could quickly identify the batch that was running for over 6s. (Which is a nasty DoS issue). Perhaps attaching a few of the hangs would help identify the common theme.
This hangcheck behavior can be easily reproduced with: dEQP-GLES3.stress.long_running_shaders.long_dynamic_do_while_fragment if hangcheck is disabled, the test passes. By default, the test will trigger gpu hang. This test is not random, and there are more like it. For Mesa's CI, we disabled the dEQP tests that consistently produced GPU Hang. There are still an unknown number of dEQP tests enabled which trigger GPU Hang intermittently, but I don't have a good way to produce that list. Running suspicious candidates individually failed to reproduce the hangcheck. The set of intermittent hangs is beside the point, since several tests produce hangs reliably: dEQP-GLES3.stress.long_running_shaders.long_static_while_vertex dEQP-GLES3.stress.long_running_shaders.long_static_do_while_fragment dEQP-GLES3.stress.long_running_shaders.long_uniform_for_vertex dEQP-GLES3.stress.long_running_shaders.long_uniform_do_while_fragment dEQP-GLES3.stress.long_running_shaders.long_dynamic_for_vertex dEQP-GLES3.stress.long_running_shaders.long_dynamic_for_fragment dEQP-GLES3.stress.long_running_shaders.long_dynamic_while_vertex dEQP-GLES3.stress.long_running_shaders.long_dynamic_while_fragment dEQP-GLES3.stress.long_running_shaders.long_dynamic_do_while_vertex dmesg output: [ 789.583298] [drm] stuck on render ring [ 789.583856] [drm] GPU HANG: ecode 6:0:0x8588cff8, in deqp-gles3 [1116], reason: Ring hung, action: reset [ 789.586392] drm/i915: Resetting chip after gpu hang [ 799.592116] [drm] stuck on render ring [ 799.592646] [drm] GPU HANG: ecode 6:0:0x4080ffff, in deqp-gles3 [1116], reason: Ring hung, action: reset [ 799.594715] drm/i915: Resetting chip after gpu hang
The system I used to reproduce these failures was Debian Testing, with the stock 4.3 kernel, on a dual core SNBGT1 laptop.
Created attachment 120213 [details] [review] drm/i915: Teach hangcheck about long operations on rings
(In reply to Mika Kuoppala from comment #5) > Created attachment 120213 [details] [review] [review] > drm/i915: Teach hangcheck about long operations on rings I don't believe that it is flushing the tiny WRITE_CACHE that is causing the huge delays, so guess it must be the top-of-pipe synchronisation (i.e. the wait for the shader to complete). Still this smells like a DoS issue inside the batch. So what is different if mesa adds the PIPE_CONTROL as its last instruction in the batch?
I tested this with witb bdw ppgtt=1 and execlists=0, as I didn't have snb available. I reduced the hangcheck tick to 500ms, in order to lure this condition into light. So yes the setup is far from snb. But with those changes, I added MI_NOOPs after flush and on multiple times, the actual head stayed in PIPE_CONTROL for > 500ms. I will post patches in list so we can continue discussion in there.
Created attachment 120223 [details] [review] drm/i915: Inspect subunit states on hangcheck
Created attachment 120224 [details] [review] drm/i915: Inspect subunit states on hangcheck
With Mika's patch, spurious hangchecks can still be generated by invoking the the tests listed in the comments (long_running_shaders). These tests are blacklisted by Chrome's test suite. The good news is that all of the *intermittent* gpu hangs that were generated by the rest of the suite have been fixed by this patch. In my opinion, this patch should be applied to the kernel, as it improves the stability of many tests as seen by Google for ChromeOS validation.
Valtteri Rantala has an automated system that can be used to reproduce/verify these gpu hangs.
(In reply to Mark Janes from comment #10) > With Mika's patch, spurious hangchecks can still be generated by invoking > the the tests listed in the comments (long_running_shaders). These tests > are blacklisted by Chrome's test suite. > > The good news is that all of the *intermittent* gpu hangs that were > generated by the rest of the suite have been fixed by this patch. Are the still hanging tests too heavy for a slow GPU like SNB GT1? What if the hang period would just be increased while running them?
Eero: increasing the timer was my initial suggestion. SNBGT1 will definitely pass if the timer is increased, although I haven't investigated the threshold that is required. I'm not sure the timer should accommodate the "long_running_shaders" tests unless they are clearly targeted by Google's dEQP whitelist for android conformance.
I'm not familiar with these tests. Why these particular tests are named as "long running", how long they're _supposed_ to run? Because I know that other gfx chip drivers have also watchdogs, I'm just wondering are they actually supposed to trigger hangcheck...
(In reply to Eero Tamminen from comment #12) > (In reply to Mark Janes from comment #10) > > With Mika's patch, spurious hangchecks can still be generated by invoking > > the the tests listed in the comments (long_running_shaders). These tests > > are blacklisted by Chrome's test suite. > > > > The good news is that all of the *intermittent* gpu hangs that were > > generated by the rest of the suite have been fixed by this patch. > > Are the still hanging tests too heavy for a slow GPU like SNB GT1? > > What if the hang period would just be increased while running them? I would like to understand the tests a little better and try to improve the heuristics, before adding a tunable. Is this still blocking a release?
Mika: this bug is not a regression, and does not block a release. There may be a point (soon) at which Google will require stricter dEQP conformance, but based on the list here, the long_running_shaders are not required. https://android.googlesource.com/platform/external/deqp/+/4423ddaeef6c6a252152f774b76950f62c412f94/android/cts However, the hang detection's intermittent firing on *other* non-hanging dEQP tests creates instability in the test suite. The patch you attached to this bug improves stability, and I haven't seen any spurious dEQP failures using it on SNB (long-running-tests are disabled in CI). (In reply to Mika Kuoppala from comment #15) > (In reply to Eero Tamminen from comment #12) > > (In reply to Mark Janes from comment #10) > > > With Mika's patch, spurious hangchecks can still be generated by invoking > > > the the tests listed in the comments (long_running_shaders). These tests > > > are blacklisted by Chrome's test suite. > > > > > > The good news is that all of the *intermittent* gpu hangs that were > > > generated by the rest of the suite have been fixed by this patch. > > > > Are the still hanging tests too heavy for a slow GPU like SNB GT1? > > > > What if the hang period would just be increased while running them? > > I would like to understand the tests a little better and try to improve the > heuristics, before adding a tunable. > > Is this still blocking a release?
(In reply to Eero Tamminen from comment #14) > I'm not familiar with these tests. > > Why these particular tests are named as "long running", how long they're > _supposed_ to run? > > Because I know that other gfx chip drivers have also watchdogs, I'm just > wondering are they actually supposed to trigger hangcheck... Eero, you may be right about the intention of the tests, it's hard to know. However, it seems like they could use an infinite loop if they really wanted to trigger hangcheck.
commit 24a65e624bcdc726c7711ae90efeffaf0a8e9f32 Author: Mika Kuoppala <mika.kuoppala@linux.intel.com> Date: Wed Mar 2 16:48:29 2016 +0200 drm/i915/hangcheck: Prevent long walks across full-ppgtt makes hangcheck less aggressive against long running consequtive batches. Please retest with latest drm-intel-nightly.
Retesting with drm-nightly shows great improvement. All of the tests which formerly could be used to trigger spurious hang detection are fixed, with the exception of: dEQP-GLES3.stress.long_running_shaders.long_static_do_while_fragment This test is not a requirement for intel hardware to pass dEQP. From my perspective, this bug is resolved in drm-nightly.
Hi! I am currently out of office 6.7-3.8. I have very limited access to mail during that time. In please contact : Tomi Sarvela in Performance Benchmark System related question, Martin Peres and Eero Tamminen with performance related questions or my manager Jani Saarinen.
With this kernel deployed in Mesa's CI, the snbgt1 machines produced a hard GPU hang within a few iterations of piglit. I don't have any information as to whether this is due to the hangcheck mechanism or other issues with drm-nightly.
Hi, can you update the status of this issue?
dEQP is now reliable on linux 4.8.
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.