Sometimes after a legitimate GPU hang, the recovery mechanic fails and we get endless stream of GPU resets in ~10sec intervals, without progress, and lose the GPU.
Fixed by the multiple timeline series. Alternatives: guilty = score >= HANGCHECK - RING_HUNG. Don't reset engine->hangcheck (score, last seqno, last state).
Submitted Mika's patch : https://patchwork.freedesktop.org/series/13367/
Hack: https://patchwork.freedesktop.org/patch/113827/ Correct fix: https://patchwork.freedesktop.org/patch/111483/
So far, I've seen this issue only on GEN9+, with Mesa GPU hangs in GfxBench 4.0 CarChase offscreen (gl_4_off) GL 4.3 test-case.
(In reply to Eero Tamminen from comment #4) > So far, I've seen this issue only on GEN9+, with Mesa GPU hangs in GfxBench > 4.0 CarChase offscreen (gl_4_off) GL 4.3 test-case. Yes, but the failure is not specific to that. It just depends on having one engine wait on a fence (that is either another engine without semaphores enabled, or an external fence from another driver) for long enough that it accrues hangcheck's anger. In your case, we have the blitter waiting on the render engine that is hung and hangcheck considers them both hung. As it is likely that the render engine made some forward progress (instdone) whilst the blitter was idle waiting, the blitter gets blamed first for the render hang.
commit 8687b3ec852e89630bac650f15136811c7b4c1dc Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Fri Oct 7 07:53:24 2016 +0100 drm/i915: Distinguish last emitted request from last submitted request In order not to trigger hangcheck on a idle-but-waiting engine, we need to distinguish between the pending request queue and the actual execution queue. This is done later in "drm/i915: Enable multiple timelines" but for now we need a temporary fix to prevent blaming the wrong engine for a GPU hang. (Note that this causes a temporary subtle change in how we decide when to allow a waitboost to be re-awarded back to the waiter, the temporary effect is that if the wait is upon the most current execution the wait is given for free, instead of checking to see if the client stalled itself. This will be repaired in "drm/i915: Enable multiple timelines".) Fixes: 0a046a0e93d2 ("drm/i915: Nonblocking request submission") And now back to the full series...
(In reply to Chris Wilson from comment #6) > commit 8687b3ec852e89630bac650f15136811c7b4c1dc > Author: Chris Wilson <chris@chris-wilson.co.uk> > Date: Fri Oct 7 07:53:24 2016 +0100 > > drm/i915: Distinguish last emitted request from last submitted request FYI: The only platform (SKL GT2) where the GPU hangs *and* results recovery failure were triggering, has been working (recovering) fine since 7th Oct. I.e. the fix is effective. Mika, from my point of view, you can mark this as verified.
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.