Bug 98104

Summary: [!semaphore] Gpu hang recovery fails, causing reset loop
Product: DRI Reporter: Mika Kuoppala <mika.kuoppala>
Component: DRM/IntelAssignee: Mika Kuoppala <mika.kuoppala>
Status: CLOSED FIXED QA Contact: Intel GFX Bugs mailing list <intel-gfx-bugs>
Severity: major    
Priority: medium CC: eero.t.tamminen, intel-gfx-bugs
Version: DRI git   
Hardware: x86-64 (AMD64)   
OS: Linux (All)   
See Also: https://bugs.freedesktop.org/show_bug.cgi?id=96743
Whiteboard:
i915 platform: ALL i915 features: GPU hang

Description Mika Kuoppala 2016-10-06 06:06:25 UTC
Sometimes after a legitimate GPU hang, the recovery mechanic fails and
we get endless stream of GPU resets in ~10sec intervals, without progress,
and lose the GPU.
Comment 1 Chris Wilson 2016-10-06 06:49:16 UTC
Fixed by the multiple timeline series.

Alternatives: guilty = score >= HANGCHECK - RING_HUNG.
Don't reset engine->hangcheck (score, last seqno, last state).
Comment 2 yann 2016-10-06 07:09:05 UTC
Submitted Mika's patch : https://patchwork.freedesktop.org/series/13367/
Comment 4 Eero Tamminen 2016-10-06 11:11:58 UTC
So far, I've seen this issue only on GEN9+, with Mesa GPU hangs in GfxBench 4.0 CarChase offscreen (gl_4_off) GL 4.3 test-case.
Comment 5 Chris Wilson 2016-10-06 11:18:24 UTC
(In reply to Eero Tamminen from comment #4)
> So far, I've seen this issue only on GEN9+, with Mesa GPU hangs in GfxBench
> 4.0 CarChase offscreen (gl_4_off) GL 4.3 test-case.

Yes, but the failure is not specific to that. It just depends on having one engine wait on a fence (that is either another engine without semaphores enabled, or an external fence from another driver) for long enough that it accrues hangcheck's anger. In your case, we have the blitter waiting on the render engine that is hung and hangcheck considers them both hung. As it is likely that the render engine made some forward progress (instdone) whilst the blitter was idle waiting, the blitter gets blamed first for the render hang.
Comment 6 Chris Wilson 2016-10-07 07:28:41 UTC
commit 8687b3ec852e89630bac650f15136811c7b4c1dc
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Fri Oct 7 07:53:24 2016 +0100

    drm/i915: Distinguish last emitted request from last submitted request
    
    In order not to trigger hangcheck on a idle-but-waiting engine, we need
    to distinguish between the pending request queue and the actual
    execution queue. This is done later in "drm/i915: Enable multiple
    timelines" but for now we need a temporary fix to prevent blaming the
    wrong engine for a GPU hang.
    
    (Note that this causes a temporary subtle change in how we decide when
    to allow a waitboost to be re-awarded back to the waiter, the temporary
    effect is that if the wait is upon the most current execution the wait
    is given for free, instead of checking to see if the client stalled
    itself. This will be repaired in "drm/i915: Enable multiple timelines".)
    
    Fixes: 0a046a0e93d2 ("drm/i915: Nonblocking request submission")

And now back to the full series...
Comment 7 Eero Tamminen 2016-10-14 15:06:15 UTC
(In reply to Chris Wilson from comment #6)
> commit 8687b3ec852e89630bac650f15136811c7b4c1dc
> Author: Chris Wilson <chris@chris-wilson.co.uk>
> Date:   Fri Oct 7 07:53:24 2016 +0100
> 
>     drm/i915: Distinguish last emitted request from last submitted request

FYI: The only platform (SKL GT2) where the GPU hangs *and* results recovery failure were triggering, has been working (recovering) fine since 7th Oct.  I.e. the fix is effective.

Mika, from my point of view, you can mark this as verified.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.