Bug 106560 - [CI] igt@drv_selftest@live_hangcheck - igt_reset_engines failed with error
Summary: [CI] igt@drv_selftest@live_hangcheck - igt_reset_engines failed with error
Status: CLOSED FIXED
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/Intel (show other bugs)
Version: unspecified
Hardware: x86-64 (AMD64) Linux (All)
: medium normal
Assignee: Intel GFX Bugs mailing list
QA Contact: Intel GFX Bugs mailing list
URL:
Whiteboard: ReadyForDev
Keywords:
Depends on:
Blocks:
 
Reported: 2018-05-18 07:21 UTC by Tomi Sarvela
Modified: 2018-09-07 16:16 UTC (History)
1 user (show)

See Also:
i915 platform: BSW/CHT, BXT, CFL, GLK, KBL
i915 features: GPU hang


Attachments

Description Tomi Sarvela 2018-05-18 07:21:59 UTC
IGT test drv_selftest@live_hangcheck started failing yesterday. Issue is visible on APL, KBL and GLK:

https://intel-gfx-ci.01.org/tree/drm-tip/igt@drv_selftest@live_hangcheck.html
Comment 1 Chris Wilson 2018-05-18 07:34:48 UTC
https://patchwork.freedesktop.org/series/43344/ seems to do the trick
Comment 2 Chris Wilson 2018-05-25 12:43:58 UTC
Hopefully,

commit 9a4dc80399b1630cea0f1ad8ef0417436cbb95d0
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Fri May 18 11:09:33 2018 +0100

    drm/i915: Flush the ring stop bit after clearing RING_HEAD in reset
    
    Inside the live_hangcheck (reset) selftests, we occasionally see
    failures like
    
    <7>[  239.094840] i915_gem_set_wedged rcs0
    <7>[  239.094843] i915_gem_set_wedged   current seqno 19a98, last 19a9a, hangcheck 0 [5158 ms]
    <7>[  239.094846] i915_gem_set_wedged   Reset count: 6239 (global 1)
    <7>[  239.094848] i915_gem_set_wedged   Requests:
    <7>[  239.095052] i915_gem_set_wedged           first  19a99 [e8c:5f] prio=1024 @ 5159ms: (null)
    <7>[  239.095056] i915_gem_set_wedged           last   19a9a [e81:1a] prio=139 @ 5159ms: igt/rcs0[5977]/1
    <7>[  239.095059] i915_gem_set_wedged           active 19a99 [e8c:5f] prio=1024 @ 5159ms: (null)
    <7>[  239.095062] i915_gem_set_wedged           [head 0220, postfix 0280, tail 02a8, batch 0xffffffff_ffffffff]
    <7>[  239.100050] i915_gem_set_wedged           ring->start:  0x00283000
    <7>[  239.100053] i915_gem_set_wedged           ring->head:   0x000001f8
    <7>[  239.100055] i915_gem_set_wedged           ring->tail:   0x000002a8
    <7>[  239.100057] i915_gem_set_wedged           ring->emit:   0x000002a8
    <7>[  239.100059] i915_gem_set_wedged           ring->space:  0x00000f10
    <7>[  239.100085] i915_gem_set_wedged   RING_START: 0x00283000
    <7>[  239.100088] i915_gem_set_wedged   RING_HEAD:  0x00000260
    <7>[  239.100091] i915_gem_set_wedged   RING_TAIL:  0x000002a8
    <7>[  239.100094] i915_gem_set_wedged   RING_CTL:   0x00000001
    <7>[  239.100097] i915_gem_set_wedged   RING_MODE:  0x00000300 [idle]
    <7>[  239.100100] i915_gem_set_wedged   RING_IMR: fffffefe
    <7>[  239.100104] i915_gem_set_wedged   ACTHD:  0x00000000_0000609c
    <7>[  239.100108] i915_gem_set_wedged   BBADDR: 0x00000000_0000609d
    <7>[  239.100111] i915_gem_set_wedged   DMA_FADDR: 0x00000000_00283260
    <7>[  239.100114] i915_gem_set_wedged   IPEIR: 0x00000000
    <7>[  239.100117] i915_gem_set_wedged   IPEHR: 0x02800000
    <7>[  239.100120] i915_gem_set_wedged   Execlist status: 0x00044052 00000002
    <7>[  239.100124] i915_gem_set_wedged   Execlist CSB read 5 [5 cached], write 5 [5 from hws], interrupt posted? no, tasklet queued? no (enabled)
    <7>[  239.100128] i915_gem_set_wedged           ELSP[0] count=1, ring->start=00283000, rq: 19a99 [e8c:5f] prio=1024 @ 5164ms: (null)
    <7>[  239.100132] i915_gem_set_wedged           ELSP[1] count=1, ring->start=00257000, rq: 19a9a [e81:1a] prio=139 @ 5164ms: igt/rcs0[5977]/1
    <7>[  239.100135] i915_gem_set_wedged           HW active? 0x5
    <7>[  239.100250] i915_gem_set_wedged           E 19a99 [e8c:5f] prio=1024 @ 5164ms: (null)
    <7>[  239.100338] i915_gem_set_wedged           E 19a9a [e81:1a] prio=139 @ 5164ms: igt/rcs0[5977]/1
    <7>[  239.100340] i915_gem_set_wedged           Queue priority: 139
    <7>[  239.100343] i915_gem_set_wedged           Q 0 [e98:19] prio=132 @ 5164ms: igt/rcs0[5977]/8
    <7>[  239.100346] i915_gem_set_wedged           Q 0 [e84:19] prio=121 @ 5165ms: igt/rcs0[5977]/2
    <7>[  239.100349] i915_gem_set_wedged           Q 0 [e87:19] prio=82 @ 5165ms: igt/rcs0[5977]/3
    <7>[  239.100352] i915_gem_set_wedged           Q 0 [e84:1a] prio=44 @ 5164ms: igt/rcs0[5977]/2
    <7>[  239.100356] i915_gem_set_wedged           Q 0 [e8b:19] prio=20 @ 5165ms: igt/rcs0[5977]/4
    <7>[  239.100362] i915_gem_set_wedged   drv_selftest [5894] waiting for 19a99
    
    where the GPU saw an arbitration point and idles; AND HAS NOT BEEN RESET!
    The RING_MODE indicates that is idle and has the STOP_RING bit set, so
    try clearing it.
    
    v2: Only clear the bit on restarting the ring, as we want to be sure the
    STOP_RING bit is kept if reset fails on wedging.
    v3: Spot when the ring state doesn't make sense when re-initialising the
    engine and dump it to the logs so that we don't have to wait for an
    error later and try to guess what happened earlier.
    v4: Prepare to print all the unexpected state, not just the first.
    
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
    Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
    Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
    Link: https://patchwork.freedesktop.org/patch/msgid/20180518100933.2239-1-chris@chris-wilson.co.uk
Comment 4 Chris Wilson 2018-07-12 13:27:03 UTC
commit 5db1d4ea91b6ee447c4ae01f7f56803e32e690b1
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Mon Jun 4 08:34:40 2018 +0100

    drm/i915/execlists: Push the tasklet kick after reset to reset_finish
    
    In the unlikely case where we have failed to keep submitting to the GPU,
    we end up with the ELSP queue empty but a pending queue of requests.
    Here, we skip the per-engine reset as there is no guilty request, but in
    doing so we also skip the engine restart leaving ourselves with a
    permanently hung engine. A quick way to recover is by moving the tasklet
    kick to execlists_reset_finish() (from init_hw). We still emit the error
    on hanging, so the error is not lost but we should be able to recover.
    
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
    Cc: Mika Kuoppala <mika.kuoppala@intel.com>
    Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
    Cc: Michel Thierry <michel.thierry@intel.com>
    Link: https://patchwork.freedesktop.org/patch/msgid/20180604073441.6737-2-chris@chris-wilson.co.uk
    Reviewed-by: Mika Kuoppala <mika.kuoppala@intel.com>

We need to be careful to not mistake the STOP_RING hangs, bug 106947.
Comment 5 Chris Wilson 2018-07-12 13:33:53 UTC
(In reply to Chris Wilson from comment #4)
> We need to be careful to not mistake the STOP_RING hangs, bug 106947.

Wrong way around.
Comment 6 Martin Peres 2018-07-16 07:48:03 UTC
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_4451/fi-bsw-cyan/igt@drv_selftest@live_hangcheck.html

https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_4483/fi-cfl-8109u/igt@drv_selftest@live_hangcheck.html

(drv_selftest:8778) igt_kmod-WARNING: probe of 0000:00:02.0 failed with error -5
(drv_selftest:8778) igt_kmod-CRITICAL: Test assertion failure function igt_kselftest_execute, file ../lib/igt_kmod.c:513:
(drv_selftest:8778) igt_kmod-CRITICAL: Failed assertion: err == 0
(drv_selftest:8778) igt_kmod-CRITICAL: kselftest "i915 igt__23__live_hangcheck=1 live_selftests=-1 disable_display=1" failed: Input/output error [5]
(drv_selftest:8778) igt_core-INFO: Stack trace:
(drv_selftest:8778) igt_core-INFO:   #0 [__igt_fail_assert+0x180]
(drv_selftest:8778) igt_core-INFO:   #1 [igt_kselftest_execute+0x1d9]
(drv_selftest:8778) igt_core-INFO:   #2 [igt_kselftests+0x18c]
(drv_selftest:8778) igt_core-INFO:   #3 [__real_main29+0x44]
(drv_selftest:8778) igt_core-INFO:   #4 [main+0x44]
(drv_selftest:8778) igt_core-INFO:   #5 [__libc_start_main+0xe7]
(drv_selftest:8778) igt_core-INFO:   #6 [_start+0x2a]
****  END  ****

[  818.263283] kthread for other engine bcs0 failed, err=-5
[  818.263347] kthread for other engine vcs0 failed, err=-5
[  818.263476] kthread for other engine vecs0 failed, err=-5
[  818.269800] Failed to switch back to kernel context; declaring wedged
[  818.287883] i915/intel_hangcheck_live_selftests: igt_reset_engines failed with error -5
[  818.298287] Failed to switch back to kernel context; declaring wedged
[  818.454389] i915: probe of 0000:00:02.0 failed with error -5
Comment 7 Chris Wilson 2018-08-15 09:19:35 UTC
commit a99b32a6fff7e482a267c72e565c8c410ce793d7 (HEAD -> drm-intel-next-queued, drm-intel/drm-intel-next-queued)
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Tue Aug 14 18:18:57 2018 +0100

    drm/i915: Clear stop-engine for a pardoned reset
    
    If we pardon a per-engine reset, we may leave the STOP_RING bit asserted
    in RING_MI_MODE resulting in the engine hanging. Unconditionally clear
    it on the per-engine exit path as we know that either we skipped the
    reset and so need the cancellation, or the reset was successful and the
    cancellation is a no-op, or there was an error and we will follow up
    with a full-reset or wedging (both of which will stop the engines again
    as required).
    
    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=107188
    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=106560
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
    Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
    Reviewed-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>
    Link: https://patchwork.freedesktop.org/patch/msgid/20180814171857.24673-1-chris@chris-wilson.co.uk


Please note that the CI run for this patch indicated we have yet another cause for hangs here. When that is detected please do file a fresh bug so we don't have the debug logs confused.
Comment 8 Martin Peres 2018-09-05 08:13:59 UTC
(In reply to Chris Wilson from comment #7)
> commit a99b32a6fff7e482a267c72e565c8c410ce793d7 (HEAD ->
> drm-intel-next-queued, drm-intel/drm-intel-next-queued)
> Author: Chris Wilson <chris@chris-wilson.co.uk>
> Date:   Tue Aug 14 18:18:57 2018 +0100
> 
>     drm/i915: Clear stop-engine for a pardoned reset
>     
>     If we pardon a per-engine reset, we may leave the STOP_RING bit asserted
>     in RING_MI_MODE resulting in the engine hanging. Unconditionally clear
>     it on the per-engine exit path as we know that either we skipped the
>     reset and so need the cancellation, or the reset was successful and the
>     cancellation is a no-op, or there was an error and we will follow up
>     with a full-reset or wedging (both of which will stop the engines again
>     as required).
>     
>     Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=107188
>     Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=106560
>     Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
>     Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
>     Reviewed-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>
>     Link:
> https://patchwork.freedesktop.org/patch/msgid/20180814171857.24673-1-
> chris@chris-wilson.co.uk
> 
> 
> Please note that the CI run for this patch indicated we have yet another
> cause for hangs here. When that is detected please do file a fresh bug so we
> don't have the debug logs confused.

That seems to have done the trick for this particular issue. Closing now :)
Comment 9 Martin Peres 2018-09-05 10:38:54 UTC
(In reply to Martin Peres from comment #8)
> (In reply to Chris Wilson from comment #7)
> > commit a99b32a6fff7e482a267c72e565c8c410ce793d7 (HEAD ->
> > drm-intel-next-queued, drm-intel/drm-intel-next-queued)
> > Author: Chris Wilson <chris@chris-wilson.co.uk>
> > Date:   Tue Aug 14 18:18:57 2018 +0100
> > 
> >     drm/i915: Clear stop-engine for a pardoned reset
> >     
> >     If we pardon a per-engine reset, we may leave the STOP_RING bit asserted
> >     in RING_MI_MODE resulting in the engine hanging. Unconditionally clear
> >     it on the per-engine exit path as we know that either we skipped the
> >     reset and so need the cancellation, or the reset was successful and the
> >     cancellation is a no-op, or there was an error and we will follow up
> >     with a full-reset or wedging (both of which will stop the engines again
> >     as required).
> >     
> >     Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=107188
> >     Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=106560
> >     Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> >     Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
> >     Reviewed-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>
> >     Link:
> > https://patchwork.freedesktop.org/patch/msgid/20180814171857.24673-1-
> > chris@chris-wilson.co.uk
> > 
> > 
> > Please note that the CI run for this patch indicated we have yet another
> > cause for hangs here. When that is detected please do file a fresh bug so we
> > don't have the debug logs confused.
> 
> That seems to have done the trick for this particular issue. Closing now :)

Actually, https://bugs.freedesktop.org/show_bug.cgi?id=106947 was saying that the following failures are for this issue, and they still happen:

https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_4512/fi-kbl-7500u/igt@drv_selftest@live_hangcheck.html

[...]

https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_4714/fi-kbl-7567u/igt@drv_selftest@live_hangcheck.html

https://intel-gfx-ci.01.org/tree/drm-tip/IGT_4611/shard-kbl7/igt@drv_selftest@live_hangcheck.html
Comment 10 Chris Wilson 2018-09-05 10:42:40 UTC
Honestly, its all fixed now. Well except for the guc. Please do treat any fresh indication of failure as a separate bug.
Comment 11 Martin Peres 2018-09-07 16:16:58 UTC
(In reply to Chris Wilson from comment #10)
> Honestly, its all fixed now. Well except for the guc. Please do treat any
> fresh indication of failure as a separate bug.

OK, moved here: https://bugs.freedesktop.org/show_bug.cgi?id=107860


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.