Bug 107860

Summary: [CI][BAT] igt@drv_selftest@live_hangcheck - igt_reset_engines failed with error
Product: DRI Reporter: Martin Peres <martin.peres>
Component: DRM/IntelAssignee: Intel GFX Bugs mailing list <intel-gfx-bugs>
Status: CLOSED FIXED QA Contact: Intel GFX Bugs mailing list <intel-gfx-bugs>
Severity: normal    
Priority: high CC: intel-gfx-bugs
Version: XOrg git   
Hardware: Other   
OS: All   
Whiteboard: ReadyForDev
i915 platform: BXT, KBL i915 features: GEM/Other

Description Martin Peres 2018-09-07 16:12:29 UTC
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_4512/fi-kbl-7500u/igt@drv_selftest@live_hangcheck.html

[...]

https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_4714/fi-kbl-7567u/igt@drv_selftest@live_hangcheck.html

https://intel-gfx-ci.01.org/tree/drm-tip/IGT_4611/shard-kbl7/igt@drv_selftest@live_hangcheck.html

[  185.989633] igt/vecs-5811    1.... 185981040us : active_request_put.part.9: vecs0 timed out waiting for completion of fence fe9:993, seqno 0.
[  185.989656] ---------------------------------
[  186.009793] kthread for other engine vecs0 failed, err=-5
[  186.009943] Failed to switch back to kernel context; declaring wedged
[  186.010389] i915/intel_hangcheck_live_selftests: igt_reset_engines failed with error -5
[  186.011113] Failed to switch back to kernel context; declaring wedged
[  186.092120] i915: probe of 0000:00:02.0 failed with error -5
Comment 1 Martin Peres 2018-09-07 16:15:24 UTC
This bug is a continuation of the bug https://bugs.freedesktop.org/show_bug.cgi?id=106560 which had been closed as some of the issues were fixed.

This bug is mostly visible on KBL, but APL also has one hit since the fix from 106560: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_4702/shard-apl6/igt@drv_selftest@live_hangcheck.html
Comment 2 Martin Peres 2018-09-07 16:16:19 UTC
Bumping the priority as it is quite problematic to wedge a GPU.
Comment 4 Martin Peres 2018-09-07 17:13:18 UTC
https://intel-gfx-ci.01.org/tree/drm-tip/IGT_4611/shard-kbl7/igt@drv_selftest@live_hangcheck.html(In reply to Chris Wilson from comment #3)
> (In reply to Martin Peres from comment #0)
> > https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_4512/fi-kbl-7500u/
> > igt@drv_selftest@live_hangcheck.html
> 
> Before the fix.
> 
> 
> > https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_4714/fi-kbl-7567u/
> > igt@drv_selftest@live_hangcheck.html
> 
> Before the fix.
> 
> 
>  
> > https://intel-gfx-ci.01.org/tree/drm-tip/IGT_4611/shard-kbl7/
> > igt@drv_selftest@live_hangcheck.html
> 
> == DRM_4715 which I guess is before the fix as well.

CI_DRM_4715 was posted on Aug. 28, 2018, 1:11 p.m. This was way after you pushed your commit (2018-08-15 10:15:28 +0100): https://cgit.freedesktop.org/drm-tip/commit/?id=a99b32a6fff7e482a267c72e565c8c410ce793d7

So, I'm re-opening. But please tell me where is the error in my logic if you still think this is fixed :)
Comment 5 Chris Wilson 2018-09-07 17:16:04 UTC
The last fix for live_hangcheck was

commit 9e4fa01221b3230320135072ad31ea809ca31147
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Tue Aug 28 16:27:02 2018 +0100

    drm/i915/execlists: Flush tasklet directly from reset-finish
    
    On finishing the reset, the intention is to restart the GPU before we
    relinquish the forcewake taken to handle the reset - the goal being the
    GPU reloads a context before it is allowed to sleep. For this purpose,
    we used tasklet_flush() which although it accomplished the goal of
    restarting the GPU, carried with it a sting in its tail: it cleared the
    TASKLET_STATE_SCHED bit. This meant that if another CPU queued a new
    request to this engine, we would clear the flag and later attempt to
    requeue the tasklet on the local CPU, breaking the per-cpu softirq
    lists.
    
    Remove the dangerous tasklet_kill() and just run the tasklet func
    directly as we know it is safe to do so (the tasklets are internally
    locked to allow mixed usage from direct submission).
    
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
    Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
    Cc: Mika Kuoppala <mika.kuoppala@intel.com>
    Cc: Michel Thierry <michel.thierry@intel.com>
    Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
    Reviewed-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>
    Link: https://patchwork.freedesktop.org/patch/msgid/20180828152702.27536-1-chris@chris-wilson.co.uk
Comment 6 Martin Peres 2018-09-07 17:22:44 UTC
(In reply to Chris Wilson from comment #5)
> The last fix for live_hangcheck was
> 
> commit 9e4fa01221b3230320135072ad31ea809ca31147
> Author: Chris Wilson <chris@chris-wilson.co.uk>
> Date:   Tue Aug 28 16:27:02 2018 +0100
> 
>     drm/i915/execlists: Flush tasklet directly from reset-finish
>     
>     On finishing the reset, the intention is to restart the GPU before we
>     relinquish the forcewake taken to handle the reset - the goal being the
>     GPU reloads a context before it is allowed to sleep. For this purpose,
>     we used tasklet_flush() which although it accomplished the goal of
>     restarting the GPU, carried with it a sting in its tail: it cleared the
>     TASKLET_STATE_SCHED bit. This meant that if another CPU queued a new
>     request to this engine, we would clear the flag and later attempt to
>     requeue the tasklet on the local CPU, breaking the per-cpu softirq
>     lists.
>     
>     Remove the dangerous tasklet_kill() and just run the tasklet func
>     directly as we know it is safe to do so (the tasklets are internally
>     locked to allow mixed usage from direct submission).
>     
>     Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
>     Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
>     Cc: Mika Kuoppala <mika.kuoppala@intel.com>
>     Cc: Michel Thierry <michel.thierry@intel.com>
>     Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
>     Reviewed-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>
>     Link:
> https://patchwork.freedesktop.org/patch/msgid/20180828152702.27536-1-
> chris@chris-wilson.co.uk

ACK! Thanks for documenting it :)

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.