Bug 108735

Summary:	[CI][BAT] igt@drv_selftest@live_hangcheck - _igt_reset_evict_vma timed out
Product:	DRI	Reporter:	Martin Peres <martin.peres>
Component:	DRM/Intel	Assignee:	Chris Wilson <chris>
Status:	CLOSED FIXED	QA Contact:	Intel GFX Bugs mailing list <intel-gfx-bugs>
Severity:	normal
Priority:	highest	CC:	intel-gfx-bugs
Version:	XOrg git
Hardware:	Other
OS:	All
Whiteboard:	ReadyForDev
i915 platform:	I965G	i915 features:

Description Martin Peres 2018-11-13 16:50:16 UTC

https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_5133/fi-bwr-2160/igt@drv_selftest@live_hangcheck.html

(drv_selftest:2162) igt_kmod-WARNING: probe of 0000:00:02.0 failed with error -5
(drv_selftest:2162) igt_kmod-CRITICAL: Test assertion failure function igt_kselftest_execute, file ../lib/igt_kmod.c:531:
(drv_selftest:2162) igt_kmod-CRITICAL: Failed assertion: err == 0
(drv_selftest:2162) igt_kmod-CRITICAL: kselftest "i915 igt__24__live_hangcheck=1 live_selftests=-1 disable_display=1" failed: Input/output error [5]

Comment 1 Chris Wilson 2018-11-21 11:46:15 UTC

I hoped

commit e32c8d3caefbb8ec734a0a79c8d4245f38c99d2a
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Tue Nov 20 12:06:01 2018 +0000

    drm/i915/selftests: Hold task reference to reset worker
    
    As the worker may exit by itself, we need to hold a task reference to it
    in the parent.
    
    References: https://bugs.freedesktop.org/show_bug.cgi?id=108735
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
    Reviewed-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
    Link: https://patchwork.freedesktop.org/patch/msgid/20181120120601.24083-1-chris@chris-wilson.co.uk

was relevant, alas not.

Comment 2 Chris Wilson 2018-11-26 12:10:21 UTC

Seems similar to:

commit d6fee0dee09317d5e83e9b855316cb779dd679cf
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Tue Aug 14 11:40:56 2018 +0100

    drm/i915: Kick waiters on resetting legacy rings
    
    This reapplies commit 39f3be162c46 ("drm/i915: Kick waiters on resetting
    legacy rings") after the improved gem_eio was run across all machines we
    found that gen3 and early gen4 still lost the immediate interrupt
    following reset, and the HWSTAM w/a applied to gen6+ is inadequate.
    
    Unlike the later gen, on gen3/4 the principle (and only tests to fail so
    far) are the wait vs reset test cases, whereas the reset stress case
    works fine (which was the predominantly failing case for gen6+). That is
    enough to suggest the underlying issue is sufficiently different to
    support the difference in HWSTAM efficacy.
    
    Testcase: igt/gem_eio/wait-10ms
    References: 39f3be162c46 ("drm/i915: Kick waiters on resetting legacy rings")
    References: a69ab52b0358 ("drm/i915: Remove extra waiter kick on legacy resets")
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
    Cc: Matthew Auld <matthew.auld@intel.com>
    Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
    Reviewed-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>
    Link: https://patchwork.freedesktop.org/patch/msgid/20180814104056.27001-1-chris@chris-wilson.co.uk

Comment 3 Chris Wilson 2018-11-26 14:36:02 UTC

Stab, stab, stab:

commit b7f21899276a3e06ea3c98d0b3771f09eefc6e3d (HEAD -> drm-intel-next-queued, drm-intel/for-linux-next, drm-intel/drm-intel-next-queued)
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Mon Nov 26 12:28:21 2018 +0000

    drm/i915/ringbuffer: 2-step restart
    
    We may be simply restarting too fast for the culmudgeonly gen3/gen4 as
    we still see missing interrupts following a reset. So let's try
    restarting a little slower, first wake up the ring empty and then tell
    it about the work it has to perform.
    
    References: https://bugs.freedesktop.org/show_bug.cgi?id=108735
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
    Reviewed-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
    Link: https://patchwork.freedesktop.org/patch/msgid/20181126122821.4537-1-chris@chris-wilson.co.uk

Comment 4 Chris Wilson 2018-11-28 15:06:10 UTC

A couple of days, too early to tell for sure (as it was hidden for quite some time and only seemed to occur rarely). But let us assume that this paper was thick enough.

Comment 5 Francesco Balestrieri 2018-12-11 09:22:10 UTC

Another occurrence three days ago:

https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_5281/fi-bwr-2160/igt@i915_selftest@live_hangcheck.html

Reopen?

Comment 6 Martin Peres 2018-12-11 11:00:25 UTC

(In reply to Francesco Balestrieri from comment #5)
> Another occurrence three days ago:
> 
> https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_5281/fi-bwr-2160/
> igt@i915_selftest@live_hangcheck.html
> 
> Reopen?

I usually don't wait. If we get another failure, we re-open.

If this is another bug, then we should move the failure to a new one and close this bug.

Comment 7 Chris Wilson 2018-12-18 15:28:29 UTC

commit 060f23225d8203b8cd9e412d984e5237e63c83dc (HEAD -> drm-intel-next-queued, drm-intel/drm-intel-next-queued)
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Tue Dec 18 10:27:12 2018 +0000

    drm/i915: Apply missed interrupt after reset w/a to all ringbuffer gen
    
    Having completed a test run of gem_eio across all machines in CI we also
    observe the phenomenon (of lost interrupts after resetting the GPU) on
    gen3 machines as well as the previously sighted gen6/gen7. Let's apply
    the same HWSTAM workaround that was effective for gen6+ for all, as
    although we haven't seen the same failure on gen4/5 it seems prudent to
    keep the code the same.
    
    As a consequence we can remove the extra setting of HWSTAM and apply the
    register from a single site.
    
    v2: Delazy and move the HWSTAM into its own function
    v3: Mask off all HWSP writes on driver unload and engine cleanup.
    v4: And what about the physical hwsp?
    v5: No, engine->init_hw() is not called from driver_init_hw(), don't be
    daft. Really scrub HWSTAM as early as we can in driver_init_mmio()
    v6: Rename set_hwsp as it was setting the mask not the hwsp register.
    v7: Ville pointed out that although vcs(bsd) was introduced for g4x/ilk,
    per-engine HWSTAM was not introduced until gen6!
    
    References: https://bugs.freedesktop.org/show_bug.cgi?id=108735
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
    Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
    Cc: Ville Syrjälä <ville.syrjala@linux.intel.com>
    Reviewed-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
    Link: https://patchwork.freedesktop.org/patch/msgid/20181218102712.11058-1-chris@chris-wilson.co.uk

and hope again.

Comment 8 Chris Wilson 2018-12-19 16:01:36 UTC

Neither of those then. Back to square 0.

Comment 9 Chris Wilson 2019-02-02 22:14:02 UTC

Fwiw, new delayed signaling of fences after reset. Time to wait for a fresh report.

Comment 10 Martin Peres 2019-03-06 18:30:48 UTC

(In reply to Chris Wilson from comment #9)
> Fwiw, new delayed signaling of fences after reset. Time to wait for a fresh
> report.

Seems to have done the trick! Used to happen pretty much every run, and now nothing for over 300. Closing!

Comment 11 CI Bug Log 2019-03-06 18:30:57 UTC

The CI Bug Log issue associated to this bug has been archived.

New failures matching the above filters will not be associated to this bug anymore.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.