https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_5133/fi-bwr-2160/igt@drv_selftest@live_hangcheck.html (drv_selftest:2162) igt_kmod-WARNING: probe of 0000:00:02.0 failed with error -5 (drv_selftest:2162) igt_kmod-CRITICAL: Test assertion failure function igt_kselftest_execute, file ../lib/igt_kmod.c:531: (drv_selftest:2162) igt_kmod-CRITICAL: Failed assertion: err == 0 (drv_selftest:2162) igt_kmod-CRITICAL: kselftest "i915 igt__24__live_hangcheck=1 live_selftests=-1 disable_display=1" failed: Input/output error [5]
I hoped commit e32c8d3caefbb8ec734a0a79c8d4245f38c99d2a Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Tue Nov 20 12:06:01 2018 +0000 drm/i915/selftests: Hold task reference to reset worker As the worker may exit by itself, we need to hold a task reference to it in the parent. References: https://bugs.freedesktop.org/show_bug.cgi?id=108735 Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Reviewed-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20181120120601.24083-1-chris@chris-wilson.co.uk was relevant, alas not.
Seems similar to: commit d6fee0dee09317d5e83e9b855316cb779dd679cf Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Tue Aug 14 11:40:56 2018 +0100 drm/i915: Kick waiters on resetting legacy rings This reapplies commit 39f3be162c46 ("drm/i915: Kick waiters on resetting legacy rings") after the improved gem_eio was run across all machines we found that gen3 and early gen4 still lost the immediate interrupt following reset, and the HWSTAM w/a applied to gen6+ is inadequate. Unlike the later gen, on gen3/4 the principle (and only tests to fail so far) are the wait vs reset test cases, whereas the reset stress case works fine (which was the predominantly failing case for gen6+). That is enough to suggest the underlying issue is sufficiently different to support the difference in HWSTAM efficacy. Testcase: igt/gem_eio/wait-10ms References: 39f3be162c46 ("drm/i915: Kick waiters on resetting legacy rings") References: a69ab52b0358 ("drm/i915: Remove extra waiter kick on legacy resets") Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Cc: Matthew Auld <matthew.auld@intel.com> Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com> Reviewed-by: Mika Kuoppala <mika.kuoppala@linux.intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20180814104056.27001-1-chris@chris-wilson.co.uk
Stab, stab, stab: commit b7f21899276a3e06ea3c98d0b3771f09eefc6e3d (HEAD -> drm-intel-next-queued, drm-intel/for-linux-next, drm-intel/drm-intel-next-queued) Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Mon Nov 26 12:28:21 2018 +0000 drm/i915/ringbuffer: 2-step restart We may be simply restarting too fast for the culmudgeonly gen3/gen4 as we still see missing interrupts following a reset. So let's try restarting a little slower, first wake up the ring empty and then tell it about the work it has to perform. References: https://bugs.freedesktop.org/show_bug.cgi?id=108735 Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Reviewed-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20181126122821.4537-1-chris@chris-wilson.co.uk
A couple of days, too early to tell for sure (as it was hidden for quite some time and only seemed to occur rarely). But let us assume that this paper was thick enough.
Another occurrence three days ago: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_5281/fi-bwr-2160/igt@i915_selftest@live_hangcheck.html Reopen?
(In reply to Francesco Balestrieri from comment #5) > Another occurrence three days ago: > > https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_5281/fi-bwr-2160/ > igt@i915_selftest@live_hangcheck.html > > Reopen? I usually don't wait. If we get another failure, we re-open. If this is another bug, then we should move the failure to a new one and close this bug.
commit 060f23225d8203b8cd9e412d984e5237e63c83dc (HEAD -> drm-intel-next-queued, drm-intel/drm-intel-next-queued) Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Tue Dec 18 10:27:12 2018 +0000 drm/i915: Apply missed interrupt after reset w/a to all ringbuffer gen Having completed a test run of gem_eio across all machines in CI we also observe the phenomenon (of lost interrupts after resetting the GPU) on gen3 machines as well as the previously sighted gen6/gen7. Let's apply the same HWSTAM workaround that was effective for gen6+ for all, as although we haven't seen the same failure on gen4/5 it seems prudent to keep the code the same. As a consequence we can remove the extra setting of HWSTAM and apply the register from a single site. v2: Delazy and move the HWSTAM into its own function v3: Mask off all HWSP writes on driver unload and engine cleanup. v4: And what about the physical hwsp? v5: No, engine->init_hw() is not called from driver_init_hw(), don't be daft. Really scrub HWSTAM as early as we can in driver_init_mmio() v6: Rename set_hwsp as it was setting the mask not the hwsp register. v7: Ville pointed out that although vcs(bsd) was introduced for g4x/ilk, per-engine HWSTAM was not introduced until gen6! References: https://bugs.freedesktop.org/show_bug.cgi?id=108735 Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Cc: Ville Syrjälä <ville.syrjala@linux.intel.com> Reviewed-by: Ville Syrjälä <ville.syrjala@linux.intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20181218102712.11058-1-chris@chris-wilson.co.uk and hope again.
Neither of those then. Back to square 0.
Fwiw, new delayed signaling of fences after reset. Time to wait for a fresh report.
(In reply to Chris Wilson from comment #9) > Fwiw, new delayed signaling of fences after reset. Time to wait for a fresh > report. Seems to have done the trick! Used to happen pretty much every run, and now nothing for over 300. Closing!
The CI Bug Log issue associated to this bug has been archived. New failures matching the above filters will not be associated to this bug anymore.
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.