Summary: | [CI] igt@drv_selftest@live_hangcheck - igt_reset_engines failed with error | ||
---|---|---|---|
Product: | DRI | Reporter: | Tomi Sarvela <tomi.p.sarvela> |
Component: | DRM/Intel | Assignee: | Intel GFX Bugs mailing list <intel-gfx-bugs> |
Status: | CLOSED FIXED | QA Contact: | Intel GFX Bugs mailing list <intel-gfx-bugs> |
Severity: | normal | ||
Priority: | medium | CC: | intel-gfx-bugs |
Version: | unspecified | ||
Hardware: | x86-64 (AMD64) | ||
OS: | Linux (All) | ||
Whiteboard: | ReadyForDev | ||
i915 platform: | BSW/CHT, BXT, CFL, GLK, KBL | i915 features: | GPU hang |
Description
Tomi Sarvela
2018-05-18 07:21:59 UTC
https://patchwork.freedesktop.org/series/43344/ seems to do the trick Hopefully, commit 9a4dc80399b1630cea0f1ad8ef0417436cbb95d0 Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Fri May 18 11:09:33 2018 +0100 drm/i915: Flush the ring stop bit after clearing RING_HEAD in reset Inside the live_hangcheck (reset) selftests, we occasionally see failures like <7>[ 239.094840] i915_gem_set_wedged rcs0 <7>[ 239.094843] i915_gem_set_wedged current seqno 19a98, last 19a9a, hangcheck 0 [5158 ms] <7>[ 239.094846] i915_gem_set_wedged Reset count: 6239 (global 1) <7>[ 239.094848] i915_gem_set_wedged Requests: <7>[ 239.095052] i915_gem_set_wedged first 19a99 [e8c:5f] prio=1024 @ 5159ms: (null) <7>[ 239.095056] i915_gem_set_wedged last 19a9a [e81:1a] prio=139 @ 5159ms: igt/rcs0[5977]/1 <7>[ 239.095059] i915_gem_set_wedged active 19a99 [e8c:5f] prio=1024 @ 5159ms: (null) <7>[ 239.095062] i915_gem_set_wedged [head 0220, postfix 0280, tail 02a8, batch 0xffffffff_ffffffff] <7>[ 239.100050] i915_gem_set_wedged ring->start: 0x00283000 <7>[ 239.100053] i915_gem_set_wedged ring->head: 0x000001f8 <7>[ 239.100055] i915_gem_set_wedged ring->tail: 0x000002a8 <7>[ 239.100057] i915_gem_set_wedged ring->emit: 0x000002a8 <7>[ 239.100059] i915_gem_set_wedged ring->space: 0x00000f10 <7>[ 239.100085] i915_gem_set_wedged RING_START: 0x00283000 <7>[ 239.100088] i915_gem_set_wedged RING_HEAD: 0x00000260 <7>[ 239.100091] i915_gem_set_wedged RING_TAIL: 0x000002a8 <7>[ 239.100094] i915_gem_set_wedged RING_CTL: 0x00000001 <7>[ 239.100097] i915_gem_set_wedged RING_MODE: 0x00000300 [idle] <7>[ 239.100100] i915_gem_set_wedged RING_IMR: fffffefe <7>[ 239.100104] i915_gem_set_wedged ACTHD: 0x00000000_0000609c <7>[ 239.100108] i915_gem_set_wedged BBADDR: 0x00000000_0000609d <7>[ 239.100111] i915_gem_set_wedged DMA_FADDR: 0x00000000_00283260 <7>[ 239.100114] i915_gem_set_wedged IPEIR: 0x00000000 <7>[ 239.100117] i915_gem_set_wedged IPEHR: 0x02800000 <7>[ 239.100120] i915_gem_set_wedged Execlist status: 0x00044052 00000002 <7>[ 239.100124] i915_gem_set_wedged Execlist CSB read 5 [5 cached], write 5 [5 from hws], interrupt posted? no, tasklet queued? no (enabled) <7>[ 239.100128] i915_gem_set_wedged ELSP[0] count=1, ring->start=00283000, rq: 19a99 [e8c:5f] prio=1024 @ 5164ms: (null) <7>[ 239.100132] i915_gem_set_wedged ELSP[1] count=1, ring->start=00257000, rq: 19a9a [e81:1a] prio=139 @ 5164ms: igt/rcs0[5977]/1 <7>[ 239.100135] i915_gem_set_wedged HW active? 0x5 <7>[ 239.100250] i915_gem_set_wedged E 19a99 [e8c:5f] prio=1024 @ 5164ms: (null) <7>[ 239.100338] i915_gem_set_wedged E 19a9a [e81:1a] prio=139 @ 5164ms: igt/rcs0[5977]/1 <7>[ 239.100340] i915_gem_set_wedged Queue priority: 139 <7>[ 239.100343] i915_gem_set_wedged Q 0 [e98:19] prio=132 @ 5164ms: igt/rcs0[5977]/8 <7>[ 239.100346] i915_gem_set_wedged Q 0 [e84:19] prio=121 @ 5165ms: igt/rcs0[5977]/2 <7>[ 239.100349] i915_gem_set_wedged Q 0 [e87:19] prio=82 @ 5165ms: igt/rcs0[5977]/3 <7>[ 239.100352] i915_gem_set_wedged Q 0 [e84:1a] prio=44 @ 5164ms: igt/rcs0[5977]/2 <7>[ 239.100356] i915_gem_set_wedged Q 0 [e8b:19] prio=20 @ 5165ms: igt/rcs0[5977]/4 <7>[ 239.100362] i915_gem_set_wedged drv_selftest [5894] waiting for 19a99 where the GPU saw an arbitration point and idles; AND HAS NOT BEEN RESET! The RING_MODE indicates that is idle and has the STOP_RING bit set, so try clearing it. v2: Only clear the bit on restarting the ring, as we want to be sure the STOP_RING bit is kept if reset fails on wedging. v3: Spot when the ring state doesn't make sense when re-initialising the engine and dump it to the logs so that we don't have to wait for an error later and try to guess what happened earlier. v4: Prepare to print all the unexpected state, not just the first. Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com> Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20180518100933.2239-1-chris@chris-wilson.co.uk commit 5db1d4ea91b6ee447c4ae01f7f56803e32e690b1 Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Mon Jun 4 08:34:40 2018 +0100 drm/i915/execlists: Push the tasklet kick after reset to reset_finish In the unlikely case where we have failed to keep submitting to the GPU, we end up with the ELSP queue empty but a pending queue of requests. Here, we skip the per-engine reset as there is no guilty request, but in doing so we also skip the engine restart leaving ourselves with a permanently hung engine. A quick way to recover is by moving the tasklet kick to execlists_reset_finish() (from init_hw). We still emit the error on hanging, so the error is not lost but we should be able to recover. Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Cc: Mika Kuoppala <mika.kuoppala@intel.com> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Cc: Michel Thierry <michel.thierry@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20180604073441.6737-2-chris@chris-wilson.co.uk Reviewed-by: Mika Kuoppala <mika.kuoppala@intel.com> We need to be careful to not mistake the STOP_RING hangs, bug 106947. (In reply to Chris Wilson from comment #4) > We need to be careful to not mistake the STOP_RING hangs, bug 106947. Wrong way around. https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_4451/fi-bsw-cyan/igt@drv_selftest@live_hangcheck.html https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_4483/fi-cfl-8109u/igt@drv_selftest@live_hangcheck.html (drv_selftest:8778) igt_kmod-WARNING: probe of 0000:00:02.0 failed with error -5 (drv_selftest:8778) igt_kmod-CRITICAL: Test assertion failure function igt_kselftest_execute, file ../lib/igt_kmod.c:513: (drv_selftest:8778) igt_kmod-CRITICAL: Failed assertion: err == 0 (drv_selftest:8778) igt_kmod-CRITICAL: kselftest "i915 igt__23__live_hangcheck=1 live_selftests=-1 disable_display=1" failed: Input/output error [5] (drv_selftest:8778) igt_core-INFO: Stack trace: (drv_selftest:8778) igt_core-INFO: #0 [__igt_fail_assert+0x180] (drv_selftest:8778) igt_core-INFO: #1 [igt_kselftest_execute+0x1d9] (drv_selftest:8778) igt_core-INFO: #2 [igt_kselftests+0x18c] (drv_selftest:8778) igt_core-INFO: #3 [__real_main29+0x44] (drv_selftest:8778) igt_core-INFO: #4 [main+0x44] (drv_selftest:8778) igt_core-INFO: #5 [__libc_start_main+0xe7] (drv_selftest:8778) igt_core-INFO: #6 [_start+0x2a] **** END **** [ 818.263283] kthread for other engine bcs0 failed, err=-5 [ 818.263347] kthread for other engine vcs0 failed, err=-5 [ 818.263476] kthread for other engine vecs0 failed, err=-5 [ 818.269800] Failed to switch back to kernel context; declaring wedged [ 818.287883] i915/intel_hangcheck_live_selftests: igt_reset_engines failed with error -5 [ 818.298287] Failed to switch back to kernel context; declaring wedged [ 818.454389] i915: probe of 0000:00:02.0 failed with error -5 commit a99b32a6fff7e482a267c72e565c8c410ce793d7 (HEAD -> drm-intel-next-queued, drm-intel/drm-intel-next-queued) Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Tue Aug 14 18:18:57 2018 +0100 drm/i915: Clear stop-engine for a pardoned reset If we pardon a per-engine reset, we may leave the STOP_RING bit asserted in RING_MI_MODE resulting in the engine hanging. Unconditionally clear it on the per-engine exit path as we know that either we skipped the reset and so need the cancellation, or the reset was successful and the cancellation is a no-op, or there was an error and we will follow up with a full-reset or wedging (both of which will stop the engines again as required). Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=107188 Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=106560 Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com> Reviewed-by: Mika Kuoppala <mika.kuoppala@linux.intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20180814171857.24673-1-chris@chris-wilson.co.uk Please note that the CI run for this patch indicated we have yet another cause for hangs here. When that is detected please do file a fresh bug so we don't have the debug logs confused. (In reply to Chris Wilson from comment #7) > commit a99b32a6fff7e482a267c72e565c8c410ce793d7 (HEAD -> > drm-intel-next-queued, drm-intel/drm-intel-next-queued) > Author: Chris Wilson <chris@chris-wilson.co.uk> > Date: Tue Aug 14 18:18:57 2018 +0100 > > drm/i915: Clear stop-engine for a pardoned reset > > If we pardon a per-engine reset, we may leave the STOP_RING bit asserted > in RING_MI_MODE resulting in the engine hanging. Unconditionally clear > it on the per-engine exit path as we know that either we skipped the > reset and so need the cancellation, or the reset was successful and the > cancellation is a no-op, or there was an error and we will follow up > with a full-reset or wedging (both of which will stop the engines again > as required). > > Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=107188 > Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=106560 > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> > Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com> > Reviewed-by: Mika Kuoppala <mika.kuoppala@linux.intel.com> > Link: > https://patchwork.freedesktop.org/patch/msgid/20180814171857.24673-1- > chris@chris-wilson.co.uk > > > Please note that the CI run for this patch indicated we have yet another > cause for hangs here. When that is detected please do file a fresh bug so we > don't have the debug logs confused. That seems to have done the trick for this particular issue. Closing now :) (In reply to Martin Peres from comment #8) > (In reply to Chris Wilson from comment #7) > > commit a99b32a6fff7e482a267c72e565c8c410ce793d7 (HEAD -> > > drm-intel-next-queued, drm-intel/drm-intel-next-queued) > > Author: Chris Wilson <chris@chris-wilson.co.uk> > > Date: Tue Aug 14 18:18:57 2018 +0100 > > > > drm/i915: Clear stop-engine for a pardoned reset > > > > If we pardon a per-engine reset, we may leave the STOP_RING bit asserted > > in RING_MI_MODE resulting in the engine hanging. Unconditionally clear > > it on the per-engine exit path as we know that either we skipped the > > reset and so need the cancellation, or the reset was successful and the > > cancellation is a no-op, or there was an error and we will follow up > > with a full-reset or wedging (both of which will stop the engines again > > as required). > > > > Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=107188 > > Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=106560 > > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> > > Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com> > > Reviewed-by: Mika Kuoppala <mika.kuoppala@linux.intel.com> > > Link: > > https://patchwork.freedesktop.org/patch/msgid/20180814171857.24673-1- > > chris@chris-wilson.co.uk > > > > > > Please note that the CI run for this patch indicated we have yet another > > cause for hangs here. When that is detected please do file a fresh bug so we > > don't have the debug logs confused. > > That seems to have done the trick for this particular issue. Closing now :) Actually, https://bugs.freedesktop.org/show_bug.cgi?id=106947 was saying that the following failures are for this issue, and they still happen: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_4512/fi-kbl-7500u/igt@drv_selftest@live_hangcheck.html [...] https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_4714/fi-kbl-7567u/igt@drv_selftest@live_hangcheck.html https://intel-gfx-ci.01.org/tree/drm-tip/IGT_4611/shard-kbl7/igt@drv_selftest@live_hangcheck.html Honestly, its all fixed now. Well except for the guc. Please do treat any fresh indication of failure as a separate bug. (In reply to Chris Wilson from comment #10) > Honestly, its all fixed now. Well except for the guc. Please do treat any > fresh indication of failure as a separate bug. OK, moved here: https://bugs.freedesktop.org/show_bug.cgi?id=107860 |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.