Summary: | [CI] igt@drv_selftest@live_hangcheck - dmesg-fail - Failed to switch back to kernel context; declaring wedged | ||
---|---|---|---|
Product: | DRI | Reporter: | Martin Peres <martin.peres> |
Component: | DRM/Intel | Assignee: | Intel GFX Bugs mailing list <intel-gfx-bugs> |
Status: | CLOSED FIXED | QA Contact: | Intel GFX Bugs mailing list <intel-gfx-bugs> |
Severity: | normal | ||
Priority: | medium | CC: | intel-gfx-bugs |
Version: | XOrg git | ||
Hardware: | Other | ||
OS: | All | ||
Whiteboard: | ReadyForDev | ||
i915 platform: | BXT, GLK, KBL, PNV | i915 features: | GEM/Other |
Description
Martin Peres
2018-06-18 07:15:07 UTC
Hmm, this is a different symptom than bug 106560. Instead of the RING_MODE being disabled (with RING_STOP), here it looks like we missed kicking the tasklet: <7>[ 374.261828] i915_gem_set_wedged vcs0 <7>[ 374.261831] i915_gem_set_wedged current seqno 13814, last 13814, hangcheck 0 [5263 ms] <7>[ 374.261834] i915_gem_set_wedged Reset count: 5598 (global 1) <7>[ 374.261838] i915_gem_set_wedged Requests: <7>[ 374.261869] i915_gem_set_wedged RING_START: 0x00109000 <7>[ 374.261873] i915_gem_set_wedged RING_HEAD: 0x00001cf8 <7>[ 374.261877] i915_gem_set_wedged RING_TAIL: 0x00001cf8 <7>[ 374.261883] i915_gem_set_wedged RING_CTL: 0x00003000 <7>[ 374.261889] i915_gem_set_wedged RING_MODE: 0x00000200 [idle] <7>[ 374.261893] i915_gem_set_wedged RING_IMR: fffffeff <7>[ 374.261901] i915_gem_set_wedged ACTHD: 0x00000000_00201cf8 <7>[ 374.261908] i915_gem_set_wedged BBADDR: 0x00000000_00000000 <7>[ 374.261916] i915_gem_set_wedged DMA_FADDR: 0x00000000_00000000 <7>[ 374.261920] i915_gem_set_wedged IPEIR: 0x00000000 <7>[ 374.261924] i915_gem_set_wedged IPEHR: 0x00000000 <7>[ 374.261930] i915_gem_set_wedged Execlist status: 0x00000301 00000000 <7>[ 374.261936] i915_gem_set_wedged Execlist CSB read 2 [2 cached], write 5 [5 from hws], interrupt posted? yes, tasklet queued? no (enabled) <7>[ 374.261942] i915_gem_set_wedged Execlist CSB[3]: 0x00000001 [0x00000001 in hwsp], context: 0 [0 in hwsp] <7>[ 374.261949] i915_gem_set_wedged Execlist CSB[4]: 0x00000014 [0x00000014 in hwsp], context: 81 [81 in hwsp] <7>[ 374.261956] i915_gem_set_wedged Execlist CSB[5]: 0x00000018 [0x00000018 in hwsp], context: 84 [84 in hwsp] <7>[ 374.261962] i915_gem_set_wedged ELSP[0] count=1, ring->start=000d8000, rq: 13813! [aef:175] prio=395 @ 5261ms: signaled <7>[ 374.261967] i915_gem_set_wedged ELSP[1] count=1, ring->start=00109000, rq: 13814! [af2:174] prio=292 @ 5261ms: signaled <7>[ 374.261988] i915_gem_set_wedged HW active? 0x1 <7>[ 374.261991] i915_gem_set_wedged Queue priority: 292 <7>[ 374.261996] i915_gem_set_wedged Q 0 [af0:174] prio=248 @ 5265ms: igt/vcs0[5775]/5 <7>[ 374.262000] i915_gem_set_wedged Q 0 [af0:175] prio=236 @ 5261ms: igt/vcs0[5775]/5 <7>[ 374.262004] i915_gem_set_wedged Q 0 [aed:175] prio=195 @ 5262ms: igt/vcs0[5775]/2 <7>[ 374.262009] i915_gem_set_wedged Q 0 [af3:174] prio=136 @ 5262ms: igt/vcs0[5775]/8 <7>[ 374.262014] i915_gem_set_wedged Q 0 [aec:175] prio=134 @ 5262ms: igt/vcs0[5775]/1 <7>[ 374.262019] i915_gem_set_wedged Q 0 [aee:175] prio=13 @ 5262ms: igt/vcs0[5775]/3 <7>[ 374.262022] i915_gem_set_wedged IRQ? 0x3 (breadcrumbs? yes) (execlists? yes) <7>[ 374.262024] i915_gem_set_wedged HWSP: <7>[ 374.262029] i915_gem_set_wedged [0000] 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 <7>[ 374.262031] i915_gem_set_wedged * <7>[ 374.262037] i915_gem_set_wedged [0040] 00000001 00000000 00000014 00000003 00000018 00000053 00000001 00000000 <7>[ 374.262041] i915_gem_set_wedged [0060] 00000014 00000051 00000018 00000054 00000000 00000000 00000000 00000005 <7>[ 374.262044] i915_gem_set_wedged [0080] 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 <7>[ 374.262047] i915_gem_set_wedged * <7>[ 374.262051] i915_gem_set_wedged [00c0] 00013814 00000000 00000000 00000000 00000000 00000000 00000000 00000000 <7>[ 374.262055] i915_gem_set_wedged [00e0] 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 <7>[ 374.262057] i915_gem_set_wedged * <7>[ 374.262103] i915_gem_set_wedged Idle? no commit 5db1d4ea91b6ee447c4ae01f7f56803e32e690b1 Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Mon Jun 4 08:34:40 2018 +0100 drm/i915/execlists: Push the tasklet kick after reset to reset_finish In the unlikely case where we have failed to keep submitting to the GPU, we end up with the ELSP queue empty but a pending queue of requests. Here, we skip the per-engine reset as there is no guilty request, but in doing so we also skip the engine restart leaving ourselves with a permanently hung engine. A quick way to recover is by moving the tasklet kick to execlists_reset_finish() (from init_hw). We still emit the error on hanging, so the error is not lost but we should be able to recover. Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Cc: Mika Kuoppala <mika.kuoppala@intel.com> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Cc: Michel Thierry <michel.thierry@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20180604073441.6737-2-chris@chris-wilson.co.uk Reviewed-by: Mika Kuoppala <mika.kuoppala@intel.com> We need to be careful to not mistake the STOP_RING hangs, bug 106560. Martin, Time to close this one? This is still happening: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_4512/fi-kbl-guc/igt@drv_selftest@live_hangcheck.html https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_4512/fi-kbl-7500u/igt@drv_selftest@live_hangcheck.html [ 612.399569] kthread for other engine bcs0 failed, err=-5 [ 612.399593] kthread for other engine vcs0 failed, err=-5 [ 612.399641] kthread for other engine vecs0 failed, err=-5 [ 612.399954] Failed to switch back to kernel context; declaring wedged [ 612.416530] i915/intel_hangcheck_live_selftests: igt_reset_engines failed with error -5 [ 612.416562] Failed to switch back to kernel context; declaring wedged [ 612.519742] i915: probe of 0000:00:02.0 failed with error -5 (In reply to Martin Peres from comment #4) > This is still happening: > > https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_4512/fi-kbl-guc/ > igt@drv_selftest@live_hangcheck.html is a guc fw bug. > https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_4512/fi-kbl-7500u/ > igt@drv_selftest@live_hangcheck.html is bug 106530, not the missed tasklet which is what this bug was. (In reply to Chris Wilson from comment #5) > (In reply to Martin Peres from comment #4) > > This is still happening: > > > > https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_4512/fi-kbl-guc/ > > igt@drv_selftest@live_hangcheck.html > > is a guc fw bug. OK, I'll file another one. Thanks! > > > https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_4512/fi-kbl-7500u/ > > igt@drv_selftest@live_hangcheck.html > > is bug 106530, not the missed tasklet which is what this bug was. Pretty sure this is not the bug you wanted to link to :s bug 106560 then :-p The key symptom is the presence of STOP_RING in RING_MI_MODE following the reset. That is unexpected as it means the context doesn't execute afterwards, and the GPU just sits there laughing at us. Martin, OK to close? Seems to continue occurring according to CI buglog, but I'm not sure it really is the same issue: http://gfx-ci.fi.intel.com/cibuglog-ng/issuefilter/1454/history The errors that were still occuring have been moved to: - GUC: https://bugs.freedesktop.org/show_bug.cgi?id=107837 - KBL: https://bugs.freedesktop.org/show_bug.cgi?id=106560 My bad, didn't realise you wanted to keep guc fw separately, I just write guc as off s.e.p. (In reply to Chris Wilson from comment #11) > My bad, didn't realise you wanted to keep guc fw separately, I just write > guc as off s.e.p. No worries, it's good to document GUC issues anyway. |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.