Bug 106947

Summary:	[CI] igt@drv_selftest@live_hangcheck - dmesg-fail - Failed to switch back to kernel context; declaring wedged
Product:	DRI	Reporter:	Martin Peres <martin.peres>
Component:	DRM/Intel	Assignee:	Intel GFX Bugs mailing list <intel-gfx-bugs>
Status:	CLOSED FIXED	QA Contact:	Intel GFX Bugs mailing list <intel-gfx-bugs>
Severity:	normal
Priority:	medium	CC:	intel-gfx-bugs
Version:	XOrg git
Hardware:	Other
OS:	All
Whiteboard:	ReadyForDev
i915 platform:	BXT, GLK, KBL, PNV	i915 features:	GEM/Other

Description Martin Peres 2018-06-18 07:15:07 UTC

https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_4325/shard-glk1/igt@drv_selftest@live_hangcheck.html

https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_4326/shard-kbl2/igt@drv_selftest@live_hangcheck.html

https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_4325/shard-apl2/igt@drv_selftest@live_hangcheck.html

[  437.351731] Failed to switch back to kernel context; declaring wedged
[  437.397965] WARNING: CPU: 2 PID: 24 at kernel/workqueue.c:1384 __queue_work+0x478/0x610
[  437.397983] Modules linked in: i915(+) snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic x86_pkg_temp_thermal coretemp crct10dif_pclmul crc32_pclmul ghash_clmulni_intel lpc_ich snd_hda_codec snd_hwdep snd_hda_core r8169 pinctrl_broxton pinctrl_intel snd_pcm mei_me mii prime_numbers mei [last unloaded: i915]
[  437.398058] CPU: 2 PID: 24 Comm: ksoftirqd/2 Tainted: G     U            4.17.0-rc7-CI-CI_DRM_4325+ #1
[  437.398062] Hardware name:  /NUC6CAYB, BIOS AYAPLCEL.86A.0049.2018.0508.1356 05/08/2018
[  437.398067] RIP: 0010:__queue_work+0x478/0x610
[  437.398071] RSP: 0018:ffffc900000f7d90 EFLAGS: 00010006
[  437.398077] RAX: 0000000080000100 RBX: ffff880267ec90f0 RCX: 0000000000000101
[  437.398080] RDX: 0000000080000101 RSI: 0000000000000000 RDI: 00000000ffffffff
[  437.398084] RBP: 0000000000000202 R08: 0000000000000000 R09: 0000000000000001
[  437.398087] R10: ffffc900000f7d90 R11: 0000000000000000 R12: ffff8802723853e8
[  437.398090] R13: 0000000000000010 R14: 0000000000000010 R15: ffff8802609b4650
[  437.398094] FS:  0000000000000000(0000) GS:ffff88027fd00000(0000) knlGS:0000000000000000
[  437.398098] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  437.398101] CR2: 00007efd5e8aace0 CR3: 000000026ce8e000 CR4: 00000000003406e0
[  437.398104] Call Trace:
[  437.398116]  queue_work_on+0x60/0x70
[  437.398222]  ? i915_gem_free_object+0x110/0x110 [i915]
[  437.398228]  rcu_process_callbacks+0x211/0x710
[  437.398239]  ? smpboot_thread_fn+0x6b/0x280
[  437.398246]  __do_softirq+0xc1/0x4e1
[  437.398255]  ? smpboot_thread_fn+0x23/0x280
[  437.398260]  ? smpboot_thread_fn+0x6b/0x280
[  437.398266]  run_ksoftirqd+0x1d/0x40
[  437.398270]  smpboot_thread_fn+0x1d3/0x280
[  437.398278]  ? sort_range+0x20/0x20
[  437.398282]  kthread+0x119/0x130
[  437.398287]  ? kthread_flush_work_fn+0x10/0x10
[  437.398295]  ret_from_fork+0x3a/0x50
[  437.398310] Code: e8 a0 e2 f5 ff e9 da fc ff ff 65 8b 05 22 17 f7 7e a9 00 01 1f 00 75 13 65 48 8b 3c 25 c0 4e 01 00 f6 47 24 20 0f 85 9f 00 00 00 <0f> 0b 48 83 c4 08 5b 5d 41 5c 41 5d 41 5e 41 5f c3 0f 0b e9 cd 
[  437.398495] irq event stamp: 2638727
[  437.398500] hardirqs last  enabled at (2638726): [<ffffffff81930bec>] _raw_spin_unlock_irqrestore+0x4c/0x60
[  437.398505] hardirqs last disabled at (2638727): [<ffffffff810a38f9>] queue_work_on+0x19/0x70
[  437.398509] softirqs last  enabled at (2638710): [<ffffffff81c0032b>] __do_softirq+0x32b/0x4e1
[  437.398514] softirqs last disabled at (2638715): [<ffffffff8108c20d>] run_ksoftirqd+0x1d/0x40
[  437.398519] WARNING: CPU: 2 PID: 24 at kernel/workqueue.c:1384 __queue_work+0x478/0x610
[  437.398522] ---[ end trace 0cb69487f9c90dbd ]---

Comment 1 Chris Wilson 2018-06-18 07:34:22 UTC

Hmm, this is a different symptom than bug 106560. Instead of the RING_MODE being disabled (with RING_STOP), here it looks like we missed kicking the tasklet:

<7>[  374.261828] i915_gem_set_wedged vcs0
<7>[  374.261831] i915_gem_set_wedged 	current seqno 13814, last 13814, hangcheck 0 [5263 ms]
<7>[  374.261834] i915_gem_set_wedged 	Reset count: 5598 (global 1)
<7>[  374.261838] i915_gem_set_wedged 	Requests:
<7>[  374.261869] i915_gem_set_wedged 	RING_START: 0x00109000
<7>[  374.261873] i915_gem_set_wedged 	RING_HEAD:  0x00001cf8
<7>[  374.261877] i915_gem_set_wedged 	RING_TAIL:  0x00001cf8
<7>[  374.261883] i915_gem_set_wedged 	RING_CTL:   0x00003000
<7>[  374.261889] i915_gem_set_wedged 	RING_MODE:  0x00000200 [idle]
<7>[  374.261893] i915_gem_set_wedged 	RING_IMR: fffffeff
<7>[  374.261901] i915_gem_set_wedged 	ACTHD:  0x00000000_00201cf8
<7>[  374.261908] i915_gem_set_wedged 	BBADDR: 0x00000000_00000000
<7>[  374.261916] i915_gem_set_wedged 	DMA_FADDR: 0x00000000_00000000
<7>[  374.261920] i915_gem_set_wedged 	IPEIR: 0x00000000
<7>[  374.261924] i915_gem_set_wedged 	IPEHR: 0x00000000
<7>[  374.261930] i915_gem_set_wedged 	Execlist status: 0x00000301 00000000
<7>[  374.261936] i915_gem_set_wedged 	Execlist CSB read 2 [2 cached], write 5 [5 from hws], interrupt posted? yes, tasklet queued? no (enabled)
<7>[  374.261942] i915_gem_set_wedged 	Execlist CSB[3]: 0x00000001 [0x00000001 in hwsp], context: 0 [0 in hwsp]
<7>[  374.261949] i915_gem_set_wedged 	Execlist CSB[4]: 0x00000014 [0x00000014 in hwsp], context: 81 [81 in hwsp]
<7>[  374.261956] i915_gem_set_wedged 	Execlist CSB[5]: 0x00000018 [0x00000018 in hwsp], context: 84 [84 in hwsp]
<7>[  374.261962] i915_gem_set_wedged 		ELSP[0] count=1, ring->start=000d8000, rq: 13813! [aef:175] prio=395 @ 5261ms: signaled
<7>[  374.261967] i915_gem_set_wedged 		ELSP[1] count=1, ring->start=00109000, rq: 13814! [af2:174] prio=292 @ 5261ms: signaled
<7>[  374.261988] i915_gem_set_wedged 		HW active? 0x1
<7>[  374.261991] i915_gem_set_wedged 		Queue priority: 292
<7>[  374.261996] i915_gem_set_wedged 		Q 0 [af0:174] prio=248 @ 5265ms: igt/vcs0[5775]/5
<7>[  374.262000] i915_gem_set_wedged 		Q 0 [af0:175] prio=236 @ 5261ms: igt/vcs0[5775]/5
<7>[  374.262004] i915_gem_set_wedged 		Q 0 [aed:175] prio=195 @ 5262ms: igt/vcs0[5775]/2
<7>[  374.262009] i915_gem_set_wedged 		Q 0 [af3:174] prio=136 @ 5262ms: igt/vcs0[5775]/8
<7>[  374.262014] i915_gem_set_wedged 		Q 0 [aec:175] prio=134 @ 5262ms: igt/vcs0[5775]/1
<7>[  374.262019] i915_gem_set_wedged 		Q 0 [aee:175] prio=13 @ 5262ms: igt/vcs0[5775]/3
<7>[  374.262022] i915_gem_set_wedged IRQ? 0x3 (breadcrumbs? yes) (execlists? yes)
<7>[  374.262024] i915_gem_set_wedged HWSP:
<7>[  374.262029] i915_gem_set_wedged [0000] 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
<7>[  374.262031] i915_gem_set_wedged *
<7>[  374.262037] i915_gem_set_wedged [0040] 00000001 00000000 00000014 00000003 00000018 00000053 00000001 00000000
<7>[  374.262041] i915_gem_set_wedged [0060] 00000014 00000051 00000018 00000054 00000000 00000000 00000000 00000005
<7>[  374.262044] i915_gem_set_wedged [0080] 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
<7>[  374.262047] i915_gem_set_wedged *
<7>[  374.262051] i915_gem_set_wedged [00c0] 00013814 00000000 00000000 00000000 00000000 00000000 00000000 00000000
<7>[  374.262055] i915_gem_set_wedged [00e0] 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
<7>[  374.262057] i915_gem_set_wedged *
<7>[  374.262103] i915_gem_set_wedged Idle? no

Comment 2 Chris Wilson 2018-07-12 13:33:37 UTC

commit 5db1d4ea91b6ee447c4ae01f7f56803e32e690b1
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Mon Jun 4 08:34:40 2018 +0100

    drm/i915/execlists: Push the tasklet kick after reset to reset_finish
    
    In the unlikely case where we have failed to keep submitting to the GPU,
    we end up with the ELSP queue empty but a pending queue of requests.
    Here, we skip the per-engine reset as there is no guilty request, but in
    doing so we also skip the engine restart leaving ourselves with a
    permanently hung engine. A quick way to recover is by moving the tasklet
    kick to execlists_reset_finish() (from init_hw). We still emit the error
    on hanging, so the error is not lost but we should be able to recover.
    
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
    Cc: Mika Kuoppala <mika.kuoppala@intel.com>
    Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
    Cc: Michel Thierry <michel.thierry@intel.com>
    Link: https://patchwork.freedesktop.org/patch/msgid/20180604073441.6737-2-chris@chris-wilson.co.uk
    Reviewed-by: Mika Kuoppala <mika.kuoppala@intel.com>

We need to be careful to not mistake the STOP_RING hangs, bug 106560.

Comment 3 Dhinakaran Pandiyan 2018-07-19 20:46:09 UTC

Martin,

Time to close this one?

Comment 4 Martin Peres 2018-07-20 10:27:50 UTC

This is still happening:

https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_4512/fi-kbl-guc/igt@drv_selftest@live_hangcheck.html

https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_4512/fi-kbl-7500u/igt@drv_selftest@live_hangcheck.html

[  612.399569] kthread for other engine bcs0 failed, err=-5
[  612.399593] kthread for other engine vcs0 failed, err=-5
[  612.399641] kthread for other engine vecs0 failed, err=-5
[  612.399954] Failed to switch back to kernel context; declaring wedged
[  612.416530] i915/intel_hangcheck_live_selftests: igt_reset_engines failed with error -5
[  612.416562] Failed to switch back to kernel context; declaring wedged
[  612.519742] i915: probe of 0000:00:02.0 failed with error -5

Comment 5 Chris Wilson 2018-07-20 10:33:39 UTC

(In reply to Martin Peres from comment #4)
> This is still happening:
> 
> https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_4512/fi-kbl-guc/
> igt@drv_selftest@live_hangcheck.html

is a guc fw bug.

> https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_4512/fi-kbl-7500u/
> igt@drv_selftest@live_hangcheck.html

is bug 106530, not the missed tasklet which is what this bug was.

Comment 6 Martin Peres 2018-07-20 10:45:49 UTC

(In reply to Chris Wilson from comment #5)
> (In reply to Martin Peres from comment #4)
> > This is still happening:
> > 
> > https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_4512/fi-kbl-guc/
> > igt@drv_selftest@live_hangcheck.html
> 
> is a guc fw bug.

OK, I'll file another one. Thanks!

> 
> > https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_4512/fi-kbl-7500u/
> > igt@drv_selftest@live_hangcheck.html
> 
> is bug 106530, not the missed tasklet which is what this bug was.

Pretty sure this is not the bug you wanted to link to :s

Comment 7 Chris Wilson 2018-07-20 10:49:05 UTC

bug 106560 then :-p

The key symptom is the presence of STOP_RING in RING_MI_MODE following the reset. That is unexpected as it means the context doesn't execute afterwards, and the GPU just sits there laughing at us.

Comment 8 Francesco Balestrieri 2018-08-04 09:24:42 UTC

Martin, OK to close?

Comment 9 Francesco Balestrieri 2018-08-07 08:08:32 UTC

Seems to continue occurring according to CI buglog, but I'm not sure it really is the same issue:

http://gfx-ci.fi.intel.com/cibuglog-ng/issuefilter/1454/history

Comment 10 Martin Peres 2018-09-05 10:48:06 UTC

The errors that were still occuring have been moved to:

 - GUC: https://bugs.freedesktop.org/show_bug.cgi?id=107837
 - KBL: https://bugs.freedesktop.org/show_bug.cgi?id=106560

Comment 11 Chris Wilson 2018-09-05 10:55:34 UTC

My bad, didn't realise you wanted to keep guc fw separately, I just write guc as off s.e.p.

Comment 12 Martin Peres 2018-09-05 11:01:54 UTC

(In reply to Chris Wilson from comment #11)
> My bad, didn't realise you wanted to keep guc fw separately, I just write
> guc as off s.e.p.

No worries, it's good to document GUC issues anyway.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.