Bug 107916

Summary: [CI][BAT] igt@drv_selftest@live_hangcheck - dmesg-fail - failed to idle after reset
Product: DRI Reporter: Martin Peres <martin.peres>
Component: DRM/IntelAssignee: Intel GFX Bugs mailing list <intel-gfx-bugs>
Status: CLOSED FIXED QA Contact: Intel GFX Bugs mailing list <intel-gfx-bugs>
Severity: normal    
Priority: high CC: intel-gfx-bugs
Version: XOrg git   
Hardware: Other   
OS: All   
Whiteboard: ReadyForDev
i915 platform: ICL i915 features: GEM/Other

Description Martin Peres 2018-09-13 07:48:49 UTC
As per Chris' comment in https://bugs.freedesktop.org/show_bug.cgi?id=106560 specifying that any new failure in igt@drv_selftest@live_hangcheck should be treated as a separate bug, I am opening a new bug:

https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_4811/fi-cfl-s3/igt@drv_selftest@live_hangcheck.html

[  503.668876] Setting dangerous option live_selftests - tainting kernel
[  526.201555] i915_reset_engine(vcs0:others-active): failed to idle after reset
[  526.206678] i915/intel_hangcheck_live_selftests: igt_reset_engines failed with error -5
[  526.314861] i915: probe of 0000:00:02.0 failed with error -5
Comment 1 Chris Wilson 2018-09-13 07:53:22 UTC
What happened here is the tasklet never runs. It's scheduled, enabled and ready...

<3>[  526.201555] i915_reset_engine(vcs0:others-active): failed to idle after reset
<6>[  526.201561] i915 0000:00:02.0: [drm] vcs0
<6>[  526.201563] i915 0000:00:02.0: [drm] 	current seqno 15874, last 15874, hangcheck 0 [10481 ms]
<6>[  526.201564] i915 0000:00:02.0: [drm] 	Reset count: 16464 (global 1)
<6>[  526.201566] i915 0000:00:02.0: [drm] 	Requests:
<6>[  526.201574] i915 0000:00:02.0: [drm] 		first  15874! [71d7:6fa] prio=0 @ 52ms: (null)
<6>[  526.201576] i915 0000:00:02.0: [drm] 		last   15874! [71d7:6fa] prio=0 @ 52ms: (null)
<6>[  526.201612] i915 0000:00:02.0: [drm] 	RING_START: 0x002f0000
<6>[  526.201623] i915 0000:00:02.0: [drm] 	RING_HEAD:  0x00000680
<6>[  526.201625] i915 0000:00:02.0: [drm] 	RING_TAIL:  0x00000680
<6>[  526.201627] i915 0000:00:02.0: [drm] 	RING_CTL:   0x00000000
<6>[  526.201629] i915 0000:00:02.0: [drm] 	RING_MODE:  0x00000200 [idle]
<6>[  526.201631] i915 0000:00:02.0: [drm] 	RING_IMR: fffffeff
<6>[  526.201633] i915 0000:00:02.0: [drm] 	ACTHD:  0x00000000_00000680
<6>[  526.201636] i915 0000:00:02.0: [drm] 	BBADDR: 0x00000000_00000000
<6>[  526.201638] i915 0000:00:02.0: [drm] 	DMA_FADDR: 0x00000000_00000000
<6>[  526.201640] i915 0000:00:02.0: [drm] 	IPEIR: 0x00000000
<6>[  526.201641] i915 0000:00:02.0: [drm] 	IPEHR: 0x00000000
<6>[  526.201644] i915 0000:00:02.0: [drm] 	Execlist status: 0x00000301 00000000
<6>[  526.201646] i915 0000:00:02.0: [drm] 	Execlist CSB read 5, write 1 [mmio:1], tasklet queued? yes (enabled)
<6>[  526.201648] i915 0000:00:02.0: [drm] 	Execlist CSB[0]: 0x00000001 [mmio:0x00000001], context: 0 [mmio:0]
<6>[  526.201650] i915 0000:00:02.0: [drm] 	Execlist CSB[1]: 0x00000018 [mmio:0x00000018], context: 2 [mmio:2]
<6>[  526.201653] i915 0000:00:02.0: [drm] 		ELSP[0] count=1, ring->start=002f0000, rq: 15874! [71d7:6fa] prio=0 @ 52ms: (null)
<6>[  526.201654] i915 0000:00:02.0: [drm] 		ELSP[1] idle
<6>[  526.201655] i915 0000:00:02.0: [drm] 		HW active? 0x1
<6>[  526.201682] i915 0000:00:02.0: [drm] 		E 15874! [71d7:6fa] prio=0 @ 52ms: (null)
<6>[  526.201700] i915 0000:00:02.0: [drm] 		Queue priority: -2147483648
<6>[  526.201702] i915 0000:00:02.0: [drm] IRQ? 0x0 (breadcrumbs? no)
<6>[  526.201703] i915 0000:00:02.0: [drm] HWSP:
<6>[  526.201706] i915 0000:00:02.0: [drm] [0000] 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
<6>[  526.201707] i915 0000:00:02.0: [drm] *
<6>[  526.201709] i915 0000:00:02.0: [drm] [0040] 00000001 00000000 00000018 00000002 00000001 00000000 00000001 00000000
<6>[  526.201711] i915 0000:00:02.0: [drm] [0060] 00000018 00000008 00000001 00000000 00000000 00000000 00000000 00000001
<6>[  526.201713] i915 0000:00:02.0: [drm] [0080] 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
<6>[  526.201714] i915 0000:00:02.0: [drm] *
<6>[  526.201717] i915 0000:00:02.0: [drm] [00c0] 00015874 00000000 00000000 00000000 00000000 00000000 00000000 00000000
<6>[  526.201718] i915 0000:00:02.0: [drm] [00e0] 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
<6>[  526.201720] i915 0000:00:02.0: [drm] *
<6>[  526.201721] i915 0000:00:02.0: [drm] Idle? no

Loss for words. To cap it all we even try and kick the tasklet when checking idle. To skip that kick means the tasklet was locked?
Comment 2 Chris Wilson 2018-09-14 11:04:12 UTC
commit 22495b68f954c9018fd52bc620292cc80df2e418 (HEAD -> drm-intel-next-queued, drm-intel/drm-intel-next-queued)
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Fri Sep 14 09:00:16 2018 +0100

    drm/i915: Flush the tasklet when checking for idle
    
    In order to reduce latency when checking for idle we kick the tasklet
    directly. Sometimes this is not enough as it is queued on another cpu
    and so to improve the accuracy of this idle-check (and so to reduce
    latency overall by avoiding another pass, or worse declaring a timeout!)
    wait for the tasklet to complete.
    
    References: https://bugs.freedesktop.org/show_bug.cgi?id=107916
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
    Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
    Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
    Cc: Michel Thierry <michel.thierry@intel.com>
    Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
    Link: https://patchwork.freedesktop.org/patch/msgid/20180914080017.30308-2-chris@chris-wilson.co.uk

Hopefully.
Comment 3 Lakshmi 2018-10-15 08:00:46 UTC
This issue occurred only once with CI_DRM_4811 (1 month / 415 runs ago).
Comment 4 Martin Peres 2018-11-13 16:03:22 UTC
(In reply to Lakshmi from comment #3)
> This issue occurred only once with CI_DRM_4811 (1 month / 415 runs ago).

Closing and archiving. Thanks!

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.