105012 – [CI] igt@drv_selftest@live_hangcheck - Incomplete

Bug 105012 - [CI] igt@drv_selftest@live_hangcheck - Incomplete

Summary: [CI] igt@drv_selftest@live_hangcheck - Incomplete

Status:	CLOSED WORKSFORME

Alias:	None

Product:	DRI
Classification:	Unclassified
Component:	DRM/Intel (show other bugs)
Version:	DRI git
Hardware:	Other All

Importance:	medium normal
Assignee:	Intel GFX Bugs mailing list
QA Contact:	Intel GFX Bugs mailing list

URL:
Whiteboard:	ReadyForDev
Keywords:

Depends on:
Blocks:

Reported:	2018-02-08 07:13 UTC by Marta Löfstedt
Modified:	2018-03-23 07:16 UTC (History)
CC List:	1 user (show)

See Also:
i915 platform:	KBL
i915 features:	GEM/Other

Attachments

Description Marta Löfstedt 2018-02-08 07:13:39 UTC

https://intel-gfx-ci.01.org/tree/drm-tip/IGT_4228/shard-kbl4/igt@drv_selftest@live_hangcheck.html

runtime i shorter that owatch timeout.

<7>[  181.243477] [IGT] drv_selftest: starting subtest live_hangcheck
...
<7>[  191.339222] [drm:i915_gem_reset_engine [i915]] resetting vecs0 to restart from tail of request 0x19c4
then stray.

pstore:
https://intel-gfx-ci.01.org/tree/drm-tip/IGT_4228/shard-kbl4/pstore35-1518035023_Oops_1.log

ftrace from:
<0>[  205.496565] drv_self-5966    3.... 204341950us : reset_common_ring: vcs0 seqno=23d2e
...
<0>[  205.505672] ksoftirq-17      1..s. 204390040us : execlists_submission_tasklet: rcs0 cs-irq head=0 [0], tail=0 [0]
...
backtrace of soft-irq

Comment 1 Chris Wilson 2018-02-08 07:24:28 UTC

Could do with execlists_submission_tasklet+0x525 translating to a line number.

Comment 2 Chris Wilson 2018-03-22 23:09:38 UTC

Hopefully,

commit 0f36a85c3bd5e0dfcbb49af203a96a933dae86cf
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Thu Mar 22 07:35:33 2018 +0000

    drm/i915: Flush pending interrupt following a GPU reset
    
    After resetting the GPU (or subset of engines), call synchronize_irq()
    to flush any pending irq before proceeding with the cleanup. For a
    device level reset, we disable the interupts around the reset, but when
    resetting just one engine, we have to avoid such global disabling. This
    leaves us open to an interrupt arriving for the engine as we try to
    reset it. We already do try to flush the IIR following the reset, but we
    have to ensure that the in-flight interrupt does not land after we start
    cleaning up after the reset; enter synchronize_irq().
    
    As it current stands, we very rarely, but fatally, see sequences such as:
    
        2.... 57964564us : execlists_reset_prepare: rcs0
        2.... 57964613us : execlists_reset: rcs0 seqno=424
        0d.h1 57964615us : gen8_cs_irq_handler: rcs0 CS active=1
        2d..1 57964617us : __i915_request_unsubmit: rcs0 fence 29:1056 <- global_seqno 1060
        2.... 57964703us : execlists_reset_finish: rcs0
        0..s. 57964705us : execlists_submission_tasklet: rcs0 awake?=1, active=0, irq-posted?=1
    
    v2: Move the sync into the execlists reset handler so that we coordinate
    the flush with disabling the interrupt handling and canceling the
    pending interrupt.
    v3: Just use synchronize_hardirq() to avoid the might_sleep(), we do not
    yet have threaded-irq to worry about.
    
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
    Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
    Cc: Michel Thierry <michel.thierry@intel.com>
    Cc: Michał Winiarski <michal.winiarski@intel.com>
    Cc: Jeff McGee <jeff.mcgee@intel.com>
    Link: https://patchwork.freedesktop.org/patch/msgid/20180322073533.5313-4-chris@chris-wilson.co.uk
    Reviewed-by: Jeff McGee <jeff.mcgee@intel.com>

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.