Bug 107500

Summary:	[CI][SHARDS] igt@gem_eio@reset-stress - fail - Failed assertion: elapsed < 250e6
Product:	DRI	Reporter:	Martin Peres <martin.peres>
Component:	DRM/Intel	Assignee:	Intel GFX Bugs mailing list <intel-gfx-bugs>
Status:	CLOSED FIXED	QA Contact:	Intel GFX Bugs mailing list <intel-gfx-bugs>
Severity:	normal
Priority:	medium	CC:	chris, intel-gfx-bugs
Version:	XOrg git
Hardware:	Other
OS:	All
Whiteboard:	ReadyForDev
i915 platform:	HSW	i915 features:	GEM/Other

Description Martin Peres 2018-08-06 14:46:10 UTC

https://intel-gfx-ci.01.org/tree/drm-tip/IGT_4587/shard-hsw2/igt@gem_eio@reset-stress.html

(gem_eio:2635) CRITICAL: Test assertion failure function check_wait, file ../tests/gem_eio.c:258:
(gem_eio:2635) CRITICAL: Failed assertion: elapsed < 250e6
(gem_eio:2635) CRITICAL: Wake up following reset+wedge took 3723.972ms
Subtest reset-stress failed.

Introduced by: 

igt/gem_eio: Measure reset delay from thread

We assert that we complete a wedge within 250ms. However, when we use a
thread to delay the wedging until after we start waiting, that thread
itself is delayed longer than our wait timeout. This results in a false
positive error where we fail the test before we even trigger the reset.

Reorder the test so that we only ever measure the delay from triggering
the reset until we wakeup, and assert that is in a timely fashion
(less than 250ms).

Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=105954
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>

Comment 1 Chris Wilson 2018-08-06 14:51:17 UTC

It is an interesting bug. To all appearances the HW doesn't generate an interrupt if we execute a MI_USER_INTERRUPT shortly after the GPU reset. A partially successful workaround was:

commit 39f3be162c46bc2349ad7a5bd89536eb83561c81
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Mon Jul 30 08:53:50 2018 +0100

    drm/i915: Kick waiters on resetting legacy rings

still the window after the kick and before interrupts are being received.

Comment 2 Chris Wilson 2018-08-06 19:47:10 UTC

A long shot (one that I've tried earlier with no success) was

commit a6476ebd4350d51146ef0492b4b06bc0d31e8827
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Mon Aug 6 15:56:47 2018 +0100

    drm/i915: Stop dropping irq around resets
    
    A long time ago, we were afraid of handling interrupts and signaling
    waiters during a reset, worrying that the confusion in request handling
    would interfere with our attempts to process the reset in an orderly
    fashion. Since then, we have isolated our irq-driven request handling by
    virtue of the engine->timeline.lock and control of kthreads where
    required, eliminating the danger of concurrently processing interrupts.
    
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
    Cc: Mika Kuoppala <mika.kuoppala@intel.com>
    Reviewed-by: Mika Kuoppala <mika.kuoppala@intel.com>
    Link: https://patchwork.freedesktop.org/patch/msgid/20180806145647.13131-1-chris@chris-wilson.co.uk

but at least that confirms that it's not a shadow caused by the disabling of irq across reset. Fresh ideas required.

Comment 3 Chris Wilson 2018-08-08 16:19:41 UTC

commit a4a717010f4e8cacaa3f0cae8a22f25c39ae1d41
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Wed Aug 8 11:51:00 2018 +0100

    drm/i915: Unmask user interrupts writes into HWSP on snb/ivb/vlv/hsw
    
    An oddity occurs on Sandybridge, Ivybridge and Haswell (and presumably
    Valleyview) in that for the period following the GPU restart after a
    reset, there are no GT interrupts received. From Ville's notes, bit 0 in
    the HWSTAM corresponds to the render interrupt, and if we unmask it we
    do see immediate resumption of GT interrupt delivery (via the master irq
    handler) after the reset.
    
    v2: Limit the w/a to the render interrupt from rcs
    
    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=107500
    Fixes: c5498089463b ("drm/i915: Mask everything in ring HWSTAM on gen6+ in ringbuffer mode")
    References: d420a50c21ef ("drm/i915: Clean up the HWSTAM mess")
    Testcase: igt/gem_eio/reset-stress
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
    Cc: Ville Syrjälä <ville.syrjala@linux.intel.com>
    Acked-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>

Comment 4 Lakshmi 2018-10-04 16:51:45 UTC

This issue is resolved/fixed. Last seen 1 month 3 weeks ago, until then this failure appears in every round.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.