https://intel-gfx-ci.01.org/tree/drm-tip/IGT_4587/shard-hsw2/igt@gem_eio@reset-stress.html (gem_eio:2635) CRITICAL: Test assertion failure function check_wait, file ../tests/gem_eio.c:258: (gem_eio:2635) CRITICAL: Failed assertion: elapsed < 250e6 (gem_eio:2635) CRITICAL: Wake up following reset+wedge took 3723.972ms Subtest reset-stress failed. Introduced by: igt/gem_eio: Measure reset delay from thread We assert that we complete a wedge within 250ms. However, when we use a thread to delay the wedging until after we start waiting, that thread itself is delayed longer than our wait timeout. This results in a false positive error where we fail the test before we even trigger the reset. Reorder the test so that we only ever measure the delay from triggering the reset until we wakeup, and assert that is in a timely fashion (less than 250ms). Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=105954 Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
It is an interesting bug. To all appearances the HW doesn't generate an interrupt if we execute a MI_USER_INTERRUPT shortly after the GPU reset. A partially successful workaround was: commit 39f3be162c46bc2349ad7a5bd89536eb83561c81 Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Mon Jul 30 08:53:50 2018 +0100 drm/i915: Kick waiters on resetting legacy rings still the window after the kick and before interrupts are being received.
A long shot (one that I've tried earlier with no success) was commit a6476ebd4350d51146ef0492b4b06bc0d31e8827 Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Mon Aug 6 15:56:47 2018 +0100 drm/i915: Stop dropping irq around resets A long time ago, we were afraid of handling interrupts and signaling waiters during a reset, worrying that the confusion in request handling would interfere with our attempts to process the reset in an orderly fashion. Since then, we have isolated our irq-driven request handling by virtue of the engine->timeline.lock and control of kthreads where required, eliminating the danger of concurrently processing interrupts. Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Cc: Mika Kuoppala <mika.kuoppala@intel.com> Reviewed-by: Mika Kuoppala <mika.kuoppala@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20180806145647.13131-1-chris@chris-wilson.co.uk but at least that confirms that it's not a shadow caused by the disabling of irq across reset. Fresh ideas required.
commit a4a717010f4e8cacaa3f0cae8a22f25c39ae1d41 Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Wed Aug 8 11:51:00 2018 +0100 drm/i915: Unmask user interrupts writes into HWSP on snb/ivb/vlv/hsw An oddity occurs on Sandybridge, Ivybridge and Haswell (and presumably Valleyview) in that for the period following the GPU restart after a reset, there are no GT interrupts received. From Ville's notes, bit 0 in the HWSTAM corresponds to the render interrupt, and if we unmask it we do see immediate resumption of GT interrupt delivery (via the master irq handler) after the reset. v2: Limit the w/a to the render interrupt from rcs Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=107500 Fixes: c5498089463b ("drm/i915: Mask everything in ring HWSTAM on gen6+ in ringbuffer mode") References: d420a50c21ef ("drm/i915: Clean up the HWSTAM mess") Testcase: igt/gem_eio/reset-stress Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Cc: Ville Syrjälä <ville.syrjala@linux.intel.com> Acked-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>
This issue is resolved/fixed. Last seen 1 month 3 weeks ago, until then this failure appears in every round.
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.