Continuing with the series of "Initial findings" with Intel-GFX-CI and i915 selftests. CFL-8109u (pre-production NUC) occasionally hangs in drv_selftest@live_workarounds Example panic: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_4465/fi-cfl-8109u/pstore0-1531239199_Panic_1.log History: https://intel-gfx-ci.01.org/tree/drm-tip/fi-cfl-8109u.html
My impression is that this is the same bug that affects live_hangcheck on execlists, in that it looks to be the restart from reset that freezes. Unlike live_hangcheck we don't have a timer in the background to kick live_workarounds in case of reset failure. I should fix that.
This should turn the incompletes into fails: commit cb4dc8daf4cb72d7833148a6087b425b5c20e903 Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Wed Jul 11 13:29:52 2018 +0100 drm/i915/selftests: Add a safety net to live_workarounds Since live_workarounds poke around the w/a registers and checks to see if they survive across a reset, we are prone to fouling the machine and leaving it in a non-recoverable state. Wrap the probe inside a timeout to abort the test if the reset fails. v2: Include GEM_TRACE on declaring wedged. v3: Add a few includes to make the header look standalone. Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=107188 Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com> Reviewed-by: Mika Kuoppala <mika.kuoppala@linux.intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20180711122952.18448-1-chris@chris-wilson.co.uk
*** Bug 107220 has been marked as a duplicate of this bug. ***
Found a subsequent BUG_ON (following the act of wedging the driver) that makes this worse than just the reset (live_hangcheck) failure.
*** Bug 107292 has been marked as a duplicate of this bug. ***
commit a99b32a6fff7e482a267c72e565c8c410ce793d7 (HEAD -> drm-intel-next-queued, drm-intel/drm-intel-next-queued) Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Tue Aug 14 18:18:57 2018 +0100 drm/i915: Clear stop-engine for a pardoned reset If we pardon a per-engine reset, we may leave the STOP_RING bit asserted in RING_MI_MODE resulting in the engine hanging. Unconditionally clear it on the per-engine exit path as we know that either we skipped the reset and so need the cancellation, or the reset was successful and the cancellation is a no-op, or there was an error and we will follow up with a full-reset or wedging (both of which will stop the engines again as required). Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=107188 Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=106560 Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com> Reviewed-by: Mika Kuoppala <mika.kuoppala@linux.intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20180814171857.24673-1-chris@chris-wilson.co.uk
Last seen 1 month ago. Closing the bug.
This bug used to appear around 1- 20 rounds, now it doesn't appear since 217 rounds. Closing the bug.
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.