107188 – [CI][CFL] Occasional hang in drv_selftest@live_workarounds

Bug 107188 - [CI][CFL] Occasional hang in drv_selftest@live_workarounds

Summary: [CI][CFL] Occasional hang in drv_selftest@live_workarounds

Status:	CLOSED FIXED

Alias:	None

Product:	DRI
Classification:	Unclassified
Component:	DRM/Intel (show other bugs)
Version:	DRI git
Hardware:	x86-64 (AMD64) Linux (All)

Importance:	medium normal
Assignee:	Intel GFX Bugs mailing list
QA Contact:	Intel GFX Bugs mailing list

URL:
Whiteboard:
Keywords:

Duplicates (2):	107220 107292 (view as bug list)
Depends on:
Blocks:

Reported:	2018-07-11 08:47 UTC by Tomi Sarvela
Modified:	2018-08-27 12:45 UTC (History)
CC List:	2 users (show)

See Also:
i915 platform:	CFL
i915 features:

Attachments

Description Tomi Sarvela 2018-07-11 08:47:47 UTC

Continuing with the series of "Initial findings" with Intel-GFX-CI and i915 selftests.

CFL-8109u (pre-production NUC) occasionally hangs in drv_selftest@live_workarounds

Example panic:
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_4465/fi-cfl-8109u/pstore0-1531239199_Panic_1.log

History:
https://intel-gfx-ci.01.org/tree/drm-tip/fi-cfl-8109u.html

Comment 1 Chris Wilson 2018-07-11 08:54:32 UTC

My impression is that this is the same bug that affects live_hangcheck on execlists, in that it looks to be the restart from reset that freezes. Unlike live_hangcheck we don't have a timer in the background to kick live_workarounds in case of reset failure. I should fix that.

Comment 2 Chris Wilson 2018-07-11 13:42:18 UTC

This should turn the incompletes into fails:

commit cb4dc8daf4cb72d7833148a6087b425b5c20e903
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Wed Jul 11 13:29:52 2018 +0100

    drm/i915/selftests: Add a safety net to live_workarounds
    
    Since live_workarounds poke around the w/a registers and checks to see
    if they survive across a reset, we are prone to fouling the machine and
    leaving it in a non-recoverable state. Wrap the probe inside a timeout
    to abort the test if the reset fails.
    
    v2: Include GEM_TRACE on declaring wedged.
    v3: Add a few includes to make the header look standalone.
    
    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=107188
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
    Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
    Reviewed-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>
    Link: https://patchwork.freedesktop.org/patch/msgid/20180711122952.18448-1-chris@chris-wilson.co.uk

Comment 3 Chris Wilson 2018-07-13 12:00:45 UTC

*** Bug 107220 has been marked as a duplicate of this bug. ***

Comment 4 Chris Wilson 2018-07-14 13:28:14 UTC

Found a subsequent BUG_ON (following the act of wedging the driver) that makes this worse than just the reset (live_hangcheck) failure.

Comment 5 Chris Wilson 2018-07-19 14:12:07 UTC

*** Bug 107292 has been marked as a duplicate of this bug. ***

Comment 6 Chris Wilson 2018-08-15 09:19:53 UTC

commit a99b32a6fff7e482a267c72e565c8c410ce793d7 (HEAD -> drm-intel-next-queued, drm-intel/drm-intel-next-queued)
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Tue Aug 14 18:18:57 2018 +0100

    drm/i915: Clear stop-engine for a pardoned reset
    
    If we pardon a per-engine reset, we may leave the STOP_RING bit asserted
    in RING_MI_MODE resulting in the engine hanging. Unconditionally clear
    it on the per-engine exit path as we know that either we skipped the
    reset and so need the cancellation, or the reset was successful and the
    cancellation is a no-op, or there was an error and we will follow up
    with a full-reset or wedging (both of which will stop the engines again
    as required).
    
    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=107188
    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=106560
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
    Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
    Reviewed-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>
    Link: https://patchwork.freedesktop.org/patch/msgid/20180814171857.24673-1-chris@chris-wilson.co.uk

Comment 7 Lakshmi 2018-08-24 06:43:30 UTC

Last seen 1 month ago. Closing the bug.

Comment 8 Lakshmi 2018-08-27 12:45:01 UTC

This bug used to appear around 1- 20 rounds, now it doesn't appear since 217 rounds. Closing the bug.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.