Bug 104261

Summary: [CI] igt@gem_exec_schedule@*preempt-* - fail - Failed assertion: !"GPU hung"
Product: DRI Reporter: Marta Löfstedt <marta.lofstedt>
Component: DRM/IntelAssignee: Kimmo Nikkanen <knikkane>
Status: CLOSED FIXED QA Contact: Intel GFX Bugs mailing list <intel-gfx-bugs>
Severity: normal    
Priority: medium CC: intel-gfx-bugs
Version: DRI git   
Hardware: Other   
OS: All   
Whiteboard: ReadyForDev
i915 platform: BXT, GLK, KBL i915 features: GPU hang

Comment 2 Chris Wilson 2017-12-14 12:24:08 UTC
The frequency is certainly troubling, it looks like it fails on the first execution on each machine? So I'm wondering if we have some leftover state that we are now not flushing. The previous failures have been more sporadic, where the tests were running much slower and timedout. The ones I've looked at here barely make it to the second iteration.
Comment 3 Marta Löfstedt 2017-12-14 12:45:59 UTC
I believe I was wrong calling out CI_DRM_3514 as the first occurrence, it was actually already on IGT_4063, which is based on CI_DRM_3513

here is one example:
https://intel-gfx-ci.01.org/tree/drm-tip/IGT_4063/shard-apl8/igt@gem_exec_schedule@preempt-contexts-bsd.html
Comment 4 Chris Wilson 2017-12-14 20:30:58 UTC
So far, every failure I've looked at has the hallmarks of the spin batch not ending. However, it is refusing to fail on my machines, suggesting that it is something to do with the pre-existing state. Grr.
Comment 5 Marta Löfstedt 2017-12-15 07:39:25 UTC
I am not sure this is actually related to this issue, this only happened on GLK-shards once, the other issues are reproduced on every run from 3514 to so far 3521.

https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3517/shard-glkb2/igt@gem_exec_schedule@preemptive-hang-bsd.html

(gem_exec_schedule:2682) CRITICAL: Test assertion failure function preemptive_hang, file gem_exec_schedule.c:545:
(gem_exec_schedule:2682) CRITICAL: Failed assertion: gem_bo_busy(fd, spin[n]->handle)
Subtest preemptive-hang-bsd failed.
Comment 6 Chris Wilson 2017-12-16 09:48:01 UTC
With a bit of luck,

commit 7b6da818d86fddfc88ddb523d6539c1bf7fc6302
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Sat Dec 16 00:03:34 2017 +0000

    drm/i915: Restore the kernel context after a GPU reset on an idle engine
    
    As part of the system requirement for powersaving is that we always have
    a context loaded. Upon boot and resume, we load the kernel_context to
    ensure that some valid state is set before powersaving kicks in, we
    should do so after a full GPU reset as well. We only need to do so for
    an idle engine, as any active engines will restart by executing the
    stuck request, loading its context. For the idle engine, we create a
    new request to load the kernel_context instead.
    
    For whatever reason, perfoming a dummy execute on the idle engine after
    reset papers over a subsequent GPU hang in rare circumstances, even on
    machines not using contexts (e.g. Pineview).
    
    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=104259
    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=104261
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
    Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
    Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
    Reviewed-by: Michel Thierry <michel.thierry@intel.com>
    Link: https://patchwork.freedesktop.org/patch/msgid/20171216000334.8197-1-chris@chris-wilson.co.uk
Comment 7 Marta Löfstedt 2017-12-18 07:39:30 UTC
Patch integrated in CI_DRM_3526 these tests are then green, I will close this.
Comment 8 Chris Wilson 2017-12-18 08:59:58 UTC
*** Bug 104315 has been marked as a duplicate of this bug. ***
Comment 9 Chris Wilson 2017-12-18 09:00:00 UTC
*** Bug 104314 has been marked as a duplicate of this bug. ***

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.