Bug 104261

Summary:	[CI] igt@gem_exec_schedule@preempt- - fail - Failed assertion: !"GPU hung"
Product:	DRI	Reporter:	Marta Löfstedt <marta.lofstedt>
Component:	DRM/Intel	Assignee:	Kimmo Nikkanen <knikkane>
Status:	CLOSED FIXED	QA Contact:	Intel GFX Bugs mailing list <intel-gfx-bugs>
Severity:	normal
Priority:	medium	CC:	intel-gfx-bugs
Version:	DRI git
Hardware:	Other
OS:	All
Whiteboard:	ReadyForDev
i915 platform:	BXT, GLK, KBL	i915 features:	GPU hang

Description Marta Löfstedt 2017-12-14 11:49:16 UTC

At lot of igt@gem_exec_schedule subtests started to fail on GPU hang on CI_DRM

https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3514/shard-apl1/igt@gem_exec_schedule@preempt-contexts-bsd.html
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3514/shard-glkb3/igt@gem_exec_schedule@preempt-contexts-bsd.html
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3514/shard-kbl7/igt@gem_exec_schedule@preempt-contexts-bsd.html

https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3514/shard-apl4/igt@gem_exec_schedule@preempt-vebox.html
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3514/shard-glkb2/igt@gem_exec_schedule@preempt-vebox.html
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3514/shard-kbl4/igt@gem_exec_schedule@preempt-vebox.html

https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3514/shard-apl2/igt@gem_exec_schedule@preempt-render.html
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3514/shard-glkb2/igt@gem_exec_schedule@preempt-render.html
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3514/shard-kbl6/igt@gem_exec_schedule@preempt-render.html

(gem_exec_schedule:4286) igt-aux-CRITICAL: Test assertion failure function sig_abort, file igt_aux.c:482:
(gem_exec_schedule:4286) igt-aux-CRITICAL: Failed assertion: !"GPU hung"

this is possible related to bug 104259 which also started on CI_DRM_3514 on some BAT machines.

Comment 1 Marta Löfstedt 2017-12-14 12:00:42 UTC

There are more of them that is already covered by bug 102848.

https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3514/shard-kbl7/igt@gem_exec_schedule@preempt-bsd2.html

https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3514/shard-apl2/igt@gem_exec_schedule@preempt-other-blt.html
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3514/shard-kbl7/igt@gem_exec_schedule@preempt-contexts-bsd.html
GLK is green on above.

https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3514/shard-kbl7/igt@gem_exec_schedule@preempt-self-bsd2.html

Below doesn't fail due to GPU hung, instead:

(gem_exec_schedule:2729) CRITICAL: Test assertion failure function preemptive_hang, file gem_exec_schedule.c:545:
(gem_exec_schedule:2729) CRITICAL: Failed assertion: gem_bo_busy(fd, spin[n]->handle)
Subtest preemptive-hang-vebox failed.

https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3514/shard-apl5/igt@gem_exec_schedule@preemptive-hang-vebox.html
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3514/shard-kbl7/igt@gem_exec_schedule@preempt-self-bsd2.html
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3514/shard-glkb3/igt@gem_exec_schedule@preemptive-hang-vebox.html

Comment 2 Chris Wilson 2017-12-14 12:24:08 UTC

The frequency is certainly troubling, it looks like it fails on the first execution on each machine? So I'm wondering if we have some leftover state that we are now not flushing. The previous failures have been more sporadic, where the tests were running much slower and timedout. The ones I've looked at here barely make it to the second iteration.

Comment 3 Marta Löfstedt 2017-12-14 12:45:59 UTC

I believe I was wrong calling out CI_DRM_3514 as the first occurrence, it was actually already on IGT_4063, which is based on CI_DRM_3513

here is one example:
https://intel-gfx-ci.01.org/tree/drm-tip/IGT_4063/shard-apl8/igt@gem_exec_schedule@preempt-contexts-bsd.html

Comment 4 Chris Wilson 2017-12-14 20:30:58 UTC

So far, every failure I've looked at has the hallmarks of the spin batch not ending. However, it is refusing to fail on my machines, suggesting that it is something to do with the pre-existing state. Grr.

Comment 5 Marta Löfstedt 2017-12-15 07:39:25 UTC

I am not sure this is actually related to this issue, this only happened on GLK-shards once, the other issues are reproduced on every run from 3514 to so far 3521.

https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3517/shard-glkb2/igt@gem_exec_schedule@preemptive-hang-bsd.html

(gem_exec_schedule:2682) CRITICAL: Test assertion failure function preemptive_hang, file gem_exec_schedule.c:545:
(gem_exec_schedule:2682) CRITICAL: Failed assertion: gem_bo_busy(fd, spin[n]->handle)
Subtest preemptive-hang-bsd failed.

Comment 6 Chris Wilson 2017-12-16 09:48:01 UTC

With a bit of luck,

commit 7b6da818d86fddfc88ddb523d6539c1bf7fc6302
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Sat Dec 16 00:03:34 2017 +0000

    drm/i915: Restore the kernel context after a GPU reset on an idle engine
    
    As part of the system requirement for powersaving is that we always have
    a context loaded. Upon boot and resume, we load the kernel_context to
    ensure that some valid state is set before powersaving kicks in, we
    should do so after a full GPU reset as well. We only need to do so for
    an idle engine, as any active engines will restart by executing the
    stuck request, loading its context. For the idle engine, we create a
    new request to load the kernel_context instead.
    
    For whatever reason, perfoming a dummy execute on the idle engine after
    reset papers over a subsequent GPU hang in rare circumstances, even on
    machines not using contexts (e.g. Pineview).
    
    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=104259
    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=104261
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
    Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
    Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
    Reviewed-by: Michel Thierry <michel.thierry@intel.com>
    Link: https://patchwork.freedesktop.org/patch/msgid/20171216000334.8197-1-chris@chris-wilson.co.uk

Comment 7 Marta Löfstedt 2017-12-18 07:39:30 UTC

Patch integrated in CI_DRM_3526 these tests are then green, I will close this.

Comment 8 Chris Wilson 2017-12-18 08:59:58 UTC

*** Bug 104315 has been marked as a duplicate of this bug. ***

Comment 9 Chris Wilson 2017-12-18 09:00:00 UTC

*** Bug 104314 has been marked as a duplicate of this bug. ***

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.