104259 – [BAT] igt@gem_sync@basic-all - fail - Failed assertion: !"GPU hung"

Bug 104259 - [BAT] igt@gem_sync@basic-all - fail - Failed assertion: !"GPU hung"

Summary: [BAT] igt@gem_sync@basic-all - fail - Failed assertion: !"GPU hung"

Status:	CLOSED FIXED

Alias:	None

Product:	DRI
Classification:	Unclassified
Component:	DRM/Intel (show other bugs)
Version:	DRI git
Hardware:	Other All

Importance:	medium normal
Assignee:	Intel GFX Bugs mailing list
QA Contact:	Intel GFX Bugs mailing list

URL:
Whiteboard:	ReadyForDev
Keywords:

Depends on:
Blocks:

Reported:	2017-12-14 09:05 UTC by Marta Löfstedt
Modified:	2017-12-18 09:01 UTC (History)
CC List:	1 user (show)

See Also:
i915 platform:	G33, PNV
i915 features:	GEM/Other

Attachments

Description Marta Löfstedt 2017-12-14 09:05:48 UTC

https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3514/fi-blb-e6850/igt@gem_sync@basic-all.html

(gem_sync:3296) igt-aux-CRITICAL: Test assertion failure function sig_abort, file igt_aux.c:482:
(gem_sync:3296) igt-aux-CRITICAL: Failed assertion: !"GPU hung"
Subtest basic-all failed.

and
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3514/fi-pnv-d510/igt@gem_sync@basic-all.html
(gem_sync:3075) igt-aux-CRITICAL: Test assertion failure function sig_abort, file igt_aux.c:482:
(gem_sync:3075) igt-aux-CRITICAL: Failed assertion: !"GPU hung"
Subtest basic-all failed.

same test same runs 2 machines, this looks suspicious, so I create this new bug instead of piling it up on bug 102848.

Comment 1 Chris Wilson 2017-12-14 09:09:23 UTC

It was anticipated.

<7>[  228.586379] [IGT] gem_sync: executing
<4>[  228.606909] Setting dangerous option reset - tainting kernel
<7>[  228.609394] [IGT] gem_sync: starting subtest basic-all
<7>[  230.752959] missed_breadcrumb rcs0 missed breadcrumb at intel_breadcrumbs_hangcheck+0x5a/0x80 [i915]
<7>[  230.752988] missed_breadcrumb 	current seqno 2854cd, last 2854ce, hangcheck 2854cd [0 ms], inflight 1
<7>[  230.752995] missed_breadcrumb 	Reset count: 0 (global 4)
<7>[  230.753004] missed_breadcrumb 	Requests:
<7>[  230.753013] missed_breadcrumb 		first  2854ce [2:2854d8] prio=-2147483648 @ 2137ms: [global]
<7>[  230.753020] missed_breadcrumb 		last   2854ce [2:2854d8] prio=-2147483648 @ 2137ms: [global]
<7>[  230.753028] missed_breadcrumb 		active 2854ce [2:2854d8] prio=-2147483648 @ 2137ms: [global]
<7>[  230.753039] missed_breadcrumb 		[head 9238, postfix 9250, tail 9260, batch 0x00000000_00325000]
<7>[  230.753044] missed_breadcrumb 	RING_START: 0x00004000 [0x00004000]
<7>[  230.753048] missed_breadcrumb 	RING_HEAD:  0x00009220 [0x00009210]
<7>[  230.753053] missed_breadcrumb 	RING_TAIL:  0x00009260 [0x00009260]
<7>[  230.753058] missed_breadcrumb 	RING_CTL:   0x0001f001
<7>[  230.753063] missed_breadcrumb 	RING_MODE:  0x00000000
<7>[  230.753068] missed_breadcrumb 	ACTHD:  0x00000000_155f7f94
<7>[  230.753073] missed_breadcrumb 	BBADDR: 0x00000000_00000000
<7>[  230.753078] missed_breadcrumb 		E 2854ce [2:2854d8] prio=-2147483648 @ 2137ms: [global]
<7>[  230.753083] missed_breadcrumb 	gem_sync [3298] waiting for 2854ce
<7>[  230.753088] missed_breadcrumb IRQ? 0x1 (breadcrumbs? yes) (execlists? no)
<7>[  230.753092] missed_breadcrumb Idle? no
<7>[  230.753097] missed_breadcrumb 
<6>[  236.783524] [drm] GPU HANG: ecode 3:0:0x7dffffc1, in gem_sync [3296], reason: Hang on rcs0, action: reset

Shows a TLB miss and walking off into the empty GTT

Comment 2 Chris Wilson 2017-12-16 09:48:10 UTC

commit 7b6da818d86fddfc88ddb523d6539c1bf7fc6302
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Sat Dec 16 00:03:34 2017 +0000

    drm/i915: Restore the kernel context after a GPU reset on an idle engine
    
    As part of the system requirement for powersaving is that we always have
    a context loaded. Upon boot and resume, we load the kernel_context to
    ensure that some valid state is set before powersaving kicks in, we
    should do so after a full GPU reset as well. We only need to do so for
    an idle engine, as any active engines will restart by executing the
    stuck request, loading its context. For the idle engine, we create a
    new request to load the kernel_context instead.
    
    For whatever reason, perfoming a dummy execute on the idle engine after
    reset papers over a subsequent GPU hang in rare circumstances, even on
    machines not using contexts (e.g. Pineview).
    
    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=104259
    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=104261
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
    Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
    Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
    Reviewed-by: Michel Thierry <michel.thierry@intel.com>
    Link: https://patchwork.freedesktop.org/patch/msgid/20171216000334.8197-1-chris@chris-wilson.co.uk

Comment 3 Marta Löfstedt 2017-12-18 09:01:46 UTC

Fix integrated in CI_DRM_3526 results are green after that

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.