Bug 105900

Summary:	[CI] igt@gem_exec_* - fail - Failed assertion: !"GPU hung"
Product:	DRI	Reporter:	Marta Löfstedt <marta.lofstedt>
Component:	DRM/Intel	Assignee:	Francesco Balestrieri <francesco.balestrieri>
Status:	CLOSED INVALID	QA Contact:	Intel GFX Bugs mailing list <intel-gfx-bugs>
Severity:	normal
Priority:	high	CC:	intel-gfx-bugs
Version:	DRI git
Hardware:	Other
OS:	All
Whiteboard:	ReadyForDev
i915 platform:	BYT, CNL, GLK, HSW, SKL	i915 features:	GPU hang

Description Marta Löfstedt 2018-04-05 05:55:28 UTC

https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_4019/shard-glkb1/igt@gem_exec_whisper@normal.html

(gem_exec_whisper:1264) igt_aux-CRITICAL: Test assertion failure function sig_abort, file ../lib/igt_aux.c:481:
(gem_exec_whisper:1264) igt_aux-CRITICAL: Failed assertion: !"GPU hung"
Subtest normal failed.

Comment 1 Marta Löfstedt 2018-04-18 06:40:34 UTC

Also,
https://intel-gfx-ci.01.org/tree/drm-tip/IGT_4432/shard-glkb6/igt@gem_exec_whisper@normal.html

Comment 2 Martin Peres 2018-04-19 07:39:58 UTC

https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_4064/shard-glk1/igt@gem_exec_flush@basic-wb-set-default.html

(gem_exec_flush:3819) igt_aux-CRITICAL: Test assertion failure function sig_abort, file ../lib/igt_aux.c:481:
(gem_exec_flush:3819) igt_aux-CRITICAL: Failed assertion: !"GPU hung"
Subtest basic-wb-set-default failed.

Comment 3 Martin Peres 2018-04-25 12:58:15 UTC

https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_4092/shard-glk3/igt@gem_exec_store@cachelines-bsd.html

(gem_exec_store:1466) igt_aux-CRITICAL: Test assertion failure function sig_abort, file ../lib/igt_aux.c:481:
(gem_exec_store:1466) igt_aux-CRITICAL: Failed assertion: !"GPU hung"
Subtest cachelines-bsd failed.

Comment 4 Martin Peres 2018-05-03 14:18:22 UTC

https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_29/fi-cnl-y3/igt@gem_exec_await@wide-contexts.html

(gem_exec_await:2291) igt_aux-CRITICAL: Test assertion failure function sig_abort, file ../lib/igt_aux.c:481:
(gem_exec_await:2291) igt_aux-CRITICAL: Failed assertion: !"GPU hung"
Subtest wide-contexts failed.

Comment 5 Chris Wilson 2018-05-03 14:24:05 UTC

(In reply to Martin Peres from comment #4)
> https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_29/fi-cnl-y3/
> igt@gem_exec_await@wide-contexts.html
> 
> (gem_exec_await:2291) igt_aux-CRITICAL: Test assertion failure function
> sig_abort, file ../lib/igt_aux.c:481:
> (gem_exec_await:2291) igt_aux-CRITICAL: Failed assertion: !"GPU hung"
> Subtest wide-contexts failed.

This is a different issue. The GPU hang here is a result of hitting a blocking ioctl in the test. The earlier hangs look more like those fixed by

commit ea491b23b2ffba069537a8216060d4d3400931a7
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Wed May 2 23:03:12 2018 +0100

    drm/i915: Reset the hangcheck timestamp before repeating a seqno

Comment 6 Chris Wilson 2018-05-03 14:30:01 UTC

(In reply to Chris Wilson from comment #5)
> (In reply to Martin Peres from comment #4)
> > https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_29/fi-cnl-y3/
> > igt@gem_exec_await@wide-contexts.html
> > 
> > (gem_exec_await:2291) igt_aux-CRITICAL: Test assertion failure function
> > sig_abort, file ../lib/igt_aux.c:481:
> > (gem_exec_await:2291) igt_aux-CRITICAL: Failed assertion: !"GPU hung"
> > Subtest wide-contexts failed.
> 
> This is a different issue. The GPU hang here is a result of hitting a
> blocking ioctl in the test.

Using Execlists submission
Ring size: 143 batches

If we can only fit 143 batches in a ring, why did we submit 144?...

<7>[  134.752155] hangcheck bcs0
<7>[  134.752174] hangcheck 	current seqno c707, last c797, hangcheck c707 [4031 ms]
<7>[  134.752178] hangcheck 	Reset count: 0 (global 0)
<7>[  134.752181] hangcheck 	Requests:
<7>[  134.752192] hangcheck 		first  c708 [4e0a:1] prio=0 @ 4047ms: gem_exec_await[2291]/3
<7>[  134.752196] hangcheck 		last   c797 [4e0a:90] prio=0 @ 4037ms: gem_exec_await[2291]/3
<7>[  134.752207] hangcheck 		active c708 [4e0a:1] prio=0 @ 4047ms: gem_exec_await[2291]/3
<7>[  134.752212] hangcheck 		[head 0000, postfix 0030, tail 0050, batch 0x00000000_00040000]
<7>[  134.752215] hangcheck 		ring->start:  0x035a8000
<7>[  134.752218] hangcheck 		ring->head:   0x00000000
<7>[  134.752221] hangcheck 		ring->tail:   0x00002cf8
<7>[  134.752224] hangcheck 		ring->emit:   0x00002d00
<7>[  134.752227] hangcheck 		ring->space:  0x000012c0
<7>[  134.752232] hangcheck 	RING_START: 0x035a8000
<7>[  134.752237] hangcheck 	RING_HEAD:  0x00000020
<7>[  134.752241] hangcheck 	RING_TAIL:  0x00002cf8
<7>[  134.752246] hangcheck 	RING_CTL:   0x00003001
<7>[  134.752252] hangcheck 	RING_MODE:  0x00000000
<7>[  134.752256] hangcheck 	RING_IMR: feffffff
<7>[  134.752263] hangcheck 	ACTHD:  0x00000000_00040000
<7>[  134.752271] hangcheck 	BBADDR: 0x00000000_00040001
<7>[  134.752277] hangcheck 	DMA_FADDR: 0x00000000_00040200
<7>[  134.752282] hangcheck 	IPEIR: 0x00000000
<7>[  134.752286] hangcheck 	IPEHR: 0x18800101
<7>[  134.752292] hangcheck 	Execlist status: 0x00024049 0000057e
<7>[  134.752297] hangcheck 	Execlist CSB read 1 [1 cached], write 1 [1 from hws], interrupt posted? no, tasklet queued? no (enabled)
<7>[  134.752302] hangcheck 		ELSP[0] count=1, rq: c797 [4e0a:90] prio=0 @ 4037ms: gem_exec_await[2291]/3
<7>[  134.752305] hangcheck 		ELSP[1] idle
<7>[  134.752308] hangcheck 		HW active? 0x5
<7>[  134.752355] hangcheck 		E c708 [4e0a:1] prio=0 @ 4047ms: gem_exec_await[2291]/3
<7>[  134.752392] hangcheck 		E c709 [4e0a:2] prio=0 @ 4047ms: gem_exec_await[2291]/3
<7>[  134.752396] hangcheck 		E c70a [4e0a:3] prio=0 @ 4047ms: gem_exec_await[2291]/3
<7>[  134.752399] hangcheck 		E c70b [4e0a:4] prio=0 @ 4047ms: gem_exec_await[2291]/3
<7>[  134.752403] hangcheck 		E c70c [4e0a:5] prio=0 @ 4047ms: gem_exec_await[2291]/3
<7>[  134.752407] hangcheck 		E c70d [4e0a:6] prio=0 @ 4047ms: gem_exec_await[2291]/3
<7>[  134.752411] hangcheck 		E c70e [4e0a:7] prio=0 @ 4047ms: gem_exec_await[2291]/3
<7>[  134.752427] hangcheck 		...skipping 136 executing requests...
<7>[  134.752431] hangcheck 		E c797 [4e0a:90] prio=0 @ 4037ms: gem_exec_await[2291]/3
<7>[  134.752434] hangcheck 		Queue priority: -2147483648

Comment 7 Chris Wilson 2018-05-03 14:35:50 UTC

(In reply to Chris Wilson from comment #6)
> (In reply to Chris Wilson from comment #5)
> > (In reply to Martin Peres from comment #4)
> > > https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_29/fi-cnl-y3/
> > > igt@gem_exec_await@wide-contexts.html
> > > 
> > > (gem_exec_await:2291) igt_aux-CRITICAL: Test assertion failure function
> > > sig_abort, file ../lib/igt_aux.c:481:
> > > (gem_exec_await:2291) igt_aux-CRITICAL: Failed assertion: !"GPU hung"
> > > Subtest wide-contexts failed.
> > 
> > This is a different issue. The GPU hang here is a result of hitting a
> > blocking ioctl in the test.
> 
> Using Execlists submission
> Ring size: 143 batches
> 
> If we can only fit 143 batches in a ring, why did we submit 144?...

Nah, last batch has seqno 144. Just an off-by-one (or at least misleading) comment about the number of skipped batches. The ring that stuck was:

<7>[  134.752727] hangcheck vecs0
<7>[  134.752730] hangcheck 	current seqno c707, last c740, hangcheck c707 [4031 ms]
<7>[  134.752733] hangcheck 	Reset count: 0 (global 0)
<7>[  134.752736] hangcheck 	Requests:
<7>[  134.752740] hangcheck 		first  c708 [4e17:1] prio=0 @ 4036ms: gem_exec_await[2291]/4
<7>[  134.752744] hangcheck 		last   c740 [4e17:39] prio=0 @ 4033ms: gem_exec_await[2291]/4
<7>[  134.752748] hangcheck 		active c708 [4e17:1] prio=0 @ 4036ms: gem_exec_await[2291]/4
<7>[  134.752752] hangcheck 		[head 0000, postfix 0030, tail 0050, batch 0x00000000_00040000]
<7>[  134.752755] hangcheck 		ring->start:  0x035ac000
<7>[  134.752758] hangcheck 		ring->head:   0x00000000
<7>[  134.752761] hangcheck 		ring->tail:   0x000011c8
<7>[  134.752764] hangcheck 		ring->emit:   0x000011d0
<7>[  134.752767] hangcheck 		ring->space:  0x00002df0
<7>[  134.752772] hangcheck 	RING_START: 0x035ac000
<7>[  134.752776] hangcheck 	RING_HEAD:  0x00000020
<7>[  134.752780] hangcheck 	RING_TAIL:  0x000011c8
<7>[  134.752786] hangcheck 	RING_CTL:   0x00003001
<7>[  134.752791] hangcheck 	RING_MODE:  0x00000000
<7>[  134.752795] hangcheck 	RING_IMR: fffffeff
<7>[  134.752802] hangcheck 	ACTHD:  0x00000000_00040000
<7>[  134.752809] hangcheck 	BBADDR: 0x00000000_00040001
<7>[  134.752816] hangcheck 	DMA_FADDR: 0x00000000_00040200
<7>[  134.752821] hangcheck 	IPEIR: 0x00000000
<7>[  134.752825] hangcheck 	IPEHR: 0x18800101
<7>[  134.752830] hangcheck 	Execlist status: 0x00044052 0000057f
<7>[  134.752835] hangcheck 	Execlist CSB read 1 [1 cached], write 1 [1 from hws], interrupt posted? no, tasklet queued? no (enabled)
<7>[  134.752840] hangcheck 		ELSP[0] count=1, rq: c740 [4e17:39] prio=0 @ 4033ms: gem_exec_await[2291]/4
<7>[  134.752843] hangcheck 		ELSP[1] idle
<7>[  134.752846] hangcheck 		HW active? 0x5
<7>[  134.752850] hangcheck 		E c708 [4e17:1] prio=0 @ 4036ms: gem_exec_await[2291]/4
<7>[  134.752853] hangcheck 		E c709 [4e17:2] prio=0 @ 4036ms: gem_exec_await[2291]/4
<7>[  134.752857] hangcheck 		E c70a [4e17:3] prio=0 @ 4036ms: gem_exec_await[2291]/4
<7>[  134.752861] hangcheck 		E c70b [4e17:4] prio=0 @ 4036ms: gem_exec_await[2291]/4
<7>[  134.752865] hangcheck 		E c70c [4e17:5] prio=0 @ 4036ms: gem_exec_await[2291]/4
<7>[  134.752868] hangcheck 		E c70d [4e17:6] prio=0 @ 4036ms: gem_exec_await[2291]/4
<7>[  134.752872] hangcheck 		E c70e [4e17:7] prio=0 @ 4036ms: gem_exec_await[2291]/4
<7>[  134.752880] hangcheck 		...skipping 49 executing requests...
<7>[  134.752884] hangcheck 		E c740 [4e17:39] prio=0 @ 4033ms: gem_exec_await[2291]/4
<7>[  134.752887] hangcheck 		Queue priority: -2147483648
<7>[  134.752890] hangcheck IRQ? 0x1 (breadcrumbs? yes) (execlists? no)
<7>[  134.752893] hangcheck HWSP:
<7>[  134.752897] hangcheck 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
<7>[  134.752900] hangcheck *
<7>[  134.752904] hangcheck 00000040 00008002 0000057f 00008002 0000057f 00008002 0000057f 00008002 0000057f
<7>[  134.752909] hangcheck 00000060 00008002 0000057f 00008002 0000057f 00000000 00000000 00000000 00000000
<7>[  134.752913] hangcheck 00000080 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
<7>[  134.752917] hangcheck 000000a0 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000001
<7>[  134.752922] hangcheck 000000c0 0000c707 00000000 00000000 00000000 00000000 00000000 00000000 00000000
<7>[  134.752926] hangcheck 000000e0 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
<7>[  134.752929] hangcheck *
<7>[  134.752932] hangcheck Idle? no

which isn't out of ring space... Oh, maybe it is just the premature hangcheck, but how? Hmm.

Comment 8 Chris Wilson 2018-05-03 14:39:26 UTC

Reminder to self: dump the kernel stack from the GPU hang signal handler.

Comment 9 Martin Peres 2018-05-03 15:59:02 UTC

https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_30/fi-glk-j4005/igt@gem_exec_await@wide-contexts.html

(gem_exec_await:2505) igt_aux-CRITICAL: Test assertion failure function sig_abort, file ../lib/igt_aux.c:481:
(gem_exec_await:2505) igt_aux-CRITICAL: Failed assertion: !"GPU hung"
Subtest wide-contexts failed.

Comment 10 Martin Peres 2018-05-03 16:23:53 UTC

https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_30/fi-skl-guc/igt@gem_exec_await@wide-contexts.html

(gem_exec_await:2704) igt_aux-CRITICAL: Test assertion failure function sig_abort, file ../lib/igt_aux.c:481:
(gem_exec_await:2704) igt_aux-CRITICAL: Failed assertion: !"GPU hung"
Subtest wide-contexts failed.

Comment 11 Chris Wilson 2018-05-11 18:34:57 UTC

Waiting for a debug dump after

commit 91b5a3ef5516b29584ea4567b0f5ffa18219b29f (upstream/master)
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Thu May 3 21:53:41 2018 +0100

    lib: Show the kernel stack when reporting a GPU hang
    
    In a few tests (like gem_exec_await, gem_exec_schedule) we use the GPU
    hang to break a deadlock hit during test setup. In these case we would
    like to see where in kernel the process is stuck (if at all).
    
    References: https://bugs.freedesktop.org/show_bug.cgi?id=105900
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
    Acked-by: Antonio Argenziano <antonio.argenziano@intel.com>

Comment 12 Martin Peres 2018-05-28 14:24:51 UTC

https://intel-gfx-ci.01.org/tree/drm-tip/IGT_4477/fi-byt-n2820/igt@gem_exec_fence@basic-await-default.html

(gem_exec_fence:1381) igt_aux-CRITICAL: Test assertion failure function sig_abort, file ../lib/igt_aux.c:500:
(gem_exec_fence:1381) igt_aux-CRITICAL: Failed assertion: !"GPU hung"
Subtest basic-await-default failed.

Comment 13 Martin Peres 2018-06-07 08:43:50 UTC

https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_4286/fi-hsw-peppy/igt@gem_exec_suspend@basic-s4-devices.html

(gem_exec_suspend:2755) igt_aux-CRITICAL: Test assertion failure function sig_abort, file ../lib/igt_aux.c:500:
(gem_exec_suspend:2755) igt_aux-CRITICAL: Failed assertion: !"GPU hung"
Subtest basic-S4-devices failed.

Comment 14 Chris Wilson 2018-06-13 12:13:06 UTC

Too many conflating *valid* fails end up being claimed by this bug. Please refine.

Comment 15 Jani Saarinen 2018-06-19 14:25:49 UTC

Closing, thanks.

Comment 16 Martin Peres 2018-09-07 17:31:42 UTC

(In reply to Martin Peres from comment #13)
> https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_4286/fi-hsw-peppy/
> igt@gem_exec_suspend@basic-s4-devices.html
> 
> (gem_exec_suspend:2755) igt_aux-CRITICAL: Test assertion failure function
> sig_abort, file ../lib/igt_aux.c:500:
> (gem_exec_suspend:2755) igt_aux-CRITICAL: Failed assertion: !"GPU hung"
> Subtest basic-S4-devices failed.

Moved the igt@gem_exec_await@wide-contexts failures to https://bugs.freedesktop.org/show_bug.cgi?id=106680

Moved the BYT failures to https://bugs.freedesktop.org/show_bug.cgi?id=107861

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.