Bug 111597

Summary: [CI][RESUME] igt@* - fail - Failed assertion: !"GPU hung"
Product: DRI Reporter: Martin Peres <martin.peres>
Component: DRM/IntelAssignee: Intel GFX Bugs mailing list <intel-gfx-bugs>
Status: RESOLVED FIXED QA Contact: Intel GFX Bugs mailing list <intel-gfx-bugs>
Severity: major    
Priority: high CC: intel-gfx-bugs, stanislav.lisovskiy
Version: XOrg git   
Hardware: Other   
OS: All   
Whiteboard:
i915 platform: TGL i915 features: GEM/Other

Description Martin Peres 2019-09-09 08:14:24 UTC
https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_362/fi-tgl-u/igt@gem_ctx_switch@legacy-blt-queue.html

Starting subtest: legacy-blt-queue
(gem_ctx_switch:1199) igt_aux-CRITICAL: Test assertion failure function sig_abort, file ../lib/igt_aux.c:502:
(gem_ctx_switch:1199) igt_aux-CRITICAL: Failed assertion: !"GPU hung"
Comment 1 CI Bug Log 2019-09-09 08:15:00 UTC
The CI Bug Log issue associated to this bug has been updated.

### New filters associated

* TGL: all tests - fail - Failed assertion: !&quot;GPU hung&quot;
  - https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_362/fi-tgl-u/igt@gem_exec_schedule@preempt-self-blt.html
  - https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_362/fi-tgl-u/igt@gem_ctx_switch@legacy-blt-queue.html
  - https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_362/fi-tgl-u/igt@gem_exec_parallel@bcs0.html
  - https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_362/fi-tgl-u/igt@gem_exec_await@wide-all.html
  - https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_362/fi-tgl-u/igt@gem_exec_schedule@preempt-queue-contexts-chain-render.html
  - https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_362/fi-tgl-u/igt@gem_exec_schedule@preempt-other-chain-blt.html
  - https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_362/fi-tgl-u/igt@gem_ctx_shared@q-out-order-blt.html
  - https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_362/fi-tgl-u/igt@gem_ctx_switch@bcs0-heavy.html
  - https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_362/fi-tgl-u/igt@gem_exec_balancer@full-pulse.html
  - https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_362/fi-tgl-u/igt@gem_exec_store@pages-bcs0.html
  - https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_362/fi-tgl-u/igt@gem_exec_schedule@preempt-queue-chain-blt.html
  - https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_362/fi-tgl-u/igt@gem_exec_schedule@smoketest-all.html
  - https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_362/fi-tgl-u/igt@gem_exec_schedule@smoketest-blt.html
  - https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_362/fi-tgl-u/igt@gem_exec_schedule@preempt-other-bsd1.html
  - https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_362/fi-tgl-u/igt@gem_ctx_switch@queue-light.html
  - https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_362/fi-tgl-u/igt@gem_exec_nop@basic-parallel.html
  - https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_362/fi-tgl-u/igt@gem_exec_schedule@preempt-contexts-blt.html
  - https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_362/fi-tgl-u/igt@gem_exec_parallel@fds.html
  - https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_362/fi-tgl-u/igt@gem_exec_schedule@semaphore-user.html
  - https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_362/fi-tgl-u/igt@gem_exec_await@wide-contexts.html
  - https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_362/fi-tgl-u/igt@gem_exec_schedule@promotion-blt.html
  - https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_362/fi-tgl-u/igt@gem_exec_schedule@preempt-other-chain-bsd1.html
  - https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_362/fi-tgl-u/igt@gem_exec_schedule@preempt-queue-contexts-chain-vebox.html
  - https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_362/fi-tgl-u/igt@gem_exec_basic@readonly-all.html
  - https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_362/fi-tgl-u/igt@gem_exec_schedule@preempt-blt.html
  - https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_362/fi-tgl-u/igt@gem_ctx_shared@q-in-order-blt.html
  - https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_362/fi-tgl-u/igt@gem_exec_basic@gtt-bcs0.html
  - https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_362/fi-tgl-u/igt@gem_exec_schedule@preempt-queue-chain-vebox.html
  - https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_362/fi-tgl-u/igt@gem_ctx_shared@q-smoketest-bsd1.html
  - https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_362/fi-tgl-u/igt@gem_busy@extended-semaphore-vcs1.html
  - https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_362/fi-tgl-u/igt@gem_exec_parallel@vcs1-contexts.html
  - https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_362/fi-tgl-u/igt@gem_exec_reuse@single.html
  - https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_362/fi-tgl-u/igt@gem_exec_parallel@vcs1-fds.html
  - https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_362/fi-tgl-u/igt@gem_exec_schedule@preempt-queue-contexts-bsd2.html
  - https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_362/fi-tgl-u/igt@gem_ctx_switch@legacy-blt-heavy.html
  - https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_362/fi-tgl-u/igt@prime_busy@wait-before-blt.html
  - https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_362/fi-tgl-u/igt@gem_ctx_switch@bcs0-heavy-queue.html
  - https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_362/fi-tgl-u/igt@gem_exec_schedule@preempt-self-bsd2.html
  - https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_362/fi-tgl-u/igt@gem_exec_schedule@independent-bsd2.html
  - https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_362/fi-tgl-u/igt@gem_exec_balancer@indices.html
  - https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_362/fi-tgl-u/igt@gem_ctx_switch@all-light.html
  - https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_362/fi-tgl-u/igt@gem_exec_schedule@preempt-queue-contexts-bsd1.html
  - https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_362/fi-tgl-u/igt@gem_exec_reuse@baggage.html
  - https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_362/fi-tgl-u/igt@gem_exec_schedule@out-order-blt.html
  - https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_362/fi-tgl-u/igt@gem_exec_schedule@preempt-self-render.html
  - https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_362/fi-tgl-u/igt@gem_exec_schedule@pi-ringfull-blt.html
Comment 2 Chris Wilson 2019-09-09 08:17:03 UTC
<7> [294.989903] hangcheck bcs0
<7> [294.989915] hangcheck 	Awake? 2
<7> [294.989920] hangcheck 	Hangcheck: 6016 ms ago
<7> [294.989924] hangcheck 	Reset count: 56 (global 39)
<7> [294.989928] hangcheck 	Requests:
<7> [294.989935] hangcheck 	MMIO base:  0x00022000
<7> [294.989947] hangcheck 	RING_START: 0x0160f000
<7> [294.989954] hangcheck 	RING_HEAD:  0x00000000
<7> [294.989962] hangcheck 	RING_TAIL:  0x00000068
<7> [294.989972] hangcheck 	RING_CTL:   0x00003001
<7> [294.989983] hangcheck 	RING_MODE:  0x00000000
<7> [294.989990] hangcheck 	RING_IMR: 00000000
<7> [294.990003] hangcheck 	ACTHD:  0x00000000_0a000140
<7> [294.990017] hangcheck 	BBADDR: 0x00000000_00000000
<7> [294.990031] hangcheck 	DMA_FADDR: 0x00000000_00000000
<7> [294.990038] hangcheck 	IPEIR: 0x00000000
<7> [294.990045] hangcheck 	IPEHR: 0x11081003

Another failed context restore. Getting closer with https://patchwork.freedesktop.org/patch/329718/?series=66415&rev=2
Comment 3 Chris Wilson 2019-09-09 10:24:16 UTC
*** Bug 111604 has been marked as a duplicate of this bug. ***
Comment 4 Chris Wilson 2019-09-09 13:28:39 UTC
(In reply to Chris Wilson from comment #2)
> Another failed context restore. Getting closer with
> https://patchwork.freedesktop.org/patch/329718/?series=66415&rev=2

Unfortunately, that was a fluke. Normal service resumed on the next run.
Comment 5 Stanislav Lisovskiy 2019-09-11 10:31:49 UTC
Doing now some testing with tgl, when I do submit multiple gpgpu_fill commands I constantly get this:

(kms_plane_stress:3092) gpu_cmds-CRITICAL: Test assertion failure function gen7_render_flush, file ../lib/gpu_cmds.c:36:
(kms_plane_stress:3092) gpu_cmds-CRITICAL: Failed assertion: ret == 0
(kms_plane_stress:3092) gpu_cmds-CRITICAL: Last errno: 5, Input/output error
Pausing GPU thread 0 
Stack trace:
  #0 ../lib/igt_core.c:1694 __igt_fail_assert()
  #1 ../lib/gpu_cmds.c:36 gen7_render_flush()
  #2 ../lib/gpgpu_fill.c:356 gen12p1_gpgpu_fillfunc()
  #3 ../tests/kms_plane_stress.c:318 gpu_load()
  #4 /build/glibc-OTsEL5/glibc-2.27/nptl/pthread_create.c:463 start_thread()
  #5 ../sysdeps/unix/sysv/linux/x86_64/clone.S:97 __clone()

Which works quite fine with ICL and other platforms. In dmesg I have this:

[ 3108.643351] hangcheck rcs0
[ 3108.643420] hangcheck 	Awake? 2
[ 3108.643428] hangcheck 	Hangcheck: 6016 ms ago
[ 3108.643434] hangcheck 	Reset count: 0 (global 0)
[ 3108.643440] hangcheck 	Requests:
[ 3108.643628] hangcheck 		active  1a:4*  prio=2 @ 7900ms: kms_plane_stres[1347]
[ 3108.643689] hangcheck 		ring->start:  0x00008000
[ 3108.643708] hangcheck 		ring->head:   0x00000048
[ 3108.643724] hangcheck 		ring->tail:   0x00003078
[ 3108.643733] hangcheck 		ring->emit:   0x00003080
[ 3108.643738] hangcheck 		ring->space:  0x00000f88
[ 3108.643745] hangcheck 		ring->hwsp:   0xffff81c0
[ 3108.643753] hangcheck [head 0080, postfix 00c8, tail 0100, batch 0x00000000_007ea000]:
[ 3108.643820] hangcheck [0000] 7a000004 21144c1c fffff080 00000000 00000000 00000000 02800000 00000000
[ 3108.643832] hangcheck [0020] 10400002 ffff81c0 00000000 00000003 04000001 18800101 007ea000 00000000
[ 3108.643841] hangcheck [0040] 04000000 00000000 7a000004 111050a1 ffff81c0 00000000 00000004 00000000
[ 3108.643849] hangcheck [0060] 01000000 04000001 0e40c002 00000000 ffffe0c8 00000000 02800000 00000000
[ 3108.644037] hangcheck 	MMIO base:  0x00002000
[ 3108.644085] hangcheck 	RING_START: 0x00008000
[ 3108.644098] hangcheck 	RING_HEAD:  0x000000c0
[ 3108.644110] hangcheck 	RING_TAIL:  0x00003078
[ 3108.644139] hangcheck 	RING_CTL:   0x00003001
[ 3108.644158] hangcheck 	RING_MODE:  0x00000000
[ 3108.644173] hangcheck 	RING_IMR: 00000000
[ 3108.644198] hangcheck 	ACTHD:  0x00000000_007ea884
[ 3108.644223] hangcheck 	BBADDR: 0x00000000_007ea885
[ 3108.644246] hangcheck 	DMA_FADDR: 0x00000000_007eaa80
[ 3108.644257] hangcheck 	IPEIR: 0x00000000
[ 3108.644267] hangcheck 	IPEHR: 0x25014100
[ 3108.644286] hangcheck 	Execlist status: 0x00002098 00000040, entries 12
[ 3108.644295] hangcheck 	Execlist CSB read 8, write 8, tasklet queued? no (enabled)
[ 3108.644318] hangcheck 		Active[0: ring:{start:00008000, hwsp:ffff81c0, seqno:00000003}, rq:  1a:c2  prio=2 @ 7748ms: kms_plane_stres[1347]
[ 3108.644343] hangcheck 		E  1a:4*  prio=2 @ 7901ms: kms_plane_stres[1347]
[ 3108.644352] hangcheck 		E  1a:6  prio=2 @ 7900ms: kms_plane_stres[1347]
[ 3108.644360] hangcheck 		E  1a:8  prio=2 @ 7899ms: kms_plane_stres[1347]
[ 3108.644368] hangcheck 		E  1a:a  prio=2 @ 7898ms: kms_plane_stres[1347]
[ 3108.644377] hangcheck 		E  1a:c  prio=2 @ 7898ms: kms_plane_stres[1347]
[ 3108.644384] hangcheck 		E  1a:e  prio=2 @ 7897ms: kms_plane_stres[1347]
[ 3108.644392] hangcheck 		E  1a:10  prio=2 @ 7896ms: kms_plane_stres[1347]
[ 3108.644442] hangcheck 		...skipping 88 executing requests...
[ 3108.644450] hangcheck 		E  1a:c2  prio=2 @ 7748ms: kms_plane_stres[1347]
[ 3108.644457] hangcheck HWSP:
[ 3108.644470] hangcheck [0000] 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
[ 3108.644475] hangcheck *
[ 3108.644486] hangcheck [0040] 00010001 00010005 00010001 00010005 00010001 00010005 00010001 00010005
[ 3108.644491] hangcheck *
[ 3108.644499] hangcheck [00a0] 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000008
[ 3108.644508] hangcheck [00c0] 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
[ 3108.644513] hangcheck *
[ 3108.644563] hangcheck Idle? no
[ 3108.644578] hangcheck Signals:
[ 3108.644676] hangcheck 	[1a:44] @ 7846ms
[ 3108.651414] i915 0000:00:02.0: GPU HANG: ecode 12:1:0xdadebeff, in kms_plane_stres [1347], hang on rcs0
[ 3108.651930] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
[ 3108.651945] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
[ 3108.651953] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
[ 3108.651958] The GPU crash dump is required to analyze GPU hangs, so please always attach it.
[ 3108.651965] GPU crash dump saved to /sys/class/drm/card0/error

kms_plane_stress is not yet in IGT however, I think there is definitely a bug, however I don't have any clue what gpu hang might mean.
Comment 6 Chris Wilson 2019-09-11 10:50:27 UTC
Most likely a dup of '593. Once that critical and wide reaching bug is resolved, we will have a better indication of what else is broken.

*** This bug has been marked as a duplicate of bug 111593 ***
Comment 7 Martin Peres 2019-09-17 13:10:52 UTC
Re-opening since 111513 has been fixed but the problem still persists.
Comment 9 Chris Wilson 2019-09-17 13:15:01 UTC
(In reply to CI Bug Log from comment #8)
> A CI Bug Log filter associated to this bug has been updated:
> 
> {- TGL: all tests - fail - Failed assertion: !&quot;GPU hung&quot; -}
> {+ TGL: all tests - fail / warn - Failed assertion: !&quot;GPU hung&quot; +}
> 
> New failures caught by the filter:
> 
>   *
> https://intel-gfx-ci.01.org/tree/drm-tip/IGT_5188/fi-tgl-u/
> igt@gem_exec_fence@nb-await-default.html
>   *
> https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6908/fi-tgl-u/
> igt@gem_exec_fence@nb-await-default.html
>   *
> https://intel-gfx-ci.01.org/tree/drm-tip/IGT_5184/fi-tgl-u/
> igt@gem_exec_fence@nb-await-default.html
>   *
> https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6901/fi-tgl-u/
> igt@gem_exec_fence@nb-await-default.html
>   *
> https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6902/fi-tgl-u/
> igt@gem_exec_fence@nb-await-default.html
>   *
> https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6903/fi-tgl-u/
> igt@gem_exec_fence@basic-await-default.html

That's a very particular hang. Not related to the earlier report.
Comment 10 Chris Wilson 2019-09-17 13:22:16 UTC
(In reply to Chris Wilson from comment #9)
> (In reply to CI Bug Log from comment #8)
> > https://intel-gfx-ci.01.org/tree/drm-tip/IGT_5188/fi-tgl-u/
> > igt@gem_exec_fence@nb-await-default.html
> 
> That's a very particular hang. Not related to the earlier report.

https://patchwork.freedesktop.org/series/66703/
https://patchwork.freedesktop.org/series/66718/
Comment 11 Francesco Balestrieri 2019-09-24 05:34:24 UTC
Chris, can I mark this fixed given the above patches? I see they are reviewed/acked-by but I'm not sure if they went in.
Comment 12 Chris Wilson 2019-09-24 11:41:51 UTC
Not the original bug, but

commit c45e788d95b470e9f68fabe1f3cb44beb5dd7840
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Thu Sep 19 16:18:11 2019 +0100

    drm/i915/tgl: Suspend pre-parser across GTT invalidations

nevertheless.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.