Bug 102848

Summary:

[CI] igt@gem_exec_schedule@* - fail - !"GPU hung"

Product:

DRI

Reporter:

Marta Löfstedt <marta.lofstedt>

Component:

DRM/Intel

Assignee:

Kimmo Nikkanen <knikkane>

Status:

CLOSED FIXED

QA Contact:

Intel GFX Bugs mailing list <intel-gfx-bugs>

Severity:

normal

Priority:

high

CC:

intel-gfx-bugs

Version:

DRI git

Hardware:

Other

OS:

All

Whiteboard:

ReadyForDev

i915 platform:

BXT, GLK, KBL

i915 features:

Attachments:

Description	Flags
kernl_log_sig_abort	none
error_state	none

Description Marta Löfstedt 2017-09-19 06:15:37 UTC

On CI_DRM_3099

(drv_module_reload:1435) igt-aux-CRITICAL: Test assertion failure function sig_abort, file igt_aux.c:444:
(drv_module_reload:1435) igt-aux-CRITICAL: Failed assertion: !"GPU hung"

https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3099/shard-snb2/igt@drv_module_reload@basic-reload.html

Comment 1 Marta Löfstedt 2017-10-10 08:02:42 UTC

Also, on APL-shards CI_DRM_3200
igt@gem_exec_schedule@reorder-wide-bsd

(gem_exec_schedule:1482) igt-aux-CRITICAL: Test assertion failure function sig_abort, file igt_aux.c:446:
(gem_exec_schedule:1482) igt-aux-CRITICAL: Failed assertion: !"GPU hung"
Subtest reorder-wide-bsd failed.

https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3200/shard-apl4/igt@gem_exec_schedule@reorder-wide-bsd.html

Comment 2 Marta Löfstedt 2017-10-10 14:16:10 UTC

Also,

https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3202/shard-kbl5/igt@gem_exec_schedule@reorder-wide-vebox.html

Comment 3 Marta Löfstedt 2017-10-12 13:26:50 UTC

Something weird happened here:
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3210/shard-apl1/igt@gem_exec_schedule@deep-vebox.html

the igt@gem_exec_schedule@deep-vebox has been skipped on all runs fir APL-shards except this one.

(gem_exec_schedule:1476) igt-aux-CRITICAL: Test assertion failure function sig_abort, file igt_aux.c:446:
(gem_exec_schedule:1476) igt-aux-CRITICAL: Failed assertion: !"GPU hung"
Subtest deep-vebox failed.

So, is there a context leak from previous tests?
Here are the previous tests in this shard:                          
pass: igt/gem_exec_reloc/basic-gtt-wc-active
skip: igt/kms_concurrent/pipe-E
skip: igt/gem_exec_parallel/bsd1-contexts
skip: igt/kms_frontbuffer_tracking/fbcpsr-2p-primscrn-spr-indfb-fullscreen
pass: igt/kms_chv_cursor_fail/pipe-A-128x128-right-edge
skip: igt/gem_mmap_gtt/basic-write-cpu-read-gtt
skip: igt/chamelium/vga-edid-read
skip: igt/kms_flip/2x-absolute-wf_vblank
pass: igt/perf/i915-ref-count
pass: igt/kms_cursor_crc/cursor-256x85-offscreen
skip: igt/kms_psr_sink_crc/cursor_mmap_gtt
pass: igt/gem_persistent_relocs/forked-faulting-reloc-thrashing
skip: igt/kms_cursor_legacy/2x-long-cursor-vs-flip-legacy             
pass: igt/gem_mmap_gtt/basic-write-read                          
skip: igt/kms_plane_multiple/atomic-pipe-D-tiling-yf                             
pass: igt/syncobj_wait/multi-wait-for-submit-submitted
pass: igt/kms_draw_crc/draw-method-xrgb2101010-mmap-wc-xtiled                                                
skip: igt/kms_frontbuffer_tracking/psr-1p-primscrn-pri-indfb-draw-mmap-wc                 
pass: igt/drv_hangman/error-state-sysfs-entry                                
pass: igt/kms_draw_crc/draw-method-rgb565-mmap-gtt-xtiled 
pass: igt/kms_legacy_colorkey

Comment 4 Chris Wilson 2017-10-12 13:39:29 UTC

The hang means that it took too long to copy i915_engines_info, longer than give or take 6s (depending on how the hangcheck aligns). The only change there is that we now use the drm_printer indirection.

Comment 5 Marta Löfstedt 2017-10-12 13:49:25 UTC

(In reply to Chris Wilson from comment #4)
> The hang means that it took too long to copy i915_engines_info, longer than
> give or take 6s (depending on how the hangcheck aligns). The only change
> there is that we now use the drm_printer indirection.

I don't like the behavior of igt, where we assert out before deciding if the test should be skipped or not. It causes noise.

Comment 6 Chris Wilson 2017-10-12 14:16:17 UTC

We don't know how long the kernel will take to do the operations, but the *kernel* is also imposing the time constraint for the entire sequence of operations. For CI the problem is compounded by lockdep making it even slower, predicting the limits is impossible and subject to change. The best way around it is to disable the limitations the kernel imposes upon userspace, but those patches fell on an unreceptive audience.

Comment 7 Marta Löfstedt 2017-10-13 12:55:54 UTC

Also, on another subtest:
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3228/shard-apl2/igt@gem_exec_schedule@reorder-wide-blt.html

Comment 8 Marta Löfstedt 2017-10-16 12:59:05 UTC

https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3239/shard-kbl1/igt@gem_exec_schedule@deep-bsd2.html

Comment 9 Marta Löfstedt 2017-10-25 09:13:58 UTC

CI_DRM_3277 KBL-shards igt@gem_exec_schedule@preempt-other-vebox fail:
	
(gem_exec_schedule:1541) igt-aux-CRITICAL: Test assertion failure function sig_abort, file igt_aux.c:484:
(gem_exec_schedule:1541) igt-aux-CRITICAL: Failed assertion: !"GPU hung"
Subtest preempt-other-vebox failed.

https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3277/shard-kbl3/igt@gem_exec_schedule@preempt-other-vebox.html

Comment 10 Marta Löfstedt 2017-10-27 07:41:34 UTC

new subtest on:
CI_DRM_3288 shard-kbl2 igt@gem_exec_schedule@preempt-other-bsd
	

(gem_exec_schedule:2576) igt-aux-CRITICAL: Test assertion failure function sig_abort, file igt_aux.c:484:
(gem_exec_schedule:2576) igt-aux-CRITICAL: Failed assertion: !"GPU hung"
Subtest preempt-other-bsd failed.


https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3288/shard-kbl2/igt@gem_exec_schedule@preempt-other-bsd.html

Comment 11 Marta Löfstedt 2017-10-27 07:43:40 UTC

new subtest on:
CI_DRM_3288 shard-apl6 igt@gem_exec_schedule@wide-blt

(gem_exec_schedule:1842) igt-aux-CRITICAL: Test assertion failure function sig_abort, file igt_aux.c:484:
(gem_exec_schedule:1842) igt-aux-CRITICAL: Failed assertion: !"GPU hung"
Subtest wide-blt failed.

https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3288/shard-apl6/igt@gem_exec_schedule@wide-blt.html

Comment 12 Chris Wilson 2017-10-27 11:49:04 UTC

(In reply to Marta Löfstedt from comment #0)
> On CI_DRM_3099
> 
> (drv_module_reload:1435) igt-aux-CRITICAL: Test assertion failure function
> sig_abort, file igt_aux.c:444:
> (drv_module_reload:1435) igt-aux-CRITICAL: Failed assertion: !"GPU hung"
> 
> https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3099/shard-snb2/
> igt@drv_module_reload@basic-reload.html

Just noticed that the original bug is for a completely different problem than gem_exec_schedule. The original bug is for ring initialisation failure on module load, which may be the same one that's been plaguing snb since 2010.

Comment 13 Marta Löfstedt 2017-10-30 11:16:28 UTC

Also see bug 103514, which is about the BAT machine fi-glk-dsi, where I filed wedged GPU and issues that happens after.

Comment 14 Marta Löfstedt 2017-11-02 06:34:23 UTC

https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3305/shard-apl8/igt@gem_exec_schedule@preempt-self-vebox.html

(gem_exec_schedule:18610) igt-aux-CRITICAL: Test assertion failure function sig_abort, file igt_aux.c:484:
(gem_exec_schedule:18610) igt-aux-CRITICAL: Failed assertion: !"GPU hung"
Subtest preempt-self-vebox failed.


<7>[ 2792.812533] [IGT] gem_exec_schedule: starting subtest preempt-self-vebox
<7>[ 2796.776113] [drm:missed_breadcrumb [i915]] rcs0 missed breadcrumb at intel_breadcrumbs_hangcheck+0x68/0x90 [i915], irq posted? yes, current seqno=eb6, last=eba
<6>[ 2803.831581] [drm] GPU HANG: ecode 9:1:0xe77ffef2, in gem_exec_schedu [18610], reason: Hang on bcs0, action: reset
<6>[ 2803.831614] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
<6>[ 2803.831635] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
<6>[ 2803.831656] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
<6>[ 2803.831677] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
<6>[ 2803.831698] [drm] GPU crash dump saved to /sys/class/drm/card0/error
<7>[ 2803.831922] [drm:i915_reset_device [i915]] resetting chip
<5>[ 2803.832018] i915 0000:00:02.0: Resetting chip after gpu hang
<7>[ 2803.835942] [drm:i915_gem_reset_engine [i915]] context gem_exec_schedu[18610]/1 marked guilty (score 10) banned? no
<7>[ 2803.836020] [drm:i915_gem_reset_engine [i915]] resetting bcs0 to restart from tail of request 0x2c9
<6>[ 2803.836123] [drm] RC6 on
<7>[ 2803.836261] [drm:gen8_init_common_ring [i915]] Execlists enabled for rcs0
<7>[ 2803.836354] [drm:init_workarounds_ring [i915]] rcs0: Number of context specific w/a: 12
<7>[ 2803.836480] [drm:gen8_init_common_ring [i915]] Execlists enabled for bcs0
<7>[ 2803.836606] [drm:gen8_init_common_ring [i915]] Execlists enabled for vcs0
<7>[ 2803.836728] [drm:gen8_init_common_ring [i915]] Execlists enabled for vecs0
<7>[ 2804.017517] [IGT] gem_exec_schedule: exiting, ret=99

Comment 15 Marta Löfstedt 2017-11-06 12:08:36 UTC

https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3315/shard-kbl1/igt@gem_exec_schedule@deep-blt.html

Comment 16 Marta Löfstedt 2017-11-08 07:02:08 UTC

https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3319/shard-kbl7/igt@gem_exec_schedule@preempt-self-blt.html

(gem_exec_schedule:1382) igt-aux-CRITICAL: Test assertion failure function sig_abort, file igt_aux.c:484:
(gem_exec_schedule:1382) igt-aux-CRITICAL: Failed assertion: !"GPU hung"
Subtest preempt-self-blt failed.

Comment 17 Marta Löfstedt 2017-11-09 07:20:43 UTC

APL-shards new subtest:

https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3321/shard-apl4/igt@gem_exec_schedule@preempt-other-blt.html

(gem_exec_schedule:8484) igt-aux-CRITICAL: Test assertion failure function sig_abort, file igt_aux.c:484:
(gem_exec_schedule:8484) igt-aux-CRITICAL: Failed assertion: !"GPU hung"
Subtest preempt-other-blt failed.

<7>[  842.217263] [IGT] gem_exec_schedule: executing
<4>[  842.240089] Setting dangerous option reset - tainting kernel
<7>[  842.241912] [IGT] gem_exec_schedule: starting subtest preempt-other-blt
<7>[  846.210788] [drm:missed_breadcrumb [i915]] rcs0 missed breadcrumb at intel_breadcrumbs_hangcheck+0x68/0x90 [i915], irq posted? yes, current seqno=25143, last=25147
<7>[  857.224472] [drm:i915_reset_device [i915]] resetting chip
<5>[  857.224663] i915 0000:00:02.0: Resetting chip after gpu hang
<7>[  857.225994] [drm:i915_gem_reset_engine [i915]] context gem_exec_schedu[8484]/2 marked guilty (score 10) banned? no
<7>[  857.226214] [drm:i915_gem_reset_engine [i915]] resetting rcs0 to restart from tail of request 0x25144
<7>[  857.226409] [drm:i915_gem_reset_engine [i915]] context gem_exec_schedu[8484]/2 marked guilty (score 20) banned? no
<7>[  857.226572] [drm:i915_gem_reset_engine [i915]] resetting bcs0 to restart from tail of request 0x6a8c
<7>[  857.226754] [drm:i915_gem_reset_engine [i915]] context gem_exec_schedu[8484]/2 marked guilty (score 30) banned? no
<7>[  857.226832] [drm:i915_gem_reset_engine [i915]] resetting vcs0 to restart from tail of request 0x2018
<7>[  857.226931] [drm:i915_gem_reset_engine [i915]] context gem_exec_schedu[8484]/2 marked guilty (score 40) banned? yes
<7>[  857.227003] [drm:i915_gem_reset_engine [i915]] client gem_exec_schedu[8484]/2 has had 1 context banned
<7>[  857.227074] [drm:i915_gem_reset_engine [i915]] resetting vecs0 to restart from tail of request 0xa051
<6>[  857.227204] [drm] RC6 on
<7>[  857.227373] [drm:gen8_init_common_ring [i915]] Execlists enabled for rcs0
<7>[  857.227469] [drm:init_workarounds_ring [i915]] rcs0: Number of context specific w/a: 12
<7>[  857.227599] [drm:gen8_init_common_ring [i915]] Execlists enabled for bcs0
<7>[  857.227725] [drm:gen8_init_common_ring [i915]] Execlists enabled for vcs0
<7>[  857.227851] [drm:gen8_init_common_ring [i915]] Execlists enabled for vecs0
<7>[  857.237878] [IGT] gem_exec_schedule: exiting, ret=99

Comment 18 Elizabeth 2017-11-09 22:05:21 UTC

We're hitting the same issue randomly in test igt@gem_sync@basic-store-each:

$ : sudo -E ./gem_sync --r basic-store-each
IGT-Version: 1.20-ge6c4968 (x86_64) (Linux: 4.14.0-rc8-drm-intel-qa-ww45-commit-8eba051+ x86_64)
Using GuC submission
Has kernel scheduler
 - With priority sorting
 - With preemption enabled
blt completed 14336 cycles: 354.177 us
bsd1 completed 14336 cycles: 355.300 us
bsd2 completed 14336 cycles: 358.892 us
render completed 12288 cycles: 420.673 us
(gem_sync:1602) igt-aux-CRITICAL: Test assertion failure function sig_abort, file igt_aux.c:484:
(gem_sync:1602) igt-aux-CRITICAL: Failed assertion: !"GPU hung"
Stack trace:
  #0 [__igt_fail_assert+0x101]
  #1 [sig_abort+0x3a]
  #2 [killpg+0x40]
  #3 [__wait+0x1e]
  #4 [igt_waitchildren+0x68]
  #5 [igt_waitchildren_timeout+0xe]
  #6 [store_ring+0x291]
  #7 [__real_main785+0x5ae]
  #8 [main+0x23]
  #9 [__libc_start_main+0xf1]
  #10 [_start+0x29]
  #11 [<unknown>+0x29]
Subtest basic-store-each failed.
**** DEBUG ****
(gem_sync:1602) igt-debugfs-DEBUG: Opening debugfs directory '/sys/kernel/debug/dri/0'
(gem_sync:1602) igt-debugfs-DEBUG: Opening debugfs directory '/sys/kernel/debug/dri/0'
(gem_sync:1602) igt-aux-CRITICAL: Test assertion failure function sig_abort, file igt_aux.c:484:
(gem_sync:1602) igt-aux-CRITICAL: Failed assertion: !"GPU hung"
(gem_sync:1602) igt-core-INFO: Stack trace:
(gem_sync:1602) igt-core-INFO:   #0 [__igt_fail_assert+0x101]
(gem_sync:1602) igt-core-INFO:   #1 [sig_abort+0x3a]
(gem_sync:1602) igt-core-INFO:   #2 [killpg+0x40]
(gem_sync:1602) igt-core-INFO:   #3 [__wait+0x1e]
(gem_sync:1602) igt-core-INFO:   #4 [igt_waitchildren+0x68]
(gem_sync:1602) igt-core-INFO:   #5 [igt_waitchildren_timeout+0xe]
(gem_sync:1602) igt-core-INFO:   #6 [store_ring+0x291]
(gem_sync:1602) igt-core-INFO:   #7 [__real_main785+0x5ae]
(gem_sync:1602) igt-core-INFO:   #8 [main+0x23]
(gem_sync:1602) igt-core-INFO:   #9 [__libc_start_main+0xf1]
(gem_sync:1602) igt-core-INFO:   #10 [_start+0x29]
(gem_sync:1602) igt-core-INFO:   #11 [<unknown>+0x29]
****  END  ****
Subtest basic-store-each: FAIL (9.969s)

Comment 19 Elizabeth 2017-11-09 22:10:21 UTC

Created attachment 135362 [details]
kernl_log_sig_abort

In this case it failed 1 of 5.

Comment 20 Elizabeth 2017-11-09 22:11:46 UTC

Created attachment 135363 [details]
error_state

Comment 21 Marta Löfstedt 2017-11-10 08:18:01 UTC

https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3326/shard-apl4/igt@gem_exec_schedule@wide-vebox.html

(gem_exec_schedule:7972) igt-aux-CRITICAL: Test assertion failure function sig_abort, file igt_aux.c:484:
(gem_exec_schedule:7972) igt-aux-CRITICAL: Failed assertion: !"GPU hung"
Subtest wide-vebox failed.

Comment 22 Marta Löfstedt 2017-11-13 14:47:30 UTC

https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3339/shard-apl5/igt@gem_exec_whisper@normal.html

Comment 23 Marta Löfstedt 2017-11-17 07:51:26 UTC

https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3354/shard-kbl7/igt@gem_exec_schedule@preempt-self-render.html

(gem_exec_schedule:1592) igt-aux-CRITICAL: Test assertion failure function sig_abort, file igt_aux.c:484:
(gem_exec_schedule:1592) igt-aux-CRITICAL: Failed assertion: !"GPU hung"
Subtest preempt-self-render failed.

no *ERROR* in dmesg. However,
<7>[   74.590165] [IGT] gem_exec_schedule: starting subtest preempt-self-render
<3>[   74.697875] e1000e 0000:00:1f.6 eno1: Detected Hardware Unit Hang:
...
<7>[   78.729290] [drm:missed_breadcrumb [i915]] rcs0 missed breadcrumb at intel_breadcrumbs_hangcheck+0x5f/0x80 [i915], irq posted? yes, current seqno=1e4b8, last=1e4bc
<3>[   80.710935] e1000e 0000:00:1f.6 eno1: Detected Hardware Unit Hang:

Comment 24 Marta Löfstedt 2017-11-17 07:52:56 UTC

https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3355/shard-kbl5/igt@gem_exec_schedule@wide-bsd1.html

Comment 25 Marta Löfstedt 2017-11-21 07:48:02 UTC

https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3364/shard-kbl3/igt@gem_exec_schedule@preempt-other-render.html

(gem_exec_schedule:2134) igt-aux-CRITICAL: Test assertion failure function sig_abort, file igt_aux.c:484:
(gem_exec_schedule:2134) igt-aux-CRITICAL: Failed assertion: !"GPU hung"
Subtest preempt-other-render failed.

https://intel-gfx-ci.01.org/tree/drm-tip/IGT_3994/shard-kbl7/igt@gem_exec_schedule@preempt-self-bsd.html

(gem_exec_schedule:1950) igt-aux-CRITICAL: Test assertion failure function sig_abort, file igt_aux.c:484:
(gem_exec_schedule:1950) igt-aux-CRITICAL: Failed assertion: !"GPU hung"
Subtest preempt-self-bsd failed.

Comment 26 Marta Löfstedt 2017-11-27 09:50:26 UTC

https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3390/shard-kbl6/igt@gem_exec_schedule@deep-bsd.html

(gem_exec_schedule:1734) igt-aux-CRITICAL: Test assertion failure function sig_abort, file igt_aux.c:484:
(gem_exec_schedule:1734) igt-aux-CRITICAL: Failed assertion: !"GPU hung"
Subtest deep-bsd failed.

Comment 27 Marta Löfstedt 2017-11-30 13:30:40 UTC

https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3416/shard-kbl3/igt@gem_exec_schedule@wide-bsd2.html

(gem_exec_schedule:1585) igt-aux-CRITICAL: Test assertion failure function sig_abort, file igt_aux.c:482:
(gem_exec_schedule:1585) igt-aux-CRITICAL: Failed assertion: !"GPU hung"
Subtest wide-bsd2 failed.

Comment 28 Elizabeth 2017-12-01 16:56:11 UTC

We have it on today's KBL results with test igt@gem_exec_suspend@basic-s4-devices:

(gem_exec_suspend:9573) igt-aux-CRITICAL: Test assertion failure function sig_abort, file igt_aux.c:482:
(gem_exec_suspend:9573) igt-aux-CRITICAL: Failed assertion: !"GPU hung"
Subtest basic-S4-devices failed.

IGT-Version: 1.20-g476c4b4 (x86_64) (Linux: 4.15.0-rc1-drm-intel-qa-ww48-commit-807db75+ x86_64)
Stack trace:
  #0 [__igt_fail_assert+0x101]
  #1 [sig_abort+0x3a]
  #2 [killpg+0x40]
  #3 [__write_nocancel+0x7]
  #4 [igt_sysfs_write+0x43]
  #5 [igt_sysfs_set+0x2e]
  #6 [igt_system_suspend_autoresume+0x420]
  #7 [run_test+0x486]
  #8 [__real_main243+0x13c]
  #9 [main+0x23]
  #10 [__libc_start_main+0xf1]
  #11 [_start+0x29]
  #12 [<unknown>+0x29]
Subtest basic-S4-devices: FAIL (16.261s)

Comment 29 Chris Wilson 2017-12-01 17:00:21 UTC

(In reply to Elizabeth from comment #28)
> We have it on today's KBL results with test
> igt@gem_exec_suspend@basic-s4-devices:
> 
> (gem_exec_suspend:9573) igt-aux-CRITICAL: Test assertion failure function
> sig_abort, file igt_aux.c:482:
> (gem_exec_suspend:9573) igt-aux-CRITICAL: Failed assertion: !"GPU hung"
> Subtest basic-S4-devices failed.
> 
> IGT-Version: 1.20-g476c4b4 (x86_64) (Linux:
> 4.15.0-rc1-drm-intel-qa-ww48-commit-807db75+ x86_64)
> Stack trace:
>   #0 [__igt_fail_assert+0x101]
>   #1 [sig_abort+0x3a]
>   #2 [killpg+0x40]
>   #3 [__write_nocancel+0x7]
>   #4 [igt_sysfs_write+0x43]
>   #5 [igt_sysfs_set+0x2e]
>   #6 [igt_system_suspend_autoresume+0x420]
>   #7 [run_test+0x486]
>   #8 [__real_main243+0x13c]
>   #9 [main+0x23]
>   #10 [__libc_start_main+0xf1]
>   #11 [_start+0x29]
>   #12 [<unknown>+0x29]
> Subtest basic-S4-devices: FAIL (16.261s)

which is nothing to do with this bug.

This bug is either the sporadic SNB hang on module load, or that the timeout for gem_exec_schedule results in a reported hung GPU. (Strange dup.)

Comment 30 Elizabeth 2017-12-04 22:54:40 UTC

(In reply to Chris Wilson from comment #29)
> (In reply to Elizabeth from comment #28)
> > ...
> which is nothing to do with this bug.
> 
> This bug is either the sporadic SNB hang on module load, or that the timeout
> for gem_exec_schedule results in a reported hung GPU. (Strange dup.)
Understood, then should this go to bug 104020, seems to be same behavior with gem_exec_suspend@basic-s* tests.

Comment 31 Marta Löfstedt 2017-12-05 08:04:08 UTC

https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3450/shard-kbl3/igt@gem_exec_schedule@preempt-other-bsd2.html

(gem_exec_schedule:4029) igt-aux-CRITICAL: Test assertion failure function sig_abort, file igt_aux.c:482:
(gem_exec_schedule:4029) igt-aux-CRITICAL: Failed assertion: !"GPU hung"
Subtest preempt-other-bsd2 failed.

Comment 32 Marta Löfstedt 2017-12-18 08:24:45 UTC

https://intel-gfx-ci.01.org/tree/drm-tip/IGT_4068/shard-kbl4/igt@gem_exec_schedule@preempt-other-bsd1.html

(gem_exec_schedule:1619) igt-aux-CRITICAL: Test assertion failure function sig_abort, file igt_aux.c:482:
(gem_exec_schedule:1619) igt-aux-CRITICAL: Failed assertion: !"GPU hung"
Subtest preempt-other-bsd1 failed.

Comment 33 Marta Löfstedt 2018-01-16 07:10:11 UTC

https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3633/shard-apl8/igt@gem_exec_schedule@wide-bsd.html

(gem_exec_schedule:1720) igt-aux-CRITICAL: Test assertion failure function sig_abort, file igt_aux.c:482:
(gem_exec_schedule:1720) igt-aux-CRITICAL: Failed assertion: !"GPU hung"
Subtest wide-bsd failed.

There is hangcheck data in dmesg:
<7>[  244.339804] [IGT] gem_exec_schedule: starting subtest wide-bsd
...
<7>[  253.900331] hangcheck Idle? no
<6>[  253.919540] [drm] GPU HANG: ecode 9:2:0x3d7fffff, in gem_exec_schedu [1720], reason: Hang on vcs0, action: reset

Comment 34 Marta Löfstedt 2018-01-19 09:42:40 UTC

I clean this bug up a bit from cibuglog perspective:

the GPU hung fail has not been reproduced on:
igt@gem_exec_schedule@preempt-self-bsd
igt@drv_module_reload@basic-reload
igt@gem_exec_schedule@preempt-self-bsd1

for a very long time and there was another issue happening on the last hit. So I removed them from impact in cibuglog

Comment 35 Marta Löfstedt 2018-01-26 08:57:31 UTC

This appear to be the "real" occurrences of this issue from the past 50 runs up until CI_DRM_3684:

https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3658/shard-glkb4/igt@gem_exec_schedule@reorder-wide-bsd.html
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3656/shard-apl6/igt@gem_exec_schedule@reorder-wide-vebox.html
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3658/shard-glkb4/igt@gem_exec_schedule@reorder-wide-bsd.html
https://intel-gfx-ci.01.org/tree/drm-tip/IGT_4050/shard-apl5/igt@gem_exec_schedule@deep-bsd.html
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3673/shard-kbl5/igt@gem_exec_schedule@deep-bsd.html
https://intel-gfx-ci.01.org/tree/drm-tip/IGT_4174/shard-kbl1/igt@gem_exec_schedule@deep-bsd2.html

So, this is still very much an issue on GLK-, APL- and KBL-shards.
SNB-shards has not been seen since:
CI_DRM_3393: 2017-11-27 / 404 runs ago. So I will remove that from cibuglog impact.

Comment 36 Chris Wilson 2018-02-03 11:14:26 UTC

I notice that gem_exec_whisper has been lumped into this one from cibuglog; if that hangs that is a different issue (not the spurious slowdown that is affecting gem_exec_schedule from time to time) -- please could you track it separately.

Comment 37 Marta Löfstedt 2018-02-05 07:44:53 UTC

(In reply to Chris Wilson from comment #36)
> I notice that gem_exec_whisper has been lumped into this one from cibuglog;
> if that hangs that is a different issue (not the spurious slowdown that is
> affecting gem_exec_schedule from time to time) -- please could you track it
> separately.

OK done.

I.e. please disregard Comment 22

Comment 38 Marta Löfstedt 2018-02-06 07:37:59 UTC

https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3726/shard-glkb3/igt@gem_exec_schedule@preempt-self-bsd.html

(gem_exec_schedule:1494) igt-aux-CRITICAL: Test assertion failure function sig_abort, file igt_aux.c:482:
(gem_exec_schedule:1494) igt-aux-CRITICAL: Failed assertion: !"GPU hung"
Subtest preempt-self-bsd failed.

Comment 39 Marta Löfstedt 2018-02-12 08:23:28 UTC

https://intel-gfx-ci.01.org/tree/drm-tip/IGT_4240/shard-apl6/igt@gem_exec_whisper@normal.html

(gem_exec_whisper:2800) igt-aux-CRITICAL: Test assertion failure function sig_abort, file igt_aux.c:482:
(gem_exec_whisper:2800) igt-aux-CRITICAL: Failed assertion: !"GPU hung"
Subtest normal failed.

Comment 40 Chris Wilson 2018-02-16 11:22:57 UTC

My understanding is that the original bug is resolved with

commit 6db24416fdcdf5571125f9005089241cc6ba2652
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Wed Jan 3 18:09:09 2018 +0000

    lib/gem: Reset the global seqno at the start of each test
    
    When we require GEM, reset the global seqno. This gives each test a
    clean slate to work with, and avoids left-over state from previous tests
    impacting on the next. In particular, somes tests may be setting up long
    sequence of stalling batches not expecting to hit a seqno wraparound
    (leftover from, for example, gem_exec_whisper), causing long GPU hangs
    and incompletes in CI if they do.
    
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
    Reviewed-by: Arkadiusz Hiler <arkadiusz.hiler@intel.com>

However, we may have a few false dupes left which need their own tracking.

Comment 41 Marta Löfstedt 2018-02-16 11:31:24 UTC

(In reply to Chris Wilson from comment #40)
> My understanding is that the original bug is resolved with
> 
> commit 6db24416fdcdf5571125f9005089241cc6ba2652
> Author: Chris Wilson <chris@chris-wilson.co.uk>
> Date:   Wed Jan 3 18:09:09 2018 +0000
> 
>     lib/gem: Reset the global seqno at the start of each test
>     
>     When we require GEM, reset the global seqno. This gives each test a
>     clean slate to work with, and avoids left-over state from previous tests
>     impacting on the next. In particular, somes tests may be setting up long
>     sequence of stalling batches not expecting to hit a seqno wraparound
>     (leftover from, for example, gem_exec_whisper), causing long GPU hangs
>     and incompletes in CI if they do.
>     
>     Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
>     Reviewed-by: Arkadiusz Hiler <arkadiusz.hiler@intel.com>
> 
> However, we may have a few false dupes left which need their own tracking.

OK Chris I will close and archive the bug and we'll see what pops up.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.