Bug 104262 - [CI] igt@drv_selftest@live_hangcheck - incomplete - Softdog | fail Returncode -11
Summary: [CI] igt@drv_selftest@live_hangcheck - incomplete - Softdog | fail Returncode...
Status: CLOSED FIXED
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/Intel (show other bugs)
Version: DRI git
Hardware: Other All
: medium normal
Assignee: Intel GFX Bugs mailing list
QA Contact: Intel GFX Bugs mailing list
URL:
Whiteboard: ReadyForDev
Keywords:
Depends on:
Blocks:
 
Reported: 2017-12-14 12:14 UTC by Marta Löfstedt
Modified: 2018-02-07 07:48 UTC (History)
1 user (show)

See Also:
i915 platform: KBL
i915 features:


Attachments

Description Marta Löfstedt 2017-12-14 12:14:41 UTC
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3514/shard-glkb4/igt@drv_selftest@live_hangcheck.html

pstore indicate softdog triggered, before that just ftrace in pstores.

However no OWATCH in run.log

but at least last test in run.log match results.

last dmesgs:
<6>[  289.098135] [drm] Initialized i915 1.6.0 20171201 for 0000:00:02.0 on minor 0
<6>[  289.099969] [drm] DRM_I915_DEBUG enabled
<6>[  289.099973] [drm] DRM_I915_DEBUG_GEM enabled
<6>[  289.099979] i915: Performing live selftests with st_random_seed=0xd859da5e st_timeout=1000
<6>[  290.126255] Submitted 50 contexts (across 4 engines), filling 200 dwords
<4>[  290.531598] i915: probe of 0000:00:02.0 failed with error -25
<7>[  290.558222] [IGT] drv_selftest: exiting, ret=0
<7>[  290.643776] [IGT] drv_selftest: executing
Comment 1 Marta Löfstedt 2017-12-15 08:55:49 UTC
Here is anotherone with legit pstore backtracing softdog.


https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3520/shard-kbl6/igt@drv_selftest@live_hangcheck.html
Comment 2 Chris Wilson 2017-12-17 00:04:24 UTC
Actually:

<0>[  183.930051]   <idle>-0       0..s1 183945742us : execlists_submission_tasklet: rcs0 cs-irq head=0 [0], tail=7 [7]
<0>[  183.930075]   <idle>-0       0..s1 183945742us : execlists_submission_tasklet: rcs0 csb[1d]: status=0x00000018:0x00000000

That tail=7 is bogus.
Comment 3 Chris Wilson 2017-12-19 12:48:43 UTC
Optimistically:

commit fcb1de54e2d9c84a2d3275d1febde00e92d45aa0
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Tue Dec 19 09:01:10 2017 +0000

    drm/i915: Add a strong mb to resetting the has-CS-interrupt bit
    
    After a reset, the state of the CSB registers are scrubbed and not valid
    until a powercontext is reloaded. We only know when a powercontext has
    been reloaded once we see a CS-interrupt, before then we must ignore the
    CSB registers within the execlists_submission_tasklet. However, glk is
    sporadically dying with an illegal CSB pointer value (both in the HSWP
    and mmio) suggesting that it is running with the CS-interrupt bit set
    before the powercontext has been reloaded. Make sure the clearing of
    that bit is serialised on reset with the re-enabling of the tasklet.
    
    References: https://bugs.freedesktop.org/show_bug.cgi?id=104262
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
    Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
    Cc: Michał Winiarski <michal.winiarski@intel.com>
    Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
    Link: https://patchwork.freedesktop.org/patch/msgid/20171219090110.11153-1-chris@chris-wilson.co.uk
    Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Comment 4 Chris Wilson 2017-12-19 17:13:17 UTC
That alone was not it.
Comment 5 Martin Peres 2018-01-15 17:47:19 UTC
The following looks interesting:

[ 1607.268733] BUG: unable to handle kernel paging request at fffffffefa9905fa
[ 1607.268736] IP: 0xfffffffefa9905fa
[ 1607.268738] Oops: 0010 [#2] PREEMPT SMP PTI
[ 1607.268739] Dumping ftrace buffer:
[ 1607.268739]    (ftrace buffer empty)
[ 1607.268740] Modules linked in: i915(+) vgem snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic x86_pkg_temp_thermal intel_powerclamp coretemp crct10dif_pclmul crc32_pclmul ghash_clmulni_intel snd_hda_codec snd_hwdep snd_hda_core r8169 snd_pcm mii mei_me mei prime_numbers i2c_hid pinctrl_geminilake pinctrl_intel [last unloaded: i915]
[ 1607.268751] CPU: 3 PID: 30 Comm: kworker/3:0 Tainted: G     UD W        4.15.0-rc7-CI-CI_DRM_3621+ #1
[ 1607.268752] Hardware name: Intel Corp. Geminilake/GLK RVP1 DDR4 (05), BIOS GELKRVPA.X64.0062.B30.1708222146 08/22/2017
[ 1607.268752] Workqueue: events efivar_update_sysfs_entries
[ 1607.268753] RIP: 0010:0xfffffffefa9905fa
[ 1607.268754] RSP: 0018:ffffc90000163b88 EFLAGS: 00010202
[ 1607.268755] RAX: 0000000000002033 RBX: ffffc90000163c10 RCX: 0000000000000440
[ 1607.268756] RDX: 0000000000000440 RSI: 0000000000000000 RDI: 0000000000000000
[ 1607.268756] RBP: fffffffefa891040 R08: 0000000000000000 R09: 00000000000000ff
[ 1607.268757] R10: 0000000076e91018 R11: ffffffff810d281e R12: fffffffefa89106c
[ 1607.268758] R13: ffffc90000163db8 R14: 0000000000000000 R15: ffffc90000163dc0
[ 1607.268758] FS:  0000000000000000(0000) GS:ffff88017fd80000(0000) knlGS:0000000000000000
[ 1607.268759] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1607.268760] CR2: fffffffefa9905fa CR3: 0000000005210000 CR4: 0000000000340ee0
[ 1607.268760] Call Trace:
[ 1607.268761]  ? ret_from_fork+0x24/0x30
[ 1607.268761]  ? init_object+0x6e/0x80
[ 1607.268762]  ? ___slab_alloc.constprop.30+0x152/0x3d0
[ 1607.268762]  ? efi_call+0x58/0x90
[ 1607.268763]  ? down_interruptible+0xe/0x50
[ 1607.268764]  ? virt_efi_get_next_variable+0x10b/0x260
[ 1607.268764]  ? efivar_update_sysfs_entry+0x2d/0x80
[ 1607.268765]  ? efivar_init+0xfa/0x370
[ 1607.268765]  ? efivar_init+0xfa/0x370
[ 1607.268766]  ? efivar_release+0x10/0x10
[ 1607.268766]  ? efivar_update_sysfs_entries+0x1f/0x60
[ 1607.268767]  ? efivar_update_sysfs_entries+0x1f/0x60
[ 1607.268767]  ? process_one_work+0x215/0x640
[ 1607.268768]  ? worker_thread+0x48/0x3a0
[ 1607.268768]  ? kthread+0xfb/0x130
[ 1607.268769]  ? process_one_work+0x640/0x640
[ 1607.268769]  ? _kthread_create_on_node+0x30/0x30
[ 1607.268770]  ? ret_from_fork+0x24/0x30
[ 1607.268770] Code:  Bad RIP value.
[ 1607.268772] RIP: 0xfffffffefa9905fa RSP: ffffc90000163b88
[ 1607.268772] CR2: fffffffefa9905fa
[ 1607.268773] ---[ end trace 353cc06db5100877 ]---
[ 1607.269236] BUG: scheduling while atomic: kworker/3:0/30/0x00000003
[ 1607.269251] INFO: lockdep is turned off.
[ 1607.269261] Modules linked in:
[ 1607.269272]  i915(+)
[ 1607.269280]  vgem
[ 1607.269287]  snd_hda_codec_hdmi
[ 1607.269297]  snd_hda_codec_realtek
[ 1607.269307]  snd_hda_codec_generic
[ 1607.269317]  x86_pkg_temp_thermal
[ 1607.269327]  intel_powerclamp
[ 1607.269337]  coretemp
[ 1607.269345]  crct10dif_pclmul
[ 1607.269355]  crc32_pclmul
[ 1607.269363]  ghash_clmulni_intel
[ 1607.269373]  snd_hda_codec
[ 1607.269382]  snd_hwdep
[ 1607.269391]  snd_hda_core
[ 1607.269399]  r8169
[ 1607.269407]  snd_pcm
[ 1607.269415]  mii
[ 1607.269423]  mei_me
[ 1607.269430]  mei
[ 1607.269438]  prime_numbers
[ 1607.269446]  i2c_hid
[ 1607.269454]  pinctrl_geminilake
[ 1607.269464]  pinctrl_intel
[ 1607.269472]  [last unloaded: i915]
[ 1607.269485] Preemption disabled at:
[ 1607.269500] [<00000000c3925e4b>] virt_efi_get_next_variable+0x6a/0x260
[ 1607.269686] CPU: 3 PID: 30 Comm: kworker/3:0 Tainted: G     UD W        4.15.0-rc7-CI-CI_DRM_3621+ #1
[ 1607.269687] Hardware name: Intel Corp. Geminilake/GLK RVP1 DDR4 (05), BIOS GELKRVPA.X64.0062.B30.1708222146 08/22/2017
[ 1607.269687] Workqueue: events efivar_update_sysfs_entries
[ 1607.269688] Call Trace:
[ 1607.269689]  dump_stack+0x5f/0x86
[ 1607.269689]  ? virt_efi_get_next_variable+0x6a/0x260
[ 1607.269690]  __schedule_bug+0x7a/0xd0
[ 1607.269690]  __schedule+0x849/0xaf0
[ 1607.269691]  ? trace_hardirqs_on_thunk+0x1a/0x1c
[ 1607.269691]  ? run_timer_softirq+0xc5/0x140
[ 1607.269692]  ? __down_interruptible+0x7e/0x110
[ 1607.269692]  schedule+0x37/0x90
[ 1607.269693]  schedule_timeout+0x21f/0x520
[ 1607.269693]  ? down_interruptible+0xe/0x50
[ 1607.269694]  ? __down_interruptible+0x7e/0x110
[ 1607.269694]  __down_interruptible+0x9d/0x110
[ 1607.269695]  ? down_interruptible+0x3d/0x50
[ 1607.269695]  down_interruptible+0x3d/0x50
[ 1607.269696]  efivar_entry_set_safe+0x5f/0x1b0
[ 1607.269697]  ? efi_pstore_write+0x100/0x140
[ 1607.269697]  efi_pstore_write+0x100/0x140
[ 1607.269698]  ? build_tree+0x1f3/0x460
[ 1607.269698]  pstore_dump+0x196/0x330
[ 1607.269699]  ? lock_acquire+0xaf/0x200
[ 1607.269699]  kmsg_dump+0xf5/0x1c0
[ 1607.269700]  oops_end+0x3e/0x80
[ 1607.269700]  no_context+0x165/0x430
[ 1607.269701]  __do_page_fault+0x196/0x560
[ 1607.269701]  ? down_interruptible+0xe/0x50
[ 1607.269702]  page_fault+0x2c/0x60
[ 1607.269702] RIP: 0010:0xfffffffefa9905fa
[ 1607.269703] RSP: 0018:ffffc90000163b88 EFLAGS: 00010202
[ 1607.269705] RAX: 0000000000002033 RBX: ffffc90000163c10 RCX: 0000000000000440
[ 1607.269706] RDX: 0000000000000440 RSI: 0000000000000000 RDI: 0000000000000000
[ 1607.269706] RBP: fffffffefa891040 R08: 0000000000000000 R09: 00000000000000ff
[ 1607.269707] R10: 0000000076e91018 R11: ffffffff810d281e R12: fffffffefa89106c
[ 1607.269708] R13: ffffc90000163db8 R14: 0000000000000000 R15: ffffc90000163dc0
[ 1607.269708]  ? down_interruptible+0xe/0x50
[ 1607.269709]  ? ret_from_fork+0x24/0x30
[ 1607.269709]  ? init_object+0x6e/0x80
[ 1607.269710]  ? ___slab_alloc.constprop.30+0x152/0x3d0
[ 1607.269710]  ? efi_call+0x58/0x90
[ 1607.269711]  ? down_interruptible+0xe/0x50
[ 1607.269711]  ? virt_efi_get_next_variable+0x10b/0x260
[ 1607.269712]  ? efivar_update_sysfs_entry+0x2d/0x80
[ 1607.269712]  ? efivar_init+0xfa/0x370
[ 1607.269713]  ? efivar_init+0xfa/0x370
[ 1607.269713]  ? efivar_release+0x10/0x10
[ 1607.269714]  ? efivar_update_sysfs_entries+0x1f/0x60
[ 1607.269714]  ? efivar_update_sysfs_entries+0x1f/0x60
[ 1607.269715]  ? process_one_work+0x215/0x640
[ 1607.269715]  ? worker_thread+0x48/0x3a0
[ 1607.269716]  ? kthread+0xfb/0x130
[ 1607.269716]  ? process_one_work+0x640/0x640
[ 1607.269717]  ? _kthread_create_on_node+0x30/0x30
[ 1607.269717]  ? ret_from_fork+0x24/0x30
[ 1607.270364] WARNING: CPU: 3 PID: 30 at kernel/rcu/tree_plugin.h:331 rcu_note_context_switch+0x69/0x6c0
[ 1607.270365] Modules linked in: i915(+) vgem snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic x86_pkg_temp_thermal intel_powerclamp coretemp crct10dif_pclmul crc32_pclmul ghash_clmulni_intel snd_hda_codec snd_hwdep snd_hda_core r8169 snd_pcm mii mei_me mei prime_numbers i2c_hid pinctrl_geminilake pinctrl_intel [last unloaded: i915]
[ 1607.270377] CPU: 3 PID: 30 Comm: kworker/3:0 Tainted: G     UD W        4.15.0-rc7-CI-CI_DRM_3621+ #1
[ 1607.270378] Hardware name: Intel Corp. Geminilake/GLK RVP1 DDR4 (05), BIOS GELKRVPA.X64.0062.B30.1708222146 08/22/2017
[ 1607.270378] Workqueue: events efivar_update_sysfs_entries
[ 1607.270379] RIP: 0010:rcu_note_context_switch+0x69/0x6c0
[ 1607.270380] RSP: 0018:ffffc900001635b0 EFLAGS: 00010002
[ 1607.270381] RAX: 0000000000000001 RBX: ffff88017aa8a740 RCX: 0000000000000003
[ 1607.270382] RDX: 0000000000000002 RSI: ffffffff82113c67 RDI: 0000000000000000
[ 1607.270382] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
[ 1607.270383] R10: 0000000000000000 R11: ffffffff81137ee0 R12: ffff88017aa8a740
[ 1607.270384] R13: ffff88017fda1458 R14: ffffc90000163710 R15: 0000000000000000
[ 1607.270384] FS:  0000000000000000(0000) GS:ffff88017fd80000(0000) knlGS:0000000000000000
[ 1607.270385] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1607.270385] CR2: fffffffefa9905fa CR3: 0000000005210000 CR4: 0000000000340ee0
[ 1607.270386] Call Trace:
[ 1607.270386]  __schedule+0x98/0xaf0
[ 1607.270387]  ? trace_hardirqs_on_thunk+0x1a/0x1c
[ 1607.270388]  ? run_timer_softirq+0xc5/0x140
[ 1607.270388]  ? __down_interruptible+0x7e/0x110
[ 1607.270389]  schedule+0x37/0x90
[ 1607.270389]  schedule_timeout+0x21f/0x520
[ 1607.270390]  ? down_interruptible+0xe/0x50
[ 1607.270390]  ? __down_interruptible+0x7e/0x110
[ 1607.270391]  __down_interruptible+0x9d/0x110
[ 1607.270391]  ? down_interruptible+0x3d/0x50
[ 1607.270392]  down_interruptible+0x3d/0x50
[ 1607.270392]  efivar_entry_set_safe+0x5f/0x1b0
[ 1607.270393]  ? efi_pstore_write+0x100/0x140
[ 1607.270394]  efi_pstore_write+0x100/0x140
[ 1607.270394]  ? build_tree+0x1f3/0x460
[ 1607.270395]  pstore_dump+0x196/0x330
[ 1607.270395]  ? lock_acquire+0xaf/0x200
[ 1607.270396]  kmsg_dump+0xf5/0x1c0
[ 1607.270396]  oops_end+0x3e/0x80
[ 1607.270397]  no_context+0x165/0x430
[ 1607.270397]  __do_page_fault+0x196/0x560
[ 1607.270398]  ? down_interruptible+0xe/0x50
[ 1607.270398]  page_fault+0x2c/0x60
[ 1607.270399] RIP: 0010:0xfffffffefa9905fa
[ 1607.270400] RSP: 0018:ffffc90000163b88 EFLAGS: 00010202
[ 1607.270401] RAX: 0000000000002033 RBX: ffffc90000163c10 RCX: 0000000000000440
[ 1607.270402] RDX: 0000000000000440 RSI: 0000000000000000 RDI: 0000000000000000
[ 1607.270403] RBP: fffffffefa891040 R08: 0000000000000000 R09: 00000000000000ff
[ 1607.270403] R10: 0000000076e91018 R11: ffffffff810d281e R12: fffffffefa89106c
[ 1607.270404] R13: ffffc90000163db8 R14: 0000000000000000 R15: ffffc90000163dc0
[ 1607.270404]  ? down_interruptible+0xe/0x50
[ 1607.270405]  ? ret_from_fork+0x24/0x30
[ 1607.270405]  ? init_object+0x6e/0x80
[ 1607.270406]  ? ___slab_alloc.constprop.30+0x152/0x3d0
[ 1607.270406]  ? efi_call+0x58/0x90
[ 1607.270407]  ? down_interruptible+0xe/0x50
[ 1607.270408]  ? virt_efi_get_next_variable+0x10b/0x260
[ 1607.270408]  ? efivar_update_sysfs_entry+0x2d/0x80
[ 1607.270409]  ? efivar_init+0xfa/0x370
[ 1607.270409]  ? efivar_init+0xfa/0x370
[ 1607.270410]  ? efivar_release+0x10/0x10
[ 1607.270410]  ? efivar_update_sysfs_entries+0x1f/0x60
[ 1607.270411]  ? efivar_update_sysfs_entries+0x1f/0x60
[ 1607.270411]  ? process_one_work+0x215/0x640
[ 1607.270412]  ? worker_thread+0x48/0x3a0
[ 1607.270412]  ? kthread+0xfb/0x130
[ 1607.270413]  ? process_one_work+0x640/0x640
[ 1607.270413]  ? _kthread_create_on_node+0x30/0x30
[ 1607.270414]  ? ret_from_fork+0x24/0x30
[ 1607.270414] Code: 01 00 85 ff 74 0e 8b b3 ac 08 00 00 85 f6 0f 84 ed 01 00 00 40 84 ed 8b 83 b8 03 00 00 0f 85 02 01 00 00 85 c0 0f 8e 02 01 00 00 <0f> ff 80 bb bc 03 00 00 00 0f 84 20 02 00 00 e8 83 c0 ff ff e8 
[ 1607.270437] ---[ end trace 353cc06db5100878 ]---

Source: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3621/shard-glkb4/igt@drv_selftest@live_hangcheck.html
Comment 6 Marta Löfstedt 2018-01-24 10:49:51 UTC
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3673/shard-kbl1/igt@drv_selftest@live_hangcheck.html

when the test fail:
Returncode	-11

on KBL-shards we hit bug 104594 on drm_mm tests.

[  232.521092] BUG: using smp_processor_id() in preemptible [00000000] code: drm_mm/6173
[  232.521102] caller is vprintk_func+0x47/0x80
[  232.521151] CPU: 3 PID: 6173 Comm: drm_mm Tainted: G     UD W        4.15.0-rc9-CI-CI_DRM_3673+ #1
[  232.521151] Hardware name:                  /NUC7i5BNB, BIOS BNKBL357.86A.0054.2017.1025.1822 10/25/2017
[  232.521152] Call Trace:
[  232.521152]  dump_stack+0x5f/0x86
[  232.521152]  check_preemption_disabled+0xd9/0xe0
[  232.521153]  vprintk_func+0x47/0x80
[  232.521153]  ? 0xffffffffa0041000
[  232.521153]  printk+0x3e/0x46
[  232.521154]  test_drm_mm_init+0x33/0x1000 [test_drm_mm]
[  232.521154]  ? 0xffffffffa0041000
[  232.521155]  do_one_initcall+0x39/0x150
[  232.521155]  ? kmem_cache_alloc_trace+0x1e7/0x2b0
[  232.521155]  do_init_module+0x56/0x1ef
[  232.521156]  load_module+0x231c/0x2d70
[  232.521156]  ? show_coresize+0x20/0x20
[  232.521156]  ? vfs_read+0x126/0x150
[  232.521156]  ? SyS_finit_module+0xa5/0xe0
[  232.521157]  SyS_finit_module+0xa5/0xe0
[  232.521157]  entry_SYSCALL_64_fastpath+0x22/0x8f
[  232.521157] RIP: 0033:0x7ff4181b89f9
[  232.521158] RSP: 002b:00007ffc69ffc8b8 EFLAGS: 00000206
Comment 7 Marta Löfstedt 2018-01-24 10:50:31 UTC
*** Bug 104594 has been marked as a duplicate of this bug. ***
Comment 8 Chris Wilson 2018-01-25 11:23:21 UTC
With a kasan run to investigate the stack page overflow:

https://intel-gfx-ci.01.org/tree/drm-tip/Trybot_1702/shard-kbl6/igt@drv_selftest@live_hangcheck.html

it passed. Note that one thing that kasan does is disable CONFIG_STACK_VMAP, so changing the stack allocatin/layout. Still not the positive lead I was hoping for.
Comment 9 Marta Löfstedt 2018-01-26 08:33:40 UTC
Note bug 104786 is filed on the kernel stack overflow, the piglit sometime add to the results and other time don't add to the results.
Comment 10 Marta Löfstedt 2018-01-30 08:32:22 UTC
There is still coming in more affected drm_mm subtests:

https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3694/shard-kbl2/igt@drm_mm@sanitycheck.html
Comment 11 Marta Löfstedt 2018-01-31 08:26:13 UTC
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3698/shard-apl6/igt@drv_selftest@live_hangcheck.html

Softdog now also on APL-shards
Comment 12 Chris Wilson 2018-02-02 20:36:54 UTC
This helped (not fixed) somewhat my bxt:

commit 274de876065a34ca1ff093bfacb62e440a4491be
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Fri Feb 2 14:54:55 2018 +0000

    drm/i915/execlists: Flush GTIIR on clearing CS interrupts during reset
    
    Be paranoid and flush the GTIIR after clearing the CS interrupt to be
    sure it has taken before we re-enable the interrupt handler. We still
    see early interrupts following reset, the tasklet handling the mmio read
    before it has been written by the CS. This hopefully reduces the
    frequency to 0...
    
    References: https://bugs.freedesktop.org/show_bug.cgi?id=104262
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
    Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
    Cc: Michel Thierry <michel.thierry@intel.com>
    Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
    Acked-by: Michel Thierry <michel.thierry@intel.com>
    Link: https://patchwork.freedesktop.org/patch/msgid/20180202145455.29876-1-chris@chris-wilson.co.uk
Comment 13 Marta Löfstedt 2018-02-05 07:29:26 UTC
This is now only affecting KBL. I will remove the impact of the other machines.
Comment 14 Chris Wilson 2018-02-05 16:18:20 UTC
My bxt stabilised since

commit 8ec21a7c4b52165211b47932c665ab2e9ca488ee
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Mon Feb 5 15:24:29 2018 +0000

    drm/i915/selftests: Use a sacrificial context for hang testing
    
    Avoid injecting hangs in to the i915->kernel_context in case the GPU
    reset leaves corruption in the context image in its wake (leading to
    continual failures and system hangs after the selftests are ostensibly
    complete). Use a sacrificial kernel_context instead.
    
    v2: Closing a context is tricky; export a function (for selftests) from
    i915_gem_context.c to get it right.
    
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
    Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
    Cc: Michel Thierry <michel.thierry@intel.com>
    Reviewed-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>
    Link: https://patchwork.freedesktop.org/patch/msgid/20180205152431.12163-2-chris@chris-wilson.co.uk

commit a8b66f2c2f3e03b70a5e72cb5034f8aff669bf34
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Mon Feb 5 15:24:28 2018 +0000

    drm/i915/selftests: Flush old resets between engines
    
    When injecting rapid resets, we must be careful to at least wait for the
    previous reset to have taken effect and the engine restarted. If we
    perform a second reset before that has happened, we will notice that the
    engine hasn't recovered and declare it lost, wedging the device and
    failing. In practice, since we wait for each hanging batch to start
    before injecting the reset, this too-fast-reset condition can only be
    triggered when moving onto the next engine in the test, so we need only
    wait for the existing reset to complete before switching engines.
    
    v2: Wrap up the wait inside a safety net to bail out in case of angry hw.
    
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
    Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
    Cc: Michel Thierry <michel.thierry@intel.com>
    Reviewed-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>
    Link: https://patchwork.freedesktop.org/patch/msgid/20180205152431.12163-1-chris@chris-wilson.co.uk

Hopefully the same is true of CI...
Comment 15 Marta Löfstedt 2018-02-07 07:48:14 UTC
Looking good on CI now as well I will close and archive, thanks Chris


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.