https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3514/shard-glkb4/igt@drv_selftest@live_hangcheck.html pstore indicate softdog triggered, before that just ftrace in pstores. However no OWATCH in run.log but at least last test in run.log match results. last dmesgs: <6>[ 289.098135] [drm] Initialized i915 1.6.0 20171201 for 0000:00:02.0 on minor 0 <6>[ 289.099969] [drm] DRM_I915_DEBUG enabled <6>[ 289.099973] [drm] DRM_I915_DEBUG_GEM enabled <6>[ 289.099979] i915: Performing live selftests with st_random_seed=0xd859da5e st_timeout=1000 <6>[ 290.126255] Submitted 50 contexts (across 4 engines), filling 200 dwords <4>[ 290.531598] i915: probe of 0000:00:02.0 failed with error -25 <7>[ 290.558222] [IGT] drv_selftest: exiting, ret=0 <7>[ 290.643776] [IGT] drv_selftest: executing
Here is anotherone with legit pstore backtracing softdog. https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3520/shard-kbl6/igt@drv_selftest@live_hangcheck.html
Actually: <0>[ 183.930051] <idle>-0 0..s1 183945742us : execlists_submission_tasklet: rcs0 cs-irq head=0 [0], tail=7 [7] <0>[ 183.930075] <idle>-0 0..s1 183945742us : execlists_submission_tasklet: rcs0 csb[1d]: status=0x00000018:0x00000000 That tail=7 is bogus.
Optimistically: commit fcb1de54e2d9c84a2d3275d1febde00e92d45aa0 Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Tue Dec 19 09:01:10 2017 +0000 drm/i915: Add a strong mb to resetting the has-CS-interrupt bit After a reset, the state of the CSB registers are scrubbed and not valid until a powercontext is reloaded. We only know when a powercontext has been reloaded once we see a CS-interrupt, before then we must ignore the CSB registers within the execlists_submission_tasklet. However, glk is sporadically dying with an illegal CSB pointer value (both in the HSWP and mmio) suggesting that it is running with the CS-interrupt bit set before the powercontext has been reloaded. Make sure the clearing of that bit is serialised on reset with the re-enabling of the tasklet. References: https://bugs.freedesktop.org/show_bug.cgi?id=104262 Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Cc: Michał Winiarski <michal.winiarski@intel.com> Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20171219090110.11153-1-chris@chris-wilson.co.uk Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
That alone was not it.
The following looks interesting: [ 1607.268733] BUG: unable to handle kernel paging request at fffffffefa9905fa [ 1607.268736] IP: 0xfffffffefa9905fa [ 1607.268738] Oops: 0010 [#2] PREEMPT SMP PTI [ 1607.268739] Dumping ftrace buffer: [ 1607.268739] (ftrace buffer empty) [ 1607.268740] Modules linked in: i915(+) vgem snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic x86_pkg_temp_thermal intel_powerclamp coretemp crct10dif_pclmul crc32_pclmul ghash_clmulni_intel snd_hda_codec snd_hwdep snd_hda_core r8169 snd_pcm mii mei_me mei prime_numbers i2c_hid pinctrl_geminilake pinctrl_intel [last unloaded: i915] [ 1607.268751] CPU: 3 PID: 30 Comm: kworker/3:0 Tainted: G UD W 4.15.0-rc7-CI-CI_DRM_3621+ #1 [ 1607.268752] Hardware name: Intel Corp. Geminilake/GLK RVP1 DDR4 (05), BIOS GELKRVPA.X64.0062.B30.1708222146 08/22/2017 [ 1607.268752] Workqueue: events efivar_update_sysfs_entries [ 1607.268753] RIP: 0010:0xfffffffefa9905fa [ 1607.268754] RSP: 0018:ffffc90000163b88 EFLAGS: 00010202 [ 1607.268755] RAX: 0000000000002033 RBX: ffffc90000163c10 RCX: 0000000000000440 [ 1607.268756] RDX: 0000000000000440 RSI: 0000000000000000 RDI: 0000000000000000 [ 1607.268756] RBP: fffffffefa891040 R08: 0000000000000000 R09: 00000000000000ff [ 1607.268757] R10: 0000000076e91018 R11: ffffffff810d281e R12: fffffffefa89106c [ 1607.268758] R13: ffffc90000163db8 R14: 0000000000000000 R15: ffffc90000163dc0 [ 1607.268758] FS: 0000000000000000(0000) GS:ffff88017fd80000(0000) knlGS:0000000000000000 [ 1607.268759] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 1607.268760] CR2: fffffffefa9905fa CR3: 0000000005210000 CR4: 0000000000340ee0 [ 1607.268760] Call Trace: [ 1607.268761] ? ret_from_fork+0x24/0x30 [ 1607.268761] ? init_object+0x6e/0x80 [ 1607.268762] ? ___slab_alloc.constprop.30+0x152/0x3d0 [ 1607.268762] ? efi_call+0x58/0x90 [ 1607.268763] ? down_interruptible+0xe/0x50 [ 1607.268764] ? virt_efi_get_next_variable+0x10b/0x260 [ 1607.268764] ? efivar_update_sysfs_entry+0x2d/0x80 [ 1607.268765] ? efivar_init+0xfa/0x370 [ 1607.268765] ? efivar_init+0xfa/0x370 [ 1607.268766] ? efivar_release+0x10/0x10 [ 1607.268766] ? efivar_update_sysfs_entries+0x1f/0x60 [ 1607.268767] ? efivar_update_sysfs_entries+0x1f/0x60 [ 1607.268767] ? process_one_work+0x215/0x640 [ 1607.268768] ? worker_thread+0x48/0x3a0 [ 1607.268768] ? kthread+0xfb/0x130 [ 1607.268769] ? process_one_work+0x640/0x640 [ 1607.268769] ? _kthread_create_on_node+0x30/0x30 [ 1607.268770] ? ret_from_fork+0x24/0x30 [ 1607.268770] Code: Bad RIP value. [ 1607.268772] RIP: 0xfffffffefa9905fa RSP: ffffc90000163b88 [ 1607.268772] CR2: fffffffefa9905fa [ 1607.268773] ---[ end trace 353cc06db5100877 ]--- [ 1607.269236] BUG: scheduling while atomic: kworker/3:0/30/0x00000003 [ 1607.269251] INFO: lockdep is turned off. [ 1607.269261] Modules linked in: [ 1607.269272] i915(+) [ 1607.269280] vgem [ 1607.269287] snd_hda_codec_hdmi [ 1607.269297] snd_hda_codec_realtek [ 1607.269307] snd_hda_codec_generic [ 1607.269317] x86_pkg_temp_thermal [ 1607.269327] intel_powerclamp [ 1607.269337] coretemp [ 1607.269345] crct10dif_pclmul [ 1607.269355] crc32_pclmul [ 1607.269363] ghash_clmulni_intel [ 1607.269373] snd_hda_codec [ 1607.269382] snd_hwdep [ 1607.269391] snd_hda_core [ 1607.269399] r8169 [ 1607.269407] snd_pcm [ 1607.269415] mii [ 1607.269423] mei_me [ 1607.269430] mei [ 1607.269438] prime_numbers [ 1607.269446] i2c_hid [ 1607.269454] pinctrl_geminilake [ 1607.269464] pinctrl_intel [ 1607.269472] [last unloaded: i915] [ 1607.269485] Preemption disabled at: [ 1607.269500] [<00000000c3925e4b>] virt_efi_get_next_variable+0x6a/0x260 [ 1607.269686] CPU: 3 PID: 30 Comm: kworker/3:0 Tainted: G UD W 4.15.0-rc7-CI-CI_DRM_3621+ #1 [ 1607.269687] Hardware name: Intel Corp. Geminilake/GLK RVP1 DDR4 (05), BIOS GELKRVPA.X64.0062.B30.1708222146 08/22/2017 [ 1607.269687] Workqueue: events efivar_update_sysfs_entries [ 1607.269688] Call Trace: [ 1607.269689] dump_stack+0x5f/0x86 [ 1607.269689] ? virt_efi_get_next_variable+0x6a/0x260 [ 1607.269690] __schedule_bug+0x7a/0xd0 [ 1607.269690] __schedule+0x849/0xaf0 [ 1607.269691] ? trace_hardirqs_on_thunk+0x1a/0x1c [ 1607.269691] ? run_timer_softirq+0xc5/0x140 [ 1607.269692] ? __down_interruptible+0x7e/0x110 [ 1607.269692] schedule+0x37/0x90 [ 1607.269693] schedule_timeout+0x21f/0x520 [ 1607.269693] ? down_interruptible+0xe/0x50 [ 1607.269694] ? __down_interruptible+0x7e/0x110 [ 1607.269694] __down_interruptible+0x9d/0x110 [ 1607.269695] ? down_interruptible+0x3d/0x50 [ 1607.269695] down_interruptible+0x3d/0x50 [ 1607.269696] efivar_entry_set_safe+0x5f/0x1b0 [ 1607.269697] ? efi_pstore_write+0x100/0x140 [ 1607.269697] efi_pstore_write+0x100/0x140 [ 1607.269698] ? build_tree+0x1f3/0x460 [ 1607.269698] pstore_dump+0x196/0x330 [ 1607.269699] ? lock_acquire+0xaf/0x200 [ 1607.269699] kmsg_dump+0xf5/0x1c0 [ 1607.269700] oops_end+0x3e/0x80 [ 1607.269700] no_context+0x165/0x430 [ 1607.269701] __do_page_fault+0x196/0x560 [ 1607.269701] ? down_interruptible+0xe/0x50 [ 1607.269702] page_fault+0x2c/0x60 [ 1607.269702] RIP: 0010:0xfffffffefa9905fa [ 1607.269703] RSP: 0018:ffffc90000163b88 EFLAGS: 00010202 [ 1607.269705] RAX: 0000000000002033 RBX: ffffc90000163c10 RCX: 0000000000000440 [ 1607.269706] RDX: 0000000000000440 RSI: 0000000000000000 RDI: 0000000000000000 [ 1607.269706] RBP: fffffffefa891040 R08: 0000000000000000 R09: 00000000000000ff [ 1607.269707] R10: 0000000076e91018 R11: ffffffff810d281e R12: fffffffefa89106c [ 1607.269708] R13: ffffc90000163db8 R14: 0000000000000000 R15: ffffc90000163dc0 [ 1607.269708] ? down_interruptible+0xe/0x50 [ 1607.269709] ? ret_from_fork+0x24/0x30 [ 1607.269709] ? init_object+0x6e/0x80 [ 1607.269710] ? ___slab_alloc.constprop.30+0x152/0x3d0 [ 1607.269710] ? efi_call+0x58/0x90 [ 1607.269711] ? down_interruptible+0xe/0x50 [ 1607.269711] ? virt_efi_get_next_variable+0x10b/0x260 [ 1607.269712] ? efivar_update_sysfs_entry+0x2d/0x80 [ 1607.269712] ? efivar_init+0xfa/0x370 [ 1607.269713] ? efivar_init+0xfa/0x370 [ 1607.269713] ? efivar_release+0x10/0x10 [ 1607.269714] ? efivar_update_sysfs_entries+0x1f/0x60 [ 1607.269714] ? efivar_update_sysfs_entries+0x1f/0x60 [ 1607.269715] ? process_one_work+0x215/0x640 [ 1607.269715] ? worker_thread+0x48/0x3a0 [ 1607.269716] ? kthread+0xfb/0x130 [ 1607.269716] ? process_one_work+0x640/0x640 [ 1607.269717] ? _kthread_create_on_node+0x30/0x30 [ 1607.269717] ? ret_from_fork+0x24/0x30 [ 1607.270364] WARNING: CPU: 3 PID: 30 at kernel/rcu/tree_plugin.h:331 rcu_note_context_switch+0x69/0x6c0 [ 1607.270365] Modules linked in: i915(+) vgem snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic x86_pkg_temp_thermal intel_powerclamp coretemp crct10dif_pclmul crc32_pclmul ghash_clmulni_intel snd_hda_codec snd_hwdep snd_hda_core r8169 snd_pcm mii mei_me mei prime_numbers i2c_hid pinctrl_geminilake pinctrl_intel [last unloaded: i915] [ 1607.270377] CPU: 3 PID: 30 Comm: kworker/3:0 Tainted: G UD W 4.15.0-rc7-CI-CI_DRM_3621+ #1 [ 1607.270378] Hardware name: Intel Corp. Geminilake/GLK RVP1 DDR4 (05), BIOS GELKRVPA.X64.0062.B30.1708222146 08/22/2017 [ 1607.270378] Workqueue: events efivar_update_sysfs_entries [ 1607.270379] RIP: 0010:rcu_note_context_switch+0x69/0x6c0 [ 1607.270380] RSP: 0018:ffffc900001635b0 EFLAGS: 00010002 [ 1607.270381] RAX: 0000000000000001 RBX: ffff88017aa8a740 RCX: 0000000000000003 [ 1607.270382] RDX: 0000000000000002 RSI: ffffffff82113c67 RDI: 0000000000000000 [ 1607.270382] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000 [ 1607.270383] R10: 0000000000000000 R11: ffffffff81137ee0 R12: ffff88017aa8a740 [ 1607.270384] R13: ffff88017fda1458 R14: ffffc90000163710 R15: 0000000000000000 [ 1607.270384] FS: 0000000000000000(0000) GS:ffff88017fd80000(0000) knlGS:0000000000000000 [ 1607.270385] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 1607.270385] CR2: fffffffefa9905fa CR3: 0000000005210000 CR4: 0000000000340ee0 [ 1607.270386] Call Trace: [ 1607.270386] __schedule+0x98/0xaf0 [ 1607.270387] ? trace_hardirqs_on_thunk+0x1a/0x1c [ 1607.270388] ? run_timer_softirq+0xc5/0x140 [ 1607.270388] ? __down_interruptible+0x7e/0x110 [ 1607.270389] schedule+0x37/0x90 [ 1607.270389] schedule_timeout+0x21f/0x520 [ 1607.270390] ? down_interruptible+0xe/0x50 [ 1607.270390] ? __down_interruptible+0x7e/0x110 [ 1607.270391] __down_interruptible+0x9d/0x110 [ 1607.270391] ? down_interruptible+0x3d/0x50 [ 1607.270392] down_interruptible+0x3d/0x50 [ 1607.270392] efivar_entry_set_safe+0x5f/0x1b0 [ 1607.270393] ? efi_pstore_write+0x100/0x140 [ 1607.270394] efi_pstore_write+0x100/0x140 [ 1607.270394] ? build_tree+0x1f3/0x460 [ 1607.270395] pstore_dump+0x196/0x330 [ 1607.270395] ? lock_acquire+0xaf/0x200 [ 1607.270396] kmsg_dump+0xf5/0x1c0 [ 1607.270396] oops_end+0x3e/0x80 [ 1607.270397] no_context+0x165/0x430 [ 1607.270397] __do_page_fault+0x196/0x560 [ 1607.270398] ? down_interruptible+0xe/0x50 [ 1607.270398] page_fault+0x2c/0x60 [ 1607.270399] RIP: 0010:0xfffffffefa9905fa [ 1607.270400] RSP: 0018:ffffc90000163b88 EFLAGS: 00010202 [ 1607.270401] RAX: 0000000000002033 RBX: ffffc90000163c10 RCX: 0000000000000440 [ 1607.270402] RDX: 0000000000000440 RSI: 0000000000000000 RDI: 0000000000000000 [ 1607.270403] RBP: fffffffefa891040 R08: 0000000000000000 R09: 00000000000000ff [ 1607.270403] R10: 0000000076e91018 R11: ffffffff810d281e R12: fffffffefa89106c [ 1607.270404] R13: ffffc90000163db8 R14: 0000000000000000 R15: ffffc90000163dc0 [ 1607.270404] ? down_interruptible+0xe/0x50 [ 1607.270405] ? ret_from_fork+0x24/0x30 [ 1607.270405] ? init_object+0x6e/0x80 [ 1607.270406] ? ___slab_alloc.constprop.30+0x152/0x3d0 [ 1607.270406] ? efi_call+0x58/0x90 [ 1607.270407] ? down_interruptible+0xe/0x50 [ 1607.270408] ? virt_efi_get_next_variable+0x10b/0x260 [ 1607.270408] ? efivar_update_sysfs_entry+0x2d/0x80 [ 1607.270409] ? efivar_init+0xfa/0x370 [ 1607.270409] ? efivar_init+0xfa/0x370 [ 1607.270410] ? efivar_release+0x10/0x10 [ 1607.270410] ? efivar_update_sysfs_entries+0x1f/0x60 [ 1607.270411] ? efivar_update_sysfs_entries+0x1f/0x60 [ 1607.270411] ? process_one_work+0x215/0x640 [ 1607.270412] ? worker_thread+0x48/0x3a0 [ 1607.270412] ? kthread+0xfb/0x130 [ 1607.270413] ? process_one_work+0x640/0x640 [ 1607.270413] ? _kthread_create_on_node+0x30/0x30 [ 1607.270414] ? ret_from_fork+0x24/0x30 [ 1607.270414] Code: 01 00 85 ff 74 0e 8b b3 ac 08 00 00 85 f6 0f 84 ed 01 00 00 40 84 ed 8b 83 b8 03 00 00 0f 85 02 01 00 00 85 c0 0f 8e 02 01 00 00 <0f> ff 80 bb bc 03 00 00 00 0f 84 20 02 00 00 e8 83 c0 ff ff e8 [ 1607.270437] ---[ end trace 353cc06db5100878 ]--- Source: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3621/shard-glkb4/igt@drv_selftest@live_hangcheck.html
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3673/shard-kbl1/igt@drv_selftest@live_hangcheck.html when the test fail: Returncode -11 on KBL-shards we hit bug 104594 on drm_mm tests. [ 232.521092] BUG: using smp_processor_id() in preemptible [00000000] code: drm_mm/6173 [ 232.521102] caller is vprintk_func+0x47/0x80 [ 232.521151] CPU: 3 PID: 6173 Comm: drm_mm Tainted: G UD W 4.15.0-rc9-CI-CI_DRM_3673+ #1 [ 232.521151] Hardware name: /NUC7i5BNB, BIOS BNKBL357.86A.0054.2017.1025.1822 10/25/2017 [ 232.521152] Call Trace: [ 232.521152] dump_stack+0x5f/0x86 [ 232.521152] check_preemption_disabled+0xd9/0xe0 [ 232.521153] vprintk_func+0x47/0x80 [ 232.521153] ? 0xffffffffa0041000 [ 232.521153] printk+0x3e/0x46 [ 232.521154] test_drm_mm_init+0x33/0x1000 [test_drm_mm] [ 232.521154] ? 0xffffffffa0041000 [ 232.521155] do_one_initcall+0x39/0x150 [ 232.521155] ? kmem_cache_alloc_trace+0x1e7/0x2b0 [ 232.521155] do_init_module+0x56/0x1ef [ 232.521156] load_module+0x231c/0x2d70 [ 232.521156] ? show_coresize+0x20/0x20 [ 232.521156] ? vfs_read+0x126/0x150 [ 232.521156] ? SyS_finit_module+0xa5/0xe0 [ 232.521157] SyS_finit_module+0xa5/0xe0 [ 232.521157] entry_SYSCALL_64_fastpath+0x22/0x8f [ 232.521157] RIP: 0033:0x7ff4181b89f9 [ 232.521158] RSP: 002b:00007ffc69ffc8b8 EFLAGS: 00000206
*** Bug 104594 has been marked as a duplicate of this bug. ***
With a kasan run to investigate the stack page overflow: https://intel-gfx-ci.01.org/tree/drm-tip/Trybot_1702/shard-kbl6/igt@drv_selftest@live_hangcheck.html it passed. Note that one thing that kasan does is disable CONFIG_STACK_VMAP, so changing the stack allocatin/layout. Still not the positive lead I was hoping for.
Note bug 104786 is filed on the kernel stack overflow, the piglit sometime add to the results and other time don't add to the results.
There is still coming in more affected drm_mm subtests: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3694/shard-kbl2/igt@drm_mm@sanitycheck.html
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3698/shard-apl6/igt@drv_selftest@live_hangcheck.html Softdog now also on APL-shards
This helped (not fixed) somewhat my bxt: commit 274de876065a34ca1ff093bfacb62e440a4491be Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Fri Feb 2 14:54:55 2018 +0000 drm/i915/execlists: Flush GTIIR on clearing CS interrupts during reset Be paranoid and flush the GTIIR after clearing the CS interrupt to be sure it has taken before we re-enable the interrupt handler. We still see early interrupts following reset, the tasklet handling the mmio read before it has been written by the CS. This hopefully reduces the frequency to 0... References: https://bugs.freedesktop.org/show_bug.cgi?id=104262 Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com> Cc: Michel Thierry <michel.thierry@intel.com> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Acked-by: Michel Thierry <michel.thierry@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20180202145455.29876-1-chris@chris-wilson.co.uk
This is now only affecting KBL. I will remove the impact of the other machines.
My bxt stabilised since commit 8ec21a7c4b52165211b47932c665ab2e9ca488ee Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Mon Feb 5 15:24:29 2018 +0000 drm/i915/selftests: Use a sacrificial context for hang testing Avoid injecting hangs in to the i915->kernel_context in case the GPU reset leaves corruption in the context image in its wake (leading to continual failures and system hangs after the selftests are ostensibly complete). Use a sacrificial kernel_context instead. v2: Closing a context is tricky; export a function (for selftests) from i915_gem_context.c to get it right. Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com> Cc: Michel Thierry <michel.thierry@intel.com> Reviewed-by: Mika Kuoppala <mika.kuoppala@linux.intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20180205152431.12163-2-chris@chris-wilson.co.uk commit a8b66f2c2f3e03b70a5e72cb5034f8aff669bf34 Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Mon Feb 5 15:24:28 2018 +0000 drm/i915/selftests: Flush old resets between engines When injecting rapid resets, we must be careful to at least wait for the previous reset to have taken effect and the engine restarted. If we perform a second reset before that has happened, we will notice that the engine hasn't recovered and declare it lost, wedging the device and failing. In practice, since we wait for each hanging batch to start before injecting the reset, this too-fast-reset condition can only be triggered when moving onto the next engine in the test, so we need only wait for the existing reset to complete before switching engines. v2: Wrap up the wait inside a safety net to bail out in case of angry hw. Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com> Cc: Michel Thierry <michel.thierry@intel.com> Reviewed-by: Mika Kuoppala <mika.kuoppala@linux.intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20180205152431.12163-1-chris@chris-wilson.co.uk Hopefully the same is true of CI...
Looking good on CI now as well I will close and archive, thanks Chris
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.