107343 – [CI][BAT] igt@drv_selftest@live_hangcheck - dmesg-fail - GEM_BUG_ON(!execlists_is_active(execlists, 0))

Bug 107343 - [CI][BAT] igt@drv_selftest@live_hangcheck - dmesg-fail - GEM_BUG_ON(!execlists_is_active(execlists, 0))

Summary: [CI][BAT] igt@drv_selftest@live_hangcheck - dmesg-fail - GEM_BUG_ON(!execlist...

Status:	CLOSED FIXED

Alias:	None

Product:	DRI
Classification:	Unclassified
Component:	DRM/Intel (show other bugs)
Version:	XOrg git
Hardware:	Other All

Importance:	medium normal
Assignee:	Intel GFX Bugs mailing list
QA Contact:	Intel GFX Bugs mailing list

URL:
Whiteboard:	ReadyForDev
Keywords:

Depends on:
Blocks:

Reported:	2018-07-23 14:36 UTC by Martin Peres
Modified:	2018-08-07 07:57 UTC (History)
CC List:	1 user (show)

See Also:
i915 platform:	CFL
i915 features:	GEM/execlists

Attachments

Description Martin Peres 2018-07-23 14:36:24 UTC

https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_4526/fi-cfl-8700k/igt@drv_selftest@live_hangcheck.html

https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_4515/fi-cfl-8109u/igt@drv_selftest@live_hangcheck.html

[  578.082475] kthread for other engine vcs1 failed, err=-5
[  578.082496] kthread for other engine vecs0 failed, err=-5
[  578.083706] Failed to switch back to kernel context; declaring wedged
[  578.084236] process_csb:993 GEM_BUG_ON(!execlists_is_active(execlists, 0))
[  578.084313] ------------[ cut here ]------------
[  578.084315] kernel BUG at drivers/gpu/drm/i915/intel_lrc.c:993!
[  578.084335] invalid opcode: 0000 [#1] PREEMPT SMP PTI
[  578.084341] CPU: 0 PID: 9336 Comm: drv_selftest Tainted: G     U            4.18.0-rc5-CI-CI_DRM_4515+ #1
[  578.084349] Hardware name: Intel Corporation NUC8i3BEH/NUC8BEB, BIOS BECFL357.86A.0037.2018.0614.2204 06/14/2018
[  578.084411] RIP: 0010:process_csb+0x4a8/0x780 [i915]
[  578.084416] Code: 07 2e e1 e0 48 8b 35 8f ba 19 00 49 c7 c0 b0 a5 3e a0 b9 e1 03 00 00 48 c7 c2 10 31 3b a0 48 c7 c7 93 75 2e a0 e8 a8 be e7 e0 <0f> 0b 48 8b 75 d0 4c 8d a6 88 16 00 00 4c 89 e7 e8 73 7c 65 e1 48 
[  578.084470] RSP: 0000:ffffc9000021f7f8 EFLAGS: 00010082
[  578.084475] RAX: 000000000000000d RBX: ffff88020e4e8008 RCX: 0000000000000000
[  578.084482] RDX: 0000000000000000 RSI: 000000000000004c RDI: 0000000000000000
[  578.084488] RBP: ffffc9000021f860 R08: ffffffffa03ea5b0 R09: 0000000000000001
[  578.084494] R10: ffffc9000021f7e8 R11: 0000000000000000 R12: ffff8802a670504c
[  578.084500] R13: 0000000000000001 R14: ffff8802a6705048 R15: ffff8802a6705040
[  578.084507] FS:  00007f25ea237980(0000) GS:ffff8802bdc00000(0000) knlGS:0000000000000000
[  578.084515] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  578.084520] CR2: 00007fc4b68e2000 CR3: 0000000212bca002 CR4: 00000000003606f0
[  578.084526] Call Trace:
[  578.084568]  ? nop_complete_submit_request+0x80/0x80 [i915]
[  578.084609]  execlists_reset_prepare+0x4e/0x150 [i915]
[  578.084648]  i915_gem_reset_prepare_engine+0x20/0x40 [i915]
[  578.084684]  i915_gem_set_wedged+0x7b/0x1f0 [i915]
[  578.084692]  ? __drm_printfn_info+0x20/0x20
[  578.084734]  igt_flush_test+0x65/0xb0 [i915]
[  578.084775]  __igt_reset_engines+0x31b/0x7a0 [i915]
[  578.084818]  igt_reset_engines+0x30/0x50 [i915]
[  578.084861]  __i915_subtests+0x5e/0xf0 [i915]
[  578.084900]  intel_hangcheck_live_selftests+0x5b/0xa0 [i915]
[  578.084943]  __run_selftests+0x10b/0x190 [i915]
[  578.084982]  i915_live_selftests+0x2c/0x60 [i915]
[  578.085018]  i915_pci_probe+0x50/0xa0 [i915]
[  578.085025]  pci_device_probe+0xa1/0x130
[  578.085031]  driver_probe_device+0x306/0x480
[  578.085037]  __driver_attach+0xdb/0x100
[  578.085042]  ? driver_probe_device+0x480/0x480
[  578.085047]  ? driver_probe_device+0x480/0x480
[  578.085053]  bus_for_each_dev+0x74/0xc0
[  578.085058]  bus_add_driver+0x15f/0x250
[  578.085063]  ? 0xffffffffa0783000
[  578.085068]  driver_register+0x56/0xe0
[  578.085072]  ? 0xffffffffa0783000
[  578.085077]  do_one_initcall+0x58/0x370
[  578.085083]  ? do_init_module+0x1d/0x1ea
[  578.085089]  ? rcu_read_lock_sched_held+0x6f/0x80
[  578.085095]  ? kmem_cache_alloc_trace+0x282/0x2e0
[  578.085101]  do_init_module+0x56/0x1ea
[  578.085107]  load_module+0x2435/0x2b20
[  578.085116]  ? __se_sys_finit_module+0xd3/0xf0
[  578.085123]  __se_sys_finit_module+0xd3/0xf0
[  578.085130]  do_syscall_64+0x55/0x190
[  578.085136]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
[  578.085142] RIP: 0033:0x7f25e9b01839
[  578.085145] Code: 00 f3 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 1f f6 2c 00 f7 d8 64 89 01 48 
[  578.085199] RSP: 002b:00007ffef327b558 EFLAGS: 00000246 ORIG_RAX: 0000000000000139
[  578.085207] RAX: ffffffffffffffda RBX: 000055a5fab97d40 RCX: 00007f25e9b01839
[  578.085214] RDX: 0000000000000000 RSI: 000055a5fab98a40 RDI: 0000000000000004
[  578.085220] RBP: 000055a5fab98a40 R08: 0000000000000004 R09: 0000000000000000
[  578.085226] R10: 00007ffef327b6d0 R11: 0000000000000246 R12: 0000000000000000
[  578.085233] R13: 000055a5fab91b20 R14: 0000000000000020 R15: 000000000000003d
[  578.085241] Modules linked in: i915(+) vgem snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic snd_hda_codec x86_pkg_temp_thermal coretemp btusb crct10dif_pclmul btrtl snd_hwdep crc32_pclmul btbcm snd_hda_core btintel ghash_clmulni_intel bluetooth snd_pcm e1000e ecdh_generic mei_me mei prime_numbers [last unloaded: i915]
[  578.085282] Dumping ftrace buffer:
[  578.085287]    (ftrace buffer empty)
[  578.085292] ---[ end trace 29c7ee53f006cee8 ]---

Comment 1 Chris Wilson 2018-07-23 14:46:19 UTC

The cause is same problem as bug 106560, but still we are not allowed to trip over after declaring wedged.

Comment 2 Chris Wilson 2018-07-23 14:49:06 UTC

In this case it's that we have declared wedged, cleared the active marker, but left HWSP alone (due to races with the GPU we can't until we actually reset). Then on a second wedging(!!) we trip over residual events in the CSB.

Comment 3 Chris Wilson 2018-07-25 07:10:26 UTC

Should be papered over by

commit 3970c65c2b47c450f917bc8a29c5849563a95dfe (HEAD -> drm-intel-next-queued, drm-intel/for-linux-next, drm-intel/drm-intel-next-queued)
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Mon Jul 23 15:53:35 2018 +0100

    drm/i915: Skip repeated calls to i915_gem_set_wedged()
    
    If we already wedged, i915_gem_set_wedged() becomes a complicated no-op.
    
    References: https://bugs.freedesktop.org/show_bug.cgi?id=107343
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
    Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
    Link: https://patchwork.freedesktop.org/patch/msgid/20180723145335.24579-1-chris@chris-wilson.co.uk

(Not the cause of the wedging, just the nasty side-effect here of wedging twice.)

Comment 4 Francesco Balestrieri 2018-08-04 09:18:31 UTC

Martin, OK to close?

Comment 5 Francesco Balestrieri 2018-08-07 07:54:53 UTC

Only occurred twice, and not for 2+ weeks. Closing.

Comment 6 Martin Peres 2018-08-07 07:57:35 UTC

(In reply to Francesco Balestrieri from comment #5)
> Only occurred twice, and not for 2+ weeks. Closing.

Please don't close this sort of bug super quickly. Only close the obviously fixed ones please.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.