https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6283/fi-kbl-guc/igt@i915_selftest@live_evict.html <3> [404.303049] i915_ggtt_offset:218 GEM_BUG_ON(!vma->node.allocated) <4> [404.303070] ------------[ cut here ]------------ <2> [404.303072] kernel BUG at drivers/gpu/drm/i915/i915_vma.h:218! <4> [404.303080] invalid opcode: 0000 [#1] PREEMPT SMP PTI <4> [404.303084] CPU: 7 PID: 4328 Comm: i915_selftest Tainted: G U 5.2.0-rc5-CI-CI_DRM_6283+ #1 <4> [404.303087] Hardware name: System manufacturer System Product Name/Z170M-PLUS, BIOS 3610 03/29/2018 <4> [404.303186] RIP: 0010:intel_engine_print_registers+0x7ea/0xa30 [i915] <4> [404.303190] Code: 65 47 bb e0 48 8b 35 ed 55 23 00 49 c7 c0 c0 5d 71 a0 b9 da 00 00 00 48 c7 c2 d0 e2 6b a0 48 c7 c7 a5 46 57 a0 e8 b6 0e c2 e0 <0f> 0b 48 c7 c1 08 14 6e a0 ba db 00 00 00 48 c7 c6 d0 e2 6b a0 48 <4> [404.303193] RSP: 0018:ffffc900004cb910 EFLAGS: 00010282 <4> [404.303197] RAX: 000000000000000c RBX: ffff8881e1fe0008 RCX: 0000000000000000 <4> [404.303200] RDX: 0000000000000001 RSI: 0000000000000008 RDI: ffff88822db3aa88 <4> [404.303202] RBP: ffffc900004cb9a8 R08: 00000000001ac57c R09: ffff88822cc40000 <4> [404.303204] R10: ffffc900004cb8a0 R11: ffff88822db3aa88 R12: 0000000000000000 <4> [404.303207] R13: ffff888178150501 R14: ffff888178150500 R15: 0000000000000002 <4> [404.303210] FS: 00007f76f7293e40(0000) GS:ffff88822eb80000(0000) knlGS:0000000000000000 <4> [404.303213] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 <4> [404.303215] CR2: 000055920df88e28 CR3: 00000001c6420001 CR4: 00000000003606e0 <4> [404.303218] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 <4> [404.303220] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 <4> [404.303222] Call Trace: <4> [404.303233] ? _raw_spin_unlock_irqrestore+0x39/0x60 <4> [404.303317] intel_engine_dump+0x40f/0x610 [i915] <4> [404.303326] ? __local_bh_enable_ip+0x74/0x100 <4> [404.303408] __i915_gem_set_wedged.part.7+0x163/0x190 [i915] <4> [404.303414] ? __drm_printfn_info+0x20/0x20 <4> [404.303493] i915_gem_set_wedged+0x61/0x70 [i915] <4> [404.303568] i915_driver_unload+0x85/0x110 [i915] <4> [404.303642] i915_pci_remove+0x19/0x30 [i915] <4> [404.303713] i915_pci_probe+0x93/0x1a0 [i915] <4> [404.303719] pci_device_probe+0x9e/0x120 <4> [404.303725] really_probe+0xea/0x3c0 <4> [404.303730] driver_probe_device+0x10b/0x120 <4> [404.303735] device_driver_attach+0x4a/0x50 <4> [404.303739] __driver_attach+0x97/0x130 <4> [404.303743] ? device_driver_attach+0x50/0x50 <4> [404.303747] bus_for_each_dev+0x74/0xc0 <4> [404.303751] bus_add_driver+0x13f/0x210 <4> [404.303755] ? 0xffffffffa0824000 <4> [404.303759] driver_register+0x56/0xe0 <4> [404.303762] ? 0xffffffffa0824000 <4> [404.303766] do_one_initcall+0x58/0x300 <4> [404.303772] ? kmem_cache_alloc_trace+0x1e8/0x290 <4> [404.303778] do_init_module+0x56/0x1f6 <4> [404.303782] load_module+0x24d1/0x2990 <4> [404.303795] ? __se_sys_finit_module+0xd3/0xf0 <4> [404.303799] __se_sys_finit_module+0xd3/0xf0 <4> [404.303807] do_syscall_64+0x55/0x1c0 <4> [404.303812] entry_SYSCALL_64_after_hwframe+0x49/0xbe <4> [404.303815] RIP: 0033:0x7f76f694e839 <4> [404.303819] Code: 00 f3 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 1f f6 2c 00 f7 d8 64 89 01 48 <4> [404.303822] RSP: 002b:00007ffd4ea212a8 EFLAGS: 00000246 ORIG_RAX: 0000000000000139 <4> [404.303826] RAX: ffffffffffffffda RBX: 0000555efba87da0 RCX: 00007f76f694e839 <4> [404.303828] RDX: 0000000000000000 RSI: 0000555efba811f0 RDI: 0000000000000006 <4> [404.303831] RBP: 0000555efba811f0 R08: 0000000000000004 R09: 0000555efa8f6c1b <4> [404.303833] R10: 00007ffd4ea214f0 R11: 0000000000000246 R12: 0000000000000000 <4> [404.303836] R13: 0000555efba7ebe0 R14: 0000000000000020 R15: 0000000000000044 <4> [404.303843] Modules linked in: i915(+) amdgpu gpu_sched ttm vgem snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic x86_pkg_temp_thermal coretemp snd_hda_codec snd_hwdep mei_hdcp crct10dif_pclmul snd_hda_core crc32_pclmul e1000e snd_pcm ghash_clmulni_intel ptp pps_core mei_me mei prime_numbers [last unloaded: i915]
The CI Bug Log issue associated to this bug has been updated. ### New filters associated * All machines: igt@i915_selftest@live_evict - incomplete - GEM_BUG_ON(!vma->node.allocated) - https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6283/fi-kbl-8809g/igt@i915_selftest@live_evict.html - https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6283/fi-kbl-guc/igt@i915_selftest@live_evict.html - https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6283/fi-kbl-r/igt@i915_selftest@live_evict.html - https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6283/fi-kbl-x1275/igt@i915_selftest@live_evict.html - https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6283/fi-skl-guc/igt@i915_selftest@live_evict.html - https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6283/fi-skl-iommu/igt@i915_selftest@live_evict.html - https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6283/fi-whl-u/igt@i915_selftest@live_evict.html - https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6284/fi-kbl-8809g/igt@i915_selftest@live_evict.html - https://intel-gfx-ci.01.org/tree/drm-tip/IGTPW_3162/fi-cfl-guc/igt@i915_selftest@live_evict.html - https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6286/fi-kbl-8809g/igt@i915_selftest@live_evict.html - https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6286/fi-kbl-x1275/igt@i915_selftest@live_evict.html - https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_13305/fi-kbl-x1275/igt@i915_selftest@live_evict.html - https://intel-gfx-ci.01.org/tree/drm-tip/IGTPW_3164/fi-kbl-8809g/igt@i915_selftest@live_evict.html - https://intel-gfx-ci.01.org/tree/drm-tip/IGTPW_3164/fi-skl-guc/igt@i915_selftest@live_evict.html - https://intel-gfx-ci.01.org/tree/drm-tip/IGTPW_3164/fi-skl-iommu/igt@i915_selftest@live_evict.html - https://intel-gfx-ci.01.org/tree/drm-tip/IGTPW_3164/fi-whl-u/igt@i915_selftest@live_evict.html - https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6287/fi-kbl-8809g/igt@i915_selftest@live_evict.html - https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6287/fi-skl-iommu/igt@i915_selftest@live_evict.html - https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6285/fi-skl-guc/igt@i915_selftest@live_evict.html - https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6285/fi-skl-iommu/igt@i915_selftest@live_evict.html - https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6285/fi-whl-u/igt@i915_selftest@live_evict.html - https://intel-gfx-ci.01.org/tree/drm-tip/IGTPW_3163/fi-kbl-8809g/igt@i915_selftest@live_evict.html - https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6290/fi-kbl-8809g/igt@i915_selftest@live_evict.html - https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6290/fi-kbl-x1275/igt@i915_selftest@live_evict.html - https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6290/fi-skl-iommu/igt@i915_selftest@live_evict.html - https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6290/fi-whl-u/igt@i915_selftest@live_evict.html - https://intel-gfx-ci.01.org/tree/drm-tip/IGTPW_3166/fi-skl-guc/igt@i915_selftest@live_evict.html - https://intel-gfx-ci.01.org/tree/drm-tip/IGTPW_3166/fi-skl-iommu/igt@i915_selftest@live_evict.html - https://intel-gfx-ci.01.org/tree/drm-tip/IGTPW_3166/fi-whl-u/igt@i915_selftest@live_evict.html * All machines: igt@runner@aborted - fail - Previous test: i915_selftest (live_evict) - https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6283/fi-cml-u/igt@runner@aborted.html - https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6283/fi-cml-u2/igt@runner@aborted.html - https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6283/fi-kbl-8809g/igt@runner@aborted.html - https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6283/fi-kbl-guc/igt@runner@aborted.html - https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6283/fi-kbl-r/igt@runner@aborted.html - https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6283/fi-kbl-x1275/igt@runner@aborted.html - https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6283/fi-whl-u/igt@runner@aborted.html - https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6284/fi-kbl-8809g/igt@runner@aborted.html - https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_13301/shard-kbl1/igt@runner@aborted.html - https://intel-gfx-ci.01.org/tree/drm-tip/IGTPW_3162/fi-cfl-guc/igt@runner@aborted.html - https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6286/fi-kbl-8809g/igt@runner@aborted.html - https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6286/fi-kbl-x1275/igt@runner@aborted.html - https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_13305/fi-kbl-x1275/igt@runner@aborted.html - https://intel-gfx-ci.01.org/tree/drm-tip/IGTPW_3164/fi-cml-u/igt@runner@aborted.html - https://intel-gfx-ci.01.org/tree/drm-tip/IGTPW_3164/fi-cml-u2/igt@runner@aborted.html - https://intel-gfx-ci.01.org/tree/drm-tip/IGTPW_3164/fi-kbl-8809g/igt@runner@aborted.html - https://intel-gfx-ci.01.org/tree/drm-tip/IGTPW_3164/fi-whl-u/igt@runner@aborted.html - https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6284/shard-kbl4/igt@runner@aborted.html - https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6287/fi-cml-u2/igt@runner@aborted.html - https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6287/fi-kbl-8809g/igt@runner@aborted.html - https://intel-gfx-ci.01.org/tree/drm-tip/IGT_5059/shard-kbl3/igt@runner@aborted.html - https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6285/shard-kbl3/igt@runner@aborted.html - https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_13306/shard-kbl4/igt@runner@aborted.html - https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6285/fi-cml-u/igt@runner@aborted.html - https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6285/fi-cml-u2/igt@runner@aborted.html - https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6285/fi-whl-u/igt@runner@aborted.html - https://intel-gfx-ci.01.org/tree/drm-tip/IGTPW_3163/fi-kbl-8809g/igt@runner@aborted.html - https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_13307/shard-kbl3/igt@runner@aborted.html - https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_13309/shard-kbl4/igt@runner@aborted.html - https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6290/fi-cml-u/igt@runner@aborted.html - https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6290/fi-cml-u2/igt@runner@aborted.html - https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6290/fi-kbl-8809g/igt@runner@aborted.html - https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6290/fi-kbl-x1275/igt@runner@aborted.html - https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6290/fi-whl-u/igt@runner@aborted.html - https://intel-gfx-ci.01.org/tree/drm-tip/IGTPW_3166/fi-cml-u/igt@runner@aborted.html - https://intel-gfx-ci.01.org/tree/drm-tip/IGTPW_3166/fi-whl-u/igt@runner@aborted.html
I really don't know what's going on there, and would dearly love a bisect for some clue.
<4> [404.303826] RAX: ffffffffffffffda RBX: 0000555efba87da0 RCX: 00007f76f694e839 RAX: -38 => ENOSYS. Could be a coincidence, as I don't think we ever use that as an error pointer.
i915_pci_probe calls i915_gem_set_wedged() when i915_live_selftests() reports an error as follow. i915_pci_probe( i915_live_selftests(pdev); i915_pci_remove(struct pci_dev *pdev) i915_driver_unload(struct drm_device *dev) i915_gem_set_wedged(struct drm_i915_private *i915) So there are 2 questions, 1) why i915_live_selftests() fails, 2) why recovery causes kernel crash by spinlock.
If I've guessed correctly, commit 7009db1475e9e92c823b60c2a9c1e5a3222699f0 Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Tue Jun 18 08:41:28 2019 +0100 drm/i915: Keep engine alive as we retire the context Though we pin the context first before taking the pm wakeref, during retire we need to unpin before dropping the pm wakeref (breaking the "natural" onion). During the unpin, we may need to attach a cleanup operation on to the engine wakeref, ergo we want to keep the engine awake until after the unpin. v2: Push the engine wakeref into the barrier so we keep the onion unwind ordering in the request itself Fixes: ce476c80b8bf ("drm/i915: Keep contexts pinned until after the next kernel context switch") Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com> Reviewed-by: Mika Kuoppala <mika.kuoppala@linux.intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20190618074153.16055-1-chris@chris-wilson.co.uk should be the right fix, and commit eca153603f2f020e15d071918e0daf1d56c17d29 (HEAD -> drm-intel-next-queued) Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Tue Jun 18 17:19:51 2019 +0100 drm/i915: Don't dereference request if it may have been retired when printing This has caught me out on countless occasions, when we retrieve a pointer from the submission/execlists backend, it does not carry a reference to the context or ring. Those are only pinned while the request is active, so if we see the request is already completed, it may be in the process of being retired and those pointers defunct. Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=110938 Fixes: 3a068721a973 ("drm/i915: Show ring->start for the ELSP context/request queue") Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20190618161951.28820-2-chris@chris-wilson.co.uk commit 1422768fa2600442bfc4be4b1fe112bf9d6bb5cf Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Tue Jun 18 17:19:50 2019 +0100 drm/i915/selftests: Flush live_evict Be sure to cleanup after live_evict by flushing any residual state off the GPU using igt_flush_test. Tvrtko mentioned that it is probably wise to stop repeating this ad hoc around the tests and implement a live test runner. Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20190618161951.28820-1-chris@chris-wilson.co.uk should prevent it from re-ocurring. We will see in the morning when Tomi reboots the affected machines...
Good news, bad news. The bug moved. At least this is fairly convincing that the cause is that we are idling the engine too early and so unpinning still active contexts.
(In reply to Chris Wilson from comment #6) > Good news, bad news. The bug moved. At least this is fairly convincing that > the cause is that we are idling the engine too early and so unpinning still > active contexts. Looks very much fixed, thanks!
The CI Bug Log issue associated to this bug has been archived. New failures matching the above filters will not be associated to this bug anymore.
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.