Bug 110938

Summary: [CI][BAT] igt@i915_selftest@live_evict - incomplete - GEM_BUG_ON(!vma->node.allocated)
Product: DRI Reporter: Martin Peres <martin.peres>
Component: DRM/IntelAssignee: Intel GFX Bugs mailing list <intel-gfx-bugs>
Status: RESOLVED FIXED QA Contact: Intel GFX Bugs mailing list <intel-gfx-bugs>
Severity: normal    
Priority: medium CC: intel-gfx-bugs
Version: XOrg git   
Hardware: Other   
OS: All   
Whiteboard:
i915 platform: i915 features:

Description Martin Peres 2019-06-18 08:37:02 UTC
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6283/fi-kbl-guc/igt@i915_selftest@live_evict.html

<3> [404.303049] i915_ggtt_offset:218 GEM_BUG_ON(!vma->node.allocated)
<4> [404.303070] ------------[ cut here ]------------
<2> [404.303072] kernel BUG at drivers/gpu/drm/i915/i915_vma.h:218!
<4> [404.303080] invalid opcode: 0000 [#1] PREEMPT SMP PTI
<4> [404.303084] CPU: 7 PID: 4328 Comm: i915_selftest Tainted: G     U            5.2.0-rc5-CI-CI_DRM_6283+ #1
<4> [404.303087] Hardware name: System manufacturer System Product Name/Z170M-PLUS, BIOS 3610 03/29/2018
<4> [404.303186] RIP: 0010:intel_engine_print_registers+0x7ea/0xa30 [i915]
<4> [404.303190] Code: 65 47 bb e0 48 8b 35 ed 55 23 00 49 c7 c0 c0 5d 71 a0 b9 da 00 00 00 48 c7 c2 d0 e2 6b a0 48 c7 c7 a5 46 57 a0 e8 b6 0e c2 e0 <0f> 0b 48 c7 c1 08 14 6e a0 ba db 00 00 00 48 c7 c6 d0 e2 6b a0 48
<4> [404.303193] RSP: 0018:ffffc900004cb910 EFLAGS: 00010282
<4> [404.303197] RAX: 000000000000000c RBX: ffff8881e1fe0008 RCX: 0000000000000000
<4> [404.303200] RDX: 0000000000000001 RSI: 0000000000000008 RDI: ffff88822db3aa88
<4> [404.303202] RBP: ffffc900004cb9a8 R08: 00000000001ac57c R09: ffff88822cc40000
<4> [404.303204] R10: ffffc900004cb8a0 R11: ffff88822db3aa88 R12: 0000000000000000
<4> [404.303207] R13: ffff888178150501 R14: ffff888178150500 R15: 0000000000000002
<4> [404.303210] FS:  00007f76f7293e40(0000) GS:ffff88822eb80000(0000) knlGS:0000000000000000
<4> [404.303213] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
<4> [404.303215] CR2: 000055920df88e28 CR3: 00000001c6420001 CR4: 00000000003606e0
<4> [404.303218] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
<4> [404.303220] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
<4> [404.303222] Call Trace:
<4> [404.303233]  ? _raw_spin_unlock_irqrestore+0x39/0x60
<4> [404.303317]  intel_engine_dump+0x40f/0x610 [i915]
<4> [404.303326]  ? __local_bh_enable_ip+0x74/0x100
<4> [404.303408]  __i915_gem_set_wedged.part.7+0x163/0x190 [i915]
<4> [404.303414]  ? __drm_printfn_info+0x20/0x20
<4> [404.303493]  i915_gem_set_wedged+0x61/0x70 [i915]
<4> [404.303568]  i915_driver_unload+0x85/0x110 [i915]
<4> [404.303642]  i915_pci_remove+0x19/0x30 [i915]
<4> [404.303713]  i915_pci_probe+0x93/0x1a0 [i915]
<4> [404.303719]  pci_device_probe+0x9e/0x120
<4> [404.303725]  really_probe+0xea/0x3c0
<4> [404.303730]  driver_probe_device+0x10b/0x120
<4> [404.303735]  device_driver_attach+0x4a/0x50
<4> [404.303739]  __driver_attach+0x97/0x130
<4> [404.303743]  ? device_driver_attach+0x50/0x50
<4> [404.303747]  bus_for_each_dev+0x74/0xc0
<4> [404.303751]  bus_add_driver+0x13f/0x210
<4> [404.303755]  ? 0xffffffffa0824000
<4> [404.303759]  driver_register+0x56/0xe0
<4> [404.303762]  ? 0xffffffffa0824000
<4> [404.303766]  do_one_initcall+0x58/0x300
<4> [404.303772]  ? kmem_cache_alloc_trace+0x1e8/0x290
<4> [404.303778]  do_init_module+0x56/0x1f6
<4> [404.303782]  load_module+0x24d1/0x2990
<4> [404.303795]  ? __se_sys_finit_module+0xd3/0xf0
<4> [404.303799]  __se_sys_finit_module+0xd3/0xf0
<4> [404.303807]  do_syscall_64+0x55/0x1c0
<4> [404.303812]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
<4> [404.303815] RIP: 0033:0x7f76f694e839
<4> [404.303819] Code: 00 f3 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 1f f6 2c 00 f7 d8 64 89 01 48
<4> [404.303822] RSP: 002b:00007ffd4ea212a8 EFLAGS: 00000246 ORIG_RAX: 0000000000000139
<4> [404.303826] RAX: ffffffffffffffda RBX: 0000555efba87da0 RCX: 00007f76f694e839
<4> [404.303828] RDX: 0000000000000000 RSI: 0000555efba811f0 RDI: 0000000000000006
<4> [404.303831] RBP: 0000555efba811f0 R08: 0000000000000004 R09: 0000555efa8f6c1b
<4> [404.303833] R10: 00007ffd4ea214f0 R11: 0000000000000246 R12: 0000000000000000
<4> [404.303836] R13: 0000555efba7ebe0 R14: 0000000000000020 R15: 0000000000000044
<4> [404.303843] Modules linked in: i915(+) amdgpu gpu_sched ttm vgem snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic x86_pkg_temp_thermal coretemp snd_hda_codec snd_hwdep mei_hdcp crct10dif_pclmul snd_hda_core crc32_pclmul e1000e snd_pcm ghash_clmulni_intel ptp pps_core mei_me mei prime_numbers [last unloaded: i915]
Comment 1 CI Bug Log 2019-06-18 08:38:16 UTC
The CI Bug Log issue associated to this bug has been updated.

### New filters associated

* All machines: igt@i915_selftest@live_evict - incomplete - GEM_BUG_ON(!vma-&gt;node.allocated)
  - https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6283/fi-kbl-8809g/igt@i915_selftest@live_evict.html
  - https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6283/fi-kbl-guc/igt@i915_selftest@live_evict.html
  - https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6283/fi-kbl-r/igt@i915_selftest@live_evict.html
  - https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6283/fi-kbl-x1275/igt@i915_selftest@live_evict.html
  - https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6283/fi-skl-guc/igt@i915_selftest@live_evict.html
  - https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6283/fi-skl-iommu/igt@i915_selftest@live_evict.html
  - https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6283/fi-whl-u/igt@i915_selftest@live_evict.html
  - https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6284/fi-kbl-8809g/igt@i915_selftest@live_evict.html
  - https://intel-gfx-ci.01.org/tree/drm-tip/IGTPW_3162/fi-cfl-guc/igt@i915_selftest@live_evict.html
  - https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6286/fi-kbl-8809g/igt@i915_selftest@live_evict.html
  - https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6286/fi-kbl-x1275/igt@i915_selftest@live_evict.html
  - https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_13305/fi-kbl-x1275/igt@i915_selftest@live_evict.html
  - https://intel-gfx-ci.01.org/tree/drm-tip/IGTPW_3164/fi-kbl-8809g/igt@i915_selftest@live_evict.html
  - https://intel-gfx-ci.01.org/tree/drm-tip/IGTPW_3164/fi-skl-guc/igt@i915_selftest@live_evict.html
  - https://intel-gfx-ci.01.org/tree/drm-tip/IGTPW_3164/fi-skl-iommu/igt@i915_selftest@live_evict.html
  - https://intel-gfx-ci.01.org/tree/drm-tip/IGTPW_3164/fi-whl-u/igt@i915_selftest@live_evict.html
  - https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6287/fi-kbl-8809g/igt@i915_selftest@live_evict.html
  - https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6287/fi-skl-iommu/igt@i915_selftest@live_evict.html
  - https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6285/fi-skl-guc/igt@i915_selftest@live_evict.html
  - https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6285/fi-skl-iommu/igt@i915_selftest@live_evict.html
  - https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6285/fi-whl-u/igt@i915_selftest@live_evict.html
  - https://intel-gfx-ci.01.org/tree/drm-tip/IGTPW_3163/fi-kbl-8809g/igt@i915_selftest@live_evict.html
  - https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6290/fi-kbl-8809g/igt@i915_selftest@live_evict.html
  - https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6290/fi-kbl-x1275/igt@i915_selftest@live_evict.html
  - https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6290/fi-skl-iommu/igt@i915_selftest@live_evict.html
  - https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6290/fi-whl-u/igt@i915_selftest@live_evict.html
  - https://intel-gfx-ci.01.org/tree/drm-tip/IGTPW_3166/fi-skl-guc/igt@i915_selftest@live_evict.html
  - https://intel-gfx-ci.01.org/tree/drm-tip/IGTPW_3166/fi-skl-iommu/igt@i915_selftest@live_evict.html
  - https://intel-gfx-ci.01.org/tree/drm-tip/IGTPW_3166/fi-whl-u/igt@i915_selftest@live_evict.html

* All machines: igt@runner@aborted - fail - Previous test: i915_selftest (live_evict)
  - https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6283/fi-cml-u/igt@runner@aborted.html
  - https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6283/fi-cml-u2/igt@runner@aborted.html
  - https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6283/fi-kbl-8809g/igt@runner@aborted.html
  - https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6283/fi-kbl-guc/igt@runner@aborted.html
  - https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6283/fi-kbl-r/igt@runner@aborted.html
  - https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6283/fi-kbl-x1275/igt@runner@aborted.html
  - https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6283/fi-whl-u/igt@runner@aborted.html
  - https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6284/fi-kbl-8809g/igt@runner@aborted.html
  - https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_13301/shard-kbl1/igt@runner@aborted.html
  - https://intel-gfx-ci.01.org/tree/drm-tip/IGTPW_3162/fi-cfl-guc/igt@runner@aborted.html
  - https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6286/fi-kbl-8809g/igt@runner@aborted.html
  - https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6286/fi-kbl-x1275/igt@runner@aborted.html
  - https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_13305/fi-kbl-x1275/igt@runner@aborted.html
  - https://intel-gfx-ci.01.org/tree/drm-tip/IGTPW_3164/fi-cml-u/igt@runner@aborted.html
  - https://intel-gfx-ci.01.org/tree/drm-tip/IGTPW_3164/fi-cml-u2/igt@runner@aborted.html
  - https://intel-gfx-ci.01.org/tree/drm-tip/IGTPW_3164/fi-kbl-8809g/igt@runner@aborted.html
  - https://intel-gfx-ci.01.org/tree/drm-tip/IGTPW_3164/fi-whl-u/igt@runner@aborted.html
  - https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6284/shard-kbl4/igt@runner@aborted.html
  - https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6287/fi-cml-u2/igt@runner@aborted.html
  - https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6287/fi-kbl-8809g/igt@runner@aborted.html
  - https://intel-gfx-ci.01.org/tree/drm-tip/IGT_5059/shard-kbl3/igt@runner@aborted.html
  - https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6285/shard-kbl3/igt@runner@aborted.html
  - https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_13306/shard-kbl4/igt@runner@aborted.html
  - https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6285/fi-cml-u/igt@runner@aborted.html
  - https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6285/fi-cml-u2/igt@runner@aborted.html
  - https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6285/fi-whl-u/igt@runner@aborted.html
  - https://intel-gfx-ci.01.org/tree/drm-tip/IGTPW_3163/fi-kbl-8809g/igt@runner@aborted.html
  - https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_13307/shard-kbl3/igt@runner@aborted.html
  - https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_13309/shard-kbl4/igt@runner@aborted.html
  - https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6290/fi-cml-u/igt@runner@aborted.html
  - https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6290/fi-cml-u2/igt@runner@aborted.html
  - https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6290/fi-kbl-8809g/igt@runner@aborted.html
  - https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6290/fi-kbl-x1275/igt@runner@aborted.html
  - https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6290/fi-whl-u/igt@runner@aborted.html
  - https://intel-gfx-ci.01.org/tree/drm-tip/IGTPW_3166/fi-cml-u/igt@runner@aborted.html
  - https://intel-gfx-ci.01.org/tree/drm-tip/IGTPW_3166/fi-whl-u/igt@runner@aborted.html
Comment 2 Chris Wilson 2019-06-18 08:38:46 UTC
I really don't know what's going on there, and would dearly love a bisect for some clue.
Comment 3 Chris Wilson 2019-06-18 08:40:47 UTC
<4> [404.303826] RAX: ffffffffffffffda RBX: 0000555efba87da0 RCX: 00007f76f694e839

RAX: -38 => ENOSYS. Could be a coincidence, as I don't think we ever use that as an error pointer.
Comment 4 Caz.Yokoyama 2019-06-18 15:48:15 UTC
i915_pci_probe calls i915_gem_set_wedged() when i915_live_selftests() reports an error as follow.

i915_pci_probe(
  i915_live_selftests(pdev);
  i915_pci_remove(struct pci_dev *pdev)
    i915_driver_unload(struct drm_device *dev)
      i915_gem_set_wedged(struct drm_i915_private *i915)

So there are 2 questions, 1) why i915_live_selftests() fails, 2) why recovery causes kernel crash by spinlock.
Comment 5 Chris Wilson 2019-06-18 17:31:54 UTC
If I've guessed correctly, 

commit 7009db1475e9e92c823b60c2a9c1e5a3222699f0
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Tue Jun 18 08:41:28 2019 +0100

    drm/i915: Keep engine alive as we retire the context
    
    Though we pin the context first before taking the pm wakeref, during
    retire we need to unpin before dropping the pm wakeref (breaking the
    "natural" onion). During the unpin, we may need to attach a cleanup
    operation on to the engine wakeref, ergo we want to keep the engine
    awake until after the unpin.
    
    v2: Push the engine wakeref into the barrier so we keep the onion unwind
    ordering in the request itself
    
    Fixes: ce476c80b8bf ("drm/i915: Keep contexts pinned until after the next kernel context switch")
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
    Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
    Reviewed-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>
    Link: https://patchwork.freedesktop.org/patch/msgid/20190618074153.16055-1-chris@chris-wilson.co.uk

should be the right fix, and 

commit eca153603f2f020e15d071918e0daf1d56c17d29 (HEAD -> drm-intel-next-queued)
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Tue Jun 18 17:19:51 2019 +0100

    drm/i915: Don't dereference request if it may have been retired when printing
    
    This has caught me out on countless occasions, when we retrieve a pointer
    from the submission/execlists backend, it does not carry a reference to
    the context or ring. Those are only pinned while the request is active,
    so if we see the request is already completed, it may be in the process
    of being retired and those pointers defunct.
    
    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=110938
    Fixes: 3a068721a973 ("drm/i915: Show ring->start for the ELSP context/request queue")
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
    Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
    Link: https://patchwork.freedesktop.org/patch/msgid/20190618161951.28820-2-chris@chris-wilson.co.uk

commit 1422768fa2600442bfc4be4b1fe112bf9d6bb5cf
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Tue Jun 18 17:19:50 2019 +0100

    drm/i915/selftests: Flush live_evict
    
    Be sure to cleanup after live_evict by flushing any residual state off
    the GPU using igt_flush_test.
    
    Tvrtko mentioned that it is probably wise to stop repeating this ad hoc
    around the tests and implement a live test runner.
    
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
    Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
    Link: https://patchwork.freedesktop.org/patch/msgid/20190618161951.28820-1-chris@chris-wilson.co.uk

should prevent it from re-ocurring. We will see in the morning when Tomi reboots the affected machines...
Comment 6 Chris Wilson 2019-06-19 09:38:03 UTC
Good news, bad news. The bug moved. At least this is fairly convincing that the cause is that we are idling the engine too early and so unpinning still active contexts.
Comment 7 Martin Peres 2019-07-02 10:29:42 UTC
(In reply to Chris Wilson from comment #6)
> Good news, bad news. The bug moved. At least this is fairly convincing that
> the cause is that we are idling the engine too early and so unpinning still
> active contexts.

Looks very much fixed, thanks!
Comment 8 CI Bug Log 2019-07-02 10:29:51 UTC
The CI Bug Log issue associated to this bug has been archived.

New failures matching the above filters will not be associated to this bug anymore.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.