Bug 109813 - [CI][SHARDS]igt@gem_exec_await@wide-context - incomplete - context list corruption
Summary: [CI][SHARDS]igt@gem_exec_await@wide-context - incomplete - context list corru...
Status: RESOLVED FIXED
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/Intel (show other bugs)
Version: DRI git
Hardware: Other All
: high normal
Assignee: Intel GFX Bugs mailing list
QA Contact: Intel GFX Bugs mailing list
URL:
Whiteboard: ReadyForDev
Keywords:
Depends on:
Blocks:
 
Reported: 2019-03-02 10:27 UTC by Lakshmi
Modified: 2019-05-15 21:11 UTC (History)
1 user (show)

See Also:
i915 platform: ICL
i915 features: GEM/Other


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Lakshmi 2019-03-02 10:27:43 UTC
https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_233/fi-icl-y/igt@gem_exec_await@wide-contexts.html

[424.328624] BUG: sleeping function called from invalid context at kernel/sched/completion.c:99
<3> [424.328642] in_atomic(): 0, irqs_disabled(): 0, pid: 1260, name: gem_exec_await
<4> [424.328652] INFO: lockdep is turned off.
<3> [424.328662] Preemption disabled at:
<4> [424.328664] [<0000000000000000>]           (null)
<4> [424.328691] CPU: 0 PID: 1260 Comm: gem_exec_await Tainted: G     UD W         5.0.0-rc8-g45d954a25f73-drmtip_233+ #1
<4> [424.328707] Hardware name: Intel Corporation Ice Lake Client Platform/IceLake Y LPDDR4x T4 RVP TLC, BIOS ICLSFWR1.R00.3087.A00.1902250334 02/25/2019
<4> [424.328724] Call Trace:
<4> [424.328734]  dump_stack+0x67/0x9b
<4> [424.328743]  ___might_sleep+0x167/0x250
<4> [424.328753]  wait_for_common+0x40/0x1f0
<4> [424.328765]  virt_efi_query_variable_info+0x161/0x1b0
<4> [424.328774]  efi_query_variable_store+0xb3/0x1a0
<4> [424.328786]  ? efivar_entry_set_safe+0x19c/0x220
<4> [424.328794]  ? efi_delete_dummy_variable+0x90/0x90
<4> [424.328803]  efivar_entry_set_safe+0x19c/0x220
<4> [424.328813]  ? efi_pstore_write+0x10b/0x150
<4> [424.328818]  ? efi_pstore_write+0xa0/0x150
<4> [424.328830]  efi_pstore_write+0x10b/0x150
<4> [424.328853]  pstore_dump+0x127/0x340
<4> [424.328870]  kmsg_dump+0x87/0x1c0
<4> [424.328882]  oops_end+0x3e/0x90
<4> [424.328985]  general_protection+0x1e/0x30
<4> [424.328995] RIP: 0010:__list_del_entry_valid+0x2d/0x90
<4> [424.329002] Code: 48 b9 00 01 00 00 00 00 ad de 48 8b 57 08 48 39 c8 74 26 48 b9 00 02 00 00 00 00 ad de 48 39 ca 74 2e 48 8b 32 48 39 fe 75 3a <48> 8b 50 08 48 39 f2 75 48 b8 01 00 00 00 c3 48 89 fe 48 89 c2 48
<4> [424.329019] RSP: 0018:ffffa953406e3c98 EFLAGS: 00010246
<4> [424.329026] RAX: fffda3ee05d4f988 RBX: ffffa3ee04c5a008 RCX: dead000000000200
<4> [424.329033] RDX: ffffa3ee1de3d990 RSI: ffffa3ee04c5a008 RDI: ffffa3ee04c5a008
<4> [424.329041] RBP: ffffa953406e3cf8 R08: 0000000000000000 R09: 0000000000000001
<4> [424.329048] R10: 0000000000000000 R11: 0000000000000000 R12: ffffa3ee1bf8d3d8
<4> [424.329056] R13: ffffa3ee1bf8d910 R14: dead000000000200 R15: dead000000000100
<4> [424.329106]  context_close+0x79/0x2b0 [i915]
<4> [424.329141]  ? i915_gem_context_create_ioctl+0x160/0x160 [i915]
<4> [424.329173]  ? i915_gem_context_create_ioctl+0x160/0x160 [i915]
<4> [424.329203]  i915_gem_context_destroy_ioctl+0x1a9/0x200 [i915]
<4> [424.329212]  drm_ioctl_kernel+0x83/0xf0
<4> [424.329219]  drm_ioctl+0x2f3/0x3b0
<4> [424.329247]  ? i915_gem_context_create_ioctl+0x160/0x160 [i915]
<4> [424.329257]  ? trace_hardirqs_on_thunk+0x1a/0x1c
<4> [424.329264]  ? lockdep_hardirqs_on+0xe0/0x1b0
<4> [424.329270]  ? trace_hardirqs_on_thunk+0x1a/0x1c
<4> [424.329279]  do_vfs_ioctl+0xa0/0x6e0
<4> [424.329285]  ? retint_kernel+0x2d/0x2d
<4> [424.329293]  ksys_ioctl+0x35/0x60
<4> [424.329299]  __x64_sys_ioctl+0x11/0x20
<4> [424.329304]  do_syscall_64+0x55/0x190
<4> [424.329311]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
<4> [424.329317] RIP: 0033:0x7f5b839e75d7
<4> [424.329323] Code: b3 66 90 48 8b 05 b1 48 2d 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 81 48 2d 00 f7 d8 64 89 01 48
<4> [424.329340] RSP: 002b:00007ffdc3c527e8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
<4> [424.329348] RAX: ffffffffffffffda RBX: 000055b373ac1c20 RCX: 00007f5b839e75d7
<4> [424.329356] RDX: 00007ffdc3c52840 RSI: 000000004008646e RDI: 0000000000000005
<4> [424.329364] RBP: 00007ffdc3c52840 R08: 00007ffdc3ce30b0 R09: 00007ffdc3ce3080
<4> [424.329371] R10: 00000000000ce1fc R11: 0000000000000246 R12: 000000004008646e
<4> [424.329379] R13: 0000000000000005 R14: 0000000000000000 R15: 0000000000000000
<4> [424.440129] RIP: 0010:__list_del_entry_valid+0x2d/0x90
<4> [424.440142] Code: 48 b9 00 01 00 00 00 00 ad de 48 8b 57 08 48 39 c8 74 26 48 b9 00 02 00 00 00 00 ad de 48 39 ca 74 2e 48 8b 32 48 39 fe 75 3a <48> 8b 50 08 48 39 f2 75 48 b8 01 00 00 00 c3 48 89 fe 48 89 c2 48
<4> [424.440165] RSP: 0018:ffffa953406e3c98 EFLAGS: 00010246
<4> [424.440174] RAX: fffda3ee05d4f988 RBX: ffffa3ee04c5a008 RCX: dead000000000200
<4> [424.440184] RDX: ffffa3ee1de3d990 RSI: ffffa3ee04c5a008 RDI: ffffa3ee04c5a008
<4> [424.440194] RBP: ffffa953406e3cf8 R08: 0000000000000000 R09: 0000000000000001
<4> [424.440205] R10: 0000000000000000 R11: 0000000000000000 R12: ffffa3ee1bf8d3d8
<4> [424.440214] R13: ffffa3ee1bf8d910 R14: dead000000000200 R15: dead000000000100
<4> [424.440225] FS:  00007f5b84551980(0000) GS:ffffa3ee5bf00000(0000) knlGS:0000000000000000
<4> [424.440236] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
<4> [424.440244] CR2: 00005576056d9628 CR3: 000000028de16004 CR4: 0000000000760ee0
<4> [424.440254] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
<4> [424.440264] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
<4> [424.440274] PKRU: 55555554
Comment 1 CI Bug Log 2019-03-02 10:29:18 UTC
The CI Bug Log issue associated to this bug has been updated.

### New filters associated

* ICL: igt@gem_exec_await@wide-context - incomplete - BUG: sleeping function called from invalid context
  - https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_233/fi-icl-y/igt@gem_exec_await@wide-contexts.html
Comment 2 CI Bug Log 2019-03-02 10:31:32 UTC
The CI Bug Log issue associated to this bug has been updated.

### New filters associated

* ICL: igt@runner@aborted - fail - Previous test: gem_exec_await (wide-contexts)
  - https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_233/fi-icl-y/igt@runner@aborted.html
Comment 3 Chris Wilson 2019-03-02 10:42:49 UTC
<4>[  423.304434] RIP: 0010:__list_del_entry_valid+0x2d/0x90
<4>[  423.304441] Code: 48 b9 00 01 00 00 00 00 ad de 48 8b 57 08 48 39 c8 74 26 48 b9 00 02 00 00 00 00 ad de 48 39 ca 74 2e 48 8b 32 48 39 fe 75 3a <48> 8b 50 08 48 39 f2 75 48 b8 01 00 00 00 c3 48 89 fe 48 89 c2 48
<4>[  423.304458] RSP: 0018:ffffa953406e3c98 EFLAGS: 00010246
<4>[  423.304465] RAX: fffda3ee05d4f988 RBX: ffffa3ee04c5a008 RCX: dead000000000200
<4>[  423.304473] RDX: ffffa3ee1de3d990 RSI: ffffa3ee04c5a008 RDI: ffffa3ee04c5a008
<4>[  423.304480] RBP: ffffa953406e3cf8 R08: 0000000000000000 R09: 0000000000000001
<4>[  423.304488] R10: 0000000000000000 R11: 0000000000000000 R12: ffffa3ee1bf8d3d8
<4>[  423.304495] R13: ffffa3ee1bf8d910 R14: dead000000000200 R15: dead000000000100
<4>[  423.304503] FS:  00007f5b84551980(0000) GS:ffffa3ee5be00000(0000) knlGS:0000000000000000
<4>[  423.304512] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
<4>[  423.304518] CR2: 00007fd945eb1788 CR3: 000000028de16003 CR4: 0000000000760ef0
<4>[  423.304526] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
<4>[  423.304533] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
<4>[  423.304540] PKRU: 55555554
<4>[  423.304544] Call Trace:
<4>[  423.304588]  context_close+0x79/0x2b0 [i915]
<4>[  423.304624]  ? i915_gem_context_create_ioctl+0x160/0x160 [i915]
<4>[  423.304656]  ? i915_gem_context_create_ioctl+0x160/0x160 [i915]
<4>[  423.304687]  i915_gem_context_destroy_ioctl+0x1a9/0x200 [i915]
<4>[  423.304696]  drm_ioctl_kernel+0x83/0xf0
<4>[  423.304703]  drm_ioctl+0x2f3/0x3b0
<4>[  423.304730]  ? i915_gem_context_create_ioctl+0x160/0x160 [i915]
<4>[  423.304741]  ? trace_hardirqs_on_thunk+0x1a/0x1c
<4>[  423.304749]  ? lockdep_hardirqs_on+0xe0/0x1b0
<4>[  423.304755]  ? trace_hardirqs_on_thunk+0x1a/0x1c
<4>[  423.304764]  do_vfs_ioctl+0xa0/0x6e0
<4>[  423.304771]  ? retint_kernel+0x2d/0x2d
<4>[  423.304778]  ksys_ioctl+0x35/0x60
<4>[  423.304785]  __x64_sys_ioctl+0x11/0x20
<4>[  423.304790]  do_syscall_64+0x55/0x190
<4>[  423.304796]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
<4>[  423.304803] RIP: 0033:0x7f5b839e75d7
<4>[  423.304809] Code: b3 66 90 48 8b 05 b1 48 2d 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 81 48 2d 00 f7 d8 64 89 01 48
<4>[  423.304826] RSP: 002b:00007ffdc3c527e8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
<4>[  423.304835] RAX: ffffffffffffffda RBX: 000055b373ac1c20 RCX: 00007f5b839e75d7
<4>[  423.304844] RDX: 00007ffdc3c52840 RSI: 000000004008646e RDI: 0000000000000005
<4>[  423.304855] RBP: 00007ffdc3c52840 R08: 00007ffdc3ce30b0 R09: 00007ffdc3ce3080
<4>[  423.304865] R10: 00000000000ce1fc R11: 0000000000000246 R12: 000000004008646e
<4>[  423.304875] R13: 0000000000000005 R14: 0000000000000000 R15: 0000000000000000
<4>[  423.304886] Modules linked in: vgem snd_hda_codec_hdmi mei_hdcp snd_hda_codec_realtek snd_hda_codec_generic snd_hda_intel snd_hda_codec snd_hwdep snd_hda_core snd_pcm x86_pkg_temp_thermal coretemp ax88179_178a usbnet mii crct10dif_pclmul crc32_pclmul ghash_clmulni_intel e1000e ptp pps_core i915 i2c_i801 mei_me mei prime_numbers


Very surprising.
Comment 4 Chris Wilson 2019-03-02 10:44:42 UTC
That would be release_hw_id(), and so ctx->hw_id_link which should be guared by i915->contexts.mutex.
Comment 5 Chris Wilson 2019-03-02 10:52:01 UTC
We always use list_del_init() on ctx->hw_id_link so no opportunity for list poisoning, so either it isn't complaining about hw_id_link or this is a very weird stray write.
Comment 6 Chris Wilson 2019-03-02 10:53:02 UTC
lut_close() also has a list_del. Let's look there next.
Comment 7 Chris Wilson 2019-03-02 11:02:21 UTC
And lut->obj_link looks pretty solidly guarded by struct_mutex.

The question du jour then is where exactly is context_close + 0x79.
Comment 8 Chris Wilson 2019-03-02 16:17:01 UTC
Also odd, despite it being LIST_POISON and list debugging enabled, we hit

<4>[  423.304395] general protection fault: 0000 [#1] PREEMPT SMP PTI

and not the list debug itself. Thinking more likely stray write in that case?
Comment 9 Martin Peres 2019-04-23 12:41:39 UTC
Bumping the priority since it is also seen on shards!
Comment 10 Lakshmi 2019-04-26 13:22:02 UTC
According to Chris, this is a random memcorruption that was waiting on the gem_create/clear bisect.
Comment 11 Chris Wilson 2019-05-15 21:11:53 UTC
Could be

commit 0152b3b3f49b36b0f1a1bf9f0353dc636f41d8f0
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Wed May 8 12:24:52 2019 +0100

    drm/i915: Seal races between async GPU cancellation, retirement and signaling
    
    Currently there is an underlying assumption that i915_request_unsubmit()
    is synchronous wrt the GPU -- that is the request is no longer in flight
    as we remove it. In the near future that may change, and this may upset
    our signaling as we can process an interrupt for that request while it
    is no longer in flight.
    
    CPU0                                    CPU1
    intel_engine_breadcrumbs_irq
    (queue request completion)
                                            i915_request_cancel_signaling
    ...                                     ...
                                            i915_request_enable_signaling
    dma_fence_signal
    
    Hence in the time it took us to drop the lock to signal the request, a
    preemption event may have occurred and re-queued the request. In the
    process, that request would have seen I915_FENCE_FLAG_SIGNAL clear and
    so reused the rq->signal_link that was in use on CPU0, leading to bad
    pointer chasing in intel_engine_breadcrumbs_irq.
    
    A related issue was that if someone started listening for a signal on a
    completed but no longer in-flight request, we missed the opportunity to
    immediately signal that request.
    
    Furthermore, as intel_contexts may be immediately released during
    request retirement, in order to be entirely sure that
    intel_engine_breadcrumbs_irq may no longer dereference the intel_context
    (ce->signals and ce->signal_link), we must wait for irq spinlock.
    
    In order to prevent the race, we use a bit in the fence.flags to signal
    the transfer onto the signal list inside intel_engine_breadcrumbs_irq.
    For simplicity, we use the DMA_FENCE_FLAG_SIGNALED_BIT as it then
    quickly signals to any outside observer that the fence is indeed signaled.
    
    v2: Sketch out potential dma-fence API for manual signaling
    v3: And the test_and_set_bit()
    
    Fixes: 52c0fdb25c7c ("drm/i915: Replace global breadcrumbs with per-context interrupt tracking")
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
    Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
    Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
    Link: https://patchwork.freedesktop.org/patch/msgid/20190508112452.18942-1-chris@chris-wilson.co.uk

There's a very small potential use-after-free inside the context there.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.