Bug 111862 - [CI][DRMTIP] igt@gem_exec_await@wide-contexts - dmesg-warn - WARNING: possible circular locking dependency detected
Summary: [CI][DRMTIP] igt@gem_exec_await@wide-contexts - dmesg-warn - WARNING: possib...
Status: RESOLVED FIXED
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/Intel (show other bugs)
Version: DRI git
Hardware: Other All
: medium normal
Assignee: Intel GFX Bugs mailing list
QA Contact: Intel GFX Bugs mailing list
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2019-09-30 09:47 UTC by Lakshmi
Modified: 2019-10-07 21:18 UTC (History)
1 user (show)

See Also:
i915 platform: ICL
i915 features: GEM/Other


Attachments

Description Lakshmi 2019-09-30 09:47:20 UTC
https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_379/fi-icl-dsi/igt@gem_exec_await@wide-contexts.html

<6> [723.734165] Console: switching to colour dummy device 80x25
<6> [723.734219] [IGT] gem_exec_await: executing
<4> [723.763274] 
<4> [723.763278] ======================================================
<4> [723.763281] WARNING: possible circular locking dependency detected
<4> [723.763285] 5.3.0-g80fa0e042cdb-drmtip_379+ #1 Tainted: G     U           
<4> [723.763288] ------------------------------------------------------
<4> [723.763291] gem_exec_await/1388 is trying to acquire lock:
<4> [723.763294] ffff93a7b53221d8 (&engine->active.lock){..-.}, at: execlists_submit_request+0x2b/0x1e0 [i915]
<4> [723.763378] 
but task is already holding lock:
<4> [723.763381] ffff93a7c25f6d20 (&i915_request_get(rq)->submit/1){-.-.}, at: __i915_sw_fence_complete+0x1b2/0x250 [i915]
<4> [723.763420] 
which lock already depends on the new lock.

<4> [723.763423] 
the existing dependency chain (in reverse order) is:
<4> [723.763427] 
-> #2 (&i915_request_get(rq)->submit/1){-.-.}:
<4> [723.763434]        _raw_spin_lock_irqsave_nested+0x39/0x50
<4> [723.763478]        __i915_sw_fence_complete+0x1b2/0x250 [i915]
<4> [723.763513]        intel_engine_breadcrumbs_irq+0x3aa/0x5e0 [i915]
<4> [723.763600]        cs_irq_handler+0x49/0x50 [i915]
<4> [723.763659]        gen11_gt_irq_handler+0x17b/0x280 [i915]
<4> [723.763690]        gen11_irq_handler+0x54/0xf0 [i915]
<4> [723.763695]        __handle_irq_event_percpu+0x41/0x2d0
<4> [723.763699]        handle_irq_event_percpu+0x2b/0x70
<4> [723.763702]        handle_irq_event+0x2f/0x50
<4> [723.763706]        handle_edge_irq+0xee/0x1a0
<4> [723.763709]        do_IRQ+0x7e/0x160
<4> [723.763712]        ret_from_intr+0x0/0x1d
<4> [723.763717]        __slab_alloc.isra.28.constprop.33+0x4f/0x70
<4> [723.763720]        kmem_cache_alloc+0x28d/0x2f0
<4> [723.763724]        vm_area_dup+0x15/0x40
<4> [723.763727]        dup_mm+0x2dd/0x550
<4> [723.763730]        copy_process+0xf21/0x1ef0
<4> [723.763734]        _do_fork+0x71/0x670
<4> [723.763737]        __se_sys_clone+0x6e/0xa0
<4> [723.763741]        do_syscall_64+0x4f/0x210
<4> [723.763744]        entry_SYSCALL_64_after_hwframe+0x49/0xbe
<4> [723.763747] 
-> #1 (&(&rq->lock)->rlock#2){-.-.}:
<4> [723.763752]        _raw_spin_lock+0x2a/0x40
<4> [723.763789]        __unwind_incomplete_requests+0x3eb/0x450 [i915]
<4> [723.763825]        __execlists_submission_tasklet+0x9ec/0x1d60 [i915]
<4> [723.763864]        execlists_submission_tasklet+0x34/0x50 [i915]
<4> [723.763874]        tasklet_action_common.isra.5+0x47/0xb0
<4> [723.763878]        __do_softirq+0xd8/0x4ae
<4> [723.763881]        irq_exit+0xa9/0xc0
<4> [723.763883]        smp_apic_timer_interrupt+0xb7/0x280
<4> [723.763887]        apic_timer_interrupt+0xf/0x20
<4> [723.763892]        cpuidle_enter_state+0xae/0x450
<4> [723.763895]        cpuidle_enter+0x24/0x40
<4> [723.763899]        do_idle+0x1e7/0x250
<4> [723.763902]        cpu_startup_entry+0x14/0x20
<4> [723.763905]        start_secondary+0x15f/0x1b0
<4> [723.763908]        secondary_startup_64+0xa4/0xb0
<4> [723.763911] 
-> #0 (&engine->active.lock){..-.}:
<4> [723.763916]        __lock_acquire+0x15d8/0x1ea0
<4> [723.763919]        lock_acquire+0xa6/0x1c0
<4> [723.763922]        _raw_spin_lock_irqsave+0x33/0x50
<4> [723.763956]        execlists_submit_request+0x2b/0x1e0 [i915]
<4> [723.764002]        submit_notify+0xa8/0x13c [i915]
<4> [723.764035]        __i915_sw_fence_complete+0x81/0x250 [i915]
<4> [723.764054]        i915_sw_fence_wake+0x51/0x64 [i915]
<4> [723.764054]        __i915_sw_fence_complete+0x1ee/0x250 [i915]
<4> [723.764054]        dma_i915_sw_fence_wake_timer+0x14/0x20 [i915]
<4> [723.764054]        dma_fence_signal_locked+0x9e/0x1c0
<4> [723.764054]        dma_fence_signal+0x1f/0x40
<4> [723.764054]        vgem_fence_signal_ioctl+0x67/0xc0 [vgem]
<4> [723.764054]        drm_ioctl_kernel+0x83/0xf0
<4> [723.764054]        drm_ioctl+0x2f3/0x3b0
<4> [723.764054]        do_vfs_ioctl+0xa0/0x6f0
<4> [723.764054]        ksys_ioctl+0x35/0x60
<4> [723.764054]        __x64_sys_ioctl+0x11/0x20
<4> [723.764054]        do_syscall_64+0x4f/0x210
<4> [723.764054]        entry_SYSCALL_64_after_hwframe+0x49/0xbe
<4> [723.764054] 
other info that might help us debug this:

<4> [723.764054] Chain exists of:
  &engine->active.lock --> &(&rq->lock)->rlock#2 --> &i915_request_get(rq)->submit/1

<4> [723.764054]  Possible unsafe locking scenario:

<4> [723.764054]        CPU0                    CPU1
<4> [723.764054]        ----                    ----
<4> [723.764054]   lock(&i915_request_get(rq)->submit/1);
<4> [723.764054]                                lock(&(&rq->lock)->rlock#2);
<4> [723.764054]                                lock(&i915_request_get(rq)->submit/1);
<4> [723.764054]   lock(&engine->active.lock);
<4> [723.764054] 
 *** DEADLOCK ***
Comment 1 CI Bug Log 2019-09-30 09:49:18 UTC
The CI Bug Log issue associated to this bug has been updated.

### New filters associated

* ICL: igt@gem_exec_await@wide-contexts - dmesg-warn - WARNING: possible circular locking dependency detected
  - https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_379/fi-icl-dsi/igt@gem_exec_await@wide-contexts.html
Comment 2 Chris Wilson 2019-09-30 10:12:51 UTC
That doesn't make much sense, nothing has changed here for a long time. Recursion onto the same engine from within a submit is not possible -- we use an irq-work to break out of the engine lock to signal any completion events spotted while submitting. The dependency chain looks mighty odd as well.
Comment 3 Francesco Balestrieri 2019-10-04 11:08:09 UTC
A fluke then? The fact that it happened once would support that.
Comment 4 Chris Wilson 2019-10-04 19:36:03 UTC
So I think the suspect is:

__unwind_incomplete_requests:

                        /*
                         * Decouple the virtual breadcrumb before moving it
                         * back to the virtual engine -- we don't want the
                         * request to complete in the background and try
                         * and cancel the breadcrumb on the virtual engine
                         * (instead of the old engine where it is linked)!
                         */
                        if (test_bit(DMA_FENCE_FLAG_ENABLE_SIGNAL_BIT,
                                     &rq->fence.flags)) {
                                spin_lock(&rq->lock);
                                i915_request_cancel_breadcrumb(rq);
                                spin_unlock(&rq->lock);
                        }

That would be an extremely rare path to hit, so would be rarely seen by lockdep.
Comment 5 Chris Wilson 2019-10-04 19:49:28 UTC
After a bit of thought, the rule is that such a spin_lock(rq->lock) inside engine->active.lock needs nesting annotation, as on the signaling path it is reversed (but we guarantee that it's a different set of engines).

https://patchwork.freedesktop.org/series/67621/
Comment 6 Chris Wilson 2019-10-07 21:18:39 UTC
commit 08ad9a3846fc72b047b110b36d162ffbcf298fa2
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Fri Oct 4 20:47:58 2019 +0100

    drm/i915/execlists: Fix annotation for decoupling virtual request
    
    As we may signal a request and take the engine->active.lock within the
    signaler, the engine submission paths have to use a nested annotation on
    their requests -- but we guarantee that we can never submit on the same
    engine as the signaling fence.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.