Bug 111626

Summary: [CI][SHARDS] igt@perf_pmu@busy-idle-no-semaphores-bcs0 - dmesg-warn - WARNING: possible irq lock inversion dependency detected
Product: DRI Reporter: Lakshmi <lakshminarayana.vudum>
Component: DRM/IntelAssignee: Intel GFX Bugs mailing list <intel-gfx-bugs>
Status: RESOLVED FIXED QA Contact: Intel GFX Bugs mailing list <intel-gfx-bugs>
Severity: not set    
Priority: not set CC: intel-gfx-bugs
Version: DRI git   
Hardware: Other   
OS: All   
Whiteboard:
i915 platform: BXT, GLK, HSW, ICL, KBL, SKL i915 features: GEM/Other, Perf/PMU

Description Lakshmi 2019-09-10 06:45:39 UTC
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6847/shard-skl5/igt@perf_pmu@busy-idle-no-semaphores-bcs0.html

[IGT] perf_pmu: executing
<6> [1794.462364] [IGT] perf_pmu: starting subtest busy-idle-no-semaphores-bcs0
<4> [1795.080225] 
<4> [1795.080255] ========================================================
<4> [1795.080277] WARNING: possible irq lock inversion dependency detected
<4> [1795.080299] 5.3.0-rc7-CI-CI_DRM_6847+ #1 Tainted: G     U           
<4> [1795.080316] --------------------------------------------------------
<4> [1795.080334] kworker/0:0H/3056 just changed the state of lock:
<4> [1795.080352] 0000000062a190ff (&timeline->mutex/2){-...}, at: __engine_park+0x3e/0x320 [i915]
<4> [1795.080638] but this lock took another, HARDIRQ-unsafe lock in the past:
<4> [1795.080655]  (&(&lock->wait_lock)->rlock){+.+.}
<4> [1795.080660] 

and interrupts could create inverse lock ordering between them.

<4> [1795.080698] 
other info that might help us debug this:
<4> [1795.080716] Chain exists of:
  &timeline->mutex/2 --> &(&timelines->lock)->rlock --> &(&lock->wait_lock)->rlock

<4> [1795.080755]  Possible interrupt unsafe locking scenario:

<4> [1795.080774]        CPU0                    CPU1
<4> [1795.080788]        ----                    ----
<4> [1795.080802]   lock(&(&lock->wait_lock)->rlock);
<4> [1795.080819]                                local_irq_disable();
<4> [1795.080835]                                lock(&timeline->mutex/2);
<4> [1795.080855]                                lock(&(&timelines->lock)->rlock);
<4> [1795.080876]   <Interrupt>
<4> [1795.080886]     lock(&timeline->mutex/2);
<4> [1795.080904] 
 *** DEADLOCK ***
Comment 1 CI Bug Log 2019-09-10 06:46:50 UTC
The CI Bug Log issue associated to this bug has been updated.

### New filters associated

* SKL:  igt@perf_pmu@busy-idle-no-semaphores-bcs0 - dmesg-warn - WARNING: possible irq lock inversion dependency detected
  - https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6847/shard-skl5/igt@perf_pmu@busy-idle-no-semaphores-bcs0.html
Comment 2 Chris Wilson 2019-09-10 08:29:42 UTC
That looks to be a usb lockchain, if I am reading that correctly.
Comment 3 CI Bug Log 2019-09-13 08:43:58 UTC
A CI Bug Log filter associated to this bug has been updated:

{- SKL:  igt@perf_pmu@busy-idle-no-semaphores-bcs0 - dmesg-warn - WARNING: possible irq lock inversion dependency detected -}
{+ SKL:  igt@perf_pmu@busy-idle-no-semaphores-bcs0 - dmesg-warn - WARNING: possible irq lock inversion dependency detected +}

New failures caught by the filter:

  * https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6878/shard-skl4/igt@perf_pmu@busy-idle-no-semaphores-vecs0.html
Comment 4 Tvrtko Ursulin 2019-09-13 13:53:48 UTC
What about intel_engine_pm_put from i915_pmu.c/engines_sample (hrtimer)? Do we need to hold an engine pm ref for the duration of the event being enabled? However those hooks are also potentially from irqoff context...
Comment 5 Tvrtko Ursulin 2019-09-13 14:01:26 UTC
I was confused by mutex under hardirq but looks that's deliberate from __timeline_mark_lock (d67739268cf0ee928cdc5e8224090c59efacd653).
Comment 6 Chris Wilson 2019-09-13 14:03:41 UTC
Yes, we need a runtime-pm reference, and we use the active engine-pm as that also tells if it was idle and we can skip. The engine-pm put(wakeref) should be irqsafe for this purpose.

I don't know what the lock->wait_lock is -- it doesn't look like one of ours, which makes it more baffling how it got coupled into our wakeref.
Comment 7 CI Bug Log 2019-09-16 10:10:07 UTC
A CI Bug Log filter associated to this bug has been updated:

{- SKL:  igt@perf_pmu@busy-idle-no-semaphores-bcs0 - dmesg-warn - WARNING: possible irq lock inversion dependency detected -}
{+ HSW SKL:  igt@perf_pmu@busy-idle-no-semaphores-* - dmesg-warn - WARNING: possible irq lock inversion dependency detected +}

New failures caught by the filter:

  * https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6896/shard-hsw6/igt@perf_pmu@busy-no-semaphores-bcs0.html
Comment 8 CI Bug Log 2019-09-17 09:20:41 UTC
A CI Bug Log filter associated to this bug has been updated:

{- HSW SKL:  igt@perf_pmu@busy-idle-no-semaphores-* - dmesg-warn - WARNING: possible irq lock inversion dependency detected -}
{+ HSW SKL:  igt@perf_pmu@busy-idle-no-semaphores-* - dmesg-warn - WARNING: possible irq lock inversion dependency detected +}

New failures caught by the filter:

  * https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6898/shard-skl3/igt@perf_pmu@busy-no-semaphores-vcs0.html
Comment 9 CI Bug Log 2019-09-23 13:21:12 UTC
A CI Bug Log filter associated to this bug has been updated:

{- HSW SKL:  igt@perf_pmu@busy-idle-no-semaphores-* - dmesg-warn - WARNING: possible irq lock inversion dependency detected -}
{+ BYT HSW SKL:  igt@perf_pmu@busy-(idle)?-no-semaphores-* - dmesg-warn - WARNING: possible irq lock inversion dependency detected +}

New failures caught by the filter:

  * https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_376/fi-byt-j1900/igt@perf_pmu@busy-no-semaphores-bcs0.html
Comment 10 Chris Wilson 2019-09-23 13:32:53 UTC
*** Bug 111783 has been marked as a duplicate of this bug. ***
Comment 11 CI Bug Log 2019-10-01 10:20:10 UTC
A CI Bug Log filter associated to this bug has been updated:

{- BYT HSW SKL:  igt@perf_pmu@busy-(idle)?-no-semaphores-* - dmesg-warn - WARNING: possible irq lock inversion dependency detected -}
{+ BYT HSW SKL GLK:  igt@perf_pmu@busy-(idle)?-no-semaphores-* - dmesg-warn - WARNING: possible irq lock inversion dependency detected +}

New failures caught by the filter:

  * https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6977/shard-glk8/igt@perf_pmu@busy-idle-no-semaphores-vcs0.html
Comment 12 CI Bug Log 2019-10-03 08:41:20 UTC
A CI Bug Log filter associated to this bug has been updated:

{- BYT HSW SKL GLK:  igt@perf_pmu@busy-(idle)?-no-semaphores-* - dmesg-warn - WARNING: possible irq lock inversion dependency detected -}
{+ BYT HSW SKL GLK:  igt@perf_pmu@busy-(idle)?-no-semaphores-* - dmesg-warn - WARNING: possible irq lock inversion dependency detected +}

New failures caught by the filter:

  * https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6987/shard-skl5/igt@perf_pmu@busy-no-semaphores-vecs0.html
Comment 13 CI Bug Log 2019-10-03 12:51:37 UTC
A CI Bug Log filter associated to this bug has been updated:

{- BYT HSW SKL GLK:  igt@perf_pmu@busy-(idle)?-no-semaphores-* - dmesg-warn - WARNING: possible irq lock inversion dependency detected -}
{+ BYT HSW  APL SKL GLK:  igt@perf_pmu@busy-(idle)?-no-semaphores-* - dmesg-warn - WARNING: possible irq lock inversion dependency detected +}

New failures caught by the filter:

  * https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6990/shard-apl7/igt@perf_pmu@busy-idle-no-semaphores-vcs0.html
Comment 14 CI Bug Log 2019-10-09 17:39:43 UTC
The CI Bug Log issue associated to this bug has been updated.

### New filters associated

* BYT: igt@perf_pmu@busy-check-all-vcs0 - dmesg-warn - WARNING: possible irq lock inversion dependency detected
  - https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_385/fi-byt-j1900/igt@perf_pmu@busy-check-all-vcs0.html
Comment 15 CI Bug Log 2019-10-11 10:03:38 UTC
A CI Bug Log filter associated to this bug has been updated:

{- BYT HSW  APL SKL GLK:  igt@perf_pmu@busy-(idle)?-no-semaphores-* - dmesg-warn - WARNING: possible irq lock inversion dependency detected -}
{+ BYT HSW  APL SKL GLK:  igt@perf_pmu@busy-(idle)?-no-semaphores-* - dmesg-warn - WARNING: possible irq lock inversion dependency detected +}

New failures caught by the filter:

  * https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_7052/shard-apl1/igt@perf_pmu@busy-no-semaphores-rcs0.html
  * https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_387/fi-byt-n2820/igt@perf_pmu@busy-hang-rcs0.html
Comment 16 Chris Wilson 2019-10-11 10:26:44 UTC
*** Bug 111945 has been marked as a duplicate of this bug. ***
Comment 17 CI Bug Log 2019-10-14 12:51:11 UTC
The CI Bug Log issue associated to this bug has been updated.

### New filters associated

* SKL: igt@gem_persistent_relocs@forked-faulting-reloc-thrashing - dmesg-warn - WARNING: HARDIRQ-safe -&gt; HARDIRQ-unsafe lock order detected
  - https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_7060/shard-skl1/igt@gem_persistent_relocs@forked-faulting-reloc-thrashing.html
Comment 18 CI Bug Log 2019-10-17 13:16:03 UTC
A CI Bug Log filter associated to this bug has been updated:

{- BYT HSW  APL SKL GLK:  igt@perf_pmu@busy-(idle)?-no-semaphores-* - dmesg-warn - WARNING: possible irq lock inversion dependency detected -}
{+ BYT HSW  APL SKL KBL GLK:  igt@perf_pmu@busy-(idle)?-no-semaphores-* - dmesg-warn - WARNING: possible irq lock inversion dependency detected +}

New failures caught by the filter:

  * https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_7084/shard-kbl4/igt@perf_pmu@busy-idle-no-semaphores-vecs0.html
Comment 19 CI Bug Log 2019-10-18 12:56:58 UTC
A CI Bug Log filter associated to this bug has been updated:

{- BYT HSW  APL SKL KBL GLK:  igt@perf_pmu@busy-(idle)?-no-semaphores-* - dmesg-warn - WARNING: possible irq lock inversion dependency detected -}
{+ BYT HSW  APL SKL KBL GLK:  igt@perf_pmu@busy-(idle)?-no-semaphores-* - dmesg-warn - WARNING: possible irq lock inversion dependency detected +}

New failures caught by the filter:

  * https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_7103/shard-skl7/igt@perf_pmu@semaphore-busy-vecs0.html
Comment 20 CI Bug Log 2019-10-18 13:47:26 UTC
The CI Bug Log issue associated to this bug has been updated.

### New filters associated

* SKL: igt@gem_ctx_shared@q-smoketest-bsd1- dmesg-warn - WARNING: HARDIRQ-safe -&gt; HARDIRQ-unsafe lock order detected
  - https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_7120/shard-skl9/igt@gem_ctx_shared@q-smoketest-bsd1.html
Comment 21 CI Bug Log 2019-10-22 10:03:37 UTC
A CI Bug Log filter associated to this bug has been updated:

{- SKL: igt@gem_ctx_shared@q-smoketest-bsd1- dmesg-warn - WARNING: HARDIRQ-safe -&gt; HARDIRQ-unsafe lock order detected -}
{+ SKL KBL: All tests- dmesg-warn - WARNING: HARDIRQ-safe -&gt; HARDIRQ-unsafe lock order detected +}

New failures caught by the filter:

  * https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_7135/shard-kbl7/igt@gem_persistent_relocs@forked-interruptible-thrash-inactive.html
Comment 22 Chris Wilson 2019-10-28 14:21:15 UTC
*** Bug 112153 has been marked as a duplicate of this bug. ***
Comment 23 Chris Wilson 2019-10-30 10:08:51 UTC
Sigh,

The following commit has been merged into the locking/core branch of tip:
  
  Commit-ID:     a0855d24fc22d49cdc25664fb224caee16998683
  Gitweb:        https://git.kernel.org/tip/a0855d24fc22d49cdc25664fb224caee16998683
  Author:        Davidlohr Bueso <dave@stgolabs.net>
  AuthorDate:    Thu, 24 Oct 2019 20:36:34 -07:00
  Committer:     Ingo Molnar <mingo@kernel.org>
  CommitterDate: Tue, 29 Oct 2019 12:22:52 +01:00
  
  locking/mutex: Complain upon mutex API misuse in IRQ contexts
  
  Add warning checks if mutex_trylock() or mutex_unlock() are used in
  IRQ contexts, under CONFIG_DEBUG_MUTEXES=y.
  
  While the mutex rules and semantics are explicitly documented, this allows
  to expose any abusers and robustifies the whole thing.
  
  While trylock and unlock are non-blocking, calling from IRQ context
  is still forbidden (lock must be within the same context as unlock).
  
  Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
  Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
  Cc: Linus Torvalds <torvalds@linux-foundation.org>
  Cc: Peter Zijlstra <peterz@infradead.org>
  Cc: Thomas Gleixner <tglx@linutronix.de>
  Cc: dave@stgolabs.net
  Link: https://lkml.kernel.org/r/20191025033634.3330-1-dave@stgolabs.net
  Signed-off-by: Ingo Molnar <mingo@kernel.org>

next kernel this warning will get louder.
Comment 24 CI Bug Log 2019-10-30 15:25:14 UTC
A CI Bug Log filter associated to this bug has been updated:

{- BYT HSW  APL SKL KBL GLK:  igt@perf_pmu@busy-(idle)?-no-semaphores-* - dmesg-warn - WARNING: possible irq lock inversion dependency detected -}
{+ BYT HSW  APL SKL KBL GLK:  igt@perf_pmu@busy-(idle)?-no-semaphores-* - dmesg-warn - WARNING: possible irq lock inversion dependency detected +}

New failures caught by the filter:

  * https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_7214/shard-skl7/igt@perf_pmu@semaphore-busy-bcs0.html
Comment 25 CI Bug Log 2019-10-31 12:10:33 UTC
A CI Bug Log filter associated to this bug has been updated:

{- BYT: igt@perf_pmu@busy-check-all-vcs0 - dmesg-warn - WARNING: possible irq lock inversion dependency detected -}
{+ BYT: igt@perf_pmu@busy-* - dmesg-warn - WARNING: possible irq lock inversion dependency detected +}

New failures caught by the filter:

  * https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_395/fi-byt-j1900/igt@perf_pmu@busy-hang-bcs0.html
Comment 26 CI Bug Log 2019-11-11 11:57:32 UTC
A CI Bug Log filter associated to this bug has been updated:

{- SKL: igt@gem_persistent_relocs@forked-faulting-reloc-thrashing - dmesg-warn - WARNING: HARDIRQ-safe -&gt; HARDIRQ-unsafe lock order detected -}
{+ SKL ICL: igt@gem_persistent_relocs@forked-faulting-reloc-thrashing - dmesg-warn - WARNING: HARDIRQ-safe -&gt; HARDIRQ-unsafe lock order detected, acquire at: __mutex_unlock_slowpath, holding by: intel_gt_retire_requests_timeout +}

New failures caught by the filter:

  * https://intel-gfx-ci.01.org/tree/drm-tip/IGT_5265/shard-iclb3/igt@gem_persistent_relocs@forked-interruptible-faulting-reloc-thrashing.html
  * https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_7280/shard-iclb3/igt@gem_persistent_relocs@forked-interruptible-faulting-reloc-thrashing.html
Comment 27 CI Bug Log 2019-11-13 11:26:29 UTC
A CI Bug Log filter associated to this bug has been updated:

{- BYT HSW  APL SKL KBL GLK:  igt@perf_pmu@busy-(idle)?-no-semaphores-* - dmesg-warn - WARNING: possible irq lock inversion dependency detected -}
{+ BYT HSW  APL SKL KBL GLK:  igt@perf_pmu@* - dmesg-warn - WARNING: possible irq lock inversion dependency detected +}

New failures caught by the filter:

  * https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_7315/shard-skl6/igt@perf_pmu@semaphore-busy-vcs0.html
Comment 28 Chris Wilson 2019-11-14 23:52:32 UTC
commit c9ad602feabe4271d2adf1bdae5d8b20c2dc84f1
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Thu Nov 14 17:25:35 2019 +0000

    drm/i915: Split i915_active.mutex into an irq-safe spinlock for the rbtree
    
    As we want to be able to run inside atomic context for retiring the
    i915_active, and we are no longer allowed to abuse mutex_trylock, split
    the tree management portion of i915_active.mutex into an irq-safe
    spinlock.
    
    References: a0855d24fc22d ("locking/mutex: Complain upon mutex API misuse in IRQ contexts")
    References: https://bugs.freedesktop.org/show_bug.cgi?id=111626
    Fixes: 274cbf20fd10 ("drm/i915: Push the i915_active.retire into a worker")
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
    Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
    Cc: Matthew Auld <matthew.auld@intel.com>
    Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
    Link: https://patchwork.freedesktop.org/patch/msgid/20191114172535.1116-1-chris@chris-wilson.co.uk

will get rid of most of the culprits, but we still have the intel_engine_pm_put() from inside i915_pmu.c to worry about. I might just stop pushing for that to be immediate and always schedule the work.
Comment 29 CI Bug Log 2019-11-20 15:45:06 UTC
A CI Bug Log filter associated to this bug has been updated:

{- SKL ICL: igt@gem_persistent_relocs@forked-faulting-reloc-thrashing - dmesg-warn - WARNING: HARDIRQ-safe -&gt; HARDIRQ-unsafe lock order detected, acquire at: __mutex_unlock_slowpath, holding by: intel_gt_retire_requests_timeout -}
{+ BYT SKL ICL: igt@gem_persistent_relocs@forked-faulting-reloc-thrashing - dmesg-warn - WARNING: HARDIRQ-safe -&gt; HARDIRQ-unsafe lock order detected, acquire at: __mutex_unlock_slowpath, holding by: intel_gt_retire_requests_timeout +}

New failures caught by the filter:

  * https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_407/fi-byt-j1900/igt@gem_persistent_relocs@forked-faulting-reloc-thrashing.html
Comment 30 Chris Wilson 2019-11-20 16:45:31 UTC
commit 07779a76ee1f93f930cf697b22be73d16e14f50c
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Wed Nov 20 12:54:33 2019 +0000

    drm/i915: Mark up the calling context for intel_wakeref_put()
    
    Previously, we assumed we could use mutex_trylock() within an atomic
    context, falling back to a worker if contended. However, such trickery
    is illegal inside interrupt context, and so we need to always use a
    worker under such circumstances. As we normally are in process context,
    we can typically use a plain mutex, and only defer to a work when we
    know we are being called from an interrupt path.
    
    Fixes: 51fbd8de87dc ("drm/i915/pmu: Atomically acquire the gt_pm wakeref")
    References: a0855d24fc22d ("locking/mutex: Complain upon mutex API misuse in IRQ contexts")
    References: https://bugs.freedesktop.org/show_bug.cgi?id=111626
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
    Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
    Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
    Link: https://patchwork.freedesktop.org/patch/msgid/20191120125433.3767149-1-chris@chris-wilson.co.uk

Good things come to those wait, or so I am told.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.