Bug 111882 - [CI][SHARDS] GLK: igt@gem_eio@kms - dmesg-warn - WARNING: possible circular locking dependency detected
Summary: [CI][SHARDS] GLK: igt@gem_eio@kms - dmesg-warn - WARNING: possible circular l...
Status: CLOSED FIXED
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/Intel (show other bugs)
Version: DRI git
Hardware: Other All
: not set not set
Assignee: Intel GFX Bugs mailing list
QA Contact: Intel GFX Bugs mailing list
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2019-10-02 13:20 UTC by Lakshmi
Modified: 2019-10-23 08:23 UTC (History)
1 user (show)

See Also:
i915 platform: GLK
i915 features: GEM/Other


Attachments

Description Lakshmi 2019-10-02 13:20:48 UTC
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6983/shard-glk7/igt@gem_eio@kms.html
<4> [1429.580580] ======================================================
<4> [1429.580583] WARNING: possible circular locking dependency detected
<4> [1429.580588] 5.4.0-rc1-CI-CI_DRM_6983+ #1 Tainted: G     U           
<4> [1429.580592] ------------------------------------------------------
<4> [1429.580596] gem_eio/3607 is trying to acquire lock:
<4> [1429.580599] ffff88825cc502d8 (&mapping->i_mmap_rwsem){++++}, at: unmap_mapping_pages+0x48/0x130
<4> [1429.580612] 
but task is already holding lock:
<4> [1429.580616] ffff88826169c5b0 (&gt->reset.mutex){+.+.}, at: intel_gt_reset+0x5f/0x3c0 [i915]
<4> [1429.580710] 
which lock already depends on the new lock.

<4> [1429.580714] 
the existing dependency chain (in reverse order) is:
<4> [1429.580718] 
-> #3 (&gt->reset.mutex){+.+.}:
<4> [1429.580788]        i915_request_wait+0xf8/0x870 [i915]
<4> [1429.580854]        i915_active_wait+0x13c/0x280 [i915]
<4> [1429.580921]        i915_vma_unbind+0x1c5/0x4a0 [i915]
<4> [1429.580986]        eb_lookup_vmas+0x4ea/0x11a0 [i915]
<4> [1429.581049]        i915_gem_do_execbuffer+0x601/0x2360 [i915]
<4> [1429.581113]        i915_gem_execbuffer2_ioctl+0x11b/0x460 [i915]
<4> [1429.581119]        drm_ioctl_kernel+0xa7/0xf0
<4> [1429.581123]        drm_ioctl+0x2e1/0x390
<4> [1429.581128]        do_vfs_ioctl+0xa0/0x6f0
<4> [1429.581132]        ksys_ioctl+0x35/0x60
<4> [1429.581135]        __x64_sys_ioctl+0x11/0x20
<4> [1429.581140]        do_syscall_64+0x4f/0x210
<4> [1429.581146]        entry_SYSCALL_64_after_hwframe+0x49/0xbe
<4> [1429.581150] 
-> #2 (i915_active){+.+.}:
<4> [1429.581216]        i915_active_wait+0x4a/0x280 [i915]
<4> [1429.581283]        i915_vma_unbind+0x1c5/0x4a0 [i915]
<4> [1429.581348]        i915_gem_object_unbind+0x153/0x1c0 [i915]
<4> [1429.581415]        userptr_mn_invalidate_range_start+0x9f/0x200 [i915]
<4> [1429.581421]        __mmu_notifier_invalidate_range_start+0xa3/0x180
<4> [1429.581426]        unmap_vmas+0x143/0x150
<4> [1429.581430]        unmap_region+0xa3/0x100
<4> [1429.581434]        __do_munmap+0x25d/0x490
<4> [1429.581437]        __vm_munmap+0x6e/0xc0
<4> [1429.581441]        __x64_sys_munmap+0x12/0x20
<4> [1429.581445]        do_syscall_64+0x4f/0x210
<4> [1429.581449]        entry_SYSCALL_64_after_hwframe+0x49/0xbe
<4> [1429.581452] 
-> #1 (mmu_notifier_invalidate_range_start){+.+.}:
<4> [1429.581459]        page_mkclean_one+0xda/0x210
<4> [1429.581463]        rmap_walk_file+0xff/0x260
<4> [1429.581467]        page_mkclean+0x9f/0xb0
<4> [1429.581471]        clear_page_dirty_for_io+0xa2/0x300
<4> [1429.581477]        mpage_submit_page+0x1a/0x70
<4> [1429.581481]        mpage_process_page_bufs+0xe7/0x110
<4> [1429.581485]        mpage_prepare_extent_to_map+0x1d2/0x2b0
<4> [1429.581490]        ext4_writepages+0x592/0x1230
<4> [1429.581493]        do_writepages+0x46/0xe0
<4> [1429.581498]        __filemap_fdatawrite_range+0xc6/0x100
<4> [1429.581502]        file_write_and_wait_range+0x3c/0x90
<4> [1429.581506]        ext4_sync_file+0x154/0x500
<4> [1429.581511]        do_fsync+0x33/0x60
<4> [1429.581514]        __x64_sys_fsync+0xb/0x10
<4> [1429.581518]        do_syscall_64+0x4f/0x210
<4> [1429.581522]        entry_SYSCALL_64_after_hwframe+0x49/0xbe
<4> [1429.581526] 
-> #0 (&mapping->i_mmap_rwsem){++++}:
<4> [1429.581533]        __lock_acquire+0x1328/0x15d0
<4> [1429.581537]        lock_acquire+0xa7/0x1c0
<4> [1429.581541]        down_write+0x33/0x70
<4> [1429.581545]        unmap_mapping_pages+0x48/0x130
<4> [1429.581606]        intel_gt_reset+0x142/0x3c0 [i915]
<4> [1429.581668]        intel_gt_reset_global+0xe0/0x150 [i915]
<4> [1429.581729]        intel_gt_handle_error+0x184/0x3b0 [i915]
<4> [1429.581788]        i915_wedged_set+0x5b/0xc0 [i915]
<4> [1429.581793]        simple_attr_write+0xb0/0xd0
<4> [1429.581799]        full_proxy_write+0x51/0x80
<4> [1429.581803]        vfs_write+0xb9/0x1d0
<4> [1429.581807]        ksys_write+0x9f/0xe0
<4> [1429.581810]        do_syscall_64+0x4f/0x210
<4> [1429.581814]        entry_SYSCALL_64_after_hwframe+0x49/0xbe
<4> [1429.581818] 
other info that might help us debug this:

<4> [1429.581823] Chain exists of:
  &mapping->i_mmap_rwsem --> i915_active --> &gt->reset.mutex

<4> [1429.581830]  Possible unsafe locking scenario:

<4> [1429.581834]        CPU0                    CPU1
<4> [1429.581837]        ----                    ----
<4> [1429.581840]   lock(&gt->reset.mutex);
<4> [1429.581843]                                lock(i915_active);
<4> [1429.581847]                                lock(&gt->reset.mutex);
<4> [1429.581851]   lock(&mapping->i_mmap_rwsem);
<4> [1429.581854] 
 *** DEADLOCK ***

<4> [1429.581859] 3 locks held by gem_eio/3607:
<4> [1429.581862]  #0: ffff88827652a6b0 (sb_writers#14){.+.+}, at: vfs_write+0x1a4/0x1d0
<4> [1429.581869]  #1: ffff88826168ea58 (&attr->mutex){+.+.}, at: simple_attr_write+0x36/0xd0
<4> [1429.581877]  #2: ffff88826169c5b0 (&gt->reset.mutex){+.+.}, at: intel_gt_reset+0x5f/0x3c0 [i915]
<4> [1429.581941] 
stack backtrace:
<4> [1429.581946] CPU: 1 PID: 3607 Comm: gem_eio Tainted: G     U            5.4.0-rc1-CI-CI_DRM_6983+ #1
<4> [1429.581952] Hardware name: Intel Corporation NUC7CJYH/NUC7JYB, BIOS JYGLKCPX.86A.0027.2018.0125.1347 01/25/2018
<4> [1429.581958] Call Trace:
<4> [1429.581964]  dump_stack+0x67/0x9b
<4> [1429.581969]  check_noncircular+0x19b/0x1c0
<4> [1429.581975]  ? __lock_acquire+0x1328/0x15d0
<4> [1429.581979]  __lock_acquire+0x1328/0x15d0
<4> [1429.581985]  ? trace_vbprintk+0x16a/0x210
<4> [1429.581989]  lock_acquire+0xa7/0x1c0
<4> [1429.581994]  ? unmap_mapping_pages+0x48/0x130
<4> [1429.581999]  down_write+0x33/0x70
<4> [1429.582002]  ? unmap_mapping_pages+0x48/0x130
<4> [1429.582007]  unmap_mapping_pages+0x48/0x130
<4> [1429.582068]  intel_gt_reset+0x142/0x3c0 [i915]
<4> [1429.582129]  intel_gt_reset_global+0xe0/0x150 [i915]
<4> [1429.582191]  ? intel_gt_set_wedged+0x70/0x70 [i915]
<4> [1429.582197]  ? queue_work_node+0x70/0x70
<4> [1429.582258]  intel_gt_handle_error+0x184/0x3b0 [i915]
<4> [1429.582320]  i915_wedged_set+0x5b/0xc0 [i915]
<4> [1429.582326]  ? simple_strtoull+0x3b/0x70
<4> [1429.582331]  simple_attr_write+0xb0/0xd0
<4> [1429.582335]  full_proxy_write+0x51/0x80
<4> [1429.582340]  vfs_write+0xb9/0x1d0
<4> [1429.582343]  ksys_write+0x9f/0xe0
<4> [1429.582347]  do_syscall_64+0x4f/0x210
<4> [1429.582352]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
<4> [1429.582357] RIP: 0033:0x7fbe9a73c281
<4> [1429.582362] Code: c3 0f 1f 84 00 00 00 00 00 48 8b 05 59 8d 20 00 c3 0f 1f 84 00 00 00 00 00 8b 05 8a d1 20 00 85 c0 75 16 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 57 f3 c3 0f 1f 44 00 00 41 54 55 49 89 d4 53
<4> [1429.582371] RSP: 002b:00007ffe335e7b78 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
<4> [1429.582377] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fbe9a73c281
<4> [1429.582381] RDX: 0000000000000002 RSI: 00007fbe9adcf0e2 RDI: 0000000000000009
<4> [1429.582385] RBP: 0000000000000002 R08: 0000000000000000 R09: 0000000000000034
<4> [1429.582389] R10: 0000000000000000 R11: 0000000000000246 R12: 00007fbe9adcf0e2
<4> [1429.582394] R13: 0000000000000009 R14: 00007ffe335e7d60 R15: 0000000000000040
<7> [1429.589338] [IGT] Checking that the GPU recovered
Comment 2 Chris Wilson 2019-10-02 13:26:20 UTC
The lock inside i915_active_wait() is due for elimination, which will break this cycle. But... Not sure yet if we don't end up replacing it with another cycle instead.
Comment 3 Chris Wilson 2019-10-04 16:10:41 UTC
commit b1e3177bd1d8f41e2a9cc847e56a96cdc0eefe62
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Fri Oct 4 14:40:00 2019 +0100

    drm/i915: Coordinate i915_active with its own mutex
    
    Forgo the struct_mutex serialisation for i915_active, and interpose its
    own mutex handling for active/retire.
    
    This is a multi-layered sleight-of-hand. First, we had to ensure that no
    active/retire callbacks accidentally inverted the mutex ordering rules,
    nor assumed that they were themselves serialised by struct_mutex. More
    challenging though, is the rule over updating elements of the active
    rbtree. Instead of the whole i915_active now being serialised by
    struct_mutex, allocations/rotations of the tree are serialised by the
    i915_active.mutex and individual nodes are serialised by the caller
    using the i915_timeline.mutex (we need to use nested spinlocks to
    interact with the dma_fence callback lists).
    
    The pain point here is that instead of a single mutex around execbuf, we
    now have to take a mutex for active tracker (one for each vma, context,
    etc) and a couple of spinlocks for each fence update. The improvement in
    fine grained locking allowing for multiple concurrent clients
    (eventually!) should be worth it in typical loads.
    
    v2: Add some comments that barely elucidate anything :(
    
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
    Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
    Link: https://patchwork.freedesktop.org/patch/msgid/20191004134015.13204-6-chris@chris-wilson.co.uk
Comment 4 Lakshmi 2019-10-23 08:23:24 UTC
This issue occurred thrice in 4 runs and last seen CI_DRM_6983_full (3 weeks old), current run is 7155. Closing and archiving this bug.
Comment 5 CI Bug Log 2019-10-23 08:23:29 UTC
The CI Bug Log issue associated to this bug has been archived.

New failures matching the above filters will not be associated to this bug anymore.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.