Bug 111882 - [CI][SHARDS] GLK: igt@gem_eio@kms - dmesg-warn - WARNING: possible circular locking dependency detected
Summary: [CI][SHARDS] GLK: igt@gem_eio@kms - dmesg-warn - WARNING: possible circular l...
Status: RESOLVED FIXED
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/Intel (show other bugs)
Version: DRI git
Hardware: Other All
: not set not set
Assignee: Intel GFX Bugs mailing list
QA Contact: Intel GFX Bugs mailing list
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2019-10-02 13:20 UTC by Lakshmi
Modified: 2019-10-04 16:10 UTC (History)
1 user (show)

See Also:
i915 platform: GLK
i915 features: GEM/Other


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Lakshmi 2019-10-02 13:20:48 UTC
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6983/shard-glk7/igt@gem_eio@kms.html
<4> [1429.580580] ======================================================
<4> [1429.580583] WARNING: possible circular locking dependency detected
<4> [1429.580588] 5.4.0-rc1-CI-CI_DRM_6983+ #1 Tainted: G     U           
<4> [1429.580592] ------------------------------------------------------
<4> [1429.580596] gem_eio/3607 is trying to acquire lock:
<4> [1429.580599] ffff88825cc502d8 (&mapping->i_mmap_rwsem){++++}, at: unmap_mapping_pages+0x48/0x130
<4> [1429.580612] 
but task is already holding lock:
<4> [1429.580616] ffff88826169c5b0 (&gt->reset.mutex){+.+.}, at: intel_gt_reset+0x5f/0x3c0 [i915]
<4> [1429.580710] 
which lock already depends on the new lock.

<4> [1429.580714] 
the existing dependency chain (in reverse order) is:
<4> [1429.580718] 
-> #3 (&gt->reset.mutex){+.+.}:
<4> [1429.580788]        i915_request_wait+0xf8/0x870 [i915]
<4> [1429.580854]        i915_active_wait+0x13c/0x280 [i915]
<4> [1429.580921]        i915_vma_unbind+0x1c5/0x4a0 [i915]
<4> [1429.580986]        eb_lookup_vmas+0x4ea/0x11a0 [i915]
<4> [1429.581049]        i915_gem_do_execbuffer+0x601/0x2360 [i915]
<4> [1429.581113]        i915_gem_execbuffer2_ioctl+0x11b/0x460 [i915]
<4> [1429.581119]        drm_ioctl_kernel+0xa7/0xf0
<4> [1429.581123]        drm_ioctl+0x2e1/0x390
<4> [1429.581128]        do_vfs_ioctl+0xa0/0x6f0
<4> [1429.581132]        ksys_ioctl+0x35/0x60
<4> [1429.581135]        __x64_sys_ioctl+0x11/0x20
<4> [1429.581140]        do_syscall_64+0x4f/0x210
<4> [1429.581146]        entry_SYSCALL_64_after_hwframe+0x49/0xbe
<4> [1429.581150] 
-> #2 (i915_active){+.+.}:
<4> [1429.581216]        i915_active_wait+0x4a/0x280 [i915]
<4> [1429.581283]        i915_vma_unbind+0x1c5/0x4a0 [i915]
<4> [1429.581348]        i915_gem_object_unbind+0x153/0x1c0 [i915]
<4> [1429.581415]        userptr_mn_invalidate_range_start+0x9f/0x200 [i915]
<4> [1429.581421]        __mmu_notifier_invalidate_range_start+0xa3/0x180
<4> [1429.581426]        unmap_vmas+0x143/0x150
<4> [1429.581430]        unmap_region+0xa3/0x100
<4> [1429.581434]        __do_munmap+0x25d/0x490
<4> [1429.581437]        __vm_munmap+0x6e/0xc0
<4> [1429.581441]        __x64_sys_munmap+0x12/0x20
<4> [1429.581445]        do_syscall_64+0x4f/0x210
<4> [1429.581449]        entry_SYSCALL_64_after_hwframe+0x49/0xbe
<4> [1429.581452] 
-> #1 (mmu_notifier_invalidate_range_start){+.+.}:
<4> [1429.581459]        page_mkclean_one+0xda/0x210
<4> [1429.581463]        rmap_walk_file+0xff/0x260
<4> [1429.581467]        page_mkclean+0x9f/0xb0
<4> [1429.581471]        clear_page_dirty_for_io+0xa2/0x300
<4> [1429.581477]        mpage_submit_page+0x1a/0x70
<4> [1429.581481]        mpage_process_page_bufs+0xe7/0x110
<4> [1429.581485]        mpage_prepare_extent_to_map+0x1d2/0x2b0
<4> [1429.581490]        ext4_writepages+0x592/0x1230
<4> [1429.581493]        do_writepages+0x46/0xe0
<4> [1429.581498]        __filemap_fdatawrite_range+0xc6/0x100
<4> [1429.581502]        file_write_and_wait_range+0x3c/0x90
<4> [1429.581506]        ext4_sync_file+0x154/0x500
<4> [1429.581511]        do_fsync+0x33/0x60
<4> [1429.581514]        __x64_sys_fsync+0xb/0x10
<4> [1429.581518]        do_syscall_64+0x4f/0x210
<4> [1429.581522]        entry_SYSCALL_64_after_hwframe+0x49/0xbe
<4> [1429.581526] 
-> #0 (&mapping->i_mmap_rwsem){++++}:
<4> [1429.581533]        __lock_acquire+0x1328/0x15d0
<4> [1429.581537]        lock_acquire+0xa7/0x1c0
<4> [1429.581541]        down_write+0x33/0x70
<4> [1429.581545]        unmap_mapping_pages+0x48/0x130
<4> [1429.581606]        intel_gt_reset+0x142/0x3c0 [i915]
<4> [1429.581668]        intel_gt_reset_global+0xe0/0x150 [i915]
<4> [1429.581729]        intel_gt_handle_error+0x184/0x3b0 [i915]
<4> [1429.581788]        i915_wedged_set+0x5b/0xc0 [i915]
<4> [1429.581793]        simple_attr_write+0xb0/0xd0
<4> [1429.581799]        full_proxy_write+0x51/0x80
<4> [1429.581803]        vfs_write+0xb9/0x1d0
<4> [1429.581807]        ksys_write+0x9f/0xe0
<4> [1429.581810]        do_syscall_64+0x4f/0x210
<4> [1429.581814]        entry_SYSCALL_64_after_hwframe+0x49/0xbe
<4> [1429.581818] 
other info that might help us debug this:

<4> [1429.581823] Chain exists of:
  &mapping->i_mmap_rwsem --> i915_active --> &gt->reset.mutex

<4> [1429.581830]  Possible unsafe locking scenario:

<4> [1429.581834]        CPU0                    CPU1
<4> [1429.581837]        ----                    ----
<4> [1429.581840]   lock(&gt->reset.mutex);
<4> [1429.581843]                                lock(i915_active);
<4> [1429.581847]                                lock(&gt->reset.mutex);
<4> [1429.581851]   lock(&mapping->i_mmap_rwsem);
<4> [1429.581854] 
 *** DEADLOCK ***

<4> [1429.581859] 3 locks held by gem_eio/3607:
<4> [1429.581862]  #0: ffff88827652a6b0 (sb_writers#14){.+.+}, at: vfs_write+0x1a4/0x1d0
<4> [1429.581869]  #1: ffff88826168ea58 (&attr->mutex){+.+.}, at: simple_attr_write+0x36/0xd0
<4> [1429.581877]  #2: ffff88826169c5b0 (&gt->reset.mutex){+.+.}, at: intel_gt_reset+0x5f/0x3c0 [i915]
<4> [1429.581941] 
stack backtrace:
<4> [1429.581946] CPU: 1 PID: 3607 Comm: gem_eio Tainted: G     U            5.4.0-rc1-CI-CI_DRM_6983+ #1
<4> [1429.581952] Hardware name: Intel Corporation NUC7CJYH/NUC7JYB, BIOS JYGLKCPX.86A.0027.2018.0125.1347 01/25/2018
<4> [1429.581958] Call Trace:
<4> [1429.581964]  dump_stack+0x67/0x9b
<4> [1429.581969]  check_noncircular+0x19b/0x1c0
<4> [1429.581975]  ? __lock_acquire+0x1328/0x15d0
<4> [1429.581979]  __lock_acquire+0x1328/0x15d0
<4> [1429.581985]  ? trace_vbprintk+0x16a/0x210
<4> [1429.581989]  lock_acquire+0xa7/0x1c0
<4> [1429.581994]  ? unmap_mapping_pages+0x48/0x130
<4> [1429.581999]  down_write+0x33/0x70
<4> [1429.582002]  ? unmap_mapping_pages+0x48/0x130
<4> [1429.582007]  unmap_mapping_pages+0x48/0x130
<4> [1429.582068]  intel_gt_reset+0x142/0x3c0 [i915]
<4> [1429.582129]  intel_gt_reset_global+0xe0/0x150 [i915]
<4> [1429.582191]  ? intel_gt_set_wedged+0x70/0x70 [i915]
<4> [1429.582197]  ? queue_work_node+0x70/0x70
<4> [1429.582258]  intel_gt_handle_error+0x184/0x3b0 [i915]
<4> [1429.582320]  i915_wedged_set+0x5b/0xc0 [i915]
<4> [1429.582326]  ? simple_strtoull+0x3b/0x70
<4> [1429.582331]  simple_attr_write+0xb0/0xd0
<4> [1429.582335]  full_proxy_write+0x51/0x80
<4> [1429.582340]  vfs_write+0xb9/0x1d0
<4> [1429.582343]  ksys_write+0x9f/0xe0
<4> [1429.582347]  do_syscall_64+0x4f/0x210
<4> [1429.582352]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
<4> [1429.582357] RIP: 0033:0x7fbe9a73c281
<4> [1429.582362] Code: c3 0f 1f 84 00 00 00 00 00 48 8b 05 59 8d 20 00 c3 0f 1f 84 00 00 00 00 00 8b 05 8a d1 20 00 85 c0 75 16 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 57 f3 c3 0f 1f 44 00 00 41 54 55 49 89 d4 53
<4> [1429.582371] RSP: 002b:00007ffe335e7b78 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
<4> [1429.582377] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fbe9a73c281
<4> [1429.582381] RDX: 0000000000000002 RSI: 00007fbe9adcf0e2 RDI: 0000000000000009
<4> [1429.582385] RBP: 0000000000000002 R08: 0000000000000000 R09: 0000000000000034
<4> [1429.582389] R10: 0000000000000000 R11: 0000000000000246 R12: 00007fbe9adcf0e2
<4> [1429.582394] R13: 0000000000000009 R14: 00007ffe335e7d60 R15: 0000000000000040
<7> [1429.589338] [IGT] Checking that the GPU recovered
Comment 2 Chris Wilson 2019-10-02 13:26:20 UTC
The lock inside i915_active_wait() is due for elimination, which will break this cycle. But... Not sure yet if we don't end up replacing it with another cycle instead.
Comment 3 Chris Wilson 2019-10-04 16:10:41 UTC
commit b1e3177bd1d8f41e2a9cc847e56a96cdc0eefe62
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Fri Oct 4 14:40:00 2019 +0100

    drm/i915: Coordinate i915_active with its own mutex
    
    Forgo the struct_mutex serialisation for i915_active, and interpose its
    own mutex handling for active/retire.
    
    This is a multi-layered sleight-of-hand. First, we had to ensure that no
    active/retire callbacks accidentally inverted the mutex ordering rules,
    nor assumed that they were themselves serialised by struct_mutex. More
    challenging though, is the rule over updating elements of the active
    rbtree. Instead of the whole i915_active now being serialised by
    struct_mutex, allocations/rotations of the tree are serialised by the
    i915_active.mutex and individual nodes are serialised by the caller
    using the i915_timeline.mutex (we need to use nested spinlocks to
    interact with the dma_fence callback lists).
    
    The pain point here is that instead of a single mutex around execbuf, we
    now have to take a mutex for active tracker (one for each vma, context,
    etc) and a couple of spinlocks for each fence update. The improvement in
    fine grained locking allowing for multiple concurrent clients
    (eventually!) should be worth it in typical loads.
    
    v2: Add some comments that barely elucidate anything :(
    
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
    Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
    Link: https://patchwork.freedesktop.org/patch/msgid/20191004134015.13204-6-chris@chris-wilson.co.uk


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.