Bug 108375

Summary:	[CI][DRMTIP] igt@gem_userptr_blits@coherency-sync - dmesg-warn - WARNING: possible circular locking dependency detected
Product:	DRI	Reporter:	Martin Peres <martin.peres>
Component:	DRM/Intel	Assignee:	Intel GFX Bugs mailing list <intel-gfx-bugs>
Status:	CLOSED FIXED	QA Contact:	Intel GFX Bugs mailing list <intel-gfx-bugs>
Severity:	normal
Priority:	high	CC:	intel-gfx-bugs
Version:	XOrg git
Hardware:	Other
OS:	All
Whiteboard:
i915 platform:	PNV	i915 features:	GEM/Other

Description Martin Peres 2018-10-15 15:13:22 UTC

https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_126/fi-pnv-d510/igt@gem_userptr_blits@coherency-sync.html

<6> [387.029820] [IGT] gem_userptr_blits: starting subtest coherency-sync
<4> [419.324985] 
<4> [419.325012] ======================================================
<4> [419.325036] WARNING: possible circular locking dependency detected
<4> [419.325065] 4.19.0-rc7-gdec9886eff39-drmtip_126+ #1 Tainted: G     U           
<4> [419.325090] ------------------------------------------------------
<4> [419.325114] kworker/u8:3/113 is trying to acquire lock:
<4> [419.325141] 000000004f6d1966 (&obj->mm.lock){+.+.}, at: cancel_userptr+0x19/0xf0 [i915]
<4> [419.325542] \x0abut task is already holding lock:
<4> [419.325588] 00000000e3332cfd ((work_completion)(&mo->work)){+.+.}, at: process_one_work+0x1bf/0x610
<4> [419.325588] \x0awhich lock already depends on the new lock.\x0a
<4> [419.325588] \x0athe existing dependency chain (in reverse order) is:
<4> [419.325588] \x0a-> #5 ((work_completion)(&mo->work)){+.+.}:
<4> [419.325762]        worker_thread+0x37/0x380
<4> [419.325762]        kthread+0x119/0x130
<4> [419.325762]        ret_from_fork+0x24/0x50
<4> [419.325762] \x0a-> #4 ((wq_completion)"i915-userptr-release"){+.+.}:
<4> [419.325762]        i915_gem_userptr_mn_invalidate_range_start+0x1bb/0x1e0 [i915]
<4> [419.325762]        __mmu_notifier_invalidate_range_start+0xa6/0x120
<4> [419.325762]        try_to_unmap_one+0x706/0x800
<4> [419.325762]        rmap_walk_anon+0x101/0x270
<4> [419.325762]        try_to_unmap+0xbe/0xf0
<4> [419.325762]        shrink_page_list+0x60e/0xcc0
<4> [419.325762]        shrink_inactive_list+0x36d/0x740
<4> [419.325762]        shrink_node_memcg+0x368/0x740
<4> [419.325762]        shrink_node+0xc9/0x450
<4> [419.325762]        kswapd+0x2f8/0x8f0
<4> [419.325762]        kthread+0x119/0x130
<4> [419.325762]        ret_from_fork+0x24/0x50
<4> [419.325762] \x0a-> #3 (&anon_vma->rwsem){++++}:
<4> [419.325762]        __vma_adjust+0x390/0x6c0
<4> [419.326467]        __split_vma+0x16a/0x180
<4> [419.326467]        mprotect_fixup+0x2a5/0x320
<4> [419.326467]        do_mprotect_pkey+0x208/0x2e0
<4> [419.326467]        __x64_sys_mprotect+0x16/0x20
<4> [419.326467]        do_syscall_64+0x55/0x190
<4> [419.326467]        entry_SYSCALL_64_after_hwframe+0x49/0xbe
<4> [419.326467] \x0a-> #2 (&mapping->i_mmap_rwsem){++++}:
<4> [419.326467]        rmap_walk_file+0x20f/0x250
<4> [419.326467]        page_referenced+0x14a/0x170
<4> [419.326467]        shrink_active_list+0x216/0x510
<4> [419.326467]        shrink_node_memcg+0x226/0x740
<4> [419.326467]        shrink_node+0xc9/0x450
<4> [419.326467]        kswapd+0x2f8/0x8f0
<4> [419.326467]        kthread+0x119/0x130
<4> [419.326467]        ret_from_fork+0x24/0x50
<4> [419.326467] \x0a-> #1 (fs_reclaim){+.+.}:
<4> [419.326467]        kmem_cache_alloc_trace+0x2a/0x290
<4> [419.326467]        i915_gem_object_get_pages_stolen+0xa2/0x130 [i915]
<4> [419.326467]        ____i915_gem_object_get_pages+0x1d/0xa0 [i915]
<4> [419.326467]        __i915_gem_object_get_pages+0x59/0xb0 [i915]
<4> [419.326467]        _i915_gem_object_create_stolen+0xd6/0x100 [i915]
<4> [419.326467]        i915_gem_object_create_stolen+0x67/0xb0 [i915]
<4> [419.326467]        intel_engine_create_ring+0xdc/0x2b0 [i915]
<4> [419.326467]        intel_init_ring_buffer+0x38/0x130 [i915]
<4> [419.326467]        intel_engines_init+0x4f/0x120 [i915]
<4> [419.326467]        i915_gem_init+0x244/0x850 [i915]
<4> [419.329470]        i915_driver_load+0xc81/0x1520 [i915]
<4> [419.329470]        i915_pci_probe+0x29/0xa0 [i915]
<4> [419.329470]        pci_device_probe+0xa1/0x130
<4> [419.329470]        really_probe+0x25d/0x3c0
<4> [419.329470]        driver_probe_device+0x10a/0x120
<4> [419.329470]        __driver_attach+0xdb/0x100
<4> [419.329470]        bus_for_each_dev+0x74/0xc0
<4> [419.329470]        bus_add_driver+0x15f/0x250
<4> [419.329470]        driver_register+0x56/0xe0
<4> [419.329470]        do_one_initcall+0x58/0x2e0
<4> [419.329470]        do_init_module+0x56/0x1ea
<4> [419.329470]        load_module+0x26f5/0x29d0
<4> [419.329470]        __se_sys_finit_module+0xd3/0xf0
<4> [419.329470]        do_syscall_64+0x55/0x190
<4> [419.329470]        entry_SYSCALL_64_after_hwframe+0x49/0xbe
<4> [419.329470] \x0a-> #0 (&obj->mm.lock){+.+.}:
<4> [419.329470]        __mutex_lock+0x89/0x970
<4> [419.329470]        cancel_userptr+0x19/0xf0 [i915]
<4> [419.330583]        process_one_work+0x245/0x610
<4> [419.330583]        worker_thread+0x37/0x380
<4> [419.330583]        kthread+0x119/0x130
<4> [419.330583]        ret_from_fork+0x24/0x50
<4> [419.330583] \x0aother info that might help us debug this:\x0a
<4> [419.330583] Chain exists of:\x0a  &obj->mm.lock --> (wq_completion)"i915-userptr-release" --> (work_completion)(&mo->work)\x0a
<4> [419.330583]  Possible unsafe locking scenario:\x0a
<4> [419.330583]        CPU0                    CPU1
<4> [419.330583]        ----                    ----
<4> [419.330583]   lock((work_completion)(&mo->work));
<4> [419.330583]                                lock((wq_completion)"i915-userptr-release");
<4> [419.330583]                                lock((work_completion)(&mo->work));
<4> [419.330583]   lock(&obj->mm.lock);
<4> [419.330583] \x0a *** DEADLOCK ***\x0a
<4> [419.330583] 2 locks held by kworker/u8:3/113:
<4> [419.330583]  #0: 00000000735f844e ((wq_completion)"i915-userptr-release"){+.+.}, at: process_one_work+0x1bf/0x610
<4> [419.330583]  #1: 00000000e3332cfd ((work_completion)(&mo->work)){+.+.}, at: process_one_work+0x1bf/0x610
<4> [419.330583] \x0astack backtrace:
<4> [419.330583] CPU: 3 PID: 113 Comm: kworker/u8:3 Tainted: G     U            4.19.0-rc7-gdec9886eff39-drmtip_126+ #1
<4> [419.330583] Hardware name:  /D510MO, BIOS MOPNV10J.86A.0311.2010.0802.2346 08/02/2010
<4> [419.330583] Workqueue: i915-userptr-release cancel_userptr [i915]
<4> [419.330583] Call Trace:
<4> [419.330583]  dump_stack+0x67/0x9b
<4> [419.330583]  print_circular_bug.isra.16+0x1c8/0x2b0
<4> [419.330583]  __lock_acquire+0x1897/0x1b50
<4> [419.330583]  ? preempt_count_add+0x44/0x90
<4> [419.330583]  ? mark_held_locks+0x50/0x80
<4> [419.330583]  ? lock_acquire+0xa6/0x1c0
<4> [419.330583]  lock_acquire+0xa6/0x1c0
<4> [419.330583]  ? cancel_userptr+0x19/0xf0 [i915]
<4> [419.330583]  __mutex_lock+0x89/0x970
<4> [419.330583]  ? cancel_userptr+0x19/0xf0 [i915]
<4> [419.330583]  ? cancel_userptr+0x19/0xf0 [i915]
<4> [419.330583]  ? lock_acquire+0xbc/0x1c0
<4> [419.330583]  ? cancel_userptr+0x19/0xf0 [i915]
<4> [419.330583]  cancel_userptr+0x19/0xf0 [i915]
<4> [419.330583]  process_one_work+0x245/0x610
<4> [419.330583]  worker_thread+0x37/0x380
<4> [419.330583]  ? process_one_work+0x610/0x610
<4> [419.330583]  kthread+0x119/0x130
<4> [419.330583]  ? kthread_park+0x80/0x80
<4> [419.330583]  ret_from_fork+0x24/0x50
<6> [425.368078] [IGT] gem_userptr_blits: exiting, ret=0

Comment 1 Chris Wilson 2018-10-15 15:27:26 UTC

The claimed dep link (esp #0 -> #1) is very confusing. Fwiw, the key link here is the shrinker and so that path looks worth investigating with the interaction with the flush_work() in invalidate_range.

Comment 2 Francesco Balestrieri 2018-11-06 11:55:54 UTC

Raising priority due to possible connection with Bug 108456

Comment 3 Chris Wilson 2019-01-15 17:10:55 UTC

commit 484d9a844d0d0aeaa4cd3cec20885b7de9986a55 (HEAD -> drm-intel-next-queued, drm-intel/for-linux-next, drm-intel/drm-intel-next-queued)
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Tue Jan 15 12:44:42 2019 +0000

    drm/i915/userptr: Avoid struct_mutex recursion for mmu_invalidate_range_start
    
    Since commit 93065ac753e4 ("mm, oom: distinguish blockable mode for mmu
    notifiers") we have been able to report failure from
    mmu_invalidate_range_start which allows us to use a trylock on the
    struct_mutex to avoid potential recursion and report -EBUSY instead.
    Furthermore, this allows us to pull the work into the main callback and
    avoid the sleight-of-hand in using a workqueue to avoid lockdep.
    
    However, not all paths to mmu_invalidate_range_start are prepared to
    handle failure, so instead of reporting the recursion, deal with it by
    propagating the failure upwards, who can decide themselves to handle it
    or report it.
    
    v2: Mark up the recursive lock behaviour and comment on the various weak
    points.
    
    v3: Follow commit 3824e41975ae ("drm/i915: Use mutex_lock_killable() from
    inside the shrinker") and also use mutex_lock_killable().
    v3.1: No leak on EINTR.
    
    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=108375
    References: 93065ac753e4 ("mm, oom: distinguish blockable mode for mmu notifiers")
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
    Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
    Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
    Link: https://patchwork.freedesktop.org/patch/msgid/20190115124442.3500-1-chris@chris-wilson.co.uk

Comment 4 CI Bug Log 2019-02-14 16:20:44 UTC

The CI Bug Log issue associated to this bug has been archived.

New failures matching the above filters will not be associated to this bug anymore.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.