104209 – [CI] igt@gem_userptr_blits@map-fixed-invalidate-gup - dmesg-warn: WARNING: possible circular locking dependency detected

Bug 104209 - [CI] igt@gem_userptr_blits@map-fixed-invalidate-gup - dmesg-warn: WARNING: possible circular locking dependency detected

Summary: [CI] igt@gem_userptr_blits@map-fixed-invalidate-gup - dmesg-warn: WARNING: po...

Status:	CLOSED WORKSFORME

Alias:	None

Product:	DRI
Classification:	Unclassified
Component:	DRM/Intel (show other bugs)
Version:	DRI git
Hardware:	Other All

Importance:	medium normal
Assignee:	Intel GFX Bugs mailing list
QA Contact:	Intel GFX Bugs mailing list

URL:
Whiteboard:	ReadyForDev
Keywords:

Depends on:
Blocks:

Reported:	2017-12-11 08:17 UTC by Marta Löfstedt
Modified:	2018-01-12 08:28 UTC (History)
CC List:	1 user (show)

See Also:
i915 platform:	HSW
i915 features:	GEM/Other

Attachments

Description Marta Löfstedt 2017-12-11 08:17:09 UTC

https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3486/shard-hsw1/igt@gem_userptr_blits@map-fixed-invalidate-gup.html

[  243.384464] ======================================================
[  243.384466] WARNING: possible circular locking dependency detected
[  243.384468] 4.15.0-rc2-CI-CI_DRM_3486+ #1 Tainted: G     U          
[  243.384470] ------------------------------------------------------
[  243.384472] kworker/6:2/222 is trying to acquire lock:
[  243.384474]  ((wq_completion)"i915-userptr-release"){+.+.}, at: [<00000000d295be80>] flush_workqueue+0x172/0x500
[  243.384480] 
               but task is already holding lock:
[  243.384482]  (&mapping->i_mmap_rwsem){++++}, at: [<00000000d538f896>] unmap_mapping_range+0x52/0x110
[  243.384488] 
               which lock already depends on the new lock.

[  243.384491] 
               the existing dependency chain (in reverse order) is:
[  243.384493] 
               -> #3 (&mapping->i_mmap_rwsem){++++}:
[  243.384499]        down_write+0x33/0x60
[  243.384501]        unmap_mapping_range+0x52/0x110
[  243.384522]        i915_vma_revoke_mmap+0x7a/0x160 [i915]
[  243.384540]        fence_update+0x1a1/0x3c0 [i915]
[  243.384560]        i915_vma_unbind+0x3f8/0x780 [i915]
[  243.384578]        i915_vma_close+0x51/0x90 [i915]
[  243.384596]        __i915_gem_free_objects+0xcb/0x540 [i915]
[  243.384613]        __i915_gem_free_work+0x68/0x90 [i915]
[  243.384616]        process_one_work+0x227/0x650
[  243.384618]        worker_thread+0x48/0x3a0
[  243.384620]        kthread+0x173/0x1b0
[  243.384622]        ret_from_fork+0x24/0x30
[  243.384623] 
               -> #2 (&dev->struct_mutex){+.+.}:
[  243.384628]        __mutex_lock+0x81/0x9b0
[  243.384646]        cancel_userptr+0x7c/0xf0 [i915]
[  243.384648]        process_one_work+0x227/0x650
[  243.384650]        worker_thread+0x48/0x3a0
[  243.384652]        kthread+0x173/0x1b0
[  243.384654]        ret_from_fork+0x24/0x30
[  243.384656] 
               -> #1 ((work_completion)(&mo->work)){+.+.}:
[  243.384660]        process_one_work+0x1fa/0x650
[  243.384662]        worker_thread+0x48/0x3a0
[  243.384663]        kthread+0x173/0x1b0
[  243.384665]        ret_from_fork+0x24/0x30
[  243.384666] 
               -> #0 ((wq_completion)"i915-userptr-release"){+.+.}:
[  243.384672]        lock_acquire+0xaf/0x200
[  243.384674]        wait_for_common+0x54/0x210
[  243.384676]        flush_workqueue+0x172/0x500
[  243.384696]        i915_gem_userptr_mn_invalidate_range_start+0x133/0x190 [i915]
[  243.384699]        __mmu_notifier_invalidate_range_start+0x73/0xb0
[  243.384702]        zap_page_range_single+0xcc/0xe0
[  243.384704]        unmap_mapping_range+0xe9/0x110
[  243.384723]        __i915_gem_object_release_mmap+0xd9/0x100 [i915]
[  243.384742]        i915_gem_runtime_suspend+0x30/0xb0 [i915]
[  243.384758]        intel_runtime_suspend+0x67/0x260 [i915]
[  243.384761]        pci_pm_runtime_suspend+0x4d/0x160
[  243.384764]        __rpm_callback+0xb1/0x1e0
[  243.384766]        rpm_callback+0x1a/0x70
[  243.384768]        rpm_suspend+0x114/0x700
[  243.384770]        pm_runtime_work+0x6e/0xb0
[  243.384772]        process_one_work+0x227/0x650
[  243.384774]        worker_thread+0x48/0x3a0
[  243.384776]        kthread+0x173/0x1b0
[  243.384778]        ret_from_fork+0x24/0x30
[  243.384780] 
               other info that might help us debug this:

[  243.384783] Chain exists of:
                 (wq_completion)"i915-userptr-release" --> &dev->struct_mutex --> &mapping->i_mmap_rwsem

[  243.384789]  Possible unsafe locking scenario:

[  243.384791]        CPU0                    CPU1
[  243.384792]        ----                    ----
[  243.384794]   lock(&mapping->i_mmap_rwsem);
[  243.384796]                                lock(&dev->struct_mutex);
[  243.384798]                                lock(&mapping->i_mmap_rwsem);
[  243.384801]   lock((wq_completion)"i915-userptr-release");
[  243.384803] 
                *** DEADLOCK ***

[  243.384806] 4 locks held by kworker/6:2/222:
[  243.384807]  #0:  ((wq_completion)"pm"){+.+.}, at: [<000000001855b359>] process_one_work+0x199/0x650
[  243.384812]  #1:  ((work_completion)(&dev->power.work)){+.+.}, at: [<000000001855b359>] process_one_work+0x199/0x650
[  243.384817]  #2:  (&mapping->i_mmap_rwsem){++++}, at: [<00000000d538f896>] unmap_mapping_range+0x52/0x110
[  243.384822]  #3:  (srcu){....}, at: [<000000009d860dbe>] __mmu_notifier_invalidate_range_start+0x0/0xb0
[  243.384826] 
               stack backtrace:
[  243.384829] CPU: 6 PID: 222 Comm: kworker/6:2 Tainted: G     U           4.15.0-rc2-CI-CI_DRM_3486+ #1
[  243.384832] Hardware name: MSI MS-7924/Z97M-G43(MS-7924), BIOS V1.12 02/15/2016
[  243.384835] Workqueue: pm pm_runtime_work
[  243.384837] Call Trace:
[  243.384841]  dump_stack+0x5f/0x86
[  243.384844]  print_circular_bug+0x230/0x3b0
[  243.384847]  check_prev_add+0x439/0x7b0
[  243.384849]  ? lockdep_init_map_crosslock+0x20/0x20
[  243.384852]  ? lockdep_init_map_crosslock+0x20/0x20
[  243.384855]  ? __lock_acquire+0x1385/0x15a0
[  243.384858]  __lock_acquire+0x1385/0x15a0
[  243.384861]  lock_acquire+0xaf/0x200
[  243.384863]  ? flush_workqueue+0x172/0x500
[  243.384866]  wait_for_common+0x54/0x210
[  243.384868]  ? flush_workqueue+0x172/0x500
[  243.384870]  ? __mutex_unlock_slowpath+0x38/0x270
[  243.384873]  flush_workqueue+0x172/0x500
[  243.384894]  ? i915_gem_userptr_mn_invalidate_range_start+0x120/0x190 [i915]
[  243.384915]  ? i915_gem_userptr_mn_invalidate_range_start+0x133/0x190 [i915]
[  243.384936]  i915_gem_userptr_mn_invalidate_range_start+0x133/0x190 [i915]
[  243.384939]  __mmu_notifier_invalidate_range_start+0x73/0xb0
[  243.384943]  ? pci_pm_runtime_resume+0x90/0x90
[  243.384945]  zap_page_range_single+0xcc/0xe0
[  243.384948]  ? unmap_mapping_range+0x52/0x110
[  243.384951]  unmap_mapping_range+0xe9/0x110
[  243.384970]  __i915_gem_object_release_mmap+0xd9/0x100 [i915]
[  243.384989]  i915_gem_runtime_suspend+0x30/0xb0 [i915]
[  243.385005]  intel_runtime_suspend+0x67/0x260 [i915]
[  243.385008]  pci_pm_runtime_suspend+0x4d/0x160
[  243.385011]  __rpm_callback+0xb1/0x1e0
[  243.385013]  rpm_callback+0x1a/0x70
[  243.385016]  ? pci_pm_runtime_resume+0x90/0x90
[  243.385018]  rpm_suspend+0x114/0x700
[  243.385020]  pm_runtime_work+0x6e/0xb0
[  243.385023]  process_one_work+0x227/0x650
[  243.385026]  worker_thread+0x48/0x3a0
[  243.385028]  kthread+0x173/0x1b0
[  243.385030]  ? process_one_work+0x650/0x650
[  243.385032]  ? _kthread_create_on_node+0x30/0x30
[  243.385034]  ret_from_fork+0x24/0x30

Comment 1 Chris Wilson 2017-12-11 09:27:03 UTC

Yet another false positive. In i915_vma_revoke_mmap, the mapping is the device-local GTT mapping, which is barred from being used as a source for userptr -- because of this deadlock.

We could overwrite the lockclass for the device local mapping, I think.

Comment 2 Jani Saarinen 2017-12-11 11:55:28 UTC

ref: https://patchwork.freedesktop.org/series/35171/

Comment 3 Daniel Vetter 2017-12-15 08:38:08 UTC

Repeating my analysis from the mail thread: This looks real:

https://www.spinics.net/lists/intel-gfx/msg150141.html

In both cases of the circle we seem to hit exactly the same gtt/drm_device mapping locks. One way to break this is to make sure we do not call flush_workqueue on a range which isn't userptr-mapped (which automatically avoids gtt mmaps, since those are not allowed for userptr, or at least should). That option still has the classification problem, and I'm honestly not sure it's actually deadlock-free.

Or we have to somehow get rid of the flush_work from the mmu_notifier callback entirely. Safest option, but I dunno how right now.

Comment 4 Daniel Vetter 2017-12-15 08:56:53 UTC

Ok, I read the code some more, and the cancel_userptr work_struct trick used to hide the real locking dependencies i915_gem_userptr_mn_invalidate_range_start() from lockdep simply doesn't work anymore with the cross-release stuff. The

queue_work(wq, work);

...

flush_workqueue(wq);

is a dead giveaway that we have the full locking dependencies of whatever runs in work. I'd say lockdep is correct here.

(I still don't know how to fix this)

Comment 5 Marta Löfstedt 2018-01-12 08:28:28 UTC

Last seen CI_DRM_3486: 2017-12-09 / 186 runs ago

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.