https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3486/shard-hsw1/igt@gem_userptr_blits@map-fixed-invalidate-gup.html [ 243.384464] ====================================================== [ 243.384466] WARNING: possible circular locking dependency detected [ 243.384468] 4.15.0-rc2-CI-CI_DRM_3486+ #1 Tainted: G U [ 243.384470] ------------------------------------------------------ [ 243.384472] kworker/6:2/222 is trying to acquire lock: [ 243.384474] ((wq_completion)"i915-userptr-release"){+.+.}, at: [<00000000d295be80>] flush_workqueue+0x172/0x500 [ 243.384480] but task is already holding lock: [ 243.384482] (&mapping->i_mmap_rwsem){++++}, at: [<00000000d538f896>] unmap_mapping_range+0x52/0x110 [ 243.384488] which lock already depends on the new lock. [ 243.384491] the existing dependency chain (in reverse order) is: [ 243.384493] -> #3 (&mapping->i_mmap_rwsem){++++}: [ 243.384499] down_write+0x33/0x60 [ 243.384501] unmap_mapping_range+0x52/0x110 [ 243.384522] i915_vma_revoke_mmap+0x7a/0x160 [i915] [ 243.384540] fence_update+0x1a1/0x3c0 [i915] [ 243.384560] i915_vma_unbind+0x3f8/0x780 [i915] [ 243.384578] i915_vma_close+0x51/0x90 [i915] [ 243.384596] __i915_gem_free_objects+0xcb/0x540 [i915] [ 243.384613] __i915_gem_free_work+0x68/0x90 [i915] [ 243.384616] process_one_work+0x227/0x650 [ 243.384618] worker_thread+0x48/0x3a0 [ 243.384620] kthread+0x173/0x1b0 [ 243.384622] ret_from_fork+0x24/0x30 [ 243.384623] -> #2 (&dev->struct_mutex){+.+.}: [ 243.384628] __mutex_lock+0x81/0x9b0 [ 243.384646] cancel_userptr+0x7c/0xf0 [i915] [ 243.384648] process_one_work+0x227/0x650 [ 243.384650] worker_thread+0x48/0x3a0 [ 243.384652] kthread+0x173/0x1b0 [ 243.384654] ret_from_fork+0x24/0x30 [ 243.384656] -> #1 ((work_completion)(&mo->work)){+.+.}: [ 243.384660] process_one_work+0x1fa/0x650 [ 243.384662] worker_thread+0x48/0x3a0 [ 243.384663] kthread+0x173/0x1b0 [ 243.384665] ret_from_fork+0x24/0x30 [ 243.384666] -> #0 ((wq_completion)"i915-userptr-release"){+.+.}: [ 243.384672] lock_acquire+0xaf/0x200 [ 243.384674] wait_for_common+0x54/0x210 [ 243.384676] flush_workqueue+0x172/0x500 [ 243.384696] i915_gem_userptr_mn_invalidate_range_start+0x133/0x190 [i915] [ 243.384699] __mmu_notifier_invalidate_range_start+0x73/0xb0 [ 243.384702] zap_page_range_single+0xcc/0xe0 [ 243.384704] unmap_mapping_range+0xe9/0x110 [ 243.384723] __i915_gem_object_release_mmap+0xd9/0x100 [i915] [ 243.384742] i915_gem_runtime_suspend+0x30/0xb0 [i915] [ 243.384758] intel_runtime_suspend+0x67/0x260 [i915] [ 243.384761] pci_pm_runtime_suspend+0x4d/0x160 [ 243.384764] __rpm_callback+0xb1/0x1e0 [ 243.384766] rpm_callback+0x1a/0x70 [ 243.384768] rpm_suspend+0x114/0x700 [ 243.384770] pm_runtime_work+0x6e/0xb0 [ 243.384772] process_one_work+0x227/0x650 [ 243.384774] worker_thread+0x48/0x3a0 [ 243.384776] kthread+0x173/0x1b0 [ 243.384778] ret_from_fork+0x24/0x30 [ 243.384780] other info that might help us debug this: [ 243.384783] Chain exists of: (wq_completion)"i915-userptr-release" --> &dev->struct_mutex --> &mapping->i_mmap_rwsem [ 243.384789] Possible unsafe locking scenario: [ 243.384791] CPU0 CPU1 [ 243.384792] ---- ---- [ 243.384794] lock(&mapping->i_mmap_rwsem); [ 243.384796] lock(&dev->struct_mutex); [ 243.384798] lock(&mapping->i_mmap_rwsem); [ 243.384801] lock((wq_completion)"i915-userptr-release"); [ 243.384803] *** DEADLOCK *** [ 243.384806] 4 locks held by kworker/6:2/222: [ 243.384807] #0: ((wq_completion)"pm"){+.+.}, at: [<000000001855b359>] process_one_work+0x199/0x650 [ 243.384812] #1: ((work_completion)(&dev->power.work)){+.+.}, at: [<000000001855b359>] process_one_work+0x199/0x650 [ 243.384817] #2: (&mapping->i_mmap_rwsem){++++}, at: [<00000000d538f896>] unmap_mapping_range+0x52/0x110 [ 243.384822] #3: (srcu){....}, at: [<000000009d860dbe>] __mmu_notifier_invalidate_range_start+0x0/0xb0 [ 243.384826] stack backtrace: [ 243.384829] CPU: 6 PID: 222 Comm: kworker/6:2 Tainted: G U 4.15.0-rc2-CI-CI_DRM_3486+ #1 [ 243.384832] Hardware name: MSI MS-7924/Z97M-G43(MS-7924), BIOS V1.12 02/15/2016 [ 243.384835] Workqueue: pm pm_runtime_work [ 243.384837] Call Trace: [ 243.384841] dump_stack+0x5f/0x86 [ 243.384844] print_circular_bug+0x230/0x3b0 [ 243.384847] check_prev_add+0x439/0x7b0 [ 243.384849] ? lockdep_init_map_crosslock+0x20/0x20 [ 243.384852] ? lockdep_init_map_crosslock+0x20/0x20 [ 243.384855] ? __lock_acquire+0x1385/0x15a0 [ 243.384858] __lock_acquire+0x1385/0x15a0 [ 243.384861] lock_acquire+0xaf/0x200 [ 243.384863] ? flush_workqueue+0x172/0x500 [ 243.384866] wait_for_common+0x54/0x210 [ 243.384868] ? flush_workqueue+0x172/0x500 [ 243.384870] ? __mutex_unlock_slowpath+0x38/0x270 [ 243.384873] flush_workqueue+0x172/0x500 [ 243.384894] ? i915_gem_userptr_mn_invalidate_range_start+0x120/0x190 [i915] [ 243.384915] ? i915_gem_userptr_mn_invalidate_range_start+0x133/0x190 [i915] [ 243.384936] i915_gem_userptr_mn_invalidate_range_start+0x133/0x190 [i915] [ 243.384939] __mmu_notifier_invalidate_range_start+0x73/0xb0 [ 243.384943] ? pci_pm_runtime_resume+0x90/0x90 [ 243.384945] zap_page_range_single+0xcc/0xe0 [ 243.384948] ? unmap_mapping_range+0x52/0x110 [ 243.384951] unmap_mapping_range+0xe9/0x110 [ 243.384970] __i915_gem_object_release_mmap+0xd9/0x100 [i915] [ 243.384989] i915_gem_runtime_suspend+0x30/0xb0 [i915] [ 243.385005] intel_runtime_suspend+0x67/0x260 [i915] [ 243.385008] pci_pm_runtime_suspend+0x4d/0x160 [ 243.385011] __rpm_callback+0xb1/0x1e0 [ 243.385013] rpm_callback+0x1a/0x70 [ 243.385016] ? pci_pm_runtime_resume+0x90/0x90 [ 243.385018] rpm_suspend+0x114/0x700 [ 243.385020] pm_runtime_work+0x6e/0xb0 [ 243.385023] process_one_work+0x227/0x650 [ 243.385026] worker_thread+0x48/0x3a0 [ 243.385028] kthread+0x173/0x1b0 [ 243.385030] ? process_one_work+0x650/0x650 [ 243.385032] ? _kthread_create_on_node+0x30/0x30 [ 243.385034] ret_from_fork+0x24/0x30
Yet another false positive. In i915_vma_revoke_mmap, the mapping is the device-local GTT mapping, which is barred from being used as a source for userptr -- because of this deadlock. We could overwrite the lockclass for the device local mapping, I think.
ref: https://patchwork.freedesktop.org/series/35171/
Repeating my analysis from the mail thread: This looks real: https://www.spinics.net/lists/intel-gfx/msg150141.html In both cases of the circle we seem to hit exactly the same gtt/drm_device mapping locks. One way to break this is to make sure we do not call flush_workqueue on a range which isn't userptr-mapped (which automatically avoids gtt mmaps, since those are not allowed for userptr, or at least should). That option still has the classification problem, and I'm honestly not sure it's actually deadlock-free. Or we have to somehow get rid of the flush_work from the mmu_notifier callback entirely. Safest option, but I dunno how right now.
Ok, I read the code some more, and the cancel_userptr work_struct trick used to hide the real locking dependencies i915_gem_userptr_mn_invalidate_range_start() from lockdep simply doesn't work anymore with the cross-release stuff. The queue_work(wq, work); ... flush_workqueue(wq); is a dead giveaway that we have the full locking dependencies of whatever runs in work. I'd say lockdep is correct here. (I still don't know how to fix this)
Last seen CI_DRM_3486: 2017-12-09 / 186 runs ago
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.