Summary: | [CI][SNB, HSW, APL, KBL] igt@gem_eio@* and igt@gem_userptr_blits@* - dmesg-warn: possible circular locking dependency detected | ||
---|---|---|---|
Product: | DRI | Reporter: | Marta Löfstedt <marta.lofstedt> |
Component: | DRM/Intel | Assignee: | Intel GFX Bugs mailing list <intel-gfx-bugs> |
Status: | CLOSED FIXED | QA Contact: | Intel GFX Bugs mailing list <intel-gfx-bugs> |
Severity: | normal | ||
Priority: | high | CC: | intel-gfx-bugs, krisman |
Version: | DRI git | ||
Hardware: | Other | ||
OS: | All | ||
Whiteboard: | ReadyForDev | ||
i915 platform: | BXT, HSW, KBL, SNB | i915 features: | GEM/Other |
Description
Marta Löfstedt
2017-09-20 06:32:54 UTC
Looks like the culprit in this case is 81c0ed21aa91230 ("drm/i915/fence: Avoid del_timer_sync() from inside a timer"). timing checks out. Also on igt@gem_eio@wait: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3119/shard-hsw4/igt@gem_eio@wait.html https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3119/shard-hsw4/igt@gem_eio@wait.html HSW- and SNB-shards CI_DRM_3118, CI_DRM_3119: new test: igt@gem_eio@execbuf https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3119/shard-snb2/igt@gem_eio@execbuf.html https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3119/shard-hsw1/igt@gem_eio@execbuf.html SNB-shards CI_DRM_3118, CI_DRM_3119: new test: igt@gem_eio@throttle https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3119/shard-snb1/igt@gem_eio@throttle.html So far, CI_DRM_3118 and CI_DRM_3119 the new tests: igt@gem_eio@wait dmesg-warn SNB, HSW, APL, KBL igt@gem_eio@execbuf dmesg-warn SNB, HSW, APL, KBL igt@gem_eio@throttle dmesg-warn only SNB Also, the "old" tests; igt@gem_eio@in-flight-* are passing on all machines and should be removed from cibuglog if that trend continues. (In reply to Marta Löfstedt from comment #5) > So far, CI_DRM_3118 and CI_DRM_3119 the new tests: > igt@gem_eio@wait dmesg-warn SNB, HSW, APL, KBL > igt@gem_eio@execbuf dmesg-warn SNB, HSW, APL, KBL > igt@gem_eio@throttle dmesg-warn only SNB > > Also, the "old" tests; igt@gem_eio@in-flight-* are passing on all machines > and should be removed from cibuglog if that trend continues. No, you should not remove these issue back again on CI_DRM_3121 see for example: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3121/shard-hsw6/igt@gem_eio@throttle.html *** Bug 102993 has been marked as a duplicate of this bug. *** *** Bug 102939 has been marked as a duplicate of this bug. *** From https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3170/shard-snb4/igt@prime_mmap@test_userptr.html the lockdep is no longer reproduced on the igt@prime_mmap@test_userptr. However it is still reproduced by the gem_eio tests in this. I.e Bug 102939 was a false duplicate *** Bug 102939 has been marked as a duplicate of this bug. *** I just realized that lockdeps are warn once. So, if other lockdep:ing tests are scheduled in the same shard we will miss the next one. This explains why we are seeing flip-flopping lockdeps and this also makes this a bit more painful. Just to make things clear: This is for the lockdep splat going through i915_gem_set_wedged/stop_machine. Anything else is probably a different splat (even when the 2 locks that close the loop might be the same, unfortunately you can't go by the top few lines in a lockdep splat but need to look at the full circle). *** Bug 103096 has been marked as a duplicate of this bug. *** The lockdep splat in igt@prime_mmap@test_userptr needs to be taken out of this bug here in cibuglog, it is a different kind of lockdep splat. That one goes through i915_gem_userptr_init__mmu_notifier as the critical call, but not through i915_gem_set_wedged. Marta, can you pls update cibuglog to create a separate entry and new bugzilla? I need that for my patch :-) *** Bug 102939 has been marked as a duplicate of this bug. *** Now also on gem_ [ 87.671792] ====================================================== [ 87.671795] WARNING: possible circular locking dependency detected [ 87.671798] 4.14.0-rc4-CI-CI_DRM_3196+ #1 Tainted: G U [ 87.671800] ------------------------------------------------------ [ 87.671803] gem_userptr_bli/1662 is trying to acquire lock: [ 87.671805] (cpu_hotplug_lock.rw_sem){++++}, at: [<ffffffff8109e777>] apply_workqueue_attrs+0x17/0x50 [ 87.671813] but task is already holding lock: [ 87.671816] (&dev_priv->mm_lock){+.+.}, at: [<ffffffffa01fc08a>] i915_gem_userptr_init__mmu_notifier+0x14a/0x270 [i915] [ 87.671854] which lock already depends on the new lock. [ 87.671857] the existing dependency chain (in reverse order) is: [ 87.671860] -> #6 (&dev_priv->mm_lock){+.+.}: [ 87.671865] __lock_acquire+0x1420/0x15e0 [ 87.671868] lock_acquire+0xb0/0x200 [ 87.671871] __mutex_lock+0x86/0x9b0 [ 87.671874] mutex_lock_nested+0x1b/0x20 [ 87.671892] i915_gem_userptr_init__mmu_notifier+0x14a/0x270 [i915] [ 87.671910] i915_gem_userptr_ioctl+0x221/0x2c0 [i915] [ 87.671914] drm_ioctl_kernel+0x69/0xb0 [ 87.671917] drm_ioctl+0x2f9/0x3d0 [ 87.671920] do_vfs_ioctl+0x94/0x670 [ 87.671922] SyS_ioctl+0x41/0x70 [ 87.671925] entry_SYSCALL_64_fastpath+0x1c/0xb1 [ 87.671927] -> #5 (&mm->mmap_sem){++++}: [ 87.671932] __lock_acquire+0x1420/0x15e0 [ 87.671935] lock_acquire+0xb0/0x200 [ 87.671939] __might_fault+0x68/0x90 [ 87.671942] _copy_to_user+0x23/0x70 [ 87.671945] filldir+0xa5/0x120 [ 87.671948] dcache_readdir+0xf9/0x170 [ 87.671951] iterate_dir+0x69/0x1a0 [ 87.671953] SyS_getdents+0xa5/0x140 [ 87.671956] entry_SYSCALL_64_fastpath+0x1c/0xb1 [ 87.671958] -> #4 (&sb->s_type->i_mutex_key#5){++++}: [ 87.671963] down_write+0x3b/0x70 [ 87.671966] handle_create+0xcb/0x1e0 [ 87.671969] devtmpfsd+0x139/0x180 [ 87.671971] kthread+0x152/0x190 [ 87.671973] ret_from_fork+0x27/0x40 [ 87.671975] -> #3 ((complete)&req.done){+.+.}: [ 87.671980] __lock_acquire+0x1420/0x15e0 [ 87.671982] lock_acquire+0xb0/0x200 [ 87.671985] wait_for_common+0x58/0x210 [ 87.671987] wait_for_completion+0x1d/0x20 [ 87.671990] devtmpfs_create_node+0x13d/0x160 [ 87.671992] device_add+0x5eb/0x620 [ 87.671995] device_create_groups_vargs+0xe0/0xf0 [ 87.671997] device_create+0x3a/0x40 [ 87.672000] msr_device_create+0x2b/0x40 [ 87.672003] cpuhp_invoke_callback+0xc9/0xbf0 [ 87.672005] cpuhp_thread_fun+0x17b/0x240 [ 87.672008] smpboot_thread_fn+0x18a/0x280 [ 87.672011] kthread+0x152/0x190 [ 87.672013] ret_from_fork+0x27/0x40 [ 87.672015] -> #2 (cpuhp_state-up){+.+.}: [ 87.672020] __lock_acquire+0x1420/0x15e0 [ 87.672022] lock_acquire+0xb0/0x200 [ 87.672025] cpuhp_issue_call+0x133/0x1c0 [ 87.672027] __cpuhp_setup_state_cpuslocked+0x139/0x2a0 [ 87.672030] __cpuhp_setup_state+0x46/0x60 [ 87.672034] page_writeback_init+0x43/0x67 [ 87.672036] pagecache_init+0x3d/0x42 [ 87.672039] start_kernel+0x3a8/0x3fc [ 87.672042] x86_64_start_reservations+0x2a/0x2c [ 87.672045] x86_64_start_kernel+0x6d/0x70 [ 87.672049] verify_cpu+0x0/0xfb [ 87.672050] -> #1 (cpuhp_state_mutex){+.+.}: [ 87.672055] __lock_acquire+0x1420/0x15e0 [ 87.672057] lock_acquire+0xb0/0x200 [ 87.672060] __mutex_lock+0x86/0x9b0 [ 87.672062] mutex_lock_nested+0x1b/0x20 [ 87.672065] __cpuhp_setup_state_cpuslocked+0x53/0x2a0 [ 87.672067] __cpuhp_setup_state+0x46/0x60 [ 87.672070] page_alloc_init+0x28/0x30 [ 87.672072] start_kernel+0x145/0x3fc [ 87.672075] x86_64_start_reservations+0x2a/0x2c [ 87.672078] x86_64_start_kernel+0x6d/0x70 [ 87.672080] verify_cpu+0x0/0xfb [ 87.672082] -> #0 (cpu_hotplug_lock.rw_sem){++++}: [ 87.672087] check_prev_add+0x430/0x840 [ 87.672089] __lock_acquire+0x1420/0x15e0 [ 87.672092] lock_acquire+0xb0/0x200 [ 87.672094] cpus_read_lock+0x3d/0xb0 [ 87.672097] apply_workqueue_attrs+0x17/0x50 [ 87.672100] __alloc_workqueue_key+0x1d8/0x4d9 [ 87.672117] i915_gem_userptr_init__mmu_notifier+0x1fb/0x270 [i915] [ 87.672135] i915_gem_userptr_ioctl+0x221/0x2c0 [i915] [ 87.672138] drm_ioctl_kernel+0x69/0xb0 [ 87.672140] drm_ioctl+0x2f9/0x3d0 [ 87.672143] do_vfs_ioctl+0x94/0x670 [ 87.672145] SyS_ioctl+0x41/0x70 [ 87.672148] entry_SYSCALL_64_fastpath+0x1c/0xb1 [ 87.672150] other info that might help us debug this: [ 87.672154] Chain exists of: cpu_hotplug_lock.rw_sem --> &mm->mmap_sem --> &dev_priv->mm_lock [ 87.672160] Possible unsafe locking scenario: [ 87.672163] CPU0 CPU1 [ 87.672165] ---- ---- [ 87.672167] lock(&dev_priv->mm_lock); [ 87.672169] lock(&mm->mmap_sem); [ 87.672172] lock(&dev_priv->mm_lock); [ 87.672175] lock(cpu_hotplug_lock.rw_sem); [ 87.672178] *** DEADLOCK *** [ 87.672181] 2 locks held by gem_userptr_bli/1662: [ 87.672183] #0: (&mm->mmap_sem){++++}, at: [<ffffffffa01fc078>] i915_gem_userptr_init__mmu_notifier+0x138/0x270 [i915] [ 87.672203] #1: (&dev_priv->mm_lock){+.+.}, at: [<ffffffffa01fc08a>] i915_gem_userptr_init__mmu_notifier+0x14a/0x270 [i915] [ 87.672222] stack backtrace: [ 87.672226] CPU: 1 PID: 1662 Comm: gem_userptr_bli Tainted: G U 4.14.0-rc4-CI-CI_DRM_3196+ #1 [ 87.672230] Hardware name: Dell Inc. XPS 8300 /0Y2MRG, BIOS A06 10/17/2011 [ 87.672232] Call Trace: [ 87.672237] dump_stack+0x68/0x9f [ 87.672239] print_circular_bug+0x235/0x3c0 [ 87.672242] ? lockdep_init_map_crosslock+0x20/0x20 [ 87.672245] check_prev_add+0x430/0x840 [ 87.672249] __lock_acquire+0x1420/0x15e0 [ 87.672251] ? __lock_acquire+0x1420/0x15e0 [ 87.672254] ? lockdep_init_map_crosslock+0x20/0x20 [ 87.672257] lock_acquire+0xb0/0x200 [ 87.672260] ? apply_workqueue_attrs+0x17/0x50 [ 87.672263] cpus_read_lock+0x3d/0xb0 [ 87.672265] ? apply_workqueue_attrs+0x17/0x50 [ 87.672268] apply_workqueue_attrs+0x17/0x50 [ 87.672271] __alloc_workqueue_key+0x1d8/0x4d9 [ 87.672273] ? __lockdep_init_map+0x57/0x1c0 [ 87.672290] i915_gem_userptr_init__mmu_notifier+0x1fb/0x270 [i915] [ 87.672307] i915_gem_userptr_ioctl+0x221/0x2c0 [i915] [ 87.672324] ? i915_gem_userptr_release+0x140/0x140 [i915] [ 87.672327] drm_ioctl_kernel+0x69/0xb0 [ 87.672330] drm_ioctl+0x2f9/0x3d0 [ 87.672345] ? i915_gem_userptr_release+0x140/0x140 [i915] [ 87.672350] ? retint_kernel+0x2d/0x2d [ 87.672353] do_vfs_ioctl+0x94/0x670 [ 87.672355] ? entry_SYSCALL_64_fastpath+0x5/0xb1 [ 87.672358] ? __this_cpu_preempt_check+0x13/0x20 [ 87.672361] ? trace_hardirqs_on_caller+0xe3/0x1b0 [ 87.672364] SyS_ioctl+0x41/0x70 [ 87.672367] entry_SYSCALL_64_fastpath+0x1c/0xb1 [ 87.672370] RIP: 0033:0x7fa3ceed4587 [ 87.672372] RSP: 002b:00007ffcf3841858 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 [ 87.672375] RAX: ffffffffffffffda RBX: ffffffff81493f33 RCX: 00007fa3ceed4587 [ 87.672378] RDX: 00007ffcf3841890 RSI: 00000000c0186473 RDI: 0000000000000003 [ 87.672381] RBP: ffffc90000743f88 R08: 0000000000000000 R09: 00007ffcf38418d4 [ 87.672384] R10: 000055ce8b79f23c R11: 0000000000000246 R12: 00007fa3786b9000 [ 87.672387] R13: 0000000000000003 R14: 00000000c0186473 R15: 00007ffcf38418d4 [ 87.672390] ? __this_cpu_preempt_check+0x13/0x20 https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3196/shard-snb4/igt@gem_userptr_blits@coherency-sync.html https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3200/shard-hsw6/igt@gem_userptr_blits@create-destroy-sync.html https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3200/shard-apl3/igt@gem_userptr_blits@dmabuf-sync.html https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3200/shard-kbl1/igt@gem_userptr_blits@process-exit.html Should be fixed width commit 7741b547b6e000b08e20667bb3bef22e1a362661 Author: Daniel Vetter <daniel.vetter@ffwll.ch> Date: Mon Oct 9 18:44:00 2017 +0200 drm/i915: Preallocate our mmu notifier workequeu to unbreak cpu hotplug deadlock Nope, this also contains the gem_eio stuff, which isn't yet fixed. Just for tracking purposes: What's fixed already is anything with userptr in it, but not yet the gem_eio stuff. No idea how you want to reflect this in cibuglog. I'd still say we should split the bug. Ok, now also the fix for the locking loop around set_wedged/stop_machine has landed: commit af7a8ffad9c58deac791333a65c62d7fc72f9e9c Author: Daniel Vetter <daniel.vetter@ffwll.ch> Date: Wed Oct 11 11:10:19 2017 +0200 drm/i915: Use rcu instead of stop_machine in set_wedged I think with that one we can close this bug here. Fix integrated from CI_DRM_3215. The issue has not been reproduced on BAT/CI. |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.