Bug 102886

Summary: [CI][SNB, HSW, APL, KBL] igt@gem_eio@* and igt@gem_userptr_blits@* - dmesg-warn: possible circular locking dependency detected
Product: DRI Reporter: Marta Löfstedt <marta.lofstedt>
Component: DRM/IntelAssignee: Intel GFX Bugs mailing list <intel-gfx-bugs>
Status: CLOSED FIXED QA Contact: Intel GFX Bugs mailing list <intel-gfx-bugs>
Severity: normal    
Priority: high CC: intel-gfx-bugs, krisman
Version: DRI git   
Hardware: Other   
OS: All   
Whiteboard: ReadyForDev
i915 platform: BXT, HSW, KBL, SNB i915 features: GEM/Other

Description Marta Löfstedt 2017-09-20 06:32:54 UTC
Starting from CI_DRM_3108: 
igt@gem_eio@in-flight
igt@gem_eio@in-flight-context
Starting from CI_DRM_3107:
igt@gem_eio@in-flight-external
igt@gem_eio@in-flight-internal

dmesg-warn on:

[   40.739222] WARNING: possible circular locking dependency detected

Also, 
igt@gem_eio@in-flight-context
igt@gem_eio@in-flight-external
igt@gem_eio@in-flight-internal
appear to be new test that fail from first execution.

Some full data links:
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3108/shard-snb1/igt@gem_eio@in-flight.html
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3111/shard-hsw4/igt@gem_eio@in-flight-contexts.html
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3111/shard-apl5/igt@gem_eio@in-flight-external.html
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3111/shard-kbl4/igt@gem_eio@in-flight-internal.html
Comment 1 krisman 2017-09-21 04:48:16 UTC
Looks like the culprit in this case is 81c0ed21aa91230 ("drm/i915/fence: Avoid del_timer_sync() from inside a timer"). timing checks out.
Comment 4 Marta Löfstedt 2017-09-22 06:46:59 UTC
SNB-shards CI_DRM_3118, CI_DRM_3119: 

new test:
igt@gem_eio@throttle

https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3119/shard-snb1/igt@gem_eio@throttle.html
Comment 5 Marta Löfstedt 2017-09-22 06:53:37 UTC
So far, CI_DRM_3118 and CI_DRM_3119 the new tests:
igt@gem_eio@wait dmesg-warn SNB, HSW, APL, KBL
igt@gem_eio@execbuf dmesg-warn SNB, HSW, APL, KBL
igt@gem_eio@throttle dmesg-warn only SNB

Also, the "old" tests;  igt@gem_eio@in-flight-* are passing on all machines and should be removed from cibuglog if that trend continues.
Comment 6 Marta Löfstedt 2017-09-22 11:14:08 UTC
(In reply to Marta Löfstedt from comment #5)
> So far, CI_DRM_3118 and CI_DRM_3119 the new tests:
> igt@gem_eio@wait dmesg-warn SNB, HSW, APL, KBL
> igt@gem_eio@execbuf dmesg-warn SNB, HSW, APL, KBL
> igt@gem_eio@throttle dmesg-warn only SNB
> 
> Also, the "old" tests;  igt@gem_eio@in-flight-* are passing on all machines
> and should be removed from cibuglog if that trend continues.

No, you should not remove these issue back again on CI_DRM_3121 

see for example:
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3121/shard-hsw6/igt@gem_eio@throttle.html
Comment 7 Marta Löfstedt 2017-09-26 09:18:29 UTC
*** Bug 102993 has been marked as a duplicate of this bug. ***
Comment 8 krisman 2017-10-02 05:30:11 UTC
*** Bug 102939 has been marked as a duplicate of this bug. ***
Comment 9 Marta Löfstedt 2017-10-04 11:47:48 UTC
From https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3170/shard-snb4/igt@prime_mmap@test_userptr.html

the lockdep is no longer reproduced on the igt@prime_mmap@test_userptr. However it is still reproduced by the gem_eio tests in this. I.e Bug 102939 was a false duplicate
Comment 10 Marta Löfstedt 2017-10-05 12:26:58 UTC
*** Bug 102939 has been marked as a duplicate of this bug. ***
Comment 11 Marta Löfstedt 2017-10-05 12:28:43 UTC
I just realized that lockdeps are warn once. So, if other lockdep:ing tests are scheduled in the same shard we will miss the next one. This explains why we are seeing flip-flopping lockdeps and this also makes this a bit more painful.
Comment 12 Daniel Vetter 2017-10-06 07:12:37 UTC
Just to make things clear: This is for the lockdep splat going through i915_gem_set_wedged/stop_machine. Anything else is probably a different splat (even when the 2 locks that close the loop might be the same, unfortunately you can't go by the top few lines in a lockdep splat but need to look at the full circle).
Comment 13 Daniel Vetter 2017-10-06 07:12:51 UTC
*** Bug 103096 has been marked as a duplicate of this bug. ***
Comment 14 Daniel Vetter 2017-10-06 07:16:03 UTC
The lockdep splat in igt@prime_mmap@test_userptr needs to be taken out of this bug  here in cibuglog, it is a different kind of lockdep splat. That one goes through i915_gem_userptr_init__mmu_notifier as the critical call, but not through i915_gem_set_wedged.

Marta, can you pls update cibuglog to create a separate entry and new bugzilla? I need that for my patch :-)
Comment 15 Chris Wilson 2017-10-06 08:34:13 UTC
*** Bug 102939 has been marked as a duplicate of this bug. ***
Comment 16 Marta Löfstedt 2017-10-10 05:45:47 UTC
Now also on gem_	

[   87.671792] ======================================================
[   87.671795] WARNING: possible circular locking dependency detected
[   87.671798] 4.14.0-rc4-CI-CI_DRM_3196+ #1 Tainted: G     U         
[   87.671800] ------------------------------------------------------
[   87.671803] gem_userptr_bli/1662 is trying to acquire lock:
[   87.671805]  (cpu_hotplug_lock.rw_sem){++++}, at: [<ffffffff8109e777>] apply_workqueue_attrs+0x17/0x50
[   87.671813] 
               but task is already holding lock:
[   87.671816]  (&dev_priv->mm_lock){+.+.}, at: [<ffffffffa01fc08a>] i915_gem_userptr_init__mmu_notifier+0x14a/0x270 [i915]
[   87.671854] 
               which lock already depends on the new lock.

[   87.671857] 
               the existing dependency chain (in reverse order) is:
[   87.671860] 
               -> #6 (&dev_priv->mm_lock){+.+.}:
[   87.671865]        __lock_acquire+0x1420/0x15e0
[   87.671868]        lock_acquire+0xb0/0x200
[   87.671871]        __mutex_lock+0x86/0x9b0
[   87.671874]        mutex_lock_nested+0x1b/0x20
[   87.671892]        i915_gem_userptr_init__mmu_notifier+0x14a/0x270 [i915]
[   87.671910]        i915_gem_userptr_ioctl+0x221/0x2c0 [i915]
[   87.671914]        drm_ioctl_kernel+0x69/0xb0
[   87.671917]        drm_ioctl+0x2f9/0x3d0
[   87.671920]        do_vfs_ioctl+0x94/0x670
[   87.671922]        SyS_ioctl+0x41/0x70
[   87.671925]        entry_SYSCALL_64_fastpath+0x1c/0xb1
[   87.671927] 
               -> #5 (&mm->mmap_sem){++++}:
[   87.671932]        __lock_acquire+0x1420/0x15e0
[   87.671935]        lock_acquire+0xb0/0x200
[   87.671939]        __might_fault+0x68/0x90
[   87.671942]        _copy_to_user+0x23/0x70
[   87.671945]        filldir+0xa5/0x120
[   87.671948]        dcache_readdir+0xf9/0x170
[   87.671951]        iterate_dir+0x69/0x1a0
[   87.671953]        SyS_getdents+0xa5/0x140
[   87.671956]        entry_SYSCALL_64_fastpath+0x1c/0xb1
[   87.671958] 
               -> #4 (&sb->s_type->i_mutex_key#5){++++}:
[   87.671963]        down_write+0x3b/0x70
[   87.671966]        handle_create+0xcb/0x1e0
[   87.671969]        devtmpfsd+0x139/0x180
[   87.671971]        kthread+0x152/0x190
[   87.671973]        ret_from_fork+0x27/0x40
[   87.671975] 
               -> #3 ((complete)&req.done){+.+.}:
[   87.671980]        __lock_acquire+0x1420/0x15e0
[   87.671982]        lock_acquire+0xb0/0x200
[   87.671985]        wait_for_common+0x58/0x210
[   87.671987]        wait_for_completion+0x1d/0x20
[   87.671990]        devtmpfs_create_node+0x13d/0x160
[   87.671992]        device_add+0x5eb/0x620
[   87.671995]        device_create_groups_vargs+0xe0/0xf0
[   87.671997]        device_create+0x3a/0x40
[   87.672000]        msr_device_create+0x2b/0x40
[   87.672003]        cpuhp_invoke_callback+0xc9/0xbf0
[   87.672005]        cpuhp_thread_fun+0x17b/0x240
[   87.672008]        smpboot_thread_fn+0x18a/0x280
[   87.672011]        kthread+0x152/0x190
[   87.672013]        ret_from_fork+0x27/0x40
[   87.672015] 
               -> #2 (cpuhp_state-up){+.+.}:
[   87.672020]        __lock_acquire+0x1420/0x15e0
[   87.672022]        lock_acquire+0xb0/0x200
[   87.672025]        cpuhp_issue_call+0x133/0x1c0
[   87.672027]        __cpuhp_setup_state_cpuslocked+0x139/0x2a0
[   87.672030]        __cpuhp_setup_state+0x46/0x60
[   87.672034]        page_writeback_init+0x43/0x67
[   87.672036]        pagecache_init+0x3d/0x42
[   87.672039]        start_kernel+0x3a8/0x3fc
[   87.672042]        x86_64_start_reservations+0x2a/0x2c
[   87.672045]        x86_64_start_kernel+0x6d/0x70
[   87.672049]        verify_cpu+0x0/0xfb
[   87.672050] 
               -> #1 (cpuhp_state_mutex){+.+.}:
[   87.672055]        __lock_acquire+0x1420/0x15e0
[   87.672057]        lock_acquire+0xb0/0x200
[   87.672060]        __mutex_lock+0x86/0x9b0
[   87.672062]        mutex_lock_nested+0x1b/0x20
[   87.672065]        __cpuhp_setup_state_cpuslocked+0x53/0x2a0
[   87.672067]        __cpuhp_setup_state+0x46/0x60
[   87.672070]        page_alloc_init+0x28/0x30
[   87.672072]        start_kernel+0x145/0x3fc
[   87.672075]        x86_64_start_reservations+0x2a/0x2c
[   87.672078]        x86_64_start_kernel+0x6d/0x70
[   87.672080]        verify_cpu+0x0/0xfb
[   87.672082] 
               -> #0 (cpu_hotplug_lock.rw_sem){++++}:
[   87.672087]        check_prev_add+0x430/0x840
[   87.672089]        __lock_acquire+0x1420/0x15e0
[   87.672092]        lock_acquire+0xb0/0x200
[   87.672094]        cpus_read_lock+0x3d/0xb0
[   87.672097]        apply_workqueue_attrs+0x17/0x50
[   87.672100]        __alloc_workqueue_key+0x1d8/0x4d9
[   87.672117]        i915_gem_userptr_init__mmu_notifier+0x1fb/0x270 [i915]
[   87.672135]        i915_gem_userptr_ioctl+0x221/0x2c0 [i915]
[   87.672138]        drm_ioctl_kernel+0x69/0xb0
[   87.672140]        drm_ioctl+0x2f9/0x3d0
[   87.672143]        do_vfs_ioctl+0x94/0x670
[   87.672145]        SyS_ioctl+0x41/0x70
[   87.672148]        entry_SYSCALL_64_fastpath+0x1c/0xb1
[   87.672150] 
               other info that might help us debug this:

[   87.672154] Chain exists of:
                 cpu_hotplug_lock.rw_sem --> &mm->mmap_sem --> &dev_priv->mm_lock

[   87.672160]  Possible unsafe locking scenario:

[   87.672163]        CPU0                    CPU1
[   87.672165]        ----                    ----
[   87.672167]   lock(&dev_priv->mm_lock);
[   87.672169]                                lock(&mm->mmap_sem);
[   87.672172]                                lock(&dev_priv->mm_lock);
[   87.672175]   lock(cpu_hotplug_lock.rw_sem);
[   87.672178] 
                *** DEADLOCK ***

[   87.672181] 2 locks held by gem_userptr_bli/1662:
[   87.672183]  #0:  (&mm->mmap_sem){++++}, at: [<ffffffffa01fc078>] i915_gem_userptr_init__mmu_notifier+0x138/0x270 [i915]
[   87.672203]  #1:  (&dev_priv->mm_lock){+.+.}, at: [<ffffffffa01fc08a>] i915_gem_userptr_init__mmu_notifier+0x14a/0x270 [i915]
[   87.672222] 
               stack backtrace:
[   87.672226] CPU: 1 PID: 1662 Comm: gem_userptr_bli Tainted: G     U          4.14.0-rc4-CI-CI_DRM_3196+ #1
[   87.672230] Hardware name: Dell Inc. XPS 8300  /0Y2MRG, BIOS A06 10/17/2011
[   87.672232] Call Trace:
[   87.672237]  dump_stack+0x68/0x9f
[   87.672239]  print_circular_bug+0x235/0x3c0
[   87.672242]  ? lockdep_init_map_crosslock+0x20/0x20
[   87.672245]  check_prev_add+0x430/0x840
[   87.672249]  __lock_acquire+0x1420/0x15e0
[   87.672251]  ? __lock_acquire+0x1420/0x15e0
[   87.672254]  ? lockdep_init_map_crosslock+0x20/0x20
[   87.672257]  lock_acquire+0xb0/0x200
[   87.672260]  ? apply_workqueue_attrs+0x17/0x50
[   87.672263]  cpus_read_lock+0x3d/0xb0
[   87.672265]  ? apply_workqueue_attrs+0x17/0x50
[   87.672268]  apply_workqueue_attrs+0x17/0x50
[   87.672271]  __alloc_workqueue_key+0x1d8/0x4d9
[   87.672273]  ? __lockdep_init_map+0x57/0x1c0
[   87.672290]  i915_gem_userptr_init__mmu_notifier+0x1fb/0x270 [i915]
[   87.672307]  i915_gem_userptr_ioctl+0x221/0x2c0 [i915]
[   87.672324]  ? i915_gem_userptr_release+0x140/0x140 [i915]
[   87.672327]  drm_ioctl_kernel+0x69/0xb0
[   87.672330]  drm_ioctl+0x2f9/0x3d0
[   87.672345]  ? i915_gem_userptr_release+0x140/0x140 [i915]
[   87.672350]  ? retint_kernel+0x2d/0x2d
[   87.672353]  do_vfs_ioctl+0x94/0x670
[   87.672355]  ? entry_SYSCALL_64_fastpath+0x5/0xb1
[   87.672358]  ? __this_cpu_preempt_check+0x13/0x20
[   87.672361]  ? trace_hardirqs_on_caller+0xe3/0x1b0
[   87.672364]  SyS_ioctl+0x41/0x70
[   87.672367]  entry_SYSCALL_64_fastpath+0x1c/0xb1
[   87.672370] RIP: 0033:0x7fa3ceed4587
[   87.672372] RSP: 002b:00007ffcf3841858 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[   87.672375] RAX: ffffffffffffffda RBX: ffffffff81493f33 RCX: 00007fa3ceed4587
[   87.672378] RDX: 00007ffcf3841890 RSI: 00000000c0186473 RDI: 0000000000000003
[   87.672381] RBP: ffffc90000743f88 R08: 0000000000000000 R09: 00007ffcf38418d4
[   87.672384] R10: 000055ce8b79f23c R11: 0000000000000246 R12: 00007fa3786b9000
[   87.672387] R13: 0000000000000003 R14: 00000000c0186473 R15: 00007ffcf38418d4
[   87.672390]  ? __this_cpu_preempt_check+0x13/0x20


https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3196/shard-snb4/igt@gem_userptr_blits@coherency-sync.html
Comment 18 Daniel Vetter 2017-10-10 14:58:35 UTC
Should be fixed width

commit 7741b547b6e000b08e20667bb3bef22e1a362661
Author: Daniel Vetter <daniel.vetter@ffwll.ch>
Date:   Mon Oct 9 18:44:00 2017 +0200

    drm/i915: Preallocate our mmu notifier workequeu to unbreak cpu hotplug deadlock
Comment 19 Daniel Vetter 2017-10-10 14:58:58 UTC
Nope, this also contains the gem_eio stuff, which isn't yet fixed.
Comment 20 Daniel Vetter 2017-10-10 15:00:33 UTC
Just for tracking purposes: What's fixed already is anything with userptr in it, but not yet the gem_eio stuff.

No idea how you want to reflect this in cibuglog. I'd still say we should split the bug.
Comment 21 Daniel Vetter 2017-10-11 15:57:29 UTC
Ok, now also the fix for the locking loop around set_wedged/stop_machine has landed:

commit af7a8ffad9c58deac791333a65c62d7fc72f9e9c
Author: Daniel Vetter <daniel.vetter@ffwll.ch>
Date:   Wed Oct 11 11:10:19 2017 +0200

    drm/i915: Use rcu instead of stop_machine in set_wedged

I think with that one we can close this bug here.
Comment 22 Marta Löfstedt 2017-10-12 07:29:24 UTC
Fix integrated from CI_DRM_3215. The issue has not been reproduced on BAT/CI.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.