Bug 94350

Summary: [BAT BSW] lockdep splat due to stop_machine() in ggtt pte programming
Product: DRI Reporter: Ville Syrjala <ville.syrjala>
Component: DRM/IntelAssignee: Intel GFX Bugs mailing list <intel-gfx-bugs>
Status: CLOSED FIXED QA Contact: Intel GFX Bugs mailing list <intel-gfx-bugs>
Severity: critical    
Priority: medium CC: daniel, intel-gfx-bugs, joonas.lahtinen
Version: DRI git   
Hardware: x86-64 (AMD64)   
OS: Linux (All)   
Whiteboard:
i915 platform: BSW/CHT i915 features: GEM/PPGTT

Description Ville Syrjala 2016-03-01 15:53:53 UTC
[  179.762854] ======================================================
[  179.762855] [ INFO: possible circular locking dependency detected ]
[  179.762860] 4.5.0-rc6-gfxbench+ #1 Tainted: G     U         
[  179.762861] -------------------------------------------------------
[  179.762863] rtcwake/5995 is trying to acquire lock:
[  179.762877]  (s_active#6){++++.+}, at: [<ffffffff8124ec70>] kernfs_remove_by_name_ns+0x40/0xa0
[  179.762878] 
but task is already holding lock:
[  179.762885]  (cpu_hotplug.lock){+.+.+.}, at: [<ffffffff81078c4d>] cpu_hotplug_begin+0x6d/0xc0
[  179.762886] 
which lock already depends on the new lock.

[  179.762887] 
the existing dependency chain (in reverse order) is:
[  179.762891] 
-> #3 (cpu_hotplug.lock){+.+.+.}:
[  179.762895]        [<ffffffff810cd09b>] lock_acquire+0xdb/0x1f0
[  179.762900]        [<ffffffff817be602>] mutex_lock_nested+0x62/0x3b0
[  179.762903]        [<ffffffff81078911>] get_online_cpus+0x61/0x80
[  179.762907]        [<ffffffff81117f1b>] stop_machine+0x1b/0xe0
[  179.762956]        [<ffffffffa01386bd>] gen8_ggtt_insert_entries__BKL+0x2d/0x30 [i915]
[  179.762991]        [<ffffffffa013cc36>] ggtt_bind_vma+0x46/0x70 [i915]
[  179.763027]        [<ffffffffa013e439>] i915_vma_bind+0x109/0x250 [i915]
[  179.763064]        [<ffffffffa0145d07>] i915_gem_object_do_pin+0x897/0xb00 [i915]
[  179.763100]        [<ffffffffa0145f98>] i915_gem_object_pin+0x28/0x30 [i915]
[  179.763138]        [<ffffffffa0159c1e>] intel_init_pipe_control+0xbe/0x210 [i915]
[  179.763176]        [<ffffffffa0156ce2>] intel_logical_rings_init+0xe2/0xde0 [i915]
[  179.763213]        [<ffffffffa0146a33>] i915_gem_init+0xf3/0x130 [i915]
[  179.763252]        [<ffffffffa01c9b07>] i915_driver_load+0xf47/0x1790 [i915]
[  179.763257]        [<ffffffff81515f44>] drm_dev_register+0xa4/0xb0
[  179.763260]        [<ffffffff8151814e>] drm_get_pci_dev+0xce/0x1e0
[  179.763292]        [<ffffffffa01062cf>] i915_pci_probe+0x2f/0x50 [i915]
[  179.763297]        [<ffffffff814430f7>] pci_device_probe+0x87/0xf0
[  179.763302]        [<ffffffff81539a39>] driver_probe_device+0x229/0x450
[  179.763305]        [<ffffffff81539ce3>] __driver_attach+0x83/0x90
[  179.763308]        [<ffffffff81537711>] bus_for_each_dev+0x61/0xa0
[  179.763311]        [<ffffffff81539329>] driver_attach+0x19/0x20
[  179.763314]        [<ffffffff81538e0f>] bus_add_driver+0x1ef/0x290
[  179.763317]        [<ffffffff8153a9ab>] driver_register+0x5b/0xe0
[  179.763320]        [<ffffffff8144202b>] __pci_register_driver+0x5b/0x60
[  179.763323]        [<ffffffff81518336>] drm_pci_init+0xd6/0x100
[  179.763326]        [<ffffffffa023c094>] 0xffffffffa023c094
[  179.763331]        [<ffffffff810003de>] do_one_initcall+0xae/0x1d0
[  179.763335]        [<ffffffff8115a5a5>] do_init_module+0x5b/0x1c6
[  179.763338]        [<ffffffff81106f80>] load_module+0x1c20/0x2490
[  179.763342]        [<ffffffff811079de>] SyS_finit_module+0x7e/0xa0
[  179.763346]        [<ffffffff817c2e9b>] entry_SYSCALL_64_fastpath+0x16/0x73
[  179.763349] 
-> #2 (&dev->struct_mutex){+.+.+.}:
[  179.763352]        [<ffffffff810cd09b>] lock_acquire+0xdb/0x1f0
[  179.763355]        [<ffffffff81511487>] drm_gem_mmap+0x1c7/0x270
[  179.763360]        [<ffffffff81197ec4>] mmap_region+0x334/0x580
[  179.763363]        [<ffffffff81198474>] do_mmap+0x364/0x410
[  179.763366]        [<ffffffff8117c65d>] vm_mmap_pgoff+0x6d/0xa0
[  179.763370]        [<ffffffff811965a4>] SyS_mmap_pgoff+0x184/0x220
[  179.763373]        [<ffffffff8100a1ed>] SyS_mmap+0x1d/0x20
[  179.763377]        [<ffffffff817c2e9b>] entry_SYSCALL_64_fastpath+0x16/0x73
[  179.763380] 
-> #1 (&mm->mmap_sem){++++++}:
[  179.763383]        [<ffffffff810cd09b>] lock_acquire+0xdb/0x1f0
[  179.763386]        [<ffffffff8118d0e5>] __might_fault+0x75/0xa0
[  179.763389]        [<ffffffff8124f67a>] kernfs_fop_write+0x8a/0x180
[  179.763393]        [<ffffffff811d2653>] __vfs_write+0x23/0xe0
[  179.763396]        [<ffffffff811d33b4>] vfs_write+0xa4/0x190
[  179.763398]        [<ffffffff811d4254>] SyS_write+0x44/0xb0
[  179.763402]        [<ffffffff817c2e9b>] entry_SYSCALL_64_fastpath+0x16/0x73
[  179.763406] 
-> #0 (s_active#6){++++.+}:
[  179.763409]        [<ffffffff810cc659>] __lock_acquire+0x1fc9/0x20f0
[  179.763411]        [<ffffffff810cd09b>] lock_acquire+0xdb/0x1f0
[  179.763414]        [<ffffffff8124dca0>] __kernfs_remove+0x210/0x2f0
[  179.763417]        [<ffffffff8124ec70>] kernfs_remove_by_name_ns+0x40/0xa0
[  179.763420]        [<ffffffff81250610>] sysfs_remove_file_ns+0x10/0x20
[  179.763423]        [<ffffffff81535384>] device_del+0x124/0x250
[  179.763426]        [<ffffffff815354c9>] device_unregister+0x19/0x60
[  179.763430]        [<ffffffff8153fb61>] cpu_cache_sysfs_exit+0x51/0xb0
[  179.763433]        [<ffffffff81540138>] cacheinfo_cpu_callback+0x38/0x70
[  179.763437]        [<ffffffff8109b499>] notifier_call_chain+0x39/0xa0
[  179.763439]        [<ffffffff8109b509>] __raw_notifier_call_chain+0x9/0x10
[  179.763442]        [<ffffffff81078b2e>] cpu_notify+0x1e/0x40
[  179.763445]        [<ffffffff81078bc9>] cpu_notify_nofail+0x9/0x20
[  179.763448]        [<ffffffff81078f13>] _cpu_down+0x233/0x340
[  179.763451]        [<ffffffff81079469>] disable_nonboot_cpus+0xc9/0x380
[  179.763455]        [<ffffffff810d36fe>] suspend_devices_and_enter+0x58e/0xbb0
[  179.763458]        [<ffffffff810d42ec>] pm_suspend+0x5cc/0x970
[  179.763461]        [<ffffffff810d2447>] state_store+0x77/0xe0
[  179.763465]        [<ffffffff813fdc1f>] kobj_attr_store+0xf/0x20
[  179.763468]        [<ffffffff81250370>] sysfs_kf_write+0x40/0x50
[  179.763470]        [<ffffffff8124f72c>] kernfs_fop_write+0x13c/0x180
[  179.763473]        [<ffffffff811d2653>] __vfs_write+0x23/0xe0
[  179.763476]        [<ffffffff811d33b4>] vfs_write+0xa4/0x190
[  179.763478]        [<ffffffff811d4254>] SyS_write+0x44/0xb0
[  179.763482]        [<ffffffff817c2e9b>] entry_SYSCALL_64_fastpath+0x16/0x73
[  179.763483] 
other info that might help us debug this:

[  179.763488] Chain exists of:
  s_active#6 --> &dev->struct_mutex --> cpu_hotplug.lock

[  179.763489]  Possible unsafe locking scenario:

[  179.763490]        CPU0                    CPU1
[  179.763491]        ----                    ----
[  179.763493]   lock(cpu_hotplug.lock);
[  179.763495]                                lock(&dev->struct_mutex);
[  179.763497]                                lock(cpu_hotplug.lock);
[  179.763500]   lock(s_active#6);
[  179.763501] 
 *** DEADLOCK ***

[  179.763502] 8 locks held by rtcwake/5995:
[  179.763509]  #0:  (sb_writers#6){.+.+.+}, at: [<ffffffff811d69e4>] __sb_start_write+0xd4/0xf0
[  179.763515]  #1:  (&of->mutex){+.+.+.}, at: [<ffffffff8124f651>] kernfs_fop_write+0x61/0x180
[  179.763521]  #2:  (s_active#118){.+.+.+}, at: [<ffffffff8124f659>] kernfs_fop_write+0x69/0x180
[  179.763527]  #3:  (pm_mutex){+.+...}, at: [<ffffffff810d3fbe>] pm_suspend+0x29e/0x970
[  179.763534]  #4:  (acpi_scan_lock){+.+.+.}, at: [<ffffffff814762fb>] acpi_scan_lock_acquire+0x12/0x14
[  179.763539]  #5:  (cpu_add_remove_lock){+.+.+.}, at: [<ffffffff810793c4>] disable_nonboot_cpus+0x24/0x380
[  179.763545]  #6:  (cpu_hotplug.dep_map){++++++}, at: [<ffffffff81078be0>] cpu_hotplug_begin+0x0/0xc0
[  179.763550]  #7:  (cpu_hotplug.lock){+.+.+.}, at: [<ffffffff81078c4d>] cpu_hotplug_begin+0x6d/0xc0
[  179.763551] 
stack backtrace:
[  179.763554] CPU: 0 PID: 5995 Comm: rtcwake Tainted: G     U          4.5.0-rc6-gfxbench+ #1
[  179.763556] Hardware name:                  /NUC5CPYB, BIOS PYBSWCEL.86A.0040.2015.0814.1353 08/14/2015
[  179.763561]  0000000000000000 ffff880179833850 ffffffff813fba95 ffffffff825e0190
[  179.763565]  ffffffff825a1220 ffff880179833890 ffffffff810c8cac ffff8801798338f0
[  179.763569]  ffff88007867aec0 ffff88007867a580 0000000000000008 ffff88007867aee8
[  179.763570] Call Trace:
[  179.763573]  [<ffffffff813fba95>] dump_stack+0x67/0x92
[  179.763576]  [<ffffffff810c8cac>] print_circular_bug+0x1fc/0x310
[  179.763578]  [<ffffffff810cc659>] __lock_acquire+0x1fc9/0x20f0
[  179.763581]  [<ffffffff810cd09b>] lock_acquire+0xdb/0x1f0
[  179.763584]  [<ffffffff8124ec70>] ? kernfs_remove_by_name_ns+0x40/0xa0
[  179.763587]  [<ffffffff8124dca0>] __kernfs_remove+0x210/0x2f0
[  179.763589]  [<ffffffff8124ec70>] ? kernfs_remove_by_name_ns+0x40/0xa0
[  179.763591]  [<ffffffff8124deb8>] ? kernfs_find_ns+0x78/0x130
[  179.763594]  [<ffffffff8124ec70>] kernfs_remove_by_name_ns+0x40/0xa0
[  179.763597]  [<ffffffff81250610>] sysfs_remove_file_ns+0x10/0x20
[  179.763599]  [<ffffffff81535384>] device_del+0x124/0x250
[  179.763601]  [<ffffffff810ca3cd>] ? trace_hardirqs_on+0xd/0x10
[  179.763603]  [<ffffffff815354c9>] device_unregister+0x19/0x60
[  179.763606]  [<ffffffff8153fb61>] cpu_cache_sysfs_exit+0x51/0xb0
[  179.763608]  [<ffffffff81540138>] cacheinfo_cpu_callback+0x38/0x70
[  179.763610]  [<ffffffff8109b499>] notifier_call_chain+0x39/0xa0
[  179.763613]  [<ffffffff8109b509>] __raw_notifier_call_chain+0x9/0x10
[  179.763615]  [<ffffffff81078b2e>] cpu_notify+0x1e/0x40
[  179.763617]  [<ffffffff81078bc9>] cpu_notify_nofail+0x9/0x20
[  179.763620]  [<ffffffff81078f13>] _cpu_down+0x233/0x340
[  179.763623]  [<ffffffff810e4940>] ? __call_rcu.constprop.61+0x2f0/0x2f0
[  179.763625]  [<ffffffff810e49a0>] ? call_rcu_bh+0x20/0x20
[  179.763628]  [<ffffffff810e0430>] ? trace_raw_output_rcu_utilization+0x60/0x60
[  179.763632]  [<ffffffff810e0430>] ? trace_raw_output_rcu_utilization+0x60/0x60
[  179.763635]  [<ffffffff81079469>] disable_nonboot_cpus+0xc9/0x380
[  179.763638]  [<ffffffff810d36fe>] suspend_devices_and_enter+0x58e/0xbb0
[  179.763641]  [<ffffffff810c7939>] ? __lock_is_held+0x49/0x70
[  179.763643]  [<ffffffff810d42ec>] pm_suspend+0x5cc/0x970
[  179.763646]  [<ffffffff810d2447>] state_store+0x77/0xe0
[  179.763648]  [<ffffffff813fdc1f>] kobj_attr_store+0xf/0x20
[  179.763651]  [<ffffffff81250370>] sysfs_kf_write+0x40/0x50
[  179.763653]  [<ffffffff8124f72c>] kernfs_fop_write+0x13c/0x180
[  179.763656]  [<ffffffff811d2653>] __vfs_write+0x23/0xe0
[  179.763659]  [<ffffffff810c6332>] ? percpu_down_read+0x52/0x90
[  179.763661]  [<ffffffff811d69e4>] ? __sb_start_write+0xd4/0xf0
[  179.763663]  [<ffffffff811d69e4>] ? __sb_start_write+0xd4/0xf0
[  179.763666]  [<ffffffff811d33b4>] vfs_write+0xa4/0x190
[  179.763669]  [<ffffffff811f1b2a>] ? __fget_light+0x6a/0x90
[  179.763671]  [<ffffffff811d4254>] SyS_write+0x44/0xb0
[  179.763674]  [<ffffffff817c2e9b>] entry_SYSCALL_64_fastpath+0x16/0x73
Comment 1 Chris Wilson 2016-03-01 15:59:41 UTC
Surely you meant due to the the kernfs  might_fault() whilst holding a lock already tainted by mmap_sem.
Comment 2 Ville Syrjala 2016-03-01 16:13:04 UTC
(In reply to Chris Wilson from comment #1)
> Surely you meant due to the the kernfs  might_fault() whilst holding a lock
> already tainted by mmap_sem.

Whatever works. I stopped reading at stop_machine() :P
Comment 3 Chris Wilson 2016-03-01 16:20:29 UTC
Can you try https://patchwork.freedesktop.org/patch/74733/ ?
Comment 4 Chris Wilson 2016-03-04 12:05:17 UTC
Once puzzle I have here is how did rtcwake's mmap_sem become tainted by struct_mutex?
Comment 5 Chris Wilson 2016-03-04 12:18:25 UTC
Or simply that it is not rtcwake's mmap_sem.
Comment 6 Imre Deak 2016-03-17 12:22:41 UTC
Raising severity due to BAT.
Comment 7 Chris Wilson 2016-03-21 09:12:32 UTC
*** Bug 94644 has been marked as a duplicate of this bug. ***
Comment 8 Chris Wilson 2016-04-01 09:06:43 UTC
(In reply to Joonas Lahtinen from comment #3 on bug 94759)
> Patch was merged to our local CI topic branch, seems to have been effective
> for past two runs (which is still quite low confidence level):
> 
> commit 6954af8b55f3b00b08f7759f479c41388fbe364f
> Author: Chris Wilson <chris@chris-wilson.co.uk>
> Date:   Thu Mar 31 11:45:06 2016 +0100
> 
>     kernfs: Move faulting copy_user operations outside of the mutex
> 
> Greg K-H will merge the patch upstream for 4.7-rc1.
Comment 9 Chris Wilson 2016-04-01 09:07:18 UTC
*** Bug 94759 has been marked as a duplicate of this bug. ***
Comment 10 Jari Tahvanainen 2016-10-07 09:05:55 UTC
Closing as verified+fixed since igt@kms_pipe_crc_basic@suspend-read-crc-pipe-c has not produced this failure in past two months.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.