Bug 94350 - [BAT BSW] lockdep splat due to stop_machine() in ggtt pte programming
Summary: [BAT BSW] lockdep splat due to stop_machine() in ggtt pte programming
Status: CLOSED FIXED
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/Intel (show other bugs)
Version: DRI git
Hardware: x86-64 (AMD64) Linux (All)
: medium critical
Assignee: Intel GFX Bugs mailing list
QA Contact: Intel GFX Bugs mailing list
URL:
Whiteboard:
Keywords:
: 94644 94759 (view as bug list)
Depends on:
Blocks:
 
Reported: 2016-03-01 15:53 UTC by Ville Syrjala
Modified: 2016-10-07 09:05 UTC (History)
3 users (show)

See Also:
i915 platform: BSW/CHT
i915 features: GEM/PPGTT


Attachments

Description Ville Syrjala 2016-03-01 15:53:53 UTC
[  179.762854] ======================================================
[  179.762855] [ INFO: possible circular locking dependency detected ]
[  179.762860] 4.5.0-rc6-gfxbench+ #1 Tainted: G     U         
[  179.762861] -------------------------------------------------------
[  179.762863] rtcwake/5995 is trying to acquire lock:
[  179.762877]  (s_active#6){++++.+}, at: [<ffffffff8124ec70>] kernfs_remove_by_name_ns+0x40/0xa0
[  179.762878] 
but task is already holding lock:
[  179.762885]  (cpu_hotplug.lock){+.+.+.}, at: [<ffffffff81078c4d>] cpu_hotplug_begin+0x6d/0xc0
[  179.762886] 
which lock already depends on the new lock.

[  179.762887] 
the existing dependency chain (in reverse order) is:
[  179.762891] 
-> #3 (cpu_hotplug.lock){+.+.+.}:
[  179.762895]        [<ffffffff810cd09b>] lock_acquire+0xdb/0x1f0
[  179.762900]        [<ffffffff817be602>] mutex_lock_nested+0x62/0x3b0
[  179.762903]        [<ffffffff81078911>] get_online_cpus+0x61/0x80
[  179.762907]        [<ffffffff81117f1b>] stop_machine+0x1b/0xe0
[  179.762956]        [<ffffffffa01386bd>] gen8_ggtt_insert_entries__BKL+0x2d/0x30 [i915]
[  179.762991]        [<ffffffffa013cc36>] ggtt_bind_vma+0x46/0x70 [i915]
[  179.763027]        [<ffffffffa013e439>] i915_vma_bind+0x109/0x250 [i915]
[  179.763064]        [<ffffffffa0145d07>] i915_gem_object_do_pin+0x897/0xb00 [i915]
[  179.763100]        [<ffffffffa0145f98>] i915_gem_object_pin+0x28/0x30 [i915]
[  179.763138]        [<ffffffffa0159c1e>] intel_init_pipe_control+0xbe/0x210 [i915]
[  179.763176]        [<ffffffffa0156ce2>] intel_logical_rings_init+0xe2/0xde0 [i915]
[  179.763213]        [<ffffffffa0146a33>] i915_gem_init+0xf3/0x130 [i915]
[  179.763252]        [<ffffffffa01c9b07>] i915_driver_load+0xf47/0x1790 [i915]
[  179.763257]        [<ffffffff81515f44>] drm_dev_register+0xa4/0xb0
[  179.763260]        [<ffffffff8151814e>] drm_get_pci_dev+0xce/0x1e0
[  179.763292]        [<ffffffffa01062cf>] i915_pci_probe+0x2f/0x50 [i915]
[  179.763297]        [<ffffffff814430f7>] pci_device_probe+0x87/0xf0
[  179.763302]        [<ffffffff81539a39>] driver_probe_device+0x229/0x450
[  179.763305]        [<ffffffff81539ce3>] __driver_attach+0x83/0x90
[  179.763308]        [<ffffffff81537711>] bus_for_each_dev+0x61/0xa0
[  179.763311]        [<ffffffff81539329>] driver_attach+0x19/0x20
[  179.763314]        [<ffffffff81538e0f>] bus_add_driver+0x1ef/0x290
[  179.763317]        [<ffffffff8153a9ab>] driver_register+0x5b/0xe0
[  179.763320]        [<ffffffff8144202b>] __pci_register_driver+0x5b/0x60
[  179.763323]        [<ffffffff81518336>] drm_pci_init+0xd6/0x100
[  179.763326]        [<ffffffffa023c094>] 0xffffffffa023c094
[  179.763331]        [<ffffffff810003de>] do_one_initcall+0xae/0x1d0
[  179.763335]        [<ffffffff8115a5a5>] do_init_module+0x5b/0x1c6
[  179.763338]        [<ffffffff81106f80>] load_module+0x1c20/0x2490
[  179.763342]        [<ffffffff811079de>] SyS_finit_module+0x7e/0xa0
[  179.763346]        [<ffffffff817c2e9b>] entry_SYSCALL_64_fastpath+0x16/0x73
[  179.763349] 
-> #2 (&dev->struct_mutex){+.+.+.}:
[  179.763352]        [<ffffffff810cd09b>] lock_acquire+0xdb/0x1f0
[  179.763355]        [<ffffffff81511487>] drm_gem_mmap+0x1c7/0x270
[  179.763360]        [<ffffffff81197ec4>] mmap_region+0x334/0x580
[  179.763363]        [<ffffffff81198474>] do_mmap+0x364/0x410
[  179.763366]        [<ffffffff8117c65d>] vm_mmap_pgoff+0x6d/0xa0
[  179.763370]        [<ffffffff811965a4>] SyS_mmap_pgoff+0x184/0x220
[  179.763373]        [<ffffffff8100a1ed>] SyS_mmap+0x1d/0x20
[  179.763377]        [<ffffffff817c2e9b>] entry_SYSCALL_64_fastpath+0x16/0x73
[  179.763380] 
-> #1 (&mm->mmap_sem){++++++}:
[  179.763383]        [<ffffffff810cd09b>] lock_acquire+0xdb/0x1f0
[  179.763386]        [<ffffffff8118d0e5>] __might_fault+0x75/0xa0
[  179.763389]        [<ffffffff8124f67a>] kernfs_fop_write+0x8a/0x180
[  179.763393]        [<ffffffff811d2653>] __vfs_write+0x23/0xe0
[  179.763396]        [<ffffffff811d33b4>] vfs_write+0xa4/0x190
[  179.763398]        [<ffffffff811d4254>] SyS_write+0x44/0xb0
[  179.763402]        [<ffffffff817c2e9b>] entry_SYSCALL_64_fastpath+0x16/0x73
[  179.763406] 
-> #0 (s_active#6){++++.+}:
[  179.763409]        [<ffffffff810cc659>] __lock_acquire+0x1fc9/0x20f0
[  179.763411]        [<ffffffff810cd09b>] lock_acquire+0xdb/0x1f0
[  179.763414]        [<ffffffff8124dca0>] __kernfs_remove+0x210/0x2f0
[  179.763417]        [<ffffffff8124ec70>] kernfs_remove_by_name_ns+0x40/0xa0
[  179.763420]        [<ffffffff81250610>] sysfs_remove_file_ns+0x10/0x20
[  179.763423]        [<ffffffff81535384>] device_del+0x124/0x250
[  179.763426]        [<ffffffff815354c9>] device_unregister+0x19/0x60
[  179.763430]        [<ffffffff8153fb61>] cpu_cache_sysfs_exit+0x51/0xb0
[  179.763433]        [<ffffffff81540138>] cacheinfo_cpu_callback+0x38/0x70
[  179.763437]        [<ffffffff8109b499>] notifier_call_chain+0x39/0xa0
[  179.763439]        [<ffffffff8109b509>] __raw_notifier_call_chain+0x9/0x10
[  179.763442]        [<ffffffff81078b2e>] cpu_notify+0x1e/0x40
[  179.763445]        [<ffffffff81078bc9>] cpu_notify_nofail+0x9/0x20
[  179.763448]        [<ffffffff81078f13>] _cpu_down+0x233/0x340
[  179.763451]        [<ffffffff81079469>] disable_nonboot_cpus+0xc9/0x380
[  179.763455]        [<ffffffff810d36fe>] suspend_devices_and_enter+0x58e/0xbb0
[  179.763458]        [<ffffffff810d42ec>] pm_suspend+0x5cc/0x970
[  179.763461]        [<ffffffff810d2447>] state_store+0x77/0xe0
[  179.763465]        [<ffffffff813fdc1f>] kobj_attr_store+0xf/0x20
[  179.763468]        [<ffffffff81250370>] sysfs_kf_write+0x40/0x50
[  179.763470]        [<ffffffff8124f72c>] kernfs_fop_write+0x13c/0x180
[  179.763473]        [<ffffffff811d2653>] __vfs_write+0x23/0xe0
[  179.763476]        [<ffffffff811d33b4>] vfs_write+0xa4/0x190
[  179.763478]        [<ffffffff811d4254>] SyS_write+0x44/0xb0
[  179.763482]        [<ffffffff817c2e9b>] entry_SYSCALL_64_fastpath+0x16/0x73
[  179.763483] 
other info that might help us debug this:

[  179.763488] Chain exists of:
  s_active#6 --> &dev->struct_mutex --> cpu_hotplug.lock

[  179.763489]  Possible unsafe locking scenario:

[  179.763490]        CPU0                    CPU1
[  179.763491]        ----                    ----
[  179.763493]   lock(cpu_hotplug.lock);
[  179.763495]                                lock(&dev->struct_mutex);
[  179.763497]                                lock(cpu_hotplug.lock);
[  179.763500]   lock(s_active#6);
[  179.763501] 
 *** DEADLOCK ***

[  179.763502] 8 locks held by rtcwake/5995:
[  179.763509]  #0:  (sb_writers#6){.+.+.+}, at: [<ffffffff811d69e4>] __sb_start_write+0xd4/0xf0
[  179.763515]  #1:  (&of->mutex){+.+.+.}, at: [<ffffffff8124f651>] kernfs_fop_write+0x61/0x180
[  179.763521]  #2:  (s_active#118){.+.+.+}, at: [<ffffffff8124f659>] kernfs_fop_write+0x69/0x180
[  179.763527]  #3:  (pm_mutex){+.+...}, at: [<ffffffff810d3fbe>] pm_suspend+0x29e/0x970
[  179.763534]  #4:  (acpi_scan_lock){+.+.+.}, at: [<ffffffff814762fb>] acpi_scan_lock_acquire+0x12/0x14
[  179.763539]  #5:  (cpu_add_remove_lock){+.+.+.}, at: [<ffffffff810793c4>] disable_nonboot_cpus+0x24/0x380
[  179.763545]  #6:  (cpu_hotplug.dep_map){++++++}, at: [<ffffffff81078be0>] cpu_hotplug_begin+0x0/0xc0
[  179.763550]  #7:  (cpu_hotplug.lock){+.+.+.}, at: [<ffffffff81078c4d>] cpu_hotplug_begin+0x6d/0xc0
[  179.763551] 
stack backtrace:
[  179.763554] CPU: 0 PID: 5995 Comm: rtcwake Tainted: G     U          4.5.0-rc6-gfxbench+ #1
[  179.763556] Hardware name:                  /NUC5CPYB, BIOS PYBSWCEL.86A.0040.2015.0814.1353 08/14/2015
[  179.763561]  0000000000000000 ffff880179833850 ffffffff813fba95 ffffffff825e0190
[  179.763565]  ffffffff825a1220 ffff880179833890 ffffffff810c8cac ffff8801798338f0
[  179.763569]  ffff88007867aec0 ffff88007867a580 0000000000000008 ffff88007867aee8
[  179.763570] Call Trace:
[  179.763573]  [<ffffffff813fba95>] dump_stack+0x67/0x92
[  179.763576]  [<ffffffff810c8cac>] print_circular_bug+0x1fc/0x310
[  179.763578]  [<ffffffff810cc659>] __lock_acquire+0x1fc9/0x20f0
[  179.763581]  [<ffffffff810cd09b>] lock_acquire+0xdb/0x1f0
[  179.763584]  [<ffffffff8124ec70>] ? kernfs_remove_by_name_ns+0x40/0xa0
[  179.763587]  [<ffffffff8124dca0>] __kernfs_remove+0x210/0x2f0
[  179.763589]  [<ffffffff8124ec70>] ? kernfs_remove_by_name_ns+0x40/0xa0
[  179.763591]  [<ffffffff8124deb8>] ? kernfs_find_ns+0x78/0x130
[  179.763594]  [<ffffffff8124ec70>] kernfs_remove_by_name_ns+0x40/0xa0
[  179.763597]  [<ffffffff81250610>] sysfs_remove_file_ns+0x10/0x20
[  179.763599]  [<ffffffff81535384>] device_del+0x124/0x250
[  179.763601]  [<ffffffff810ca3cd>] ? trace_hardirqs_on+0xd/0x10
[  179.763603]  [<ffffffff815354c9>] device_unregister+0x19/0x60
[  179.763606]  [<ffffffff8153fb61>] cpu_cache_sysfs_exit+0x51/0xb0
[  179.763608]  [<ffffffff81540138>] cacheinfo_cpu_callback+0x38/0x70
[  179.763610]  [<ffffffff8109b499>] notifier_call_chain+0x39/0xa0
[  179.763613]  [<ffffffff8109b509>] __raw_notifier_call_chain+0x9/0x10
[  179.763615]  [<ffffffff81078b2e>] cpu_notify+0x1e/0x40
[  179.763617]  [<ffffffff81078bc9>] cpu_notify_nofail+0x9/0x20
[  179.763620]  [<ffffffff81078f13>] _cpu_down+0x233/0x340
[  179.763623]  [<ffffffff810e4940>] ? __call_rcu.constprop.61+0x2f0/0x2f0
[  179.763625]  [<ffffffff810e49a0>] ? call_rcu_bh+0x20/0x20
[  179.763628]  [<ffffffff810e0430>] ? trace_raw_output_rcu_utilization+0x60/0x60
[  179.763632]  [<ffffffff810e0430>] ? trace_raw_output_rcu_utilization+0x60/0x60
[  179.763635]  [<ffffffff81079469>] disable_nonboot_cpus+0xc9/0x380
[  179.763638]  [<ffffffff810d36fe>] suspend_devices_and_enter+0x58e/0xbb0
[  179.763641]  [<ffffffff810c7939>] ? __lock_is_held+0x49/0x70
[  179.763643]  [<ffffffff810d42ec>] pm_suspend+0x5cc/0x970
[  179.763646]  [<ffffffff810d2447>] state_store+0x77/0xe0
[  179.763648]  [<ffffffff813fdc1f>] kobj_attr_store+0xf/0x20
[  179.763651]  [<ffffffff81250370>] sysfs_kf_write+0x40/0x50
[  179.763653]  [<ffffffff8124f72c>] kernfs_fop_write+0x13c/0x180
[  179.763656]  [<ffffffff811d2653>] __vfs_write+0x23/0xe0
[  179.763659]  [<ffffffff810c6332>] ? percpu_down_read+0x52/0x90
[  179.763661]  [<ffffffff811d69e4>] ? __sb_start_write+0xd4/0xf0
[  179.763663]  [<ffffffff811d69e4>] ? __sb_start_write+0xd4/0xf0
[  179.763666]  [<ffffffff811d33b4>] vfs_write+0xa4/0x190
[  179.763669]  [<ffffffff811f1b2a>] ? __fget_light+0x6a/0x90
[  179.763671]  [<ffffffff811d4254>] SyS_write+0x44/0xb0
[  179.763674]  [<ffffffff817c2e9b>] entry_SYSCALL_64_fastpath+0x16/0x73
Comment 1 Chris Wilson 2016-03-01 15:59:41 UTC
Surely you meant due to the the kernfs  might_fault() whilst holding a lock already tainted by mmap_sem.
Comment 2 Ville Syrjala 2016-03-01 16:13:04 UTC
(In reply to Chris Wilson from comment #1)
> Surely you meant due to the the kernfs  might_fault() whilst holding a lock
> already tainted by mmap_sem.

Whatever works. I stopped reading at stop_machine() :P
Comment 3 Chris Wilson 2016-03-01 16:20:29 UTC
Can you try https://patchwork.freedesktop.org/patch/74733/ ?
Comment 4 Chris Wilson 2016-03-04 12:05:17 UTC
Once puzzle I have here is how did rtcwake's mmap_sem become tainted by struct_mutex?
Comment 5 Chris Wilson 2016-03-04 12:18:25 UTC
Or simply that it is not rtcwake's mmap_sem.
Comment 6 Imre Deak 2016-03-17 12:22:41 UTC
Raising severity due to BAT.
Comment 7 Chris Wilson 2016-03-21 09:12:32 UTC
*** Bug 94644 has been marked as a duplicate of this bug. ***
Comment 8 Chris Wilson 2016-04-01 09:06:43 UTC
(In reply to Joonas Lahtinen from comment #3 on bug 94759)
> Patch was merged to our local CI topic branch, seems to have been effective
> for past two runs (which is still quite low confidence level):
> 
> commit 6954af8b55f3b00b08f7759f479c41388fbe364f
> Author: Chris Wilson <chris@chris-wilson.co.uk>
> Date:   Thu Mar 31 11:45:06 2016 +0100
> 
>     kernfs: Move faulting copy_user operations outside of the mutex
> 
> Greg K-H will merge the patch upstream for 4.7-rc1.
Comment 9 Chris Wilson 2016-04-01 09:07:18 UTC
*** Bug 94759 has been marked as a duplicate of this bug. ***
Comment 10 Jari Tahvanainen 2016-10-07 09:05:55 UTC
Closing as verified+fixed since igt@kms_pipe_crc_basic@suspend-read-crc-pipe-c has not produced this failure in past two months.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.