Bug 93294

Summary: [BAT SKL] lockdep splat
Product: DRI Reporter: Daniel Vetter <daniel>
Component: DRM/IntelAssignee: Daniel Vetter <daniel>
Status: CLOSED FIXED QA Contact: Intel GFX Bugs mailing list <intel-gfx-bugs>
Severity: normal    
Priority: highest CC: intel-gfx-bugs, przanoni, tomi.p.sarvela
Version: XOrg git   
Hardware: Other   
OS: All   
Whiteboard:
i915 platform: SKL i915 features:
Attachments:
Description Flags
SKL CPU lockdep splat traces none

Description Daniel Vetter 2015-12-08 09:55:23 UTC
Looks like locking recursion around cpu hotplug (so not directly a i915 problem), but it's causing fails in our CI. We need at least a hack to get the lockdep splat shut up, but probably need to chase this with upstream.

Affects all skl CI platforms: skl-i5k-2 and skl-i7k-2

Only one testcase seems to hit this reliably: igt@kms_pipe_crc_basic@suspend-read-crc-pipe-b

Full dmesg splat:

[  185.245810] Suspending console(s) (use no_console_suspend to debug)
[  185.250126] sd 3:0:0:0: [sda] Synchronizing SCSI cache
[  185.250333] sd 3:0:0:0: [sda] Stopping disk

[  185.556664] ======================================================
[  185.556664] [ INFO: possible circular locking dependency detected ]
[  185.556665] 4.4.0-rc4-gfxbench+ #1 Tainted: G     U  W      
[  185.556665] -------------------------------------------------------
[  185.556666] rtcwake/6380 is trying to acquire lock:
[  185.556670]  (&policy->rwsem){+++++.}, at: [<ffffffff81632dee>] cpufreq_offline_prepare+0x7e/0x200
[  185.556670] 
but task is already holding lock:
[  185.556673]  (cpu_hotplug.lock#2){+.+.+.}, at: [<ffffffff81074cfd>] cpu_hotplug_begin+0x6d/0xc0
[  185.556673] 
which lock already depends on the new lock.

[  185.556673] 
the existing dependency chain (in reverse order) is:
[  185.556675] 
-> #1 (cpu_hotplug.lock#2){+.+.+.}:
[  185.556677]        [<ffffffff810c85e3>] lock_acquire+0xc3/0x1d0
[  185.556679]        [<ffffffff81795962>] mutex_lock_nested+0x62/0x3b0
[  185.556681]        [<ffffffff814878ed>] acpi_processor_ppc_notifier+0x3e/0xa7
[  185.556683]        [<ffffffff810970a9>] notifier_call_chain+0x39/0xa0
[  185.556684]        [<ffffffff810974a8>] __blocking_notifier_call_chain+0x48/0x70
[  185.556685]        [<ffffffff810974e1>] blocking_notifier_call_chain+0x11/0x20
[  185.556686]        [<ffffffff81633200>] cpufreq_set_policy+0x70/0x340
[  185.556687]        [<ffffffff81633c4f>] cpufreq_init_policy+0x6f/0xc0
[  185.556688]        [<ffffffff81634100>] cpufreq_online+0x290/0x7d0
[  185.556689]        [<ffffffff816346ed>] cpufreq_add_dev+0x5d/0x90
[  185.556690]        [<ffffffff815195e7>] subsys_interface_register+0xa7/0xf0
[  185.556692]        [<ffffffff81633946>] cpufreq_register_driver+0x106/0x1f0
[  185.556693]        [<ffffffff81da9aa6>] intel_pstate_init+0x37f/0x45c
[  185.556695]        [<ffffffff810003de>] do_one_initcall+0xae/0x1d0
[  185.556697]        [<ffffffff81d63fcf>] kernel_init_freeable+0x113/0x19b
[  185.556698]        [<ffffffff81791c69>] kernel_init+0x9/0xe0
[  185.556699]        [<ffffffff8179a97f>] ret_from_fork+0x3f/0x70
[  185.556700] 
-> #0 (&policy->rwsem){+++++.}:
[  185.556702]        [<ffffffff810c7af6>] __lock_acquire+0x1cf6/0x1f70
[  185.556703]        [<ffffffff810c85e3>] lock_acquire+0xc3/0x1d0
[  185.556704]        [<ffffffff81797e2f>] down_write+0x3f/0x70
[  185.556705]        [<ffffffff81632dee>] cpufreq_offline_prepare+0x7e/0x200
[  185.556706]        [<ffffffff81634685>] cpufreq_cpu_callback+0x45/0x50
[  185.556707]        [<ffffffff810970a9>] notifier_call_chain+0x39/0xa0
[  185.556709]        [<ffffffff81097119>] __raw_notifier_call_chain+0x9/0x10
[  185.556709]        [<ffffffff81074e2e>] _cpu_down+0x9e/0x340
[  185.556710]        [<ffffffff81075519>] disable_nonboot_cpus+0xc9/0x350
[  185.556712]        [<ffffffff810ceb81>] suspend_devices_and_enter+0x5a1/0xb50
[  185.556713]        [<ffffffff810cf673>] pm_suspend+0x543/0x8d0
[  185.556714]        [<ffffffff810cd8f7>] state_store+0x77/0xe0
[  185.556716]        [<ffffffff813e199f>] kobj_attr_store+0xf/0x20
[  185.556717]        [<ffffffff8123cfe0>] sysfs_kf_write+0x40/0x50
[  185.556718]        [<ffffffff8123c3cc>] kernfs_fop_write+0x13c/0x180
[  185.556720]        [<ffffffff811c2323>] __vfs_write+0x23/0xe0
[  185.556721]        [<ffffffff811c2b34>] vfs_write+0xa4/0x190
[  185.556722]        [<ffffffff811c3774>] SyS_write+0x44/0xb0
[  185.556723]        [<ffffffff8179a5db>] entry_SYSCALL_64_fastpath+0x16/0x73
[  185.556724] 
other info that might help us debug this:

[  185.556724]  Possible unsafe locking scenario:

[  185.556724]        CPU0                    CPU1
[  185.556724]        ----                    ----
[  185.556725]   lock(cpu_hotplug.lock#2);
[  185.556726]                                lock(&policy->rwsem);
[  185.556727]                                lock(cpu_hotplug.lock#2);
[  185.556727]   lock(&policy->rwsem);
[  185.556727] 
 *** DEADLOCK ***

[  185.556728] 8 locks held by rtcwake/6380:
[  185.556730]  #0:  (sb_writers#6){.+.+.+}, at: [<ffffffff811c5d34>] __sb_start_write+0xd4/0xf0
[  185.556732]  #1:  (&of->mutex){+.+.+.}, at: [<ffffffff8123c2f1>] kernfs_fop_write+0x61/0x180
[  185.556733]  #2:  (s_active#229){.+.+.+}, at: [<ffffffff8123c2f9>] kernfs_fop_write+0x69/0x180
[  185.556735]  #3:  (pm_mutex){+.+...}, at: [<ffffffff810cf1c7>] pm_suspend+0x97/0x8d0
[  185.556738]  #4:  (acpi_scan_lock){+.+.+.}, at: [<ffffffff81459363>] acpi_scan_lock_acquire+0x12/0x14
[  185.556739]  #5:  (cpu_add_remove_lock){+.+.+.}, at: [<ffffffff81075474>] disable_nonboot_cpus+0x24/0x350
[  185.556741]  #6:  (cpu_hotplug.lock){++++++}, at: [<ffffffff81074c90>] cpu_hotplug_begin+0x0/0xc0
[  185.556743]  #7:  (cpu_hotplug.lock#2){+.+.+.}, at: [<ffffffff81074cfd>] cpu_hotplug_begin+0x6d/0xc0
[  185.556743] 
stack backtrace:
[  185.556744] CPU: 1 PID: 6380 Comm: rtcwake Tainted: G     U  W       4.4.0-rc4-gfxbench+ #1
[  185.556745] Hardware name: System manufacturer System Product Name/Z170M-PLUS, BIOS 0408 08/28/2015
[  185.556747]  ffffffff825d90b0 ffff880241523940 ffffffff813df80c ffffffff8260c9e0
[  185.556748]  ffff880241523980 ffffffff810c441c ffff8802415239e0 ffff880244a44e80
[  185.556749]  ffff880244a45760 ffff880244a45788 0000000000000008 ffff880244a45788
[  185.556749] Call Trace:
[  185.556750]  [<ffffffff813df80c>] dump_stack+0x4e/0x82
[  185.556752]  [<ffffffff810c441c>] print_circular_bug+0x1fc/0x310
[  185.556753]  [<ffffffff810c7af6>] __lock_acquire+0x1cf6/0x1f70
[  185.556754]  [<ffffffff810c85e3>] lock_acquire+0xc3/0x1d0
[  185.556755]  [<ffffffff81632dee>] ? cpufreq_offline_prepare+0x7e/0x200
[  185.556756]  [<ffffffff81797e2f>] down_write+0x3f/0x70
[  185.556757]  [<ffffffff81632dee>] ? cpufreq_offline_prepare+0x7e/0x200
[  185.556758]  [<ffffffff81632dee>] cpufreq_offline_prepare+0x7e/0x200
[  185.556759]  [<ffffffff81634685>] cpufreq_cpu_callback+0x45/0x50
[  185.556760]  [<ffffffff810970a9>] notifier_call_chain+0x39/0xa0
[  185.556762]  [<ffffffff81097119>] __raw_notifier_call_chain+0x9/0x10
[  185.556762]  [<ffffffff81074e2e>] _cpu_down+0x9e/0x340
[  185.556763]  [<ffffffff810d24cf>] ? vprintk_emit+0x2cf/0x610
[  185.556764]  [<ffffffff810d250b>] ? vprintk_emit+0x30b/0x610
[  185.556766]  [<ffffffff810d294a>] ? vprintk_default+0x1a/0x20
[  185.556767]  [<ffffffff810c30b9>] ? __lock_is_held+0x49/0x70
[  185.556768]  [<ffffffff81075519>] disable_nonboot_cpus+0xc9/0x350
[  185.556769]  [<ffffffff810ceb81>] suspend_devices_and_enter+0x5a1/0xb50
[  185.556770]  [<ffffffff810c30b9>] ? __lock_is_held+0x49/0x70
[  185.556771]  [<ffffffff810cf673>] pm_suspend+0x543/0x8d0
[  185.556772]  [<ffffffff810cd8f7>] state_store+0x77/0xe0
[  185.556773]  [<ffffffff813e199f>] kobj_attr_store+0xf/0x20
[  185.556774]  [<ffffffff8123cfe0>] sysfs_kf_write+0x40/0x50
[  185.556774]  [<ffffffff8123c3cc>] kernfs_fop_write+0x13c/0x180
[  185.556775]  [<ffffffff811c2323>] __vfs_write+0x23/0xe0
[  185.556776]  [<ffffffff810c1ad2>] ? percpu_down_read+0x52/0x90
[  185.556777]  [<ffffffff811c5d34>] ? __sb_start_write+0xd4/0xf0
[  185.556778]  [<ffffffff811c5d34>] ? __sb_start_write+0xd4/0xf0
[  185.556779]  [<ffffffff811c2b34>] vfs_write+0xa4/0x190
[  185.556780]  [<ffffffff811e091a>] ? __fget_light+0x6a/0x90
[  185.556781]  [<ffffffff811c3774>] SyS_write+0x44/0xb0
[  185.556782]  [<ffffffff8179a5db>] entry_SYSCALL_64_fastpath+0x16/0x73
[  185.573654] Broke affinity for irq 121
[  185.582651] Broke affinity for irq 121
[  185.592586]  cache: parent cpu1 should not be sleeping
[  185.598049]  cache: parent cpu2 should not be sleeping
[  185.603869]  cache: parent cpu3 should not be sleeping
[  185.643441] snd_hda_intel 0000:00:1f.3: spurious response 0x0:0x0, last cmd=0x0f0000
[  185.643445] snd_hda_intel 0000:00:1f.3: spurious response 0x0:0x0, last cmd=0x0f0000
[  185.643448] snd_hda_intel 0000:00:1f.3: spurious response 0x0:0x0, last cmd=0x0f0000
[  185.643451] snd_hda_intel 0000:00:1f.3: spurious response 0x0:0x0, last cmd=0x0f0000
[  185.643453] snd_hda_intel 0000:00:1f.3: spurious response 0x0:0x0, last cmd=0x0f0000
[  185.643456] snd_hda_intel 0000:00:1f.3: spurious response 0x0:0x0, last cmd=0x0f0000
[  185.643458] snd_hda_intel 0000:00:1f.3: spurious response 0x0:0x0, last cmd=0x0f0000
[  185.643461] snd_hda_intel 0000:00:1f.3: spurious response 0x0:0x0, last cmd=0x0f0000
[  185.643464] snd_hda_intel 0000:00:1f.3: spurious response 0x0:0x0, last cmd=0x0f0000
[  185.661028] sd 3:0:0:0: [sda] Starting disk
[  185.787456] snd_hda_intel 0000:00:1f.3: Codec #2 probe error; disabling it...
[  186.107210] ------------[ cut here ]------------
[  186.107239] WARNING: CPU: 1 PID: 6375 at drivers/gpu/drm/i915/intel_display.c:13778 skl_max_scale.part.87+0x67/0x70 [i915]()
[  186.107240] WARN_ON_ONCE(!crtc_clock || cdclk < crtc_clock)
[  186.107241] Modules linked in:
[  186.107242]  snd_hda_intel i915 ax88179_178a usbnet mii snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic x86_pkg_temp_thermal intel_powerclamp coretemp crct10dif_pclmul crc32_pclmul snd_hda_codec snd_hwdep snd_hda_core snd_pcm mei_me mei i2c_hid e1000e ptp pps_core [last unloaded: i915]
[  186.107254] CPU: 1 PID: 6375 Comm: kms_pipe_crc_ba Tainted: G     U  W       4.4.0-rc4-gfxbench+ #1
[  186.107255] Hardware name: System manufacturer System Product Name/Z170M-PLUS, BIOS 0408 08/28/2015
[  186.107256]  ffffffffa034bab0 ffff8802481fba40 ffffffff813df80c ffff8802481fba88
[  186.107258]  ffff8802481fba78 ffffffff810747c1 0000000000010000 ffff88003680cf38
[  186.107260]  ffff8802497d8000 ffff88003680cf38 ffffffffa032d620 ffff8802481fbad8
[  186.107262] Call Trace:
[  186.107265]  [<ffffffff813df80c>] dump_stack+0x4e/0x82
[  186.107267]  [<ffffffff810747c1>] warn_slowpath_common+0x81/0xc0
[  186.107268]  [<ffffffff81074847>] warn_slowpath_fmt+0x47/0x50
[  186.107278]  [<ffffffffa02d48a7>] skl_max_scale.part.87+0x67/0x70 [i915]
[  186.107288]  [<ffffffffa02d4958>] intel_check_primary_plane+0xa8/0xc0 [i915]
[  186.107296]  [<ffffffffa02c6ab9>] intel_plane_atomic_check+0x129/0x1f0 [i915]
[  186.107299]  [<ffffffff814ea153>] drm_atomic_helper_check_planes+0x43/0x1d0
[  186.107309]  [<ffffffffa02e00d5>] intel_atomic_check+0x2f5/0x1090 [i915]
[  186.107311]  [<ffffffff815001db>] ? drm_property_unreference_blob+0x4b/0xf0
[  186.107312]  [<ffffffff81500101>] ? drm_property_free_blob+0x21/0x60
[  186.107314]  [<ffffffff8150dc72>] drm_atomic_check_only+0x142/0x5e0
[  186.107315]  [<ffffffff8150d7c0>] ? drm_atomic_set_mode_prop_for_crtc+0x20/0x110
[  186.107316]  [<ffffffff8150e122>] drm_atomic_commit+0x12/0x50
[  186.107318]  [<ffffffff814ead55>] drm_atomic_helper_set_config+0x75/0xb0
[  186.107319]  [<ffffffff814fcc60>] drm_mode_set_config_internal+0x60/0x110
[  186.107321]  [<ffffffff81501876>] drm_mode_setcrtc+0x186/0x4e0
[  186.107322]  [<ffffffff814f3b5d>] drm_ioctl+0x13d/0x590
[  186.107324]  [<ffffffff815016f0>] ? drm_mode_setplane+0x1b0/0x1b0
[  186.107326]  [<ffffffff811d4eac>] do_vfs_ioctl+0x2fc/0x550
[  186.107327]  [<ffffffff8118d554>] ? do_munmap+0x254/0x3d0
[  186.107329]  [<ffffffff8118d71a>] ? vm_munmap+0x4a/0x60
[  186.107330]  [<ffffffff811e091a>] ? __fget_light+0x6a/0x90
[  186.107331]  [<ffffffff811d513c>] SyS_ioctl+0x3c/0x70
[  186.107333]  [<ffffffff8179a5db>] entry_SYSCALL_64_fastpath+0x16/0x73
[  186.107334] ---[ end trace 29c112d5b40b86c8 ]---


Note that the cdclk splat is already tracked in bug #93293
Comment 1 Chris Wilson 2015-12-08 11:59:59 UTC
Disable CONFIG_HOTPLUG_CPU?
Comment 2 Daniel Vetter 2015-12-10 11:54:53 UTC
(In reply to Chris Wilson from comment #1)
> Disable CONFIG_HOTPLUG_CPU?

Tomi, can you please try to change the kernel build for CI and see whether that gets rid of these failures here on skl? If that's the case it's the fastest way to get CI into shape again since this isn't our bug really (I think at least).
Comment 3 Tomi Sarvela 2015-12-10 15:43:38 UTC
Changed CONFIG_HOTPLUG_CPU=y to =n in kernel debug config.
Comment 4 Tomi Sarvela 2016-01-07 09:54:58 UTC
Changing HOTPLUG_CPU and ACPI_HOTPLUG_CPU is not enough to keep this lock out of code. Looking at Kconfig: (disabling PM is counterproductive to testing PM functionality with IGT)

config HOTPLUG_CPU
        bool "Support for hot-pluggable CPUs"
        depends on SMP
        ---help---
          Say Y here to allow turning CPUs off and on. CPUs can be
          controlled through /sys/devices/system/cpu.
          ( Note: power management support will enable this option
            automatically on SMP systems. )
          Say N if you want to disable CPU hotplug.
Comment 5 Joonas Lahtinen 2016-01-22 12:08:53 UTC
There must be something fishy with lockdep tracking. The reported trace doesn't involve cpu_hotplug.lock in any way. Continuing with debugging.
Comment 6 Joonas Lahtinen 2016-01-25 13:26:15 UTC
This seems to have to do with the fact that the lockdep_map is being held while schedule() is called, leading to the confusing bug report. Still need to figure out the correct way of fixing this, while still maintaining proper lockdep tracking.
Comment 7 Chris Wilson 2016-01-28 13:12:50 UTC
Reverse the polarity of the tachyon streams:

diff --git a/kernel/cpu.c b/kernel/cpu.c
index 5b9d39633ce9..58a0ca0789a0 100644
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -148,7 +148,6 @@ void cpu_hotplug_begin(void)
        DEFINE_WAIT(wait);
 
        cpu_hotplug.active_writer = current;
-       cpuhp_lock_acquire();
 
        for (;;) {
                mutex_lock(&cpu_hotplug.lock);
@@ -159,13 +158,14 @@ void cpu_hotplug_begin(void)
                schedule();
        }
        finish_wait(&cpu_hotplug.wq, &wait);
+       cpuhp_lock_acquire();
 }
 
 void cpu_hotplug_done(void)
 {
+       cpuhp_lock_release();
        cpu_hotplug.active_writer = NULL;
        mutex_unlock(&cpu_hotplug.lock);
-       cpuhp_lock_release();
 }
 
 /*
Comment 8 Joonas Lahtinen 2016-02-03 14:16:41 UTC
OK, found the real cause. P-state init ends up locking first the policy->rwsem and then the cpu_hotplug.lock. I just wonder how this does not happen on all platforms?

I attached a description of the call traces. I think our P-state driver should not call get_online_cpus() during the policy init, if possible, as not many of the other drivers call it either.

Or doing the init under cpu_hotplug_begin(). I'll see about this tomorrow.

$ fgrep get_online_cpus *
acpi-cpufreq.c:	get_online_cpus();
cpufreq.c:	get_online_cpus();
cpufreq.c:	get_online_cpus();
cpufreq.c:	get_online_cpus();
cpufreq_ondemand.c:	get_online_cpus();
intel_pstate.c:	get_online_cpus();
intel_pstate.c:	get_online_cpus();
powernow-k8.c:	get_online_cpus();
Comment 9 Joonas Lahtinen 2016-02-03 14:17:14 UTC
Created attachment 121490 [details]
SKL CPU lockdep splat traces
Comment 10 Joonas Lahtinen 2016-02-15 12:35:26 UTC
Some further investigation revealed that the using cpu_hotplug.lock for both locking the access to refcount and actual CPU hotplugging section will potentially cause more trouble, so avoiding it in Intel P-state driver is not a good resolution, especially due to measures been taken in the past to make sure get/put_online_cpus can be called recursively.

Scenario that goes wrong in a couple of CPUfreq drivers (inlcuding Intel P-state); policy->rwsem is locked during driver initialization and the functions called during init that apply CPU limits use get_online_cpus (because they have other calling paths too), which will briefly lock cpu_hotplug.lock to increase cpu_hotplug.refcount. On the other scenarion when doing a suspend, when cpu_hotplug_begin() is called in the disable_nonboot_cpus(), callbacks to CPUfreq functions get called, which will lock policy->rwsem after holding cpu_hotplug.lock and we do have a potential deadlock scenario (though very unlikely).

This is solved by using a lockref for locked reference counting and having a second wait queue for readers during a CPU hotplug operation. I've written a patch that resolves it on my SKL, and set it for comments.
Comment 11 Joonas Lahtinen 2016-02-26 10:34:49 UTC
The fix is in linux-pm tree, awaiting to get to upstream. Can be merged to our -misc branch in the meanwhile.

https://git.kernel.org/cgit/linux/kernel/git/rafael/linux-pm.git/commit/?h=bleeding-edge&id=41cfd64cf49fc84837341732a142f3d4cdc1e83a
Comment 12 Paulo Zanoni 2016-03-08 18:05:42 UTC
I see this commit on my drm-intel-nightly tree, and the CI pages for the SKL machines seem to have got rid of these messages after around CI_DRM_1100. Close the bug?
Comment 13 cprigent 2016-03-09 10:52:17 UTC
igt@kms_pipe_crc_basic@suspend-read-crc-pipe-b is Pass. I don't see such log in kernel log.

Hardware: 
Motherboard: Skylake Y 
cpu model name : Intel(R) Core(TM) m5-6Y54 CPU @ 1.10GHz 
cpu model : 78 
cpu family : 6 
Graphic card: Sky Lake Integrated Graphics (rev 07)
Software:
Ubuntu 14.04.4 LTS 
Bios: SKLSE2R1.R00.X100.B01.1509220551 
Libdrm: 2.4.64 
Kernel 4.5.0-rc6 drm-intel-nightly from git://anongit.freedesktop.org/drm-intel
  commit f9cadb616ff17d482312fba07db772b6604ce799
  Author: Imre Deak <imre.deak@intel.com>
  Date:   Tue Mar 1 19:17:18 2016 +0200
  drm-intel-nightly: 2016y-03m-01d-17h-16m-32s UTC integration manifest
Comment 14 cprigent 2016-03-09 10:52:32 UTC
So closed

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.