Summary: | [BAT SKL] lockdep splat | ||||||
---|---|---|---|---|---|---|---|
Product: | DRI | Reporter: | Daniel Vetter <daniel> | ||||
Component: | DRM/Intel | Assignee: | Daniel Vetter <daniel> | ||||
Status: | CLOSED FIXED | QA Contact: | Intel GFX Bugs mailing list <intel-gfx-bugs> | ||||
Severity: | normal | ||||||
Priority: | highest | CC: | intel-gfx-bugs, przanoni, tomi.p.sarvela | ||||
Version: | XOrg git | ||||||
Hardware: | Other | ||||||
OS: | All | ||||||
Whiteboard: | |||||||
i915 platform: | SKL | i915 features: | |||||
Attachments: |
|
Description
Daniel Vetter
2015-12-08 09:55:23 UTC
Disable CONFIG_HOTPLUG_CPU? (In reply to Chris Wilson from comment #1) > Disable CONFIG_HOTPLUG_CPU? Tomi, can you please try to change the kernel build for CI and see whether that gets rid of these failures here on skl? If that's the case it's the fastest way to get CI into shape again since this isn't our bug really (I think at least). Changed CONFIG_HOTPLUG_CPU=y to =n in kernel debug config. Changing HOTPLUG_CPU and ACPI_HOTPLUG_CPU is not enough to keep this lock out of code. Looking at Kconfig: (disabling PM is counterproductive to testing PM functionality with IGT) config HOTPLUG_CPU bool "Support for hot-pluggable CPUs" depends on SMP ---help--- Say Y here to allow turning CPUs off and on. CPUs can be controlled through /sys/devices/system/cpu. ( Note: power management support will enable this option automatically on SMP systems. ) Say N if you want to disable CPU hotplug. There must be something fishy with lockdep tracking. The reported trace doesn't involve cpu_hotplug.lock in any way. Continuing with debugging. This seems to have to do with the fact that the lockdep_map is being held while schedule() is called, leading to the confusing bug report. Still need to figure out the correct way of fixing this, while still maintaining proper lockdep tracking. Reverse the polarity of the tachyon streams: diff --git a/kernel/cpu.c b/kernel/cpu.c index 5b9d39633ce9..58a0ca0789a0 100644 --- a/kernel/cpu.c +++ b/kernel/cpu.c @@ -148,7 +148,6 @@ void cpu_hotplug_begin(void) DEFINE_WAIT(wait); cpu_hotplug.active_writer = current; - cpuhp_lock_acquire(); for (;;) { mutex_lock(&cpu_hotplug.lock); @@ -159,13 +158,14 @@ void cpu_hotplug_begin(void) schedule(); } finish_wait(&cpu_hotplug.wq, &wait); + cpuhp_lock_acquire(); } void cpu_hotplug_done(void) { + cpuhp_lock_release(); cpu_hotplug.active_writer = NULL; mutex_unlock(&cpu_hotplug.lock); - cpuhp_lock_release(); } /* OK, found the real cause. P-state init ends up locking first the policy->rwsem and then the cpu_hotplug.lock. I just wonder how this does not happen on all platforms? I attached a description of the call traces. I think our P-state driver should not call get_online_cpus() during the policy init, if possible, as not many of the other drivers call it either. Or doing the init under cpu_hotplug_begin(). I'll see about this tomorrow. $ fgrep get_online_cpus * acpi-cpufreq.c: get_online_cpus(); cpufreq.c: get_online_cpus(); cpufreq.c: get_online_cpus(); cpufreq.c: get_online_cpus(); cpufreq_ondemand.c: get_online_cpus(); intel_pstate.c: get_online_cpus(); intel_pstate.c: get_online_cpus(); powernow-k8.c: get_online_cpus(); Created attachment 121490 [details]
SKL CPU lockdep splat traces
Some further investigation revealed that the using cpu_hotplug.lock for both locking the access to refcount and actual CPU hotplugging section will potentially cause more trouble, so avoiding it in Intel P-state driver is not a good resolution, especially due to measures been taken in the past to make sure get/put_online_cpus can be called recursively. Scenario that goes wrong in a couple of CPUfreq drivers (inlcuding Intel P-state); policy->rwsem is locked during driver initialization and the functions called during init that apply CPU limits use get_online_cpus (because they have other calling paths too), which will briefly lock cpu_hotplug.lock to increase cpu_hotplug.refcount. On the other scenarion when doing a suspend, when cpu_hotplug_begin() is called in the disable_nonboot_cpus(), callbacks to CPUfreq functions get called, which will lock policy->rwsem after holding cpu_hotplug.lock and we do have a potential deadlock scenario (though very unlikely). This is solved by using a lockref for locked reference counting and having a second wait queue for readers during a CPU hotplug operation. I've written a patch that resolves it on my SKL, and set it for comments. The fix is in linux-pm tree, awaiting to get to upstream. Can be merged to our -misc branch in the meanwhile. https://git.kernel.org/cgit/linux/kernel/git/rafael/linux-pm.git/commit/?h=bleeding-edge&id=41cfd64cf49fc84837341732a142f3d4cdc1e83a I see this commit on my drm-intel-nightly tree, and the CI pages for the SKL machines seem to have got rid of these messages after around CI_DRM_1100. Close the bug? igt@kms_pipe_crc_basic@suspend-read-crc-pipe-b is Pass. I don't see such log in kernel log. Hardware: Motherboard: Skylake Y cpu model name : Intel(R) Core(TM) m5-6Y54 CPU @ 1.10GHz cpu model : 78 cpu family : 6 Graphic card: Sky Lake Integrated Graphics (rev 07) Software: Ubuntu 14.04.4 LTS Bios: SKLSE2R1.R00.X100.B01.1509220551 Libdrm: 2.4.64 Kernel 4.5.0-rc6 drm-intel-nightly from git://anongit.freedesktop.org/drm-intel commit f9cadb616ff17d482312fba07db772b6604ce799 Author: Imre Deak <imre.deak@intel.com> Date: Tue Mar 1 19:17:18 2016 +0200 drm-intel-nightly: 2016y-03m-01d-17h-16m-32s UTC integration manifest So closed |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.