Bug 98211

Summary: i915 / drm crash when undocking from DP monitors
Product: DRI Reporter: Vadim Lobanov <vadimplobanov>
Component: DRM/IntelAssignee: Intel GFX Bugs mailing list <intel-gfx-bugs>
Status: CLOSED FIXED QA Contact: Intel GFX Bugs mailing list <intel-gfx-bugs>
Severity: blocker    
Priority: highest CC: alejandro_aero, freedesktop, intel-gfx-bugs, nutello
Version: unspecifiedKeywords: bisect_pending, regression
Hardware: x86-64 (AMD64)   
OS: Linux (All)   
Whiteboard:
i915 platform: BDW i915 features: display/DP
Attachments:
Description Flags
dmesg booting with drm.debug=14 none

Description Vadim Lobanov 2016-10-12 02:45:43 UTC
Hi folks,

I'm seeing a repeatable crash on my HP EliteBook 840 G2/2216 when booting it while in a docking station connected to two external DisplayPort monitors, undocking, and then either logging out or shutting down -- regardless of whether I've redocked it beforehand or not. Both logout and shutdown work great if I do not undock the laptop at all, so the badness correlates with the DP monitors going away.

This is a regression introduced somewhere in the v4.6 -> v4.7 development timeframe: 4.6.0 works, 4.7.0 fails as described, and 4.8.0 crashes earlier still when undocking.

The graphics hardware involved is:

00:02.0 VGA compatible controller: Intel Corporation HD Graphics 5500 (rev 09) (prog-if 00 [VGA controller])
	Subsystem: Hewlett-Packard Company ZBook 15u G2 Mobile Workstation
	Flags: bus master, fast devsel, latency 0, IRQ 49
	Memory at c0000000 (64-bit, non-prefetchable) [size=16M]
	Memory at b0000000 (64-bit, prefetchable) [size=256M]
	I/O ports at 5000 [size=64]
	[virtual] Expansion ROM at 000c0000 [disabled] [size=128K]
	Capabilities: [90] MSI: Enable+ Count=1/1 Maskable- 64bit-
	Capabilities: [d0] Power Management version 2
	Capabilities: [a4] PCI Advanced Features
	Kernel driver in use: i915
	Kernel modules: i915

And the crash that I see is similar to this:

Oct 07 17:47:16 localhost.localdomain kernel: BUG: unable to handle kernel paging request at 0000000000018c70
Oct 07 17:47:16 localhost.localdomain kernel: IP: [<ffffffff960ecd48>] queued_spin_lock_slowpath+0x108/0x190
Oct 07 17:47:16 localhost.localdomain kernel: PGD 0 
Oct 07 17:47:16 localhost.localdomain kernel: Oops: 0002 [#1] SMP
Oct 07 17:47:16 localhost.localdomain kernel: Modules linked in: rfcomm ccm xt_CHECKSUM tun ipt_MASQUERADE nf_nat_masquerade_ipv4 xt_addrtype nf_conntrack_netbios_ns nf_conntrack_broadcast ip6t_REJECT nf_reject_
Oct 07 17:47:16 localhost.localdomain kernel:  sparse_keymap ppdev irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel iwlwifi intel_cstate intel_uncore intel_rapl_perf cfg80211 joydev uvcvideo lpc_ich r
Oct 07 17:47:16 localhost.localdomain kernel: CPU: 2 PID: 855 Comm: systemd-logind Not tainted 4.7.5-200.fc24.x86_64 #1
Oct 07 17:47:16 localhost.localdomain kernel: Hardware name: Hewlett-Packard HP EliteBook 840 G2/2216, BIOS M71 Ver. 01.04 02/24/2015
Oct 07 17:47:16 localhost.localdomain kernel: task: ffff88043a120000 ti: ffff880035d50000 task.ti: ffff880035d50000
Oct 07 17:47:16 localhost.localdomain kernel: RIP: 0010:[<ffffffff960ecd48>]  [<ffffffff960ecd48>] queued_spin_lock_slowpath+0x108/0x190
Oct 07 17:47:16 localhost.localdomain kernel: RSP: 0018:ffff880035d53908  EFLAGS: 00010202
Oct 07 17:47:16 localhost.localdomain kernel: RAX: 0000000000018c70 RBX: ffff880438716a50 RCX: ffff88044f498c40
Oct 07 17:47:16 localhost.localdomain kernel: RDX: 0000000000001b9a RSI: 000000006e6f746f RDI: ffff880438716a54
Oct 07 17:47:16 localhost.localdomain kernel: RBP: ffff880035d53908 R08: 00000000000c0000 R09: 0000000000000000
Oct 07 17:47:16 localhost.localdomain kernel: R10: ffff880096e4e780 R11: 0000000000000898 R12: ffff88043ab3ec40
Oct 07 17:47:16 localhost.localdomain kernel: R13: ffff880438716a58 R14: ffff880427ebd800 R15: ffff8804396bd000
Oct 07 17:47:16 localhost.localdomain kernel: FS:  00007f22e2cb5900(0000) GS:ffff88044f480000(0000) knlGS:0000000000000000
Oct 07 17:47:16 localhost.localdomain kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Oct 07 17:47:16 localhost.localdomain kernel: CR2: 0000000000018c70 CR3: 000000043a095000 CR4: 00000000003406e0
Oct 07 17:47:16 localhost.localdomain kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Oct 07 17:47:16 localhost.localdomain kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Oct 07 17:47:16 localhost.localdomain kernel: Stack:
Oct 07 17:47:16 localhost.localdomain kernel:  ffff880035d53918 ffffffff967ec350 ffff880035d53940 ffffffff967e9f2f
Oct 07 17:47:16 localhost.localdomain kernel:  ffff88043ab3ec40 ffff880438716a50 ffff880438714800 ffff880035d53970
Oct 07 17:47:16 localhost.localdomain kernel:  ffffffffc00a155e ffff880427e49800 ffff880438716800 ffff880427ebd800
Oct 07 17:47:16 localhost.localdomain kernel: Call Trace:
Oct 07 17:47:16 localhost.localdomain kernel:  [<ffffffff967ec350>] _raw_spin_lock+0x20/0x30
Oct 07 17:47:16 localhost.localdomain kernel:  [<ffffffff967e9f2f>] __ww_mutex_lock+0x6f/0xa0
Oct 07 17:47:16 localhost.localdomain kernel:  [<ffffffffc00a155e>] drm_modeset_lock+0x4e/0xd0 [drm]
Oct 07 17:47:16 localhost.localdomain kernel:  [<ffffffffc00a2044>] drm_atomic_get_connector_state+0x34/0x1c0 [drm]
Oct 07 17:47:16 localhost.localdomain kernel:  [<ffffffffc014ff90>] __drm_atomic_helper_set_config+0x2a0/0x360 [drm_kms_helper]
Oct 07 17:47:16 localhost.localdomain kernel:  [<ffffffffc01511da>] restore_fbdev_mode+0x22a/0x260 [drm_kms_helper]
Oct 07 17:47:16 localhost.localdomain kernel:  [<ffffffffc01535d4>] drm_fb_helper_restore_fbdev_mode_unlocked+0x34/0x80 [drm_kms_helper]
Oct 07 17:47:16 localhost.localdomain kernel:  [<ffffffffc015364d>] drm_fb_helper_set_par+0x2d/0x50 [drm_kms_helper]
Oct 07 17:47:16 localhost.localdomain kernel:  [<ffffffffc023da4a>] intel_fbdev_set_par+0x1a/0x60 [i915]
Oct 07 17:47:16 localhost.localdomain kernel:  [<ffffffff9645a6b6>] fb_set_var+0x236/0x460
Oct 07 17:47:16 localhost.localdomain kernel:  [<ffffffff960d98e8>] ? enqueue_task_fair+0xa8/0x960
Oct 07 17:47:16 localhost.localdomain kernel:  [<ffffffff961bf0df>] ? free_hot_cold_page_list+0x3f/0xa0
Oct 07 17:47:16 localhost.localdomain kernel:  [<ffffffff9645074f>] fbcon_blank+0x30f/0x350
Oct 07 17:47:16 localhost.localdomain kernel:  [<ffffffff9624c200>] ? chrdev_open+0xb0/0x180
Oct 07 17:47:16 localhost.localdomain kernel:  [<ffffffff964db0b2>] do_unblank_screen+0xd2/0x1a0
Oct 07 17:47:16 localhost.localdomain kernel:  [<ffffffff964d0ef6>] vt_ioctl+0x4f6/0x1270
Oct 07 17:47:16 localhost.localdomain kernel:  [<ffffffff9625abf9>] ? fasync_remove_entry+0x29/0xb0
Oct 07 17:47:16 localhost.localdomain kernel:  [<ffffffff964c537a>] tty_ioctl+0x35a/0xc50
Oct 07 17:47:16 localhost.localdomain kernel:  [<ffffffff964cdf79>] ? tty_unlock+0x29/0x50
Oct 07 17:47:16 localhost.localdomain kernel:  [<ffffffff9625f909>] ? dput+0xd9/0x260
Oct 07 17:47:16 localhost.localdomain kernel:  [<ffffffff96268ae4>] ? mntput+0x24/0x40
Oct 07 17:47:16 localhost.localdomain kernel:  [<ffffffff9625b4b2>] do_vfs_ioctl+0xa2/0x5d0
Oct 07 17:47:16 localhost.localdomain kernel:  [<ffffffff962497ce>] ? ____fput+0xe/0x10
Oct 07 17:47:16 localhost.localdomain kernel:  [<ffffffff960be9b8>] ? task_work_run+0x88/0xb0
Oct 07 17:47:16 localhost.localdomain kernel:  [<ffffffff9625ba59>] SyS_ioctl+0x79/0x90
Oct 07 17:47:16 localhost.localdomain kernel:  [<ffffffff967ec572>] entry_SYSCALL_64_fastpath+0x1a/0xa4
Oct 07 17:47:16 localhost.localdomain kernel: Code: 02 89 c2 45 31 c9 c1 e2 10 85 d2 74 41 c1 ea 12 83 e0 03 83 ea 01 48 c1 e0 04 48 63 d2 48 05 40 8c 01 00 48 03 04 d5 40 58 d3 96 <48> 89 08 8b 41 08 85 c0 75 0
Oct 07 17:47:16 localhost.localdomain kernel: RIP  [<ffffffff960ecd48>] queued_spin_lock_slowpath+0x108/0x190

I tried to bisect the crash, what with it being nicely reproducible and all, but that effort didn't yield much useful information as many of the intermediate commits either did not build, or the resulting kernel did not recognize the DP monitors at all (thought they were disconnected) when booted up.

Here's the bisect log so far, note the skips due to the issues described above:

git bisect start
# good: [2dcd0af568b0cf583645c8a317dd12e344b1c72a] Linux 4.6
git bisect good 2dcd0af568b0cf583645c8a317dd12e344b1c72a
# bad: [523d939ef98fd712632d93a5a2b588e477a7565e] Linux 4.7
git bisect bad 523d939ef98fd712632d93a5a2b588e477a7565e
# good: [0694f0c9e20c47063e4237e5f6649ae5ce5a369a] radix tree test suite: remove dependencies on height
git bisect good 0694f0c9e20c47063e4237e5f6649ae5ce5a369a
# bad: [e4f7bdc2ec0d0dcc27f7d70db27a620dfdc1f697] Merge branch 'for-4.7-zac' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/libata
git bisect bad e4f7bdc2ec0d0dcc27f7d70db27a620dfdc1f697
# good: [2f37dd131c5d3a2eac21cd5baf80658b1b02a8ac] Merge tag 'staging-4.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging
git bisect good 2f37dd131c5d3a2eac21cd5baf80658b1b02a8ac
# bad: [2b669875332fbdff0a7ad559e8662e875e7a1526] drm/msm: Drop load/unload drm_driver ops
git bisect bad 2b669875332fbdff0a7ad559e8662e875e7a1526
# skip: [560ce1dc7c87ade27faaf07d381a9a5a2ffc9934] drm/i915: use drm_crtc_send_vblank_event()
git bisect skip 560ce1dc7c87ade27faaf07d381a9a5a2ffc9934
# good: [bf16200689118d19de1b8d2a3c314fc21f5dc7bb] Linux 4.6-rc3
git bisect good bf16200689118d19de1b8d2a3c314fc21f5dc7bb
# skip: [9cd47424fb410e478e5a97e83ac10263c13ed65c] drm/mode: reduce scope of fb_lock in framebuffer init
git bisect skip 9cd47424fb410e478e5a97e83ac10263c13ed65c
# skip: [187a1c07ec3c19d0c965f95741ed260bbc02040e] drm/i915: Fix oops in vlv_force_pll_on()
git bisect skip 187a1c07ec3c19d0c965f95741ed260bbc02040e
# good: [fbf6d8798fceb1f64eb0e5fd7cd541becfc376cd] drm/i915: Add locking to pll updates, v3.
git bisect good fbf6d8798fceb1f64eb0e5fd7cd541becfc376cd
# skip: [b5bf0f1ea3658254bd72ef64abc97786e8a32255] drm/exynos: clean up register definions for fimd and decon
git bisect skip b5bf0f1ea3658254bd72ef64abc97786e8a32255
# skip: [528948745f6f52f36839b76beeab0632a9f16471] drm/i915: Move gt/pm irq handling out from irq disabled section on VLV
git bisect skip 528948745f6f52f36839b76beeab0632a9f16471
# skip: [71cbf451eb2715865e3dbd0ec55837dac1148d23] drm/radeon: Use lockless gem BO free callback
git bisect skip 71cbf451eb2715865e3dbd0ec55837dac1148d23
# skip: [7c8f6d2577c7565f67ba3f6b9b76f7422710d66e] drm/mode: rework drm_mode_object_put to drm_mode_object_unregister.
git bisect skip 7c8f6d2577c7565f67ba3f6b9b76f7422710d66e
Comment 1 Vadim Lobanov 2016-10-12 02:46:42 UTC
Created attachment 127229 [details]
dmesg booting with drm.debug=14
Comment 2 Vadim Lobanov 2016-10-12 18:31:44 UTC
Note that bisect_pending is a bit of a misnomer since I have no idea how to make forward progress on the bisect with the intervening compile/disconnected-port issues that crop up, noted above.
Comment 3 Alejandro Lorenzo 2016-10-24 20:29:32 UTC
I Also seem to be affected by this. My system is a Dell precision 9550 connected to the Dell TB15 dock station and using DisplayPort

Kernel 4.6 works well, 4.7 hangs when trying to switch a tty if at some past time the DisplayPort display has been disconnected (That's why it freezes the system when logging out or powering off; switches to tty)

In my case, the device:

00:02.0 VGA compatible controller: Intel Corporation HD Graphics 530 (rev 06) (prog-if 00 [VGA controller])
        DeviceName:  Onboard IGD
        Subsystem: Dell HD Graphics 530
        Flags: bus master, fast devsel, latency 0, IRQ 141
        Memory at db000000 (64-bit, non-prefetchable) [size=16M]
        Memory at 70000000 (64-bit, prefetchable) [size=256M]
        I/O ports at f000 [size=64]
        [virtual] Expansion ROM at 000c0000 [disabled] [size=128K]
        Capabilities: <access denied>
        Kernel driver in use: i915
        Kernel modules: i915
Comment 4 Alejandro Lorenzo 2016-10-24 20:32:44 UTC
Correction, the docking station is the WD15
Comment 5 Alejandro Lorenzo 2016-10-24 21:17:03 UTC
Could this be related (or even a duplicate) of #98211 ?
Comment 6 Alejandro Lorenzo 2016-10-24 21:17:32 UTC
(In reply to Alejandro Lorenzo from comment #5)
> Could this be related (or even a duplicate) of #98211 ?

I meant 96938.
Comment 7 Jani Nikula 2016-10-25 09:29:22 UTC
(In reply to Alejandro Lorenzo from comment #6)
> (In reply to Alejandro Lorenzo from comment #5)
> > Could this be related (or even a duplicate) of #98211 ?
> 
> I meant 96938.

And if you say bug 96938 you'll get a nice link.
Comment 8 Alejandro Lorenzo 2016-11-29 08:55:33 UTC
Kernel 4.8.11 seems to improve the situation a lot. The computer no longer crashes.

Still i think something is not quite right in the GPU after the disconnection because of some funny behavior i saw, but more testing is still required

However, for those of you affected by this, i would give the 4.8.11 a spin :D
Comment 9 Vadim Lobanov 2016-11-29 23:55:09 UTC
Running on the most up-to-date Fedora 25 environment (kernel 4.8.8-300.fc25.x86_64) and I no longer see crashes described here. Not sure if it's fixed upstream in vanilla or if it's a local patch that Fedora is carrying.
Comment 10 Alejandro Lorenzo 2016-11-30 09:06:51 UTC
When i tested the 4.8.8 indeed the crashes stopped, but DP would not work fine. 
At least with my MST DP, the monitor would work ok as long as it was plugged it at boot time. However, if you unplug and plug the monitor the computer wouldn't crash, but i wouldn't have signal for the DP monitor either.
In 4.8.11 this works
Comment 11 Andreas Kloeckner 2016-12-01 01:52:31 UTC
Bug 98919 is possibly related.
Comment 12 Jari Tahvanainen 2016-12-19 10:56:16 UTC
Highest+Blocker as being regression w/o workaround.
Comment 13 Jari Tahvanainen 2017-01-27 14:04:37 UTC
Vadim, Alejandro, I'm proposing this to be closed due to comment 9 and comment 10, where you said that crash does not happen anymore. 
Let's follow the "i wouldn't have signal for the DP monitor either." either in bug 98919 or if that does not apply then please create a new bug.
Comment 14 Vadim Lobanov 2017-01-27 17:27:56 UTC
Hi Jari,

I would agree. This bug does not happen on my latest kernels any longer.
Comment 15 yann 2017-01-29 09:23:25 UTC
(In reply to Vadim Lobanov from comment #14)
> Hi Jari,
> 
> I would agree. This bug does not happen on my latest kernels any longer.

Thanks for your feedback Vadim Lobanov. So closing this bug as fixed/
Comment 16 Alejandro Lorenzo 2017-01-30 09:54:16 UTC
Sorry for the late answer. I agree, this doesn't happen in new releases of the kernel.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.