After having my kernel crash a couple times randomly when restarting X, I've identified a way to reproduce this in a (what seems to be) reliable fashion. I have two displays (actually, one display with two inputs right now, though it's crashed with two displays as well). One is HDMI, the other VGA. Here's the scenario: 0: boot with HDMI (only) connected 1: connect the VGA cable (screen goes black for a second, then comes back) 2: run startx 3: run xrandr --output VGA1 --off 4: disconnect the VGA cable. 5: press ctrl+alt+backspace to terminate X. Repeat steps 1 through 5 three times, and as soon as you press ctrl+alt+backspace for the third time, the kernel will BUG: Jul 22 10:59:30 emergent kernel: kernel BUG at /scratch_space/linux-2.6/drivers/gpu/drm/i915/i915_gem.c:4195! Jul 22 10:59:30 emergent kernel: invalid opcode: 0000 [#1] PREEMPT SMP Jul 22 10:59:30 emergent kernel: last sysfs file: /sys/devices/pci0000:00/0000:00:02.1/resource Jul 22 10:59:30 emergent kernel: CPU 0 Jul 22 10:59:30 emergent kernel: Modules linked in: nfs nfs_acl bridge stp llc autofs4 nfsd lockd sunrpc exportfs ipv6 iptable_filter iptable_nat nf_nat nf_conntrack_ipv4 nf_conntrack nf_defrag_ipv4 ip_tables x_tables snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device snd_pcm_oss snd_mixer_oss snd_hda_codec_intelhdmi snd_hda_codec_realtek snd_hda_intel snd_hda_codec snd_hwdep snd_pcm snd_timer snd soundcore snd_page_alloc sg usb_storage ext2 evdev sr_mod cdrom loop tun kqemu acpi_cpufreq mperf ehci_hcd e1000e Jul 22 10:59:30 emergent kernel: Jul 22 10:59:30 emergent kernel: Pid: 24079, comm: X Not tainted 2.6.35-rc4-00043-g140236b #9 WG43M/Aspire X3810 Jul 22 10:59:30 emergent kernel: RIP: 0010:[<ffffffff811cbdca>] [<ffffffff811cbdca>] i915_gem_object_unpin+0x41/0xc2 Jul 22 10:59:30 emergent kernel: RSP: 0018:ffff880124b298d0 EFLAGS: 00010282 Jul 22 10:59:30 emergent kernel: RAX: 00000000fffffff8 RBX: ffff88013fece800 RCX: 000000000002047e Jul 22 10:59:30 emergent kernel: RDX: ffff88013fece000 RSI: ffff88013fec2000 RDI: ffff88013f13bc00 Jul 22 10:59:30 emergent kernel: RBP: ffff88013fec2000 R08: 0000000000020000 R09: 0000000000000000 Jul 22 10:59:30 emergent kernel: R10: ffff88013fcb7a70 R11: 0000000000000000 R12: ffff88013fece000 Jul 22 10:59:30 emergent kernel: R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 Jul 22 10:59:30 emergent kernel: FS: 00007fc9436f3840(0000) GS:ffff880001600000(0000) knlGS:0000000000000000 Jul 22 10:59:30 emergent kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b Jul 22 10:59:30 emergent kernel: CR2: 00007fc941f44cb0 CR3: 0000000101886000 CR4: 00000000000406f0 Jul 22 10:59:30 emergent kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 Jul 22 10:59:30 emergent kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Jul 22 10:59:30 emergent kernel: Process X (pid: 24079, threadinfo ffff880124b28000, task ffff88007579f1a0) Jul 22 10:59:30 emergent kernel: Stack: Jul 22 10:59:30 emergent kernel: ffffffff811d6b42 0000000000000000 0000000000000000 ffff880100003600 Jul 22 10:59:30 emergent kernel: <0> ffff880124b29918 ffff88010adb1000 ffffffff00000000 ffff880077a8ae00 Jul 22 10:59:30 emergent kernel: <0> 00070180812cefcb 00070188000701a4 000701840007019c ffff88013fece020 Jul 22 10:59:30 emergent kernel: Call Trace: Jul 22 10:59:30 emergent kernel: [<ffffffff811d6b42>] ? intel_pipe_set_base+0x352/0x3d2 Jul 22 10:59:30 emergent kernel: [<ffffffff811d7e77>] ? intel_crtc_mode_set+0x12b5/0x1305 Jul 22 10:59:30 emergent kernel: [<ffffffff8103b30a>] ? msleep+0x16/0x1e Jul 22 10:59:30 emergent kernel: [<ffffffff811d67a6>] ? i9xx_crtc_dpms+0x262/0x2ac Jul 22 10:59:30 emergent kernel: [<ffffffff811ae8d1>] ? drm_crtc_helper_set_mode+0x1dc/0x2fb Jul 22 10:59:30 emergent kernel: [<ffffffff811af426>] ? drm_crtc_helper_set_config+0x5ae/0x770 Jul 22 10:59:30 emergent kernel: [<ffffffff811be2b0>] ? drm_mode_setcrtc+0x2a9/0x2d2 Jul 22 10:59:30 emergent kernel: [<ffffffff811be007>] ? drm_mode_setcrtc+0x0/0x2d2 Jul 22 10:59:30 emergent kernel: [<ffffffff811b33c3>] ? drm_ioctl+0x211/0x2f2 Jul 22 10:59:30 emergent kernel: [<ffffffff810aaf70>] ? vfs_ioctl+0x23/0x93 Jul 22 10:59:30 emergent kernel: [<ffffffff810ab432>] ? do_vfs_ioctl+0x3e0/0x417 Jul 22 10:59:30 emergent kernel: [<ffffffff812d2ca9>] ? do_page_fault+0x22f/0x271 Jul 22 10:59:30 emergent kernel: [<ffffffff810ab4a5>] ? sys_ioctl+0x3c/0x5c Jul 22 10:59:30 emergent kernel: [<ffffffff8100292b>] ? system_call_fastpath+0x16/0x1b Jul 22 10:59:30 emergent kernel: Code: 81 e1 ff 3f fc ff c1 f8 1c 41 b8 0f 00 00 00 ff c8 c1 e0 04 c0 f8 04 41 21 c0 41 c1 e0 0e 44 09 c1 84 c0 89 8f b0 00 00 00 79 04 <0f> 0b eb fe 48 83 bf 88 00 00 00 00 75 04 0f 0b eb fe 81 e1 00 Jul 22 10:59:30 emergent kernel: RIP [<ffffffff811cbdca>] i915_gem_object_unpin+0x41/0xc2 Jul 22 10:59:30 emergent kernel: RSP <ffff880124b298d0> Jul 22 10:59:30 emergent kernel: ---[ end trace 0050d9db11a1fa89 ]--- and the display is now dead. The above trace was from around 2.6.35-rc4 but the problem persists in -rc6 (however the line number is now 4196). I can try to get a trace from -rc6 if it is desired. Running latest git everything on an intel G45.
Possibly fixed by: https://patchwork.kernel.org/patch/112571/ ?
Different bug. This is a double unpin, smells like a use-after-free but I think it is a different one than https://bugzilla.kernel.org/attachment.cgi?id=27229 I'll if I can reproduce it on a laptop.
Not having much luck with my t61. It's a stubborn beast that obstinately refuses to crash, see bug 28811. Will have to do this the old fashioned way and read code.
FWIW, I can't reproduce this on my T500 (GM45) either: only on the desktop machine (G45). However, the connectors on the two systems are different: LVDS & VGA on the laptop, HDMI & VGA on the desktop.
Nick, can you trigger the OOPs whilst you have drm.debug=0xc? I am trying to order the sequence of events. Either we have freed the current fb object or have already unpinned it but left it attached. As always does your xsession fire up compiz, or is that a plain X server? (Just wondering if we have other races going on as well.)
int drm_mode_rmfb() { /* TODO release all crtc connected to the framebuffer */ /* TODO unhock the destructor from the buffer object */ } * sigh. I think -intel avoids that trap.
Created attachment 37399 [details] [review] Warn, not free, an active fb Nick, can you try reproducing with this patch? If I am on the right lines we should just start getting errors in dmesg rather than oopses.
Created attachment 37400 [details] Full kernel log with drm.debug=0xc > Nick, can you try reproducing with this patch? If I am on the right lines we > should just start getting errors in dmesg rather than oopses. With the patch, I get the new error messages [drm:drm_mode_rmfb] *ERROR* tried to remove an active fb between steps 3 and 4 (i.e., after running xrandr but before removing the cable). However, the kernel still crashes in the same way. > As always does your xsession fire up compiz, or is that a plain X server? > (Just wondering if we have other races going on as well.) No compositing or 3D, just FVWM and a terminal (my normal setup). Note that this does not feel like a race: it is completely 100% reproducible. The crash always occurs after exactly three iterations of the steps in the original report.
The TODO comment in that function is misleading. The fb is in fact detached by drm_framebuffer_cleanup() called from intel_user_framebuffer_destroy() [via fb->funcs->destroy()]. So the patch is useless.
Nick, can you try http://cgit.freedesktop.org/~ickle/linux-2.6/log/?h=drm-testing . Mainly out of curiosity, but it does come with a few interesting patches. ;-)
OK. First off, I've updated my userspace since first reporting this bug. On Linux 2.6.35, I can still reliably reproduce this issue by following the steps in the original report. Now, on 2.6.36-rc3, I can no longer reproduce this reliably. However, I just encountered this issue (or something very similar to it) today when I restarted my X session, so it doesn't seem to have been fixed. I've just checked out your drm-testing branch and if it crashes I'll let you know (unfortunately, I was using 2.6.36-rc3 for some time before it died on me...). When I have some spare time, I'll try to bisect the first commit where my original reproduction steps fail.
So I don't know if _this_ bug is fixed, but with the drm-testing branch the following gets spammed repeatedly to the console after I start X. While the ill effects appear to be limited to the warning, the console spam is to the detriment of the system's usefulness and I'm afraid I'm not able to test this branch for very long. ------------[ cut here ]------------ WARNING: at lib/kref.c:34 kref_get+0x1b/0x20() Hardware name: Aspire X3810 Modules linked in: bridge stp llc autofs4 nfsd lockd sunrpc exportfs ipv6 iptable_filter iptable_nat nf_nat nf_conntrack_ipv4 nf_conntrack nf_defrag_ipv4 ip_tables x_tables snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device snd_pcm_oss snd_mixer_oss snd_hda_codec_intelhdmi snd_hda_codec_realtek snd_hda_intel snd_hda_codec snd_hwdep snd_pcm snd_timer snd soundcore snd_page_alloc sg evdev usb_storage ext2 ehci_hcd sr_mod cdrom loop tun acpi_cpufreq mperf e1000e Pid: 2324, comm: X Not tainted 2.6.36-rc3-00054-gc83c440 #1 Call Trace: [<ffffffff810329e8>] ? warn_slowpath_common+0x78/0x8c [<ffffffff811342c7>] ? kref_get+0x1b/0x20 [<ffffffff811b78b0>] ? drm_gem_handle_create+0x73/0x7e [<ffffffff811d25a0>] ? i915_gem_create_ioctl+0x44/0x68 [<ffffffff811b5fb4>] ? drm_ioctl+0x236/0x2ea [<ffffffff81087040>] ? __do_fault+0x358/0x393 [<ffffffff811d255c>] ? i915_gem_create_ioctl+0x0/0x68 [<ffffffff81088ebc>] ? handle_mm_fault+0x3f4/0x7ab [<ffffffff810ae4ee>] ? do_vfs_ioctl+0x42f/0x47c [<ffffffff812d5e97>] ? do_page_fault+0x22f/0x271 [<ffffffff810ae577>] ? sys_ioctl+0x3c/0x5c [<ffffffff8100292b>] ? system_call_fastpath+0x16/0x1b ---[ end trace 6607a7076c60d2a6 ]---
Working through bug 29857 uncovered the opposite bug.... So wondering if there are in fact two sides to the same bug, can you please test: git://git.kernel.org/pub/scm/linux/kernel/git/ickle/drm-intel.git drm-intel-staging In particular: http://git.kernel.org/?p=linux/kernel/git/ickle/drm-intel.git;a=commit;h=a9878eb9c2a68e5d06169dc315f8eb91e13c1c20 and http://git.kernel.org/?p=linux/kernel/git/ickle/drm-intel.git;a=commit;h=074bdaab420b5bec20f2c46772e6f720cefc446f
OK. I did two tests: First, since I can reproduce the issue on 2.6.35 more easily, I checked out 2.6.35 and cherry-picked those two commits. I also needed to cherry-pick 5c8d7171cc4984 ("drm/kms: add crtc disable function") for the second commit to build. The result was worse than before: the kernel crashed at the second 'startx' (rather than the third ctrl+alt+bksp). Second, I checked out Linus' master and merged drm-intel-staging into it. There was one conflict, in drm_crtc_helper (line 105..108 in Linus' master), which I resolved by combining the two changes: } else { connector->status = connector->funcs->detect(connector, false); drm_helper_hpd_irq_event(dev); } This kernel crashed after *two* iterations of the steps in the original report. However, the bug was _slightly_ different -- this time the crash was in _pin as opposed to _unpin. (unfortunately, not all of the output made it to disk). ------------[ cut here ]------------ kernel BUG at /scratch_space/linux-2.6/drivers/gpu/drm/i915/i915_gem.c:4030! invalid opcode: 0000 [#1] PREEMPT SMP last sysfs file: /sys/devices/virtual/vtconsole/vtcon1/uevent CPU 2 Modules linked in: nfs nfs_acl bridge stp llc autofs4 nfsd lockd sunrpc exportfs ipv6 iptable_filter iptable_nat nf_nat nf_conntrack_ipv4 nf_conntrack nf_defrag_ipv4 ip_tables x_tables snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device snd_pcm_oss snd_mixer_oss snd_hda_codec_intelhdmi snd_hda_codec_realtek snd_hda_intel snd_hda_codec snd_hwdep snd_pcm snd_timer snd soundcore snd_page_alloc sg evdev usb_storage ext2 ehci_hcd sr_mod cdrom loop tun acpi_cpufreq mperf e1000e Pid: 2388, comm: X Not tainted 2.6.36-rc3-00344-gc8f3615 #27 WG43M/Aspire X3810 RIP: 0010:[<ffffffff811e6491>] [<ffffffff811e6491>] i915_gem_object_pin+0x29/0x170 RSP: 0018:ffff880136be9618 EFLAGS: 00010246 [output truncated] I am currently running Linus' master, which so far seems to be the same as -rc3 though I haven't seen it crash yet.
Thanks Nick, the pin leak trumps the overzealous unpinning!
Just to confirm: Linus' master (2.6.36-rc6-00119-g3c06806) still crashes after a while, with the the BUG in i915_gem_object_pin (i915_gem.c line 4045). The crash always occurs upon terminating the X server, but I can't reproduce it on demand anymore. It also seems that the connectors don't actually have to be yanked: the last several crashes have occurred without my touching any of the connectors at all.
Created attachment 41770 [details] [review] Restore old fb after modesetting failure I think I finally spotted it!
The good news is that I can reliably reproduce this issue again on 2.6.37. The bad news is that the above patch, applied on top of 2.6.37, does not fix the issue: the crash is the same as the previous report (see below). Annoyingly, I'm not sure how to get the full crash log. Only the first few lines make it to the system log, and despite having netconsole set up, _none_ of the output made it to the remote machine. Annoyingly, this computer does not have a serial port. ------------[ cut here ]------------ kernel BUG at /scratch_space/linux-2.6/drivers/gpu/drm/i915/i915_gem.c:4190! invalid opcode: 0000 [#1] PREEMPT SMP last sysfs file: /sys/devices/virtual/vtconsole/vtcon1/uevent CPU 0 Modules linked in: netconsole nfs nfs_acl autofs4 nfsd lockd sunrpc exportfs ipv6 iptable_filter iptable_nat nf_nat nf_conntrack_ipv4 nf_conntrack nf_defrag_ipv4 ip_tables x_tables snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device snd_pcm_oss snd_mixer_oss snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_intel snd_hda_codec snd_hwdep snd_pcm snd_timer snd soundcore snd_page_alloc sg evdev usb_storage ext2 ehci_hcd sr_mod cdrom loop tun acpi_cpufreq mperf arc4 ecb crypto_blkcipher cryptomgr aead crypto_algapi rt2800pci rt2800lib crc_ccitt rt2x00pci rt2x00lib mac80211 cfg80211 eeprom_93cx6 e1000e [last unloaded: netconsole] Pid: 3576, comm: X Not tainted 2.6.37-00001-g3ec088a #114 WG43M/Aspire X3810 RIP: 0010:[<ffffffff811ecbbf>] [<ffffffff811ecbbf>] i915_gem_object_pin+0x30/0x179 RSP: 0018:ffff880134e335e8 EFLAGS: 00010246 [output truncated]
Created attachment 42635 [details] [review] Don't switch fb after a no-op (when disabling)
(In reply to comment #19) > Created an attachment (id=42635) [details] > Don't switch fb after a no-op (when disabling) This looks very promising: I tried latest Linus' git without this patch and crashed the box (interestingly, it crashed in a different way from usual). After applying this patch, I was unable to crash the box. I'll try 2.6.37 later today and see if it fixes the problem there, too (as that kernel behaved almost exactly as described in the original report).
I do think we have this sussed! Finally! commit 9334ef755f060e251f3f395caeda1a58b6834ea3 Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Fri Jan 28 11:53:03 2011 +0000 drm: Don't switch fb when disabling an output In drm_crtc_helper_set_config, we call drm_crtc_helper_set_mode which may return early and do no operation if the crtc is to be disabled. In this case we merrily swap to the new fb, discarding the old_fb believing that it has been cleaned up. However, due to the early return, the old_fb was not presented to the backend for correct reaping, and nor was the new one - which is about to be reaped via the drm_helper_disable_unused_functions(), leading to incorrect refcounting of the pinned objects. Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=27722 Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=29857 Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=29230 Tested-by: Takashi Iwai <tiwai@suse.de> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.