Summary: | Occasional kernel BUG when switching connectors/restarting X | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | DRI | Reporter: | Nick Bowler <nbowler> | ||||||||||
Component: | DRM/Intel | Assignee: | Chris Wilson <chris> | ||||||||||
Status: | CLOSED FIXED | QA Contact: | |||||||||||
Severity: | major | ||||||||||||
Priority: | medium | Keywords: | NEEDINFO | ||||||||||
Version: | unspecified | ||||||||||||
Hardware: | Other | ||||||||||||
OS: | All | ||||||||||||
Whiteboard: | |||||||||||||
i915 platform: | i915 features: | ||||||||||||
Attachments: |
|
Description
Nick Bowler
2010-07-23 13:38:25 UTC
Possibly fixed by: https://patchwork.kernel.org/patch/112571/ ? Different bug. This is a double unpin, smells like a use-after-free but I think it is a different one than https://bugzilla.kernel.org/attachment.cgi?id=27229 I'll if I can reproduce it on a laptop. Not having much luck with my t61. It's a stubborn beast that obstinately refuses to crash, see bug 28811. Will have to do this the old fashioned way and read code. FWIW, I can't reproduce this on my T500 (GM45) either: only on the desktop machine (G45). However, the connectors on the two systems are different: LVDS & VGA on the laptop, HDMI & VGA on the desktop. Nick, can you trigger the OOPs whilst you have drm.debug=0xc? I am trying to order the sequence of events. Either we have freed the current fb object or have already unpinned it but left it attached. As always does your xsession fire up compiz, or is that a plain X server? (Just wondering if we have other races going on as well.) int drm_mode_rmfb() { /* TODO release all crtc connected to the framebuffer */ /* TODO unhock the destructor from the buffer object */ } * sigh. I think -intel avoids that trap. Created attachment 37399 [details] [review] Warn, not free, an active fb Nick, can you try reproducing with this patch? If I am on the right lines we should just start getting errors in dmesg rather than oopses. Created attachment 37400 [details] Full kernel log with drm.debug=0xc > Nick, can you try reproducing with this patch? If I am on the right lines we > should just start getting errors in dmesg rather than oopses. With the patch, I get the new error messages [drm:drm_mode_rmfb] *ERROR* tried to remove an active fb between steps 3 and 4 (i.e., after running xrandr but before removing the cable). However, the kernel still crashes in the same way. > As always does your xsession fire up compiz, or is that a plain X server? > (Just wondering if we have other races going on as well.) No compositing or 3D, just FVWM and a terminal (my normal setup). Note that this does not feel like a race: it is completely 100% reproducible. The crash always occurs after exactly three iterations of the steps in the original report. The TODO comment in that function is misleading. The fb is in fact detached by drm_framebuffer_cleanup() called from intel_user_framebuffer_destroy() [via fb->funcs->destroy()]. So the patch is useless. Nick, can you try http://cgit.freedesktop.org/~ickle/linux-2.6/log/?h=drm-testing . Mainly out of curiosity, but it does come with a few interesting patches. ;-) OK. First off, I've updated my userspace since first reporting this bug. On Linux 2.6.35, I can still reliably reproduce this issue by following the steps in the original report. Now, on 2.6.36-rc3, I can no longer reproduce this reliably. However, I just encountered this issue (or something very similar to it) today when I restarted my X session, so it doesn't seem to have been fixed. I've just checked out your drm-testing branch and if it crashes I'll let you know (unfortunately, I was using 2.6.36-rc3 for some time before it died on me...). When I have some spare time, I'll try to bisect the first commit where my original reproduction steps fail. So I don't know if _this_ bug is fixed, but with the drm-testing branch the following gets spammed repeatedly to the console after I start X. While the ill effects appear to be limited to the warning, the console spam is to the detriment of the system's usefulness and I'm afraid I'm not able to test this branch for very long. ------------[ cut here ]------------ WARNING: at lib/kref.c:34 kref_get+0x1b/0x20() Hardware name: Aspire X3810 Modules linked in: bridge stp llc autofs4 nfsd lockd sunrpc exportfs ipv6 iptable_filter iptable_nat nf_nat nf_conntrack_ipv4 nf_conntrack nf_defrag_ipv4 ip_tables x_tables snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device snd_pcm_oss snd_mixer_oss snd_hda_codec_intelhdmi snd_hda_codec_realtek snd_hda_intel snd_hda_codec snd_hwdep snd_pcm snd_timer snd soundcore snd_page_alloc sg evdev usb_storage ext2 ehci_hcd sr_mod cdrom loop tun acpi_cpufreq mperf e1000e Pid: 2324, comm: X Not tainted 2.6.36-rc3-00054-gc83c440 #1 Call Trace: [<ffffffff810329e8>] ? warn_slowpath_common+0x78/0x8c [<ffffffff811342c7>] ? kref_get+0x1b/0x20 [<ffffffff811b78b0>] ? drm_gem_handle_create+0x73/0x7e [<ffffffff811d25a0>] ? i915_gem_create_ioctl+0x44/0x68 [<ffffffff811b5fb4>] ? drm_ioctl+0x236/0x2ea [<ffffffff81087040>] ? __do_fault+0x358/0x393 [<ffffffff811d255c>] ? i915_gem_create_ioctl+0x0/0x68 [<ffffffff81088ebc>] ? handle_mm_fault+0x3f4/0x7ab [<ffffffff810ae4ee>] ? do_vfs_ioctl+0x42f/0x47c [<ffffffff812d5e97>] ? do_page_fault+0x22f/0x271 [<ffffffff810ae577>] ? sys_ioctl+0x3c/0x5c [<ffffffff8100292b>] ? system_call_fastpath+0x16/0x1b ---[ end trace 6607a7076c60d2a6 ]--- Working through bug 29857 uncovered the opposite bug.... So wondering if there are in fact two sides to the same bug, can you please test: git://git.kernel.org/pub/scm/linux/kernel/git/ickle/drm-intel.git drm-intel-staging In particular: http://git.kernel.org/?p=linux/kernel/git/ickle/drm-intel.git;a=commit;h=a9878eb9c2a68e5d06169dc315f8eb91e13c1c20 and http://git.kernel.org/?p=linux/kernel/git/ickle/drm-intel.git;a=commit;h=074bdaab420b5bec20f2c46772e6f720cefc446f OK. I did two tests: First, since I can reproduce the issue on 2.6.35 more easily, I checked out 2.6.35 and cherry-picked those two commits. I also needed to cherry-pick 5c8d7171cc4984 ("drm/kms: add crtc disable function") for the second commit to build. The result was worse than before: the kernel crashed at the second 'startx' (rather than the third ctrl+alt+bksp). Second, I checked out Linus' master and merged drm-intel-staging into it. There was one conflict, in drm_crtc_helper (line 105..108 in Linus' master), which I resolved by combining the two changes: } else { connector->status = connector->funcs->detect(connector, false); drm_helper_hpd_irq_event(dev); } This kernel crashed after *two* iterations of the steps in the original report. However, the bug was _slightly_ different -- this time the crash was in _pin as opposed to _unpin. (unfortunately, not all of the output made it to disk). ------------[ cut here ]------------ kernel BUG at /scratch_space/linux-2.6/drivers/gpu/drm/i915/i915_gem.c:4030! invalid opcode: 0000 [#1] PREEMPT SMP last sysfs file: /sys/devices/virtual/vtconsole/vtcon1/uevent CPU 2 Modules linked in: nfs nfs_acl bridge stp llc autofs4 nfsd lockd sunrpc exportfs ipv6 iptable_filter iptable_nat nf_nat nf_conntrack_ipv4 nf_conntrack nf_defrag_ipv4 ip_tables x_tables snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device snd_pcm_oss snd_mixer_oss snd_hda_codec_intelhdmi snd_hda_codec_realtek snd_hda_intel snd_hda_codec snd_hwdep snd_pcm snd_timer snd soundcore snd_page_alloc sg evdev usb_storage ext2 ehci_hcd sr_mod cdrom loop tun acpi_cpufreq mperf e1000e Pid: 2388, comm: X Not tainted 2.6.36-rc3-00344-gc8f3615 #27 WG43M/Aspire X3810 RIP: 0010:[<ffffffff811e6491>] [<ffffffff811e6491>] i915_gem_object_pin+0x29/0x170 RSP: 0018:ffff880136be9618 EFLAGS: 00010246 [output truncated] I am currently running Linus' master, which so far seems to be the same as -rc3 though I haven't seen it crash yet. Thanks Nick, the pin leak trumps the overzealous unpinning! Just to confirm: Linus' master (2.6.36-rc6-00119-g3c06806) still crashes after a while, with the the BUG in i915_gem_object_pin (i915_gem.c line 4045). The crash always occurs upon terminating the X server, but I can't reproduce it on demand anymore. It also seems that the connectors don't actually have to be yanked: the last several crashes have occurred without my touching any of the connectors at all. Created attachment 41770 [details] [review] Restore old fb after modesetting failure I think I finally spotted it! The good news is that I can reliably reproduce this issue again on 2.6.37. The bad news is that the above patch, applied on top of 2.6.37, does not fix the issue: the crash is the same as the previous report (see below). Annoyingly, I'm not sure how to get the full crash log. Only the first few lines make it to the system log, and despite having netconsole set up, _none_ of the output made it to the remote machine. Annoyingly, this computer does not have a serial port. ------------[ cut here ]------------ kernel BUG at /scratch_space/linux-2.6/drivers/gpu/drm/i915/i915_gem.c:4190! invalid opcode: 0000 [#1] PREEMPT SMP last sysfs file: /sys/devices/virtual/vtconsole/vtcon1/uevent CPU 0 Modules linked in: netconsole nfs nfs_acl autofs4 nfsd lockd sunrpc exportfs ipv6 iptable_filter iptable_nat nf_nat nf_conntrack_ipv4 nf_conntrack nf_defrag_ipv4 ip_tables x_tables snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device snd_pcm_oss snd_mixer_oss snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_intel snd_hda_codec snd_hwdep snd_pcm snd_timer snd soundcore snd_page_alloc sg evdev usb_storage ext2 ehci_hcd sr_mod cdrom loop tun acpi_cpufreq mperf arc4 ecb crypto_blkcipher cryptomgr aead crypto_algapi rt2800pci rt2800lib crc_ccitt rt2x00pci rt2x00lib mac80211 cfg80211 eeprom_93cx6 e1000e [last unloaded: netconsole] Pid: 3576, comm: X Not tainted 2.6.37-00001-g3ec088a #114 WG43M/Aspire X3810 RIP: 0010:[<ffffffff811ecbbf>] [<ffffffff811ecbbf>] i915_gem_object_pin+0x30/0x179 RSP: 0018:ffff880134e335e8 EFLAGS: 00010246 [output truncated] Created attachment 42635 [details] [review] Don't switch fb after a no-op (when disabling) (In reply to comment #19) > Created an attachment (id=42635) [details] > Don't switch fb after a no-op (when disabling) This looks very promising: I tried latest Linus' git without this patch and crashed the box (interestingly, it crashed in a different way from usual). After applying this patch, I was unable to crash the box. I'll try 2.6.37 later today and see if it fixes the problem there, too (as that kernel behaved almost exactly as described in the original report). I do think we have this sussed! Finally! commit 9334ef755f060e251f3f395caeda1a58b6834ea3 Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Fri Jan 28 11:53:03 2011 +0000 drm: Don't switch fb when disabling an output In drm_crtc_helper_set_config, we call drm_crtc_helper_set_mode which may return early and do no operation if the crtc is to be disabled. In this case we merrily swap to the new fb, discarding the old_fb believing that it has been cleaned up. However, due to the early return, the old_fb was not presented to the backend for correct reaping, and nor was the new one - which is about to be reaped via the drm_helper_disable_unused_functions(), leading to incorrect refcounting of the pinned objects. Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=27722 Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=29857 Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=29230 Tested-by: Takashi Iwai <tiwai@suse.de> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.