Bug 29230

Summary: Occasional kernel BUG when switching connectors/restarting X
Product: DRI Reporter: Nick Bowler <nbowler>
Component: DRM/IntelAssignee: Chris Wilson <chris>
Status: CLOSED FIXED QA Contact:
Severity: major    
Priority: medium Keywords: NEEDINFO
Version: unspecified   
Hardware: Other   
OS: All   
Whiteboard:
i915 platform: i915 features:
Attachments:
Description Flags
Warn, not free, an active fb
none
Full kernel log with drm.debug=0xc
none
Restore old fb after modesetting failure
none
Don't switch fb after a no-op (when disabling) none

Description Nick Bowler 2010-07-23 13:38:25 UTC
After having my kernel crash a couple times randomly when restarting X, I've
identified a way to reproduce this in a (what seems to be) reliable fashion.

I have two displays (actually, one display with two inputs right now, though
it's crashed with two displays as well).  One is HDMI, the other VGA.  Here's
the scenario:

  0:  boot with HDMI (only) connected
  
  1:  connect the VGA cable (screen goes black for a second, then comes back)
  2:  run startx
  3:  run xrandr --output VGA1 --off
  4:  disconnect the VGA cable.
  5:  press ctrl+alt+backspace to terminate X.

Repeat steps 1 through 5 three times, and as soon as you press
ctrl+alt+backspace for the third time, the kernel will BUG:

  Jul 22 10:59:30 emergent kernel: kernel BUG at /scratch_space/linux-2.6/drivers/gpu/drm/i915/i915_gem.c:4195!
  Jul 22 10:59:30 emergent kernel: invalid opcode: 0000 [#1] PREEMPT SMP 
  Jul 22 10:59:30 emergent kernel: last sysfs file: /sys/devices/pci0000:00/0000:00:02.1/resource
  Jul 22 10:59:30 emergent kernel: CPU 0 
  Jul 22 10:59:30 emergent kernel: Modules linked in: nfs nfs_acl bridge stp llc autofs4 nfsd lockd sunrpc exportfs ipv6 iptable_filter iptable_nat nf_nat nf_conntrack_ipv4 nf_conntrack nf_defrag_ipv4 ip_tables x_tables snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device snd_pcm_oss snd_mixer_oss snd_hda_codec_intelhdmi snd_hda_codec_realtek snd_hda_intel snd_hda_codec snd_hwdep snd_pcm snd_timer snd soundcore snd_page_alloc sg usb_storage ext2 evdev sr_mod cdrom loop tun kqemu acpi_cpufreq mperf ehci_hcd e1000e
  Jul 22 10:59:30 emergent kernel: 
  Jul 22 10:59:30 emergent kernel: Pid: 24079, comm: X Not tainted 2.6.35-rc4-00043-g140236b #9 WG43M/Aspire X3810
  Jul 22 10:59:30 emergent kernel: RIP: 0010:[<ffffffff811cbdca>]  [<ffffffff811cbdca>] i915_gem_object_unpin+0x41/0xc2
  Jul 22 10:59:30 emergent kernel: RSP: 0018:ffff880124b298d0  EFLAGS: 00010282
  Jul 22 10:59:30 emergent kernel: RAX: 00000000fffffff8 RBX: ffff88013fece800 RCX: 000000000002047e
  Jul 22 10:59:30 emergent kernel: RDX: ffff88013fece000 RSI: ffff88013fec2000 RDI: ffff88013f13bc00
  Jul 22 10:59:30 emergent kernel: RBP: ffff88013fec2000 R08: 0000000000020000 R09: 0000000000000000
  Jul 22 10:59:30 emergent kernel: R10: ffff88013fcb7a70 R11: 0000000000000000 R12: ffff88013fece000
  Jul 22 10:59:30 emergent kernel: R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
  Jul 22 10:59:30 emergent kernel: FS:  00007fc9436f3840(0000) GS:ffff880001600000(0000) knlGS:0000000000000000
  Jul 22 10:59:30 emergent kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
  Jul 22 10:59:30 emergent kernel: CR2: 00007fc941f44cb0 CR3: 0000000101886000 CR4: 00000000000406f0
  Jul 22 10:59:30 emergent kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
  Jul 22 10:59:30 emergent kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
  Jul 22 10:59:30 emergent kernel: Process X (pid: 24079, threadinfo ffff880124b28000, task ffff88007579f1a0)
  Jul 22 10:59:30 emergent kernel: Stack:
  Jul 22 10:59:30 emergent kernel: ffffffff811d6b42 0000000000000000 0000000000000000 ffff880100003600
  Jul 22 10:59:30 emergent kernel: <0> ffff880124b29918 ffff88010adb1000 ffffffff00000000 ffff880077a8ae00
  Jul 22 10:59:30 emergent kernel: <0> 00070180812cefcb 00070188000701a4 000701840007019c ffff88013fece020
  Jul 22 10:59:30 emergent kernel: Call Trace:
  Jul 22 10:59:30 emergent kernel: [<ffffffff811d6b42>] ? intel_pipe_set_base+0x352/0x3d2
  Jul 22 10:59:30 emergent kernel: [<ffffffff811d7e77>] ? intel_crtc_mode_set+0x12b5/0x1305
  Jul 22 10:59:30 emergent kernel: [<ffffffff8103b30a>] ? msleep+0x16/0x1e
  Jul 22 10:59:30 emergent kernel: [<ffffffff811d67a6>] ? i9xx_crtc_dpms+0x262/0x2ac
  Jul 22 10:59:30 emergent kernel: [<ffffffff811ae8d1>] ? drm_crtc_helper_set_mode+0x1dc/0x2fb
  Jul 22 10:59:30 emergent kernel: [<ffffffff811af426>] ? drm_crtc_helper_set_config+0x5ae/0x770
  Jul 22 10:59:30 emergent kernel: [<ffffffff811be2b0>] ? drm_mode_setcrtc+0x2a9/0x2d2
  Jul 22 10:59:30 emergent kernel: [<ffffffff811be007>] ? drm_mode_setcrtc+0x0/0x2d2
  Jul 22 10:59:30 emergent kernel: [<ffffffff811b33c3>] ? drm_ioctl+0x211/0x2f2
  Jul 22 10:59:30 emergent kernel: [<ffffffff810aaf70>] ? vfs_ioctl+0x23/0x93
  Jul 22 10:59:30 emergent kernel: [<ffffffff810ab432>] ? do_vfs_ioctl+0x3e0/0x417
  Jul 22 10:59:30 emergent kernel: [<ffffffff812d2ca9>] ? do_page_fault+0x22f/0x271
  Jul 22 10:59:30 emergent kernel: [<ffffffff810ab4a5>] ? sys_ioctl+0x3c/0x5c
  Jul 22 10:59:30 emergent kernel: [<ffffffff8100292b>] ? system_call_fastpath+0x16/0x1b
  Jul 22 10:59:30 emergent kernel: Code: 81 e1 ff 3f fc ff c1 f8 1c 41 b8 0f 00 00 00 ff c8 c1 e0 04 c0 f8 04 41 21 c0 41 c1 e0 0e 44 09 c1 84 c0 89 8f b0 00 00 00 79 04 <0f> 0b eb fe 48 83 bf 88 00 00 00 00 75 04 0f 0b eb fe 81 e1 00 
  Jul 22 10:59:30 emergent kernel: RIP  [<ffffffff811cbdca>] i915_gem_object_unpin+0x41/0xc2
  Jul 22 10:59:30 emergent kernel: RSP <ffff880124b298d0>
  Jul 22 10:59:30 emergent kernel: ---[ end trace 0050d9db11a1fa89 ]---

and the display is now dead.  The above trace was from around 2.6.35-rc4 but
the problem persists in -rc6 (however the line number is now 4196).  I can
try to get a trace from -rc6 if it is desired.

Running latest git everything on an intel G45.
Comment 1 Jesse Barnes 2010-07-23 13:44:08 UTC
Possibly fixed by:

https://patchwork.kernel.org/patch/112571/

?
Comment 2 Chris Wilson 2010-07-23 14:55:20 UTC
Different bug. This is a double unpin, smells like a use-after-free but I think it is a different one than https://bugzilla.kernel.org/attachment.cgi?id=27229

I'll if I can reproduce it on a laptop.
Comment 3 Chris Wilson 2010-07-26 07:16:38 UTC
Not having much luck with my t61. It's a stubborn beast that obstinately refuses to crash, see bug 28811. Will have to do this the old fashioned way and read code.
Comment 4 Nick Bowler 2010-07-26 07:30:15 UTC
FWIW, I can't reproduce this on my T500 (GM45) either: only on the desktop
machine (G45).  However, the connectors on the two systems are different:
LVDS & VGA on the laptop, HDMI & VGA on the desktop.
Comment 5 Chris Wilson 2010-07-26 08:41:02 UTC
Nick, can you trigger the OOPs whilst you have drm.debug=0xc? I am trying to order the sequence of events. Either we have freed the current fb object or have already unpinned it but left it attached.

As always does your xsession fire up compiz, or is that a plain X server? (Just wondering if we have other races going on as well.)
Comment 6 Chris Wilson 2010-07-26 09:03:24 UTC
int drm_mode_rmfb() {
 /* TODO release all crtc connected to the framebuffer */
 /* TODO unhock the destructor from the buffer object */
}

* sigh.

I think -intel avoids that trap.
Comment 7 Chris Wilson 2010-07-26 09:18:54 UTC
Created attachment 37399 [details] [review]
Warn, not free, an active fb

Nick, can you try reproducing with this patch? If I am on the right lines we should just start getting errors in dmesg rather than oopses.
Comment 8 Nick Bowler 2010-07-26 10:25:04 UTC
Created attachment 37400 [details]
Full kernel log with drm.debug=0xc

> Nick, can you try reproducing with this patch? If I am on the right lines we
> should just start getting errors in dmesg rather than oopses.

With the patch, I get the new error messages

  [drm:drm_mode_rmfb] *ERROR* tried to remove an active fb

between steps 3 and 4 (i.e., after running xrandr but before removing the
cable).  However, the kernel still crashes in the same way.

> As always does your xsession fire up compiz, or is that a plain X server?
> (Just wondering if we have other races going on as well.)

No compositing or 3D, just FVWM and a terminal (my normal setup).  Note that
this does not feel like a race: it is completely 100% reproducible.  The crash
always occurs after exactly three iterations of the steps in the original
report.
Comment 9 Chris Wilson 2010-08-01 05:14:30 UTC
The TODO comment in that function is misleading. The fb is in fact detached by drm_framebuffer_cleanup() called from intel_user_framebuffer_destroy() [via fb->funcs->destroy()].

So the patch is useless.
Comment 10 Chris Wilson 2010-08-20 03:06:58 UTC
Nick, can you try http://cgit.freedesktop.org/~ickle/linux-2.6/log/?h=drm-testing . Mainly out of curiosity, but it does come with a few interesting patches. ;-)
Comment 11 Nick Bowler 2010-09-07 07:47:25 UTC
OK.  First off, I've updated my userspace since first reporting this bug.  On
Linux 2.6.35, I can still reliably reproduce this issue by following the steps
in the original report.

Now, on 2.6.36-rc3, I can no longer reproduce this reliably.  However, I just
encountered this issue (or something very similar to it) today when I restarted
my X session, so it doesn't seem to have been fixed.  I've just checked out
your drm-testing branch and if it crashes I'll let you know (unfortunately,
I was using 2.6.36-rc3 for some time before it died on me...).

When I have some spare time, I'll try to bisect the first commit where my
original reproduction steps fail.
Comment 12 Nick Bowler 2010-09-07 10:50:34 UTC
So I don't know if _this_ bug is fixed, but with the drm-testing branch the
following gets spammed repeatedly to the console after I start X.  While the
ill effects appear to be limited to the warning, the console spam is to the
detriment of the system's usefulness and I'm afraid I'm not able to test
this branch for very long.

  ------------[ cut here ]------------
  WARNING: at lib/kref.c:34 kref_get+0x1b/0x20()
  Hardware name: Aspire X3810
  Modules linked in: bridge stp llc autofs4 nfsd lockd sunrpc exportfs ipv6 iptable_filter iptable_nat nf_nat nf_conntrack_ipv4 nf_conntrack nf_defrag_ipv4 ip_tables x_tables snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device snd_pcm_oss snd_mixer_oss snd_hda_codec_intelhdmi snd_hda_codec_realtek snd_hda_intel snd_hda_codec snd_hwdep snd_pcm snd_timer snd soundcore snd_page_alloc sg evdev usb_storage ext2 ehci_hcd sr_mod cdrom loop tun acpi_cpufreq mperf e1000e
  Pid: 2324, comm: X Not tainted 2.6.36-rc3-00054-gc83c440 #1
  Call Trace:
  [<ffffffff810329e8>] ? warn_slowpath_common+0x78/0x8c
  [<ffffffff811342c7>] ? kref_get+0x1b/0x20
  [<ffffffff811b78b0>] ? drm_gem_handle_create+0x73/0x7e
  [<ffffffff811d25a0>] ? i915_gem_create_ioctl+0x44/0x68
  [<ffffffff811b5fb4>] ? drm_ioctl+0x236/0x2ea
  [<ffffffff81087040>] ? __do_fault+0x358/0x393
  [<ffffffff811d255c>] ? i915_gem_create_ioctl+0x0/0x68
  [<ffffffff81088ebc>] ? handle_mm_fault+0x3f4/0x7ab
  [<ffffffff810ae4ee>] ? do_vfs_ioctl+0x42f/0x47c
  [<ffffffff812d5e97>] ? do_page_fault+0x22f/0x271
  [<ffffffff810ae577>] ? sys_ioctl+0x3c/0x5c
  [<ffffffff8100292b>] ? system_call_fastpath+0x16/0x1b
  ---[ end trace 6607a7076c60d2a6 ]---
Comment 13 Chris Wilson 2010-09-10 05:05:05 UTC
Working through bug 29857 uncovered the opposite bug.... So wondering if there are in fact two sides to the same bug, can you please test:

git://git.kernel.org/pub/scm/linux/kernel/git/ickle/drm-intel.git drm-intel-staging

In particular:

http://git.kernel.org/?p=linux/kernel/git/ickle/drm-intel.git;a=commit;h=a9878eb9c2a68e5d06169dc315f8eb91e13c1c20

and

http://git.kernel.org/?p=linux/kernel/git/ickle/drm-intel.git;a=commit;h=074bdaab420b5bec20f2c46772e6f720cefc446f
Comment 14 Nick Bowler 2010-09-10 07:06:51 UTC
OK.  I did two tests:

First, since I can reproduce the issue on 2.6.35 more easily, I checked out
2.6.35 and cherry-picked those two commits.  I also needed to cherry-pick
5c8d7171cc4984 ("drm/kms: add crtc disable function") for the second commit to
build.  The result was worse than before: the kernel crashed at the second
'startx' (rather than the third ctrl+alt+bksp).

Second, I checked out Linus' master and merged drm-intel-staging into it.
There was one conflict, in drm_crtc_helper (line 105..108 in Linus' master),
which I resolved by combining the two changes:

  } else {
    connector->status = connector->funcs->detect(connector, false);
    drm_helper_hpd_irq_event(dev);
  }

This kernel crashed after *two* iterations of the steps in the original report.
However, the bug was _slightly_ different -- this time the crash was in _pin as
opposed to _unpin.  (unfortunately, not all of the output made it to disk).

  ------------[ cut here ]------------
  kernel BUG at /scratch_space/linux-2.6/drivers/gpu/drm/i915/i915_gem.c:4030!
  invalid opcode: 0000 [#1] PREEMPT SMP 
  last sysfs file: /sys/devices/virtual/vtconsole/vtcon1/uevent
  CPU 2 
  Modules linked in: nfs nfs_acl bridge stp llc autofs4 nfsd lockd sunrpc exportfs ipv6 iptable_filter iptable_nat nf_nat nf_conntrack_ipv4 nf_conntrack nf_defrag_ipv4 ip_tables x_tables snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device snd_pcm_oss snd_mixer_oss snd_hda_codec_intelhdmi snd_hda_codec_realtek snd_hda_intel snd_hda_codec snd_hwdep snd_pcm snd_timer snd soundcore snd_page_alloc sg evdev usb_storage ext2 ehci_hcd sr_mod cdrom loop tun acpi_cpufreq mperf e1000e
  
  Pid: 2388, comm: X Not tainted 2.6.36-rc3-00344-gc8f3615 #27 WG43M/Aspire X3810
  RIP: 0010:[<ffffffff811e6491>]  [<ffffffff811e6491>] i915_gem_object_pin+0x29/0x170
  RSP: 0018:ffff880136be9618  EFLAGS: 00010246
  [output truncated]

I am currently running Linus' master, which so far seems to be the same as -rc3
though I haven't seen it crash yet.
Comment 15 Chris Wilson 2010-09-11 01:12:47 UTC
Thanks Nick, the pin leak trumps the overzealous unpinning!
Comment 16 Nick Bowler 2010-10-06 10:55:35 UTC
Just to confirm: Linus' master (2.6.36-rc6-00119-g3c06806) still crashes after
a while, with the the BUG in i915_gem_object_pin (i915_gem.c line 4045).  The
crash always occurs upon terminating the X server, but I can't reproduce it on
demand anymore.

It also seems that the connectors don't actually have to be yanked: the last
several crashes have occurred without my touching any of the connectors at
all.
Comment 17 Chris Wilson 2011-01-08 07:16:06 UTC
Created attachment 41770 [details] [review]
Restore old fb after modesetting failure

I think I finally spotted it!
Comment 18 Nick Bowler 2011-01-11 12:29:49 UTC
The good news is that I can reliably reproduce this issue again on 2.6.37.

The bad news is that the above patch, applied on top of 2.6.37, does not fix
the issue: the crash is the same as the previous report (see below).

Annoyingly, I'm not sure how to get the full crash log.  Only the first few
lines make it to the system log, and despite having netconsole set up, _none_
of the output made it to the remote machine.  Annoyingly, this computer does
not have a serial port.

  ------------[ cut here ]------------
  kernel BUG at /scratch_space/linux-2.6/drivers/gpu/drm/i915/i915_gem.c:4190!
  invalid opcode: 0000 [#1] PREEMPT SMP 
  last sysfs file: /sys/devices/virtual/vtconsole/vtcon1/uevent
  CPU 0 
  Modules linked in: netconsole nfs nfs_acl autofs4 nfsd lockd sunrpc exportfs ipv6 iptable_filter iptable_nat nf_nat nf_conntrack_ipv4 nf_conntrack nf_defrag_ipv4 ip_tables x_tables snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device snd_pcm_oss snd_mixer_oss snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_intel snd_hda_codec snd_hwdep snd_pcm snd_timer snd soundcore snd_page_alloc sg evdev usb_storage ext2 ehci_hcd sr_mod cdrom loop tun acpi_cpufreq mperf arc4 ecb crypto_blkcipher cryptomgr aead crypto_algapi rt2800pci rt2800lib crc_ccitt rt2x00pci rt2x00lib mac80211 cfg80211 eeprom_93cx6 e1000e [last unloaded: netconsole]
  
  Pid: 3576, comm: X Not tainted 2.6.37-00001-g3ec088a #114 WG43M/Aspire X3810
  RIP: 0010:[<ffffffff811ecbbf>]  [<ffffffff811ecbbf>] i915_gem_object_pin+0x30/0x179
  RSP: 0018:ffff880134e335e8  EFLAGS: 00010246
  [output truncated]
Comment 19 Chris Wilson 2011-01-28 04:03:09 UTC
Created attachment 42635 [details] [review]
Don't switch fb after a no-op (when disabling)
Comment 20 Nick Bowler 2011-01-28 09:18:55 UTC
(In reply to comment #19)
> Created an attachment (id=42635) [details]
> Don't switch fb after a no-op (when disabling)

This looks very promising: I tried latest Linus' git without this patch and
crashed the box (interestingly, it crashed in a different way from usual).
After applying this patch, I was unable to crash the box.  I'll try 2.6.37
later today and see if it fixes the problem there, too (as that kernel
behaved almost exactly as described in the original report).
Comment 21 Chris Wilson 2011-02-01 01:31:48 UTC
I do think we have this sussed! Finally!

commit 9334ef755f060e251f3f395caeda1a58b6834ea3
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Fri Jan 28 11:53:03 2011 +0000

    drm: Don't switch fb when disabling an output
    
    In drm_crtc_helper_set_config, we call drm_crtc_helper_set_mode which
    may return early and do no operation if the crtc is to be disabled. In
    this case we merrily swap to the new fb, discarding the old_fb believing
    that it has been cleaned up. However, due to the early return, the
    old_fb was not presented to the backend for correct reaping, and nor was
    the new one - which is about to be reaped via the
    drm_helper_disable_unused_functions(), leading to incorrect refcounting
    of the pinned objects.
    
    Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=27722
    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=29857
    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=29230
    Tested-by: Takashi Iwai <tiwai@suse.de>
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.