34787 – SIGSEGV on monitor hot-unplug

Bug 34787 - SIGSEGV on monitor hot-unplug

Summary: SIGSEGV on monitor hot-unplug

Status:	RESOLVED FIXED

Alias:	None

Product:	xorg
Classification:	Unclassified
Component:	Driver/intel (show other bugs)
Version:	git
Hardware:	x86-64 (AMD64) Linux (All)

Importance:	medium major
Assignee:	Chris Wilson
QA Contact:	Xorg Project Team

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2011-02-26 21:06 UTC by Bernie Innocenti
Modified:	2012-04-19 15:51 UTC (History)
CC List:	0 users

See Also:
i915 platform:
i915 features:

Attachments
dmesg while the X server was hung (123.48 KB, application/octet-stream) 2011-03-09 12:43 UTC, Bernie Innocenti	no flags	Details
Xorg.log from the hung X server (115.08 KB, application/octet-stream) 2011-03-09 12:44 UTC, Bernie Innocenti	no flags	Details
gdb backtrace of Compiz (hung in glXSwapBuffers()) (3.71 KB, application/octet-stream) 2011-03-09 12:47 UTC, Bernie Innocenti	no flags	Details
strace of the Xorg process while it's hung (fd 8 is /dev/dri/card0) (108.29 KB, application/x-gzip) 2011-03-09 12:51 UTC, Bernie Innocenti	no flags	Details
dmesg output of 2.6.38, taken while Xorg was hung with a black screen (121.97 KB, application/octet-stream) 2011-03-21 20:45 UTC, Bernie Innocenti	no flags	Details
View All

Description Bernie Innocenti 2011-02-26 21:06:33 UTC

The X server sometimes crashes on monitor unplug, with this error:


[ 54874.313] (II) config/udev: removing device Targus Soft-Touch Bluetooth Mouse
[ 54874.314] (II) Targus Soft-Touch Bluetooth Mouse: Close
[ 54874.314] (II) UnloadModule: "evdev"
[ 54874.315] (II) Unloading evdev
[ 85626.868] (EE) intel(0): Detected a hung GPU, disabling acceleration.
[ 85733.985] (II) intel(0): EDID vendor "LEN", prod id 16401
[ 85733.985] (II) intel(0): Printing DDC gathered Modelines:
[ 85733.985] (II) intel(0): Modeline "1280x800"x0.0   68.94  1280 1296 1344 1408  800 801 804 816 -hsync -vsync (49.0 kHz)
[ 85733.985] (II) intel(0): Modeline "1280x800"x0.0   60.96  1280 1328 1360 1478  800 803 809 825 -hsync -vsync (41.2 kHz)
[ 85734.185] (II) intel(0): EDID vendor "LEN", prod id 16401
[ 85734.185] (II) intel(0): Printing DDC gathered Modelines:
[ 85734.185] (II) intel(0): Modeline "1280x800"x0.0   68.94  1280 1296 1344 1408  800 801 804 816 -hsync -vsync (49.0 kHz)
[ 85734.185] (II) intel(0): Modeline "1280x800"x0.0   60.96  1280 1328 1360 1478  800 803 809 825 -hsync -vsync (41.2 kHz)
[ 85734.562] (II) intel(0): Allocated new frame buffer 1920x1080 stride 7680, tiled
[ 85735.126] (EE) intel(0): [DRI2] DRI2SwapBuffers: drawable has no back or front?
[ 85735.419] 
Backtrace:
[ 85735.422] 0: /usr/bin/Xorg (xorg_backtrace+0x2f) [0x4a120f]
[ 85735.423] 1: /usr/bin/Xorg (0x400000+0x61da6) [0x461da6]
[ 85735.423] 2: /lib64/libc.so.6 (0x3000400000+0x33140) [0x3000433140]
[ 85735.423] 3: /usr/lib64/xorg/modules/drivers/intel_drv.so (0x7f2a08675000+0x23b22) [0x7f2a08698b22]
[ 85735.423] 4: /usr/lib64/xorg/modules/extensions/libdri2.so (0x7f2a088c5000+0x2370) [0x7f2a088c7370]
[ 85735.423] 5: /usr/lib64/xorg/modules/extensions/libdri2.so (DRI2GetBuffersWithFormat+0x14) [0x7f2a088c74a4]
[ 85735.423] 6: /usr/lib64/xorg/modules/extensions/libdri2.so (0x7f2a088c5000+0x3d1c) [0x7f2a088c8d1c]
[ 85735.423] 7: /usr/bin/Xorg (0x400000+0x2e6a1) [0x42e6a1]
[ 85735.423] 8: /usr/bin/Xorg (0x400000+0x2292a) [0x42292a]
[ 85735.423] 9: /lib64/libc.so.6 (__libc_start_main+0xfd) [0x300041ee5d]
[ 85735.423] 10: /usr/bin/Xorg (0x400000+0x22c11) [0x422c11]
[ 85735.423] Segmentation fault at address (nil)
[ 85735.423] 
Fatal server error:
[ 85735.423] Caught signal 11 (Segmentation fault). Server aborting


The 0x23b22 offset in intel_drv.so corresponds to:

/usr/src/debug/xf86-video-intel-2.14.0/src/intel_dri.c:388

The line is in I830DRI2DestroyBuffer():
387         I830DRI2BufferPrivatePtr private = buffer->driverPrivate;
388         if (--private->refcnt == 0) {
389             ScreenPtr screen = private->pixmap->drawable.pScreen;

So, buffer->driverPrivare was probably NULL?

Comment 1 Chris Wilson 2011-02-27 02:55:01 UTC

This looks to be papering over an underlying bug. Can you keep you eyes open for more "DRI2SwapBuffers: drawable has no back or front?"

commit e889d3a709b55a0731ab098b17a3364b9bf39387
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Sun Feb 27 10:51:50 2011 +0000

    dri: Protect against destroying a foreign DRI drawable
    
    I have no clue as to how such an alien drawable reached us, but we have
    the evidence of a segfault to say it can happen.
    
    Reported-by: Bernie Innocenti <bernie@codewiz.org>
    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=34787
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>

Comment 2 Bernie Innocenti 2011-02-27 19:14:08 UTC

(In reply to comment #1)
> This looks to be papering over an underlying bug. Can you keep you eyes open
> for more "DRI2SwapBuffers: drawable has no back or front?"

Thanks for the fast response.

I've rebuilt the intel driver package from a git snapshot and I'm currently testing it. I'll check the Xorg logs from time to time for more instances of this error.

Comment 3 Bernie Innocenti 2011-03-08 08:33:46 UTC

(In reply to comment #2)
> I've rebuilt the intel driver package from a git snapshot and I'm currently
> testing it. I'll check the Xorg logs from time to time for more instances of
> this error.

Unfortunately, I experienced a new crash on hot-unplg after applying the
proposed patch.

Here's the backtrace:

[160859.189] 0: /usr/bin/Xorg (xorg_backtrace+0x2f) [0x4a120f]
[160859.189] 1: /usr/bin/Xorg (0x400000+0x61da6) [0x461da6]
[160859.189] 2: /lib64/libc.so.6 (0x3000400000+0x33140) [0x3000433140]
[160859.189] 3: /lib64/libc.so.6 (cfree+0x3c) [0x300047a52c]
[160859.189] 4: /usr/lib64/xorg/modules/extensions/libdri2.so
(0x7f3462f9c000+0x2370) [0x7f3462f9e370]
[160859.189] 5: /usr/lib64/xorg/modules/extensions/libdri2.so
(DRI2GetBuffersWithFormat+0x14) [0x7f3462f9e4a4]
[160859.189] 6: /usr/lib64/xorg/modules/extensions/libdri2.so
(0x7f3462f9c000+0x3d1c) [0x7f3462f9fd1c]
[160859.189] 7: /usr/bin/Xorg (0x400000+0x2e6a1) [0x42e6a1]
[160859.189] 8: /usr/bin/Xorg (0x400000+0x2292a) [0x42292a]
[160859.189] 9: /lib64/libc.so.6 (__libc_start_main+0xfd) [0x300041ee5d]
[160859.189] 10: /usr/bin/Xorg (0x400000+0x22c11) [0x422c11]
[160859.189] Segmentation fault at address (nil)


The last call in libdri2.so in do_get_buffers() line 501:

 499     for (i = 0; i < count; i++) {
 500         if (buffers[i] != NULL)
 501             (*ds->DestroyBuffer)(pDraw, buffers[i]);  <---
 502     }

The call to cfree() in glibc seems bogus. Maybe we crashed somewhere in
I830DRI2DestroyBuffer()? It seems likely that the driverPrivate may be NULL
here as well.

Comment 4 Bernie Innocenti 2011-03-08 08:42:48 UTC

This morning, I experience a different failure mode, maybe related to this bug:

1. I unplug my DisplayPort monitor
2. the LVDS output lits with a uniform gray background (which corresponds to my background in GNOME)
3. The cursor continues to move, but nothing happens
4. I could switch to the console by hitting CTRL-ALT-F1
5. On the console, I could see GPU hung messages (I don't remember the exact text), approx. one per second
6. There were definitely other messages intermixed on the console, but I can't remember what they said
7. After switching virtual consoles a few times, I finally managed to completely hang the machine

I was running 2.6.35.11-83.fc14.x86_64. Newer kernels hang during PM resume on my Lenovo X201, so I cannot test them. If you indicate a drm patch that would apply to 2.6.35, I could build a custom kernel.

Comment 5 Bernie Innocenti 2011-03-09 12:41:37 UTC

Today it happened again, but the console did not hang, so I could collect more data (attached).

Compiz was definitely involved in some way: when I killed it, the X server suddenly came back to life!

Comment 6 Bernie Innocenti 2011-03-09 12:43:39 UTC

Created attachment 44280 [details]
dmesg while the X server was hung

Comment 7 Bernie Innocenti 2011-03-09 12:44:24 UTC

Created attachment 44281 [details]
Xorg.log from the hung X server

Comment 8 Bernie Innocenti 2011-03-09 12:47:59 UTC

Created attachment 44282 [details]
gdb backtrace of Compiz (hung in glXSwapBuffers())

Comment 9 Bernie Innocenti 2011-03-09 12:51:15 UTC

Created attachment 44283 [details]
strace of the Xorg process while it's hung (fd 8 is /dev/dri/card0)

Comment 10 Bernie Innocenti 2011-03-09 15:27:57 UTC

I filed a bug in Fedora against Compiz which is loosely related with this one:

  https://bugzilla.redhat.com/show_bug.cgi?id=664094

Comment 11 Bernie Innocenti 2011-03-21 20:36:55 UTC

Today, when unplugging a VGA monitor, I got this kernel oops:

general protection fault: 0000 [#1] SMP 
last sysfs file: /sys/devices/pci0000:00/0000:00:1a.0/usb1/idVendor
CPU 0 
Modules linked in: pl2303 hidp fuse usb_storage rfcomm sco bnep l2cap vboxnetadp vboxnetflt vboxdrv coretemp ipv6 cpufreq_ondemand acpi_cpufreq freq_table mperf kvm_intel kvm uinput qcserial usb_wwan arc4 i2400m_usb i2400m ecb wimax snd_hda_codec_intelhdmi btusb bluetooth usbserial snd_hda_codec_conexant iwlagn snd_hda_intel snd_hda_codec snd_hwdep snd_seq iwlcore snd_seq_device snd_pcm mac80211 thinkpad_acpi cfg80211 snd_timer e1000e snd snd_page_alloc joydev iTCO_wdt i2c_i801 wmi iTCO_vendor_support rfkill microcode soundcore i915 drm_kms_helper drm i2c_algo_bit i2c_core video output [last unloaded: scsi_wait_scan]
Pid: 2080, comm: kslowd002 Not tainted 2.6.35.11-83.fc14.x86_64 #1 3249CTO/3249CTO
RIP: 0010:[<ffffffff812266a0>]  [<ffffffff812266a0>] list_del+0x10/0x8b
RSP: 0018:ffff88012d9cfc90  EFLAGS: 00010286
RAX: dead000000200200 RBX: ffff88008d8b09c8 RCX: ffff880130648000
RDX: 0000000000000000 RSI: ffff88012d9cfff8 RDI: ffff88008d8b09c8
RBP: ffff88012d9cfca0 R08: ffff88012d9cfb5c R09: ffffffff8100aae0
R10: ffff8801ad9cf98f R11: 0000000000000000 R12: ffff88012fd27800
R13: ffff88012fd25000 R14: ffffffffa009ea10 R15: 0000000000000000
FS:  0000000000000000(0000) GS:ffff880002000000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00007f2168285018 CR3: 0000000132253000 CR4: 00000000000006f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process kslowd002 (pid: 2080, threadinfo ffff88012d9ce000, task ffff880130648000)
Stack:
 ffff88012fd27800 ffff88008d8b09c0 ffff88012d9cfcf0 ffffffffa0035425
<0> 0000000000000000 0000000000000001 ffff88012fd27800 ffff88012fd27800
<0> 0000000000000000 ffff88012fd27a68 ffffffffa009ea10 0000000000000438
Call Trace:
 [<ffffffffa0035425>] drm_mode_connector_update_edid_property+0x3f/0xf3 [drm]
 [<ffffffffa0067c40>] drm_helper_probe_single_connector_modes+0x110/0x29d [drm_kms_helper]
 [<ffffffffa00654e5>] drm_fb_helper_probe_connector_modes+0x47/0x5f [drm_kms_helper]
 [<ffffffffa0066756>] drm_fb_helper_hotplug_event+0xac/0xc9 [drm_kms_helper]
 [<ffffffffa0094888>] intel_fb_output_poll_changed+0x1c/0x20 [i915]	
 [<ffffffffa006757b>] output_poll_execute+0xf2/0x12e [drm_kms_helper]
 [<ffffffff810c9f1b>] slow_work_execute+0x1a2/0x2cc
 [<ffffffff810ca3c5>] slow_work_thread+0x173/0x2a4
 [<ffffffff81066633>] ? autoremove_wake_function+0x0/0x39
 [<ffffffff81469fcf>] ? _raw_spin_unlock_irqrestore+0x17/0x19
 [<ffffffff810ca252>] ? slow_work_thread+0x0/0x2a4
 [<ffffffff81066199>] kthread+0x7f/0x87
 [<ffffffff8100aae4>] kernel_thread_helper+0x4/0x10
 [<ffffffff8106611a>] ? kthread+0x0/0x87
 [<ffffffff8100aae0>] ? kernel_thread_helper+0x0/0x10
Code: 00 00 00 74 05 e8 8b 74 e2 ff 48 8d 65 d8 5b 41 5c 41 5d 41 5e 41 5f c9 c3 90 90 55 48 89 e5 53 48 89 fb 48 83 ec 08 48 8b 47 08 <4c> 8b 00 49 39 f8 74 1d 48 89 f9 48 c7 c2 fb 18 7b 81 be 30 00 
RIP  [<ffffffff812266a0>] list_del+0x10/0x8b
 RSP <ffff88012d9cfc90>

Comment 12 Bernie Innocenti 2011-03-21 20:38:20 UTC

Also, a few days ago I was able to reproduce the GPU lockup without plugging or unplugging a monitor. All I did was closing and reopening the lid, causing a suspend-to-ram and resume cycle.

Comment 13 Bernie Innocenti 2011-03-21 20:44:13 UTC

Another hint, maybe useful: on kernel-2.6.38-1.fc15.x86_64, plugging a monitor into the external VGA connector of my Lenovo X201 often results in a flashing red/black screen!

Moreover, it seems to be a lot easier to make the X hang with a black screen while running a 2.6.38 kernel, but I can't see any output on the console to confirm it. I'm attaching a dmesg.out taken while X was hung.

Comment 14 Bernie Innocenti 2011-03-21 20:45:45 UTC

Created attachment 44702 [details]
dmesg output of 2.6.38, taken while Xorg was hung with a black screen

Comment 15 Bernie Innocenti 2011-04-24 20:03:02 UTC

After several days of testing, I couldn't reproduce this bug with
kernel-2.6.39-0.rc3.git2.0.fc16.x86_64.

However, I'm still seeing several other bugs on monitor plug/unplug while
running a composing GL window manager (compiz in my case).

Comment 16 Chris Wilson 2011-06-26 06:55:24 UTC

So the kernel bugs were (upstream):

commit 752d2635ebb12b6122ba05775f7d1ccfef14b275
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Fri Apr 22 11:03:57 2011 +0100

    drm: Take lock around probes for drm_fb_helper_hotplug_event
    
    We need to hold the dev->mode_config.mutex whilst detecting the output
    status. But we also need to drop it for the call into
    drm_fb_helper_single_fb_probe(), which indirectly acquires the lock when
    attaching the fbcon.
    
    Failure to do so exposes a race with normal output probing. Detected by
    adding some warnings that the mutex is held to the backend detect routines

and


commit 9a362dd718119042cbe2821edd277c8b98c7fa65
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Thu Jun 16 12:59:17 2011 +0100

    drm/i915: Finish any pending operations on the framebuffer before disabling
    
    Similar to the case where we are changing from one framebuffer to
    another, we need to be sure that there are no pending WAIT_FOR_EVENTs on
    the pipe for the current framebuffer before switching. If we disable the
    pipe, and then try to execute a WAIT_FOR_EVENT it will block
    indefinitely and cause a GPU hang.
    
    We attempted to fix this in commit 85345517fe6d4de27b0d6ca19fef9d28ac947c4a
    (drm/i915: Retire any pending operations on the old scanout when switching)
    for the case of mode switching, but this leaves the condition where we
    are switching off the pipe vulnerable.
    
    There still remains the race condition were a display may be unplugged,
    switched off by the core, a uevent sent to notify the DDX and the DDX
    may issue a WAIT_FOR_EVENT before it processes the uevent. This window
    does not exist if the pipe is only switched off in response to the
    uevent. Time to make sure that is so...

Comment 17 Bernie Innocenti 2011-07-03 22:50:09 UTC

Which tree has the latter patch? (9a362dd718119042cbe2821edd277c8b98c7fa65)

Comment 18 Chris Wilson 2012-04-11 08:20:41 UTC

We are getting closer to having the flush-before-disable fix upstream.

Comment 19 Chris Wilson 2012-04-14 07:19:55 UTC

Step 1 of the flush fixes is upstream:

commit 14667a4bde4361b7ac420d68a2e9e9b9b2df5231
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Tue Apr 3 17:58:35 2012 +0100

    drm/i915: Finish any pending operations on the framebuffer before disabling

Comment 20 Chris Wilson 2012-04-19 15:51:13 UTC

And the second step is upstream as well, so I think we have all the pieces in place for this.

commit 0f91128d88bbb8b0a8e7bb93df2c40680871d45a
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Tue Apr 17 10:05:38 2012 +0100

    drm/i915: Wait for all pending operations to the fb before disabling the pip
    
    During modeset we have to disable the pipe to reconfigure its timings
    and maybe its size. Userspace may have queued up command buffers that
    depend upon the pipe running in a certain configuration and so the
    commands may become confused across the modeset. At the moment, we use a
    less than satisfactory kick-scanline-waits should the GPU hang during
    the modeset. It should be more reliable to wait for the pending
    operations to complete first, even though we still have a window for
    userspace to submit a broken command buffer during the modeset.
    
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
    Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.