Bug 73566 - SIGSEGV: vt-switching + logind + multi-head crashes weston
Summary: SIGSEGV: vt-switching + logind + multi-head crashes weston
Status: VERIFIED FIXED
Alias: None
Product: Wayland
Classification: Unclassified
Component: weston (show other bugs)
Version: unspecified
Hardware: Other All
: medium critical
Assignee: Wayland bug list
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2014-01-13 18:11 UTC by U. Artie Eoff
Modified: 2014-04-14 20:34 UTC (History)
0 users

See Also:
i915 platform:
i915 features:


Attachments
gdb backtrace 1 (8.99 KB, text/plain)
2014-01-13 18:11 UTC, U. Artie Eoff
Details
gdb backtrace 2 (9.03 KB, text/plain)
2014-01-13 18:11 UTC, U. Artie Eoff
Details
gdb backtrace 3 (9.35 KB, text/plain)
2014-01-13 18:13 UTC, U. Artie Eoff
Details
1.3.93 gdb backtrace 1 (9.73 KB, text/plain)
2014-01-21 17:11 UTC, U. Artie Eoff
Details
1.3.93 gdb backtrace 2 (9.67 KB, text/plain)
2014-01-21 17:13 UTC, U. Artie Eoff
Details

Description U. Artie Eoff 2014-01-13 18:11:05 UTC
Created attachment 91966 [details]
gdb backtrace 1

When using weston with logind + multihead setup, rapid vt-switching triggers segmentation fault in drm_output_set_cursor().  The SEGV stack trace is not consistent however.  See attached gdb backtraces for more info.

kernel 3.11.10-301.fc20.x86_64
systemd (master) heads/master-0-g63966da
wayland (master) 1.3.92-0-gc102c20
drm (master) libdrm-2.4.50-0-g4c5de72
mesa (master) heads/master-0-g532b1fe
libva (master) libva-1.2.1-0-g88ed1eb
intel-driver (master) 1.2.1-0-g8f306e3
weston (master) 1.3.92-0-gb637a40
Comment 1 U. Artie Eoff 2014-01-13 18:11:39 UTC
Created attachment 91967 [details]
gdb backtrace 2
Comment 2 U. Artie Eoff 2014-01-13 18:13:24 UTC
Created attachment 91968 [details]
gdb backtrace 3
Comment 3 U. Artie Eoff 2014-01-13 18:15:40 UTC
Additionally, I could not reproduce this issue when:

1. using weston-launch (i.e. no logind)
2. using weston + logind + single-head
Comment 4 U. Artie Eoff 2014-01-21 17:09:59 UTC
(In reply to comment #3)
> Additionally, I could not reproduce this issue when:
> 
> 1. using weston-launch (i.e. no logind)
> 2. using weston + logind + single-head

Reproduced #2 with Weston 1.3.93 (i.e. with single-head):

systemd (master) heads/master-0-g63966da
wayland (master) heads/master-0-g47bbc6b
drm (master) libdrm-2.4.50-0-g4c5de72
mesa (master) heads/master-0-g4cd8011
libva (master) libva-1.2.1-0-g88ed1eb
intel-driver (master) 1.2.1-0-g8f306e3
weston (master) 1.3.93-0-g7cccfca
Comment 5 U. Artie Eoff 2014-01-21 17:11:53 UTC
Created attachment 92527 [details]
1.3.93 gdb backtrace 1
Comment 6 U. Artie Eoff 2014-01-21 17:13:17 UTC
Created attachment 92528 [details]
1.3.93 gdb backtrace 2
Comment 7 U. Artie Eoff 2014-01-21 17:23:05 UTC
(In reply to comment #4)
> (In reply to comment #3)
> > Additionally, I could not reproduce this issue when:
> > 
> > 1. using weston-launch (i.e. no logind)
> > 2. using weston + logind + single-head
> 
> Reproduced #2 with Weston 1.3.93 (i.e. with single-head):
> 
> systemd (master) heads/master-0-g63966da
> wayland (master) heads/master-0-g47bbc6b
> drm (master) libdrm-2.4.50-0-g4c5de72
> mesa (master) heads/master-0-g4cd8011
> libva (master) libva-1.2.1-0-g88ed1eb
> intel-driver (master) 1.2.1-0-g8f306e3
> weston (master) 1.3.93-0-g7cccfca

Hrmm... we are hitting a different bug now since vt-switching is also broken without logind usage.  Will file a new bug for that.
Comment 8 Bryce Harrington 2014-01-22 01:34:19 UTC
Looks like two different stack traces; perhaps two different bugs in a race condition as to which will crash first?

Two of the backtraces show a crash in the wl_list_insert routine:

        elm->next->prev = elm;

Tracing up the stack, this implies that:

  new_output->destroy_signal.listener_list.prev->next

is invalid (e.g. maybe NULL).

Not sure what to make of that though.  I think this will require more than just a stack trace to sort out; perhaps instrumenting the weston_view_assign_output() routine could reveal something?
Comment 9 Bryce Harrington 2014-01-22 01:46:44 UTC
The other two backtraces show a crash in drm_output_set_cursor()

                stride = wl_shm_buffer_get_stride(buffer->shm_buffer);
		s = wl_shm_buffer_get_data(buffer->shm_buffer);
                wl_shm_buffer_begin_access(buffer->shm_buffer);
                for (i = 0; i < ev->surface->height; i++)
                        memcpy(buf + i * 64, s + i * stride,
                               ev->surface->width * 4);


But both s and stride appear to be invalid here:

#1  0x00007f31e46dd58f in drm_output_set_cursor (output=0x13f20f0) at compositor-drm.c:1007
        ev = 0x142e810
        buffer = 0x7f31e5cb8a88 <main_arena+808>
        stride = -439645528
        buf = {0 <repeats 4096 times>}
        x = 4223888
        c = 0x1291940
        handle = 0
        bo = 0x12a8ea0
        s = 0x7f31cb971560 <Address 0x7f31cb971560 out of bounds>
        i = 0
        y = 32561

In the other stacktrace, s looks ok, but stride looks invalid still:

#1  0x00007fc70e76a506 in drm_output_set_cursor (output=0x1685ef0) at compositor-drm.c:1004
        ev = 0x182d7f0
        buffer = 0x7fc70fd45a88 <main_arena+808>
        stride = 265575080
        buf = {0 <repeats 4096 times>}
        x = 4223888
        c = 0x166e940
        handle = 0
        bo = 0x17cec50
        s = 0x1685ef0 ""
        i = 0
        y = 32711

My guess would be that buffer->shm_buffer is uninitialized when this code gets hit, or has been partially destroyed.  If this is relatively easily reproduced, perhaps instrumenting the wl_shm_buffer creation and destruction routines would reveal something.  Maybe also check reference counts in shm_pool_unref().
Comment 10 Kristian Høgsberg 2014-01-24 00:46:10 UTC
commit b3955b095317466afd05b474666eef938b7f72d0
Author: Kristian Høgsberg <krh@bitplanet.net>
Date:   Thu Jan 23 16:25:06 2014 -0800

    compositor-drm: Set cursor surface to NULL when pageflip fails
    
    If we VT switch away between  picking a cursor surface and actually doing
    the pageflip in drm_output_repaint(), we never set output->cursor_view to
    NULL.  Then we unplug all the input devices and as the last pointer device
    goes away we destroy the cursor surface.  Then when we switch back, we
    call drm_output_set_cursor() with an invalid surface and crashes.
    
    Fixes: https://bugs.freedesktop.org/show_bug.cgi?id=73566


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.