Bug 100878

Summary: Request for wl_list API change
Product: Wayland Reporter: worknesday
Component: westonAssignee: Wayland bug list <wayland-bugs>
Status: RESOLVED WONTFIX QA Contact:
Severity: normal    
Priority: medium    
Version: unspecified   
Hardware: All   
OS: All   
See Also: https://bugs.freedesktop.org/show_bug.cgi?id=100897
Whiteboard:
i915 platform: i915 features:

Description worknesday 2017-04-29 04:11:22 UTC
As seen on SHA 9ad4de1f7ad411fcb5a62eb85e17cf96ae076a0f; no modification

At the bottom of the description, you can find stacktrace.

Analysis:
When I saw this issue, I did some digging around on wl_list type and I suspect the seg fault arose from the design of the circular-doubly-linked lists themselves.

If we look at wayland-util.h (where these list functions are declared), you can see something odd and (potentially) dangerous:

    void
    wl_list_insert(struct wl_list *list, struct wl_list *elm);

    void
    wl_list_remove(struct wl_list *elm); // 

The problem here is that the *identity* of the list is subject to change as you add/remove nodes -- especially wl_list_remove. If, say, you have a list

    struct wl_list *my_list; // initialized to some list with 3 elements

then you do

    wl_list_remove(my_list)

then you will lose the other elements and my_list is no longer valid nor circular because wl_list_remove sets prev and next to NULL. Indeed, wl_list_remove says so explicitly:

    /**
     * Removes an element from the list.
     *
     * \note This operation leaves \p elm in an invalid state.
     *
     * \param elm Link of the containing struct to remove from the list
     *
     * \memberof wl_list
     */

I think a better (robust) design is to wrap and hide the actual node pointers so that
    1. we preserve the identity (pointer) of the list itself
    2. we can call wl_list_remove multiple times

I envision something like

    struct wl_list_node {
	    /** Previous list element */
	    struct wl_list_node *prev;
	    /** Next list element */
	    struct wl_list_node *next;
    };

and wl_list will be

    struct wl_list {
	    struct wl_list_node *node;
    };

Then, removing a node would be something like

    // assert that node is an element of list
    void
    wl_list_remove(struct wl_list *list, struct wl_list_node *node);

So, you can snip node out of list. If list->node == node, then you can perform a fixup (e.g. nudge list->node to the "next" element).

Thus, you ensure the integrity of the list.

I think this is the best option given the circumstance and I hope that I'm not stepping on peoples' toes.


Stack trace:
    #0  0x00007ffff79a0313 in wl_list_insert () from /usr/lib64/libwayland-server.so.0
    #1  0x00007fffeeae16cf in wl_signal_add (signal=<optimized out>, listener=<optimized out>) at /usr/include/wayland-server-core.h:320
    #2  focus_state_set_focus (state=<optimized out>, surface=<optimized out>) at desktop-shell/shell.c:735
    #3  0x00007fffeeae7435 in activate (shell=0x7820e0, view=<optimized out>, seat=0x622100, flags=3) at desktop-shell/shell.c:3686
    #4  0x00007ffff7bc5d5d in weston_compositor_run_button_binding (compositor=compositor@entry=0x614c60, pointer=pointer@entry=0x76c900, 
        time=time@entry=3071980015, button=button@entry=272, state=state@entry=WL_POINTER_BUTTON_STATE_PRESSED) at libweston/bindings.c:368
    #5  0x00007ffff7bc13cd in notify_button (seat=seat@entry=0x622100, time=3071980015, button=button@entry=272, 
        state=WL_POINTER_BUTTON_STATE_PRESSED) at libweston/input.c:1673
    #6  0x00007ffff531065b in x11_backend_deliver_button_event (event=0x7bc180, b=0x622070) at libweston/compositor-x11.c:1230
    #7  x11_backend_handle_event (fd=<optimized out>, mask=<optimized out>, data=0x622070) at libweston/compositor-x11.c:1422
    #8  0x00007ffff799dd12 in wl_event_loop_dispatch () from /usr/lib64/libwayland-server.so.0
    #9  0x00007ffff799c1b5 in wl_display_run () from /usr/lib64/libwayland-server.so.0
    #10 0x0000000000405719 in main (argc=1, argv=<optimized out>) at compositor/main.c:1969

surface data:
    (gdb) p *es
    $6 = {resource = 0x7b2fb0, destroy_signal = {listener_list = {prev = 0x622190, next = 0x793a90}}, ...

    (gdb) p *es->destroy_signal.listener_list.prev
    $10 = {prev = 0x0, next = 0x7e1348}
Comment 1 worknesday 2017-04-29 04:17:12 UTC
I don't have any reproduction steps, it was out-of-the blue as I was playing with weston-editor. However, weston itself came crashing down
Comment 2 Daniel Stone 2017-04-30 13:48:24 UTC
The clue, however, is in the names of the parameters to the functions. For instance, you do:
    struct wl_list list;
    struct {
        int something;
        struct wl_list link;
    } foo;

    wl_list_init(&list);
    wl_list_insert(&list, &foo->link);
    wl_list_remove(&foo->link);

At this point, foo has been removed from list (making its link member dangling), but the overall list itself is still valid. Trying to remove a list from itself does not make any real sense; instead, you remove an element from that list.
Comment 3 worknesday 2017-04-30 16:56:38 UTC
(In reply to Daniel Stone from comment #2)
> The clue, however, is in the names of the parameters to the functions. For
> instance, you do:
>     struct wl_list list;
>     struct {
>         int something;
>         struct wl_list link;
>     } foo;
> 
>     wl_list_init(&list);
>     wl_list_insert(&list, &foo->link);
>     wl_list_remove(&foo->link);
> 
> At this point, foo has been removed from list (making its link member
> dangling), but the overall list itself is still valid. Trying to remove a
> list from itself does not make any real sense; instead, you remove an
> element from that list.

ah, I see! So, does that mean the list itself is represented as an always-present dummy-node that resides in the list?

However, it looks like the crash occurred as a result of es->destroy_signal.listener_list.prev being NULL. Somehow, this list ceased to be circular (at least while traversing backwards)
Comment 4 Pekka Paalanen 2017-05-01 08:51:28 UTC
(In reply to worknesday from comment #0)
> The problem here is that the *identity* of the list is subject to change as
> you add/remove nodes -- especially wl_list_remove. If, say, you have a list
> 
>     struct wl_list *my_list; // initialized to some list with 3 elements
> 
> then you do
> 
>     wl_list_remove(my_list)

This is stricty in the land of "don't do that". You are removing the list's head from the circular list. It is a very handy trick but rarely useful: it's purpose is to be able to free() the list head element while leaving all the nodes still valid for a wl_list_remove() call later. It is no longer possible to iterate the list, so this is only useful in destruction paths.

> I think a better (robust) design is to wrap and hide the actual node
> pointers so that
>     1. we preserve the identity (pointer) of the list itself
>     2. we can call wl_list_remove multiple times

The current list API may not be very intuitive at first, but it probably cannot be changed in libwayland at least due to stable ABI.

> Then, removing a node would be something like
> 
>     // assert that node is an element of list
>     void
>     wl_list_remove(struct wl_list *list, struct wl_list_node *node);

That is inconvenient. We use a lot the feature of wl_list_remove() that you do not need to know in which list the element is in to just remove it. If we had to look up the identity of the list to remove the node from it, that would lead into a lot of new code for not much benefit. For instance, all object destroy functions can currently simply wl_list_remove() without regard to which list the node is in, as long as the node is initialized (does not even need to be in a list really).

(In reply to worknesday from comment #3)
> ah, I see! So, does that mean the list itself is represented as an
> always-present dummy-node that resides in the list?

Yes, exactly. The list head is always part of the circular list, but the head is not a list node - dereferencing the head as a node would produce a garbage pointer. This is a design choice. To iterate a list, one always needs the list head. A list cannot be iterated starting from an arbitrary node without knowing the head, as one must never dereference the head.

> However, it looks like the crash occurred as a result of
> es->destroy_signal.listener_list.prev being NULL. Somehow, this list ceased
> to be circular (at least while traversing backwards)

Yes, that is a side-effect of list corruption. Something else is corrupting the list, and you happen to see a crash in a list manipulation function. Another very common effect from list corruption is an endless loop.


Since this report is full of discussion about wl_list design rather than an actual bug, I would suggest to rename this bug, close it, and open a new one concentrating on the actual crash.
Comment 5 worknesday 2017-05-01 21:30:10 UTC
close as suggested by Pekka

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.