Bug 83304

Summary: wl_display_roundtrip() seems not thread-safe
Product: Wayland Reporter: Remi Denis-Courmont <courmisch>
Component: waylandAssignee: Wayland bug list <wayland-bugs>
Status: RESOLVED NOTABUG QA Contact:
Severity: normal    
Priority: medium CC: daniel
Version: 1.5.0   
Hardware: x86-64 (AMD64)   
OS: Linux (All)   
Whiteboard:
i915 platform: i915 features:
Bug Depends on:    
Bug Blocks: 83429    

Description Remi Denis-Courmont 2014-08-31 17:50:04 UTC
Hello,

It seems that calling wl_display_roundtrip() from a thread while another thread runs the Wayland client "mainloop" is prone to deadlocks. After a while, I always keep getting the calling thread stuck in wl_display_dispatch_queue() even though the sync event is already completed.

From a quick look at the code, the use of the "done" variable seems to lack memory barriers if the function was intended for use across multiple threads. I would guess there is a race condition between setting the "done" flag and wl_display_dispatch_queue().

Now, I don't know if this is supposed to work, or if it is not. But even if it is not, the lack of explicit warning against it in the documentation would qualify as a bug in my humble opinion.

This is the program's stack dump (where thread 2 runs the mainloop):

Thread 3 (Thread 0x7f19b3d8f700 (LWP 3112)):
#0  0x00007f19de4de0ed in poll () at ../sysdeps/unix/syscall-template.S:81
#1  0x00007f19dfb6249a in poll () from /usr/lib/x86_64-linux-gnu/libasan.so.1
#2  0x00007f19b38703f0 in poll (__timeout=-1, __nfds=1, __fds=0x7f19b3d8e9c0)
    at /usr/include/x86_64-linux-gnu/bits/poll2.h:46
#3  wl_display_dispatch_queue (display=display@entry=0x61400007fc40, 
---Type <return> to continue, or q <return> to quit---
    queue=queue@entry=0x61400007fd28) at ../src/wayland-client.c:1334
#4  0x00007f19b38704cc in wl_display_dispatch (
    display=display@entry=0x61400007fc40) at ../src/wayland-client.c:1414
#5  0x00007f19b387052c in wl_display_roundtrip (display=0x61400007fc40)
    at ../src/wayland-client.c:856
#6  0x00007f19b3a81cfe in Display (vd=0x6150000af398, pic=0x6130000a44c0, 
    subpic=0x0) at ../../../modules/video_output/wl/shm.c:248
#7  0x00007f19dd47128f in vout_display_Display (vd=0x6150000af398, 
    picture=0x6130000a44c0, subpicture=0x0)
    at ../../include/vlc_vout_wrapper.h:57
#8  0x00007f19dd47d64f in ThreadDisplayRenderPicture (vout=0x61a00012af18, 
    is_forced=false) at ../../src/video_output/video_output.c:1061
#9  0x00007f19dd47e2e3 in ThreadDisplayPicture (vout=0x61a00012af18, 
    deadline=0x7f19b3d8ede0) at ../../src/video_output/video_output.c:1135
#10 0x00007f19dd48394f in Thread (object=0x61a00012af18)
    at ../../src/video_output/video_output.c:1572
#11 0x00007f19de9b50a4 in start_thread (arg=0x7f19b3d8f700)
    at pthread_create.c:309
#12 0x00007f19de4e6c2d in clone ()
    at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 2 (Thread 0x7f19b3359700 (LWP 3113)):
#0  0x00007f19de4de0ed in poll () at ../sysdeps/unix/syscall-template.S:81
#1  0x00007f19dfb6249a in poll () from /usr/lib/x86_64-linux-gnu/libasan.so.1
#2  0x00007f19b3663f5e in Thread (data=0x6110001d5f98)
    at ../../../modules/video_output/wl/shell_surface.c:77
#3  0x00007f19de9b50a4 in start_thread (arg=0x7f19b3359700)
    at pthread_create.c:309
#4  0x00007f19de4e6c2d in clone ()
    at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

(That was generated with libwayland-client 1.5.0, so no _queue, but the problem should remain identical.)

Meanwhile, the following ugly work-around solves it for me. But obviously, it assumes that another thread is processing events to invoke the callback:

----8<--------8<--------8<--------8<--------8<--------8<--------8<----

#include <semaphore.h>

static void done_cb(void *data, struct wl_callback *callback, uint32_t serial)
{
    sem_t *done = data;

    sem_post(done);
    wl_callback_destroy(callback);
    (void) serial;
}

static const struct wl_callback_listener sync_cbs =
{
    done_cb,
};

static int wl_display_roundtrip_(struct wl_display *display)
{
    sem_t done;

    struct wl_callback *callback = wl_display_sync(display);
    if (callback == NULL)
        return -1;

    sem_init(&done, 0, 0);
    wl_callback_add_listener(callback, &sync_cbs, &done);
    wl_display_flush(display);
    sem_wait(&done);
    sem_destroy(&done);
    return 0;
}

---->8-------->8-------->8-------->8-------->8-------->8-------->8----
Comment 1 Daniel Stone 2014-09-01 18:47:20 UTC
Can you please post a link to the code? In particular, how are you calling poll() from the main thread; is it polling on the Wayland display fd, and if so, is it doing so strictly according to the documentation for wl_display_dispatch_queue()?
Comment 2 Pekka Paalanen 2014-09-01 18:59:12 UTC
Yeah, I don't think wl_display_roundtrip() is thread-safe.

It's really the same thing that dispatching the same queue from multiple threads is a bad idea.

You should really have a queue dedicated to each thread, and dispatch that queue only from that exact thread, never from any other thread. Otherwise your event handlers become quite a bit complex and hard to make reliable. It might even be impossible to implement correct locking for your own data structures without a dedicated queue. You always want to know in which thread your callbacks will run, and creating a new queue and using it consistently is the tool to achieve that.

Unless you are running in the thread that dispatches the default queue (the queue built in to wl_display), do not use wl_display_roundtrip(). Instead, use wl_display_roundtrip_queue() which is new in 1.6, or open-code that if you need to work with older libwayland-client.

Also, I think the docs on wl_display_dispatch_queue are outdated a bit:

 * This function blocks if there are no events to dispatch. If calling from
 * the main thread, it will block reading data from the display fd. For other
 * threads this will block until the main thread queues events on the queue
 * passed as argument.

Nowdays, any thread can read the fd, but it all works, if all threads follow the polling protocol documented for wl_display_prepare_read().

I'm leaving this bug open, so that we would remember to check the documentation.
Comment 3 Remi Denis-Courmont 2014-09-01 19:16:23 UTC
Pekka, the mainloop is simplistic and tries to follow the documentation:
https://git.videolan.org/?p=vlc.git;a=blob;f=modules/video_output/wl/shell_surface.c;hb=2.2.0-git-662-ge08f4fc#l48

(Admittedly the cleanup path may look unorthodox, but you can ignore it as the deadlock predates it.) The deadlock is there, right below FIXME:

https://git.videolan.org/?p=vlc.git;a=blob;f=modules/video_output/wl/shm.c;hb=2.2.0-git-662-ge08f4fc#l184

(NOTE: I know that wl_shm is not the best option for media playback. In the long term, we definitely want to avoid software scaling and colour space conversion. But that is outside the scope of this bug.)

I tried to use wl_event_queue, but it gave me crashes in the mainloop after destroying the supposedly no longer used queue. I guess I don't know how/when it is safe to delete a queue.

So it does not seem to me that "it all works if all threads follow the polling protocol documented".
Comment 4 Pekka Paalanen 2014-09-01 19:39:04 UTC
(In reply to comment #3)
> Pekka, the mainloop is simplistic and tries to follow the documentation:
> https://git.videolan.org/?p=vlc.git;a=blob;f=modules/video_output/wl/
> shell_surface.c;hb=2.2.0-git-662-ge08f4fc#l48

That looks fine.

> (Admittedly the cleanup path may look unorthodox, but you can ignore it as
> the deadlock predates it.) The deadlock is there, right below FIXME:
> 
> https://git.videolan.org/?p=vlc.git;a=blob;f=modules/video_output/wl/shm.c;
> hb=2.2.0-git-662-ge08f4fc#l184

That is broken like I explained above, it runs in a different thread, doesn't it?

> I tried to use wl_event_queue, but it gave me crashes in the mainloop after
> destroying the supposedly no longer used queue. I guess I don't know
> how/when it is safe to delete a queue.

Can you show the code for that? There likely are several details to get right, like assigning proxies to queues, and when to create and destroy queues.
Comment 5 Pekka Paalanen 2014-09-01 19:40:53 UTC
(In reply to comment #4)
> > I tried to use wl_event_queue, but it gave me crashes in the mainloop after
> > destroying the supposedly no longer used queue. I guess I don't know
> > how/when it is safe to delete a queue.
> 
> Can you show the code for that? There likely are several details to get
> right, like assigning proxies to queues, and when to create and destroy
> queues.

Sorry, this is actually the wrong forum to discuss this. If you cannot hang around in IRC long enough to get replied, please ask on the wayland-devel@ mailing list with your code examples.

This bug is about wl_display_roundtrip() documentation.
Comment 6 Remi Denis-Courmont 2014-09-01 19:50:27 UTC
(In reply to comment #4)
> > The deadlock is there, right below FIXME:
> > 
> > https://git.videolan.org/?p=vlc.git;a=blob;f=modules/video_output/wl/shm.c;
> > hb=2.2.0-git-662-ge08f4fc#l184
> 
> That is broken like I explained above, it runs in a different thread,
> doesn't it?

Yes. But now you seem to imply that running wl_display_roundtrip() in parallel to a threaded mainloop is not meant to work. Well, the Wayland developers are free to "define" their API the way they feel like, so that is fine.

But then I don't understand what you meant by "it all works if all threads follow the polling protocol documented". That seems contradictory.
Comment 7 Pekka Paalanen 2014-09-02 06:27:18 UTC
(In reply to comment #6)
> (In reply to comment #4)
> > > The deadlock is there, right below FIXME:
> > > 
> > > https://git.videolan.org/?p=vlc.git;a=blob;f=modules/video_output/wl/shm.c;
> > > hb=2.2.0-git-662-ge08f4fc#l184
> > 
> > That is broken like I explained above, it runs in a different thread,
> > doesn't it?
> 
> Yes. But now you seem to imply that running wl_display_roundtrip() in
> parallel to a threaded mainloop is not meant to work. Well, the Wayland
> developers are free to "define" their API the way they feel like, so that is
> fine.
> 
> But then I don't understand what you meant by "it all works if all threads
> follow the polling protocol documented". That seems contradictory.

The major point here is, that dispatching a queue from more than one thread is a bad idea. The event handlers called by dispatch are usually not written to be totally thread and async safe. Sometimes it is practically impossible to write them like that.

wl_display_roundtrip() must dispatch the default queue, because it is waiting for an event on the default queue. That means you now have two different threads dispatching the same queue, and the event handler functions will be called in an arbitrary thread context.

As you found out yourself, wl_display_roundtrip() itself (the handling of 'done' variable) is not written to work in that scenario. The algorithm there is fundamentally racy is multiple threads are involved in dispatch, and no barriers can fix it.

Your semaphore replacement looks like it should work, but then it assumes that there actually is another thread dispatching the default queue. If you call your semaphore version from the "main" thread that is responsible for dispatching the default 

All threads following the polling protocol is one thing. Not dispatching a queue from multiple threads is another important thing.
Comment 8 Pekka Paalanen 2014-09-02 06:30:04 UTC
(In reply to comment #7)
> Your semaphore replacement looks like it should work, but then it assumes
> that there actually is another thread dispatching the default queue. If you
> call your semaphore version from the "main" thread that is responsible for
> dispatching the default 

Argh, finger slipped to a wrong key...

If you call your semaphore version from the "main" thread that is responsible for dispatching the default queue, then the roundtrip will never complete, because nothing is dispatching the queue.
Comment 9 Daniel Stone 2018-06-04 06:47:10 UTC
This bug was completed and answered quite a while ago; closing.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.