Bug 91273 - Setting a queue on a wl_proxy is racy if some other thread is dispatching the default queue
Summary: Setting a queue on a wl_proxy is racy if some other thread is dispatching the...
Status: RESOLVED FIXED
Alias: None
Product: Wayland
Classification: Unclassified
Component: wayland (show other bugs)
Version: unspecified
Hardware: All All
: medium normal
Assignee: Wayland bug list
QA Contact:
URL:
Whiteboard:
Keywords:
: 91750 (view as bug list)
Depends on:
Blocks: 91768 91769
  Show dependency treegraph
 
Reported: 2015-07-09 02:28 UTC by Boram
Modified: 2016-04-29 13:09 UTC (History)
3 users (show)

See Also:
i915 platform:
i915 features:


Attachments
simple test application to reproduce this bug (1.53 KB, text/plain)
2015-07-09 02:28 UTC, Boram
Details
simple test application to reproduce this bug (3.07 KB, text/plain)
2015-07-09 02:37 UTC, Boram
Details

Description Boram 2015-07-09 02:28:59 UTC
Created attachment 117010 [details]
simple test application to reproduce this bug

Condition:
2 threads polling on same display fd in both threads.

Result:
When one thread awakes, if it quickly reads all events from display fd before kernel wakes up another, then another can't awake from poll(). I attached a very simple test application to reproduce this.

Real example:
gstreamer's waylandsink can do polling on same diplay fd of toolkit's main-thread(Such like GTK and EFL...).
(http://cgit.freedesktop.org/gstreamer/gst-plugins-bad/tree/ext/wayland/wldisplay.c)
If waylandsink's thread reads all event using wl_display_read_events before toolkit's main-thread awakes, although all events are queued into corresponding event queues, default queue's events can't be handled because main-thread is still sleeping.

Comment:
poll(), select()... seems not guarantee that kernel will wake all threads up when events happen. Creating new queue seems a good solution if we want to handle  events of a certain wl_proxy. But not in this case. Maybe using a different display fd will be a good solution. But for using a different display fd, we have to create new display, then waylandsink can't use parent surface and create subsurface. Or else, can it be solution creating new fd for each queue?
Comment 1 Boram 2015-07-09 02:37:28 UTC
Created attachment 117011 [details]
simple test application to reproduce this bug
Comment 2 Jonas Ådahl 2015-07-09 02:52:22 UTC
(In reply to Boram from comment #0)
> Created attachment 117010 [details]
> simple test application to reproduce this bug

Please attach the source code for the test application instead of the binary (I only see garbage when opening the attachment in the browser).

> 
> Condition:
> 2 threads polling on same display fd in both threads.
> 
> Result:
> When one thread awakes, if it quickly reads all events from display fd
> before kernel wakes up another, then another can't awake from poll(). I
> attached a very simple test application to reproduce this.

The API is made so that when there are multiple threads reading from the fd, they don't do so until every thread has returned from poll(). When every thread that want data from the fd calls "prepare read", only one will actually read it in the end, after all threads that are poll():ing should have seen that data is on the fd.

> 
> Real example:
> gstreamer's waylandsink can do polling on same diplay fd of toolkit's
> main-thread(Such like GTK and EFL...).
> (http://cgit.freedesktop.org/gstreamer/gst-plugins-bad/tree/ext/wayland/
> wldisplay.c)
> If waylandsink's thread reads all event using wl_display_read_events before
> toolkit's main-thread awakes, although all events are queued into
> corresponding event queues, default queue's events can't be handled because
> main-thread is still sleeping.

All threads that called prepare_read need to either cancel or read, and not until then will one thread actually read the data.

> 
> Comment:
> poll(), select()... seems not guarantee that kernel will wake all threads up
> when events happen. Creating new queue seems a good solution if we want to
> handle  events of a certain wl_proxy. But not in this case. Maybe using a
> different display fd will be a good solution. But for using a different
> display fd, we have to create new display, then waylandsink can't use parent
> surface and create subsurface. Or else, can it be solution creating new fd
> for each queue?

I suspect the issue you are seeing is the race between creating the wl_callback and setting the queue and listener, meaning the callback might be dispatched on the main thread before the proxy got its listener configured.

The issue was discussed on IRC and also on the mailing list starting here: http://lists.freedesktop.org/archives/wayland-devel/2015-June/023054.html

There is currently a potentially working work-around (with some side effects) which is to wrap the sync callback proxy creation and setup inside a prepare_read and a cancel_read in order to block any other thread from reading and dispatching during the setup of the proxy.
Comment 3 Jonas Ådahl 2015-07-09 02:53:14 UTC
(In reply to Jonas Ådahl from comment #2)
> (In reply to Boram from comment #0)
> > Created attachment 117010 [details]
> > simple test application to reproduce this bug
> 
> Please attach the source code for the test application instead of the binary
> (I only see garbage when opening the attachment in the browser).

Ah, I see you already updated it. Disregard this!
Comment 4 Jonas Ådahl 2015-07-09 02:57:03 UTC
(In reply to Boram from comment #1)
> Created attachment 117011 [details]
> simple test application to reproduce this bug

It seems you are misusing the API; you are mixing wl_display_dispatch() with wl_display_prepare_read() and friends. See the paragraph of multi threaded environments in the API documentation: http://wayland.freedesktop.org/docs/html/apb.html#Client-classwl__display_1a30a9c4f020f3e77581c7a81ecdb4913d
Comment 5 Boram 2015-07-09 04:38:19 UTC
(In reply to Jonas Ådahl from comment #4)
> (In reply to Boram from comment #1)
> > Created attachment 117011 [details]
> > simple test application to reproduce this bug
> 
> It seems you are misusing the API; you are mixing wl_display_dispatch() with
> wl_display_prepare_read() and friends. See the paragraph of multi threaded
> environments in the API documentation:
> http://wayland.freedesktop.org/docs/html/apb.html#Client-
> classwl__display_1a30a9c4f020f3e77581c7a81ecdb4913d

Actually, I only simplified what gtk & waylandsink do, and combined them into one test application to reproduce and explain bug easily. GTK, EFL and weston-client applications use same APIs. They call wl_display_flush & wl_display_dispatch_pending for prepare step and call wl_display_dispatch for read & dispatch step. 

I already read the API documentation as well as wayland-client source code also. And I fully understand what you say. I had questions why these toolkits and weston-client applications use wl_display_dispatch instead of prepare_event, read_event? why gstreamer's waylandsink uses prepare, read_event?  After reading documentation, now I understand.
The main-thread of client applications & toolkits with mesa gl acceleration can be blocked if they use prepare_read & read_event because mesa call wl_display_dispatch_queue internally. And gstreamer's waylandsink faithfully follows the API documentation. So, I couldn't blame one of them. I guess toolkits, weston-client and gstreamer have their own good reasons why they use that kind of APIs.

Anyway, my point of this report is "In case of sharing and polling fd, it's possible for kernel not to wake up the thread. Then, the thread can be blocked".
Even if we use properly APIs as the API docuementation, I think this bug can't be solved.
Comment 6 Pekka Paalanen 2015-07-14 05:53:52 UTC
All libraries, including Mesa, need to be fixed to use the prepare_read/read_events API, because that is the only API we even think as thread-safe.

I suppose it could be nice to file bugs per library, and add them here as blockers (or just references in case the library does not use fd.o bugzilla).

Lazily written application can use the non-safe API just fine, when they are not threaded and assuming the libraries they use do not use the same wl_display object from other threads. The non-safe API was there first, so apps that don't require thread-safety have not been ported. Mixing non-safe and safe APIs is only dangerous if their use may overlap. If you only ever have a single thread, I believe these can be mixed safely if paying attention to the details.

As for sharing the wl_display *file descriptor*, that is never intended to work. I mean, you cannot create multiple wl_displays from the same fd and expect that to work. I don't think anyone even tries to, just clarifying.

Jonas' "proxy wrapper" is an attempt to solve one more threading issue.

I still don't understand what you mean by "the kernel may not wake up the thread" if it refers to a different issue than the two mentioned above.
Comment 7 Boram 2015-07-21 01:41:00 UTC
> I still don't understand what you mean by "the kernel may not wake up the
> thread" if it refers to a different issue than the two mentioned above.

When A thread is polling on a display fd and B thread is polling on the same display fd, if a server sends messages, sometimes only one thread(Let's say it is B thread) awakes and reads all events so fast. Then A thread still is in the polling status and can't awake. Even if B thread reads and queue all events into corresponding queue_lists, the events of A thread's queue_list can't be handled because A thread is still in sleep.
Comment 8 Jonas Ådahl 2015-07-21 03:08:31 UTC
(In reply to Boram from comment #7)
> > I still don't understand what you mean by "the kernel may not wake up the
> > thread" if it refers to a different issue than the two mentioned above.
> 
> When A thread is polling on a display fd and B thread is polling on the same
> display fd, if a server sends messages, sometimes only one thread(Let's say
> it is B thread) awakes and reads all events so fast. Then A thread still is
> in the polling status and can't awake. Even if B thread reads and queue all
> events into corresponding queue_lists, the events of A thread's queue_list
> can't be handled because A thread is still in sleep.

When the API is used correctly, all threads will have exited the poll() call and either canceled the read, or entered the read phase, so the scenario you describe won't happen. Note that no thread will ever read from the fd until *all* threads that called prepare_read has either called cancel_read or read_events.
Comment 9 Kristian Høgsberg 2015-07-21 04:26:34 UTC
(In reply to Boram from comment #7)
> > I still don't understand what you mean by "the kernel may not wake up the
> > thread" if it refers to a different issue than the two mentioned above.
> 
> When A thread is polling on a display fd and B thread is polling on the same
> display fd, if a server sends messages, sometimes only one thread(Let's say
> it is B thread) awakes and reads all events so fast. Then A thread still is
> in the polling status and can't awake. Even if B thread reads and queue all
> events into corresponding queue_lists, the events of A thread's queue_list
> can't be handled because A thread is still in sleep.

All threads call wl_display_read_events() after returning from poll(), but only the last thread reads the data from the fd.  That way we make sure all threads who called wl_display_prepare_read_queue() will come out of poll before we read the data.

The kernel may only wake one thread, but that's only an optimization.  If that thread doesn't read the data but goes to sleep on a pthread condition (the case for the n-1 first threads), the other threads that block on poll will eventually run and rendezvous in wl_display_read_events().
Comment 10 Jonas Ådahl 2015-08-25 07:18:19 UTC
*** Bug 91750 has been marked as a duplicate of this bug. ***
Comment 11 Boram 2015-09-07 08:07:38 UTC
> All threads call wl_display_read_events() after returning from poll(), but
> only the last thread reads the data from the fd.  That way we make sure all
> threads who called wl_display_prepare_read_queue() will come out of poll
> before we read the data.

In multi-thread application, looks we need to avoid to use dispatch()/dipatch_queue(). And there might be no problem if every threads call prepare_read(), poll() and read_events(). For toolkits to support multi-thread, they also need to be fixed to use prepare_read/read_events, as well as mesa Pekka Paalanen mentioned above.

To replace roundtrip()/dispatch()/dispatch_queue(), can wayland offer a new convenient API which internally calls poll() and read_events()? i.e, wl_display_poll_read()? It will be more convenient than calling poll() and read_events() directly by each library.
Comment 12 Jonas Ådahl 2015-09-07 08:12:34 UTC
(In reply to Boram from comment #11)
> > All threads call wl_display_read_events() after returning from poll(), but
> > only the last thread reads the data from the fd.  That way we make sure all
> > threads who called wl_display_prepare_read_queue() will come out of poll
> > before we read the data.
> 
> In multi-thread application, looks we need to avoid to use
> dispatch()/dipatch_queue(). And there might be no problem if every threads
> call prepare_read(), poll() and read_events(). For toolkits to support
> multi-thread, they also need to be fixed to use prepare_read/read_events, as
> well as mesa Pekka Paalanen mentioned above.
> 
> To replace roundtrip()/dispatch()/dispatch_queue(), can wayland offer a new
> convenient API which internally calls poll() and read_events()? i.e,
> wl_display_poll_read()? It will be more convenient than calling poll() and
> read_events() directly by each library.

It is more or less fine to use dispatch()/dispatch_queue() in a multi-threaded application as long as one deals with tear-down appropriately and dispatches the right queue on the right thread. See bug 91767 and bug 91769 for more details.

We aim to fix these issues shortly after the coming release.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.