Summary: | send_with_reply_and_block doesn't work with threads | ||
---|---|---|---|
Product: | dbus | Reporter: | Olivier Andrieu <oliv__a> |
Component: | core | Assignee: | Havoc Pennington <hp> |
Status: | RESOLVED WORKSFORME | QA Contact: | |
Severity: | normal | ||
Priority: | high | CC: | danny.smit.0, desrt, ht990332, jilles+freedesktop, johangy, mlauer, stian, walters |
Version: | unspecified | ||
Hardware: | x86 (IA32) | ||
OS: | All | ||
Whiteboard: | |||
i915 platform: | i915 features: | ||
Attachments: |
Bug-857-Update-DBusConnection-documentation-to-war.patch
dbus-1.2.12.patch dbus-1.2.12.patch |
Description
Olivier Andrieu
2004-07-12 05:26:00 UTC
There are actually two problems here: (1) Mainloop runs while connection is unlocked (in send_with_reply_and_block) between connection_send and _connection_block_for_reply. (2) Two instances of send_with_reply_and_block run at the same time and the second one to send the message starts blocking for the reply. (or the messages come back out of order) The solution is probably something along the lines of making 'internal' unlocked versions of everything that connection_send calls, then wrapping connection_send in one big instance of the DBusConnection lock. For now I'm just using a code lock before calling send_with_reply_and_block. any word on if this is fixed? I won't have time to investigate threads for a while; I know it's sort of broken, there are also a collection of errors with "valgrind --tool=helgrind" that are plausibly real issues. If you're using dbus from multiple threads I think you'll definitely want to spend a day or two fixing up dbus. Is this still an issue? Threading support has been improved greatly in recent versions of D-Bus. If this is still an issue please reopen. dbus_connection_send_with_reply_and_block() usually works, but not always. dbus_connection_send_with_reply_and_block() may block for 25 seconds needlessly when using a main loop (watches/timeouts stuff) in a different thread. What seems to happen is that the main loop notices the incoming reply first and calls the watch handler, reading it from the socket buffer. At this point, the thread calling dbus_connection_send_with_reply_and_block() has already checked that it needs to wait, and goes into poll(2) itself. Because the reply has already been read, it blocks for the timeout (which is 25 seconds). After that it will process the reply normally. This can also be seen because attaching a debugger to the dbus-daemon during the 25 second wait does not have an effect, it still continues normally when the 25 seconds are over. To reproduce the problem it is best to use an otherwise idle bus; it also seems more likely to trigger when calling a method on the bus itself such as by using dbus_bus_request_name(). I am using libdbus directly but I suppose the problem can also occur when making method calls in different threads than the main loop with dbus-glib. More technical details: _dbus_connection_block_pending_call() tries to check if the reply is already in before blocking, but does not lock this down properly. _dbus_connection_do_iteration_unlocked() temporarily unlocks the connection by calling _dbus_connection_acquire_io_path(). At that point the reply may be read from the socket by the main loop thread. This is on Ubuntu 8.10. ii dbus 1.2.4-0ubuntu1 simple interprocess messaging system *** Bug 19796 has been marked as a duplicate of this bug. *** In bug 19796, a patch is provided which adds a new dbus_connection API which is intending to be more threadsafe. A downstream instance of this: https://bugzilla.redhat.com/show_bug.cgi?id=484553 It would be very nice if the poor thread support were documented better. For example, http://dbus.freedesktop.org/doc/dbus/api/html/group__DBusConnection.html#_details pretends that DBusConnection works very nicely with multiple threads. In fact, a DBusConnection should be used from only one thread; if multiple threads need to use dbus they should pass the messages from/to a specific thread or use their own DBusConnection. To help programs comply to that rule, it would be good if dbus_connection_open() always behaved as dbus_connection_open_private(), and similarly for dbus_bus_get(). I think it is very hard to fix the thread support, so the problems should be documented. Created attachment 22759 [details] [review] Bug-857-Update-DBusConnection-documentation-to-war.patch I personally would meen that it is better to fix the multithread issues instead of documenting that the library isn't 100% thread safe, just about 95%. All over the dbus source code (and its languagebindings) have mutexes all over the place. But the thread-issues can be a big job yes. Somebody who understands the internals of dbus needs to go over the entire code-base and rethink the behaviour, and keep in mind that several parallell request might be in act. (In reply to comment #8) > In bug 19796, a patch is provided which adds a new dbus_connection API which is > intending to be more threadsafe. > It actually only contains patch for fixing send_with_reply (used by dbus-glib during async calls among others). send_with_reply_and_block() can be fixed aswell, but would require a bit more work, since it depends on underlaying code dispatching message with mutexes unlocked before reaching certain points, but not allowing other threads to make core read response. What if we could require (and release) a read-block to transport layer from a high-level call like send_with_reply_and_block() ? Right, obviously better to fix it, and we should. The primary intent of the patch is to get the website updated so people are aware of the issue that exists in current libdbus. (In reply to comment #7) > *** Bug 19796 has been marked as a duplicate of this bug. *** Seems like the bugreport in 19796 in not more related to this bug, other than beeing a thread issue; and almost the same function. Created attachment 22777 [details] [review] dbus-1.2.12.patch Does this solve the issue? Just took a fast peek at the code, and to me, it looks wrong to check if we are completed or not without holding the lock, dangerous. As I'm not able to trigger this bug myself, can somebody else please test this? (I have only done compile-check) Created attachment 22778 [details] [review] dbus-1.2.12.patch Use the correct flags on diff :-p (In reply to comment #11) > Created an attachment (id=22759) [details] > Bug-857-Update-DBusConnection-documentation-to-war.patch This is good and will save a lot of time and annoyance for people trying to use dbus in threaded applications, but I have some comments. As noted in #19796, the dbus_connection_send_with_reply() api may be problematic with threading so + * The API has been designed such that it should be possible to make it + * threadsafe in the future. may not be true. Also, I think the api may allow too many use patterns that make it hard to test everything. (In reply to comment #17) > Created an attachment (id=22778) [details] > dbus-1.2.12.patch Your patch seems to close a small race window, but does not fix the issue I had in mind. I now notice that it is more complicated. There are not two but three states the reply can be in after it has arrived in the socket buffer: in the socket buffer, in the connection's queue of incoming messages and completed (complete_pending_call_and_unlock() has been done). Both of the latter two need to be checked in a race-free manner. from an earlier comment: > What if we could require (and release) a read-block to transport layer from a > high-level call like send_with_reply_and_block() ? you cannot fix that in dbus_connection_send_with_reply_and_block(), it would block the entire connection while waiting for the (possibly slow) application on the other end. As mentioned in the api documentation this can cause a deadlock if the other application needs you to reply to a method call; such a deadlock would be unexpected if the main loop runs in a separate thread. Said read-block really already exists: _dbus_connection_acquire_io_path() and _dbus_connection_release_io_path(). Having random threads do I/O on the socket seems a little overcomplicated to me and might have more bugs. For connections that have a main loop in a separate thread, how about using the fixed dbus_connection_send_with_reply() and waiting for the DBusPendingCall to complete using a condition variable? That does not solve the problem when two threads simultaneously call dbus_connection_send_with_reply_and_block() on a connection without a main loop, but it can be fixed by disallowing that (libdbus possibly allows too many different ways of using it). In the main loop thread and in connections used by a single thread only (including single-threaded applications), _dbus_connection_block_pending_call() works properly, although it may leave a rather large queue of incoming messages (because the other messages must be read to detect the reply, the kernel will allow the bus daemon to send more and more messages). There may also be bugs in the case when sending to the socket would block. This does not usually happen so the bugs do not show up as easily. The actual send call could be delegated to the main loop thread (if any) as well. Completely agree about the random threads doing I/O problem. Here are the high level scenarios I think make sense: 1) One thread, no main loop, calling dbus_connection_send_with_reply_and_block 2) One thread, main loop 3) Multiple threads, one thread pumping mainloop, worker thread calling dbus_connection_send (without a reply, here the idea is that the worker thread itself is sending a reply message). 4) Multiple threads, one thread pumping mainloop, another thread calling dbus_connection_send_with_reply_and_block In particular calling dbus_connection_send_with_reply from a thread I consider nonsensical. It would imply keeping track of which threads initiated particular actions and ensuring we invoke callbacks on those threads somehow...I don't even know how it would work. The scenario I'm somewhat unsure about is whether we want to support multiple threads with no main loop calling dbus_connection_send_with_reply_and_block. #4 is one that can hit people who are just using libraries which internally use libdbus, and is really the most important to fix IMO. #3 is definitely a "would be nice", but if you're implementing a server you can also just do the equivalent of g_idle_add to send the reply message you've constructed back from the mainloop thread. The intent of dbus_connection_send_with_reply_and_block() as far as I can see, is to have a simple call to use for waiting for a reply (make the the natural async design of dbus into a synchronize one). In the current design all threads can kick off pending-callbacks as they make dbus core read in messages from the transport-layer, which is not a great design if you actually have a main loop that can do this job for us, but it works to a certain degree. Also the thread-api available lacks support of conditionals, which could have eased the design some when we would have multiple dbus_connection_send_with_reply_and_block running i parallell on the same connection, or have support for a proper invoke. If dbus_connection_send_with_reply_and_block() never released the lock it has on the given connection while doing its job, would it ever be a thread-issue then (except from thread-starvation and the fact that the connection will stop to respond to RPC calls while waiting for reply, but that is what happens anyway if you only runs with one thread and no main loop, isn't it?)? The basic intent of the thread support was cases 3 and 4, I think, i.e. to allow blocking calls outside of the main loop's thread. So you can spawn a thread that does a particular method call. Secondarily, you could write an app with no main loop using only threads (sort of how Java server apps usually work), one thread would block draining the incoming message queue while other threads would make outgoing calls, or something along those lines. Most people do just use a main loop, since it's a lot easier, of course. If you do a sync call on a connection, it should be acceptable that the connection might block for other connections, or dbus needs to implemenent Conditionals in its thread API. Second option will not make dbus backware compatible anymore. Or we could fall back to blocking the connection while doing sync call if Conditional isn't supported? > Or we could fall back to blocking the connection while doing sync call if
> Conditional isn't supported?
I think this is OK, if it's the only solution. People using threads are mostly doing embedded stuff with a custom built dbus anyway.
Relevant mailing list threads: http://lists.freedesktop.org/archives/dbus/2009-February/010945.html http://lists.freedesktop.org/archives/dbus/2009-February/010993.html Additional note: If it's indeed the case that dbus_bus_get_private is recommended in multithreaded apps, it should be mentioned in the documentation. I'm currently trying whether it fixes my (occasional) crashes. My main recommendation if using threads is that people have the dbus source and are prepared to dig into it and help improve libdbus thread support. I don't think there's any special need to use a private connection, though it may avoid certain bugs, I'm not sure. Seems better to fix the bugs. (In reply to comment #26) > I don't think there's any special need to use a private connection, though it > may avoid certain bugs, I'm not sure. Seems better to fix the bugs. I can confirm that switching to private connection did *not* fix my problems. I keep getting crashes like: QQQ Signal 11 (SIGSEGV) 0 _dbus_connection_lock (connection=0xd08) at dbus-connection.c:355 1 0x40054eb4 _dbus_pending_call_get_connection_and_lock (pending=0x37ab00) at dbus-pending-call.c:309 2 0x40047b18 reply_handler_timeout (data=0xd08) at dbus-connection.c:3116 3 0x40059acc dbus_timeout_handle (timeout=0x37ab30) at dbus-timeout.c:473 4 0x40e8f9dc timeout_handler_dispatch () at dbus-gmain.c:343 5 0x408166b8 g_timeout_dispatch (source=0x37aba0, callback=0x40e8f9cc <timeout_handler_dispatch>, user_data=0xd08) at QQQ #0 _dbus_connection_lock( connection=0xd08) #1 _dbus_pending_call_get_connection_and_lock( pending=0x37ab00) #2 reply_handler_timeout( data=0xd08) #3 dbus_timeout_handle( timeout=0x37ab30) #4 timeout_handler_dispatch( ) QQQ It seems user_data in reply_handler_timeout is rubbish. (In reply to comment #27) > I can confirm that switching to private connection did *not* fix my problems. Private connection only helps if you have two or more places otherwise sharing the same connection, for example if a library calls dbus_bus_get() itself. In a multithreaded app, this sharing will make it hard to ensure each connection is used by only a single thread. Also, the sharing makes it hard to avoid triggering assertions in bindings. So a private connection is still a good idea in most cases. If you, like I'm guessing from the backtrace, are using the same connection from multiple threads, you are already in trouble and a private connection is not going to fix it for you. Read the previous comments for some of the problems. (In reply to comment #25) > Relevant mailing list threads: > > > http://lists.freedesktop.org/archives/dbus/2009-February/010945.html > http://lists.freedesktop.org/archives/dbus/2009-February/010993.html > > Additional note: > > If it's indeed the case that dbus_bus_get_private is recommended in > multithreaded apps, it should be mentioned in the documentation. I'm currently > trying whether it fixes my (occasional) crashes. Now, it seems our problem is resolved. Our code had this: - if (conn && dbus_connection_send_with_reply(conn, adapterMsg, &call, -1) ) { - dbus_pending_call_set_notify(call, get_properties_cb, NULL, NULL); - dbus_pending_call_block(call); - dbus_pending_call_unref(call); Changing it to dbus_connection_send_with_reply_and_block seems to have fixed the crash. So there is probably race condition with send_with_reply, set_notify and _block while other threads are using dbus in the background. Following sources were the inspiration for this fix: http://lists.freedesktop.org/archives/dbus/2009-March/011031.html gvfs workaround: http://lists.freedesktop.org/archives/dbus/2009-March/011037.html http://svn.gnome.org/viewvc/gvfs/trunk/common/gdbusutils.c?revision=2262&view=markup I don't think this bug being open is of any benefit to anyone any more: it hasn't been touched for 5 years despite several threading bugs having been fixed during that time, and it is not clear what concrete change or fix, if any, would close it. So I'm resolving it as WORKSFORME, with the various concrete found-and-fixed threading bugs as justification. If someone knows of a specific threading problem that can be reproduced in a supported branch of dbus (currently 1.8+), please open a specific bug describing that problem. Please feel free to include URLs to particular comments in this one if they are relevant, but reopening this one is unlikely to be constructive, IMO. Similarly, if anyone has concrete suggestions for documentation improvements, please open an 'enhancement' bug with a patch. I've been working on libdbus (on and off) for several years now, so I "know too much", and am unlikely to choose the correct new-developer-visible places for warnings; so I'm afraid my response to anyone saying "you should have a warning..." is likely to be "if you think so, please write it and send a patch". Thanks, S ---- (In reply to comment #18) > Also, I think the api may allow too many use patterns that make it hard to > test everything. This is certainly very true. libdbus tries to be all things to all people, and (inevitably) fails. However, we can't fix the API of an existing library without making people port everything to the new API anyway, and I happen to think that GDBus is pretty much what a C API for D-Bus should look like; so, please consider GDBus to be "the new API" for C clients of D-Bus. In C++-land, QtDBus also does well despite just being a libdbus binding behind the scenes, by forcing its API users into following particular threading/main-loop patterns that are known to work acceptably, using Qt's event loop. As a general thing that I would say to the original D-Bus designers if time-travel-based shenanigans gave me the opportunity: writing good async APIs that assume a particular main loop (GLib's, or Qt's) is, if not easy, then at least something that can be done if you're careful. Writing good async APIs that are main-loop-agnostic is really hard. Writing good async APIs that don't even assume that you *have* a main-loop is just not feasible. (In reply to comment #26) > My main recommendation if using threads is that people have the dbus source > and are prepared to dig into it and help improve libdbus thread support. Yes. Documentation patches that would have warned you off problematic use-patterns would be welcome: I think by now I've used libdbus far too much to know which bits a beginner is going to look at. > I don't think there's any special need to use a private connection, though > it may avoid certain bugs, I'm not sure. Seems better to fix the bugs. Notably, dbus < 1.6 (which was current when this bug was last touched, but is now very obsolete) had a faulty implementation of recursive mutexes on Unix, which meant that all the code that we thought should be thread-safe, wasn't. It took years for this to be found (via better tools, namely helgrind) and fixed (by using standard pthreads primitives). |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.