Created attachment 22331 [details]
DBUS_VERBOSE=1 and gdb-backstraces (with comments).txt
dbus_connection_send_with_reply and all its users like dbus_connection_send_with_reply_and_block and dbus_g_proxy_begin_call_internal can randomly miss reply messages when using pending (async).
With the current design, the following scenario might happen (and sometime does) if the peer you are talking to is fast enough:
dbus internals might send off the message straight away, and also retrieve a reply straigh away
reply message finds its matching DBusPendingCall
DBusPendingCall->function is NULL, so no call is done
* dbus_connection_send_with_reply finally returns
* dbus_pending_call_set_notify() sets the return notify-function
And there you are, your notify function is never called (you will never even receive a timeout error), and memory for maintaining the DBusPendingCall is leaked.
dbus_connection_send_with_reply() should perhaps have been replaced with a new function (with a new name) that has a new argument, a callback functions that configures the newly created DBusPendingCall structure, and
dbus_connection_send_with_reply tagged dangerous/deprecated.
Sorry for tagging this blocker/highest, but atleast I find this bug to be quite severe, due to the main functions of dbus fail to function as expected/designed.
Created attachment 22384 [details] [review]
This is a code-suggestion (no documentation updated) about how to fix dbus_connection_send_with_reply atleast. If dbus_connection_send_with_reply_and_block needs change I don't know (I don't know dbus internals much)
Created attachment 22386 [details] [review]
Take advantage of dbus-1.2.12-dbus_connection_send_with_reply.patch. With both these applied, async replies are much more reliable. Without them, replies never reach me unless I slow my system down with valgrind etc.
Can we fix dbus_pending_call_set_notify in some way?
Would be good to have a test case for this. I wonder how we didn't hit it earlier.
I'm investigating more closely now.
(In reply to comment #3)
> Can we fix dbus_pending_call_set_notify in some way?
No, since we end up to call this function way too late, which is the actual cause of this, unless you think of dbus_pending_call_set_notify() checking if reply already has come, and calls the callback function itself if that has happened.
Either way what the fix is, it should be documented that the function we fix might call the callback immidiantly.
> Would be good to have a test case for this. I wonder how we didn't hit it
> I'm investigating more closely now.
Might be that the orignal dbus code didn't attempt to read a reply just after a send perhaps, I don't know. Users of glib-dbus who does not use async, seems not to be affected by this.
test-case... I'll what I can fix next week. I'm on a tight schedule, and the current test I have is a big project, so it doesn't fit as a simple test-case.
With patch-set submitted, my project runs fine.
(In reply to comment #4)
> (In reply to comment #3)
> > Can we fix dbus_pending_call_set_notify in some way?
> No, since we end up to call this function way too late, which is the actual
> cause of this, unless you think of dbus_pending_call_set_notify() checking if
> reply already has come, and calls the callback function itself if that has
> Either way what the fix is, it should be documented that the function we fix
> might call the callback immidiantly.
Replying to myself here:
Fixing as I stated above should be doable, since when we call dbus_connection_send_with_reply(), the given DBusPendingCall() is returned with a ref-count, so it should be possible and safe to add some intelligence in there for handling dbus_pending_call_set_notify() beeing called after the reply has been received.
Attempt in patching dbus_pending_call_set_notify() seems not to be a good idea, since dbus-glib expects dbus_pending_call_set_notify() to return without the callback beeing called, since it wants to append the Pending to a hash-list first. And callback expects to find this entry, which then failes.
Process 10506: arguments to dbus_pending_call_block() were incorrect, assertion "pending != NULL" failed in file dbus-pending-call.c line 713.
This is normally a bug in some application using the D-Bus library.
D-Bus not built with -rdynamic so unable to print a backtrace
Program received signal SIGABRT, Aborted.
[Switching to Thread 0xb7455b90 (LWP 10509)]
0xb80d9424 in __kernel_vsyscall ()
#0 0xb80d9424 in __kernel_vsyscall ()
#1 0xb7d573c1 in raise () from /lib/libc.so.6
#2 0xb7d58cf4 in abort () from /lib/libc.so.6
#3 0xb8078135 in _dbus_abort () at dbus-sysdeps.c:88
#4 0xb8073ba9 in _dbus_warn_check_failed (format=0xb8080c20 "arguments to %s() were incorrect, assertion \"%s\" failed in file %s line %d.\nThis is normally a bug in some application using the D-Bus library.\n") at dbus-internals.c:283
#5 0xb8069a6c in dbus_pending_call_block (pending=0x0) at dbus-pending-call.c:713
#6 0xb809115f in dbus_g_proxy_end_call_internal (proxy=0x808fb18, call_id=0x1, error=0xb7454efc, first_arg_type=0x0, args=0xb7454ef0 "��\b\b�/\b�(OE�") at dbus-gproxy.c:2264
#7 0xb8091645 in dbus_g_proxy_end_call (proxy=0x808fb18, call=0x1, error=0xb7454efc, first_arg_type=0x0) at dbus-gproxy.c:2541
#8 0xb80c2c86 in org_freesmartphone_GSM_SIM_send_auth_code_async_callback (proxy=0x808fb18, call=0x1, user_data=0x8086960) at ogsmd/dbus/sim.h:80
#9 0xb808fc2f in d_pending_call_notify (dcall=0x812b850, data=0x813dd68) at dbus-gproxy.c:1763
#10 0xb8069ed6 in dbus_pending_call_set_notify (pending=0x812b850, function=0xb808fc10 <d_pending_call_notify>, user_data=0x813dd68, free_user_data=0xb808fc40 <d_pending_call_free>) at dbus-pending-call.c:615
#11 0xb80910d9 in dbus_g_proxy_begin_call_internal (proxy=0x808fb18, method=<value optimized out>, notify=0xb80c2c46 <org_freesmartphone_GSM_SIM_send_auth_code_async_callback>, user_data=0x8086960, destroy=0xb80c3b37 <_dbus_glib_async_data_free>, args=0x80b95a0, timeout=0xffffffff) at dbus-gproxy.c:2223
#12 0xb80920cf in dbus_g_proxy_begin_call (proxy=0x808fb18, method=0xb80d1dc5 "SendAuthCode", notify=0xb80c2c46 <org_freesmartphone_GSM_SIM_send_auth_code_async_callback>, user_data=0x8086960, destroy=0xb80c3b37 <_dbus_glib_async_data_free>, first_arg_type=0x40) at dbus-gproxy.c:2445
#13 0xb80c3ccc in org_freesmartphone_GSM_SIM_send_auth_code_async (proxy=0x808fb18, IN_pin=0x80f1029 "8800", callback=0xb80c3b62 <ogsmd_sim_send_auth_code_callback>, userdata=0x80f7790) at ogsmd/dbus/sim.h:97
Looking at this in a bit more depth, I believe the way this is supposed to work is simply that when we do dbus_connection_send_with_reply, we only attempt to write to the socket if possible; i.e. we should not do a read which would cause us to process replies.
The detail looks like this, with some functions omitted:
_dbus_connection_do_iteration_unlocked, notably with the flag DBUS_ITERATION_DO_WRITING ->
The key bit is in socket_do_iteration, since we're authenticated we drop into the first if clause, then the inner clause should be true, we call do_writing which eventually just calls Unix write(). Then if there's still more to do we drop into poll(), but again we should just be calling do_writing.
Now, I'm not claiming there couldn't be bugs here - for example how authentication works is fairly complex, and I could imagine there could be a bug where we might inadvertently process message reads immediately after authentication, or something like that.
Anyways, writing a test case now.
Created attachment 22536 [details] [review]
This patch creates many pending calls (50000, with up to 200 in flight simultaneously). If the test is correct (as I believe it is), then no reply messages are being lost.
Just some thoughts I have:
Might be that the project I'm testing with enables threading, so I did some further debug:
when _dbus_connection_update_dispatch_status_and_unlock() does
/* We drop the lock */
Another thread reaches this point
#0 socket_handle_watch (transport=0x808d978, watch=0x808da28, flags=0x1) at dbus-transport-socket.c:826
#1 0xb7ec86f3 in _dbus_transport_handle_watch (transport=0x808d978, watch=0x808da28, condition=0x1) at dbus-transport.c:865
#2 0xb7eb4b98 in _dbus_connection_handle_watch (watch=0x808da28, condition=0x1, data=0x808dd18) at dbus-connection.c:1435
#3 0xb7ecb47f in dbus_watch_handle (watch=0x808da28, flags=0x1) at dbus-watch.c:663
#4 0xb7ee958f in io_handler_dispatch (source=0x808f780, condition=0, data=0x8060378) at dbus-gmain.c:232
#5 0xb7df4327 in g_io_unix_dispatch (source=0x808f7c8, callback=0xb7ee9520 <io_handler_dispatch>, user_data=0x8060378) at giounix.c:162
#6 0xb7db60f8 in g_main_dispatch (context=0x8089f58) at gmain.c:2144
#7 0xb7db75be in IA__g_main_context_dispatch (context=0x8089f58) at gmain.c:2697
#8 0xb7db7b12 in g_main_context_iterate (context=0x8089f58, block=0x1, dispatch=0x1, self=0x805b2b8) at gmain.c:2778
#9 0xb7db82e9 in IA__g_main_loop_run (loop=0x805e270) at gmain.c:2986
#10 0x0804a2c9 in main (argc=0x1, argv=0xbfd54c94) at ophonekitd-main.c:92
Which then again leads incomming data beeing parsed and handled, before _dbus_connection_update_dispatch_status_and_unlock gets any further (single core CPU here)
So this is a very hard bug to trigger which my sytems seems to enjoy triggering all the time in my project (every time actually).
Ahh, yes, we have longstanding problems with threads =/
Fixing it would obviously be the best thing to do. In the absence of a fix, it's really our fault for not having bigger, more obvious warnings about the current state of thread support.
There's an existing bug which this one seems to duplicate now that I look at the full list: bug 857.
*** This bug has been marked as a duplicate of bug 857 ***
I would like to request this bug and the patch in comment #1 to be reconsidered. Like bug #857 this issue is only apparent in multi-threaded applications, however fixing just this bug allows, in my experience, reliable asynchronous method calls to be made from multi-threaded clients.
Conversely, bug #857 describes threading issues in send_with_reply_and_block which is for performing (pseudo) synchronous method calls and has its own set of issues.
The patch provided in comment #1 seems reasonable to me and solves the obvious race condition when making async calls. I have applied this patch in my environment and can confirm it works as intended.
Notes on the patch:-
* the _dbus_connection_update_dispatch_status function should be removed, it seems to have been included by mistake?
* dbus_connection_send_with_reply_and_block needs to be modified to use dbus_connection_send_with_reply_setup instead of dbus_connection_send_with_reply (since the latter is deprecated)
* there is a typo in the warning message: discuraged/discouraged.
If applied this patch will require some downstream work in the various bindings. glib patch is provided in comment #2 (this I haven't tested). The work required is very minimal however.
Further, bug #857 could be marked as having a dependency on this bug since send_with_reply_and_block uses send_with_reply.
Created attachment 26440 [details] [review]
Applying my own suggested fixes to original patch:
* the _dbus_connection_update_dispatch_status function is removed
* dbus_connection_send_with_reply_and_block modified to use
dbus_connection_send_with_reply_setup instead of
dbus_connection_send_with_reply (since the latter is deprecated)
* fix typo in the warning message: discuraged/discouraged.
(In reply to comment #12)
Thanks Mark, I have attached this patch in OpenEmbedded, so all of our distros benefit from it. We're getting swamped by the dbus_warn now though -- what else in dbus-glib or dbus needs to be patched to get rid of the warnings / fix the remaining issues?
Looking at this again, I agree it's useful if we can make a targeted fix to send_with_reply to be threadsafe.
I'm unsure though about the model of allowing random threads to do I/O. This targeted patch may work for now, but I'd like to get a rough consensus of how we tackle thread issues in general in bug 896, and then review specific patches.
For example with this bug, I'm not sure about straight up deprecating _send_with_reply; a lot of people are using it, and it's fine if you're not using threads. We might instead mark it as not threadsafe, say.
Comment on attachment 26440 [details] [review]
This seems pretty clunky.
Might be cleaner to just have a variant that passes in the set_notify args.
Also, no need to deprecate the other one, I don't think, just document the thread issue.
(In reply to comment #16)
> (From update of attachment 26440 [details] [review])
> This seems pretty clunky.
> Might be cleaner to just have a variant that passes in the set_notify args.
> Also, no need to deprecate the other one, I don't think, just document the
> thread issue.
Programs already written that are threaded should be catched somehow. How about issuing a warning at runtime if the functions is called and it is detected that glib is initialised with threading enabled?
we pretty much always enable threads though in any real app.
however, I don't think using dbus from multiple threads is very common (if only because it has never worked well), and when people do it they've often known about this issue... I don't know.
> however, I don't think using dbus from multiple threads is very common (if only
> because it has never worked well), and when people do it they've often known
> about this issue... I don't know.
Some applications developers does not know this. I was one of them. I randomly hit this bug when hacking on an already existing applications which is using dbus threaded.
DBus with threading in general should probably have its own tracking bug and being some future milestone
(In reply to comment #6)
> Attempt in patching dbus_pending_call_set_notify() seems not to be a good
> idea, since dbus-glib expects dbus_pending_call_set_notify() to return
> without the callback beeing called
Unexpected reentrancy is bad, yes. Let's not do that.
(In reply to comment #15)
> I'm unsure though about the model of allowing random threads to do I/O.
In the bug as reported, which thread would the notify-function get executed in? I think it's the thread that is calling dbus_connection_dispatch(). So there are two threads of execution:
dbus_connection_send_with_reply (..., &pc, ...);
dbus_pending_call_set_notify (pc, on_pending_call_notify, ...);
dbus_connection_read_write_dispatch (...); /* or something */
As far as I can see, on_pending_call_notify() will be called from the dispatching thread, and there's no way to know whether it will be before or after do_something_else(). This seems a poor model.
I'm pretty sure that the right way to do asynchronous I/O among threads looks a lot like GMainContext <https://tecnocode.co.uk/2014/03/27/what-is-gmaincontext/> - whenever you pass a callback to a function, it should be obvious which context ("which thread") will call that callback.
In GDBus, when you call a method, the answer is "the thread-default main context of the thread that called the method", which in practice means the calling thread doesn't get called back until it goes back to its main loop. However, libdbus can't do that, because libdbus has no concept of main-contexts, and cannot even assume that its user has a "main loop" at all.
I think the answer to this is "use a higher-level thing, which can impose a particular threading/concurrency/main-loop model". GLib is one example, and GDBus (a compatible reimplementation of D-Bus) uses GLib's main loop model. Qt is another example, and QtDBus (a libdbus wrapper) uses Qt's threading model.
libdbus' documentation does specifically say "if you use this, you're signing up for some pain" and it does what it says on the tin. In particular...
(In reply to comment #19)
> DBus with threading in general should probably have its own tracking bug and
> being some future milestone
... if that bug existed, I would close it WONTFIX and recommend that people switch to GDBus or QtDBus, because with the limited assumptions that libdbus allows itself to make, I don't think we can ever support all the weird ways in which people want to do threading.