Bug 102839 - IO lock contention, causing timeouts
Summary: IO lock contention, causing timeouts
Status: RESOLVED FIXED
Alias: None
Product: dbus
Classification: Unclassified
Component: core (show other bugs)
Version: 1.8
Hardware: ARM Linux (All)
: medium normal
Assignee: Simon McVittie
QA Contact: D-Bus Maintainers
URL:
Whiteboard: review+
Keywords: patch
Depends on:
Blocks:
 
Reported: 2017-09-18 13:13 UTC by Michael Searle
Modified: 2018-02-12 10:48 UTC (History)
1 user (show)

See Also:
i915 platform:
i915 features:


Attachments
Patch to fix the described bug (4.87 KB, patch)
2017-09-18 13:13 UTC, Michael Searle
Details | Splinter Review
Test patch to trigger the bug (749 bytes, patch)
2017-09-19 15:38 UTC, Michael Searle
Details | Splinter Review
Simplified patch, connection flag only (2.01 KB, patch)
2017-09-20 10:01 UTC, Michael Searle
Details | Splinter Review
Patch with sleeps (3.98 KB, patch)
2017-09-25 17:15 UTC, Michael Searle
Details | Splinter Review
test code to reproduce the problem (20.00 KB, application/x-tar)
2018-01-25 12:39 UTC, Manish Narang
Details
patch to resolve pending call blocking on I/O despite having received a reply (3.79 KB, patch)
2018-01-25 12:41 UTC, Manish Narang
Details | Splinter Review
[1/3] DBusConnection: Pass a pending call around more often (1.71 KB, patch)
2018-01-25 13:17 UTC, Simon McVittie
Details | Splinter Review
[2/3] DBusPendingCall: Only update ->completed under the connection lock (4.07 KB, patch)
2018-01-25 13:22 UTC, Simon McVittie
Details | Splinter Review
[3/3] DBusPendingCall: Improve doc-comments around completed flag (2.03 KB, patch)
2018-01-25 13:24 UTC, Simon McVittie
Details | Splinter Review
test code to reproduce the problem with updated copyright information (20.00 KB, application/x-tar)
2018-01-25 14:00 UTC, Manish Narang
Details
[4/5] test_try_connect_to_bus: Allow skipping the use of a DBusLoop (906 bytes, patch)
2018-02-01 20:09 UTC, Simon McVittie
Details | Splinter Review
[5/5] test-thread-blocking: New test for waiting on pending calls in threads (9.96 KB, patch)
2018-02-01 20:09 UTC, Simon McVittie
Details | Splinter Review
[5/5 v2] test-thread-blocking: New test for waiting on pending calls in threads (10.02 KB, patch)
2018-02-01 20:21 UTC, Simon McVittie
Details | Splinter Review
[5/5 v3] test-thread-blocking: New test for waiting on pending calls in threads (11.05 KB, patch)
2018-02-06 19:07 UTC, Simon McVittie
Details | Splinter Review
[4/5, 1.12 edition] test_connect_to_bus: Allow skipping the use of a DBusLoop (1.06 KB, patch)
2018-02-06 19:20 UTC, Simon McVittie
Details | Splinter Review
[5/5 v4] Add new test for waiting on pending calls in threads (11.07 KB, patch)
2018-02-06 22:52 UTC, Simon McVittie
Details | Splinter Review

Description Michael Searle 2017-09-18 13:13:21 UTC
Created attachment 134315 [details] [review]
Patch to fix the described bug

On BrightSign players (embedded Linux dual-core ARM systems), the multi-threaded application (which uses libdbus-c++ and dbus 1.8.20) occasionally pauses for 25 seconds in continuous dbus activity (around 3000 transactions/second, as fast as it can).

What appears to be going wrong is that the main application thread checks for a message, which has not yet been processed, so it then attempts to acquire the IO path.
But the dispatcher contends for the IO path at almost the same time, and wins.
The main thread then blocks while the dispatcher processes the message - but fails to spot that the message has been processed and thinks more IO is needed.
The result is that _dbus_connection_do_iteration_unlocked() calls _dbus_transport_do_iteration() with the default 25000ms timeout when it should have exited already as the required message is already on its way.

My suggested fix for this has three parts.
First, add a flag to the pending call structure - initially clear, set when processed by dispatch and remaining set until destroyed. Also add a block similar to the existing tests in _dbus_connection_do_iteration_unlocked to check this flag and avoid entering transport (and so blocking).
This catches some but not all instances of this bug, because it is also possible for the pending pointer to be NULL at this stage.
So a second block is added to _dbus_connection_do_iteration_unlocked, testing for a new connection flag. This is set in the critical section in dbus_connection_dispatch.
Finally to prevent these mechanisms from being bypassed, the NULL parameter is replaced with pending in the call to _dbus_connection_do_iteration_unlocked.

The patch (attached) is for 1.8.20, however the bug has been replicated in 1.11.16 and 1.10.14. (The patch has not been tested against any other version, though.)
Comment 1 Michael Searle 2017-09-19 15:38:24 UTC
Created attachment 134343 [details] [review]
Test patch to trigger the bug
Comment 2 Michael Searle 2017-09-19 15:39:26 UTC
The bug is difficult to reproduce, as it typically appears only once in hundreds of thousands of repetitions and is sensitive to changes in timing, such as by using verbose debug prints. So I have a patch (attached) which triggers the bug most times it runs - it increases the delay in the critical section of the dispatcher with a usleep() call. The exact length of delay is not important - I have used 1000 and 100000 us with similar results. The original fix has also been tested and found to be still effective with this patch in place.
Comment 3 Michael Searle 2017-09-20 10:01:10 UTC
A simplified version of this patch is attached. The connection flag and check remain unchanged, but the pending call flag and check are removed. This has been tested (using the above patch to trigger the bug frequently) and appears effective.
Comment 4 Michael Searle 2017-09-20 10:01:34 UTC
Created attachment 134357 [details] [review]
Simplified patch, connection flag only
Comment 5 Simon McVittie 2017-09-22 21:08:55 UTC
This might be the same issue as Bug #90135, and possibly even Bug #19796. The dispatching/main loop stuff is all about a decade old, so any bugs in this area are very long-standing.

Thank you for digging into this. I'll try to review the patches here when I have more time, but given the maintainers' current lack of understanding of the dispatch code and how fundamental it is, we shouldn't rush into merging anything. (In particular, accidentally breaking the dbus-daemon would be Very Bad.)

Some more general comments about threads:

What you are doing here is probably *meant* to work - libdbus was designed to be fully thread-safe. However, its original authors didn't use that facility heavily, and it has never been considered particularly reliable. The current recommendations for reliable threaded D-Bus are:

* Don't use libdbus (or its "thin" wrappers/bindings like dbus-c++) if
  you can help it. The good implementations, in no particular order, are
  GLib's GDBus (a reimplementation), Qt's QtDBus (a "larger" wrapper
  around libdbus) and systemd's sd-bus (another reimplementation).
  libdbus is the reference implementation, which isn't the same thing
  as the best implementation; and it has some very weird patterns
  coming from the fact that it tries to be suitable for the
  dbus-daemon, which has unusual requirements, as well as application code.

  Unfortunately, there's a long history of similarly "minimal"
  third party C++ bindings for libdbus (I think we've had dbus-c++,
  dbus-cpp, dbus-cxx and dbus-cplusplus, possibly not all distinct)
  which inherit some of its design mistakes, like trying to support
  any or no main loop from one codebase.

* Try to have this principle: for each object (connection, message,
  pending call, whatever), at any given time, you should be able to
  point to one thread and say: this thread owns the object, and no other
  thread is interacting with it. (GDBus doesn't require user code to
  have this pattern, but it works like this behind the scenes: all of
  its D-Bus interactions happen in a worker thread that is not accessible
  to user code, and API calls just post messages to the worker thread.
  Conversely, sd-bus absolutely requires user code to have this pattern,
  because it doesn't do any locking or thread-safety itself.)

* Have a main loop model/abstraction that helps to guide you into
  sensible patterns, like GLib's GMainContext. With hindsight,
  libdbus' biggest design mistake was trying to be main-loop-agnostic:
  the fact that it doesn't have a conceptual model for how it's meant to
  be used ensures that every multi-threaded use is doing something
  unexpected.

My comments from Bug #19796 also still apply here:

"""
I'm pretty sure that the right way to do asynchronous I/O among threads looks a lot like GMainContext <https://tecnocode.co.uk/2014/03/27/what-is-gmaincontext/> - whenever you pass a callback to a function, it should be obvious which context ("which thread") will call that callback.

In GDBus, when you call a method, the answer is "the thread-default main context of the thread that called the method", which in practice means the calling thread doesn't get called back until it goes back to its main loop. However, libdbus can't do that, because libdbus has no concept of main-contexts, and cannot even assume that its user has a "main loop" at all.

I think the answer to this is "use a higher-level thing, which can impose a particular threading/concurrency/main-loop model". GLib is one example, and GDBus (a compatible reimplementation of D-Bus) uses GLib's main loop model. Qt is another example, and QtDBus (a libdbus wrapper) uses Qt's threading model.

libdbus' documentation does specifically say "if you use this, you're signing up for some pain" and it does what it says on the tin. 
"""
Comment 6 Simon McVittie 2017-09-25 16:28:23 UTC
(In reply to Michael Searle from comment #0)
> What appears to be going wrong is that the main application thread checks
> for a message, which has not yet been processed, so it then attempts to
> acquire the IO path.

What is going on in this main application thread? Saying it "checks for a message" is non-specific. Is it sending a message, or blocking on a pending call, or flushing the connection, or dispatching the connection? From your proposed solution it seems it might be blocking on a pending call?

Similarly, what does the dispatcher thread do? From the description below it seems it might be calling dbus_connection_dispatch()?

It would be great if you could put together a minimal test-case for this, ideally using plain libdbus rather than dbus-c++, and ideally under a license suitable for integrating into the dbus test suite (MIT/X11 like test/dbus-daemon.c, or GPL/AFL like dbus/dbus-connection.c).

Failing that, a stack trace for both the interesting threads in the stuck situation would show me what is happening. Feel free to replace your own application-specific stack frames with "(censored)" or something if they are considered confidential. I would also be interested to see an outline of what the dispatching thread does, in terms of libdbus and dbus-c++ API calls.
Comment 7 Simon McVittie 2017-09-25 16:42:24 UTC
Having the "main application thread" block on a pending call and having a worker thread dispatch the connection seems an unusual pattern, particularly if the main thread is also the one responsible for running the UI as is conventional on Linux and other Unix systems (libX11, GTK+, Qt, etc. require all UI interaction to be done in the main thread).

The usual pattern is to run the GLib main loop, the Qt main loop, libevent, libev or some similar event loop in the main thread, make sure nothing in that thread is going to block, and use it for everything asynchronous, including GUI, networking and D-Bus.

In libdbus, that can be achieved by using dbus_connection_set_dispatch_status_function(), dbus_connection_set_watch_functions(), dbus_connection_set_timeout_functions() and dbus_connection_set_wakeup_main_function() to make libdbus tell your main loop what is going on. dbus/dbus-gmain.c in dbus-glib is believed to be an example of reasonably correct main-loop integration, although the rest of dbus-glib is strongly deprecated and should not be used or imitated.

A few users of D-Bus also make blocking calls (block on a pending call) on other threads while the main thread is dispatching the connection, but this pattern is somewhat discredited (perhaps because of this bug). A safer approach is to pass messages between the single thread that is responsible for dispatching, and any other threads that are interested in interacting with D-Bus.
Comment 8 Simon McVittie 2017-09-25 16:56:26 UTC
(In reply to Michael Searle from comment #0)
> continuous dbus activity (around 3000 transactions/second, as fast as it can)

This is not really what D-Bus (the protocol) and dbus (the reference implementation) are designed for. There is considerable per-transaction overhead, which is a trade-off made during design in order to get the other properties its designers wanted. This doesn't mean you can't use D-Bus, but it might not be the best possible hammer for your particular nail.

D-Bus is intended to be a signalling rather than bulk data protocol - often measured in seconds per message, rather than messages per second. There should usually be a brief flurry of D-Bus messages when something "interesting" happens, but little or no D-Bus traffic in a steady state. One of the classic examples is the Telepathy framework for real-time communications,  which I used to work on. It uses D-Bus messages to represent text messages and users going online or offline, and you'd see a burst of D-Bus messages at the beginning and end of a voice or video call to do the setup and teardown, but the actual voice or video data goes out-of-band and is never seen on D-Bus.
Comment 9 Michael Searle 2017-09-25 17:15:28 UTC
Created attachment 134467 [details] [review]
Patch with sleeps
Comment 10 Michael Searle 2017-09-25 17:20:35 UTC
OK, some background.
The script doing 3000 calls per second is a test case which was the quickest way to get the bug to appear, not a real application.
(In real life, it was rare, but common enough for people with large numbers of units to notice.)
The main thread is a script interpreter which wants to do synchronous calls - in this case to a registry daemon which looks up
requests in a hash table mapping strings to strings and returns them.

The above patch is because the main thread must spin until the dispatcher is ready - so it retries once, then calls sched_yield,
then calls nanosleep for 1ms until it can make progress. This is not as generally applicable as the previous patch but does at least point you at a known problem with it. (It can't just use a mutex, because it would have to be taken while making a callback).
Comment 11 Simon McVittie 2017-09-25 17:31:47 UTC
Comment on attachment 134357 [details] [review]
Simplified patch, connection flag only

Review of attachment 134357 [details] [review]:
-----------------------------------------------------------------

It looks as though the new struct member @processing is really @completing_pending_call_from_dispatch, and indicates that the thread that is dispatching the connection has taken responsibility for calling the pending call's callback.

If you are in _dbus_connection_do_iteration_unlocked() with a non-NULL pending call on another thread, and meanwhile you have dbus_connection_dispatch() trying to complete that same pending call, then I think that means you have two threads both trying to push forward the status of the connection, and whichever one gets the appropriate locks first, the pending call's callback will be called in that thread. This seems an unwise pattern: when user code (the pending call's callback) executes in an unknown thread, it can't know what is safe for it to do.

::: dbus-1.8.20/dbus/dbus-connection.c
@@ +1225,5 @@
>          }
> +      else if ( connection->processing )
> +        {
> +          _dbus_verbose ("connection processing while acquiring I/O path");
> +        }

Let's imagine that this check has become significant.

What that means: on the thread that's blocking on a pending call at the beginning of do_iteration_unlocked, we held the lock for the connection. That means the dispatching thread can't have been in most parts of the body of dbus_connection_dispatch() at that time.

However, while the blocking-on-pending-call thread had temporarily dropped its connection lock to acquire the I/O path, the dispatching thread got into the critical section that you've marked with your new flag and started to complete the pending call. It got far enough to unref the pending call in _dbus_connection_detach_pending_call_and_unlock(), which drops the connection lock.

Now the blocking-on-pending-call thread gets its lock back and returns from acquiring the I/O path. There is a race. If the dispatching thread wins and the pending call is marked as completed first, then we get "pending call completed while acquiring I/O path" and we're happy. If the blocking-on-pending-call thread wins and checks whether the pending call has been marked as completed before the dispatching thread can get there, then we block.

This immediately makes me think that we have a race condition that is not fixed here: we are updating pending->completed without holding any locks, and other threads might query pending->completed.

@@ +2530,4 @@
>          {          
>            /* block again, we don't have the reply buffered yet. */
>            _dbus_connection_do_iteration_unlocked (connection,
> +                                                  pending,

This bit certainly looks like a bug: all the other calls to do_iteration in this function pass the pending call in.

@@ +4627,4 @@
>                                           reply_serial);
>    if (pending)
>      {
> +      connection->processing = 1;

I don't think a connection-wide processing flag, that disables waiting for *any* pending call on the connection, is conceptually correct. What happens if threads A, B, C are all blocking on pending calls, while thread D is dispatching? Whenever thread D is in this critical section, all of A, B and C will busy-wait.
Comment 12 Simon McVittie 2017-09-25 17:43:24 UTC
(In reply to Michael Searle from comment #0)
> First, add a flag to the pending call structure - initially clear, set when
> processed by dispatch and remaining set until destroyed.

This seems more conceptually correct. However, I can't help thinking that the existing ->completed flag should be this. We already set that flag before calling the callback, so it would seem OK to stretch it to meaning that we have merely *taken responsibility for* calling the callback.

The brute-force way would be to re-lock the connection around setting the completed flag, but that's ugly and slow.

I think another way would be to split _dbus_pending_call_complete() into _dbus_pending_call_start_completion_unlocked() (which is called under the lock and sets the flag) and _dbus_pending_call_finish_completion() (which is called outside the lock and calls the callback). Or something like that.

> This catches some but not all instances of this bug, because it is also
> possible for the pending pointer to be NULL at this stage.

If we change the last _dbus_connection_do_iteration_unlocked() in _dbus_connection_block_pending_call() so it *does* set pending != NULL, I think this problem should go away?

Would combining those two (make sure to hold the connection lock while setting ->completed = TRUE, and passing the pending call to _dbus_connection_do_iteration_unlocked() in one more place) solve your problem?
Comment 13 Simon McVittie 2017-09-25 17:47:51 UTC
(In reply to Michael Searle from comment #10)
> The script doing 3000 calls per second is a test case which was the quickest
> way to get the bug to appear, not a real application.

OK, that's good to know. Hammering D-Bus with calls like this is a valid way to shake out race conditions by giving them more opportunities to arise, just not something you should do in production.
Comment 14 Simon McVittie 2017-09-25 17:50:53 UTC
(In reply to Michael Searle from comment #10)
> The main thread is a script interpreter which wants to do synchronous calls
> - in this case to a registry daemon which looks up
> requests in a hash table mapping strings to strings and returns them.

In that case I would suggest doing some sort of internal message-passing from the main thread to the dispatcher thread, blocking on that internal message-passing, and putting all your D-Bus interaction in the dispatcher thread. Here, the dispatcher thread is effectively in the role that would normally go to the main thread of a desktop app (do asynchronous things as requested).
Comment 15 Simon McVittie 2017-09-25 18:29:58 UTC
I don't think either of the potential patches here is ready, but Comment #12 describes some possible routes.
Comment 16 Michael Searle 2017-09-27 10:00:15 UTC
It doesn't appear that moving the completion point (as described in comment 12) will be quite sufficient.  The original problem was that a pending call was removed from the pending queue, and so was not detected by the "other" thread, which then blocked, because it thought IO was required in order to make progress.  Setting the completion flag later would help with some potential races, but not that one.  So while it would probably be better to set the completed flag under the connection lock, it is still necessary to drop all locks for the callback, at which point there is a hazard.
Comment 17 Simon McVittie 2017-09-27 14:03:35 UTC
(In reply to Michael Searle from comment #16)
> The original problem was that a pending call
> was removed from the pending queue

Please tell me which call stacks, function names, variables are involved?

I asked for some more information in Comment #6 which is also relevant:

(In reply to Simon McVittie from comment #6)
> It would be great if you could put together a minimal test-case for this,
> ideally using plain libdbus rather than dbus-c++, and ideally under a
> license suitable for integrating into the dbus test suite (MIT/X11 like
> test/dbus-daemon.c, or GPL/AFL like dbus/dbus-connection.c).
> 
> Failing that, a stack trace for both the interesting threads in the stuck
> situation would show me what is happening. Feel free to replace your own
> application-specific stack frames with "(censored)" or something if they are
> considered confidential. I would also be interested to see an outline of
> what the dispatching thread does, in terms of libdbus and dbus-c++ API calls.

(In reply to Michael Searle from comment #16)
> Setting the completion flag later would help with some potential races, but
> not that one.  So while it would probably be better to set the completed
> flag under the connection lock, it is still necessary to drop all locks
> for the callback, at which point there is a hazard.

My suggestion in Comment #12 was to set pending->completed = TRUE earlier, not later. Specifically, we could set it before releasing the lock in complete_pending_call_and_unlock() - perhaps immediately before _dbus_connection_detach_pending_call_and_unlock().

If you assume that that change has been made, and also that the last call from _dbus_connection_block_pending_call() to _dbus_connection_do_iteration_unlocked() is changed to pass the missing pending call argument, what race condition or thread-safety hazard remains?

Please be specific. It's entirely possible that one of your earlier patches is necessary/more correct for a reason that I'm not yet seeing - but I am not going to merge anything to this poorly-understood part of the codebase without understanding exactly what race condition we are fixing (and when some future maintainer looks at this bug, they should be able to understand it too).
Comment 18 Michael Searle 2017-09-28 15:13:47 UTC
A stack trace follows, cut down to show the two relevant threads - thread 1 is the main thread, thread 31 is the dispatcher.
It has been generated by gdb from a dump made by causing a crash when a long wait was detected - that's the reason for the main thread being in abort().
Is this stack trace useful? (I could provide a similar dump made in the same way but at a different point if that would be more helpful.)

Thread 31 (LWP 1219):
#0  0xb1475630 in poll () at ../sysdeps/unix/syscall-template.S:84
#1  0xb6e0b310 in DBus::DefaultMainLoop::dispatch (this=0x12d240c)
    at ../../clone/src/eventloop.cpp:211
#2  0xb6e0bba2 in DBus::BusDispatcher::enter (this=0x12d23b8)
    at ../../clone/src/eventloop-integration.cpp:100
#3  0x00ec61e0 in EnsureDBusDispatcher::thread () at lib/platform/dbus_dispatcher.cpp:37
#4  0x00ec69a2 in boost::detail::thread_data<bool (*)()>::run (this=0x12d32f8)
    at ../../../../build-bcm7425/tmp-glibc/sysroots/impala-td/include/boost/thread/detail/thread.hpp:116
#5  0xb2cc0d5c in boost::(anonymous namespace)::thread_proxy (param=<optimized out>)
    at libs/thread/src/pthread/thread.cpp:168
#6  0xb6e217a2 in start_thread (arg=0xbeb5cb77) at pthread_create.c:339
#7  0xb147f29c in ?? () at ../sysdeps/unix/sysv/linux/arm/clone.S:89


Thread 1 (LWP 1212):
#0  __libc_do_syscall () at ../sysdeps/unix/sysv/linux/arm/libc-do-syscall.S:47
#1  0xb1408adc in __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:55
#2  0xb140b52e in __GI_abort () at abort.c:89
#3  0xb6dab384 in _dbus_connection_do_iteration_unlocked (connection=0xb0087000, 
    connection@entry=0x1388698, pending=pending@entry=0x154b350, flags=1, flags@entry=6, 
    timeout_milliseconds=timeout_milliseconds@entry=25000)
    at ../../dbus-1.8.20/dbus/dbus-connection.c:1230
#4  0xb6dabb2a in _dbus_connection_block_pending_call (pending=pending@entry=0x154b350)
    at ../../dbus-1.8.20/dbus/dbus-connection.c:2432
#5  0xb6db7ed0 in dbus_pending_call_block (pending=pending@entry=0x154b350)
    at ../../dbus-1.8.20/dbus/dbus-pending-call.c:741
#6  0xb6dabf28 in dbus_connection_send_with_reply_and_block (connection=<optimized out>, 
    message=0x1572840, timeout_milliseconds=-1, error=0xbeb79650)
    at ../../dbus-1.8.20/dbus/dbus-connection.c:3575
#7  0xb6e097be in DBus::Connection::send_blocking (this=this@entry=0x12d5c8c, msg=..., 
    timeout=-1) at ../../clone/src/connection.cpp:436
#8  0xb6e113da in DBus::ObjectProxy::_invoke_method (this=0x12d5c88, call=...)
    at ../../clone/src/object.cpp:361
#9  0xb6e1140a in virtual thunk to DBus::ObjectProxy::_invoke_method(DBus::CallMessage&)
    () at ../../clone/src/object.cpp:356
#10 0xb6e0cedc in DBus::InterfaceProxy::invoke_method (this=<optimized out>, call=...)
    at ../../clone/src/interface.cpp:176
    
...
    
#38 0x00aac58e in main ()
Comment 19 Simon McVittie 2018-01-25 12:21:02 UTC
(In reply to Simon McVittie from comment #17)
> My suggestion in Comment #12 was to set pending->completed = TRUE earlier,
> not later. Specifically, we could set it before releasing the lock in
> complete_pending_call_and_unlock() - perhaps immediately before
> _dbus_connection_detach_pending_call_and_unlock().
> 
> If you assume that that change has been made, and also that the last call
> from _dbus_connection_block_pending_call() to
> _dbus_connection_do_iteration_unlocked() is changed to pass the missing
> pending call argument, what race condition or thread-safety hazard remains?

Manish Narang reported a similar blocking issue on the mailing list, and reports that a patch similar to what I described in Comment #12 resolves it. I'm going to break up their patch into orthogonal changes, which, taken together, will hopefully solve this bug.
Comment 20 Manish Narang 2018-01-25 12:39:39 UTC
Created attachment 136953 [details]
test code to reproduce the problem
Comment 21 Manish Narang 2018-01-25 12:41:17 UTC
Created attachment 136954 [details] [review]
patch to resolve pending call blocking on I/O despite having received a reply
Comment 22 Simon McVittie 2018-01-25 13:17:47 UTC
Created attachment 136955 [details] [review]
[1/3] DBusConnection: Pass a pending call around more often

From: Manish Narang <Manish.Narang@kpit.com>

If a pending call is provided, _dbus_connection_do_iteration_unlocked
checks whether it has completed or has a reply ready as soon as it
acquires the I/O path. If that's the case, then the iteration
terminates without trying to carry out I/O, so that the pending call
can be dispatched immediately, without blocking until a timeout is
reached. This change is believed to be necessary, but not sufficient,
to resolve #102839.

Based on part of a patch from Michael Searle on
<https://bugs.freedesktop.org/show_bug.cgi?id=102839>.
Commit message added by Simon McVittie.

Bug: https://bugs.freedesktop.org/show_bug.cgi?id=102839
Reviewed-by: Simon McVittie <smcv@collabora.com>
Comment 23 Simon McVittie 2018-01-25 13:22:39 UTC
Created attachment 136956 [details] [review]
[2/3] DBusPendingCall: Only update ->completed under the  connection lock

If one thread is blocking on a pending call, and another thread is
dispatching the connection, then we need them to agree on the value
of the completed flag by protecting all accesses with a lock. Reads
for this member seem to have the connection lock already, so it's
sufficient to make sure that the only write also happens under the
connection lock.

We already set the completed flag before calling the callback, so it
seems OK to stretch it to meaning that some thread has merely *taken
responsibility for* calling the callback.

The completed flag shares a bitfield with timeout_added, but that
flag is protected by the connection lock already.

Based on suggestions from Simon McVittie on
<https://bugs.freedesktop.org/show_bug.cgi?id=102839>.

Bug: https://bugs.freedesktop.org/show_bug.cgi?id=102839
[smcv: Revert indentation changes; add commit message]
Reviewed-by: Simon McVittie <smcv@collabora.com>

---

This is the other half of what I suggested in Comment #12. I've broken it up into separate patches because we strongly prefer to have each commit be a single correct change that is justified by its commit message, to make it easier for future maintainers to work out what is going on.

Manish, please check that the combination of this and Attachment #136955 [details] still solves your multi-threading issues, and that the commit messages seem accurate? If the attribution or commit message that I've added are wrong, please propose something better, in `git-format-patch` format.

Michael, please could you see whether the combination of this and Attachment #136955 [details] solves the threading issue for you? If it doesn't, or if you review it and find a thread-safety hazard, please could you try to clarify what the remaining problems are?
Comment 24 Simon McVittie 2018-01-25 13:23:11 UTC
(In reply to Simon McVittie from comment #23)
> [2/3] DBusPendingCall: Only update ->completed under the  connection lock

Forgot to add that this is also:

From: Manish Narang <Manish.Narang@kpit.com>
Comment 25 Simon McVittie 2018-01-25 13:24:11 UTC
Created attachment 136957 [details] [review]
[3/3] DBusPendingCall: Improve doc-comments around completed flag

---

Attachment #136956 [details] seems correct, but doesn't really explain the locking regime that is now in use. I think it deserves some comments; here's a proposed change.
Comment 26 Simon McVittie 2018-01-25 13:33:51 UTC
(In reply to Manish Narang from comment #20)
> test code to reproduce the problem

Who is the copyright holder for this test code? You? Your employer? (Please add a copyright notice for whoever owns it and is contributing it to dbus.)

This isn't ready for inclusion in the dbus automated tests as-is, but could be adapted to be something we can use - thanks!

Am I right to assume that the intended use is this?

* have a D-Bus session bus available
* make
* ./add-server &
* ./add-client
* successful result: add-client runs until killed, and the numbers
  in the "Replying to/Reply for" messages keep going up (10K replies
  reached in about 12 seconds on my laptop)
* unsuccessful result: after less than 1000 replies (typically between
  50 and 200 on my laptop), add-client hangs and stops producing
  further output
Comment 27 Manish Narang 2018-01-25 14:00:58 UTC
Created attachment 136959 [details]
test code to reproduce the problem with updated copyright information

Test code re-uploaded with updated copyright information.
Comment 28 Manish Narang 2018-01-25 14:05:30 UTC
> Who is the copyright holder for this test code? You? Your employer? (Please
> add a copyright notice for whoever owns it and is contributing it to dbus.)
Updated code has been attached with the copyright information in the header. Copyright is co-owned by myself and my organization.

> This isn't ready for inclusion in the dbus automated tests as-is, but could
> be adapted to be something we can use - thanks!
Glad I could be of help..

> Am I right to assume that the intended use is this?
> 
> * have a D-Bus session bus available
> * make
> * ./add-server &
> * ./add-client
> * successful result: add-client runs until killed, and the numbers
>   in the "Replying to/Reply for" messages keep going up (10K replies
>   reached in about 12 seconds on my laptop)
> * unsuccessful result: after less than 1000 replies (typically between
>   50 and 200 on my laptop), add-client hangs and stops producing
>   further output
That is correct. A watchdog implementation can help detect if a call blocks for more than 5 seconds, in order to remove the need for manual intervention using gdb to confirm that the issue occurred.
Comment 29 Manish Narang 2018-01-25 14:37:46 UTC
> Manish, please check that the combination of this and Attachment #136955 [details]
> [details] still solves your multi-threading issues, and that the commit
> messages seem accurate? If the attribution or commit message that I've added
> are wrong, please propose something better, in `git-format-patch` format.
Applied 136955, 136956 & 136957 (in that order). It still solves the multi-threading issue. The commit messages are good to go.
Comment 30 Michael Searle 2018-01-30 10:50:33 UTC
I have tested the patches (the combination of 136955, 136956 & 136957) and run our test script for some time (a total of 6 million repetitions, sufficient that without any fix I would have expected the bug to have appeared) with no problems seen.
Comment 31 Philip Withnall 2018-01-30 10:59:35 UTC
(In reply to Michael Searle from comment #30)
> I have tested the patches (the combination of 136955, 136956 & 136957) and
> run our test script for some time (a total of 6 million repetitions,
> sufficient that without any fix I would have expected the bug to have
> appeared) with no problems seen.

Thanks for testing, but this bug isn’t fixed until the patches have been reviewed and pushed to git master.
Comment 32 Simon McVittie 2018-01-30 14:00:18 UTC
(In reply to Philip Withnall from comment #31)
> Thanks for testing, but this bug isn’t fixed until the patches have been
> reviewed and pushed to git master.

Now that we have positive test results from both Michael and Manish, for patches that I understand, I think these are ready to go.

Patches 1 and 2 are by Manish, Reviewed-by: me (with the caveat that I think patch 2 deserves more comments). Patch 3 is by me and adds the comments I wanted; I'll push that one unreviewed soon if nobody wants to veto it, because it doesn't actually change any code. Second opinions welcome, of course.

I'm intending to put these in 1.12 and master, because the other maintained branches are basically security-fixes-only at this point; but for those interested in older branches, I assume they'll backport cleanly (the code nearby hasn't changed for a long time).

Please leave the bug open until someone (possibly me later this week?) converts Attachment #136959 [details] into something we can use in the test suite.
Comment 33 Simon McVittie 2018-01-30 14:01:13 UTC
Comment on attachment 134315 [details] [review]
Patch to fix the described bug

Superseded by Attachment #136955 [details], Attachment #136956 [details], Attachment #136957 [details]
Comment 34 Simon McVittie 2018-01-30 14:01:27 UTC
Comment on attachment 134357 [details] [review]
Simplified patch, connection flag only

Superseded by Attachment #136955 [details], Attachment #136956 [details], Attachment #136957 [details]
Comment 35 Simon McVittie 2018-02-01 20:09:16 UTC
Created attachment 137120 [details] [review]
[4/5] test_try_connect_to_bus: Allow skipping the use of a DBusLoop

DBusLoop isn't thread-safe, so we can't use it to test multi-threaded
situations.
Comment 36 Simon McVittie 2018-02-01 20:09:50 UTC
Created attachment 137121 [details] [review]
[5/5] test-thread-blocking: New test for waiting on pending  calls in threads

Based on code contributed by Manish Narang.
Comment 37 Simon McVittie 2018-02-01 20:20:00 UTC
(In reply to Manish Narang from comment #27)
> Created attachment 136959 [details]
> test code to reproduce the problem with updated copyright information

FYI, this test code also contains various race conditions: there's no locking or other cross-thread synchronization around threads_alive, and thread_func() races with the main thread (if the main thread wins the race every time, then it will think no threads are alive and just exit).

I've simplified it into a single program by moving the service into another thread, used GLib's test framework instead of pthreads, made it go faster by taking out the usleep() calls, reduced the number of threads and the number of method calls by default. Hopefully this version will 

These changes makes it fail reliably on my laptop (x86, 2 cores, 4 hyperthreads) without the patches, even when not run with "-m slow" to restore your original number of threads and method calls. I try to avoid the automated tests hammering slow platforms too much, because that makes them fail with timeouts on various distributions' autobuilder infrastructure - hopefully the version with 6 threads rather than 22 will avoid that!

Code reviews welcome (from maintainers or anyone else).
Comment 38 Simon McVittie 2018-02-01 20:20:30 UTC
(In reply to Simon McVittie from comment #37)
> Hopefully this version will

... work reliably.
Comment 39 Simon McVittie 2018-02-01 20:21:53 UTC
Created attachment 137122 [details] [review]
[5/5 v2] test-thread-blocking: New test for waiting on pending  calls in threads

---

Same as previous, but when run with "-m slow", multiply the SIGALRM watchdog timeout by 10 so we don't just time out and fail.
Comment 40 Manish Narang 2018-02-06 11:31:37 UTC
Simon, 

Any idea when the changes will be merged in the remote?
I'm asking, I am planning to migrate to the latest libdbus version as per your suggestion.

Regards,
Manish
Comment 41 Philip Withnall 2018-02-06 11:58:35 UTC
Comment on attachment 136957 [details] [review]
[3/3] DBusPendingCall: Improve doc-comments around completed flag

Review of attachment 136957 [details] [review]:
-----------------------------------------------------------------

r+
Comment 42 Philip Withnall 2018-02-06 12:15:17 UTC
Comment on attachment 137120 [details] [review]
[4/5] test_try_connect_to_bus: Allow skipping the use of a DBusLoop

Review of attachment 137120 [details] [review]:
-----------------------------------------------------------------

r+
Comment 43 Philip Withnall 2018-02-06 12:17:39 UTC
Comment on attachment 137122 [details] [review]
[5/5 v2] test-thread-blocking: New test for waiting on pending  calls in threads

Review of attachment 137122 [details] [review]:
-----------------------------------------------------------------

Needs to be added to cmake/test/CMakeLists.txt. :-(

Other than that, r+.
Comment 44 Simon McVittie 2018-02-06 12:40:29 UTC
(In reply to Manish Narang from comment #40)
> Any idea when the changes will be merged in the remote?

Now that Philip has reviewed them: hopefully today, and I'm hoping to have time to do 1.12.4 and 1.13.0 releases this week.
Comment 45 Simon McVittie 2018-02-06 19:07:55 UTC
Created attachment 137190 [details] [review]
[5/5 v3] test-thread-blocking: New test for waiting on pending calls in threads

Based on code contributed by Manish Narang.

Reviewed-by: Philip Withnall <withnall@endlessm.com>
[smcv: Add the test to the CMake build system too, as requested]
Signed-off-by: Simon McVittie <smcv@collabora.com>
Bug: https://bugs.freedesktop.org/show_bug.cgi?id=102839

---

I'm going to push this without further review if nobody actively vetoes it in the near future, since the CMake part only differs trivially from the tests above and below it.
Comment 46 Simon McVittie 2018-02-06 19:20:24 UTC
Created attachment 137191 [details] [review]
[4/5, 1.12 edition] test_connect_to_bus: Allow skipping the use of a DBusLoop

DBusLoop isn't thread-safe, so we can't use it to test multi-threaded
situations.

Bug: https://bugs.freedesktop.org/show_bug.cgi?id=102839
Signed-off-by: Simon McVittie <smcv@collabora.com>
Reviewed-by: Philip Withnall <withnall@endlessm.com>
(cherry picked from commit f127c8e110faed76039f96dbc53a87f093fea312)
[smcv: Adjust for older codebase]

---

This is pretty trivial, so I'm going to just push it if Travis-CI has succeeded by the time I get another chance to look at this bug; but in case someone wants to review it, here is the 1.12.x version of [4/5].

[5/5] also requires commit 5c91d85f "tests: Add the ability to multiply up test timeouts" to have been cherry-picked from master. No backporting was required for this one, and it only touches test code.
Comment 47 Simon McVittie 2018-02-06 19:25:08 UTC
(In reply to Simon McVittie from comment #46)
> This is pretty trivial, so I'm going to just push it if Travis-CI has
> succeeded

It didn't. It looks like the timeouts (or thread counts) in this test are too optimistic for Travis-CI: I've seen these method calls taking up to 17 seconds out of the 10 seconds allowed.

We might have to skip this test by default, or set a stupidly long timeout, or something :-(
Comment 48 Simon McVittie 2018-02-06 19:25:55 UTC
(In reply to Simon McVittie from comment #46)
> [5/5] also requires commit 5c91d85f "tests: Add the ability to multiply up
> test timeouts" to have been cherry-picked from master. No backporting was
> required for this one, and it only touches test code.

For ye olde GLib on Travis-CI, it also needs our backport of g_steal_pointer().
Comment 49 Simon McVittie 2018-02-06 21:30:33 UTC
(In reply to Simon McVittie from comment #48)
> For ye olde GLib on Travis-CI, it also needs our backport of
> g_steal_pointer().

I used 86e0416d.

(In reply to Simon McVittie from comment #47)
> It looks like the timeouts (or thread counts) in this test are
> too optimistic for Travis-CI: I've seen these method calls taking up to 17
> seconds out of the 10 seconds allowed.
> 
> We might have to skip this test by default, or set a stupidly long timeout,
> or something :-(

I tried out these mitigations:

* set a 60s timeout, but fail if it takes >= 30s (was 30/10)
* use MAX(4,n) threads for quick tests, where n is g_get_num_processors()
  (was 4)
* use MAX(20, n*5) threads for slow tests (was 20)
* only do 200 calls for quick tests (was 1000)

but they weren't sufficient, so I'm not confident that it will work reliably on autobuilders. I think I'll have to turn this into a manual test, or skip it entirely unless run with -m slow.
Comment 50 Simon McVittie 2018-02-06 22:52:33 UTC
Created attachment 137200 [details] [review]
[5/5 v4] Add new test for waiting on pending calls in threads

Based on code contributed by Manish Narang. This is not included in the
automated test suite, because it isn't reliable on heavily-loaded
automatic test infrastructure like Travis-CI.

Reviewed-by: Philip Withnall <withnall@endlessm.com>
[smcv: Add the test to the CMake build system too, as requested]
[smcv: Convert into a manual test]
Signed-off-by: Simon McVittie <smcv@collabora.com>
Bug: https://bugs.freedesktop.org/show_bug.cgi?id=102839

---

Again, I'll probably insta-commit this at some point if not vetoed.
Comment 51 Philip Withnall 2018-02-08 12:21:08 UTC
Comment on attachment 137200 [details] [review]
[5/5 v4] Add new test for waiting on pending calls in threads

Review of attachment 137200 [details] [review]:
-----------------------------------------------------------------

r+
Comment 52 Simon McVittie 2018-02-12 10:48:05 UTC
This was fixed in 1.12.4 and 1.13.0.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.