Bug 89297 - reference count for DBusCounter is not thread safe (and should be)
Summary: reference count for DBusCounter is not thread safe (and should be)
Status: RESOLVED FIXED
Alias: None
Product: dbus
Classification: Unclassified
Component: core (show other bugs)
Version: git master
Hardware: ARM Linux (All)
: medium normal
Assignee: D-Bus Maintainers
QA Contact: D-Bus Maintainers
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2015-02-24 10:26 UTC by Adrian Szyndela
Modified: 2015-05-05 13:23 UTC (History)
1 user (show)

See Also:
i915 platform:
i915 features:


Attachments
A patch for making DBusCounter refcount atomic. (1.27 KB, text/plain)
2015-02-24 10:26 UTC, Adrian Szyndela
Details
A patch that uses mutex to protect DBusCounter data (3.80 KB, patch)
2015-03-26 15:30 UTC, Adrian Szyndela
Details | Splinter Review
test case (773 bytes, text/plain)
2015-03-27 19:34 UTC, Ralf Habacker
Details
test program in C (1.19 KB, text/plain)
2015-03-30 10:21 UTC, Adrian Szyndela
Details
A patch that uses mutex to protect DBusCounter data with corrected indentation and clarified precedence (3.81 KB, patch)
2015-04-13 14:07 UTC, Adrian Szyndela
Details | Splinter Review

Note You need to log in before you can comment on or make changes to this bug.
Description Adrian Szyndela 2015-02-24 10:26:30 UTC
Created attachment 113787 [details]
A patch for making DBusCounter refcount atomic.

With a simple test program I can generate a crash in libdbus.
The program:
1. creates a connection;
2. creates a few threads that share the connection;
3. then, threads perform method calls over the connection.

At some point, the program gets into a following state:
(gdb) info threads
  Id   Target Id         Frame
  2   Thread 0xb4fdd460 (LWP 3073) "tt" _dbus_counter_unref
(counter=0x117d8) at dbus-resources.c:128
* 1    Thread 0xb5bdd460 (LWP 3067) "tt" _dbus_counter_ref
(counter=counter@entry=0x117d8) at dbus-resources.c:112

Call stacks:
Thread 2:
_dbus_counter_unref (counter=0x117d8)
dbus_message_cache_or_finalize (message=0xb4c004d0)
dbus_message_unref (message=0xb4c004d0)
...

Thread 1:
_dbus_counter_ref (counter=counter@entry=0x117d8)
_dbus_message_add_counter (message=message@entry=0xb4700548, counter=0x117d8)
_dbus_transport_queue_messages (...)
< ... transport functions ... >
_dbus_transport_do_iteration (...)
_dbus_connection_do_iteration_unlocked (...)
_dbus_connection_block_pending_call (...)
dbus_pending_call_block (...)
dbus_connection_send_with_reply_and_block (...)

Two threads access same counter. One of them wants to add a reference, the other wants to delete a reference.

There are two messages involved, but both point to the same counter
(taken from connection->transport->live_messages).

The rer/unref functions on an ARM are compiled to:
Dump of assembler code for function _dbus_counter_ref:
         counter->refcount += 1;
<+0>:    ldr    r2, [r0, #0]         ; read
<+2>:    adds    r2, #1              ; modify
<+4>:    str    r2, [r0, #0]         ; store
...

Dump of assembler code for function _dbus_counter_unref:
         counter->refcount -= 1;
<+0>:    ldr    r3, [r0, #0]         ; read
<+2>:    subs    r3, #1              ; modify
<+4>:    str    r3, [r0, #0]         ; store
...

The crash appears when context is switched in "unrefing" thread after
reading, but before storing. 

This leads to too low reference count value. Then, after some unrefs, the counter is freed, while the pointer is still in connection->transport->live_messages. After a new message is received this is used again, and when the message is being freed, freed counter memory is accessed. It eventually ends with a crash.

Attached is a patch, that makes reference count in DBusCounter atomic.
Comment 1 Adrian Szyndela 2015-03-26 15:30:41 UTC
Created attachment 114648 [details] [review]
A patch that uses mutex to protect DBusCounter data

It seems using a mutex is better than using atomic, because it protects whole object instead of protecting only reference count.
Comment 2 Ralf Habacker 2015-03-27 19:34:53 UTC
Created attachment 114668 [details]
test case

(In reply to Adrian Szyndela from comment #0)
> Created attachment 113787 [details]
> A patch for making DBusCounter refcount atomic.
> 
> With a simple test program I can generate a crash in libdbus.
> The program:
> 1. creates a connection;
> 2. creates a few threads that share the connection;
> 3. then, threads perform method calls over the connection.
I tried to verify that the patch is working with the appended test case (written in python) on linux, because I do not have an ARM based system.
The test case freezes with or without the patch. Not sure, if it matches this bug.
Comment 3 Adrian Szyndela 2015-03-30 10:21:29 UTC
Created attachment 114726 [details]
test program in C

I attach a test program in C that I used.
Comment 4 Ralf Habacker 2015-04-13 12:30:58 UTC
(In reply to Adrian Szyndela from comment #3)
> Created attachment 114726 [details]
> test program in C
> 
> I attach a test program in C that I used.

I did run the test app on linux without any reproducable problem on linux. How did you see the bug condition ? Do you run the app, wait for a visible error condition (please describe how do you detect the error case) and then attached a debugger or something else ?
Comment 5 Adrian Szyndela 2015-04-13 13:12:03 UTC
First, it appeared during massive automatic tests (lots of devices for many hours). It was not really reproducible in controlled environment in reasonable time. All I had was a core dump. Well, lots of core dumps.

To generate the bug with the attached program I ran it under gdb in non-stop mode (https://sourceware.org/gdb/onlinedocs/gdb/Non_002dStop-Mode.html). I put breakpoints in _dbus_counter_unref() and _dbus_counter_ref(). Then, by continuing selected threads manually I was able to get the program to the state, where one thread was refing a counter, and another thread was unrefing it, as in bug description (there are copied&pasted values from actual gdb session).

So, maybe there is a way to modify the test program to "see" the bug without gdb in reasonable time, but I was already pretty sure of its existence at this point and didn't feel the need to push it further.
Comment 6 Ralf Habacker 2015-04-13 13:14:52 UTC
@simon: any hints how to proceed ?
Comment 7 Simon McVittie 2015-04-13 13:50:29 UTC
Comment on attachment 114648 [details] [review]
A patch that uses mutex to protect DBusCounter data

Review of attachment 114648 [details] [review]:
-----------------------------------------------------------------

The overall problem here is that DBusCounter is sort of vaguely linked to a DBusConnection, but is not actually guaranteed to be protected by that connection's mutex; and a DBusMessage can carry a reference to the DBusCounter, resulting in freeing that DBusMessage having an effect on the DBusCounter.

So the weird thing that you're doing, which other libdbus users have apparently not done, seems to be:

* process a DBusConnection in one or more threads
* get a DBusMessage from that DBusConnection
* keep a reference to the message
* free the DBusMessage later, when another thread happens to be processing the DBusConnection

This is a facet of the general problem that libdbus is not very well-designed for multi-threaded use: it tries to support every possible use-case (the most-general, lowest-performance model) and as a result it supports them all poorly. A more opinionated design (for instance "each DBusConnection belongs to one thread, all other uses are invalid") would not have had this problem.

Potential fixes:

* this
* make the refcount atomic instead
* re-document libdbus as "was never actually thread safe"

The last one is unfortunately not really feasible, as much as I would like it to be.

Making the refcount atomic would involve less messing about with mutexes, so it is superficially attractive. However, the potential problem with that approach is that it does not protect the notify function. Is there a valid situation in which one thread tries to notify while another thread is changing the notify function?

The call to _dbus_counter_set_notify() in _dbus_transport_init_base() is on a newly allocated counter and nobody can be holding a ref to it yet, so that's safe.

The call in _dbus_transport_finalize_base() is, unfortunately, probably not protected by any specific lock.

The call in _dbus_transport_set_max_message_size() is under the DBusConnection's lock, but that's no good, because _dbus_counter_notify() isn't.

So unfortunately making the reference count atomic would not work either, leaving this as the only valid solution without more extensive design changes.

::: dbus/dbus-resources.c
@@ +99,5 @@
> +  _dbus_rmutex_new_at_location (&counter->mutex);
> +  if (counter->mutex == NULL)
> +  {
> +    dbus_free (counter);
> +	counter = NULL;

(Minor: something isn't right with the indentation here. Please indent new code with spaces, not "hard tabs"; if there are hard tabs in the source already, they should always be treated as advancing the cursor to the next multiple of 8 spaces, like ts=8 in vim.)

@@ +142,4 @@
>    _dbus_assert (counter->refcount > 0);
>  
>    counter->refcount -= 1;
> +  last_ref = counter->refcount == 0;

minor: "last_ref = (counter->refcount == 0);" please, to clarify precedence

@@ +213,5 @@
>    if (counter->notify_pending)
>      {
>        counter->notify_pending = FALSE;
> +	  notify_function = counter->notify_function;
> +	  notify_data = counter->notify_data;

(Minor: something isn't right with the indentation here, as above.)
Comment 8 Adrian Szyndela 2015-04-13 14:07:32 UTC
Created attachment 115058 [details] [review]
A patch that uses mutex to protect DBusCounter data with corrected indentation and clarified precedence
Comment 9 Simon McVittie 2015-04-13 16:31:57 UTC
I'll give Thiago a chance to respond since he seemed to have strong opinions on this, but if there's no veto I'll test and probably apply that patch. Thanks!
Comment 10 Ralf Habacker 2015-04-20 19:03:43 UTC
(In reply to Simon McVittie from comment #7)

> So the weird thing that you're doing, which other libdbus users have
> apparently not done, seems to be:
> 
> * process a DBusConnection in one or more threads
> * get a DBusMessage from that DBusConnection
> * keep a reference to the message
> * free the DBusMessage later, when another thread happens to be processing
> the DBusConnection

would be test-relay the best base for creating a related test case or something else ?
Comment 11 Simon McVittie 2015-04-21 08:50:59 UTC
(In reply to Ralf Habacker from comment #10)
> would be test-relay the best base for creating a related test case or
> something else ?

The structure of test-loopback, where the program just "talks to itself", would probably be sufficient: test-relay is a bit more complicated (4 connections instead of 2) which I don't think is actually necessary here.

I think this would be best as a separate test executable: it will probably need to be fairly abusive (lots of threads, lots of iterations, arbitrary delays) in order to reproduce the bug.
Comment 12 Simon McVittie 2015-05-05 13:13:47 UTC
Fixed in git for 1.8.18 and 1.9.16. If someone wants to add a test-case that would be great, but I'm not going to wait for it.
Comment 13 Ralf Habacker 2015-05-05 13:23:15 UTC
(In reply to Simon McVittie from comment #12)
> Fixed in git for 1.8.18 and 1.9.16. If someone wants to add a test-case that
> would be great, but I'm not going to wait for it.

I tried several variants of the appended test case running countless hours, but was not able to reproduce this problem on a linux x64_86 system. The only catch I got has been reported with bug 90316, which may or may not be related.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.