Bug 34393 - reaching incoming message limit can cause locking violations
Summary: reaching incoming message limit can cause locking violations
Status: RESOLVED FIXED
Alias: None
Product: dbus
Classification: Unclassified
Component: core (show other bugs)
Version: 1.4.x
Hardware: Other All
: medium major
Assignee: Simon McVittie
QA Contact: John (J5) Palmieri
URL: http://git.collabora.co.uk/?p=user/sm...
Whiteboard:
Keywords: patch
Depends on: 34570
Blocks: dbus-1.5
  Show dependency treegraph
 
Reported: 2011-02-17 06:41 UTC by Simon McVittie
Modified: 2011-07-28 10:35 UTC (History)
1 user (show)

See Also:
i915 platform:
i915 features:


Attachments
Add a hackish script to reproduce fd.o #34393 (4.33 KB, patch)
2011-02-25 10:21 UTC, Simon McVittie
Details | Splinter Review
dbus_bus_register: don't unref the messages with the lock held (2.15 KB, patch)
2011-02-25 10:22 UTC, Simon McVittie
Details | Splinter Review
_dbus_connection_message_sent: rename to _unlocked (3.07 KB, patch)
2011-02-25 10:22 UTC, Simon McVittie
Details | Splinter Review
Add _dbus_counter_notify and call it after every adjustment (5.18 KB, patch)
2011-02-25 10:23 UTC, Simon McVittie
Details | Splinter Review
Comment some places where it's OK to unref a message despite holding locks (1.59 KB, patch)
2011-02-25 10:23 UTC, Simon McVittie
Details | Splinter Review
When attaching counters to messages, don't automatically notify callbacks (4.20 KB, patch)
2011-02-25 10:23 UTC, Simon McVittie
Details | Splinter Review
Don't inline the contents of _dbus_connection_unlock (1.60 KB, patch)
2011-02-25 10:24 UTC, Simon McVittie
Details | Splinter Review
Don't finalize sent or dispatched messages while under the connection lock (6.59 KB, patch)
2011-02-25 10:24 UTC, Simon McVittie
Details | Splinter Review
_dbus_connection_peer_filter_unlocked_no_update: delay freeing reply (4.56 KB, patch)
2011-02-25 10:25 UTC, Simon McVittie
Details | Splinter Review
dbus_connection_dispatch: avoid freeing UnknownMethod reply until we unlock (2.25 KB, patch)
2011-02-25 10:25 UTC, Simon McVittie
Details | Splinter Review
Remove the per-connection link cache (4.51 KB, patch)
2011-03-07 09:55 UTC, Simon McVittie
Details | Splinter Review
Add a regression test that can reproduce fd.o #34393 (9.83 KB, patch)
2011-03-10 10:06 UTC, Simon McVittie
Details | Splinter Review
Run some trivial integration tests on the installed dbus binaries (1.38 KB, patch)
2011-03-10 10:47 UTC, Simon McVittie
Details | Splinter Review
Add a regression test that can reproduce fd.o #34393 (4.31 KB, patch)
2011-03-11 05:45 UTC, Simon McVittie
Details | Splinter Review
Remove the per-connection link cache (v2) (4.63 KB, patch)
2011-04-29 07:15 UTC, Simon McVittie
Details | Splinter Review
_dbus_message_remove_counter: remove ability to return the link (3.31 KB, patch)
2011-04-29 08:07 UTC, Simon McVittie
Details | Splinter Review
Add a regression test that can reproduce fd.o #34393 (v2) (5.29 KB, patch)
2011-06-22 05:55 UTC, Simon McVittie
Details | Splinter Review

Note You need to log in before you can comment on or make changes to this bug.
Description Simon McVittie 2011-02-17 06:41:13 UTC
To reproduce:

- have dbus-echo, dbus-spam from Bug #34140

- --enable-asserts, so we check for locking violations

- launch a dbus-daemon with <limit name="max_incoming_bytes">1</limit> [1]
  and attach gdb

- ./tools/dbus-echo --name=com.example.Echo &

- ./tools/dbus-spam --dest=com.example.Echo --count=N --flood
  where N is a large number

This sometimes leads to the following locking violation (it seems to be more likely for large N, 8192 is typically enough):

- connection C1's read watch is currently disabled because it has too many
  live messages (in this setup, any message will exceed the limit)

- a message M is dispatched from sending connection C1 to receiving
  connection C2

- _dbus_connection_message_sent is called on C2, with C2's lock held;
  it unrefs the message

- this is the last ref, so dbus_message_cache_or_finalize is called

- all of the DBusCounter instances for the message are removed, subtracting
  the message's size from their values

- notably, the counter for C1's transport is decremented past the limit on
  the size of live messages

- C1's transport toggles its read watch to enabled

- C1's lock isn't currently held - assertion failure

[1] This actually makes sense, because dbus-daemon allows the limit to be exceeded by up to one message or one 2048-byte read, whichever is greater; so setting a 1-byte limit basically means "don't read a second message until you've fully dispatched the first one"

#0  0x00007f288d2ba165 in raise (sig=<value optimized out>) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64
#1  0x00007f288d2bcf70 in abort () at abort.c:92
#2  0x00007f288e31b195 in _dbus_abort () at ../../dbus/dbus-sysdeps.c:94
#3  0x00007f288e3019f1 in _dbus_real_assert (condition=<value optimized out>, condition_text=0x7f288e33da38 "(connection)->have_connection_lock", file=0x7f288e33d5c7 "../../dbus/dbus-connection.c", line=680, func=<value optimized out>) at ../../dbus/dbus-internals.c:902
#4  0x00007f288e2d3981 in protected_change_watch (connection=0x7f288c426600, watch=0x7f288c429940, add_function=0, remove_function=0, toggle_function=0x7f288e2fd460 <_dbus_watch_list_toggle_watch>, enabled=7) at ../../dbus/dbus-connection.c:680
#5  0x00007f288e2fb2c5 in check_read_watch (transport=0x7f288c409200) at ../../dbus/dbus-transport-socket.c:227
#6  0x00007f288e2f93ee in live_messages_notify (counter=<value optimized out>, user_data=<value optimized out>) at ../../dbus/dbus-transport.c:79
#7  0x00007f288e2ea3f0 in free_counter (element=0xc47, data=0x7f288c42a600) at ../../dbus/dbus-message.c:597
#8  0x00007f288e2e976b in dbus_message_cache_or_finalize (message=0x7f288c42a600) at ../../dbus/dbus-message.c:624
#9  dbus_message_unref (message=0x7f288c42a600) at ../../dbus/dbus-message.c:1592
#10 0x00007f288e2da5ed in _dbus_connection_message_sent (connection=0x7f288c422280, message=0x7f288c42a600) at ../../dbus/dbus-connection.c:656
#11 0x00007f288e2fc335 in do_writing (transport=0x7f288c409600) at ../../dbus/dbus-transport-socket.c:680
#12 0x00007f288e2fcd48 in socket_handle_watch (transport=0x7f288c409600, watch=0x7f288c429f00, flags=2) at ../../dbus/dbus-transport-socket.c:953
#13 0x00007f288e2f9692 in _dbus_transport_handle_watch (transport=0x7f288c409600, watch=0x7f288c429f00, condition=2) at ../../dbus/dbus-transport.c:891
#14 0x00007f288e2dd0e0 in _dbus_connection_handle_watch (watch=0x7f288c429f00, condition=2, data=0x7f288c422280) at ../../dbus/dbus-connection.c:1509
#15 0x00007f288e2fd28a in dbus_watch_handle (watch=0x7f288c429f00, flags=2) at ../../dbus/dbus-watch.c:683
#16 0x00007f288e31c0a6 in _dbus_loop_iterate (loop=0x7f288c40f040, block=<value optimized out>) at ../../dbus/dbus-mainloop.c:820
#17 0x00007f288e31c17d in _dbus_loop_run (loop=0x7f288c40f040) at ../../dbus/dbus-mainloop.c:893
#18 0x00007f288e2d0e1f in main (argc=<value optimized out>, argv=<value optimized out>) at ../../bus/main.c:520
Comment 1 Simon McVittie 2011-02-25 10:21:23 UTC
Here's a branch. This depends on Bug #34140 for the test script.
Comment 2 Simon McVittie 2011-02-25 10:21:52 UTC
Created attachment 43814 [details] [review]
Add a hackish script to reproduce fd.o #34393

The number of messages is arbitrary; the more messages, the more likely
the crash is. 2000 messages seem to cause it reliably, on this laptop.

The script may have race conditions (there are a couple of hard-coded sleep
calls), so I haven't set it up to be run automatically.
Comment 3 Simon McVittie 2011-02-25 10:22:18 UTC
Created attachment 43815 [details] [review]
dbus_bus_register: don't unref the messages with the lock held

Finalizing a message can call out to user code via dbus_message_set_data
(or to internal code not expecting locks to be held, via DBusCounter).
Comment 4 Simon McVittie 2011-02-25 10:22:35 UTC
Created attachment 43816 [details] [review]
_dbus_connection_message_sent: rename to _unlocked

It's called with the connection's lock held.
Comment 5 Simon McVittie 2011-02-25 10:23:16 UTC
Created attachment 43817 [details] [review]
Add _dbus_counter_notify and call it after every adjustment

When fd-passing is implemented, adjustments happen in pairs; in that case
we coalesce the two calls into one.
Comment 6 Simon McVittie 2011-02-25 10:23:32 UTC
Created attachment 43818 [details] [review]
Comment some places where it's OK to unref a message despite holding locks

In general, dbus_message_unref should be avoided while holding locks,
because it can invoke arbitrary user callbacks (via attached data, or
via DBusCounter).
Comment 7 Simon McVittie 2011-02-25 10:23:48 UTC
Created attachment 43819 [details] [review]
When attaching counters to messages, don't automatically notify callbacks

In all the places where counters are added, we're under a lock. The caller
knows what effect adding the counter might have, and can replicate it
in a lock-safe way if necessary.
Comment 8 Simon McVittie 2011-02-25 10:24:07 UTC
Created attachment 43820 [details] [review]
Don't inline the contents of _dbus_connection_unlock

It's about to become more complex, to handle delayed deallocation of
messages in order to avoid triggering callbacks while locked.
Comment 9 Simon McVittie 2011-02-25 10:24:52 UTC
Created attachment 43821 [details] [review]
Don't finalize sent or dispatched messages while under the connection lock

Finalizing a message can trigger callbacks; that's bad, if we have a
connection locked.
 
In particular, if a message is received by the "left side", passed to
the "right side" and sent (as in test/relay.c (see the diagram there)
or in dbus-daemon), then finalizing that message could result in the
live messages counter for the left side, and the outgoing messages counter
for the right side, both being decremented while under either side's
lock.

After a message is dispatched on the left side, finalizing it now drops         the lock temporarily, to avoid this problem.
   
After a message is sent on the right side, finalizing it is now deferred
until the right side unlocks, by moving it to a new queue of
"expired messages" which is automatically cleared every time we release
the lock.

The "live messages" counter for the "left" connection will now explicitly
take the left connection's lock before decrementing, to avoid
manipulating watches without a lock.

---

The comment about relay.c refers to Bug #34570.
Comment 10 Simon McVittie 2011-02-25 10:25:10 UTC
Created attachment 43822 [details] [review]
_dbus_connection_peer_filter_unlocked_no_update: delay freeing reply

Finalizing the reply could conceivably call callbacks, so wait til we
unlock.
Comment 11 Simon McVittie 2011-02-25 10:25:27 UTC
Created attachment 43823 [details] [review]
dbus_connection_dispatch: avoid freeing UnknownMethod reply until we unlock
Comment 12 Simon McVittie 2011-02-25 10:25:56 UTC
... and that's all for now.
Comment 13 Simon McVittie 2011-03-07 09:55:06 UTC
Created attachment 44207 [details] [review]
Remove the per-connection link cache

With fd.o#34393 fixed, retaking the lock to cache unused links
significantly adds to locking overhead (-18% throughput in a synthetic
benchmark on an ARM device). The cache is also unlimited in size, and
probably contributes to memory growth and fragmentation by not being
under the system malloc's control.

Fixing fd.o #34393 [with the sequence of patches already attached here],
but also dropping this cache entirely, turns out to lead to a 5% increase
in throughput on the same synthetic benchmark.
Comment 14 Cosimo Alfarano 2011-03-10 05:54:10 UTC
Review of attachment 43815 [details] [review]:

R+ for me, but no dev-hat on.
Comment 15 Simon McVittie 2011-03-10 10:06:09 UTC
Created attachment 44314 [details] [review]
Add a regression test that can reproduce fd.o #34393

The number of messages is arbitrary; the more messages, the more likely
the crash is. 2000 messages seem to cause it reliably, on this laptop.

This commit depends on the branch from Bug #34570, for infrastructure. The actual bugfix does not depend on that branch.

It should also be possible to run this test as a standalone binary on any system with dbus-daemon, libdbus and dbus-glib, by setting TEST_DBUS_DAEMON to the path of the installed dbus-daemon and TEST_CONFIG_FILES to a directory containing a copy of incoming-limit.conf.
Comment 16 Simon McVittie 2011-03-10 10:47:56 UTC
Created attachment 44317 [details] [review]
Run some trivial integration tests on the installed dbus binaries

This is really part of Bug #34570, except that the addition of installcheck_environment is part of Attachment #44314 [details]; it could be rebased if people are feeling particularly pedantic, but tbh I'd prefer to just get both branches merged.
Comment 17 Simon McVittie 2011-03-11 05:45:16 UTC
Created attachment 44353 [details] [review]
Add a regression test that can reproduce fd.o #34393

The number of messages is arbitrary; the more messages, the more likely
the crash is. 2000 messages seem to cause it reliably on this laptop,
but I've set it to 10000 to be safe.

---

This obsoletes Attachment #44314 [details] using infrastructure I added to Bug #34570
(which obsoletes Attachment #44317 [details]).
Comment 18 Simon McVittie 2011-03-21 09:55:42 UTC
Review of attachment 43821 [details] [review]:

Looking back at my own patches after a few weeks...

::: dbus/dbus-connection.c
@@ +4836,3 @@
+       * We have a reference to the connection, and we don't use any cached
+       * pointers to the connection's internals below this point, so it should
+       * be safe to drop the lock and take it back. */

Whoever reviews this: please pay particular attention to what happens in this function, it looks like the part most likely to be wrong.

::: dbus/dbus-transport.c
@@ +82,3 @@
+      (* transport->vtable->live_messages_changed) (transport);
+      _dbus_connection_unlock (transport->connection);
+    }

Possibly this should be reversed to:

    lock
    if (l_m_c)
      l_m_c ()
    unlock

but neither the vtable pointer nor its contents should change between a transport object's creation and finalization, and we're holding a ref, so I think the patch is safe.
Comment 19 Simon McVittie 2011-04-29 07:15:56 UTC
Created attachment 46155 [details] [review]
Remove the per-connection link cache (v2)

Simplified to avoid pointless and slightly confusing reuse of a variable.
Comment 20 Colin Walters 2011-04-29 07:55:01 UTC
Review of attachment 46155 [details] [review]:

Minor comments; otherwise great - I love patches with benchmark references.

::: dbus/dbus-connection.c
@@ +394,2 @@
       iter != NULL;
+      iter = _dbus_list_pop_first_link (&expired_messages))

while ((iter = _dbus_list_pop_first_link (&expired_messages)) != NULL)

@@ +657,3 @@
   /* It's OK that in principle we call the notify function, because for the
    * outgoing limit, there isn't one */
+  _dbus_message_remove_counter (message, connection->outgoing_counter, NULL);

Might as well remove the link_return argument to  _dbus_message_remove_counter too.
Comment 21 Simon McVittie 2011-04-29 08:06:28 UTC
(In reply to comment #20)
> while ((iter = _dbus_list_pop_first_link (&expired_messages)) != NULL)

I prefer not to do assignments inside conditionals if I can avoid it... but, if you insist, I can use that format.

The benchmark is only valid after the rest of the branch that's attached here: the link cache is probably a net performance win in current dbus, but with this bug fixed, the extra lock-retaking step that becomes required makes it considerably more expensive. Any chance you could review the bugfix?

I'd also really appreciate review of Bug #34570 so we have a framework for improving "realistic" (public-API-only) test coverage, like the test that this branch uses; eventually I'd like to have most of our tests be small and self-contained like that one.

> @@ +657,3 @@
>    /* It's OK that in principle we call the notify function, because for the
>     * outgoing limit, there isn't one */
> +  _dbus_message_remove_counter (message, connection->outgoing_counter, NULL);
> 
> Might as well remove the link_return argument to  _dbus_message_remove_counter
> too.

Well spotted, this was its last use. I'll attach a patch to simplify it.
Comment 22 Simon McVittie 2011-04-29 08:07:56 UTC
Created attachment 46157 [details] [review]
_dbus_message_remove_counter: remove ability to return the link
Comment 23 Simon McVittie 2011-06-22 05:53:58 UTC
(In reply to comment #21)
> (In reply to comment #20)
> > while ((iter = _dbus_list_pop_first_link (&expired_messages)) != NULL)
> 
> I prefer not to do assignments inside conditionals if I can avoid it...
> but, if you insist, I can use that format.

Do you insist or not? I tend to think assignments in conditionals violate least-astonishment.

> Any chance you could review the bugfix?

That, please? Also, should it be targeting dbus-1.4 or master? (The branch is based on dbus-1.4, but perhaps master is more appropriate for this sort of intrusive bugfix.)

> I'd also really appreciate review of Bug #34570 so we have a framework for
> improving "realistic" (public-API-only) test coverage, like the test that this
> branch uses; eventually I'd like to have most of our tests be small and
> self-contained like that one.

wjt reviewed this, so I needed to rebase the test-case a bit, I'll attach the new version here.

> Well spotted, this was its last use. I'll attach a patch to simplify it.

Comment #22, Attachment #46157 [details].
Comment 24 Simon McVittie 2011-06-22 05:55:22 UTC
Created attachment 48281 [details] [review]
Add a regression test that can reproduce fd.o #34393 (v2)

Same as Attachment #44353 [details], but adapted for changes made in dbus-1.4 since that test was written (installable tests, etc.)
Comment 25 Simon McVittie 2011-07-28 10:35:59 UTC
Fixed in git for 1.5.6. I think this is too big for 1.4.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct.