Description
Simon McVittie
2011-02-17 06:41:13 UTC
Here's a branch. This depends on Bug #34140 for the test script. Created attachment 43814 [details] [review] Add a hackish script to reproduce fd.o #34393 The number of messages is arbitrary; the more messages, the more likely the crash is. 2000 messages seem to cause it reliably, on this laptop. The script may have race conditions (there are a couple of hard-coded sleep calls), so I haven't set it up to be run automatically. Created attachment 43815 [details] [review] dbus_bus_register: don't unref the messages with the lock held Finalizing a message can call out to user code via dbus_message_set_data (or to internal code not expecting locks to be held, via DBusCounter). Created attachment 43816 [details] [review] _dbus_connection_message_sent: rename to _unlocked It's called with the connection's lock held. Created attachment 43817 [details] [review] Add _dbus_counter_notify and call it after every adjustment When fd-passing is implemented, adjustments happen in pairs; in that case we coalesce the two calls into one. Created attachment 43818 [details] [review] Comment some places where it's OK to unref a message despite holding locks In general, dbus_message_unref should be avoided while holding locks, because it can invoke arbitrary user callbacks (via attached data, or via DBusCounter). Created attachment 43819 [details] [review] When attaching counters to messages, don't automatically notify callbacks In all the places where counters are added, we're under a lock. The caller knows what effect adding the counter might have, and can replicate it in a lock-safe way if necessary. Created attachment 43820 [details] [review] Don't inline the contents of _dbus_connection_unlock It's about to become more complex, to handle delayed deallocation of messages in order to avoid triggering callbacks while locked. Created attachment 43821 [details] [review] Don't finalize sent or dispatched messages while under the connection lock Finalizing a message can trigger callbacks; that's bad, if we have a connection locked. In particular, if a message is received by the "left side", passed to the "right side" and sent (as in test/relay.c (see the diagram there) or in dbus-daemon), then finalizing that message could result in the live messages counter for the left side, and the outgoing messages counter for the right side, both being decremented while under either side's lock. After a message is dispatched on the left side, finalizing it now drops the lock temporarily, to avoid this problem. After a message is sent on the right side, finalizing it is now deferred until the right side unlocks, by moving it to a new queue of "expired messages" which is automatically cleared every time we release the lock. The "live messages" counter for the "left" connection will now explicitly take the left connection's lock before decrementing, to avoid manipulating watches without a lock. --- The comment about relay.c refers to Bug #34570. Created attachment 43822 [details] [review] _dbus_connection_peer_filter_unlocked_no_update: delay freeing reply Finalizing the reply could conceivably call callbacks, so wait til we unlock. Created attachment 43823 [details] [review] dbus_connection_dispatch: avoid freeing UnknownMethod reply until we unlock ... and that's all for now. Created attachment 44207 [details] [review] Remove the per-connection link cache With fd.o#34393 fixed, retaking the lock to cache unused links significantly adds to locking overhead (-18% throughput in a synthetic benchmark on an ARM device). The cache is also unlimited in size, and probably contributes to memory growth and fragmentation by not being under the system malloc's control. Fixing fd.o #34393 [with the sequence of patches already attached here], but also dropping this cache entirely, turns out to lead to a 5% increase in throughput on the same synthetic benchmark. Review of attachment 43815 [details] [review]: R+ for me, but no dev-hat on. Created attachment 44314 [details] [review] Add a regression test that can reproduce fd.o #34393 The number of messages is arbitrary; the more messages, the more likely the crash is. 2000 messages seem to cause it reliably, on this laptop. This commit depends on the branch from Bug #34570, for infrastructure. The actual bugfix does not depend on that branch. It should also be possible to run this test as a standalone binary on any system with dbus-daemon, libdbus and dbus-glib, by setting TEST_DBUS_DAEMON to the path of the installed dbus-daemon and TEST_CONFIG_FILES to a directory containing a copy of incoming-limit.conf. Created attachment 44317 [details] [review] Run some trivial integration tests on the installed dbus binaries This is really part of Bug #34570, except that the addition of installcheck_environment is part of Attachment #44314 [details]; it could be rebased if people are feeling particularly pedantic, but tbh I'd prefer to just get both branches merged. Created attachment 44353 [details] [review] Add a regression test that can reproduce fd.o #34393 The number of messages is arbitrary; the more messages, the more likely the crash is. 2000 messages seem to cause it reliably on this laptop, but I've set it to 10000 to be safe. --- This obsoletes Attachment #44314 [details] using infrastructure I added to Bug #34570 (which obsoletes Attachment #44317 [details]). Review of attachment 43821 [details] [review]: Looking back at my own patches after a few weeks... ::: dbus/dbus-connection.c @@ +4836,3 @@ + * We have a reference to the connection, and we don't use any cached + * pointers to the connection's internals below this point, so it should + * be safe to drop the lock and take it back. */ Whoever reviews this: please pay particular attention to what happens in this function, it looks like the part most likely to be wrong. ::: dbus/dbus-transport.c @@ +82,3 @@ + (* transport->vtable->live_messages_changed) (transport); + _dbus_connection_unlock (transport->connection); + } Possibly this should be reversed to: lock if (l_m_c) l_m_c () unlock but neither the vtable pointer nor its contents should change between a transport object's creation and finalization, and we're holding a ref, so I think the patch is safe. Created attachment 46155 [details] [review] Remove the per-connection link cache (v2) Simplified to avoid pointless and slightly confusing reuse of a variable. Review of attachment 46155 [details] [review]: Minor comments; otherwise great - I love patches with benchmark references. ::: dbus/dbus-connection.c @@ +394,2 @@ iter != NULL; + iter = _dbus_list_pop_first_link (&expired_messages)) while ((iter = _dbus_list_pop_first_link (&expired_messages)) != NULL) @@ +657,3 @@ /* It's OK that in principle we call the notify function, because for the * outgoing limit, there isn't one */ + _dbus_message_remove_counter (message, connection->outgoing_counter, NULL); Might as well remove the link_return argument to _dbus_message_remove_counter too. (In reply to comment #20) > while ((iter = _dbus_list_pop_first_link (&expired_messages)) != NULL) I prefer not to do assignments inside conditionals if I can avoid it... but, if you insist, I can use that format. The benchmark is only valid after the rest of the branch that's attached here: the link cache is probably a net performance win in current dbus, but with this bug fixed, the extra lock-retaking step that becomes required makes it considerably more expensive. Any chance you could review the bugfix? I'd also really appreciate review of Bug #34570 so we have a framework for improving "realistic" (public-API-only) test coverage, like the test that this branch uses; eventually I'd like to have most of our tests be small and self-contained like that one. > @@ +657,3 @@ > /* It's OK that in principle we call the notify function, because for the > * outgoing limit, there isn't one */ > + _dbus_message_remove_counter (message, connection->outgoing_counter, NULL); > > Might as well remove the link_return argument to _dbus_message_remove_counter > too. Well spotted, this was its last use. I'll attach a patch to simplify it. Created attachment 46157 [details] [review] _dbus_message_remove_counter: remove ability to return the link (In reply to comment #21) > (In reply to comment #20) > > while ((iter = _dbus_list_pop_first_link (&expired_messages)) != NULL) > > I prefer not to do assignments inside conditionals if I can avoid it... > but, if you insist, I can use that format. Do you insist or not? I tend to think assignments in conditionals violate least-astonishment. > Any chance you could review the bugfix? That, please? Also, should it be targeting dbus-1.4 or master? (The branch is based on dbus-1.4, but perhaps master is more appropriate for this sort of intrusive bugfix.) > I'd also really appreciate review of Bug #34570 so we have a framework for > improving "realistic" (public-API-only) test coverage, like the test that this > branch uses; eventually I'd like to have most of our tests be small and > self-contained like that one. wjt reviewed this, so I needed to rebase the test-case a bit, I'll attach the new version here. > Well spotted, this was its last use. I'll attach a patch to simplify it. Comment #22, Attachment #46157 [details]. Created attachment 48281 [details] [review] Add a regression test that can reproduce fd.o #34393 (v2) Same as Attachment #44353 [details], but adapted for changes made in dbus-1.4 since that test was written (installable tests, etc.) Fixed in git for 1.5.6. I think this is too big for 1.4. |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.