11454 – no way to exit a dbus server without creating a race condition

Bug 11454 - no way to exit a dbus server without creating a race condition

Summary: no way to exit a dbus server without creating a race condition

Status:	RESOLVED MOVED

Alias:	None

Product:	dbus
Classification:	Unclassified
Component:	core (show other bugs)
Version:	1.5
Hardware:	All All

Importance:	medium normal
Assignee:	D-Bus Maintainers
QA Contact:	D-Bus Maintainers

URL:
Whiteboard:
Keywords:

Duplicates (1):	22258 (view as bug list)
Depends on:
Blocks:

Reported:	2007-07-02 14:19 UTC by Allison Lortie (desrt)
Modified:	2018-10-12 21:04 UTC (History)
CC List:	4 users (show)

See Also:
i915 platform:
i915 features:

Attachments
a server that tries to exit (1.02 KB, text/plain) 2007-07-02 14:23 UTC, Allison Lortie (desrt)	Details
View All

Description Allison Lortie (desrt) 2007-07-02 14:19:40 UTC

i believe that it is sane to have some dbus services exit (for example, after a period of activity).  as far as i can tell (and from some simple testing) there is absolutely no way to do this without creating a race condition.

if a message is delivered to the server just as it is exiting then the message becomes lost forever and the client receives a timeout error.

ideally, for the activation case the server should be reactivated and the message resent.  for the non-activation case, the client should receive the same error message that they receive if the server is not running (since a timeout error indicates that probably the server has become wedged).

see the attached example.

Comment 1 Allison Lortie (desrt) 2007-07-02 14:23:47 UTC

Created attachment 10558 [details]
a server that tries to exit

compile this and create an activation file for it

[D-BUS Service]
Name=ca.desrt.Ping
Exec=/whatever/you/like/ping-server

then run this command twice quickly:

dbus-send --session --dest=ca.desrt.Ping --print-reply=yes /ca/desrt/Ping ca.desrt.Ping.Foo

the first one activates the server and receives the reply normally.  then the server quits (and takes it time doing so).

for the second attempt you get:
Error org.freedesktop.DBus.Error.NoReply: Message did not receive a reply (timeout by message bus)

as far as i know, the code here is more or less the "best" i can do.  note that disconnecting the session bus is considered an error (and the library will even prevent you from doing it).  also note that the location of the sleep() doesn't really matter.  suggesting "exit quickly after flushing" doesn't fix the race... it just narrows the window a bit.

Comment 2 Havoc Pennington 2007-07-02 16:44:22 UTC

I suppose the bus daemon could keep the method call message until a reply to it is received, since it already tracks whether a reply is pending in order to do the timeout; and if the recipient disconnects before sending a reply, resend the method call (possibly causing auto-activation).

Problems with this include:

1) the service may have processed the method call, and simply failed to send a reply. For a method call with side effects, in other words, resending it could be considered bad.

The corresponding current problem is that if you get a timeout, there's no way to know whether the server processed the method call already before it decided not to reply. i.e. if you get a timeout, you don't know whether to resend.

problem 1) stated succintly is that the bus daemon doesn't know whether to resend either.

2) resending would necessarily either reorder messages, or force blocking for a reply in between messages. That is, the bus daemon has to either keep the message to resend, continue to write out other messages, then resend if needed; or the bus daemon has to wait to write out any other messages until it gets a reply, so it knows there will be no need to resend. Reordering results in weird semantics, while blocking for replies is a major performance hit.

Overall, the current dbus rule is that if you need reliable delivery, you must wait for a reply (and handle it if the reply is an error). Certainly over TCP or when no bus daemon is involved, there is no way around this no matter what dbus does.

For a bus daemon and unix domain sockets, perhaps we can do better, since really the only possible error is that the app you are talking to crashes or exits. Perhaps the bus daemon could help automate handling error replies in this case, e.g. by automatically resending if no reply is received. However, the key question is whether it is possible to "do the right thing" automatically; do all apps need the same semantics or not.

Surveying a bunch of existing dbus APIs to see what resending their method calls would do might be one way to get data on this.

Another approach might be to add an AUTO_RESEND message flag, so apps that want reliable delivery but don't care about the reordering or multiple-delivery dangers can set that flag.

Hmm. Now that I wrote all of the above, here's another answer for your specific case: drop the well-known bus name *first*, before exiting. The reply to ReleaseName will be in your DBusConnection queue *after* all messages sent to the well-known name, i.e. if you dispatch the DBusConnection in order, once you see the ReleaseName reply, you will know that you can exit without losing any messages. Note that the "block for reply"/DBusPendingCall stuff in DBusConnection will pull the reply out of the queue (effectively reorder the reply), so getting the reply from these functions won't be enough; you would have to manually dispatch the message queue item by item, up to the reply.

I think the above method gives a way to exit without a race, though it is hardly convenient, I think it would be appropriate for dbus bindings to have high-level support for bus name ownership that included this feature.
There may also be a useful libdbus function or two to make it easier to code.

Comment 3 Allison Lortie (desrt) 2007-07-02 17:45:37 UTC

i had considered the method that you gave but it has a problem: a lot of people write dbus services as a method of making sure only one process has access to a shared resource at a time. as such, a very natural "lock" mechanism to use for this shared resource is owning the name on the bus.

if you release the name and then process some extra requests (that were in the queue before you released the name) then you have a problem with possibly multiple instances of the service sharing the resource.

there are two basic solutions to this problem that i see:

1) attack the core problem that packets can be sent to the client that the client won't notice. require acknowledgement on receipt of messages. a message isn't considered by the bus to be "sent" until the client library says "ok, got it!". this probably implies increased numbers of roundtrips but it need not be too bad. merely sending a reply to the original request (in the case of method calls) could count as an acknowledgement.

the best way would probably be something like tagging each request to the client with a sequence. the client would then send an "ok, i got it [sequence #]" in the next outgoing packet to the bus. in the case that the client sends a reply to a method call, this would happen almost for free. in the case that we get back to the dbus core after dispatching to all of the filters/object/fallback handlers and no outgoing traffic was sent we could send a quick extra "ok. got it." packet to the bus. it's probably reasonable to subject the wait for the acknowledgement packet to the normal timeout rules (so that the bus doesn't buffer forever).

or you could just have the client always send "ok. got it." packets immediately when the request is noticed (even before dispatch).

this doesn't imply that anything has to be sent out of sequence or block. you still send all packets normally -- you just don't consider them "sent" until you hear that they have been actually received. in the normal case, you may have a few requests "in the tubes" at a given time. in the case that the service exits, it stops sending "ok, got seq #" packets you'll want to start up the new service instance and send all of those packets (in order) to the new instance.

2) another approach is a bit closer to your solution with some sort of extra locking -- have a two-stage disconnect. the service says to the bus "ok. i want to go away now". the bus stops sending packets to this instance and queues up the remaining packets somewhere. the bus sends a "you may exit now" packet to the service after which absolutely no other packets to the service name are sent. the bus doesn't actually start the new instance (or allow anyone else to have the name) until either

a) the service sends a "service is fully done" packet
-or-
b) the service exits, closing the socket connection
-or-
c) the service reclaims the name itself (in which case the packets are just sent to the service as if nothing happened)

(c) might be nice for the case where the service wants to shut down because it hasn't seen any messages in 5 minutes. if it receives a new message between the "i want to go away now" and the bus saying "ok, you're free to go" then it might decide that it actually wants to stay awake. it's a nice optimisation possibility but obviously not very important.

of the two presented methods, method 1 has the advantage that the server can decide exactly when it wants to exit by merely flushing outgoing traffic and exit()ing. this means that the server can do things like process -exactly- one request and die. i'm not sure why this might be useful but i'm sure i could think of something :)

method 2 has the advantage that it doesn't impose changes to the working of the entire protocol for the purpose of handling an unusual race condition. services might have to process extra requests after they signal their intent to exit.

Comment 4 Havoc Pennington 2007-07-03 07:06:12 UTC

I don't understand how the 'ack packets' in suggestion 1 are different from the current method call replies. The message replies are an acknowledgement of receipt, that's what they do. Signals don't have replies, true, but they also should not be addressed to a particular bus name. Replies also don't have replies, but that's because it would get infinite-loopish to do that. Other than methods, signals, and replies there are no other message types.

There is a solution to the locking problem: use two bus names, one as a resource lock and one as an 'address' for method calls. Again, not convenient, but possible.

Your idea 2) is a possible convenience thing to avoid the extra resource lock; maybe a simple form of it would be something like:
FreezeDeliveryToName(in STRING alreadyOwnedName)

which would allow saying "dispatch no more messages, leave them in the queue" and then you could process the queue up to the FreezeDeliveryToName reply while still holding the bus name, and know there is nothing else in the queue. vs. the suggestion I gave earlier, the significance of this is just that it lets you keep holding the bus name while draining the last of the queue, thus avoiding the need for an extra resource lock (extra resource lock could be outside dbus, or a second bus name).

One implementation approach might be to handle the queuing up of messages as we do when we're auto-starting a service; this would be easier to implement if the freeze were not reversible for the current owner, i.e. the freeze is effectively saying "if you activate a new name owner, don't start delivering to it until I say" - but the bus name can be dropped right away for the current owner - not sure what that method would be called,
FreezeNewNameOwner(in STRING alreadyOwnedName)

So an app could do:
FreezeNewNameOwner("myname")
ReleaseName("myname")
process_all_queued_messages()
UnfreezeNewNameOwner("myname") // optional, disconnect also does it
disconnect / exit

That isn't so bad, and possibly easy to implement as a flag on our pending activation data structure. Potentially easier to implement than a general FreezeDeliveryToName feature.

(Though the implementation may be complicated if the freeze has to pertain to new name owners that the bus doesn't activate, and to a potential series of new name owners instead of just one. These might make FreezeNewNameOwner just as hard to code as general FreezeDeliveryToName, I guess.)

I don't know, would have to look into the code more and mess with it. I think the "use a separate resource lock" workaround is the simplest short-term answer for you, though I'm not opposed to some kind of help provided by the bus as long as it isn't too messy and has solid test coverage for the corner cases.

I would also say, it might be best practice to have a separate resource lock anyway, if you support "--replace" type functionality you almost have to have that. And --replace certainly is convenient.

Comment 5 Allison Lortie (desrt) 2010-06-17 13:54:10 UTC

I was talking to Colin just now and I proposed a solution.  I then came back to this bug to discover that I had proposed (and Havoc had supported) essentially the same solution in 2007.

I'm quite convinced that the way to do this is to have a "Freeze" call that you can call as the owner of the name.  The freeze remains effective as long as you are the owner of the name (ie: until you release it or exit).  If messages are ever queued on an unfrozen name then activation occurs.

We could add a Thaw call too, but this can certainly come later (or not at all).


Meanwhile, I thought of another related race: when doing replacement, you could be notified that you lost your name while you're in the middle of processing a request for that name (and meanwhile the new owner starts doing things too).  Might just be one of those "don't do that" cases.

Comment 6 Havoc Pennington 2010-06-17 18:46:12 UTC

right, my last paragraph in comment 4 was probably about that replacement race.

Re-reading this stuff much after I wrote it, a separate resource lock really does seem more robust and easy to understand.

Something we didn't discuss last time is that this whole discussion assumes that no method on the service assumes anything about previous methods called on the service. i.e. we're assuming there is no setup - nothing where you'd register with the service, or get an object from the service then use that object.
How many services meet this "stateless" criteria?

When a service does NOT meet that criteria, the way clients generally have to be coded is:

* watch for service name
* on seeing service name, do any setup
* on seeing service name go away, do any teardown
* do all method calls - all setup and teardown and in between - with unique bus name, not well-known. when you get a "no such name" error, then stop doing anything and a name lost signal will be coming in to tear you down

I had discussed this model with davidz and iirc the new gdbus stuff supports it nicely.

The "freeze" approach assumes that you don't care which instance of a service handles the message, which assumes the service is completely stateless.
The "resource lock" approach will work even if the service has some state.

Some services are probably better suited for one or the other model. However, a service that "mixed" the models could be almost impossible to use Correctly(tm) because using well-known name would break stateful stuff and using unique name would break stateless stuff.

Further complicating matters, if you're using unique name that will never auto-activate.

There may not be one true way, maybe we just need docs describing some options and considerations, and mechanisms supporting the useful approaches.

Comment 7 Allison Lortie (desrt) 2010-06-18 06:03:06 UTC

There's one more reason that the resource lock is insufficient:

Take dconf as an example.

It watches for signal being emitted by the service (which owns a bus name).  Imagine dconf is just about to exit due to inactivity (and drops its name) as a message comes in...

dconf has to send out a Notify signal for the change that happened as a result of this.  It no longer has ownership of the bus name (and maybe can't get it back), so the Notify signal will not be coming from the owner of the name...

The only way to properly cover this situation presently is with two DBus names:

  - one used for incoming requests (and activation)
  - one used for sending signals

which seems quite odd...

Comment 8 Havoc Pennington 2010-06-18 09:32:05 UTC

The notify signal coming without owning the name is just fine though in the model I described where the apps are using the unique name. Apps would connect to the signal by unique name (and reload the current values) in the "setup" step whenever a new name owner appears.

Maybe the conclusion from this bug is, relying on being stateless and only using the well-known name is kind of a race-fest pain in the ass ;-)

Comment 9 Allison Lortie (desrt) 2010-06-18 13:51:58 UTC

What you're describing is how GDBus used to function.  I complained to David about that because it's inherently racy too: when the service first starts up there is a decent chance it will send its first signal very soon after acquiring its name.

The client that is only adding watches for unique names is now in trouble because it has a very tight window to notice that the bus has given the well-known name to a new connection and register its watch for that before the signal is (immediately) sent.  In practice I guess this would actually always fail...

Comment 10 Havoc Pennington 2010-06-20 15:56:42 UTC

I don't understand that problem. It is always broken to watch for a "state changed" signal without also querying the initial state.

Comment 11 Simon McVittie 2011-01-19 10:27:20 UTC

(In reply to comment #1)
> the first one activates the server and receives the reply normally.  then the
> server quits (and takes it time doing so).
> 
> for the second attempt you get:
> Error org.freedesktop.DBus.Error.NoReply: Message did not receive a reply
> (timeout by message bus)

test/name-test/test-privserver turns out to suffer from a similar race. The client tells it to quit between iterations, but if the client starts another iteration before the server has actually exited (and dbus-daemon has detected it exiting), the test will fail.

Unfortunately, the client seems to win the race more often on the branch where I'm refactoring the main loop for Bug #23194...

Comment 12 Simon McVittie 2011-01-19 10:32:07 UTC

On Bug #22258 (which I'm about to close as a dup), Colin suggested:
> Service activation works quite well for starting services on-demand.  However,
> over time memory usage will continue to go up.  The problem is that if a
> service decides to just call exit() at some arbitrary point in time, there is a
> race condition where a message for it may be in-flight.
> 
> There are several possible in-flight situations.  
> 
> 1) A message is in the queue from a client to the bus, unread by the bus.
> 2) A message is in the bus pending a write to the service
> 3) A message has been written to the service, but not read by the service
> 
> What would be nice is if DBus had a way for a process to explicitly say "hold
> any messages for org.foo.MyService, I'm planning to exit".  
> 
> For situation 1), dbus doesn't yet know about the message, so it says "OK". 
> Then later when it does process the message, if the service has not yet
> relinquished the service name, the bus holds the message until that event
> happens.  If the service has released the name in some way, service activation
> proceeds.
> 
> For situation 2) and 3), dbus returns a reply that says "messages pending,
> don't exit".
> 
> Needs a bit more analysis but it seems plausible.

Comment 13 Simon McVittie 2011-01-19 10:32:24 UTC

*** Bug 22258 has been marked as a duplicate of this bug. ***

Comment 14 Simon McVittie 2011-01-19 10:44:29 UTC

(In reply to comment #12)
> On Bug #22258 (which I'm about to close as a dup), Colin suggested:
[some stuff]

to which Havoc replied:
> This has come up a few times before. I think you just need to 1) release the
> name then 2) process the pending queue.
> 
> The exception is if your app needs an exclusive lock on some resource, like the
> sound card. In that case, one solution is to have a separate lock: which can be
> a second bus name, or an fcntl lock, or whatever else. And the starting-up
> replacement service would wait for this lock before it starts processing the
queue.
> 
> Downsides of doing this with some special feature are:
> * older dbus wouldn't have it anyway
> * it complicates bus names and message routing even more, and nobody
> understands it already
> 
> Two bus names (one for exclusivity, one for message routing) I'm pretty sure
> works perfectly and is pretty straightforward. An app needs both names to begin
> processing the queue, only the "routing" name to receive messages, and only the
> "lock" name to continue processing messages (the routing name has to be owned
> to _begin_ processing but the lock name to continue).
> 
> In your cases:
> 1) and 2) should both go to the new service instance if the old has dropped the
> routing name. The old one processes what remains of the queue, then drops lock
> name. When the new instance gets the lock name, it can begin processing the new
> messages.
> 3) should be addressed if the old instance iterates the dbus main loop source
> until there's nothing to read and no messages left.
> 
> 3) could maybe be enhanced by a call that just says "close() my connection from
> the bus side" which would allow a more deterministic way to read everything
> still in the socket buffer. The bus could only close() once the
> currently-in-process message is written out in full. However, in practice I
> think this is mostly hygiene and Correctness, while just iterating the main
> loop source until it stops and calling close() on the client side probably
> works OK in practice.

Comment 15 Colin Walters 2015-05-14 12:40:24 UTC

Another thread on this:

http://lists.freedesktop.org/archives/dbus/2015-May/016671.html

Executive summary: Using systemd, for system bus services, it's possible to handle a subset of this problem domain, but not fully general DBus services with the userspace bus daemon.

kdbus has this built in though.

Comment 16 GitLab Migration User 2018-10-12 21:04:08 UTC

-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/dbus/dbus/issues/5.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.