It seems that D-BUS cannot cope with a situation where several services need to be activated in a short time. This can be replicated every time with a client and three services. The services are not running and the client sends a message to each of the service, but D-BUS daemon fails to start all of the services (sometimes one, sometimes two). Note also how long it takes for the buggy client to execute; for some reason, checking for service existence in end of the program takes a lot of time. There's also a modified client no_bug_client that waits one second after each message. D-BUS works fine with the modified client. kihamala@mylly:~/dbus-test/multi-act-bug$ ps xau | grep bug kihamala 7750 0.0 0.0 1548 468 pts/2 R+ 19:37 0:00 grep bug kihamala@mylly:~/dbus-test/multi-act-bug$ time ./bug_client service bug.svc does not exist service bug.svca does not exist service bug.svcb does not exist message to bug.svc was put to the queue message to bug.svca was put to the queue message to bug.svcb was put to the queue Messages sent and flushed Sleeping for 5 secs.. going to the main loop going to the main loop Error when connecting to the session bus: No reply within specified time service bug.svc exists service bug.svca does not exist service bug.svcb exists I'm still connected real 0m25.553s user 0m0.002s sys 0m0.002s kihamala@mylly:~/dbus-test/multi-act-bug$ ps xau | grep bug kihamala 7755 0.0 0.0 2560 836 pts/2 S 19:38 0:00 /home/kihamala/ dbus-test/multi-act-bug/bug_svc kihamala 7759 0.0 0.0 2560 836 pts/2 S 19:38 0:00 /home/kihamala/ dbus-test/multi-act-bug/bug_svcb kihamala 7761 0.0 0.0 1548 468 pts/2 R+ 19:38 0:00 grep bug kihamala@mylly:~/dbus-test/multi-act-bug$ kill 7755 7759 kihamala@mylly:~/dbus-test/multi-act-bug$ time ./no_bug_client service bug.svc does not exist service bug.svca does not exist service bug.svcb does not exist message to bug.svc was put to the queue Sleeping a second going to the main loop message to bug.svca was put to the queue Sleeping a second going to the main loop message to bug.svcb was put to the queue Sleeping a second going to the main loop Messages sent and flushed Sleeping for 5 secs.. service bug.svc exists service bug.svca exists service bug.svcb exists I'm still connected real 0m8.021s user 0m0.001s sys 0m0.001s kihamala@mylly:~/dbus-test/multi-act-bug$ ps xau | grep bug kihamala 7764 0.0 0.0 2560 836 pts/2 S 19:39 0:00 /home/kihamala/ dbus-test/multi-act-bug/bug_svc kihamala 7766 0.0 0.0 2560 836 pts/2 S 19:39 0:00 /home/kihamala/ dbus-test/multi-act-bug/bug_svca kihamala 7768 0.0 0.0 2560 836 pts/2 S 19:39 0:00 /home/kihamala/ dbus-test/multi-act-bug/bug_svcb kihamala 7770 0.0 0.0 1548 468 pts/2 R+ 19:39 0:00 grep bug kihamala@mylly:~/dbus-test/multi-act-bug$ diff bug_client.c no_bug_client.c 62a63,64 > printf("Sleeping a second\n"); > sleep(1); kihamala@mylly:~/dbus-test/multi-act-bug$ dpkg -l | grep dbus ii dbus-1 0.23.2-3 simple interprocess messaging system ii dbus-1-dev 0.23.2-3 simple interprocess messaging system (develo ii dbus-1-utils 0.23.2-3 simple interprocess messaging system (utilit ii dbus-glib-1 0.23.2-3 simple interprocess messaging system (GLib-b ii dbus-glib-1-de 0.23.2-3 simple interprocess messaging system (GLib i
Created attachment 2214 [details] all code, .service files and session.conf used
I've looked into this and have some comments. First, I've reproduced the problem on both the 0.23 branch and HEAD. Second, it doesn't happen always so there seems to be a race condition that causes this. It's actually worse than it first looks, since the bus blocks and no other communication with it is possible after this happens (so I'm raising the severity). The problem seems to be that the babysitter doesn't always get a HUP on the socket to the bus. This causes the waitpit in the bus to block indefinitely (in _dbus_babysitter_unref in dbus-spawn.c). That code first closes the socket from the bus side, then waits for the sitter to exit. The sitter in turn polls the socket and exits on HUP or ERR but that never happens in this case for some reason, most of the time.
(I don't quite get what is the purpose of all that babysitter stuff, would it be simpler if the daemon gets SIGCHLD directly?) I wonder what is the purpose of the write() in babysit_signal_handler(); if it's meant for raising SIGPIPE, why not do it directly with raise(3)?
The babysitter's purpose is to handle the life-cycle of the child before the connection has been set up between the bus and the child. It's needed since the child is double-forked to be detached from the bus.
Interesting, the _dbus_spawn_async_with_babysitter's comment says that sitter_p can be NULL. But it cannot be NULL, because it's dereferenced right in the beginning of the function (*sitter_p = NULL) without checking its value... But if it jams to waitpid() in _dbus_babysitter_unref(), it could mean that the reference counting fails at one point. I.e. the bus thinks that the child is exiting, but it's not, so waitpid() blocks?
My debugging showed that the sitter was still running (and not blocking), but not getting a HUP event on the socket to the bus. The bus closes the socket right before doing waitpid so I don't think it has anything to do with the sitter refcounting.
I think a "stuff to read" flag from poll(), and then a read() that returns 0, should also be treated as a HUP (I haven't looked at the code but maybe it fails to do this)
I actually tried adding a read since I saw that sometimes a read event comes in and isn't handled. It doesn't help this particular case though since the sitter never returns from the poll.
It would be good to get this failure into dbus-spawn.c:_dbus_spawn_test() (both for a simpler debug case, and to avoid reintroducing it later once it's solved) I don't see how a poll() can hang forever if the other end of one of the polled fd's has been closed? That doesn't make sense to me unless we failed to ask for the right events in the poll, or left the fd out of the poll, or something like that... maybe it's just a bug in my lame-ass main loop that the daemon uses.
I'll try to write a test that can be put in the test code. I wonder if the problem is that one of the unused ends of the sockets aren't closed under some circumstances, which should cause the HUP event not to be fired, I think. There is also a chance that there is a race with closing/opening fds, like not using close_and_invalidate in all the places.
I tried to write a test that triggered this but I was not able to do that. The test code does not end up exactly like the real case, which might explain it, especially if the reason is a race condition like I suspect. My theory about not closing the unsed socket ends or a race with closing/opening fds does not seem to be the reason either. I'm not sure how to proceed with this, any ideas?
We're seeing this, too. This should probably be a 1.0 blocker.
Has anyone tried DBUS_VERBOSE=1 DBUS_DEBUG_OUTPUT=1 on the daemon? (you would have to start the daemon from a terminal for this and redirect stderr to a logfile)
Hi guys, So I'm just scanning through the code and here's some (unhelpful) things that I see: - The code does treat EOF as HUP as mentioned in comment 7. It basically follows the algorithm "if there is stuff to read on the socket then try to read it. If all that results from the read is eof or error then close the socket. if there wasn't anything to read in the first place and (so) there was a hangup or error then close the socket" - There is a little ickyness in the unref function--it's checking WIFEXITED and WIFSIGNALED on sitter->status instead of status. That shouldn't matter though because it's only for debug spew.
Created attachment 2889 [details] buttons idea This is what I get from the bus with DBUS_VERBOSE and DBUS_DEBUG_OUTPUT set, but I can't spot anything particular in the output. There are two services, bug.ServiceA, and bug.ServiceB. A message "foobar" is sent to both, but only the first one receives it. The second service gets stuck in dbus_bus_get and when it finally times out and exits, the sitter returns from its poll and exits.
For what it's worth, this seems like a refcounting bug. This patch works works around the problem: --- bus/activation.c 18 Jan 2005 20:42:15 -0000 1.38 +++ bus/activation.c 23 Jun 2005 00:47:25 -0000 @@ -1498,20 +1498,21 @@ bus_activation_activate_service (BusActi if (!_dbus_hash_table_insert_string (activation->pending_activations, pending_activation->service_name, pending_activation)) { _dbus_verbose ("Failed to put pending activation in hash table\n"); BUS_SET_OOM (error); bus_pending_activation_unref (pending_activation); return FALSE; } + bus_pending_activation_ref (pending_activation); } if (!add_cancel_pending_to_transaction (transaction, pending_activation)) { _dbus_verbose ("Failed to add pending activation cancel hook to transaction\n"); BUS_SET_OOM (error); _dbus_hash_table_remove_string (activation->pending_activations, pending_activation->service_name); return FALSE; } Not sure what the correct fix is yet, though. I'm going to investigate more tomorrow.
Created attachment 2951 [details] [review] check-for-glproto.patch So Matthias and I looked at this some more. It appears that there is just some confusion about what the lifecycle of things should be. The babysitter process isn't supposed to hang around for the life time of the activated process, but instead just hang around until the process has registered itself with the bus. The solution is to just kill the babysitter process when no one needs it anymore (i.e. when the ref count of the babysitter process reaches 0).
Comment on attachment 2951 [details] [review] check-for-glproto.patch Yay, nice work. Patch looks reasonable to me.
Did you ever figure out why it matters that we're activating multiple services in a short time?
Hi, So I think it has to do with how quickly you fork(). Basically, if you close the last open file descriptor to the socket to the babysitter then the babysitter will wake up from poll (with a hangup) and quit. This works fine if you only fork once before you get to waitpid. The problem is when you fork more than once. Each fork will get a copy of all open file descriptors from the parent (including open sockets to other babysitters). That means the close on the socket in babysitter_unref won't drop the reference count on the open socket to 0 and so the hangup message will never be delivered. The easiest thing to do is just kill the child.
Hi, so this is commited now: 2005-06-28 Ray Strode <rstrode@redhat.com> * dbus/dbus-spawn.c (_dbus_babysitter_unref): kill babysitter helper process on last unref, bug #2813.
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.