Bug 2813 - D-BUS v0.23.2: cannot activate several services in short time
Summary: D-BUS v0.23.2: cannot activate several services in short time
Status: RESOLVED FIXED
Alias: None
Product: dbus
Classification: Unclassified
Component: core (show other bugs)
Version: unspecified
Hardware: x86 (IA32) Linux (All)
: high critical
Assignee: Havoc Pennington
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on: 5772
Blocks:
  Show dependency treegraph
 
Reported: 2005-03-25 10:02 UTC by Kimmo Hämäläinen
Modified: 2006-08-10 14:11 UTC (History)
3 users (show)

See Also:
i915 platform:
i915 features:


Attachments
all code, .service files and session.conf used (16.54 KB, application/x-gtar)
2005-03-25 10:06 UTC, Kimmo Hämäläinen
Details

Description Kimmo Hämäläinen 2005-03-25 10:02:04 UTC
It seems that D-BUS cannot cope with a situation where several services need to
be activated in a short time. This can be replicated every time with a client
and three services. The services are not running and the client sends a message
to each of the service, but D-BUS daemon fails to start all of the services
(sometimes one, sometimes two). Note also how long it takes for the buggy client
to execute; for some reason, checking for service existence in end of the
program takes a lot of time. There's also a modified client no_bug_client that
waits one second after each message. D-BUS works fine with the modified client.

kihamala@mylly:~/dbus-test/multi-act-bug$ ps xau | grep bug
kihamala  7750  0.0  0.0   1548   468 pts/2    R+   19:37   0:00 grep bug
kihamala@mylly:~/dbus-test/multi-act-bug$ time ./bug_client 
service bug.svc does not exist
service bug.svca does not exist
service bug.svcb does not exist
message to bug.svc was put to the queue
message to bug.svca was put to the queue
message to bug.svcb was put to the queue
Messages sent and flushed
Sleeping for 5 secs..
going to the main loop
going to the main loop
Error when connecting to the session bus: No reply within specified time
service bug.svc exists
service bug.svca does not exist
service bug.svcb exists
I'm still connected

real    0m25.553s
user    0m0.002s
sys     0m0.002s
kihamala@mylly:~/dbus-test/multi-act-bug$ ps xau | grep bug
kihamala  7755  0.0  0.0   2560   836 pts/2    S    19:38   0:00 /home/kihamala/
dbus-test/multi-act-bug/bug_svc
kihamala  7759  0.0  0.0   2560   836 pts/2    S    19:38   0:00 /home/kihamala/
dbus-test/multi-act-bug/bug_svcb
kihamala  7761  0.0  0.0   1548   468 pts/2    R+   19:38   0:00 grep bug
kihamala@mylly:~/dbus-test/multi-act-bug$ kill 7755 7759
kihamala@mylly:~/dbus-test/multi-act-bug$ time ./no_bug_client
service bug.svc does not exist
service bug.svca does not exist
service bug.svcb does not exist
message to bug.svc was put to the queue
Sleeping a second
going to the main loop
message to bug.svca was put to the queue
Sleeping a second
going to the main loop
message to bug.svcb was put to the queue
Sleeping a second
going to the main loop
Messages sent and flushed
Sleeping for 5 secs..
service bug.svc exists
service bug.svca exists
service bug.svcb exists
I'm still connected

real    0m8.021s
user    0m0.001s
sys     0m0.001s
kihamala@mylly:~/dbus-test/multi-act-bug$ ps xau | grep bug
kihamala  7764  0.0  0.0   2560   836 pts/2    S    19:39   0:00 /home/kihamala/
dbus-test/multi-act-bug/bug_svc
kihamala  7766  0.0  0.0   2560   836 pts/2    S    19:39   0:00 /home/kihamala/
dbus-test/multi-act-bug/bug_svca
kihamala  7768  0.0  0.0   2560   836 pts/2    S    19:39   0:00 /home/kihamala/
dbus-test/multi-act-bug/bug_svcb
kihamala  7770  0.0  0.0   1548   468 pts/2    R+   19:39   0:00 grep bug
kihamala@mylly:~/dbus-test/multi-act-bug$ diff bug_client.c no_bug_client.c 
62a63,64
>     printf("Sleeping a second\n");
>     sleep(1);
kihamala@mylly:~/dbus-test/multi-act-bug$ dpkg -l | grep dbus
ii  dbus-1         0.23.2-3       simple interprocess messaging system
ii  dbus-1-dev     0.23.2-3       simple interprocess messaging system (develo
ii  dbus-1-utils   0.23.2-3       simple interprocess messaging system (utilit
ii  dbus-glib-1    0.23.2-3       simple interprocess messaging system (GLib-b
ii  dbus-glib-1-de 0.23.2-3       simple interprocess messaging system (GLib i
Comment 1 Kimmo Hämäläinen 2005-03-25 10:06:20 UTC
Created attachment 2214 [details]
all code, .service files and session.conf used
Comment 2 Richard Hult 2005-04-27 13:27:27 UTC
I've looked into this and have some comments. First, I've reproduced the problem
on both the 0.23 branch and HEAD. Second, it doesn't happen always so there
seems to be a race condition that causes this.

It's actually worse than it first looks, since the bus blocks and no other
communication with it is possible after this happens (so I'm raising the severity).

The problem seems to be that the babysitter doesn't always get a HUP on the
socket to the bus. This causes the waitpit in the bus to block indefinitely (in
_dbus_babysitter_unref in dbus-spawn.c). That code first closes the socket from
the bus side, then waits for the sitter to exit. The sitter in turn polls the
socket and exits on HUP or ERR but that never happens in this case for some
reason, most of the time.
Comment 3 Kimmo Hämäläinen 2005-04-28 01:23:30 UTC
(I don't quite get what is the purpose of all that babysitter stuff, would it be
simpler if the daemon gets SIGCHLD directly?)

I wonder what is the purpose of the write() in babysit_signal_handler(); if it's
meant for raising SIGPIPE, why not do it directly with raise(3)?
Comment 4 Richard Hult 2005-04-28 01:38:41 UTC
The babysitter's purpose is to handle the life-cycle of the child before the
connection has been set up between the bus and the child. It's needed since the
child is double-forked to be detached from the bus.
Comment 5 Kimmo Hämäläinen 2005-04-28 05:04:36 UTC
Interesting, the _dbus_spawn_async_with_babysitter's comment says that sitter_p
can be NULL. But it cannot be NULL, because it's dereferenced right in the
beginning of the function (*sitter_p = NULL) without checking its value...

But if it jams to waitpid() in _dbus_babysitter_unref(), it could mean that the
reference counting fails at one point. I.e. the bus thinks that the child is
exiting, but it's not, so waitpid() blocks?
Comment 6 Richard Hult 2005-04-28 08:22:50 UTC
My debugging showed that the sitter was still running (and not blocking), but
not getting a HUP event on the socket to the bus. The bus closes the socket
right before doing waitpid so I don't think it has anything to do with the
sitter refcounting.
Comment 7 Havoc Pennington 2005-04-28 10:20:43 UTC
I think a "stuff to read" flag from poll(), and then a read() that returns 0,
should also be treated as a HUP (I haven't looked at the code but maybe it fails
to do this)
Comment 8 Richard Hult 2005-04-29 00:29:13 UTC
I actually tried adding a read since I saw that sometimes a read event comes in
and isn't handled.  It doesn't help this particular case though since the sitter
never returns from the poll.
Comment 9 Havoc Pennington 2005-04-29 06:16:00 UTC
It would be good to get this failure into dbus-spawn.c:_dbus_spawn_test() (both
for a simpler debug case, and to avoid reintroducing it later once it's solved)

I don't see how a poll() can hang forever if the other end of one of the polled
fd's has been closed? That doesn't make sense to me unless we failed to ask for
the right events in the poll, or left the fd out of the poll, or something like
that... maybe it's just a bug in my lame-ass main loop that the daemon uses.
Comment 10 Richard Hult 2005-05-04 02:57:41 UTC
I'll try to write a test that can be put in the test code.

I wonder if the problem is that one of the unused ends of the sockets aren't
closed under some circumstances, which should cause the HUP event not to be
fired, I think. There is also a chance that there is a race with closing/opening
fds, like not using close_and_invalidate in all the places.
Comment 11 Richard Hult 2005-05-24 03:03:10 UTC
I tried to write a test that triggered this but I was not able to do that. The
test code does not end up exactly like the real case, which might explain it,
especially if the reason is a race condition like I suspect.

My theory about not closing the unsed socket ends or a race with closing/opening
fds does not seem to be the reason either.

I'm not sure how to proceed with this, any ideas?
Comment 12 Matthias Clasen 2005-05-24 13:36:43 UTC
We're seeing this, too. This should probably be a 1.0 blocker.
Comment 13 Havoc Pennington 2005-05-24 16:18:32 UTC
Has anyone tried DBUS_VERBOSE=1 DBUS_DEBUG_OUTPUT=1 on the daemon? (you would
have to start the daemon from a terminal for this and redirect stderr to a logfile)
Comment 14 Ray Strode [halfline] 2005-05-25 23:52:43 UTC
Hi guys,

So I'm just scanning through the code and here's some (unhelpful) things that I see:

- The code does treat EOF as HUP as mentioned in comment 7.  It basically
follows the algorithm "if there is stuff to read on the socket then try to read
it.  If all that results from the read is eof or error then close the socket. 
if there wasn't anything to read in the first place and (so) there was a hangup
or error then close the socket"

- There is a little ickyness in the unref function--it's checking WIFEXITED and
WIFSIGNALED on sitter->status instead of status.  That shouldn't matter though
because it's only for debug spew.
Comment 15 Richard Hult 2005-06-15 02:42:34 UTC
Created attachment 2889 [details]
buttons idea

This is what I get from the bus with DBUS_VERBOSE and DBUS_DEBUG_OUTPUT set,
but I can't spot anything particular in the output. There are two services,
bug.ServiceA, and bug.ServiceB. A message "foobar" is sent to both, but only
the first one receives it.

The second service gets stuck in dbus_bus_get and when it finally times out and
exits, the sitter returns from its poll and exits.
Comment 16 Ray Strode [halfline] 2005-06-22 17:48:51 UTC
For what it's worth, this seems like a refcounting bug.  This patch works works
around the problem:

--- bus/activation.c    18 Jan 2005 20:42:15 -0000      1.38
+++ bus/activation.c    23 Jun 2005 00:47:25 -0000
@@ -1498,20 +1498,21 @@ bus_activation_activate_service (BusActi
       if (!_dbus_hash_table_insert_string (activation->pending_activations,
                                           pending_activation->service_name,
                                            pending_activation))
        {
           _dbus_verbose ("Failed to put pending activation in hash table\n");

          BUS_SET_OOM (error);
          bus_pending_activation_unref (pending_activation);
          return FALSE;
        }
+      bus_pending_activation_ref (pending_activation);
     }

   if (!add_cancel_pending_to_transaction (transaction, pending_activation))
     {
       _dbus_verbose ("Failed to add pending activation cancel hook to
transaction\n");
       BUS_SET_OOM (error);
       _dbus_hash_table_remove_string (activation->pending_activations,
                                      pending_activation->service_name);
       return FALSE;
     }

Not sure what the correct fix is yet, though. I'm going to investigate more
tomorrow.
Comment 17 Ray Strode [halfline] 2005-06-24 13:23:04 UTC
Created attachment 2951 [details] [review]
check-for-glproto.patch

So Matthias and I looked at this some more.

It appears that there is just some confusion about what the lifecycle of things
should be.   The babysitter process isn't supposed to hang around for the life
time of the activated process, but instead just hang around until the process
has registered itself with the bus.  The solution is to just kill the
babysitter process when no one needs it anymore (i.e. when the ref count of the
babysitter process reaches 0).
Comment 18 Havoc Pennington 2005-06-24 19:27:35 UTC
Comment on attachment 2951 [details] [review]
check-for-glproto.patch

Yay, nice work.

Patch looks reasonable to me.
Comment 19 Havoc Pennington 2005-06-24 19:28:14 UTC
Did you ever figure out why it matters that we're activating multiple services
in a short time?
Comment 20 Ray Strode [halfline] 2005-06-27 07:05:08 UTC
Hi,

So I think it has to do with how quickly you fork().

Basically, if you close the last open file descriptor to the socket to the
babysitter then the babysitter will wake up from poll (with a hangup) and quit.
 This works fine if you only fork once before you get to waitpid.  The problem
is when you fork more than once. Each fork will get a copy of all open file
descriptors from the parent (including open sockets to other babysitters).  That
means the close on the socket in babysitter_unref won't drop the reference count
on the open socket to 0 and so the hangup message will never be delivered.

The easiest thing to do is just kill the child.
Comment 21 Ray Strode [halfline] 2005-06-28 08:28:33 UTC
Hi, so this is commited now:

2005-06-28  Ray Strode  <rstrode@redhat.com>

        *  dbus/dbus-spawn.c (_dbus_babysitter_unref): kill
        babysitter helper process on last unref, bug #2813.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.