Bug 896

Summary: D-BUS messages lost when auto-activating
Product: dbus Reporter: Kimmo Hämäläinen <kimmo.hamalainen>
Component: coreAssignee: D-Bus Maintainers <dbus>
Status: RESOLVED MOVED QA Contact: D-Bus Maintainers <dbus>
Severity: normal    
Priority: medium CC: michael.meeks, milo000, msniko14, oholy, oliv__a, scottt.tw, walters, will
Version: 1.5   
Hardware: All   
OS: All   
Whiteboard:
i915 platform: i915 features:
Bug Depends on:    
Bug Blocks: 52372    
Attachments: Test server and client + .service and .conf files used
test server and client + .service file for activating the server
fix
cleaner fix
new test programs
Implement-org.freedesktop.DBus.Peer-for-th.patch
Have-dbus-send-always-wait-until-message-i.patch
this version has the new test
Ignore-EPIPE-on-writes.patch
ignore sigpipe
updated version of Michael's patch
failing trace with revents=POLLIN|POLLERR|POLLHUP
0001-Bug-896-Avoid-race-conditions-reading-message-from.patch

Description Kimmo Hämäläinen 2004-07-20 02:57:25 UTC
I made a simple D-BUS server which writes a line into a file when it receives a
message and a client which sends 100 messages to the server with the
auto-activate flag up in every one of them. The server is not initially running
(it's supposed to be activated by the first message). The D-BUS system bus is
used and proper .service and .conf files are installed for the server.

When the client is run, one of the two things happen:
1) D-BUS system bus crashes (the server is not activated).
2) Another(!) D-BUS system bus appears to ps command's listing and the server is
activated, but the server does not receive any of the 100 messages.

In case of 2) happened and the client is run again, the server receives all the
100 messages (but there's still two D-BUS system buses running, according to ps).
Comment 1 Kimmo Hämäläinen 2004-07-20 02:59:04 UTC
Created attachment 497 [details]
Test server and client + .service and .conf files used
Comment 2 Olivier Andrieu 2004-07-20 16:50:38 UTC
Hmm, works fine for me, using current CVS version and testing with the session
bus instead of the system bus.
Comment 3 Kimmo Hämäläinen 2004-07-21 00:02:26 UTC
I tested it again with v0.21 session bus and the same bug is there, so it must
be fixed in the CVS version.
Comment 4 Kimmo Hämäläinen 2004-07-22 00:23:21 UTC
Tested this with D-BUS CVS version (dated 21 Jul 04) and strange things happened
(server not up, session bus running, only 5 messages with auto-activation flag set):

kihamala@mylly:~/dbus-test/act-test-progs$ ./bug_client 
service bug.svc does not exist
message 1 was put into the queue
message 2 was put into the queue
message 3 was put into the queue
message 4 was put into the queue
message 5 was put into the queue
Message sent
kihamala@mylly:~/dbus-test/act-test-progs$ Fd 4 did not have the close-on-exec
flag set!
Fd 5 did not have the close-on-exec flag set!
22758: assertion failed "sitter != NULL" file "dbus-spawn.c" line 244
Error when connecting to the session bus: Disconnected prior to receiving a reply

[1]+  Aborted                 ./dbus-daemon-1 --session --print-address  (wd:
~/dbus-prefix/bin)
(wd now: ~/dbus-test/act-test-progs)

I.e. the session bus crashed. Strange thing is that the bus does not crash if I
comment out line 43 from the client, where I set the auto-activation flag (but
of course does not launch the server either).

I'll attach the source code soon.
Comment 5 Kimmo Hämäläinen 2004-07-22 00:28:28 UTC
Created attachment 511 [details]
test server and client + .service file for activating the server


client resided in:
/home/kihamala/dbus-test/act-test-progs
server resided in (as specified in .service file):
/home/kihamala/dbus-test
D-BUS daemon (CVS version 21th Jul 04) resided in:
/home/kihamala/dbus-prefix/bin

Ldd output for bug_client:
kihamala@mylly:~/dbus-test/act-test-progs$ ldd bug_client
	libdbus-1.so.0 => /home/kihamala/dbus-prefix/lib/libdbus-1.so.0
(0x40018000)
	libdbus-glib-1.so.0 =>
/home/kihamala/dbus-prefix/lib/libdbus-glib-1.so.0 (0x4007c000)
	libc.so.6 => /lib/libc.so.6 (0x40092000)
	libnsl.so.1 => /lib/libnsl.so.1 (0x401c6000)
	libgobject-2.0.so.0 => /usr/lib/libgobject-2.0.so.0 (0x401db000)
	libglib-2.0.so.0 => /usr/lib/libglib-2.0.so.0 (0x40214000)
	/lib/ld-linux.so.2 => /lib/ld-linux.so.2 (0x40000000)
Comment 6 Olivier Andrieu 2004-07-22 08:26:56 UTC
OK, there are several issues. 

First, we have the same problem as bug #894 , ie the daemon loses the last
messages when the client exits. 

As for the assertion failure, here comes a patch.

Then, about the message loss, what's happenning is that, by the time the server
has spawn, registered with bus and acquired its service, the client has exited.
The problem is that in the case of auto-activation message, policy checking is
delayed until the server is up. So when the daemon wants to send the delayed
messages with the auto-activation flag, the client connection is dead, so it
cannot check the policy and it just bins these messages.
Comment 7 Olivier Andrieu 2004-07-22 08:31:23 UTC
Created attachment 512 [details] [review]
fix

first hunk is a cheap optimization, second hunk fixes the bug
Comment 8 Olivier Andrieu 2004-07-22 19:07:50 UTC
Created attachment 518 [details] [review]
cleaner fix

Basically, ignore the HANGUP condition
Comment 9 Olivier Andrieu 2004-07-22 19:14:16 UTC
#$%!£ I posted in the wrong bug. Sorry, that's for bug #894
Comment 10 Kimmo Hämäläinen 2004-07-27 05:53:58 UTC
Comment on attachment 518 [details] [review]
cleaner fix

marked obsolete to make clearer that this is not a fix to this bug...
Comment 11 Kimmo Hämäläinen 2004-07-27 06:32:08 UTC
Tried out the patch (dated 2004-07-22 08:31) with CVS D-BUS (dated 21 Jul 04,
except for patched with the "cleaner fix" for Bug #894) to replicate the test of
Comment #4.

Now the bus didn't crash but the server does not receive any messages on first
run of the client; though, the server gets activated. The server (which is
already running) receives all messages when running the client again.

Here is my test session (code from the second attachment):

kihamala@mylly:~/dbus-test/act-test-progs$ /home/kihamala/dbus-patched-prefix/bi
n/dbus-daemon-1 --session --print-address &
[6] 17565
unix:abstract=/tmp/dbus-pVqTHdAMW7
kihamala@mylly:~/dbus-test/act-test-progs$ export DBUS_SESSION_BUS_ADDRESS=unix:
abstract=/tmp/dbus-pVqTHdAMW7
kihamala@mylly:~/dbus-test/act-test-progs$ ps xau | grep dbus
message    218  0.0  0.0  2064  968 ?        Ss   Jul23   0:00 /usr/bin/dbus-dae
mon-1 --system
kihamala 17565  0.0  0.0  2200  788 pts/0    S    16:26   0:00 /home/kihamala/db
us-patched-prefix/bin/dbus-daemon-1 --session --print-address
kihamala@mylly:~/dbus-test/act-test-progs$ cat /tmp/bug_svc.out
cat: /tmp/bug_svc.out: No such file or directory
kihamala@mylly:~/dbus-test/act-test-progs$ ./bug_client 
service bug.svc does not exist
message 1 was put into the queue
message 2 was put into the queue
message 3 was put into the queue
message 4 was put into the queue
message 5 was put into the queue
Message sent
kihamala@mylly:~/dbus-test/act-test-progs$ Fd 4 did not have the close-on-exec f
lag set!
Fd 5 did not have the close-on-exec flag set!
going to the main loop

kihamala@mylly:~/dbus-test/act-test-progs$ cat /tmp/bug_svc.out
cat: /tmp/bug_svc.out: No such file or directory
kihamala@mylly:~/dbus-test/act-test-progs$ ps xau | grep dbus
message    218  0.0  0.0  2064  968 ?        Ss   Jul23   0:00 /usr/bin/dbus-dae
mon-1 --system
kihamala 17565  0.0  0.0  2304 1128 pts/0    S    16:26   0:00 /home/kihamala/db
us-patched-prefix/bin/dbus-daemon-1 --session --print-address
kihamala 17571  0.0  0.0  2748  936 pts/0    S    16:27   0:00 /home/kihamala/db
us-test/bug_svc
kihamala@mylly:~/dbus-test/act-test-progs$ ./bug_client 
service bug.svc exists
message 1 was put into the queue
message 2 was put into the queue
message 3 was put into the queue
message 4 was put into the queue
message 5 was put into the queue
Message sent
kihamala@mylly:~/dbus-test/act-test-progs$ cat /tmp/bug_svc.out
the string argument: 'hello world 1'
the string argument: 'hello world 2'
the string argument: 'hello world 3'
the string argument: 'hello world 4'
the string argument: 'hello world 5'
kihamala@mylly:~/dbus-test/act-test-progs$ ps xau | grep dbus
message    218  0.0  0.0  2064  968 ?        Ss   Jul23   0:00 /usr/bin/dbus-dae
mon-1 --system
kihamala 17565  0.0  0.0  2304 1128 pts/0    S    16:26   0:00 /home/kihamala/db
us-patched-prefix/bin/dbus-daemon-1 --session --print-address
kihamala 17571  0.0  0.0  2748  948 pts/0    S    16:27   0:00 /home/kihamala/db
us-test/bug_svc
kihamala@mylly:~/dbus-test/act-test-progs$ 
Comment 12 Olivier Andrieu 2004-07-27 07:17:41 UTC
Yes, that's because of the policy checking: the client exits before the server
is up. 

It would work fine if your client were listening for replies of the method calls
or if it waited for the ServiceCreated signal before exiting.
Comment 13 Olivier Andrieu 2004-07-29 01:15:41 UTC
committed the fix for the assertion failure.

Ensuring messages do not get lost is a bit harder, maybe we could split
bus_context_check_security_policy in two ?
Comment 14 Havoc Pennington 2004-07-29 23:14:14 UTC
Comment on attachment 512 [details] [review]
fix

this is obsolete right? marking it so
Comment 15 Havoc Pennington 2004-07-29 23:19:15 UTC
I'm not sure honestly how to fix this one. You have to fully dispatch the
messages from a connection before you let that connection be shut down.

We could try just popping each message off the connection and dispatching it 
at the top of bus_connection_disconnected(). Seems like it could create some
pretty scary reentrancy issues though.

It'd be good to have test suite coverage for "connection disconnecting while all
its messages have not been dispatched"
Comment 16 Colin Walters 2004-10-29 12:04:21 UTC
Hm, if the problem is that we can't check the message policy due to the sender
having gone away - might a solution be to cache connection credentials until all
messages sent from that connection are destroyed?  
Comment 17 John (J5) Palmieri 2005-08-24 21:50:56 UTC
I think this has been fixed when we fixed activation.  Please confirm.
Comment 18 Kimmo Hämäläinen 2005-08-24 22:57:49 UTC
I think this is still unsolved (see Havoc's last comment). However, this is easy
to workaround by making sure that the client does not quit too quickly after
sending the messages, but e.g. sleeps few seconds before exiting.
Comment 19 Kimmo Hämäläinen 2006-03-17 20:41:20 UTC
Finally I took the time to test this in (slightly modified) v0.61 DBus. The
messages are still getting lost, but as the workaround (sleeping) is quite
simple and unproblematic, this bug could be marked as WONTFIX.

This shows that when the service was not yet running (and was activated), it did
not receive any of the sent messages. When it was already running, it received
all five messages.

----
Nokia770-10:~# ps xau | grep bug_svc
  943 root        432 S   grep bug_svc 
Nokia770-10:~# ./bug_client
sending message 0
sending message 1
sending message 2
sending message 3
sending message 4
Nokia770-10:~# going to the main loop
iface: org.freedesktop.DBus, member: NameAcquired, path: /org/freedesktop/DBus
iface: org.freedesktop.DBus, member: NameAcquired, path: /org/freedesktop/DBus

Nokia770-10:~# ps xau | grep bug_svc
  946 messageb    864 S   /root/bug_svc 
  948 root        432 S   grep bug_svc 
Nokia770-10:~# ./bug_client
sending message 0
iface: foo.bar, member: foobar, path: /bug/svc
sending message 1
iface: foo.bar, member: foobar, path: /bug/svc
sending message 2
iface: foo.bar, member: foobar, path: /bug/svc
sending message 3
iface: foo.bar, member: foobar, path: /bug/svc
sending message 4
iface: foo.bar, member: foobar, path: /bug/svc
Nokia770-10:~# 
Comment 20 Kimmo Hämäläinen 2006-03-17 20:44:02 UTC
Created attachment 4976 [details]
new test programs

I used these test programs.
Comment 21 Kimmo Hämäläinen 2006-09-05 06:19:42 UTC
provided the info, so setting another state
Comment 22 Havoc Pennington 2006-09-05 07:23:22 UTC
this is a pretty bad race really, potentially breaks any dbus-send - john you 
might consider it for your blocker bug
Comment 23 Kimmo Hämäläinen 2006-12-19 03:55:11 UTC
(In reply to comment #22)
> this is a pretty bad race really, potentially breaks any dbus-send - john you 
> might consider it for your blocker bug

Yes, indeed this makes dbus-send unreliable unless it has the '--print-reply'
option. I just spent two days to realise this.
Comment 24 Michael Meeks 2008-03-27 07:56:21 UTC
So - this is badly hurting NetworkManager reliability for us too - we have rather a number of duplicates of this underlying issue from using dbus-send.

The option of 'sleeping a while' before closing the connection is never the right solution :-)

I'm working on a patch - also, cf. my description of the problem here:

http://lists.freedesktop.org/archives/dbus/2008-March/009526.html
Comment 25 Colin Walters 2009-03-18 15:30:40 UTC
Created attachment 24013 [details] [review]
Implement-org.freedesktop.DBus.Peer-for-th.patch

We want the bus to reply to the Peer interface for obvious
reasons.  To implement, refactor the code in dbus-connection
so that we have a (internal public) method to process a message
and create the reply.

The actual sending function differs from bus-internal versus
from a DBusConnection, so we can't just call the peer filter
directly.

Create a test too.
Comment 26 Colin Walters 2009-03-18 15:31:24 UTC
Created attachment 24014 [details] [review]
Have-dbus-send-always-wait-until-message-i.patch

For method calls, we simply wait for the reply, even if we don't
print it.  For signals, we send an org.freedesktop.DBus.Peer.Ping
message to the destination.

Note as a side effect of this, the dbus-send exit code will now
be nonzero if the remote side replies with an error, even if
it's not printed.  This is a desirable change, but could in theory
break a script.  I am of the opinion the script is buggy if it's
broken by this.
Comment 27 Colin Walters 2009-03-18 15:43:17 UTC
Created attachment 24016 [details] [review]
this version has the new test

Forgot to git add the test.
Comment 28 Colin Walters 2009-03-18 16:32:01 UTC
Created attachment 24019 [details] [review]
Ignore-EPIPE-on-writes.patch

From 7a4fd0fb71e698f3c8a04a245df9bf0187c45945 Mon Sep 17 00:00:00 2001
From: Colin Walters <walters@verbum.org>
Date: Wed, 18 Mar 2009 19:26:33 -0400
Subject: [PATCH] Bug 896 - Ignore EPIPE on writes

http://lists.freedesktop.org/archives/dbus/2008-March/009526.html

We can get an EPIPE if the remote end has closed the connection
on a write.  However, it's possible we still have messages to
read (and the socket can still be valid for reading).  This
is referred to as a "half open" connection.  It turns out that
things are correct if we simply ignore the EPIPE.  We actually
want to treat the connection as closed only when reading is
complete (i.e. we get a hangup/EOF).
Comment 29 Colin Walters 2009-03-18 16:44:17 UTC
Created attachment 24020 [details] [review]
ignore sigpipe

This patch is from http://lists.freedesktop.org/archives/dbus/2008-March/009531.html

it need updating to the latest code.
Comment 30 Colin Walters 2009-03-18 16:46:38 UTC
Actually Avery claims (in http://lists.freedesktop.org/archives/dbus/2008-March/009544.html ) that something closer to my smaller patch will work.

Opinions?  Needs tests I know...I'll look at that.
Comment 31 Michael Meeks 2009-03-19 06:47:50 UTC
Hi Colin, thanks for the CC.
Of course - I prefer my patch, that (at least attempts) to get the underlying bug fixed right [ no doubt band-aiding this one short-lived client will cause grief for others in the future ].

My second post, even had a hacked up regression test [ which needed some love ]:
http://lists.freedesktop.org/archives/dbus/2008-March/009543.html

We don't ship my patch, and it clearly needed testing - but hopefully shows what is needed I guess. Happy to help test out whatever ultimate fix you come up with.
Comment 32 Dafydd Harries 2009-04-30 05:18:14 UTC
Created attachment 25304 [details] [review]
updated version of Michael's patch

(In reply to comment #31)
> Hi Colin, thanks for the CC.
> Of course - I prefer my patch, that (at least attempts) to get the underlying
> bug fixed right [ no doubt band-aiding this one short-lived client will cause
> grief for others in the future ].

By his smaller patch, I think he meant the one that just ignored EPIPE on writes, not the dbus-send patch that makes an extra method call. (I gather that the portion to implement the Peer interface on the bus daemon has already been applied though.)

However, Colin's patch didn't work when I tested it; we also need the change you made to avoid closing the transport on hangup when the read watch is still active.

Here's a patch which updates yours to apply to the current code, but with Colin's _dbus_get_is_errno_epipe() since that seemed a bit nicer style-wise. I haven't looked at the regression test.
Comment 33 Colin Walters 2009-07-10 18:00:20 UTC
Created attachment 27579 [details]
failing trace with revents=POLLIN|POLLERR|POLLHUP

This patch doesn't look to be quite enough.  I'm able to reproduce this attached failing trace with it.  My setup is a "strace -s 4000 -o /tmp/dbus.strace" on the bus, and then a while loop running: 
while true; do sleep 0.1; dbus-send --dest=org.freedesktop.DBus /org/freedesktop/DBus org.freedesktop.Introspectable.Introspect; done

If the bus reads the message, I count that as success.  System is Fedora 11, Linux megatron 2.6.29.5-191.fc11.x86_64, libc-2.10.1-2.x86_64.

This is a different race though...
Comment 34 Colin Walters 2009-07-10 18:37:15 UTC
Created attachment 27585 [details] [review]
0001-Bug-896-Avoid-race-conditions-reading-message-from.patch

This patch looks like it closes both races for me.  Comments appreciated...let's get this fixed (sorry for being really slow on this bug).
Comment 35 Scott James Remnant 2009-07-13 09:14:03 UTC
Looks sane to me
Comment 36 Colin Walters 2009-07-13 10:27:52 UTC
Pushed the patch.
Comment 37 Michael Meeks 2009-07-14 02:49:02 UTC
Whoot - thanks Colin, you rock :-)
I guess we can bin the sillies in dbus-send now - I believe we added an unnecessary 'ping' after async methods there.

As a more advanced question - it was unclear to me quite why we needed two sockets per client, instead of one - more to poll and fiddle with; ORBit2 makes do with one, and that makes this case a simple ordering problem in the poll handler rather than something more complex like this. But anyway, glad to see it fixed.
Comment 38 Colin Walters 2009-07-14 12:08:24 UTC
The dbus-send patch was never committed AFAICS, only proposed.  

Two sockets?  I believe we only have one socket.  Or you mean the two poll records?  I agree that's unfortunate; in a quick glance I think it's an artifact of an attempt to handle threading issues.  Havoc would probably know for sure.
Comment 39 Michael Meeks 2009-07-15 02:14:47 UTC
> The dbus-send patch was never committed AFAICS, only proposed.  

Good 'oh.

> Two sockets?  I believe we only have one socket.  Or you mean the two poll
> records?  I agree that's unfortunate; in a quick glance I think it's an
> artifact of an attempt to handle threading issues.  Havoc would probably
> know for sure.

Hmm ;-) I don't believe there needs to be any advantage wrt. threading issues, and there is certainly some performance hit both in code complexity, cost in allocating two of these, and cost in polling the same fd twice (~halving the peak performance there).

It is not as if the two watches even have different callbacks:

static dbus_bool_t
socket_connection_set (DBusTransport *transport)
{
  DBusTransportSocket *socket_transport = (DBusTransportSocket*) transport;

  _dbus_watch_set_handler (socket_transport->write_watch,
                           _dbus_connection_handle_watch,
                           transport->connection, NULL);

  _dbus_watch_set_handler (socket_transport->read_watch,
                           _dbus_connection_handle_watch,
                           transport->connection, NULL);

Commit comments like:

    2004-11-25  Havoc Pennington  <hp@redhat.com>
    
            The primary change here is to always write() once before adding
        the write watch, which gives us about a 10% performance increase.
    
Make this sound like a performance issue too; in theory adding a write watch should involve no allocation, no heavy lifting, but just be twiddling a single bit in a bit-mask of an already existing read/write watch record (surely) ?

But anyhow ... ;-) perhaps something to investigate for those wanting to improve dbus performance.
Comment 40 Xavier Claessens 2010-03-26 09:37:20 UTC
I tested on Fremantle, using DBus 1.2.14-0maemo4+0m5, and I can still reproduce this bug:

When the application is not pre-started, using dbus-send does activate the app, but don't call the method, unless I give  --print-reply option.

Reopening as Will asked.
Comment 41 Jiří Klimeš 2010-10-08 01:15:29 UTC
Don't know exactly if its the same bug. But messages *are* still lost when the client exits early.

These commands don't work:
dbus-send --system --dest=org.freedesktop.NetworkManager \
/org/freedesktop/NetworkManager org.freedesktop.DBus.Properties.Set \
string:'org.freedesktop.NetworkManager' string:'WirelessEnabled' \
variant:boolean:false
dbus-send --system --dest=org.freedesktop.NetworkManager \
/org/freedesktop/NetworkManager org.freedesktop.NetworkManager.Sleep \
boolean:false

Adding '--print-reply' helps.

For the reference, the patch from comment #34 is committed here:
http://cgit.freedesktop.org/dbus/dbus/commit/?id=87ddff6b24d9b9d4bba225c33890db25022d8cbe
So it doesn't fix the bug.

$ rpm -qv dbus
dbus-1.2.24-1.fc13.x86_64
Comment 42 Havoc Pennington 2010-10-11 10:03:01 UTC
It seems like this bug may have ended up mixing up some different issues.

One bug (that the patches tried to fix) is failure to _read_ from a connection that has had a write failure. This is a problem too.

However the other issue (I don't see a patch for) is that once a connection's socket is closed, which now with the fixes means all messages have been _read_, the messages have not all been _dispatched_. When dispatching a message I guess the bus tries to get unix credentials, which are not available on closed connections. So maybe that's why people still see issues?
Comment 43 Ondrej Holy 2015-05-14 11:36:47 UTC
Is there any progress? It seems this bug causes that gvfs-open doesn't work correctly, see:
https://bugzilla.gnome.org/show_bug.cgi?id=746534

It seems g_dbus_connection_flush() after async call isn't enough.
Comment 44 GitLab Migration User 2018-10-12 21:03:42 UTC
-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/dbus/dbus/issues/1.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.