Bug 894

Summary:	Messages lost in D-BUS 0.21
Product:	dbus	Reporter:	Kimmo Hämäläinen <kimmo.hamalainen>
Component:	core	Assignee:	Havoc Pennington <hp>
Status:	RESOLVED FIXED	QA Contact:
Severity:	major
Priority:	high	CC:	oliv__a
Version:	unspecified
Hardware:	x86 (IA32)
OS:	Linux (All)
Whiteboard:
i915 platform:		i915 features:
Attachments:	Code of the test server and client. possible fix cleaner fix

Description Kimmo Hämäläinen 2004-07-20 01:10:53 UTC

Dbus seems to lose messages when sending lots of messages in a loop.

I made a simple server that writes a line into a file whenever a message is
received and a client that sends 3000 messages to the server in a loop. It seems
that five or so last messages are lost. Interestingly, about the same amount of
messages is lost when sending 10000 messages.

Comment 1 Kimmo Hämäläinen 2004-07-20 01:12:07 UTC

Created attachment 496 [details]
Code of the test server and client.

Comment 2 Havoc Pennington 2004-07-20 07:01:28 UTC

are you syncing everything? (flushing all dbus messages from the client,
ensuring you've read them all on the server, ensuring you've written out all the
text lines)

Comment 3 Olivier Andrieu 2004-07-20 10:52:41 UTC

The bus drops some messages but you do not see it since you're not watching for
replies to your messages. 

I modified your code to use dbus_connection_send_with_reply instead of
dbus_connection_send and I get 'org.freedesktop.DBus.Error.LimitsExceeded' errors.

Comment 4 Havoc Pennington 2004-07-20 13:15:07 UTC

Probably NOTABUG then, but we should probably understand why the limit is
exceeded (in particular I don't know why it would be the same with 3000 and
10000 messages?)

I still haven't looked at the code, so apologies if this is silly ;-)

Comment 5 Olivier Andrieu 2004-07-20 13:59:47 UTC

(In reply to comment #4)
> Probably NOTABUG then, but we should probably understand why the limit is
> exceeded (in particular I don't know why it would be the same with 3000 and
> 10000 messages?)

It's the max_replies_per_connection limit that is exceeded. By default it is 32,
indeed not very high. 

It works fine when the NO_REPLY flag is set on the client messages.

Comment 6 Havoc Pennington 2004-07-20 14:09:17 UTC

So the client isn't reading replies quickly enough and they get backlogged?

I'd be curious whether a higher max replies value helps or not.

Comment 7 Olivier Andrieu 2004-07-20 14:48:25 UTC

(In reply to comment #6)
> So the client isn't reading replies quickly enough and they get backlogged?

no, the _server_ isn't replying quickly enough, so the daemon stops accepting
method calls from the client. 

Actually the server code in the attachement doesn't reply at all. So the only
moment the backlog of pending replies empties is when the daemon timeouts them ... 

But even once this is fixed by having the server sending a reply right away, the
default limit of 32 triggers some rejection on my system (FC2). Maybe some
scheduling issue (the client sends the 32 messages before the server has even
time to run ??)

> I'd be curious whether a higher max replies value helps or not.

It sure does.

Comment 8 Kimmo Hämäläinen 2004-07-21 00:16:13 UTC

I tried with the NO_REPLY flag set on, but same message loss happened.

I did it like this, using the dbus_message_set_no_reply function:
...
    if (!succ) {
      printf("appending an arg. to message %d failed!\n", i);
      exit(1);
    }
    /* set the NO_REPLY flag */
    dbus_message_set_no_reply(msg, 1);

    succ = dbus_connection_send(conn, msg, &ret);
    if (!succ) {
      printf("sending of message %d failed\n", i);
      exit(1);
...

So, it seems that the server has to reply, or messages are lost...

Comment 9 Olivier Andrieu 2004-07-21 03:46:41 UTC

(In reply to comment #8)
> I tried with the NO_REPLY flag set on, but same message loss happened.

OK. Actually the max_replies_per_connection isn't enforced in DBUS 0.21, only in
the CVS. So that wasn't explaining your message losses.

(But I still think the default value of 32 is too low)

Comment 10 Olivier Andrieu 2004-07-21 11:46:39 UTC

OK, I think I understand what is going on: it's probably a bug in the watch
handler logic.

unix_handle_watch() in dbus-transport-unix calls do_reading() which reads data
from the socket. But do_reading() never reads more than
max_bytes_read_per_iteration bytes at a time. Here, the client sends a lot of
large messages and then exits; on the daemon side, the socket buffer contains
more than max_bytes_read_per_iteration bytes and poll() signals a HUP condition.
unix_handle_watch() reads part of the buffer via do_reading() and then
disconnects the transport (closes the fd) although there is still unread data.

A possible solution is to have do_reading() ignore the
max_bytes_read_per_iteration limit when the watch has the HUP condition. I'm not
sure that's the cleanest way to solve this bug though :).

Comment 11 Olivier Andrieu 2004-07-21 11:47:34 UTC

Created attachment 508 [details] [review]
possible fix

Comment 12 Havoc Pennington 2004-07-21 20:03:11 UTC

This looks like an unrelated fix, might go ahead and commit it so we don't lose
it in this conversation:
-      if (_dbus_transport_queue_messages (transport) == DBUS_DISPATCH_NEED_MEMORY)
+      if (!_dbus_transport_queue_messages (transport))

I'm not sure what the HUP semantics are, if we have a bunch of bytes in the
buffer and we repeatedly poll and read, do we get:

IN
IN
IN | HUP
IN | HUP
IN | HUP
HUP

i.e. can we just say "ignore HUP while there's an IN flag"?

Comment 13 Kimmo Hämäläinen 2004-07-21 23:49:56 UTC

Verified this bug with D-BUS CVS version (session bus), dating 21th July 2004.

Comment 14 Olivier Andrieu 2004-07-22 19:02:43 UTC

(In reply to comment #12)

> I'm not sure what the HUP semantics are, if we have a bunch of bytes in the
> buffer and we repeatedly poll and read, do we get:
> 
> IN
> IN
> IN | HUP
> IN | HUP
> IN | HUP
> HUP

Apparently the last step (HUP alone) does not occur, ie the IN flag is always
present (but I don't know if that's a portable behaviour of poll). What happens
is that do_reading eventually do a _dbus_read() which returns 0 and causes the
transport disconnection.

Comment 15 Olivier Andrieu 2004-07-22 19:10:59 UTC

Created attachment 519 [details] [review]
cleaner fix

Basically, ignore the HANGUP condition

Comment 16 Kimmo Hämäläinen 2004-07-27 05:44:21 UTC

Verified the "cleaner fix" by patching D-BUS CVS version (session bus), dating
21th July 2004.

Tried with 5, 3000 and 10000 messages, and no message loss was detected.

Maybe we get this fix to the next release...

Comment 17 Olivier Andrieu 2004-07-29 01:12:53 UTC

committed, with one difference: I moved the test for ERROR and HANGUP at the
beginning of unix_handle_watch (it was there originally)

Comment 18 Kimmo Hämäläinen 2004-09-10 04:05:59 UTC

I think this bug should be closed, seems to work in 0.22...

Comment 19 Olivier Andrieu 2004-09-13 02:18:14 UTC

Right, let's close this bug.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.