It's possible to make dbus-daemon stop responding at all to any requests, with some effort. It's caused by its buggy use of poll(2).
The manual for poll says:
EINVAL The nfds value exceeds the RLIMIT_NOFILE value.
The code in dbus-mainloop.c never checks for error conditions returning from poll() / _dbus_poll(), assuming that errors cannot happen. EBADF and EFAULT cannot happen in a properly constructed application, ENOMEM should be self-corrected by the kernel and EINTR is harmless.
However, EINVAL can only be recovered from by trying to poll less file descriptors. It can also be prevented by properly constructing the application, which D-Bus isn't doing.
The reason why nfds got over the limit is because the code is creating one entry for POLLIN and another for POLLOUT. So it's possible to cause the problem to occur by opening more than 512 connections to the bus and making it try to write to most of those.
In the case I ran into, nfds was 1047. That means EINVAL got returned and no file descriptors got acted on. By consequence, the daemon did not deliver any messages nor closed any sockets, so it reentered the same poll(2) call, which failed again. So the daemon was unrecoverable: no new connections were accepted, no messages delivered or accepted.
Applications with blocking calls started hanging and the desktop became unusable.
The suggested solution for this problem is to create one DBusPollFD per file descriptor, no more.
(In reply to comment #0)
> The reason why nfds got over the limit is because the code is creating one
> entry for POLLIN and another for POLLOUT. So it's possible to cause the problem
> to occur by opening more than 512 connections to the bus and making it try to
> write to most of those.
I always thought it was pretty weird that DBusConnection has two watches for one fd... unfortunately, it follows from these assumptions:
* DBusWatch's flags are immutable during the watch's lifetime
* we don't want to remove/add a DBusWatch every time whether we want to write changes
This is also irritating if we want to use epoll_wait(2) on Linux, which requires you to watch each fd at most once. Under that regime, we need to adjust the setup for each fd whenever the corresponding read-watch *or* write-watch is added, toggled or removed.
I agree that the poll() code here is busted, however I think there's also a larger issue, which is the desynchronization between the process ulimit for the bus, and its configuration.
Many Linux forks and embeds use 1024 as a per-process default I believe. However the default bus configuration we ship has:
parser->limits.max_completed_connections = 2048;
In addition to:
parser->limits.max_incomplete_connections = 64;
For completeness there is also:
parser->limits.max_connections_per_user = 256;
* Should the system bus instance e.g. ask for more via setrlimit() on startup?
We'd have to do this before we called setuid(), since on Linux we need CAP_SYS_RESOURCE.
* Should we change the hardcoded defaults here to be below 1024?
(In reply to comment #2)
> I agree that the poll() code here is busted, however I think there's also a
> larger issue, which is the desynchronization between the process ulimit for the
> bus, and its configuration.
Let's restrict this bug to the double-polling (for which I now have a patch), and clone it for the ulimit/configuration mismatch.
(In reply to comment #4)
> clone it for the ulimit/configuration mismatch
> The suggested solution for this problem is to create one DBusPollFD per file
> descriptor, no more.
Patch on the way; the branch in my gitweb is based on a merge of all my pending branches, but I think it probably "only" needs the branches from Bug #33342 and Bug #33336, plus Bug #33126 so it will pass the regression tests reliably.
I don't think this is suitable for 1.4 due to its dependencies, but it's my top priority for 1.5.
(Looking at the patch now, I could probably have done this without depending on *all* of Bug #33342, but it requires at least the commits "keep separate lists of watches and timeouts" and "factor out watch_flags_to_poll_events ...", and I think a minimal backport would still be too big for 1.4.)
Created attachment 42465 [details] [review]
DBusLoop: store watches in a hash table keyed by fd
This means we don't need to build up the watches_for_fds array, because
it's quick to find the watches based on the fd.
This also fixes fd.o #23194, because we only put each fd in the
array of DBusPollFD once, with the union of all its watches' flags; we
can no longer get more entries in the array than our number of file
descriptors, which is capped at our RLIMIT_NOFILE, so we can't get
EINVAL from poll(2).
Created attachment 42466 [details] [review]
bus-test: close fds beyond 2 on startup
If we're going to check for leaks by asserting that we have no open files
beyond fd 2, then we ought to make sure that's the case on startup, in
case our parent process gave us a fd we didn't want (gdb does this, for
(This is not required to fix this bug, but it was very useful during development, to make the test not fail too early under gdb.)
Review of attachment 42466 [details] [review]:
@@ +139,3 @@
Would prefer this lived in dbus-sysdeps.h.
@@ +182,3 @@
This needs a non-#ifdef __linux__ implementation (and should probably fall back if the open of /prof/self/fd fails for some reason too).
This code is definitely scattered around; look at e.g.
dbus-sysdeps-unix.c:3108 where we use sysconf (_SC_OPEN_MAX) and then a loop. I think it'd be worth at least refactoring that bit to use this common function.
Also I suggest a better name is:
_dbus_close_fds_from (3); ?
(In reply to comment #8)
> This needs a non-#ifdef __linux__ implementation (and should probably fall back
> if the open of /prof/self/fd fails for some reason too).
For this purpose, it doesn't need a fallback: it's meant to be an exact mirror of the check for leaked fds in the same file. Both are #ifdef __linux__ and only used in the regression tests.
> _dbus_close_nonstandard_fds ();
It'd be reasonable to put that function (and perhaps the leak check too) in sysdeps-util, though, and if placed in that location I agree it should be cross-platform.
Comment on attachment 42465 [details] [review]
DBusLoop: store watches in a hash table keyed by fd
>From 55e1ef617c6d1892dc5b390cf6cb8252ee002a89 Mon Sep 17 00:00:00 2001
>From: Simon McVittie <firstname.lastname@example.org>
>Date: Tue, 25 Jan 2011 15:38:06 +0000
>Subject: [PATCH 1/2] DBusLoop: store watches in a hash table keyed by fd
Reading over this it definitely feels right, but I am not very familiar with the DBusWatch code unfortunately. I just have a minor style comments:
>+ if (dbus_watch_get_enabled (link->data))
I've been bitten before when using "link->data" style direct dereferences with GList; basically you're trying to refactor some code, and you have to try to remember each time you see link->data what the actual type is.
Can you just hoist it out like you've done elsewhere with:
DBusWatch *watch = link->data;
if (dbus_watch_get_enabled (watch)) ...
Otherwise if no one else comments and this passes the test suite etc., I think it's probably good to go.
I can append a fix for the link->data pattern, sure.
Other things that block merging this, from the blocking bugs:
* Bug #33126 needs review: otherwise the tests will fail, because they detect
a previously undetected memory leak
* Bug #33336 needs review: it provides an invariant that we rely on here
* Bug #33342 needs review: this branch makes use of the refactoring there
Because of Bug #33342, I think this is probably a job for 1.5. I've branched 1.4 and will use that to commit small bugfixes.
*** Bug 33837 has been marked as a duplicate of this bug. ***
Comment on attachment 42466 [details] [review]
bus-test: close fds beyond 2 on startup
I've cloned Bug #35173 for the unimportant patch attached here.
Other than that, I owe Colin a stylistic patch for not using iter->data all over the place, and this bug is also blocked by re-review of Bug #33336 and review of Bug #33342.
Created attachment 45391 [details] [review]
dbus-mainloop: stylistic fix: don't use link->data directly
This amends Attachment #42465 [details] with Colin's suggested stylistic change. In the git branch whose gitweb is in the URL field, I've squashed it into the previous commit.
not for 1.4
Review of attachment 45391 [details] [review]:
Nothing particular, this just works.
(In reply to comment #16)
> Review of attachment 45391 [details] [review]:
> Nothing particular, this just works.
Thanks, any thoughts on the bugs that block this one?
Review of attachment 42465 [details] [review]:
Simple questions, the patch looks good.
@@ +257,3 @@
+ DBusList **watches;
+ _dbus_warn ("invalid request, socket fd %d not open\n", fd);
Why is this unconditional warning here?
@@ +268,3 @@
+ _dbus_hash_table_remove_int (loop->watches, fd);
Doesn't this need to free the memory for loop->watches too?
@@ +605,2 @@
fds = dbus_new0 (DBusPollFD, loop->watch_count);
Here we allocate memory for loop->watch_count elements, which may be higher than the number of fds.
@@ +844,3 @@
+ watches = _dbus_hash_table_lookup_int (loop->watches, fds[i].fd);
+ if (watches == NULL)
Is this condition possible? Is this a case for _dbus_assert?
(In reply to comment #18)
> Review of attachment 42465 [details] [review]:
> Simple questions, the patch looks good.
> ::: dbus/dbus-mainloop.c
> @@ +257,3 @@
> + DBusList **watches;
> + _dbus_warn ("invalid request, socket fd %d not open\n", fd);
> Why is this unconditional warning here?
As discussed on IRC, that function is only called to recover from something that shouldn't happen (it's the fix for Bug #32992, factored out into a function). I've renamed it to cull_watches_for_invalid_fd() as you suggested.
> @@ +268,3 @@
> + }
> + _dbus_hash_table_remove_int (loop->watches, fd);
> Doesn't this need to free the memory for loop->watches too?
If you mean "doesn't this need to free the value corresponding to fd?", then no, the hash table has a free-function (free_watch_table_entry) which does that automatically when the key is removed or when the hash table itself is destroyed.
If you mean "shouldn't this free the hash table?", then no, it shouldn't be freed until the last unref of the DBusLoop.
> @@ +605,2 @@
> _dbus_wait_for_memory ();
> fds = dbus_new0 (DBusPollFD, loop->watch_count);
> Here we allocate memory for loop->watch_count elements, which may be higher
> than the number of fds.
Right, the important thing is that it's at least the number of fds. You're right that this will typically be using a bit more memory than it needs to; I could use _dbus_hash_table_size() if you think this is significant enough.
Strictly speaking we only actually need enough memory for the number of *enabled* fds, but I don't think the extra complexity of keeping a count of those is justified. Bug #33337 switches to using epoll on modern Linux, at which point it's the kernel's problem (all other OSs get a slow-path using poll, like we do now, but there's infrastructure by which we could plug in a scalable epoll-like interface on other Oss if anyone cares enough).
> @@ +844,3 @@
> + watches = _dbus_hash_table_lookup_int (loop->watches, fds[i].fd);
> + if (watches == NULL)
> Is this condition possible? Is this a case for _dbus_assert?
It shouldn't be possible; I'm not sure I want to bet on that though.
Created attachment 47893 [details] [review]
 Rename cull_watches_for_fd() based on Thiago's review
To be squashed into Attachment #42465 [details] for commit, probably.
Fixed in git for 1.5.6. The refactoring that this required is too big for 1.4, I think.