Maemo's dbus-daemon apparently sometimes goes into a busy-loop because bad fds have been placed into the main loop. This shouldn't happen in correct code, but apparently it sometimes does (there must be another bug somewhere in the jungle of fd watches). The poll() call immediately returns, with the POLLNVAL condition set on the bad fds; we don't do anything about it, iterate again, and so on. This is vaguely related to Bug #23194.
Amusingly, when I hacked up _dbus_poll() to abort on bad file descriptors, one of the regression tests aborted. It doesn't seem to be 100% reproducible, though.
Created attachment 41912 [details] [review] [1/4] dbus-spawn: don't leave bad file descriptors being watched The code called from handle_watch() might close either or both of the sockets we're watching, without cleaning up the DBusWatch. This results in invalid file descriptors being passed to _dbus_poll(), which could end up busy-looping on a POLLNVAL condition until the babysitter loses its last ref (which automatically clears up both watches).
Created attachment 41914 [details] [review] [2/4] _dbus_loop_iterate: if a watch has been invalidated, skip it This shouldn't happen - other modules are responsible for cleaning up their watches - but the bug fixed in my last commit has been present for several years and I'm sure it's not the only one, so for robustness, let's refuse to watch obviously-wrong file descriptors.
Created attachment 41916 [details] [review] [3/4] _dbus_loop_iterate: cleanup: make more use of a temporary variable It was added by the previous commit.
Created attachment 41917 [details] [review] [4/4] _dbus_loop_iterate: if the kernel says a fd is bad, stop watching it Again, this shouldn't happen - modules are responsible for cleaning up their watches - but the failure mode here is really bad: if we leave an invalid fd in the set, every poll() call will instantly return, marking it as POLLNVAL. The result is that dbus-daemon busy-loops on poll() without responding to I/O, so the bad watch will probably never be cleared up. Bug: https://bugs.freedesktop.org/show_bug.cgi?id=32992 Bug-NB: NB#200248
I am not a dbus reviewer, but for my understanding of dbus they seem pretty fine
Stepping back a bit, does anyone have any idea what the heck is going on in dbus-spawn.c? Why is there an "intermediate child between the main process and the grandchild"?? Why aren't we just executing the child directly? The commit message doesn't say...
(In reply to comment #7) > Stepping back a bit, does anyone have any idea what the heck is going on in > dbus-spawn.c? We just don't know. To be honest I don't really want to risk draining that particular swamp in 1.4.x - I'm sure it'd be possible to make it much much nicer, but the current code "mostly" works, so any extensive refactoring seems like a job for 1.5. To be honest we should probably have branched 1.4 as soon as 1.4.0 was released? In Telepathy, we always maintain a stable branch in addition to master, and encourage people to develop all stable-suitable bugfixes against the stable branch and merge the stable branch to master periodically.
I forgot to mention, the busy-loop can be reproduced by adding a "service" that immediately exits 0: [D-BUS Service] Name=bin.true Exec=/bin/true and service-activating it. I think it's a race in the activation code, and when the service is as quick to exit as that, the code path in which the fd isn't cleared up nearly always wins the race.
May I at least apply patch 1 (Attachment #41912 [details])? I know dbus-spawn is full of bees and I haven't fixed that yet, but this is a reproducible busy-loop/DoS, so it seems like something to fix minimally in a stable branch. Patches 2 and 4 are defensive programming rather than fixes for a known bug, but again, they seem worthwhile for 1.4 to me. I've checked that they have the desired effect when patch 1 is reverted temporarily. FWIW, we've applied all of these in Maemo.
(In reply to comment #10) > May I at least apply patch 1 (Attachment #41912 [details])? I know dbus-spawn is full of > bees and I haven't fixed that yet, but this is a reproducible busy-loop/DoS, so > it seems like something to fix minimally in a stable branch. > > Patches 2 and 4 are defensive programming rather than fixes for a known bug, > but again, they seem worthwhile for 1.4 to me. I've checked that they have the > desired effect when patch 1 is reverted temporarily. The dbus-spawn code terrifies me, your change "feels" safe but a through analysis would require writing up some sort of diagram of the processes and their open file descriptors and state. Feel free to commit. Patches 2-4 look fine to me.
Thanks, all applied in git for 1.4.4. Bug #33336 has a couple of patches with related improvements; up to you whether they're candidates for 1.4 or 1.5.
(In reply to comment #7) > Stepping back a bit, does anyone have any idea what the heck is going on in > dbus-spawn.c? Why is there an "intermediate child between the main process and > the grandchild"?? Why aren't we just executing the child directly? Apparently this is a trick to avoid having the parent deal with SIGCHLD if it wasn't expecting it, see: http://compgroups.net/comp.unix.programmer/the-need-for-double-forking
(In reply to comment #13) > Apparently this is a trick to avoid having the parent deal with SIGCHLD Ah, well spotted! I've copied your comment to Bug #33335; any further discussion of why dbus-spawn.c is such a tangle there, please.
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.