|Summary:||bad fds in _dbus_loop_iterate cause busylooping|
|Product:||dbus||Reporter:||Simon McVittie <smcv>|
|Component:||core||Assignee:||Simon McVittie <smcv>|
|Status:||RESOLVED FIXED||QA Contact:||John (J5) Palmieri <johnp>|
|Priority:||medium||CC:||cosimo.alfarano, hp, walters|
|i915 platform:||i915 features:|
[1/4] dbus-spawn: don't leave bad file descriptors being watched
[2/4] _dbus_loop_iterate: if a watch has been invalidated, skip it
[3/4] _dbus_loop_iterate: cleanup: make more use of a temporary variable
[4/4] _dbus_loop_iterate: if the kernel says a fd is bad, stop watching it
Description Simon McVittie 2011-01-11 04:27:04 UTC
Maemo's dbus-daemon apparently sometimes goes into a busy-loop because bad fds have been placed into the main loop. This shouldn't happen in correct code, but apparently it sometimes does (there must be another bug somewhere in the jungle of fd watches). The poll() call immediately returns, with the POLLNVAL condition set on the bad fds; we don't do anything about it, iterate again, and so on. This is vaguely related to Bug #23194.
Comment 1 Simon McVittie 2011-01-11 10:38:19 UTC
Amusingly, when I hacked up _dbus_poll() to abort on bad file descriptors, one of the regression tests aborted. It doesn't seem to be 100% reproducible, though.
Comment 2 Simon McVittie 2011-01-12 04:24:37 UTC
Created attachment 41912 [details] [review] [1/4] dbus-spawn: don't leave bad file descriptors being watched The code called from handle_watch() might close either or both of the sockets we're watching, without cleaning up the DBusWatch. This results in invalid file descriptors being passed to _dbus_poll(), which could end up busy-looping on a POLLNVAL condition until the babysitter loses its last ref (which automatically clears up both watches).
Comment 3 Simon McVittie 2011-01-12 04:25:08 UTC
Created attachment 41914 [details] [review] [2/4] _dbus_loop_iterate: if a watch has been invalidated, skip it This shouldn't happen - other modules are responsible for cleaning up their watches - but the bug fixed in my last commit has been present for several years and I'm sure it's not the only one, so for robustness, let's refuse to watch obviously-wrong file descriptors.
Comment 4 Simon McVittie 2011-01-12 04:25:42 UTC
Created attachment 41916 [details] [review] [3/4] _dbus_loop_iterate: cleanup: make more use of a temporary variable It was added by the previous commit.
Comment 5 Simon McVittie 2011-01-12 04:26:22 UTC
Created attachment 41917 [details] [review] [4/4] _dbus_loop_iterate: if the kernel says a fd is bad, stop watching it Again, this shouldn't happen - modules are responsible for cleaning up their watches - but the failure mode here is really bad: if we leave an invalid fd in the set, every poll() call will instantly return, marking it as POLLNVAL. The result is that dbus-daemon busy-loops on poll() without responding to I/O, so the bad watch will probably never be cleared up. Bug: https://bugs.freedesktop.org/show_bug.cgi?id=32992 Bug-NB: NB#200248
Comment 6 Cosimo Alfarano 2011-01-13 09:13:44 UTC
I am not a dbus reviewer, but for my understanding of dbus they seem pretty fine
Comment 7 Colin Walters 2011-01-13 14:20:29 UTC
Stepping back a bit, does anyone have any idea what the heck is going on in dbus-spawn.c? Why is there an "intermediate child between the main process and the grandchild"?? Why aren't we just executing the child directly? The commit message doesn't say...
Comment 8 Simon McVittie 2011-01-14 04:02:25 UTC
(In reply to comment #7) > Stepping back a bit, does anyone have any idea what the heck is going on in > dbus-spawn.c? We just don't know. To be honest I don't really want to risk draining that particular swamp in 1.4.x - I'm sure it'd be possible to make it much much nicer, but the current code "mostly" works, so any extensive refactoring seems like a job for 1.5. To be honest we should probably have branched 1.4 as soon as 1.4.0 was released? In Telepathy, we always maintain a stable branch in addition to master, and encourage people to develop all stable-suitable bugfixes against the stable branch and merge the stable branch to master periodically.
Comment 9 Simon McVittie 2011-01-14 04:07:25 UTC
I forgot to mention, the busy-loop can be reproduced by adding a "service" that immediately exits 0: [D-BUS Service] Name=bin.true Exec=/bin/true and service-activating it. I think it's a race in the activation code, and when the service is as quick to exit as that, the code path in which the fd isn't cleared up nearly always wins the race.
Comment 10 Simon McVittie 2011-01-18 09:08:29 UTC
May I at least apply patch 1 (Attachment #41912 [details])? I know dbus-spawn is full of bees and I haven't fixed that yet, but this is a reproducible busy-loop/DoS, so it seems like something to fix minimally in a stable branch. Patches 2 and 4 are defensive programming rather than fixes for a known bug, but again, they seem worthwhile for 1.4 to me. I've checked that they have the desired effect when patch 1 is reverted temporarily. FWIW, we've applied all of these in Maemo.
Comment 11 Colin Walters 2011-01-20 14:15:00 UTC
(In reply to comment #10) > May I at least apply patch 1 (Attachment #41912 [details])? I know dbus-spawn is full of > bees and I haven't fixed that yet, but this is a reproducible busy-loop/DoS, so > it seems like something to fix minimally in a stable branch. > > Patches 2 and 4 are defensive programming rather than fixes for a known bug, > but again, they seem worthwhile for 1.4 to me. I've checked that they have the > desired effect when patch 1 is reverted temporarily. The dbus-spawn code terrifies me, your change "feels" safe but a through analysis would require writing up some sort of diagram of the processes and their open file descriptors and state. Feel free to commit. Patches 2-4 look fine to me.
Comment 12 Simon McVittie 2011-01-24 07:41:05 UTC
Thanks, all applied in git for 1.4.4. Bug #33336 has a couple of patches with related improvements; up to you whether they're candidates for 1.4 or 1.5.
Comment 13 Colin Walters 2011-03-02 12:51:35 UTC
(In reply to comment #7) > Stepping back a bit, does anyone have any idea what the heck is going on in > dbus-spawn.c? Why is there an "intermediate child between the main process and > the grandchild"?? Why aren't we just executing the child directly? Apparently this is a trick to avoid having the parent deal with SIGCHLD if it wasn't expecting it, see: http://compgroups.net/comp.unix.programmer/the-need-for-double-forking