Bug 32992 - bad fds in _dbus_loop_iterate cause busylooping
Summary: bad fds in _dbus_loop_iterate cause busylooping
Status: RESOLVED FIXED
Alias: None
Product: dbus
Classification: Unclassified
Component: core (show other bugs)
Version: 1.4.x
Hardware: Other All
: medium normal
Assignee: Simon McVittie
QA Contact: John (J5) Palmieri
URL: http://git.collabora.co.uk/?p=user/sm...
Whiteboard: NB#200248
Keywords: patch
Depends on:
Blocks:
 
Reported: 2011-01-11 04:27 UTC by Simon McVittie
Modified: 2011-03-03 03:19 UTC (History)
3 users (show)

See Also:
i915 platform:
i915 features:


Attachments
[1/4] dbus-spawn: don't leave bad file descriptors being watched (1.86 KB, patch)
2011-01-12 04:24 UTC, Simon McVittie
Details | Splinter Review
[2/4] _dbus_loop_iterate: if a watch has been invalidated, skip it (1.78 KB, patch)
2011-01-12 04:25 UTC, Simon McVittie
Details | Splinter Review
[3/4] _dbus_loop_iterate: cleanup: make more use of a temporary variable (1.93 KB, patch)
2011-01-12 04:25 UTC, Simon McVittie
Details | Splinter Review
[4/4] _dbus_loop_iterate: if the kernel says a fd is bad, stop watching it (1.49 KB, patch)
2011-01-12 04:26 UTC, Simon McVittie
Details | Splinter Review

Description Simon McVittie 2011-01-11 04:27:04 UTC
Maemo's dbus-daemon apparently sometimes goes into a busy-loop because bad fds have been placed into the main loop.

This shouldn't happen in correct code, but apparently it sometimes does (there must be another bug somewhere in the jungle of fd watches). The poll() call immediately returns, with the POLLNVAL condition set on the bad fds; we don't do anything about it, iterate again, and so on.

This is vaguely related to Bug #23194.
Comment 1 Simon McVittie 2011-01-11 10:38:19 UTC
Amusingly, when I hacked up _dbus_poll() to abort on bad file descriptors, one of the regression tests aborted. It doesn't seem to be 100% reproducible, though.
Comment 2 Simon McVittie 2011-01-12 04:24:37 UTC
Created attachment 41912 [details] [review]
[1/4] dbus-spawn: don't leave bad file descriptors being watched

The code called from handle_watch() might close either or both of the
sockets we're watching, without cleaning up the DBusWatch. This results
in invalid file descriptors being passed to _dbus_poll(), which could
end up busy-looping on a POLLNVAL condition until the babysitter loses
its last ref (which automatically clears up both watches).
Comment 3 Simon McVittie 2011-01-12 04:25:08 UTC
Created attachment 41914 [details] [review]
[2/4] _dbus_loop_iterate: if a watch has been invalidated, skip it

This shouldn't happen - other modules are responsible for cleaning up
their watches - but the bug fixed in my last commit has been present for
several years and I'm sure it's not the only one, so for robustness,
let's refuse to watch obviously-wrong file descriptors.
Comment 4 Simon McVittie 2011-01-12 04:25:42 UTC
Created attachment 41916 [details] [review]
[3/4] _dbus_loop_iterate: cleanup: make more use of a temporary variable

It was added by the previous commit.
Comment 5 Simon McVittie 2011-01-12 04:26:22 UTC
Created attachment 41917 [details] [review]
[4/4] _dbus_loop_iterate: if the kernel says a fd is bad, stop watching it

Again, this shouldn't happen - modules are responsible for cleaning up
their watches - but the failure mode here is really bad: if we leave an
invalid fd in the set, every poll() call will instantly return, marking
it as POLLNVAL. The result is that dbus-daemon busy-loops on poll()
without responding to I/O, so the bad watch will probably never be
cleared up.

Bug: https://bugs.freedesktop.org/show_bug.cgi?id=32992
Bug-NB: NB#200248
Comment 6 Cosimo Alfarano 2011-01-13 09:13:44 UTC
I am not a dbus reviewer, but for my understanding of dbus they seem pretty fine
Comment 7 Colin Walters 2011-01-13 14:20:29 UTC
Stepping back a bit, does anyone have any idea what the heck is going on in dbus-spawn.c?  Why is there an "intermediate child between the main process and the grandchild"??  Why aren't we just executing the child directly?

The commit message doesn't say...
Comment 8 Simon McVittie 2011-01-14 04:02:25 UTC
(In reply to comment #7)
> Stepping back a bit, does anyone have any idea what the heck is going on in
> dbus-spawn.c?

We just don't know. To be honest I don't really want to risk draining that particular swamp in 1.4.x - I'm sure it'd be possible to make it much much nicer, but the current code "mostly" works, so any extensive refactoring seems like a job for 1.5.

To be honest we should probably have branched 1.4 as soon as 1.4.0 was released? In Telepathy, we always maintain a stable branch in addition to master, and encourage people to develop all stable-suitable bugfixes against the stable branch and merge the stable branch to master periodically.
Comment 9 Simon McVittie 2011-01-14 04:07:25 UTC
I forgot to mention, the busy-loop can be reproduced by adding a "service" that immediately exits 0:

[D-BUS Service]
Name=bin.true
Exec=/bin/true

and service-activating it. I think it's a race in the activation code, and when the service is as quick to exit as that, the code path in which the fd isn't cleared up nearly always wins the race.
Comment 10 Simon McVittie 2011-01-18 09:08:29 UTC
May I at least apply patch 1 (Attachment #41912 [details])? I know dbus-spawn is full of bees and I haven't fixed that yet, but this is a reproducible busy-loop/DoS, so it seems like something to fix minimally in a stable branch.

Patches 2 and 4 are defensive programming rather than fixes for a known bug, but again, they seem worthwhile for 1.4 to me. I've checked that they have the desired effect when patch 1 is reverted temporarily.

FWIW, we've applied all of these in Maemo.
Comment 11 Colin Walters 2011-01-20 14:15:00 UTC
(In reply to comment #10)
> May I at least apply patch 1 (Attachment #41912 [details])? I know dbus-spawn is full of
> bees and I haven't fixed that yet, but this is a reproducible busy-loop/DoS, so
> it seems like something to fix minimally in a stable branch.
> 
> Patches 2 and 4 are defensive programming rather than fixes for a known bug,
> but again, they seem worthwhile for 1.4 to me. I've checked that they have the
> desired effect when patch 1 is reverted temporarily.

The dbus-spawn code terrifies me, your change "feels" safe but a through analysis would require writing up some sort of diagram of the processes and their open file descriptors and state.  Feel free to commit.

Patches 2-4 look fine to me.
Comment 12 Simon McVittie 2011-01-24 07:41:05 UTC
Thanks, all applied in git for 1.4.4. Bug #33336 has a couple of patches with related improvements; up to you whether they're candidates for 1.4 or 1.5.
Comment 13 Colin Walters 2011-03-02 12:51:35 UTC
(In reply to comment #7)
> Stepping back a bit, does anyone have any idea what the heck is going on in
> dbus-spawn.c?  Why is there an "intermediate child between the main process and
> the grandchild"??  Why aren't we just executing the child directly?

Apparently this is a trick to avoid having the parent deal with SIGCHLD if it wasn't expecting it, see:

http://compgroups.net/comp.unix.programmer/the-need-for-double-forking
Comment 14 Simon McVittie 2011-03-03 03:19:42 UTC
(In reply to comment #13)
> Apparently this is a trick to avoid having the parent deal with SIGCHLD

Ah, well spotted! I've copied your comment to Bug #33335; any further discussion of why dbus-spawn.c is such a tangle there, please.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.