Bug 74698 - dbus-launch: has non-obvious critical sections in which an X error would cause process leaks
Summary: dbus-launch: has non-obvious critical sections in which an X error would caus...
Status: RESOLVED MOVED
Alias: None
Product: dbus
Classification: Unclassified
Component: core (show other bugs)
Version: 1.5
Hardware: All All
: medium normal
Assignee: D-Bus Maintainers
QA Contact: D-Bus Maintainers
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2014-02-07 20:38 UTC by Роман Донченко
Modified: 2018-10-12 21:17 UTC (History)
5 users (show)

See Also:
i915 platform:
i915 features:


Attachments
Proposed patch (1.02 KB, patch)
2014-02-07 20:38 UTC, Роман Донченко
Details | Splinter Review
Proposed patch v2 (1.04 KB, patch)
2014-02-10 20:22 UTC, Роман Донченко
Details | Splinter Review
Kill bus if can't attach to session (427 bytes, patch)
2014-05-02 15:41 UTC, Роман Донченко
Details | Splinter Review

Description Роман Донченко 2014-02-07 20:38:07 UTC
Created attachment 93633 [details] [review]
Proposed patch

If an X error occurs before the main dbus-launch process transmits the daemon's PID to the babysitter, x_io_error_handler will try to kill the bus with bus_pid_to_kill still set to -1. This will result in killing every process that dbus-launch can get its hands on.

I can easily reproduce this by running xvfb-run sh -c 'dbus-launch --autolaunch=11111111111111111111111111111111 &' a bunch of times.

The attached patch fixes this by making sure bus_pid_to_kill doesn't try to kill -1. It also fixes a related problem: if the X error happens after the fork but before the babysitter gets the bus PID, the bus won't be killed, because bus_pid_to_kill isn't initialized yet.
Comment 1 Simon McVittie 2014-02-10 11:04:58 UTC
Comment on attachment 93633 [details] [review]
Proposed patch

Review of attachment 93633 [details] [review]:
-----------------------------------------------------------------

This seems a good change, yes. A couple of coding style points, and I'd prefer the check for not-really-a-pid things to be a bit more comprehensive:

::: tools/dbus-launch.c
@@ +406,4 @@
>  static void
>  kill_bus(void)
>  {
> +  if (bus_pid_to_kill == -1) return;

D-Bus coding style is approximately GNU style, the same as in GNOME:

  if (bus_pid_to_kill == -1)
    return;

Also, I'd prefer it if this checked for "is a real pid" vs. "is some sort of pseudo-pid":

  if (bus_pid_to_kill <= 0)
    return;

(The IDs of actual processes are strictly positive: pid < -1, pid == -1 and pid == 0 are all special values, see kill(2).)

@@ +1275,5 @@
>  	}
>  
> +      /* Have to initialize bus_pid_to_kill ASAP, so that the
> +         X error callback can kill it if an error happens. */
> +      bus_pid_to_kill = bus_pid = val;

We don't usually use chained assignments, so I'd prefer it like this:

  bus_pid = val;
  /* Have to initialize [...] error happens. */
  bus_pid_to_kill = val;
Comment 2 Simon McVittie 2014-02-10 11:09:45 UTC
(In reply to comment #0)
> If an X error occurs before the main dbus-launch process transmits the
> daemon's PID to the babysitter

Which X error can occur under normal circumstances?

> This will result in killing every
> process that dbus-launch can get its hands on.

That's obviously bad, but dbus-launch should do *something* in response to an X error: if it hasn't forked the child that will execute dbus-daemon yet, it should issue a warning to stderr (or possibly syslog) and exit with nonzero status. Does it?

I can't help thinking that dbus-launch ought to be simpler - I'm not sure why the babysitter is the "bus runner"'s sibling, rather than its parent...
Comment 3 Simon McVittie 2014-02-10 11:18:47 UTC
At a high level, what we want is: when something goes fatally wrong (X error, D-Bus error, ...), all the dbus-launch processes that have been started exit, and so does the dbus-daemon.

(In reply to comment #0)
> I can easily reproduce this by running xvfb-run sh -c 'dbus-launch
> --autolaunch=11111111111111111111111111111111 &' a bunch of times.

It looks as though what you're doing here is manufacturing a race condition:

* xvfb-run starts an X server
* "sh -c" runs dbus-launch in the background
* "sh -c" has nothing more to do, and exits
* RACE:
  * dbus-launch connects to the X server
  * xvfb-run detects that "sh -c" has exited, and tells the X server to exit

If dbus-launch wins the race then it connects to the X server before the X server exits, sets up its dbus-daemon, receives an X error, kills its dbus-daemon and exits. All good.

If xvfb-run wins the race then dbus-launch gets an X error before it has forked, and does the incorrect kill(-1).

It's bad that dbus-launch kills unrelated processes, but I don't see how this would happen "in real life"?

With your patch, are the child processes (bus-runner/dbus-daemon, babysitter, "intermediate parent") guaranteed to exit or be killed?
Comment 4 Роман Донченко 2014-02-10 20:19:44 UTC
Regarding the coding style issues - sure, I'll upload a reformatted version of the patch. I don't think bus_pid_to_kill can ever contain any special value other than -1 (unless dbus-daemon decides to sabotage it), but I suppose for extra safety it may as well be a more broad check.

(In reply to comment #2)
> (In reply to comment #0)
> > If an X error occurs before the main dbus-launch process transmits the
> > daemon's PID to the babysitter
> 
> Which X error can occur under normal circumstances?

An error is abnormal pretty much by definition; I don't know what you're asking here.

> That's obviously bad, but dbus-launch should do *something* in response to
> an X error: if it hasn't forked the child that will execute dbus-daemon yet,
> it should issue a warning to stderr (or possibly syslog) and exit with
> nonzero status. Does it?

No, not currently. I suppose x_io_error_handler can easily be amended to do that; although it can't distinguish between being in the main process and being in the babysitter, so the babysitter will have the same response to an X error.

> I can't help thinking that dbus-launch ought to be simpler - I'm not sure
> why the babysitter is the "bus runner"'s sibling, rather than its parent...

So that it doesn't become a zombie child of the main process.

(In reply to comment #3)
> It looks as though what you're doing here is manufacturing a race condition:

Exactly.

> If xvfb-run wins the race then dbus-launch gets an X error before it has
> forked, and does the incorrect kill(-1).

This can also happen if dbus-launch gets an X error after it has forked but before it has sent the bus's PID to the babysitter.

> It's bad that dbus-launch kills unrelated processes, but I don't see how
> this would happen "in real life"?

I did actually encounter this bug in real life - I was running a graphical program with xvfb-run that indirectly ran dbus-launch (not sure how exactly that worked, because I stopped debugging once I found the culprit - it had to do with Ubuntu's overlay scrollbars, GSettings and dconf), and occasionally it would do kill(-1).

> With your patch, are the child processes (bus-runner/dbus-daemon,
> babysitter, "intermediate parent") guaranteed to exit or be killed?

I think so. If the X error occurs before the fork, there won't be any child processes to begin with. If it occurs after the fork, then the bus runner and the intermediate parent will exit normally, the main process will kill the bus daemon and exit, and the babysitter will exit because bus_pid_to_babysitter_pipe will be closed.
Comment 5 Роман Донченко 2014-02-10 20:22:13 UTC
Created attachment 93802 [details] [review]
Proposed patch v2
Comment 6 Роман Донченко 2014-03-08 16:09:37 UTC
Ping? Anything else you want me to do?
Comment 7 Simon McVittie 2014-03-10 11:17:40 UTC
> (In reply to comment #2)
> > Which X error can occur under normal circumstances?
> 
> An error is abnormal pretty much by definition; I don't know what you're
> asking here.

I suppose what I was really asking is: what's (an example of) the real-world situation in which an X error can occur at a time that triggers this bug? Or is it purely theoretical? - and you answered that already.

> > That's obviously bad, but dbus-launch should do *something* in response to
> > an X error: if it hasn't forked the child that will execute dbus-daemon yet,
> > it should issue a warning to stderr (or possibly syslog) and exit with
> > nonzero status. Does it?
> 
> No, not currently. I suppose x_io_error_handler can easily be amended to do
> that; although it can't distinguish between being in the main process and
> being in the babysitter, so the babysitter will have the same response to an
> X error.

I would like to be sure that our response to an X error is never "oh well, carry on regardless"; dbus-launch must always respond to an X error by either killing the dbus-daemon, or making sure that it will not start the dbus-daemon.

If your current proposed patch is intended to do that, I'll try to find time to trace through the various code paths and make sure it does.

> > I can't help thinking that dbus-launch ought to be simpler - I'm not sure
> > why the babysitter is the "bus runner"'s sibling, rather than its parent...
> 
> So that it doesn't become a zombie child of the main process.

I can see that that's why it doesn't go like this:

   shell [*]
     \- main()               --exec--> myapp[*]
        \- babysitter[*]
           \- bus-runner     --exec--> dbus-daemon --fork
                                       \- final dbus-daemon[*]

   [*]: long-lived process

because if the babysitter gets an X error and exits, myapp might never reap it and so it'll become a zombie.

I don't see why not this, though:

   shell [*]
     \- main()               --exec--> myapp[*]
        \- "intermediate parent"
           \- babysitter[*]
              \- bus-runner     --exec--> dbus-daemon --fork
                                          \- final dbus-daemon[*]

or indeed this:

   shell [*]
     \- main()               --exec--> myapp[*]
        \- "intermediate parent"
           \- babysitter[*]
              \- bus-runner     --exec--> dbus-daemon --nofork

If one of those makes it easier to get this right, I'd consider that patch.
Comment 8 Роман Донченко 2014-03-10 13:29:45 UTC
> If your current proposed patch is intended to do that

Yep, that is the purpose of the 2nd and 3rd hunks. In fact, I did test that with just the 1st hunk, my original test case sometimes leaves dbus-daemon processes running, and with all 3 it doesn't.

I did find another suspicious spot just now, though: in kill_bus_when_session_ends, in the branch that prints "No terminal on standard input and no X display" the exit(1) should probably be a kill_bus_and_exit(1). Would you agree?

> I don't see why not this, though:

I see what you mean, and I have no idea. :-) I don't think this has any bearing on the applicability of this patch, anyway.
Comment 9 Simon McVittie 2014-04-28 15:43:23 UTC
(In reply to comment #8)
> I did find another suspicious spot just now, though: in
> kill_bus_when_session_ends, in the branch that prints "No terminal on
> standard input and no X display" the exit(1) should probably be a
> kill_bus_and_exit(1). Would you agree?

Yes. Please do/test that.

> [ensuring death of the dbus-daemon] is the purpose of the 2nd and 3rd hunks.
> In fact, I did test that
> with just the 1st hunk, my original test case sometimes leaves dbus-daemon
> processes running, and with all 3 it doesn't.

Your patch does appear to be correct, but the fact that it is correct is far too subtle (indeed, everything about dbus-launch is far too subtle). The timeline of the parent dbus-launch process goes something like this:

* do initial X calls
[1]
* do some other stuff
* fork
    (child process starts doing some other stuff)
* return "intermediate parent" pid from fork()
* obtain bus daemon pid from bus_pid_to_launcher_pipe
[2]
* do things that might include X11 calls or killing the dbus-daemon

Meanwhile, the babysitter goes like this:

* return 0 from fork()
[3]
* obtain bus daemon pid from parent process via bus_pid_to_babysitter_pipe
[4]
* do things that might include X11 calls or killing the bus daemon

AFAICS, the 1st hunk of your patch relies for its correctness on the fact that control does not pass into libX11 in any way between [1] and [2], or between [3] and [4], so there's no way we can receive an error indication and try to kill the dbus-daemon during that interval. If we did, we would previously have done a kill(-1) and blown up the user session, which is at least visible. Now, we do nothing.

The correct thing to do about an X error is:

* before [1] or [3]: don't kill anything. Current implementation wrongly kills -1, your implementation is right.

* after [2] or [4]: kill the dbus-daemon. Current implementation wrongly kills -1 in rare cases, your implementation is right.

* between [1] and [2], or [3] and [4]: there is no correct thing we can do so we'd better be really sure it can't happen. Current implementation would kill -1, your implementation would not kill anything, both are wrong.

This is pretty annoying. The only resolutions I can think of are:

* add a comment briefly explaining the situation (with a reference to this bug) and make really sure we never add any libX11 calls in those critical sections in future

* track whether we are in one of those critical sections (maybe via bus_pid_to_kill being 0, -1 or positive to mean "bus not started", "bus potentially started but pid unknown" or "here is its pid", or via a separate tristate variable), and if we are, defer exiting until we have tried to discover the right pid to kill

Any thoughts/ideas from the other maintainers?

This reinforces the opinion I have had for a while that dbus-launch should be specific to dbus-on-X11 and deprecated in favour of systemd user sessions and/or dbus-run-session, and that in the glorious and semi-hypothetical Wayland (and/or Mir and/or SurfaceFlinger) future, dbus-launch should just not exist.
Comment 10 Thiago Macieira 2014-04-28 20:28:32 UTC
Comment on attachment 93802 [details] [review]
Proposed patch v2

Review of attachment 93802 [details] [review]:
-----------------------------------------------------------------

Looks good.
Comment 11 Thiago Macieira 2014-04-28 20:32:13 UTC
The contents of the patch are ok by itself, since we definitely do not want to ever kill -1. I really don't remember what the convoluted babysitter code is doing. I also don't believe I fully understood it when I wrote the X launcher back in 2006.

This patch is better than nothing. More comments explaining what to do later or not do would be welcome.
Comment 12 Simon McVittie 2014-04-30 18:52:36 UTC
Fixed in git for 1.8.2, 1.9.0. Leaving this bug open with a new title.
Comment 13 Роман Донченко 2014-05-02 15:38:52 UTC
(In reply to comment #9)
> (In reply to comment #8)
> > I did find another suspicious spot just now, though: in
> > kill_bus_when_session_ends, in the branch that prints "No terminal on
> > standard input and no X display" the exit(1) should probably be a
> > kill_bus_and_exit(1). Would you agree?
> 
> Yes. Please do/test that.

OK, I'll attach the patch in a moment. I tested that running "dbus-launch --exit-with-session </dev/null" without the change leaves a dbus-daemon process running; with the change, that process is killed.
Comment 14 Роман Донченко 2014-05-02 15:41:37 UTC
Created attachment 98341 [details] [review]
Kill bus if can't attach to session
Comment 15 Simon McVittie 2014-05-02 16:08:52 UTC
Comment on attachment 98341 [details] [review]
Kill bus if can't attach to session

Review of attachment 98341 [details] [review]:
-----------------------------------------------------------------

Looks good, I'll apply it for 1.8.4
Comment 16 Simon McVittie 2014-06-11 10:34:54 UTC
(In reply to comment #15)
> Looks good, I'll apply it for 1.8.4

1.8.4 was a security release with only one change. I'll apply it for 1.8.6 instead.
Comment 17 Simon McVittie 2014-06-11 10:46:48 UTC
Comment on attachment 93802 [details] [review]
Proposed patch v2

was applied, marking obsolete
Comment 18 Simon McVittie 2014-06-11 10:57:11 UTC
Comment on attachment 98341 [details] [review]
Kill bus if can't attach to session

Fixed in git for 1.8.6, 1.9.0.

As a note for the future, it's helpful if you can provide patches in "git format-patch" format so that we can attribute them correctly.
Comment 19 GitLab Migration User 2018-10-12 21:17:38 UTC
-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/dbus/dbus/issues/96.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.