Bug 100843 - libdbus calls malloc() after fork(), causing clients to deadlock
Summary: libdbus calls malloc() after fork(), causing clients to deadlock
Status: RESOLVED MOVED
Alias: None
Product: dbus
Classification: Unclassified
Component: core (show other bugs)
Version: git master
Hardware: Other Linux (All)
: medium normal
Assignee: D-Bus Maintainers
QA Contact: D-Bus Maintainers
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2017-04-27 09:15 UTC by Primiano Tucci
Modified: 2018-10-12 21:30 UTC (History)
1 user (show)

See Also:
i915 platform:
i915 features:


Attachments

Description Primiano Tucci 2017-04-27 09:15:16 UTC
TL;DR
dbus-autolaunch causes random hangs, if the hosting process uses an allocator which doesn't have multithreads+fork protections.
This is because a malloc() happens after fork() in _dbus_close_all() via opendir().

Related chromium bugs:
https://crbug.com/695643
https://crbug.com/715658

We had a number of bug reports in chromium where chrome just hangs during startup on Linux.
I narrowed down the cause to libdbus.
The scenario is the following:
- somebody starts chrome in a session that doesn't have a dbus running. This is very common in test harness that run via xvfb. doing that triggers dbus-autolaunch.

- The following happens in the caller process (chrome, in our case):
_dbus_transport_open()
  _dbus_transport_open_autolaunch()
    _dbus_transport_new_for_autolaunch()
      _dbus_get_autolaunch_address()
        _read_subprocess_line_argv()
          fork()
            wait() (on the pipe)

- While in the child process:
 fork()
   _dbus_close_all()
     opendir("/proc/self/fd")
       malloc()


Calling malloc in a fork()ed process is bad: if the allocator has no atfork handler, doing that will cause random hangs. More details in crbug.com/695643 .

All this seems to come from an optimization in _dbus_close_all() for Linux [1]. You folks seem to have already a fallback there that is doing a:
  maxfds = sysconf (_SC_OPEN_MAX);
  for (i = 3; i < maxfds; i++)
    close (i);
Which is good and harmless.

The problem instead is when you opendir(/proc/self/fd). Sadly opendir() implies a malloc.
Doing a fork()+malloc() is just asking for troubles. Given that libdbus is a library and not an hemertic executable, IMHO it isn't sensible to expect that hosting process has an allocator with a post-fork handler which handles this corner case.

Can you just always use the fallback and get rid of that /proc/self/fd optimization ? Or achieve that with some other way that doesn't involve a malloc after fork?

For the chrome specific bug we'll be looking into disabling dbus autolunch and reducing our dependencies to that, but in general this feels to me could impact and surprise quite lot of other dbus users (fun fact, the root bug in chrome was caused by somebody calling gconf_client_get_bool).

Regards,
Primiano

[1] https://cgit.freedesktop.org/dbus/dbus/tree/dbus/dbus-sysdeps-unix.c?id=d42d167cc3ba0d20aa891a75aa1cd80ec5f490f1#n4368
Comment 1 Simon McVittie 2017-04-27 10:56:10 UTC
(In reply to Primiano Tucci from comment #0)
> - somebody starts chrome in a session that doesn't have a dbus running. This
> is very common in test harness that run via xvfb.

I recommend combining xvfb with dbus-run-session(1) if you want to run automated tests in a fake X11 session with a session bus.

> All this seems to come from an optimization in _dbus_close_all() for Linux
> [1].

The goal here is: if libdbus is linked into a process that foolishly does not set close-on-exec on all its file descriptors, we want to avoid leaking all those file descriptors to the child process, so that if the child process leaves them open, things assuming those fds will reach EOF do not deadlock.

(libdbus tries hard to set close-on-exec on file descriptors that we open, but we cannot make such guarantees about the hosting process.)

I think the intention may have been that the readdir() from /proc is for correctness, not for optimization: I don't think it's necessarily guaranteed that we don't have file descriptors open above _SC_OPEN_MAX, if someone opened a high file descriptor (perhaps with dup2) and then reduced the runtime limit below that. In the single-threaded process that we have after fork(), reading the list of fds from /proc is race-free, because we know that we don't have any other threads opening and closing fds behind our back.

Cc'ing the author of that code (from Bug #35230).

Unfortunately, I am not aware of any way to list file descriptors without using an async-signal-unsafe call like opendir() or readdir() (both of which are, strictly speaking, undefined behaviour between fork() and exec() on a random POSIX system). We shouldn't really be calling sysconf() after fork either, because that isn't on the list of async-signal-safe functions in POSIX, so in principle it's just as bad as opendir()/readdir() - but I don't see better ways to achieve the goal.

For the specific case of dbus-launch, we could have dbus-launch call _dbus_close_all() during its startup, instead of having libdbus call that function between fork and exec. However, the same code is used for the unixexec: transport, and we can't rely on arbitrary user-supplied subprocesses (typically a tunnel like ssh) behaving in that way.

I suppose we could have the parent process list the child's fds and write them into the write end of a pipe, and have the child block on the read end of that pipe, read all the fd numbers, close all those fds, and continue; but, ugh... implementing that seems fairly likely to introduce worse bugs than it solves.

GLib's glib/gspawn.c, which is more or less the state of the art in doing this sort of thing as far as I know, has the same bug (actually more bugs, because it respects the file descriptor rlimit, which can definitely lead to fds not being closed if the rlimit was adjusted downwards after the fd in question was opened).

> For the chrome specific bug we'll be looking into disabling dbus autolunch

That's an excellent typo. Now I want to implement automated lunch :-)

Do you use a bundled libdbus that you can patch, or a system copy?

I would be open to introducing a global function dbus_disable_autolaunch() or similar (that sets a mutex-protected global flag) to knock out that whole code path. This would only be in 1.11.x, but if you bundle a copy of libdbus you could backport that.

Alternatively, if DBUS_SESSION_BUS_ADDRESS is set in the environment and does not contain "autolaunch:", autolaunching will be suppressed. If you control the whole process-space, you could use this by setting DBUS_SESSION_BUS_ADDRESS to "disable:" or similar if it is unset.

> (fun fact, the root bug in
> chrome was caused by somebody calling gconf_client_get_bool)

GConf was deprecated in or before 2010 (reference: <https://bugzilla.gnome.org/show_bug.cgi?id=622558>). Can Chrome stop using it at some point, or at least use it from a subprocess or something?

If you want a D-Bus implementation for use by an application, and your reason for using it is not GConf, then libdbus is honestly not the implementation I would recommend. GLib's GDBus is better for high-level APIs or threaded processes (although I think its implementation of autolaunch uses glib/gspawn.c and so will have this same bug). systemd's sd-bus is probably better for low-level APIs in single-threaded processes.

(In reply to Primiano Tucci from comment #0)
> Given that libdbus is a
> library and not an hemertic executable, IMHO it isn't sensible to expect
> that hosting process has an allocator with a post-fork handler which handles
> this corner case.

If libdbus wasn't a library, we wouldn't have to cope with sharing our process space with code that fails to close its file descriptors... unfortunately, this is a necessary evil of not having full control over our execution environment.

Session bus autolaunching was absolutely necessary back in ~ 2005 when D-Bus was first designed, when a typical user might be running some 1990s Unix environment with individual GNOME or KDE apps under twm on a badly-integrated Linux-from-scratch or proprietary Unix system. This decade, D-Bus is essential "desktop plumbing" like X11, and we want to phase out autolaunching in favour of having systemd --user, a desktop environment starter like gnome-session, or distro infrastructure like Debian's Xsession.d start a session bus - but we can't just delete it, because at the moment people still rely on it.
Comment 2 Simon McVittie 2017-04-27 11:02:00 UTC
See also https://bugzilla.gnome.org/show_bug.cgi?id=746604 which is the analogous GLib bug. The resolution there seems to have been "please fix your malloc reimplementation to use pthread_atfork()".
Comment 3 Simon McVittie 2017-04-27 11:10:41 UTC
(In reply to Simon McVittie from comment #1)
> I would be open to introducing a global function dbus_disable_autolaunch()
> or similar (that sets a mutex-protected global flag) to knock out that whole
> code path.

Actually, I don't think it even needs to be mutex-protected, because nobody should be calling functions that alter global state from anywhere except the beginning of main() anyway.

Prior art: dbus_connection_set_change_sigpipe(). The observant will notice that, just like CLOEXEC, this is a workaround for strange historical Unixisms, because the sensible behaviour cannot be made the default for compatibility reasons.
Comment 4 Primiano Tucci 2017-04-27 12:04:33 UTC
Thanks a lot for your prompt reply Simon.

(In reply to Simon McVittie from comment #1)
> (In reply to Primiano Tucci from comment #0)
> > - somebody starts chrome in a session that doesn't have a dbus running. This
> > is very common in test harness that run via xvfb.
> 
> I recommend combining xvfb with dbus-run-session(1) if you want to run
> automated tests in a fake X11 session with a session bus.

Yup, I understand that doing that (or just setting DBUS_SESSION_BUS_ADDRESS to something) prevents the fork+exec.
It is something that we've been telling in the various bugs. Unfortunately there are so many projects using chrome as a test driver and we don't have control over them.

> 
> > All this seems to come from an optimization in _dbus_close_all() for Linux
> > [1].
> 
> The goal here is: if libdbus is linked into a process that foolishly does
> not set close-on-exec on all its file descriptors, 
You say foolishly does not set O_CLOEXEC. IMHO the sad reality is that IMHO the design of open() is just bad and file descriptors should be O_CLOEXEC by default.
From a more concrete viewpoint, Chrome depends on hundreds of third-party libraries and we cannot control the opens they do. (well we could but every time I did some doing symbol interpositions I fixed one bug and caused other 42 of them :) )

> we want to avoid leaking
> all those file descriptors to the child process, so that if the child
> process leaves them open, things assuming those fds will reach EOF do not
> deadlock.

I fully understand the need of closing fds after forking before exec. No objection against that.
Again, this is a sad *nix design, but this is out of scope and neither chrome or libdbus can change it.

> (libdbus tries hard to set close-on-exec on file descriptors that we open,
> but we cannot make such guarantees about the hosting process.)
> 
> I think the intention may have been that the readdir() from /proc is for
> correctness, not for optimization: I don't think it's necessarily guaranteed
> that we don't have file descriptors open above _SC_OPEN_MAX, if someone
> opened a high file descriptor (perhaps with dup2) and then reduced the
> runtime limit below that.
Ah I didn't think to that, good point.
Although, in that case the new process wouldn't be able to open() anything beyond the new rlimit, so the presence of other fds beyond that wouldn't cause a problem in the final executable, unless it tries to bump up the limit again (but this is now becoming a bit of an unrealistic scenario).
Agree that could still be bad in terms of leaking resources/capabilities.

> In the single-threaded process that we have after
> fork(), reading the list of fds from /proc is race-free, because we know
> that we don't have any other threads opening and closing fds behind our back.

Is very hard to keep everything under control. In chrome, one of the bugs was triggered by somebody calling gconf_client_get_bool. It's quite hard to predict that a funciton called *get_bool() does a fork+exec.

> Cc'ing the author of that code (from Bug #35230).
> 
> Unfortunately, I am not aware of any way to list file descriptors without
> using an async-signal-unsafe call like opendir() or readdir() (both of which
> are, strictly speaking, undefined behaviour between fork() and exec() on a
> random POSIX system). We shouldn't really be calling sysconf() after fork
> either, because that isn't on the list of async-signal-safe functions in
> POSIX, so in principle it's just as bad as opendir()/readdir()

Yup but realistically sysconf isn't a problem (yet) while opendir() is a real known problem.

> but I don't see better ways to achieve the goal.
How about using open() + getdents() instead?
or eventually read all the FDs before forking. That would be racy and risk to leak some fd opened by another thread before fork, but still IMHO better than causing deadlocks around malloc.


> For the specific case of dbus-launch, we could have dbus-launch call
> _dbus_close_all() during its startup, instead of having libdbus call that
> function between fork and exec. However, the same code is used for the
> unixexec: transport, and we can't rely on arbitrary user-supplied
> subprocesses (typically a tunnel like ssh) behaving in that way.
> 
> I suppose we could have the parent process list the child's fds and write
> them into the write end of a pipe, and have the child block on the read end
> of that pipe, read all the fd numbers, close all those fds, and continue;
> but, ugh... implementing that seems fairly likely to introduce worse bugs
> than it solves.
Definitely agree. That seems really over-engineering it.

 
> GLib's glib/gspawn.c, which is more or less the state of the art in doing
> this sort of thing as far as I know, has the same bug (actually more bugs,
> because it respects the file descriptor rlimit, which can definitely lead to
> fds not being closed if the rlimit was adjusted downwards after the fd in
> question was opened).

Yes, if you see the crbug linked above we are discussion gspawn.c separately in fact :)

> > For the chrome specific bug we'll be looking into disabling dbus autolunch
> 
> That's an excellent typo. Now I want to implement automated lunch :-)
Ehm, It's 12.30 here in BST, so I guess it was my subconscious speaking :)

> Do you use a bundled libdbus that you can patch, or a system copy?
System..

> I would be open to introducing a global function dbus_disable_autolaunch()
> or similar (that sets a mutex-protected global flag) to knock out that whole
> code path. This would only be in 1.11.x, but if you bundle a copy of libdbus
> you could backport that.
> 
> Alternatively, if DBUS_SESSION_BUS_ADDRESS is set in the environment and
> does not contain "autolaunch:", autolaunching will be suppressed. If you
> control the whole process-space, you could use this by setting
> DBUS_SESSION_BUS_ADDRESS to "disable:" or similar if it is unset.

Yeah we are evaluating this in our bug. even if somebody fixed the dbus bug today, realistically it will take some time before the fix shows up in the various releases (I can never keep up with the various release policies and timelines of the various distros).
So in parallel we are discussing in crubg.com/715658 temporary workarounds.

This might be a bit OT w.r.t. this bug, but since the right people are here...
I was considering doing something like:
if getenv(DBUS_SESSION_BUS_ADDRESS) == "" -> setenv(DBUS_SESSION_BUS_ADDRESS, "/dev/null")

One thing that worries me a bit, is that doing that will also disable the autolaunch path that tries first to connect to an existing session in ~/.dbus.
Now, from a deadlock viewpoint, that is similarly problematic because still involves a fork()+malloc()+exec() (so would be good to get rid of it).
From a end-user experience viewpoint, I worry that could cause some regressions. In other words: are there any distributions / desktop environments out there where DBUS_SESSION_BUS_ADDRESS is always empty and that hence rely on the autolaunch to connecting to the session? Or could we safely ignore that use-case and say that if you don't have DBUS_SESSION_BUS_ADDRESS we'll just forcefully bypass it.
I have no idea about this. Thoughts?

> 
> > (fun fact, the root bug in
> > chrome was caused by somebody calling gconf_client_get_bool)
> 
> GConf was deprecated in or before 2010 (reference:
> <https://bugzilla.gnome.org/show_bug.cgi?id=622558>). Can Chrome stop using
> it at some point, or at least use it from a subprocess or something?

Yeah, you make another good observation. this is another discussion I started a while ago, before discovering this bug:
https://groups.google.com/a/chromium.org/d/msg/chromium-dev/hRpUZr4mQ0k/g4qQ4mgFCQAJ
I guess the only reason is just "legacy" and "removing dependencies is hard".

> If you want a D-Bus implementation for use by an application, and your
> reason for using it is not GConf, then libdbus is honestly not the
> implementation I would recommend.
To be honest, I have no idea if and which other libraries indirectly can use libdbus.
I am a poor browser engineer, all these desktop environments and libraries scare and confuse me :)

> GLib's GDBus is better for high-level APIs
> or threaded processes (although I think its implementation of autolaunch
> uses glib/gspawn.c and so will have this same bug). systemd's sd-bus is
> probably better for low-level APIs in single-threaded processes.
> 
> (In reply to Primiano Tucci from comment #0)
> > Given that libdbus is a
> > library and not an hemertic executable, IMHO it isn't sensible to expect
> > that hosting process has an allocator with a post-fork handler which handles
> > this corner case.
> 
> If libdbus wasn't a library, we wouldn't have to cope with sharing our
> process space with code that fails to close its file descriptors...
> unfortunately, this is a necessary evil of not having full control over our
> execution environment.
> 
> Session bus autolaunching was absolutely necessary back in ~ 2005 when D-Bus
> was first designed, when a typical user might be running some 1990s Unix
> environment with individual GNOME or KDE apps under twm on a
> badly-integrated Linux-from-scratch or proprietary Unix system. This decade,
> D-Bus is essential "desktop plumbing" like X11, and we want to phase out
> autolaunching in favour of having systemd --user, a desktop environment
> starter like gnome-session, or distro infrastructure like Debian's
> Xsession.d start a session bus - but we can't just delete it, because at the
> moment people still rely on it.

Thanks for the background. Believe me, I really really understand how hard is to deal with legacy.
We still have workarounds for RedHat 9 which nobody knowns if are still necessary and I'm too scared to drop. :)
Comment 5 Simon McVittie 2017-04-27 12:43:36 UTC
(In reply to Primiano Tucci from comment #4)
> You say foolishly does not set O_CLOEXEC. IMHO the sad reality is that IMHO
> the design of open() is just bad and file descriptors should be O_CLOEXEC by
> default.

Right, I agree, but it's far too late to fix that (and that was already true before D-Bus was written).

It isn't just open(), every other ioctl that returns a fd has the same problem (socket() without SOCK_CLOEXEC, epoll_create(), epoll_create1() without EPOLL_CLOEXEC, and so on).

> From a more concrete viewpoint, Chrome depends on hundreds of third-party
> libraries and we cannot control the opens they do.

Sure, and if you replace "Chrome" with "whatever uses libdbus or GLib", that's why we do these crazy things to mitigate their bugs.

> Although, in that case the new process wouldn't be able to open() anything
> beyond the new rlimit, so the presence of other fds beyond that wouldn't
> cause a problem in the final executable, unless it tries to bump up the
> limit again (but this is now becoming a bit of an unrealistic scenario).
> Agree that could still be bad in terms of leaking resources/capabilities.

It's worse than that. Consider this situation:

* (Some other library in) Chrome opens a pipe to a subprocess, let's
  say myhelper, and wrongly does not make it CLOEXEC; let's say the
  write end is fd 1234
* Use of D-Bus causes a different subprocess (let's say dbus-launch)
  to be started; it inherits fd 1234
* Chrome finishes writing to its copy of fd 1234 and closes the write
  end of the pipe
* Expected result: myhelper receives EOF on the read end of the pipe
* Actual result: myhelper does not receive EOF because dbus-launch
  still has the write end open, and has no idea that it should close it
  (after all, it didn't open it...)

We can easily fix this for dbus-launch by inserting a _dbus_close_all(), but we can't fix it for unixexec: tunnels, which run an arbitrary subprocess.

> How about using open() + getdents() instead?

The syscall that is explicitly documented as "not the interfaces you are interested in"? I'd rather not.

I don't really want to copy-paste a definition of struct linux_dirent64, and it isn't in any installed header...

Is the getdents64 syscall considered to be async-signal-safe?

> This might be a bit OT w.r.t. this bug, but since the right people are
> here...
> I was considering doing something like:
> if getenv(DBUS_SESSION_BUS_ADDRESS) == "" ->
> setenv(DBUS_SESSION_BUS_ADDRESS, "/dev/null")

getenv() would return NULL, not "", in the problematic case.

DBUS_SESSION_BUS_ADDRESS is a D-Bus address string vaguely similar to a URL (or technically a list of comma-separated addresses), not a pathname, so please use an undefined transport name plus a colon, like "disable:" or something (the transport you'd typically use on Unix is "unix:"). I could reserve a transport name in the D-Bus Specification that will never be defined or implemented, if that would help.

One thing that this could easily break, though: recent versions of libdbus and other major D-Bus implementations (sd-bus, GDBus) will try "$XDG_RUNTIME_DIR/bus" (which we hope will become the dominant way to locate a session bus over the next few years) before falling back to X11 autolaunch. At the moment, we expect distributions that make use of this feature to set DBUS_SESSION_BUS_ADDRESS="unix:path=$XDG_RUNTIME_DIR/bus". However, eventually we would like to be able to stop setting that variable, which would break your proposed code.
Comment 6 Simon McVittie 2017-04-27 12:49:40 UTC
(In reply to Simon McVittie from comment #5)
> It isn't just open(), every other ioctl that returns a fd has the same
> problem

That should say "every other syscall that returns a fd".

One thing we could potentially do here, although it's pretty horrible, is:

* fork()
* exec() an "adverb" wrapper like
  {"/usr/libexec/dbus-exec-helper", "the-real-executable", "--arg1", "--arg2",
   NULL}
* the "adverb" wrapper calls _dbus_closeall() and then execvp()s
  {"the-real-executable", "--arg1", "--arg2", NULL}

But I really think it would be better for Chrome to either not interpose a malloc() reimplementation, or use pthread_atfork() to make sure its interposed malloc() reimplementation behaves in the same way as the one in glibc.

Arbitrary GNU/Linux libraries are going to assume that they run on GNU/Linux, with GNU libc behaviour. If you break that behaviour for a process that hosts arbitrary libraries, you are constantly going to be playing whack-a-mole with this sort of assumption.
Comment 7 Simon McVittie 2017-04-27 12:53:35 UTC
I would really prefer to follow GLib's lead here, because GLib has a larger fraction of a maintainer than libdbus does, while working in the same sort of space as the lower-level parts of libdbus (a "make portable C less painful" library). It's written in a similar style and targets similar platforms.

At the moment I'm afraid following GLib's lead would mean WONTFIX and asking Chrome to fix its malloc().

> GConf was deprecated in or before 2010 (reference:
> <https://bugzilla.gnome.org/show_bug.cgi?id=622558>). Can Chrome stop using
> it at some point, or at least use it from a subprocess or something?

In another life (as a Debian GNOME team member) I have been trying to eradicate GConf from a default Debian desktop installation, with the longer-term goal of removing it from Debian altogether.
Comment 8 Primiano Tucci 2017-04-27 14:10:31 UTC
(In reply to Simon McVittie from comment #5)
> (In reply to Primiano Tucci from comment #4)
> The syscall that is explicitly documented as "not the interfaces you are
> interested in"? 
Well, sadly the "one you are interested in" forces a malloc(). (but, yup, agreed, there is very little point discussing what we cannot change)

> Is the getdents64 syscall considered to be async-signal-safe?
The key here is not really async safe, rather "not make assumptions / mess up with any other userspace state", which AFAIK is generally true for syscalls.

> DBUS_SESSION_BUS_ADDRESS is a D-Bus address string vaguely similar to a URL
> (or technically a list of comma-separated addresses), not a pathname, so
> please use an undefined transport name plus a colon, like "disable:" or
> something (the transport you'd typically use on Unix is "unix:"). 
Ah right, sure thing. Any suggestions? like "unix:/dev/null" or "disable:" ?

> One thing that this could easily break, though: recent versions of libdbus
> and other major D-Bus implementations (sd-bus, GDBus) will try
> "$XDG_RUNTIME_DIR/bus" (which we hope will become the dominant way to locate
> a session bus over the next few years) before falling back to X11
> autolaunch. At the moment, we expect distributions that make use of this
> feature to set DBUS_SESSION_BUS_ADDRESS="unix:path=$XDG_RUNTIME_DIR/bus".
> However, eventually we would like to be able to stop setting that variable,
> which would break your proposed code.
Yeah this is precisely what I was fearing. Any idea if _today_ it does break anything?


(In reply to Simon McVittie from comment #5)
> It's worse than that. Consider this situation:
> 
> * (Some other library in) Chrome opens a pipe to a subprocess, let's
>   say myhelper, and wrongly does not make it CLOEXEC; let's say the
>   write end is fd 1234
> * Use of D-Bus causes a different subprocess (let's say dbus-launch)
>   to be started; it inherits fd 1234
> * Chrome finishes writing to its copy of fd 1234 and closes the write
>   end of the pipe
> * Expected result: myhelper receives EOF on the read end of the pipe
> * Actual result: myhelper does not receive EOF because dbus-launch
>   still has the write end open, and has no idea that it should close it
>   (after all, it didn't open it...)

Yup agree. this is what I meant when I said leaking *resources*/capabilities.

(In reply to Simon McVittie from comment #6)
> But I really think it would be better for Chrome to either not interpose a
> malloc() reimplementation

That is done for various security (heaps isolation + enforcing a bound on size to prevent int wrapping bugs) and reliability purposes (we suicide on malloc failure, as realistically any attempt of hanling that ends up just moving the bug somewhere else) and we intend to keep doing so.

> or use pthread_atfork() to make sure its
> interposed malloc() reimplementation behaves in the same way as the one in
> glibc.

Yeah this is not unreasonable, although realistically will be absolutely non trivial to backport that.

> Arbitrary GNU/Linux libraries are going to assume that they run on
> GNU/Linux, with GNU libc behaviour. If you break that behaviour for a
> process that hosts arbitrary libraries, you are constantly going to be
> playing whack-a-mole with this sort of assumption.

I feel the biggest mistake we made was assuming that libraries won't fork() without being very explicit about that.
if I have to make a change to chrome I'd just interpose fork() and turn that into a (if(!scoped_allow_fork) __builtin_trap() or similar to allow it only in controlled places, to avoid further surprises like this.
I'll give it a go, but I feel I might be just too late for this and trying to do it will just unveil other cases that are going to make me feel even more sad.

(In reply to Simon McVittie from comment #7)
> I would really prefer to follow GLib's lead here, because GLib has a larger
> fraction of a maintainer than libdbus does, while working in the same sort
> of space as the lower-level parts of libdbus (a "make portable C less
> painful" library). It's written in a similar style and targets similar
> platforms.
> 
> At the moment I'm afraid following GLib's lead would mean WONTFIX and asking
> Chrome to fix its malloc().

In all honesty I understand your position here as maintainer of a library to stick to whatever glib does. But I had to give it a go here. 

My personal view is that both GLib and libdbus are wrong in fork()+malloc()ing and in assuming that everybody uses glibc malloc.
I will try to revive that thread, but realistically speaking I might be just too late for that battle.
Honestly, I think really you should both be using getdents(). I am not sure if chrome or dbus will be there 10 years from now, but I'm quite confident that getdents() is going to.
That part of the kernel ABI is unrealistic to change any time soon, without causing the rest of the world to fall apart.
Comment 9 Simon McVittie 2017-04-27 14:32:06 UTC
(In reply to Primiano Tucci from comment #8)
> Honestly, I think really you should both be using getdents(). I am not sure
> if chrome or dbus will be there 10 years from now, but I'm quite confident
> that getdents() is going to.

It already isn't there, if your CPU architecture is one that has always had getdents64(). https://bugs.launchpad.net/apparmor/+bug/1674245
Comment 10 Simon McVittie 2017-04-27 14:55:38 UTC
(In reply to Primiano Tucci from comment #8)
>Any suggestions? like "unix:/dev/null" or "disable:" ?

If you go this route, use something like disable: to make it fail as cleanly as possible.

> I feel the biggest mistake we made was assuming that libraries won't fork()
> without being very explicit about that.

Yes, I think that's a flawed assumption. Put yourself in a library maintainer's place for a moment, and suppose you need to do things that can't be done without affecting process-global state, or can't be done reliably without making assumptions about the process you're in. How are you going to do that, other than spawning a helper that controls its entire process state?

The ideal answer a lot of the time is "use IPC to talk to some more complicated code in a service, which gets to run in a more predictable environment" - but that isn't an option if what you're implementing *is* the IPC mechanism, and D-Bus is one of the candidates for an IPC mechanism you can use to communicate with services like that.

(Don't get me wrong, I don't think it's great that libdbus autolaunching runs subprocesses either - but we're stuck with it, and we have been for a decade.)
Comment 11 Primiano Tucci 2017-04-27 15:25:57 UTC
(In reply to Simon McVittie from comment #10)
> (In reply to Primiano Tucci from comment #8)
> >Any suggestions? like "unix:/dev/null" or "disable:" ?
> 
> If you go this route, use something like disable: to make it fail as cleanly
> as possible.
> 
> > I feel the biggest mistake we made was assuming that libraries won't fork()
> > without being very explicit about that.
> 
> Yes, I think that's a flawed assumption. Put yourself in a library
> maintainer's place for a moment, and suppose you need to do things that
> can't be done without affecting process-global state, or can't be done
> reliably without making assumptions about the process you're in. How are you
> going to do that, other than spawning a helper that controls its entire
> process state?
> 
> The ideal answer a lot of the time is "use IPC to talk to some more
> complicated code in a service which gets to run in a more predictable "environment"

You stole my words :)
I was going to say, use a socket to talk to your service somewhere else. Or at very least fork()+exec() immediately your service without doing anything else (like opendirs/readdir/malloc) and talk to that.

> but that isn't an option if what you're implementing *is* the
> IPC mechanism, and D-Bus is one of the candidates for an IPC mechanism you
> can use to communicate with services like that.

This can become a longer discussion. I'll be really honest, I don't know too many of the details required to make give a sensible opinion here.
I just know that when I see a function call xxx_get_bool() ending up exec-ing a full service like d-bus, my feeling is that, I don't know precisely at which point, but the entire universe has gone wrong. 
Maybe if the IPC mechanism isn't there (read: if the dbus service isn't running) you should just give up and not trying to exec anything and say "in order for this to work, you must have started a dbus daemon". Which in turn would have caused dbus clients like gconf to have returned an error in  gconf_get_bool, and so on.
Really, I am not saying "this is clearly wrong, how this could have happened". I really understand how software evolves and how these situations slowly creep in over time.
My only advice, without too much context, is just: when in doubt, fail brutally, don't allow clients to make wrong assumptions. 

FWIW, this is the reason why in chromium the rule is that anything which is unexpected and we don't know how to handle just ends up in a __builtin_trap() equivalent that murders the full process. It might seem brutal, but at least avoids that some other client start making wrong assumptions.
But now I feel I am dragging this entire bug OT (But since you asked  "put yourself in this situation", this is what, without enough context, I would have done).

> (Don't get me wrong, I don't think it's great that libdbus autolaunching
> runs subprocesses either - but we're stuck with it, and we have been for a
> decade.)
Yup, I fully understand. And really thanks for the honest and very thorough explanation here.
I updated crbug.com/715658, I think that the most pragmatic is setting DBUS_SESSION_BUS_ADDRESS="disable:" when detecting that we are running in test mode (and that var is not set).
Comment 12 Simon McVittie 2017-04-27 18:44:37 UTC
(In reply to Primiano Tucci from comment #11)
> Maybe if the IPC mechanism isn't there (read: if the dbus service isn't
> running) you should just give up and not trying to exec anything and say "in
> order for this to work, you must have started a dbus daemon".

That would be a valid answer now, but it isn't clear that it would have been a valid answer in 2005, when D-Bus was a new thing that needed to solve the chicken/egg problem of "won't be started if nothing uses it; won't be used if it can't be relied on to have been started".

I think the long term answer is to phase out autolaunching, but that's likely to break enough things that it will have to be mostly a job for distributions - upstream can provide a switch, but distributions will have to be the ones to choose when they can flip that switch.

Anyway, for the use of _dbus_close_all() with dbus-launch, we can get rid of this by moving it from the parent into dbus-launch itself. That one is actually easier.

For the use of _dbus_close_all() for the unixexec: transport (in which the address is set to a command-line, typically something involving ssh, that is expected to connect its stdin/stdout to some local or remote dbus-daemon), we don't actually know what we're exec'ing (it's user-supplied), so that isn't an option. That feature is actually why _dbus_close_all() was added in the first place, and I already cc'd its author on this bug.

I think it would probably make sense to have a way to prevent libdbus from ever exec'ing anything - that's probably more useful than the dbus_disable_autolaunch() that I suggested above.
Comment 13 GitLab Migration User 2018-10-12 21:30:45 UTC
-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/dbus/dbus/issues/173.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.