Bug 101898 - Containers (#100344): fd-passing-based creation
Summary: Containers (#100344): fd-passing-based creation
Status: ASSIGNED
Alias: None
Product: dbus
Classification: Unclassified
Component: core (show other bugs)
Version: git master
Hardware: All All
: medium enhancement
Assignee: Simon McVittie
QA Contact: D-Bus Maintainers
URL:
Whiteboard:
Keywords:
Depends on: 101354
Blocks: 100344
  Show dependency treegraph
 
Reported: 2017-07-24 14:27 UTC by Simon McVittie
Modified: 2017-12-19 18:23 UTC (History)
5 users (show)

See Also:
i915 platform:
i915 features:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Simon McVittie 2017-07-24 14:27:39 UTC
+++ This bug was initially created as a clone of Bug #100344 +++

Following on from Bug #101354, Allison wants to be able to arrange for container instances' servers to appear inside containers in a more elegant way than creating them outside and bind-mounting them in. I already intended to do this, but it is not part of the minimum viable product (Bug #101354).

Design sketch:

The named_parameters a{sv} argument may contain:

    ServerSocket: h
        A socket (fstat() must indicate format S_IFSOCK)
        with SO_DOMAIN = AF_UNIX and SO_TYPE = SOCK_STREAM. The
        container manager will arrange for bind() and listen() to be called
        on this socket so that it is made available inside the container.

        If ServerSocketReadyNotifier is not provided, the container manager
        must already have called bind() and listen() (the SO_ACCEPTCONN socket
        option is 1 and getsockname() returns an address), such that this
        socket is already ready for the message bus to call accept() on it.
        In this case the AddServer() method will return the socket's path
        and D-Bus address as usual.

        If ServerSocketReadyNotifier is provided, then the container manager
        may delay calling bind() and listen() until just before it makes the
        ServerSocketNotifierReadyNotifier poll readable. In this case the
        AddServer() method cannot determine the socket's address, so it
        will return an empty byte-array instead of the socket's absolute
        path, and an empty string instead of its D-Bus address.

    ServerSocketReadyNotifier: h
        The reading end of a pipe or FIFO (format S_IFIFO). The container
        manager will wait for this pipe to poll readable, then close it
        and begin to accept() on the ServerSocket.

        (The container manager should keep the write end of this socket open
        until it has called bind() and listen() on the ServerSocket,
        then close the write end, resulting in the read end polling readable.)
Comment 1 Simon McVittie 2017-07-24 14:43:12 UTC
Implementation sketch:

We can use fstat() to check that the passed socket and pipe are of the types we expect, to avoid callers doing anything crazy with non-socket or non-pipe fds. It might well be a good idea to getsockopt() the socket for SO_DOMAIN and SO_TYPE too - the usual motto of "we can always make it more liberal later".

From socket(7) it looks like we can getsockopt() for SO_ACCEPTCONN (at least on Linux) to check that it's already listening when we want to accept() it. As a safety-catch we probably still want to make sure the dbus-daemon won't busy-loop on an invalid server fd, though.

Open question: can a FUSE filesystem intercept fstat() on an open fd and make it take arbitrarily long, resulting in a DoS on dbus-daemon when it tries to fstat() them? If it can, then this mode of container creation has to be a privileged operation (root or bus owner only).
Comment 2 Philip Withnall 2017-08-01 11:42:53 UTC
(In reply to Simon McVittie from comment #0)
>         If ServerSocketReadyNotifier is provided, then the container manager
>         may delay calling bind() and listen() until just before it makes the
>         ServerSocketNotifierReadyNotifier poll readable. In this case the
>         AddServer() method cannot determine the socket's address, so it
>         will return an empty byte-array instead of the socket's absolute
>         path, and an empty string instead of its D-Bus address.

Typo: s/ServerSocketNotifierReadyNotifier/ServerSocketReadyNotifier/?
Comment 3 David Herrmann 2017-12-19 10:52:54 UTC
(In reply to Simon McVittie from comment #1) 
> Open question: can a FUSE filesystem intercept fstat() on an open fd and
> make it take arbitrarily long, resulting in a DoS on dbus-daemon when it
> tries to fstat() them? If it can, then this mode of container creation has
> to be a privileged operation (root or bus owner only).

Don't use fstat(2).

Assuming you don't trust FUSE, this would be safe:

getsockopt(SOL_SOCKET, SO_DOMAIN) -> PF_UNIX
getsockopt(SOL_SOCKET, SO_TYPE) -> SOCK_STREAM
getsockopt(SOL_SOCKET, SO_ACCEPTCON) -> 1

Once all these return the expected values, you know this is the socket you expect. Note that getsockopt() on SOL_SOCKET cannot be intercepted. It is always handled by the networking-core (on linux..).

While I would argue FUSE should be trusted, or not given access to, this sequence is perfectly safe against FUSE as well. Unlike fstat(2), getsockopt(2) cannot be intercepted by FUSE.
Comment 4 David Herrmann 2017-12-19 11:03:40 UTC
(In reply to Simon McVittie from comment #0)
>     ServerSocketReadyNotifier: h
>         The reading end of a pipe or FIFO (format S_IFIFO). The container
>         manager will wait for this pipe to poll readable, then close it
>         and begin to accept() on the ServerSocket.
> 
>         (The container manager should keep the write end of this socket open
>         until it has called bind() and listen() on the ServerSocket,
>         then close the write end, resulting in the read end polling
> readable.)

Is there a particular reason to put the burden of this interface on dbus-daemon? Why would someone push the socket into dbus-daemon before it is ready to use?

Even in the bubblewrap example, why not put the burden of making this work on bubblewrap and its caller? For instance, if flatpack spawns an app, it could create the listener socket *AND* ready-pipe itself, pass it to bubblewrap. Once bubblewrap closes the ready-pipe, it pushes it into dbus-daemon via this call.

This would avoid extending the dbus-daemon interface, only for the policy to not have dbus in bubblewrap.

Lastly, please note that a trivial implementation of this is racy. Just because bubblewrap closed the ready-notifier socket does not mean dbus-daemon handled that event. On the contrary, dbus-daemon might handle any different event first, even if those fired after the ready-notifier. No priority-ordering can prevent this, since kernel events are not fine-grained enough on stream sockets to protect this kind of ordering.

Worst case, bubblewrap closes a socket, but a subsequent attempt to connect to the socket is still rejected, because dbus-daemon didn't handle the event, yet.
Even if you make dbus-daemon check the condition manually on connection attempts, you are still susceptible to ordering issues with any other operation on the bus, even though those might be unlikely to be triggered.

If we defer this to flatpack and bubblewrap, they can decide themselves how much barriers are needed for their use-case. Worst case, bubblewrap needs a synchronous interface to flatpack to tell it to add the socket and block until that is done (make it two pipes, one for each direction).
Comment 5 Simon McVittie 2017-12-19 11:28:16 UTC
(In reply to David Herrmann from comment #4)
> Is there a particular reason to put the burden of this interface on
> dbus-daemon? Why would someone push the socket into dbus-daemon before it is
> ready to use?

Because in the design I discussed with Allison and Alex, Flatpak creates the socket() and communicates with dbus-daemon, but it's bubblewrap that makes the socket ready to use by calling bind() and listen(), because Allison felt strongly that it should be possible to avoid the the socket ever being bound outside the container.

> Even in the bubblewrap example, why not put the burden of making this work
> on bubblewrap and its caller? For instance, if flatpack spawns an app, it
> could create the listener socket *AND* ready-pipe itself, pass it to
> bubblewrap. Once bubblewrap closes the ready-pipe, it pushes it into
> dbus-daemon via this call.

By the time the socket becomes ready, the `flatpak run` process has already gone away (it execs bubblewrap, rather than doing a fork-and-exec, and having the parent wait for bubblewrap to exit then exit itself).

bubblewrap can't do non-trivial IPC to dbus-daemon, because bubblewrap is setuid root on some systems, so making it use non-trivial libraries is dangerous.

> Lastly, please note that a trivial implementation of this is racy. Just
> because bubblewrap closed the ready-notifier socket does not mean
> dbus-daemon handled that event. On the contrary, dbus-daemon might handle
> any different event first, even if those fired after the ready-notifier. No
> priority-ordering can prevent this, since kernel events are not fine-grained
> enough on stream sockets to protect this kind of ordering.

As long as dbus-daemon doesn't start trying to accept() until bubblewrap has called bind() and listen(), everything's fine? It doesn't matter if there's a small delay during which the sandboxed app can connect() to the listening socket, but will block until the dbus-daemon wakes up and starts accept()ing.

> Worst case, bubblewrap closes a socket, but a subsequent attempt to connect
> to the socket is still rejected, because dbus-daemon didn't handle the
> event, yet.

As long as bubblewrap calls listen() before it closes the ready-notifier, I don't see how this can happen? If the dbus-daemon isn't accept()ing just yet, won't the client just block?
Comment 6 Simon McVittie 2017-12-19 11:30:04 UTC
(In reply to Simon McVittie from comment #0)
>     ServerSocketReadyNotifier: h
>         The reading end of a pipe or FIFO (format S_IFIFO). The container
>         manager will wait for this pipe to poll readable, then close it
>         and begin to accept() on the ServerSocket.
> 
>         (The container manager should keep the write end of this socket open
>         until it has called bind() and listen() on the ServerSocket,
>         then close the write end, resulting in the read end polling
> readable.)

Sorry, that first paragraph should have said: The *message bus* will wait for...

The message bus is dbus-daemon or equivalent, the container manager is the combination of Flatpak and bwrap (or eventually Snap or Firejail, if they want to use this interface).
Comment 7 Simon McVittie 2017-12-19 12:47:25 UTC
(In reply to David Herrmann from comment #3)
> (In reply to Simon McVittie from comment #1) 
> > Open question: can a FUSE filesystem intercept fstat() on an open fd?
> 
> Don't use fstat(2).
> 
> Assuming you don't trust FUSE, this would be safe:
> 
> getsockopt(SOL_SOCKET, SO_DOMAIN) -> PF_UNIX
> getsockopt(SOL_SOCKET, SO_TYPE) -> SOCK_STREAM
> getsockopt(SOL_SOCKET, SO_ACCEPTCON) -> 1

In that case we'd need to use a socket pair, not a pipe (or eventfd), but that seems fine. We could shutdown() one of the two directions so that it behaves more like a pipe.
Comment 8 David Herrmann 2017-12-19 18:09:46 UTC
(In reply to Simon McVittie from comment #7)
> (In reply to David Herrmann from comment #3)
> > (In reply to Simon McVittie from comment #1) 
> > > Open question: can a FUSE filesystem intercept fstat() on an open fd?
> > 
> > Don't use fstat(2).
> > 
> > Assuming you don't trust FUSE, this would be safe:
> > 
> > getsockopt(SOL_SOCKET, SO_DOMAIN) -> PF_UNIX
> > getsockopt(SOL_SOCKET, SO_TYPE) -> SOCK_STREAM
> > getsockopt(SOL_SOCKET, SO_ACCEPTCON) -> 1
> 
> In that case we'd need to use a socket pair, not a pipe (or eventfd), but
> that seems fine. We could shutdown() one of the two directions so that it
> behaves more like a pipe.

This was meant regarding the listener-socket. For other types it really depends what is needed:

1) If you use a pipe as notifier and only wait for POLLHUP, you can straight out do that without verifying *anything*. close(2) is never synchronous on FUSE, neither is poll(2). So you can just take any FD without verification. Note that there is no guarantee that your FD is linked to anything, so you might be the only one holding a reference. As long as the FD is accounted and bound to a dbus-connection, this _should_ be safe as its lifetime is bound (but see below...).

2) If you want eventfd semantics and use its counter as notification mechanism, I am afraid there might be no nice way to check it. The only safe option is to read /proc/self/fdinfo/<num> and see whether it says `eventfd-count: <num>`

While at it: Whenever you accept file-descriptors without verifying its type, you must make sure they're considered inflight, just like message payloads. Because again, it might be an AF_UNIX socket that has another socket queued recursively, and you keep it alive by pinning it.

For AF_UNIX-*listener* sockets this is safe. For anything else... well... depends on how safe your accounting is, and what kind of FD you're dealing with.

I hope that helps. I briefly verified this all on the current kernel sources. If we settle on one technique, I can verify it properly again, just to be sure.
Comment 9 David Herrmann 2017-12-19 18:23:08 UTC
(In reply to Simon McVittie from comment #5)
> (In reply to David Herrmann from comment #4)
> > Is there a particular reason to put the burden of this interface on
> > dbus-daemon? Why would someone push the socket into dbus-daemon before it is
> > ready to use?
> 
> Because in the design I discussed with Allison and Alex, Flatpak creates the
> socket() and communicates with dbus-daemon, but it's bubblewrap that makes
> the socket ready to use by calling bind() and listen(), because Allison felt
> strongly that it should be possible to avoid the the socket ever being bound
> outside the container.

Side-note: On linux, you can pin an unlinked directory via an O_PATH fd, and then create AF_UNIX sockets in that directory by specifying /proc/self/fd/<O_PATH-fd>/<file-path-of-choice>. That is, this is kinda the directory equivalent of creating anonymous files via memfd_create().

With this in mind, you can simply bind the socket in the flatpak controller without it ever being visible to anyone but you.

> > Even in the bubblewrap example, why not put the burden of making this work
> > on bubblewrap and its caller? For instance, if flatpack spawns an app, it
> > could create the listener socket *AND* ready-pipe itself, pass it to
> > bubblewrap. Once bubblewrap closes the ready-pipe, it pushes it into
> > dbus-daemon via this call.
> 
> By the time the socket becomes ready, the `flatpak run` process has already
> gone away (it execs bubblewrap, rather than doing a fork-and-exec, and
> having the parent wait for bubblewrap to exit then exit itself).
> 
> bubblewrap can't do non-trivial IPC to dbus-daemon, because bubblewrap is
> setuid root on some systems, so making it use non-trivial libraries is
> dangerous.

Why can't flatpak be changed to wait for bubblewrap to exit and then call into dbus-daemon? (Or wait for bubblewrap to signal something, rather than exit.)

I mean, we're designing spec extensions here based on implementation restrictions in one user.

Maybe I am misunderstanding something, but the two options are this:

1) Change flatpak to do a fork-and-exec on bubblewrap, wait until its setup is done, and then call into dbus-daemon.

2) Make dbus-daemon support inhibitors that delay activation of queued containers (like proposed here).

The second one requires extending the D-Bus spec and support in every message bus, while the former would be restricted to an implementation detail of flatpak.

This is not a major issue, so I don't want to block this. But for what it's worth, I would vote for option 1).

> > Lastly, please note that a trivial implementation of this is racy. Just
> > because bubblewrap closed the ready-notifier socket does not mean
> > dbus-daemon handled that event. On the contrary, dbus-daemon might handle
> > any different event first, even if those fired after the ready-notifier. No
> > priority-ordering can prevent this, since kernel events are not fine-grained
> > enough on stream sockets to protect this kind of ordering.
> 
> As long as dbus-daemon doesn't start trying to accept() until bubblewrap has
> called bind() and listen(), everything's fine? It doesn't matter if there's
> a small delay during which the sandboxed app can connect() to the listening
> socket, but will block until the dbus-daemon wakes up and starts accept()ing.

Right, I missed that part, sorry! This solves the connection-issue.

Also, as long as closing the notifier is solely interpreted as barrier for accept(2) to be allowed, I don't see any race.

Sorry, I was missing that the kernel queues your connection attempts just fine.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.