101898 – Containers (#100344): fd-passing-based creation

Bug 101898 - Containers (#100344): fd-passing-based creation

Summary: Containers (#100344): fd-passing-based creation

Status:	RESOLVED MOVED

Alias:	None

Product:	dbus
Classification:	Unclassified
Component:	core (show other bugs)
Version:	git master
Hardware:	All All

Importance:	medium enhancement
Assignee:	Simon McVittie
QA Contact:	D-Bus Maintainers

URL:
Whiteboard:
Keywords:

Depends on:	101354
Blocks:	100344
	Show dependency tree / graph

Reported:	2017-07-24 14:27 UTC by Simon McVittie
Modified:	2018-10-12 21:31 UTC (History)
CC List:	5 users (show)

See Also:
i915 platform:
i915 features:

Attachments

Description Simon McVittie 2017-07-24 14:27:39 UTC

+++ This bug was initially created as a clone of Bug #100344 +++

Following on from Bug #101354, Allison wants to be able to arrange for container instances' servers to appear inside containers in a more elegant way than creating them outside and bind-mounting them in. I already intended to do this, but it is not part of the minimum viable product (Bug #101354).

Design sketch:

The named_parameters a{sv} argument may contain:

    ServerSocket: h
        A socket (fstat() must indicate format S_IFSOCK)
        with SO_DOMAIN = AF_UNIX and SO_TYPE = SOCK_STREAM. The
        container manager will arrange for bind() and listen() to be called
        on this socket so that it is made available inside the container.

        If ServerSocketReadyNotifier is not provided, the container manager
        must already have called bind() and listen() (the SO_ACCEPTCONN socket
        option is 1 and getsockname() returns an address), such that this
        socket is already ready for the message bus to call accept() on it.
        In this case the AddServer() method will return the socket's path
        and D-Bus address as usual.

        If ServerSocketReadyNotifier is provided, then the container manager
        may delay calling bind() and listen() until just before it makes the
        ServerSocketNotifierReadyNotifier poll readable. In this case the
        AddServer() method cannot determine the socket's address, so it
        will return an empty byte-array instead of the socket's absolute
        path, and an empty string instead of its D-Bus address.

    ServerSocketReadyNotifier: h
        The reading end of a pipe or FIFO (format S_IFIFO). The container
        manager will wait for this pipe to poll readable, then close it
        and begin to accept() on the ServerSocket.

        (The container manager should keep the write end of this socket open
        until it has called bind() and listen() on the ServerSocket,
        then close the write end, resulting in the read end polling readable.)

Comment 1 Simon McVittie 2017-07-24 14:43:12 UTC

Implementation sketch:

We can use fstat() to check that the passed socket and pipe are of the types we expect, to avoid callers doing anything crazy with non-socket or non-pipe fds. It might well be a good idea to getsockopt() the socket for SO_DOMAIN and SO_TYPE too - the usual motto of "we can always make it more liberal later".

From socket(7) it looks like we can getsockopt() for SO_ACCEPTCONN (at least on Linux) to check that it's already listening when we want to accept() it. As a safety-catch we probably still want to make sure the dbus-daemon won't busy-loop on an invalid server fd, though.

Open question: can a FUSE filesystem intercept fstat() on an open fd and make it take arbitrarily long, resulting in a DoS on dbus-daemon when it tries to fstat() them? If it can, then this mode of container creation has to be a privileged operation (root or bus owner only).

Comment 2 Philip Withnall 2017-08-01 11:42:53 UTC

(In reply to Simon McVittie from comment #0)
>         If ServerSocketReadyNotifier is provided, then the container manager
>         may delay calling bind() and listen() until just before it makes the
>         ServerSocketNotifierReadyNotifier poll readable. In this case the
>         AddServer() method cannot determine the socket's address, so it
>         will return an empty byte-array instead of the socket's absolute
>         path, and an empty string instead of its D-Bus address.

Typo: s/ServerSocketNotifierReadyNotifier/ServerSocketReadyNotifier/?

Comment 3 David Herrmann 2017-12-19 10:52:54 UTC

(In reply to Simon McVittie from comment #1) 
> Open question: can a FUSE filesystem intercept fstat() on an open fd and
> make it take arbitrarily long, resulting in a DoS on dbus-daemon when it
> tries to fstat() them? If it can, then this mode of container creation has
> to be a privileged operation (root or bus owner only).

Don't use fstat(2).

Assuming you don't trust FUSE, this would be safe:

getsockopt(SOL_SOCKET, SO_DOMAIN) -> PF_UNIX
getsockopt(SOL_SOCKET, SO_TYPE) -> SOCK_STREAM
getsockopt(SOL_SOCKET, SO_ACCEPTCON) -> 1

Once all these return the expected values, you know this is the socket you expect. Note that getsockopt() on SOL_SOCKET cannot be intercepted. It is always handled by the networking-core (on linux..).

While I would argue FUSE should be trusted, or not given access to, this sequence is perfectly safe against FUSE as well. Unlike fstat(2), getsockopt(2) cannot be intercepted by FUSE.

Comment 4 David Herrmann 2017-12-19 11:03:40 UTC

(In reply to Simon McVittie from comment #0)
>     ServerSocketReadyNotifier: h
>         The reading end of a pipe or FIFO (format S_IFIFO). The container
>         manager will wait for this pipe to poll readable, then close it
>         and begin to accept() on the ServerSocket.
> 
>         (The container manager should keep the write end of this socket open
>         until it has called bind() and listen() on the ServerSocket,
>         then close the write end, resulting in the read end polling
> readable.)

Is there a particular reason to put the burden of this interface on dbus-daemon? Why would someone push the socket into dbus-daemon before it is ready to use?

Even in the bubblewrap example, why not put the burden of making this work on bubblewrap and its caller? For instance, if flatpack spawns an app, it could create the listener socket *AND* ready-pipe itself, pass it to bubblewrap. Once bubblewrap closes the ready-pipe, it pushes it into dbus-daemon via this call.

This would avoid extending the dbus-daemon interface, only for the policy to not have dbus in bubblewrap.

Lastly, please note that a trivial implementation of this is racy. Just because bubblewrap closed the ready-notifier socket does not mean dbus-daemon handled that event. On the contrary, dbus-daemon might handle any different event first, even if those fired after the ready-notifier. No priority-ordering can prevent this, since kernel events are not fine-grained enough on stream sockets to protect this kind of ordering.

Worst case, bubblewrap closes a socket, but a subsequent attempt to connect to the socket is still rejected, because dbus-daemon didn't handle the event, yet.
Even if you make dbus-daemon check the condition manually on connection attempts, you are still susceptible to ordering issues with any other operation on the bus, even though those might be unlikely to be triggered.

If we defer this to flatpack and bubblewrap, they can decide themselves how much barriers are needed for their use-case. Worst case, bubblewrap needs a synchronous interface to flatpack to tell it to add the socket and block until that is done (make it two pipes, one for each direction).

Comment 5 Simon McVittie 2017-12-19 11:28:16 UTC

(In reply to David Herrmann from comment #4)
> Is there a particular reason to put the burden of this interface on
> dbus-daemon? Why would someone push the socket into dbus-daemon before it is
> ready to use?

Because in the design I discussed with Allison and Alex, Flatpak creates the socket() and communicates with dbus-daemon, but it's bubblewrap that makes the socket ready to use by calling bind() and listen(), because Allison felt strongly that it should be possible to avoid the the socket ever being bound outside the container.

> Even in the bubblewrap example, why not put the burden of making this work
> on bubblewrap and its caller? For instance, if flatpack spawns an app, it
> could create the listener socket *AND* ready-pipe itself, pass it to
> bubblewrap. Once bubblewrap closes the ready-pipe, it pushes it into
> dbus-daemon via this call.

By the time the socket becomes ready, the `flatpak run` process has already gone away (it execs bubblewrap, rather than doing a fork-and-exec, and having the parent wait for bubblewrap to exit then exit itself).

bubblewrap can't do non-trivial IPC to dbus-daemon, because bubblewrap is setuid root on some systems, so making it use non-trivial libraries is dangerous.

> Lastly, please note that a trivial implementation of this is racy. Just
> because bubblewrap closed the ready-notifier socket does not mean
> dbus-daemon handled that event. On the contrary, dbus-daemon might handle
> any different event first, even if those fired after the ready-notifier. No
> priority-ordering can prevent this, since kernel events are not fine-grained
> enough on stream sockets to protect this kind of ordering.

As long as dbus-daemon doesn't start trying to accept() until bubblewrap has called bind() and listen(), everything's fine? It doesn't matter if there's a small delay during which the sandboxed app can connect() to the listening socket, but will block until the dbus-daemon wakes up and starts accept()ing.

> Worst case, bubblewrap closes a socket, but a subsequent attempt to connect
> to the socket is still rejected, because dbus-daemon didn't handle the
> event, yet.

As long as bubblewrap calls listen() before it closes the ready-notifier, I don't see how this can happen? If the dbus-daemon isn't accept()ing just yet, won't the client just block?

Comment 6 Simon McVittie 2017-12-19 11:30:04 UTC

(In reply to Simon McVittie from comment #0)
>     ServerSocketReadyNotifier: h
>         The reading end of a pipe or FIFO (format S_IFIFO). The container
>         manager will wait for this pipe to poll readable, then close it
>         and begin to accept() on the ServerSocket.
> 
>         (The container manager should keep the write end of this socket open
>         until it has called bind() and listen() on the ServerSocket,
>         then close the write end, resulting in the read end polling
> readable.)

Sorry, that first paragraph should have said: The *message bus* will wait for...

The message bus is dbus-daemon or equivalent, the container manager is the combination of Flatpak and bwrap (or eventually Snap or Firejail, if they want to use this interface).

Comment 7 Simon McVittie 2017-12-19 12:47:25 UTC

(In reply to David Herrmann from comment #3)
> (In reply to Simon McVittie from comment #1) 
> > Open question: can a FUSE filesystem intercept fstat() on an open fd?
> 
> Don't use fstat(2).
> 
> Assuming you don't trust FUSE, this would be safe:
> 
> getsockopt(SOL_SOCKET, SO_DOMAIN) -> PF_UNIX
> getsockopt(SOL_SOCKET, SO_TYPE) -> SOCK_STREAM
> getsockopt(SOL_SOCKET, SO_ACCEPTCON) -> 1

In that case we'd need to use a socket pair, not a pipe (or eventfd), but that seems fine. We could shutdown() one of the two directions so that it behaves more like a pipe.

Comment 8 David Herrmann 2017-12-19 18:09:46 UTC

(In reply to Simon McVittie from comment #7)
> (In reply to David Herrmann from comment #3)
> > (In reply to Simon McVittie from comment #1) 
> > > Open question: can a FUSE filesystem intercept fstat() on an open fd?
> > 
> > Don't use fstat(2).
> > 
> > Assuming you don't trust FUSE, this would be safe:
> > 
> > getsockopt(SOL_SOCKET, SO_DOMAIN) -> PF_UNIX
> > getsockopt(SOL_SOCKET, SO_TYPE) -> SOCK_STREAM
> > getsockopt(SOL_SOCKET, SO_ACCEPTCON) -> 1
> 
> In that case we'd need to use a socket pair, not a pipe (or eventfd), but
> that seems fine. We could shutdown() one of the two directions so that it
> behaves more like a pipe.

This was meant regarding the listener-socket. For other types it really depends what is needed:

1) If you use a pipe as notifier and only wait for POLLHUP, you can straight out do that without verifying *anything*. close(2) is never synchronous on FUSE, neither is poll(2). So you can just take any FD without verification. Note that there is no guarantee that your FD is linked to anything, so you might be the only one holding a reference. As long as the FD is accounted and bound to a dbus-connection, this _should_ be safe as its lifetime is bound (but see below...).

2) If you want eventfd semantics and use its counter as notification mechanism, I am afraid there might be no nice way to check it. The only safe option is to read /proc/self/fdinfo/<num> and see whether it says `eventfd-count: <num>`

While at it: Whenever you accept file-descriptors without verifying its type, you must make sure they're considered inflight, just like message payloads. Because again, it might be an AF_UNIX socket that has another socket queued recursively, and you keep it alive by pinning it.

For AF_UNIX-*listener* sockets this is safe. For anything else... well... depends on how safe your accounting is, and what kind of FD you're dealing with.

I hope that helps. I briefly verified this all on the current kernel sources. If we settle on one technique, I can verify it properly again, just to be sure.

Comment 9 David Herrmann 2017-12-19 18:23:08 UTC

(In reply to Simon McVittie from comment #5)
> (In reply to David Herrmann from comment #4)
> > Is there a particular reason to put the burden of this interface on
> > dbus-daemon? Why would someone push the socket into dbus-daemon before it is
> > ready to use?
> 
> Because in the design I discussed with Allison and Alex, Flatpak creates the
> socket() and communicates with dbus-daemon, but it's bubblewrap that makes
> the socket ready to use by calling bind() and listen(), because Allison felt
> strongly that it should be possible to avoid the the socket ever being bound
> outside the container.

Side-note: On linux, you can pin an unlinked directory via an O_PATH fd, and then create AF_UNIX sockets in that directory by specifying /proc/self/fd/<O_PATH-fd>/<file-path-of-choice>. That is, this is kinda the directory equivalent of creating anonymous files via memfd_create().

With this in mind, you can simply bind the socket in the flatpak controller without it ever being visible to anyone but you.

> > Even in the bubblewrap example, why not put the burden of making this work
> > on bubblewrap and its caller? For instance, if flatpack spawns an app, it
> > could create the listener socket *AND* ready-pipe itself, pass it to
> > bubblewrap. Once bubblewrap closes the ready-pipe, it pushes it into
> > dbus-daemon via this call.
> 
> By the time the socket becomes ready, the `flatpak run` process has already
> gone away (it execs bubblewrap, rather than doing a fork-and-exec, and
> having the parent wait for bubblewrap to exit then exit itself).
> 
> bubblewrap can't do non-trivial IPC to dbus-daemon, because bubblewrap is
> setuid root on some systems, so making it use non-trivial libraries is
> dangerous.

Why can't flatpak be changed to wait for bubblewrap to exit and then call into dbus-daemon? (Or wait for bubblewrap to signal something, rather than exit.)

I mean, we're designing spec extensions here based on implementation restrictions in one user.

Maybe I am misunderstanding something, but the two options are this:

1) Change flatpak to do a fork-and-exec on bubblewrap, wait until its setup is done, and then call into dbus-daemon.

2) Make dbus-daemon support inhibitors that delay activation of queued containers (like proposed here).

The second one requires extending the D-Bus spec and support in every message bus, while the former would be restricted to an implementation detail of flatpak.

This is not a major issue, so I don't want to block this. But for what it's worth, I would vote for option 1).

> > Lastly, please note that a trivial implementation of this is racy. Just
> > because bubblewrap closed the ready-notifier socket does not mean
> > dbus-daemon handled that event. On the contrary, dbus-daemon might handle
> > any different event first, even if those fired after the ready-notifier. No
> > priority-ordering can prevent this, since kernel events are not fine-grained
> > enough on stream sockets to protect this kind of ordering.
> 
> As long as dbus-daemon doesn't start trying to accept() until bubblewrap has
> called bind() and listen(), everything's fine? It doesn't matter if there's
> a small delay during which the sandboxed app can connect() to the listening
> socket, but will block until the dbus-daemon wakes up and starts accept()ing.

Right, I missed that part, sorry! This solves the connection-issue.

Also, as long as closing the notifier is solely interpreted as barrier for accept(2) to be allowed, I don't see any race.

Sorry, I was missing that the kernel queues your connection attempts just fine.

Comment 10 Allison Lortie (desrt) 2018-05-14 09:03:26 UTC

hi,

(In reply to David Herrmann from comment #9)
> Side-note: On linux, you can pin an unlinked directory via an O_PATH fd, and
> then create AF_UNIX sockets in that directory by specifying
> /proc/self/fd/<O_PATH-fd>/<file-path-of-choice>. That is, this is kinda the
> directory equivalent of creating anonymous files via memfd_create().
> 
> With this in mind, you can simply bind the socket in the flatpak controller
> without it ever being visible to anyone but you.

This was probably something like the original idea we had but it didn't work because the flow of fds was backwards: the way flatpak's invocation of bwrap currently works is that all fds have to flow in.  None can come out.  That prevents creating a directory in the new namespace and passing it back out, unless we pass in a unix socket for the sole purpose of passing out a directory.  That was considered too, and deemed too ugly.  It may be worth reconsidering this approach because it would give us the "lifetime" fd for free, but it is really a *very* strange API.

> I mean, we're designing spec extensions here based on implementation
> restrictions in one user.

This is exactly what is happening, indeed, and I share your concerns that we are making D-Bus excessively complicated to deal with something that should be better dealt with in flatpak/bwrap.  Finding a solution to this problem there may also enable other useful "pass back" applications other than for D-Bus.

I've never been a fan of the separate fd to signal readiness.

I had a quick chat with Alex today and the upshot of it is that he's going to try to implement a feature in bwrap to pass in a unix socket that can be used for "output" fds.  Then we could easily add bwrap features to (for example) create a directory and output its fd, or create/bind/listen a socket and output its fd.  With the latter, our D-Bus call could be quite simple: here's a socket, please start poll()/accept() on it.

Comment 11 Simon McVittie 2018-05-16 14:20:24 UTC

(In reply to Simon McVittie from comment #0)
> Following on from Bug #101354, Allison wants to be able to arrange for
> container instances' servers to appear inside containers in a more elegant
> way than creating them outside and bind-mounting them in

Taking a step back from the implementation details of how this works: Allison, please could you clarify why you consider this to be important?

The current implementation, both in flatpak-dbus-proxy and in dbus-daemon, is that the socket is created and bound in the host system's mount namespace, then bind-mounted into the container's mount namespace. What's wrong with this? Is it just that you consider the current implementation to be inelegant, or is there a functional problem?

Our privilege model is (I think) that anything with unrestricted filesystem access to the host system is trusted, because it could equally well connect to the real (unrestricted) listening dbus-daemon socket, which among other things lets it create its own container sockets with metadata of its choice; so I don't think there's any security problem with what we do at the moment.

I'm concerned that we might be inventing new complexity (either in Flatpak, dbus-daemon or bwrap, depending on design decisions made here) for a relatively marginal benefit.

Comment 12 Alexander Larsson 2018-05-16 14:57:23 UTC

Ok, so in flatpak I managed to avoid the contortions with ServerSocketReadyNotifier, etc. In this commit:

 https://github.com/flatpak/flatpak/pull/1682

Basically it creates the socket in flatpak run, passes it to bwrap which binds and listens the socket, then it signals back to flatpak and that then starts the dbus-proxy helper with the sockets.

The same could be done but instead of spawning the dbus-proxy, spawn something that sets up the dbus container.

The setup is a bit weird, because the main flatpak pid execs bwrap, so we have to pick up the "sockets are ready" feedback in a fork:ed child. This means the child has to be very careful and only do signal-async-safe code, but exec:ing a helper is quite doable.

This means that the dbus api can be a lot simpler. Just pass in a socket fd that is already bind()+listen(), and just accept clients on it.

Comment 13 Alexander Larsson 2018-05-18 15:35:22 UTC

Given the contortions required for the little gain, i've decided to drop pursuing fd-passing-based proxy creation in flatpak. I think we should similarly drop this.

Comment 14 GitLab Migration User 2018-10-12 21:31:35 UTC

-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/dbus/dbus/issues/184.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.