Bug 89847 - dbus.service stops too early during shutdown, breaking all running D-Bus clients
Summary: dbus.service stops too early during shutdown, breaking all running D-Bus clients
Status: RESOLVED NOTOURBUG
Alias: None
Product: dbus
Classification: Unclassified
Component: core (show other bugs)
Version: git master
Hardware: Other All
: medium normal
Assignee: D-Bus Maintainers
QA Contact: D-Bus Maintainers
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2015-03-31 16:44 UTC by Martin Pitt
Modified: 2016-10-14 11:49 UTC (History)
1 user (show)

See Also:
i915 platform:
i915 features:


Attachments
system bus: do not allow stopping the system dbus-daemon (1.95 KB, patch)
2015-04-02 13:04 UTC, Simon McVittie
Details | Splinter Review

Description Martin Pitt 2015-03-31 16:44:49 UTC
Downstream in Ubuntu we have gotten several reports of shutdown hangs due to remote file systems trying to unmount when the wifi is already down, and simlar issues. In journals [1][2][3] you see things like


systemd[1]: Stopping D-Bus System Message Bus...
avahi-daemon[694]: Disconnected from D-Bus, exiting.
whoopsie[685]: g_dbus_connection_real_closed: Remote peer vanished with error: Underlying GIOStream returned 0 bytes on an async read (g-io-error-quark, 0). Exiting.
NetworkManager[689]: g_dbus_connection_real_closed: Remote peer vanished with error: Underlying GIOStream returned 0 bytes on an async read (g-io-error-quark, 0). Exiting.
wpa_supplicant[1246]: wlan0: CTRL-EVENT-DISCONNECTED bssid=00:0f:66:57:07:7c reason=3 locally_generated=1
NetworkManager[689]: <info> (wlan0): roamed from BSSID 00:0F:66:57:07:7C (RedKite) to (none) ((none))
kernel: CIFS VFS: Server 192.168.1.93 has not responded in 120 seconds. Reconnecting...
systemd[1]: Unmounted <some remote file system path> → this is usually timing out as the network is not reachable any more
Stopped target Remote File Systems.
systemd[1]: Stopping Remote File Systems.
systemd[1]: Stopped target Remote File Systems (Pre).
systemd[1]: Stopping Remote File Systems (Pre).
systemd[1]: Stopping Network Manager...
systemd[1]: Stopped Network Manager.
systemd[1]: Stopping D-Bus System Message Bus Socket.

The main problem is that there is nothing that prevents D-Bus from stopping very early, way earlier than all of the Type=dbus services. There is an attempt to prevent that as systemd implies "After=dbus.socket" for Type=dbus units, but that doesn't save us: you can't re-start D-Bus after shutting it down and expect things to just work, as it loses all of its state. And even if that would work, it's rather pointless (and slow) to try and shut it down early just to have it reactivate several times during shutdown. So in short, I think dbus.socket doesn't really help us with anything except not having to specify precise startup requirements; but we still need them for shutdown.

So what we really need is some way to say "Don't stop dbus.service before any D-Bus client" (i. e. *.service of Type=dbus). We could make dbus.service start before basic.target and stop after basic.target and add DefaultDependencies=no, so that it stops after most stuff; or change the implied After=dbus.socket in systemd to After=dbus.service. But that would also be prone to causing cyclic dependencies, if you e. g. have /var on NFS and need it to start dbus. (I didn't test that, it's just a gut feeling as remote file systems that you need to boot are notoriously susceptible to dependency loops.)

Michael Biebl came up with a crazy, but neat idea for a workaround:

  ExecStop=/bin/true
  KillMode=none

i. e. don't actually stop dbus through the unit, and just let it get killed at the bitter end. But on second thought it's actually quite tempting: Stopping/restarting D-Bus during runtime has never been supported, and always caused your desktop session and half of your services to get killed, so in the distro we went out of our way to prevent maintainer scripts etc. from doing that. So it's at least a nice initial hack which isn't introducing dependency loops. But I figure we should at least discuss a cleaner solution.

Of course this will all be obsolete with kdbus, as then dbus always works. (Incidentally, not ever stopping it gets us rather close in that regard :-) ).

[1] https://launchpadlibrarian.net/201680859/journal.txt
[2] http://paste.ubuntu.com/10711795/
[3] https://launchpadlibrarian.net/200211036/journal.txt
Comment 1 Michael Biebl 2015-03-31 21:08:39 UTC
ExecStop=/bin/true was my first idea, but Martin rightfully pointed out, that this doesn't influence KillMode, i.e. we need KillMode=none.
With that KillMode setting, I think we can actually drop ExecStop=/bin/true, so what remains is

KillMode=none

systemd has a final killing spree, before it unmounts the file systems and I don't think dbus needs to do something special on stop?

We might have a problem, if /usr is on NFS and (at least on Debian) dbus-daemon being installed in /usr/bin, which would keep the FS busy.
Comment 2 Martin Pitt 2015-04-01 09:05:32 UTC
(In reply to Michael Biebl from comment #1)
> ExecStop=/bin/true was my first idea, but Martin rightfully pointed out,
> that this doesn't influence KillMode, i.e. we need KillMode=none.
> With that KillMode setting, I think we can actually drop ExecStop=/bin/true,
> so what remains is
> 
> KillMode=none

This doesn't actually work in my tests, the ExecStop= is needed (tested with systemd 219).

I was quite surprised that this fixed a lot more than just the hangs on unmounting network file systems. I also got reports that other vague shutdown hangs now disappeared (https://launchpad.net/bugs/1427672) and that shutdown hangs with libvirt's virbr0-nic interfaces are now gone as well.
Comment 3 Simon McVittie 2015-04-01 11:37:27 UTC
(In reply to Michael Biebl from comment #1)
> We might have a problem, if /usr is on NFS and (at least on Debian)
> dbus-daemon being installed in /usr/bin, which would keep the FS busy.

If dbus-daemon really badly needs to be moved to the rootfs, then it can be... but in Debian, some libraries need to move first before that becomes useful (libcap-ng0, libapparmor1).

If this is something we support upstream then I'd like to do it by introducing a ${rootbindir} like systemd has, rather than moving binaries around and patching systemd units etc. to compensate for it like Ubuntu has traditionally done. But I would prefer not to have to support it tbh.

(Also, is there any reason to prefer NFS /usr that doesn't apply equally to preferring NFS rootfs?)

(In reply to Martin Pitt from comment #0)
> So what we really need is some way to say "Don't stop dbus.service before
> any D-Bus client" (i. e. *.service of Type=dbus). We could make dbus.service
> start before basic.target and stop after basic.target and add
> DefaultDependencies=no, so that it stops after most stuff; or change the
> implied After=dbus.socket in systemd to After=dbus.service. But that would
> also be prone to causing cyclic dependencies, if you e. g. have /var on NFS
> and need it to start dbus. (I didn't test that, it's just a gut feeling as
> remote file systems that you need to boot are notoriously susceptible to
> dependency loops.)

The fewer constraints we have to apply to startup, the better, to make it more likely that dependency loops can be avoided in arbitrarily weird situations (NFS /usr and/or /var, or networking infrastructure that needs D-Bus - you can have one or the other but you cannot have both). So I would be OK with giving dbus.service DefaultDependencies=no, but I would not necessarily encourage requiring it to start before basic.target.

If we give it DefaultDependencies=no and RequiresMountsFor=/usr /var /proc, but do not order it either before or after basic.target, would that help? Then we'd get:

startup: /usr and /var must be mounted before the system dbus-daemon starts

shutdown: dbus-daemon must stop before /usr and /var are unmounted

Would not having the implicit Conflicts on shutdown.target mean that dbus-daemon would survive the initial shutdown transaction and only be killed when systemd got as far as unmounting the filesystems, or does it schedule everything required for shutdown, including the unmounting, as a single large transaction?

(In reply to Michael Biebl from comment #1)
> systemd has a final killing spree, before it unmounts the file systems and I
> don't think dbus needs to do something special on stop?

It does not need to do anything special on stop. Is this killing spree suffiently early that it happens before attempting to unmount /usr?
Comment 4 Martin Pitt 2015-04-02 08:38:10 UTC
Hello Simon,

(In reply to Simon McVittie from comment #3)
> (Also, is there any reason to prefer NFS /usr that doesn't apply equally to
> preferring NFS rootfs?)

I'm not sure if root on NFS was ever attempted/supported. You'd basically need half an OS in your initramfs then? :-)

> The fewer constraints we have to apply to startup, the better, to make it
> more likely that dependency loops can be avoided in arbitrarily weird
> situations (NFS /usr and/or /var, or networking infrastructure that needs
> D-Bus - you can have one or the other but you cannot have both). So I would
> be OK with giving dbus.service DefaultDependencies=no, but I would not
> necessarily encourage requiring it to start before basic.target.

Why not before basic.target? This would make it a basic system service which everyone could rely on. It's already kind of that through dbus.socket? It could lead to a more serialized boot of course, as non-dbus services would have to wait for dbus.service then.

> If we give it DefaultDependencies=no and RequiresMountsFor=/usr /var /proc,
> but do not order it either before or after basic.target, would that help?

We don't need /proc, that's always available. "/usr /var" sounds like a good idea. I'd additionally have After=local-fs.target to ensure that you have a r/w root partition; or RequiresMountsFor=/ /usr /var.

But any kind of dependency that we add here (and we have to) will lead to dbus.service being stopped at some time before that dependency/mount gets shut down, and then we are back to the situation that d-bus stops too early. In my experiments it again got stopped before any Type=dbus dependency, just because nothing else was directly depending on it. I have to use Before=basic.target so that it survives long enough during shutdown for all multi-user.target-like services to go down.

> Then we'd get:
> 
> startup: /usr and /var must be mounted before the system dbus-daemon starts
> 
> shutdown: dbus-daemon must stop before /usr and /var are unmounted

That's required indeed, but not sufficient. We also need to tell it to stop after consumers of dbus.service (which is really the whole point of this exercise).

> Would not having the implicit Conflicts on shutdown.target mean that
> dbus-daemon would survive the initial shutdown transaction and only be
> killed when systemd got as far as unmounting the filesystems

Unfortunately not, see above. This only works if you don't specify *any* dependency, but that's obviously not going to work for boot.

> (In reply to Michael Biebl from comment #1)
> > systemd has a final killing spree, before it unmounts the file systems and I
> > don't think dbus needs to do something special on stop?
> 
> It does not need to do anything special on stop. Is this killing spree
> suffiently early that it happens before attempting to unmount /usr?

It should better be, otherwise every random running process that the user and system left behind would block the unmounting. But I don't see how we can actually get to the point where dbus.service is never actually explicitly stopped; and we don't really need to, as long as stopping it happens sufficiently late.

In my initial experiments this seems to work well:

DefaultDependencies=no
After=local-fs.target
RequiresMountsFor=/usr /var
Before=basic.target network.target

Note that I didn't test this yet with a remote /var (or /usr).
Comment 5 Simon McVittie 2015-04-02 10:32:00 UTC
(In reply to Martin Pitt from comment #4)
> I'm not sure if root on NFS was ever attempted/supported. You'd basically
> need half an OS in your initramfs then? :-)

Yes it is/was, with or without an initramfs:

https://www.kernel.org/doc/Documentation/filesystems/nfs/nfsroot.txt
https://anonscm.debian.org/cgit/kernel/initramfs-tools.git/tree/scripts/nfs

> > The fewer constraints we have to apply to startup, the better [...]
> > So I would
> > be OK with giving dbus.service DefaultDependencies=no, but I would not
> > necessarily encourage requiring it to start before basic.target.
> 
> Why not before basic.target? This would make it a basic system service which
> everyone could rely on. It's already kind of that through dbus.socket? It
> could lead to a more serialized boot of course, as non-dbus services would
> have to wait for dbus.service then.

If dbus.service *on a particular system configuration* is After something that is After basic.target, and basic.target is After dbus.service, then we have a cyclic dependency on that system configuration. (This is regardless of whether it works without cyclic dependencies for the relevant distribution's dbus maintainers, who presumably know enough to avoid bizarre boot orderings.)

Notable examples of things that dbus might need:

* NIS and other sources of users
* networking and other non-trivial infrastructure to get /usr

Debian and Ubuntu have traditionally run dbus as a rc2.d service, so there is nothing to stop it having such dependencies.
Comment 6 Simon McVittie 2015-04-02 13:04:15 UTC
Created attachment 114829 [details] [review]
system bus: do not allow stopping the system dbus-daemon

There is nothing that prevents D-Bus from stopping very early,
way earlier than all of the Type=dbus services. There is an
attempt to prevent that as systemd implies "After=dbus.socket"
for Type=dbus units, but that doesn't save us: you can't re-start
D-Bus after shutting it down and expect things to just work, as
it loses all of its state.

Putting dbus.service Before basic.target was considered and rejected,
because it's a recipe for cyclic dependencies: as soon as
dbus.service wants to be After anything that is After basic.target
(e.g. NIS and other user databases, or remote filesystems) you
get a cycle. dbus-daemon does not need to start early, it only
needs to stop late.

systemd has a final killing spree before it unmounts the
file systems, which should be sufficient to avoid dbus-daemon
preventing a separate /usr from being unmounted; this does not
consider KillMode. dbus-daemon doesn't need to do anything special
during shutdown, so it's OK that it survives until then.

Based on a suggestion from Michael Biebl and Martin Pitt.

---

How's this? I made the stop command be "echo" instead of "true" so that it leaves a hint in systemctl status if someone tries to stop it by hand.

Unfortunately, "systemctl restart dbus" (which was never supported either) will now start a second dbus-daemon in parallel with the first, and in my testing, the second one will get all new connections. Any ideas for how to avoid that? Perhaps it would be better to make the stop command exit nonzero? ... but then we'd log scary messages during a normal shutdown, which is no better really.
Comment 7 Simon McVittie 2015-04-02 14:09:23 UTC
(In reply to Simon McVittie from comment #6)
> Perhaps it would be better to make the stop command exit
> nonzero?

Straw man:

ExecStop=/bin/sh -c "echo Stopping the system dbus-daemon is not supported. Reboot the system instead.; exit 1"

... which does work, but logs "Unit dbus.service entered failed state" during shutdown. It seems better to avoid "crying wolf" if possible, because it can hide real issues.
Comment 8 Martin Pitt 2015-04-02 14:22:04 UTC
(In reply to Simon McVittie from comment #6)
> How's this? I made the stop command be "echo" instead of "true" so that it
> leaves a hint in systemctl status if someone tries to stop it by hand.

Ah, good one.
 
> Unfortunately, "systemctl restart dbus" (which was never supported either)
> will now start a second dbus-daemon in parallel with the first, and in my
> testing, the second one will get all new connections. Any ideas for how to
> avoid that? Perhaps it would be better to make the stop command exit
> nonzero? ... but then we'd log scary messages during a normal shutdown,
> which is no better really.

Right, I don't like making it fail on stop. I don't see anything explicit which would declare "cannot restart"; I haven't tested this (travelling/no real computer), but would something like 

    ConditionPathExists=!/run/dbus/system_bus_socket

prevent further starts/restarts? I'm not sure if conditions are evaluated on start only, or on restart too -- my gut feeling is the former, i. e. that won't work.

Thanks!

BTW, this is being discussed on the upstream ML too:

http://lists.freedesktop.org/archives/systemd-devel/2015-April/030070.html

(I haven't caught up with that yet)
Comment 9 Simon McVittie 2015-04-02 15:25:29 UTC
(In reply to Martin Pitt from comment #8)
> I don't see anything explicit
> which would declare "cannot restart"; I haven't tested this (travelling/no
> real computer), but would something like 
> 
>     ConditionPathExists=!/run/dbus/system_bus_socket
> 
> prevent further starts/restarts?

Good idea, I'll try it. systemd guarantees that /run is a tmpfs, so the "what if I have stale misc lying around in /var/run?" question that we would have had under sysvinit and possibly Upstart does not exist.

> BTW, this is being discussed on the upstream ML too:
> 
> http://lists.freedesktop.org/archives/systemd-devel/2015-April/030070.html

I am aware, and have replied there.
Comment 10 Simon McVittie 2015-04-02 15:32:13 UTC
(In reply to Martin Pitt from comment #8)
>     ConditionPathExists=!/run/dbus/system_bus_socket

That can't be suitable, because dbus.socket creates that filesystem object, so dbus-daemon would never start.

Removing --nopidfile and adding ConditionPathExists=!/run/dbus/pid in addition to the patch I posted above (for upstream consumption that would be @DBUS_SYSTEM_PID_FILE@) does work.
Comment 11 Michael Biebl 2015-04-02 19:33:36 UTC
(In reply to Simon McVittie from comment #7)
> (In reply to Simon McVittie from comment #6)
> > Perhaps it would be better to make the stop command exit
> > nonzero?
> 
> Straw man:
> 
> ExecStop=/bin/sh -c "echo Stopping the system dbus-daemon is not supported.
> Reboot the system instead.; exit 1"
> 
> ... which does work, but logs "Unit dbus.service entered failed state"
> during shutdown. It seems better to avoid "crying wolf" if possible, because
> it can hide real issues.


Spawning a shell (which will burn more cpu cycles then a simple /bin/true), this will also trigger a failure on every shutdown/reboot.
Do we really want this?
Comment 12 Lennart Poettering 2015-04-03 12:17:12 UTC
Please do not apply that ExecStop= thing. You really shouldn't block that. Think about people who boot their system with "emergency" on the kernel cmdline, and thus get a PID 1 plus a shell and nothing else, they should be able to start dbus and stop it as they wish... 

If at all, use RefuseManualStop=yes on the unit, but I don't like that much either. It's mostly relevant for things like audit where the kernel might stop if the service is shut down.

The best way is to fix the few services that really need dbus unconditionally to be around to add After=dbus.service. And all is good.
Comment 13 Martin Pitt 2015-04-04 06:43:02 UTC
(In reply to Lennart Poettering from comment #12)
> The best way is to fix the few services that really need dbus
> unconditionally to be around to add After=dbus.service. And all is good.

If we go with that approach, then it would IMHO be cleaner to change the implied "After=dbus.socket" for Type=dbus units to "After=dbus.service", for the non-kdbus case at least.
Comment 14 Lennart Poettering 2015-04-07 15:02:04 UTC
(In reply to Martin Pitt from comment #13)
> (In reply to Lennart Poettering from comment #12)
> > The best way is to fix the few services that really need dbus
> > unconditionally to be around to add After=dbus.service. And all is good.
> 
> If we go with that approach, then it would IMHO be cleaner to change the
> implied "After=dbus.socket" for Type=dbus units to "After=dbus.service", for
> the non-kdbus case at least.

Nah, I am pretty sure this should be fixed in the few services that need this, rather than adding these otherwise unnecessary synchronization points to all services, including the ones that don't need it.
Comment 15 Simon McVittie 2015-05-13 14:10:15 UTC
(In reply to Simon McVittie from comment #6)
> Unfortunately, "systemctl restart dbus" (which was never supported either)
> will now start a second dbus-daemon in parallel with the first

I think that's unacceptable.

(In reply to Lennart Poettering from comment #12)
> If at all, use RefuseManualStop=yes on the unit, but I don't like that much
> either.

I don't think this would satisfy the original requirement unless the ExecStop=/bin/true was also included, and you've said that that's undesired.

> Please do not apply that ExecStop= thing. You really shouldn't block that.

Fair enough. NOTOURBUG then.

> The best way is to fix the few services that really need dbus
> unconditionally to be around to add After=dbus.service. And all is good.

OK. In the case of the original bug report, I think this means wpasupplicant and maybe NetworkManager would have to either turn off exit-on-disconnect (for libdbus and/or GDBus, depending which they use), or be After=dbus.service.
Comment 16 Martin Pitt 2016-10-14 11:49:02 UTC
Note that bug 98254 alleviates this a fair bit -- with that, dbus.service can stop later.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.