Bug 68217

Summary: systemctl times out under load
Product: systemd Reporter: Petr Ročkai <me>
Component: generalAssignee: systemd-bugs
Status: RESOLVED WONTFIX QA Contact: systemd-bugs
Severity: major    
Priority: medium    
Version: unspecified   
Hardware: All   
OS: Linux (All)   
Whiteboard:
i915 platform: i915 features:

Description Petr Ročkai 2013-08-17 13:16:32 UTC
On a system under load (possibly due to a runaway service, in which case this bug becomes more or less critical), systemctl will fail with various errors:

Failed to issue method call: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken.

Failed to get D-Bus connection: Failed to authenticate in time.

(the first message sometimes appears even though the call did actually succeed)
Comment 1 Lennart Poettering 2013-09-12 17:13:38 UTC
Well, if the machine is busy things time out. That's hardly surprising?
Comment 2 Petr Ročkai 2013-09-12 17:54:01 UTC
It is quite surprising. Normally, tasks will finish under comparable load. In fact, almost everything (other than systemctl) works fine, albeit very slowly. I don't see a reason why systemctl should stop working under load, considering that it's a relatively important system management command.
Comment 3 Lennart Poettering 2013-09-12 18:42:17 UTC
Hmm, so is the same case as https://bugs.freedesktop.org/show_bug.cgi?id=68232 ?

if so, then this is not surprising. Basically, while PID 1 is cleaning up the private tmp dir we don't execute the event loop anymore, thus not dispatching bus requests anymore, which means they time out.

This is really bad design on systemd's side. We really shouldn't do possibly unbounded IO from PID 1 I guess, blocking execution otherwise...
Comment 4 Zbigniew Jedrzejewski-Szmek 2013-09-13 00:18:58 UTC

*** This bug has been marked as a duplicate of bug 68232 ***
Comment 5 Petr Ročkai 2013-09-13 09:38:22 UTC
No, it's not the same as 68232. Any load will make it impossible to talk to systemd. Everything still works, you can also create sessions, but anything "systemctl" will die with a DBUS timeout. This is clearly a separate issue.
Comment 6 Petr Ročkai 2013-09-13 09:43:12 UTC
In fact, it will sometimes work, but it will usually fail, depending on how processes get scheduled. But that means that you need to watch it and re-start commands that fail. Really boring.
Comment 7 Zbigniew Jedrzejewski-Szmek 2013-09-17 15:40:03 UTC
So if it's not related to systemd getting stuck in tmpdir removal, then it's a bit unexpected. How many units do you have?

Output of perf report could help us understand what's going on. Can you start 'perf record -g -p 1' before the systemctl call is started, and afterwards ^C perf, and attach the output of 'perf report --sort symbol -g fractal,5'. Please make sure that you have the debugging symbols available.
Comment 8 Petr Ročkai 2013-09-17 15:58:38 UTC
Sigh. I already said it twice, but maybe it will work the third time? The problem is that there is a timeout. Neither systemd nor systemctl are doing any significant work. If the system is under load, it may take a long while for the right processes to get to the CPU in the right order. That's all. You can't fix that, it's completely external. All I am asking for is there to not be a timeout. If not having a timeout is not an option (say, a dbus restriction), then making the timeout unrealistically long might be sufficient (say, 8h). As things are, I can hit a timeout under realistic conditions.
Comment 9 Zbigniew Jedrzejewski-Szmek 2013-09-17 16:33:09 UTC
(In reply to comment #8)

The timeout is quite large already (30s ?). And it is used in various conditions, e.g. during shutdown when dbus is already down. Removing the timeout would mean that things block instead of failing more gracefully. How loaded is your machine when systemctl dies with a timeout?
Comment 10 Petr Ročkai 2013-09-17 17:59:52 UTC
Load average is between say 20 and 100 when things start to go awry, not sure when exactly, I don't normally trigger high loads willfully. But these things do happen. I suspect it may be more a function of (un)available physical memory, though. Can the timeout for interactive systemctl be different from other (shutdown-time) timeouts? (And when I say interactive, I also mean running from scripts, where this is even more of a problem, i.e. not based on a controlling tty). It'd be less of a problem if there was another way (more direct than dbus) to reach systemd, but I didn't find anything.
Comment 11 Zbigniew Jedrzejewski-Szmek 2013-11-06 15:14:24 UTC
http://cgit.freedesktop.org/systemd/systemd/commit/?id=1f19a5 made the timeout configurable.
Comment 12 Petr Ročkai 2013-11-06 18:34:17 UTC
I am pretty sure these are not the timeouts I am seeing. My concern is the timeout when systemctl is talking to dbus, not anything directly related to units. Also in the patch you refer to,
diff --git a/man/systemd.mount.xml b/man/systemd.mount.xml
has a small mistake in it, p.p. of set is set, not setted.
Comment 13 Zbigniew Jedrzejewski-Szmek 2018-03-09 08:00:44 UTC
Closing all stale bugs with NEEDINFO. Please open a new bug at https://github.com/systemd/issues if the problem still occurs.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.