Bug 101858

Summary: recent libexpat versions can make dbus-daemon hang waiting for entropy
Product: dbus Reporter: Simon McVittie <smcv>
Component: coreAssignee: Simon McVittie <smcv>
Status: RESOLVED FIXED QA Contact: D-Bus Maintainers <dbus>
Severity: normal    
Priority: medium CC: hewitt
Version: git masterKeywords: patch
Hardware: Other   
OS: All   
URL: https://github.com/smcv/dbus/commit/101858-expat-entropy
Whiteboard: review+
i915 platform: i915 features:
Attachments: [1.10] config-loader-expat: Tell Expat not to defend against hash collisions
[master] config-loader-expat: Tell Expat not to defend against hash collisions

Description Simon McVittie 2017-07-20 20:18:16 UTC
@hewittc on GitHub reported a boot sequence issue on an ARM device running Arch Linux, apparently triggered by a libexpat upgrade: with libexpat >= 2.2.1, system services like systemd-logind and systemd process 1 would frequently fail to connect to the dbus-daemon. This caused subsequent uses of systemd-logind (in particular system logins) to fail, making the system rather broken.

Enabling verbose logging for the dbus-daemon indicated that the dbus-daemon starts 7.2 seconds after boot, stops logging anything at 7.3 seconds, then wakes up again and resumes logging things at 38.1 seconds. That's shortly after the kernel reports this:

[   37.789094] ngsa-2 kernel: random: nonblocking pool is initialized

which I think is the time at which getrandom() stops blocking. So I suspect that the dbus-daemon is blocking in libexpat for approximately 30 seconds while waiting for entropy. 30ish seconds is too long for the timeouts used in systemd, systemd-logind and systemd-networkd - by the time the dbus-daemon has woken up, those system services have already timed out.

The motivation for using high-quality entropy seems to be to avoid hash collision complexity attacks involving crafted XML files that collide. dbus-daemon does not actually need this: the only XML that we parse is completely trusted (it's the dbus-daemon's configuration). So this failure mode could be avoided by asking libexpat to not wait for high-quality entropy.

XML_SetHashSalt() with some arbitrary nonzero hash_salt parameter is an undocumented way to do this, which a libexpat developer has confirmed should work. As a short term fix we can add a call to this. It might become a no-op one day, but it can't hurt.

Longer term, libexpat might gain a function like XML_UseWeakRandomness() or something, which we could also call.
Comment 1 Simon McVittie 2017-07-20 20:29:46 UTC
https://github.com/libexpat/libexpat/issues/91 is the Expat side of this.
Comment 2 Simon McVittie 2017-07-21 09:46:34 UTC
See also https://github.com/systemd/systemd/issues/6418.
Comment 3 Simon McVittie 2017-07-21 10:06:50 UTC
Created attachment 132813 [details] [review]
[1.10] config-loader-expat: Tell Expat not to defend against hash  collisions

By default, Expat uses cryptographic-quality random numbers as a salt for
its hash algorithm, and since 2.2.1 it gets them from the getrandom
syscall on Linux. That syscall refuses to return any entropy until the
kernel's CSPRNG (random pool) has been initialized. Unfortunately, this
can take as long as 40 seconds on embedded devices with few entropy
sources, which is too long: if the system dbus-daemon blocks for that
length of time, important D-Bus clients like systemd and systemd-logind
time out and fail to connect to it.

We're parsing small configuration files here, and we trust them
completely, so we don't need to defend against hash collisions: nobody
is going to be crafting them to cause pathological performance.

---

This is for 1.10. A slightly different version is needed for master because we changed the build around a bit.

This also doesn't solve anything for CMake builds, but I'm not sure we care. The Autotools build system is recommended for anything other than:

(a) MSVC on Windows
(b) building on Linux to check that the CMake build system isn't completely broken
Comment 4 Simon McVittie 2017-07-21 10:08:00 UTC
Created attachment 132814 [details] [review]
[master] config-loader-expat: Tell Expat not to defend against hash  collisions

---

Same thing but for master, where XML_CFLAGS and XML_LIBS have become EXPAT_* because we stopped pretending to support any other XML parser.
Comment 5 Simon McVittie 2017-07-21 10:08:59 UTC
Testing on the failing hardware, with the unpatched Expat >= 2.2.1, would be appreciated.
Comment 6 Christopher Hewitt 2017-07-21 14:40:06 UTC
I reverted any local modifications to expat and dbus, verified that the boot behavior was broken, then applied the patch for dbus 1.10 and built a new dbus package. Indeed the system now boots as expected once again.
Comment 7 Simon McVittie 2017-07-21 16:47:34 UTC
Thanks, added your Tested-by in my local version of those commits. Hopefully a reviewer can take a look at these soon.

(Also at <https://github.com/smcv/dbus/commit/101858-expat-entropy> and <https://github.com/smcv/dbus/commit/101858-expat-entropy-1.10> for CI testing)
Comment 8 Philip Withnall 2017-07-28 10:01:03 UTC
Comment on attachment 132813 [details] [review]
[1.10] config-loader-expat: Tell Expat not to defend against hash  collisions

Review of attachment 132813 [details] [review]:
-----------------------------------------------------------------

r+
Comment 9 Philip Withnall 2017-07-28 10:01:23 UTC
Comment on attachment 132814 [details] [review]
[master] config-loader-expat: Tell Expat not to defend against hash  collisions

Review of attachment 132814 [details] [review]:
-----------------------------------------------------------------

r+
Comment 10 Simon McVittie 2017-07-28 10:26:43 UTC
Thanks, fixed in git for 1.10.24 and 1.11.18.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.