Bug 95263 - unix fd in-flight counting broken
Summary: unix fd in-flight counting broken
Status: RESOLVED FIXED
Alias: None
Product: dbus
Classification: Unclassified
Component: core (show other bugs)
Version: unspecified
Hardware: Other All
: medium normal
Assignee: Simon McVittie
QA Contact: D-Bus Maintainers
URL:
Whiteboard:
Keywords: patch
Depends on: 86442
Blocks:
  Show dependency treegraph
 
Reported: 2016-05-04 15:56 UTC by Lennart Poettering
Modified: 2017-04-05 15:19 UTC (History)
7 users (show)

See Also:
i915 platform:
i915 features:


Attachments
a patch (2.36 KB, patch)
2016-05-05 11:23 UTC, Lennart Poettering
Details | Splinter Review
second try (4.57 KB, patch)
2016-05-05 22:50 UTC, Lennart Poettering
Details | Splinter Review
third try (4.57 KB, patch)
2016-05-05 22:51 UTC, Lennart Poettering
Details | Splinter Review
new version (4.87 KB, patch)
2016-05-09 12:25 UTC, Lennart Poettering
Details | Splinter Review
dbus-daemon test: attempt to reproduce #95263 (3.98 KB, patch)
2016-07-01 15:27 UTC, Simon McVittie
Details | Splinter Review
WiP: only read one message at a time if there are fds pending (10.92 KB, patch)
2016-07-01 15:30 UTC, Simon McVittie
Details | Splinter Review
WiP: make uid 0 immune to pending_fd_timeout (1.92 KB, patch)
2016-07-01 15:37 UTC, Simon McVittie
Details | Splinter Review
Only read one message at a time if there are fds pending (12.38 KB, patch)
2017-01-17 17:00 UTC, Simon McVittie
Details | Splinter Review

Description Lennart Poettering 2016-05-04 15:56:50 UTC
So, here's an issue I found with the unix fd counting introduced in e0c9d31be3b9eea9ee2a3a255bc2cf9aad713642:

systemd-logind provides a method call OpenSession() that is invoked each time a user session is created. It returns some metadata including one FD that is used to track the session's runtime. When the system is creating a myriad of sessions in a tight loop, dbus-daemon will eventually kick logind from the bus. This is because logind sends enough fds so that the fd counter never hits 0 for the connection and thus the timer introduced by the commit above is hit.

I figure the timer should really be restarted each time the counter is decreased, not just each time when it goes from 0 → 1. But I am not sure I understand the alrgorithm fully. Alban?

The original issue is tracked here in systemd upstream:

https://github.com/systemd/systemd/issues/1961

And I figure this is actually the same issue:

https://github.com/systemd/systemd/issues/2925
Comment 1 Alban Crequy 2016-05-04 17:24:26 UTC
> I figure the timer should really be restarted each time the counter
> is decreased,

I think you are correct.
Comment 2 Lennart Poettering 2016-05-05 11:23:28 UTC
Created attachment 123493 [details] [review]
a patch

Here's a patch to fix this. With this in place, I cannot reproduce the original issue with systemd-logind anymore.
Comment 3 Alban Crequy 2016-05-05 17:19:42 UTC
       /* Start timer each time an fd is added or removed, i.e. the list of fds isn't stale or stuck */
       _dbus_timeout_set_interval (d->pending_unix_fds_timeout,
               bus_context_get_pending_fd_timeout (d->connections->context));
       _dbus_timeout_set_enabled (d->pending_unix_fds_timeout, TRUE);


The timeout might already be set to TRUE and it is set to TRUE again. The documentation for _dbus_timeout_set_interval() says that "you have to disable and re-enable the timeout using the timeout toggle function". Is it true?
Comment 4 Lennart Poettering 2016-05-05 21:32:53 UTC
I am a bit puzzled, since the old code didn't call this either, but really should afaics. only when the timeout is toggled or when it elapses the new timestamp (to which the "interval" parameter is considered relative) is read into the timeout object.

With other words, as far as I can see this the old code never actually worked correctly either: it used the clock at the time the connection was allocated as base for the interval, and never ever restarted it.

So I figure this patch only worked for me, because the stuff was turned off quicker than it could elapse again now...

Anyway, will prep a fix that will toggle the timer each time we turn it on here.
Comment 5 Lennart Poettering 2016-05-05 22:49:41 UTC
So, yeah, adding and removing the timeout would work, and was previously missing already. However, adding/removing the timeout sucks, because it involves memory allocation.

Instead I decided to simply provide a new call _dbus_loop_restart_timeout() that directly takes the new timestamp for the timeout, without requiring a free/malloc cycle.

The bus code always goes directly to the various _dbus_loop_xyz() calls, hence I figured it's better to just add this new function than accepting the fact that the memory allocation might fail.

This new patch I tested for 30min in a tight-loop, and all appears good.
Comment 6 Lennart Poettering 2016-05-05 22:50:07 UTC
Created attachment 123506 [details] [review]
second try
Comment 7 Lennart Poettering 2016-05-05 22:51:14 UTC
Created attachment 123507 [details] [review]
third try

this time, with correct commit msg
Comment 8 Simon McVittie 2016-05-06 17:42:48 UTC
Comment on attachment 123507 [details] [review]
third try

Review of attachment 123507 [details] [review]:
-----------------------------------------------------------------

> This also adds a log message whenever the fd-in-flight timer hits, since
> the timer elapsing should usually be a very exceptional condition, and a
> clear indication of a bug or attack somewhere.

Yes! We need more of this sort of thing.

(Implementation not yet reviewed.)
Comment 9 Simon McVittie 2016-05-06 17:47:15 UTC
(In reply to Lennart Poettering from comment #5)
> The bus code always goes directly to the various _dbus_loop_xyz() calls,
> hence I figured it's better to just add this new function than accepting the
> fact that the memory allocation might fail.

Yes, that's fine. DBusLoop is a simple single-threaded main loop for use in the dbus-daemon and regression tests; it's OK for dbus-daemon to assume it.
Comment 10 Simon McVittie 2016-05-06 17:51:06 UTC
Comment on attachment 123507 [details] [review]
third try

Review of attachment 123507 [details] [review]:
-----------------------------------------------------------------

::: bus/connection.c
@@ +657,5 @@
> +      _dbus_timeout_set_enabled (d->pending_unix_fds_timeout, FALSE);
> +    }
> +  else if (n_pending_unix_fds_old != n_pending_unix_fds_new)
> +    {
> +      /* Start timer each time an fd is added or removed, i.e. the list of fds isn't stale or stuck */

Restarting the timer every time a fd is removed, sure, I can understand that. logind (or whatever) has made forward progress, so it's OK to give it some time to deal with the next one.

Restarting the timer every time a fd is *added* seems odd though... that doesn't prove anything about whether logind (or whatever) is actually reading from its socket?

On one hand, it can't keep increasing forever, because eventually it'll hit the per-connection limit; but on the other hand, we ideally want a stuck or malicious (slow loris) service to be detected *before* we hit the per-connection limit, because the per-connection limit is uncomfortably few orders of magnitude down from the limit for our entire process.
Comment 11 Simon McVittie 2016-05-06 17:56:38 UTC
Comment on attachment 123507 [details] [review]
third try

Review of attachment 123507 [details] [review]:
-----------------------------------------------------------------

::: bus/connection.c
@@ +675,4 @@
>  pending_unix_fds_timeout_cb (void *data)
>  {
>    DBusConnection *connection = data;
> +  _dbus_warn ("Client sent too many file descriptors, disconnecting.");

This must not be merged. _dbus_warn() is fatal by default, and will normally abort() dbus-daemon (on anything that isn't a Debian derivative) - it's like a more trigger-happy version of g_critical(). Perhaps we should rename it to _dbus_critical() or _dbus_bug() or _dbus_should_not_happen().

(Debian derivatives have a patch altering this behaviour, see Bug #92546. _dbus_warn() is still the wrong function though.)

Use bus_context_log() with severity WARNING instead.
Comment 12 Lennart Poettering 2016-05-09 12:17:07 UTC
> Restarting the timer every time a fd is *added* seems odd though... that doesn't prove anything about whether logind (or whatever) is actually reading from its socket?

Well, we have to initially start the timer when the first fd is enqueued. Starting it only when fds are removed would defeat the whole purpose of the concept, no? I mean, that would allow a client to enqueue one fd, and then not do anything anymore and the connection would be stuck forever.

Hence, we need to at least restart the timer when the first fd is enqeued, and when any fd is enqueued
Comment 13 Lennart Poettering 2016-05-09 12:25:06 UTC
Created attachment 123568 [details] [review]
new version

Added a new patch that doesn#t reset the timer anymore if there's already an fd queued and another one added.

Fixes the warning issue.
Comment 14 Michal Koutný 2016-05-24 09:54:50 UTC
Not sure if this is stuck on the fds counting part, so I filed a separate bug 95619 dedicated to missing timeout restarts.

As for the counting -- theoretically correct approach would be to have timeout for each pending file descriptor, that's a bit overkill though. I've run into this issue when sshd/systemd-logind/systemd heavily communicated via dbus-daemon. When monitored via gdb, I've never observed n_pending_unix_fds exceed 1 -- so the primary cause was due to not-reset timeouts rather than not enabling with second and next pending fds.
Comment 15 Michal Koutný 2016-06-15 13:02:51 UTC
Hi Simon (this may be completely blind shot, in that case excuse the needinfo),
I just wanted to make sure the bug is on somebody's radar (I understand low priority) and not lost. As I summed up in comment 14, these are actually two related issues that can be solved separately.

The correct timeout reset would AFAICS solve also previously referenced issue [1] with systemd-logind.

What is required to move this forward? Thanks for information.

[1] https://github.com/systemd/systemd/issues/2925
Comment 16 Simon McVittie 2016-06-20 09:17:28 UTC
(In reply to Michal Koutný from comment #15)
> I just wanted to make sure the bug is on somebody's radar (I understand low
> priority) and not lost.

Everything in the D-Bus bug tracking system is on our radar, but we don't always have time to work on D-Bus: it isn't anyone's primary job.

> What is required to move this forward?

A maintainer (other than Lennart) having time to review it.

I've just got back from a hackfest and had work deadlines immediately before that, so I'm hoping to be able to spend some time on D-Bus when I've caught up on what I missed while at the hackfest.

Marking bugs as NEEDINFO when waiting for maintainer input is usually counterproductive: we will tend to ignore NEEDINFO bugs, because that state is usually used to indicate bugs that need more information from the submitter and are not ready for a maintainer to work on. Marking the bug as "review?" rather than "review-" is useful, though, and I have now done that.

If you have reviewed Lennart's patch yourself and you believe it to be correct, or if you have tested it and found that it solves a reproducible issue, then that would be useful input. (Similarly, if you think there's an error in Lennart's patch, please say so.)
Comment 17 Simon McVittie 2016-06-30 16:07:40 UTC
(In reply to Simon McVittie from comment #11)
> This must not be merged. _dbus_warn() is fatal by default, and will normally
> abort() dbus-daemon (on anything that isn't a Debian derivative) - it's like
> a more trigger-happy version of g_critical().

This was actually incorrect; I was thinking of _dbus_warn_check_failed().

> Use bus_context_log() with severity WARNING instead.

This is still the better one to use, though.
Comment 18 Simon McVittie 2016-06-30 16:40:20 UTC
Comment on attachment 123568 [details] [review]
new version

Review of attachment 123568 [details] [review]:
-----------------------------------------------------------------

This seems correct. I think.

I'd be happier with it if we had a regression test, at least a manual one.

I definitely like the addition of a log message: as you say, this should be a really weird occurrence.

::: bus/connection.c
@@ +682,5 @@
> +  d = BUS_CONNECTION_DATA (connection);
> +  _dbus_assert(d != NULL);
> +
> +  bus_context_log (d->connections->context, DBUS_SYSTEM_LOG_WARNING,
> +                   "Client sent too many file descriptors, disconnecting.");

I'd like this to mention *which* client, but that can be a follow-up commit.

::: dbus/dbus-mainloop.c
@@ +419,5 @@
>  }
>  
> +void
> +_dbus_loop_restart_timeout (DBusLoop            *loop,
> +                            DBusTimeout         *timeout) {

{ should move to next line

@@ +427,5 @@
> +  link = _dbus_list_get_first_link (&loop->timeouts);
> +  while (link != NULL)
> +    {
> +      DBusList *next = _dbus_list_get_next_link (&loop->timeouts, link);
> +      TimeoutCallback *this = link->data;

I can't say I'm delighted about an O(n) iteration through all the timeouts looking for the right one, but it'll do.

I'm tempted to see whether we can do better by stashing the TimeoutCallback, or the list link, in the timeout object.
Comment 19 Simon McVittie 2016-06-30 17:20:49 UTC
(In reply to Michal Koutný from comment #14)
> As for the counting -- theoretically correct approach would be to have
> timeout for each pending file descriptor, that's a bit overkill though.

Right. The change proposed here changes the meaning of the pending_fd_timeout, and I need to go through what now happens and check that it's OK.

The usual diagram applies:

    process boundary
        |             dispatch            |
 sender--> receiving ----------> sending --> recipient
        |    loader               queue   |

Previously, the intention (from Bug #80559) was that after a sender puts a file descriptor into the receiving DBusMessageLoader, it has a grace period of 2.5 minutes to get the rest of the message into the receiving loader. When the whole message has arrived, we build a DBusMessage and move it out of the loader into the sending queue, and we no longer blame the sender for it (whether we can make progress is no longer dependent on the sender).

In the pedantically correct approach that you describe, if the sender gave us *another* fd, it would have 2.5 minutes (with a different start time) to get *that* fd out of the loader, independently.

With Lennart's patch, after a sender puts a fd into the receiving loader, it still has 2.5 minutes to get the rest of the message into the receiving loader. It can stuff many additional fds into the loader, up to the default limit of 4 messages' worth of fds (64), without hurting itself; if we assume the sender is aiming for a DoS, we can assume it will submit as many as it possibly can. As long as the sender submits enough message content that a message is built and queued for sending at least every 2.5 minutes, it can keep those fds alive almost indefinitely. The worst case is that the sender gives us 64 of the harmful fds that Alban described. As long as it lets one fd get dequeued every 2.5 minutes, it will not get kicked out; so it can keep the harmful one in our queue for up to 160 minutes (2h40).

This seems... not great either. In effect, we have just as much complexity as in Alban's patchset (or in fact a little bit more), for less DoS protection.

One thing that occurs to me is that logind, the major use-case for this, happens to run as uid 0. If we were to exempt uid 0 from this limit, that would circumvent this counting issue, and in particular completely fix logind.
Comment 20 Simon McVittie 2016-06-30 17:48:52 UTC
What we ideally want to happen is: after a fd has entered the DBusMessageLoader, we do not receive any more fds until the message that contained the fd has been fully received and can be turned into a DBusMessage. That way, the timeout is very simple: it starts at the transition from 0 to >0 fds in the loader, and stops at the transition back to 0.

We cannot just stop receiving fds, unless we are careful to avoid starting to read a second or subsequent message, because that would mean we could accidentally discard the fds attached to that second or subsequent message.

Let's assume that the sender correctly attached their fds to the first byte of a message, and that we don't mind breaking senders who didn't do that. In general, a read that contains fds will contain /n/ complete messages (n >= 0) plus /m/ bytes of a partial message (m >= 0, m < size of last message).

We can process the /n/ complete messages immediately and get them out of the way.

If m is 0, there's nothing more to do and we resume normal operation.

If m is at least DBUS_MINIMUM_HEADER_SIZE, then we can work out exactly how many more bytes to read to get the rest of this message, and avoid reading beyond that many until we have completed that partial message. This will likely mean we slow down, but maybe that's fine. Then we resume normal operation (with fds).

If m is less than DBUS_MINIMUM_HEADER_SIZE, then we could slow right down to read the remaining 1-15 bytes of the fixed-size header, then we'll know how much data we need to complete the partial message. Then we can complete the partial message as above, then resume normal operation.

Does that sound feasible? Then we wouldn't need Lennart's change at all, because Alban's original fd-counting logic would become correct?

(I'm also starting to think we should never have added fd-passing...)
Comment 21 Simon McVittie 2016-07-01 12:36:31 UTC
> > This also adds a log message whenever the fd-in-flight timer hits, since
> > the timer elapsing should usually be a very exceptional condition, and a
> > clear indication of a bug or attack somewhere.
> 
> Yes! We need more of this sort of thing.

In fact I already implemented basically this message 2.5 years ago, along with syslog messages when a lot of other limits are exceeded; see Bug #80559. Reviews very welcome.

I've rebased those messages, and added regression tests for most of the affected limits, which indirectly exercise the logging (in the tests it just goes to stderr).

I think the next step here is to have a regression test for the fd-in-flight timer, with the timeout cut to 1 second or something. It might have to do some fairly horrible hacks somehow to trick libdbus into not sending the entire message.
Comment 22 Simon McVittie 2016-07-01 14:40:18 UTC
(In reply to Simon McVittie from comment #21)
> In fact I already implemented basically this message 2.5 years ago, along
> with syslog messages when a lot of other limits are exceeded; see Bug
> #80559. Reviews very welcome.

Sorry, that should have said Bug #86442.
Comment 23 Simon McVittie 2016-07-01 15:27:18 UTC
Created attachment 124840 [details] [review]
dbus-daemon test: attempt to reproduce #95263

---
Based on the patches from Bug #86442.

I was unable to reproduce the failure mode you describe, even by reducing the timeout to 1ms and increasing the number of fds allowed in-flight to 4000 (with a correspondingly higher ulimit).

I was also unable to reproduce that failure mode by allowing up to 1000000 messages in-flight, with a fd attached to every 1000th message.
Comment 24 Simon McVittie 2016-07-01 15:30:37 UTC
Created attachment 124841 [details] [review]
WiP: only read one message at a time if there are fds pending

---
Here is an implementation of my idea from a few comments ago.

I couldn't reproduce the reported failure mode, so I can't usefully test this. However, if you have a reproducible test-case based on logind, please try it?
Comment 25 Simon McVittie 2016-07-01 15:31:27 UTC
(In reply to Simon McVittie from comment #21)
> I think the next step here is to have a regression test for the fd-in-flight
> timer, with the timeout cut to 1 second or something. It might have to do
> some fairly horrible hacks somehow to trick libdbus into not sending the
> entire message.

See Bug #86442 for that.
Comment 26 Simon McVittie 2016-07-01 15:37:12 UTC
Created attachment 124842 [details] [review]
WiP: make uid 0 immune to pending_fd_timeout

---
Here is the suggested workaround of making uid 0 immune to this limit, again based on what's in Bug #86442.
Comment 27 Michal Koutný 2016-07-11 12:34:23 UTC
(In reply to Simon McVittie from comment #23)
> dbus-daemon test: attempt to reproduce #95263
I'd like to try out the test, however, I can't apply this patch on top of current master. Is it based on something different? 

(FTR, my testing was just many SSH connections to server opened/closed at once and 150 seconds after the first pending fd was added, systemd-logind was disconnected by dbus.)
Comment 28 Simon McVittie 2016-07-11 14:40:40 UTC
(In reply to Michal Koutný from comment #27)
> I'd like to try out the test, however, I can't apply this patch on top of
> current master. Is it based on something different? 

It's based on master + the patches from Bug #86442.
Comment 29 Mike Pontillo 2016-08-16 13:38:18 UTC
We have had multiple confirmations that this set of patches fixes serious issues with SSH login behavior in Ubuntu 16.04.[1]

For what it's worth, I have tested the patch in a staging environment and a simple container-based test bed I used to reproduce the issue, and the GitLab team is now using it successfully in production.

I hope this can be merged into dbus soon; I think it's an important fix.

[1]:
https://launchpad.net/bugs/1591411
Comment 30 Simon McVittie 2016-08-16 15:01:16 UTC
(In reply to Mike Pontillo from comment #29)
> We have had multiple confirmations that this set of patches fixes serious
> issues with SSH login behavior in Ubuntu 16.04.[1]

*Which* set of patches? The contents of this bug are not a unified set of patches - they are several attempts at (aspects of) a fix for the same symptoms.

As far as I can tell, Attachment #123568 [details] reintroduces CVE-2014-3637 (Bug #80559), a denial of service security vulnerability. This makes me very reluctant to apply it: I don't want to release known security regressions.

Does a version of dbus with Attachment #124841 [details] applied, but Attachment #123568 [details] not applied, also resolve this failure mode in a test environment that can reliably produce those SSH login symptoms?

Does a version of dbus with Attachment #124842 [details] applied, but Attachment #123568 [details] not applied, resolve the same failure mode in the same environment?

Are you able to get this failure mode in a more minimal environment using Attachment #124840 [details] or a modified version of it?
Comment 31 Łukasz 'sil2100' Zemczak 2016-10-13 10:43:48 UTC
(In reply to Simon McVittie from comment #30)
> (In reply to Mike Pontillo from comment #29)
> > We have had multiple confirmations that this set of patches fixes serious
> > issues with SSH login behavior in Ubuntu 16.04.[1]
> 
> *Which* set of patches? The contents of this bug are not a unified set of
> patches - they are several attempts at (aspects of) a fix for the same
> symptoms.
> 
> As far as I can tell, Attachment #123568 [details] reintroduces
> CVE-2014-3637 (Bug #80559), a denial of service security vulnerability. This
> makes me very reluctant to apply it: I don't want to release known security
> regressions.
> 
> Does a version of dbus with Attachment #124841 [details] applied, but
> Attachment #123568 [details] not applied, also resolve this failure mode in
> a test environment that can reliably produce those SSH login symptoms?
> 
> Does a version of dbus with Attachment #124842 [details] applied, but
> Attachment #123568 [details] not applied, resolve the same failure mode in
> the same environment?
> 
> Are you able to get this failure mode in a more minimal environment using
> Attachment #124840 [details] or a modified version of it?

Hello Simon!

I took the two attached patches and prepared packages with each of them (in a PPA) and asked our community to try and reproduce the SSH session bug [1] on their Ubuntu machines - the results are as follow:

Attachment #124841 [details] applied - this did not seem to fix the problem and the bug was reproducible in the same way as before.

Attachment #124842 [details] applied - this patch does seem to fix the issue! So far two of our users reported that dbus 1.10.6-1ubuntu3 (+ the fix) and also 1.10.10 (+ the fix) seem to fix the issue for them and systemd-logind does not seem to fall over as before. It seems to be a valid candidate for review and release - at least from our perspective.

Could someone review and, possibly, accept this patch in this case? We will continue to ask people for testing but so far it seems really promising. Once the fix is accepted here upstream I'll cherry-pick it to Ubuntu ASAP.

Thanks!

[1] https://bugs.launchpad.net/ubuntu/+source/systemd/+bug/1591411
Comment 32 Simon McVittie 2016-10-21 18:40:15 UTC
(In reply to Łukasz 'sil2100' Zemczak from comment #31)
> Attachment #124841 [details] applied - this did not seem to fix the problem
> and the bug was reproducible in the same way as before.

That's unfortunate; I can't think of any other way to fix the root cause without re-introducing a denial of service security vulnerability. Unless someone else can find a bug in that patch that made it not work, without invalidating the general approach?

> Attachment #124842 [details] applied - this patch does seem to fix the
> issue

That's essentially a workaround for the root cause, but certainly an improvement.

> Could someone review and, possibly, accept this patch in this case?

That "someone" could be you. Does the fact that you have been testing it in Ubuntu imply that you have reviewed it and consider it to be a valid and correct change?
Comment 33 Łukasz 'sil2100' Zemczak 2016-10-24 14:00:15 UTC
(In reply to Simon McVittie from comment #32)
> (In reply to Łukasz 'sil2100' Zemczak from comment #31)
> > Attachment #124841 [details] applied - this did not seem to fix the problem
> > and the bug was reproducible in the same way as before.
> 
> That's unfortunate; I can't think of any other way to fix the root cause
> without re-introducing a denial of service security vulnerability. Unless
> someone else can find a bug in that patch that made it not work, without
> invalidating the general approach?
> 

I guess we could try digging into that a bit further. 

> > Attachment #124842 [details] applied - this patch does seem to fix the
> > issue
> 
> That's essentially a workaround for the root cause, but certainly an
> improvement.
> 
> > Could someone review and, possibly, accept this patch in this case?
> 
> That "someone" could be you. Does the fact that you have been testing it in
> Ubuntu imply that you have reviewed it and consider it to be a valid and
> correct change?

Generally yes, we are aware that this is not the final solution we'd possibly like to settle on, but it does fix all the issues for all users testing it so far. I passed it on to the Ubuntu security team for a quick review and they also did not find any issues with the fix - besides, of course, also noting down that it's only a workaround (but once they saw your name as the author, they became a bit less worried about that).

I am not an expert in the dbus code-base but I also couldn't find any issues with this change. It's certainly much better than having so many users having to resort to some hacky workarounds to get their environments working.

So, if I have the power to do so, I would say: yes, the patch is reviewed and good to go IMO.

Not sure if this bug should be marked as resolved, as we would probably like to have it fixed completely? Maybe there should be another bug filled, or can we still use this one?

All in all: big thanks for looking into this.
Comment 34 Simon McVittie 2016-10-27 13:55:36 UTC
(In reply to Łukasz 'sil2100' Zemczak from comment #33)
> Not sure if this bug should be marked as resolved, as we would probably like
> to have it fixed completely? Maybe there should be another bug filled, or
> can we still use this one?

We can apply a workaround that fixes the issue in practice, but leave the bug open for someone to look for a better solution to the root cause.

I would really appreciate someone writing a test case that does not require logind and an entire OS (particularly since the workaround will make that test-case unusable). Unfortunately my own attempt at this (Comment #23) was not successful.
Comment 35 Simon McVittie 2016-11-11 19:11:29 UTC
(In reply to Michal Koutný from comment #27)
> (FTR, my testing was just many SSH connections to server opened/closed at
> once and 150 seconds after the first pending fd was added, systemd-logind
> was disconnected by dbus.)

I haven't been able to reproduce this failure mode (on Ubuntu xenial, with a dbus-daemon whose changelog indicates that this has not been worked around, or on Debian jessie or sid, with a dbus-daemon that definitely doesn't have this workaround). I tried reducing pending_fd_timeout to make it easier to hit the failure case, and couldn't reproduce it that way either.

Is there anything special that needs to be installed or configured to make this happen?
Comment 36 Simon McVittie 2016-11-11 19:41:37 UTC
(In reply to Łukasz 'sil2100' Zemczak from comment #31)
> Attachment #124841 [details] applied - this did not seem to fix the problem
> and the bug was reproducible in the same way as before.

I think I might have spotted the flaw in the approach this patch took. It was intended to make sure dbus-daemon does not accept any more fds while there are fds pending; but perhaps it was incomplete and didn't ensure that a higher layer wouldn't just read some more?

For people who are able to reproduce this (which unfortunately does not include me), here is another thing to try. It might not be a correct solution, but telling me whether it works is information that could lead me to a correct solution.

* Use an expendable machine where you can reproduce this bug
* If you have applied Attachment #124842 [details], revert it (make the reverse change)
* Install dbus-daemon, reboot, check that you can still reproduce the bug
* Create /etc/dbus-1/system-local.conf with this content:
  <busconfig>
    <limit name="max_incoming_unix_fds">1</limit>
  </busconfig>
* Apply Attachment #124841 [details] and recompile
* Install the new dbus-daemon, reboot, try to reproduce the bug again

If I'm reading the code correctly, max_incoming_unix_fds=1 does not actually prevent messages with more than one fd from being received - we are willing to exceed all incoming message limits by up to one more recvmsg() or one more complete D-Bus message, whichever is greater. So I don't think this prevents services from sending messages with up to max_message_unix_fds fds attached (the default for the system bus is 16).

The combination of that lower limit and Attachment #124841 [details] should mean that after the dbus-daemon has received a fd from logind, it will pause receiving anything else from logind until it has handed off that fd to its recipient, which means the number of fds that logind is considered to be responsible for will drop from 1 down to 0, which means we will stop running the timer that would disconnect logind when it ran out. I hope.
Comment 37 Simon McVittie 2016-11-11 19:51:05 UTC
(In reply to Łukasz 'sil2100' Zemczak from comment #33)
> Generally yes, we are aware that this is not the final solution we'd
> possibly like to settle on, but it does fix all the issues for all users
> testing it so far.

I've pushed this patch to git master for 1.11.8. Leaving the bug open as discussed.

I'll probably backport both this and commit 05cb619f to the dbus-1.10 stable branch (for 1.10.14) after releasing 1.11.8, but I've run out of time to do that right now.
Comment 38 Simon McVittie 2016-11-11 20:22:50 UTC
Looking back through this saga again, I think the issue might be that we have two related bugs:

* Timeouts' intervals not resetting correctly: half of the purpose of
  Lennart's Attachment #123568 [details] here is fixing this; Bug #95619 contains
  another approach. Let's use Bug #95619 to track that.

* What happens when we continously have fds in-flight for a long time:
  the other half of the purpose of Lennart's Attachment #123568 [details] here is to
  change this (but the specific change reintroduces CVE-2014-3637, so we
  don't want it). My Attachment #124841 [details] is a different change that might
  resolve the symptoms without reintroducing CVE-2014-3637,
  *but it still relies on fixing Bug #95619* - so the fact that this bug
  isn't resolved by Attachment #124841 [details] alone might be caused by Bug #95619
  still being present.
Comment 39 Simon McVittie 2016-11-11 20:58:41 UTC
(In reply to Simon McVittie from comment #38)
> Looking back through this saga again, I think the issue might be that we
> have two related bugs

To test this theory, someone who can reproduce the symptoms of this bug could:

* Use an expendable machine where you can reproduce this bug
* If you have applied Attachment #124842 [details], revert it
* Install dbus-daemon, reboot, check that you can still reproduce the bug
* Apply Attachment #124841 [details] *and also the patch from Bug #95619*, and recompile
* Install the new dbus-daemon, reboot, try to reproduce the bug again

The combination of those two patches hopefully gives us an equivalent of Attachment #123568 [details] that does not reintroduce CVE-2014-3637?

If that doesn't work, the next thing to try would be combining those two patches with the <busconfig> snippet from Comment #36.
Comment 40 Łukasz 'sil2100' Zemczak 2016-11-16 11:03:54 UTC
(In reply to Simon McVittie from comment #39)
> (In reply to Simon McVittie from comment #38)
> > Looking back through this saga again, I think the issue might be that we
> > have two related bugs
> 
> To test this theory, someone who can reproduce the symptoms of this bug
> could:
> 
> * Use an expendable machine where you can reproduce this bug
> * If you have applied Attachment #124842 [details], revert it
> * Install dbus-daemon, reboot, check that you can still reproduce the bug
> * Apply Attachment #124841 [details] *and also the patch from Bug #95619*,
> and recompile
> * Install the new dbus-daemon, reboot, try to reproduce the bug again
> 
> The combination of those two patches hopefully gives us an equivalent of
> Attachment #123568 [details] that does not reintroduce CVE-2014-3637?
> 
> If that doesn't work, the next thing to try would be combining those two
> patches with the <busconfig> snippet from Comment #36.

Hey Simon!

I have prepared some dbus test packages for Ubuntu with the requested changes (and the workaround reverted) for people to test in our Ubuntu downstream bug. People were willing to give those a spin. I'll send an update here once I get some test results from various community people.
Comment 41 Łukasz 'sil2100' Zemczak 2016-12-19 12:14:33 UTC
(In reply to Łukasz 'sil2100' Zemczak from comment #40)
> Hey Simon!
> 
> I have prepared some dbus test packages for Ubuntu with the requested
> changes (and the workaround reverted) for people to test in our Ubuntu
> downstream bug. People were willing to give those a spin. I'll send an
> update here once I get some test results from various community people.

Apologies that this took so long! It seems that once the workaround got in place, it was not so easy to find people that could do some testing of a real fix. But so far we got one person reporting that the real fix (composed from the two patches you mentioned, with the workaround reverted) does fix the issue as well. I would like a few more people to submit their test results but so far it seems this might be 'it'.
Comment 42 Simon McVittie 2017-01-17 15:41:03 UTC
I think what I'm going to do here is:

* For 1.11.x (master): apply the patches you used, and revert the
  "uid 0" workaround.

* For 1.10.x: stick with the "uid 0" workaround, because that workaround
  is enough to address this for logind, which is the most important impact.
  We can consider additionally backporting the patches from Bug #95619
  and this bug later, once they have had more testing in master; but
  if we do, we will also keep the workaround.
Comment 43 Simon McVittie 2017-01-17 17:00:22 UTC
Created attachment 129002 [details] [review]
Only read one message at a time if there are fds pending

systemd-logind's OpenSession() API call returns a fd. If there is a
flood of new sessions, it is possible that by the time we finish reading
message 1, message 2 will already be in our incoming buffer and so on.
This results in systemd-logind consistently having one or more fds enqueued
for an extended period, which we interpret as a denial of service
attack, and handle by kicking it off the bus (at least until we worked
around the resulting logind failure by making uid 0 immune to that
particular anti-DoS mechanism, but that workaround doesn't work for
other uids).

To avoid this without the complexity of tracking multiple countdowns
per connection (one for each message with fds), we can avoid reading
any additional messages while we already have a message with a fd
attached pending processing. To avoid stalling, we have to read the rest
of any partial message we might have, but we stop after that.
Assuming we are able to get rid of the pending fds within a reasonable
time, we'll eventually drain the incoming queue to a level of 0 bytes
and 0 fds, at which point the countdown stops.

To make this actually work, we need fd.o #95619 to be fixed first, so
that when we receive more fds and restart the countdown, it restarts
with its correct time remaining.

---

Here is a hopefully-final patch for current master.

Changes from Attachment #124842 [details]:
* commit message
* wrap fd-dependent code in #ifdef HAVE_UNIX_FD_PASSING so we continue
  to be able to do Windows builds (thanks, Travis-CI)

Also requires Attachment #127921 [details] from Bug #95619.

I'll apply both of those patches to master soon, and revert the uid 0 workaround, unless there are objections.
Comment 44 Simon McVittie 2017-04-05 15:19:09 UTC
Fixed in 1.11.10


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.