94629 – PulseAudio gets reliably killed upon a big number of client connections

Bug 94629 - PulseAudio gets reliably killed upon a big number of client connections

Summary: PulseAudio gets reliably killed upon a big number of client connections

Status:	RESOLVED MOVED

Alias:	None

Product:	PulseAudio
Classification:	Unclassified
Component:	core (show other bugs)
Version:	unspecified
Hardware:	x86-64 (AMD64) Linux (All)

Importance:	medium major
Assignee:	pulseaudio-bugs
QA Contact:	pulseaudio-bugs

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2016-03-19 19:44 UTC by Ahmed S. Darwish
Modified:	2018-07-30 10:29 UTC (History)
CC List:	2 users (show)

See Also:
i915 platform:
i915 features:

Attachments

Description Ahmed S. Darwish 2016-03-19 19:44:09 UTC

On master branch [1], connections from a high number of clients
_reliably_ kills the PulseAudio daemon.

Here is a minimal script that triggers the bug:

LARGE_WAVE_FILE=...
for i in {1..60}; do
    echo "Client #$i";
    ./src/pacat $LARGE_WAVE_FILE &
    sleep 1
done

Here are some important factors:

  1. The bug is always triggered in regular -O2 build, ALSA sink.
     This usually happens after client #45. [2] [3]

  2. Disabling SHM (--disable-shm) makes triggering the bug much
     quicker. Only after connection from client #27

  3. Compiling at -O0 also makes the bug triggered much quicker.
     Only after connection from client #28

  4. The bug _disappears_ when choosing the NULL sink as default
     This is the case even after leaving the pacat clients run
     for an hour

  5. In the point #4 above, resetting ALSA sink back as the
     default sink triggers the bug again.

  6. This bug affects all versions of PulseAudiob back to v5.0!
     I could not test older versions (v4.0, v3.0, ..) as they
     always fail at runtime with my current toolchain. [4] [5]

Any thoughts on how to track this issue further?

Thanks,


==> footnotes:

[1] As of 19 March 2016, 4731690a21edc59acfd0bd27f810d5c895ac7629

[2] No logs are produced from the Linux kernel in this case.
    Check Arun's work at http://goo.gl/0mq3ym for context.

[3] This is an old "AMD Athlon(tm) II X2 260" dual-core desktop
    machine

[4] Arch Linux, GCC 5.3.0 with target x86_64-unknown-linux-gnu,
    glibc version 2.23

[5] Runtime failure for v4.0 is:
    hashmap.c: Assertion 'h->iterate_list_head' failed at
    pulsecore/hashmap.c:151, function pa_hashmap_put(). Aborting.

Comment 1 Alexander E. Patrakov 2016-03-20 05:48:28 UTC

Thanks for the report. The kills may happen because of exceeding the allowed real-time budget. To verify this, please modify /etc/pulse/daemon.conf:

realtime-scheduling=no

Comment 2 Ahmed S. Darwish 2016-03-20 10:20:07 UTC

Hi Alex,

On Sun, Mar 20, 2016 at 05:48:28AM +0000, bugzilla-daemon@freedesktop.org wrote:
>
> https://bugs.freedesktop.org/show_bug.cgi?id=94629
> 
> --- Comment #1 from Alexander E. Patrakov <patrakov@gmail.com> ---
> Thanks for the report. The kills may happen because of exceeding the allowed
> real-time budget. To verify this, please modify /etc/pulse/daemon.conf:
>
> realtime-scheduling=no
>

Thanks! This indeed confirms that PA reliably exceeds its RT budget.
Adding `realtime-scheduling=no' makes the bug completely vanish.
Reset the option back to yes, and PA gets reliably killed again.

After applying Arun's kernel rlimits verbosity patch, here is what
I see in dmesg when PA gets unsolicitedly killed:

  [23891.784010] CPU Watchdog Timeout (hard): alsa-sink-VT170[7131]

So this also adds another confirmation and shows why the bug does
not appear by using NULL sinks..

[ Hmm, .. thinking about solutions ]

Unfortunately lowering the maximum amount of clients is not a
solution: on Tanu's machine, sometimes even just 18 clients did
the trick :-(

One of the reasons of the memfd + making pools per-client patches
is to make PA a little bit safer as a system daemon (a fuzzer may
also be built down the road). If clients can trigger a reliable
PulseAudio kill, we have a reliable DoS attack on our hands..

So now the question is, how to modify PA to make sure it does not
exceed its RT budget?

Regards,
Darwish

Comment 3 Alexander E. Patrakov 2016-03-20 10:26:33 UTC

One possibility is to make sure that it doesn't attempt to mix too much ahead. Please reenable realtime scheduling, and try this in default.pa:

load-module module-udev-detect tsched_buffer_size=50000

But note that this will add up to 0.7W of extra power consumption in the single-client case, so cannot be made the default.

Comment 4 Ahmed S. Darwish 2016-03-20 10:46:40 UTC

On Sun, Mar 20, 2016 at 10:26:33AM +0000, bugzilla-daemon@freedesktop.org wrote:
>
> https://bugs.freedesktop.org/show_bug.cgi?id=94629
> 
> --- Comment #3 from Alexander E. Patrakov <patrakov@gmail.com> ---
> One possibility is to make sure that it doesn't attempt to mix too much ahead.
> Please reenable realtime scheduling, and try this in default.pa:
> 
> load-module module-udev-detect tsched_buffer_size=50000
>

Yup, I confirm this has made the bug appear much less. Originally,
PA got killed only after 30 clients connections

After adding the parameter above, the kill got triggered only
after 60 client connections + waiting for a full minute..

> But note that this will add up to 0.7W of extra power consumption in the
> single-client case, so cannot be made the default.
> 

Sorry, my knowledge in this area is limited, but can we program
PA to dynamically stop accepting more clients when it's close to
approaching its realtime CPU limit?

Maybe program PA upon approaching its soft limit and getting a
SIGXCPU from the kernel not to accept more clients?

Thanks,

Comment 5 Alexander E. Patrakov 2016-03-20 10:59:18 UTC

(In reply to Ahmed S. Darwish from comment #4)
> Sorry, my knowledge in this area is limited, but can we program
> PA to dynamically stop accepting more clients when it's close to
> approaching its realtime CPU limit?

No. The problem is not only with the number of clients. In steady state, PulseAudio may well support, say, 45 clients. However, if a client triggers rewinds often (either by changing its own volume frequently or by providing non-zero as the last two parameters of pa_stream_write()), that's much more work for PulseAudio. In other words, this is not only about not accepting new clients, but about stopping processing on behalf of already-connected clients that started behaving abusively.

I encourage you to try to write such abusive client, just to see whether you can kill PulseAudio on your hardware using just two clients :)

Comment 6 Ahmed S. Darwish 2016-03-20 11:39:18 UTC

(In reply to Alexander E. Patrakov from comment #5)
> (In reply to Ahmed S. Darwish from comment #4)
> > Sorry, my knowledge in this area is limited, but can we program
> > PA to dynamically stop accepting more clients when it's close to
> > approaching its realtime CPU limit?
> 
> No. The problem is not only with the number of clients. In steady state,
> PulseAudio may well support, say, 45 clients. However, if a client triggers
> rewinds often (either by changing its own volume frequently or by providing
> non-zero as the last two parameters of pa_stream_write()), that's much more
> work for PulseAudio. In other words, this is not only about not accepting
> new clients, but about stopping processing on behalf of already-connected
> clients that started behaving abusively.
> 
> I encourage you to try to write such abusive client, just to see whether you
> can kill PulseAudio on your hardware using just two clients :)

Excellent. So beside a policy patchset, we also need a "handling
abusive clients" one .. there's still a very long way to go for
proper containers support ;-)

Comment 7 Ahmed S. Darwish 2016-03-20 19:26:01 UTC

After a lot of enlightening discussions with Alex, it seems this
is a well-known problem in Pulse.

For completeness of this bug report, here are the basic points:

1. Linux Audio Conference 2015, "Timing issues in desktop audio
   playback infrastructure", by Alexander
   slides: http://lac.linuxaudio.org/2015/download/rewind-slides.pdf

The issue of unsolicited kills are _clearly_ summarized in slide #13
above: "to process (resample, mix, encode) 2000 ms of sound under
the limited budget of 200ms of real-time. Not easy: on a weak CPU,
a cpufreq-governed CPU, with software DTS encoder, under valgrined,
etc ... Result: KILLED"

Even more details are in the video conference and paper of the same
topic here: http://lac.linuxaudio.org/2015/video.php?id=8


2. A second suggestion is to let PA appropriately program its
   realtime soft limit and install the appropriate SIGXCPU handlers
   in PA. This way, we can be almost sure that the kills are due
   to exceeding our budget.

[ This is also the view favored by kernel developers as they don't
  won't to pollute the kernel logs much.
  http://www.gossamer-threads.com/lists/linux/kernel/1513490#1513490 ]


3. A third and final suggestion is to write some abusive clients
   to demonstrate how common the issue is, and that it's not only
   related to the number of connected clients, but to the issue of
   excessive rewinds and abusive clients in general

"You could write a client that does a lot of rewinds, calls
pa_stream_write with bad timing (e.g. rewinds 990 ms and writes 1s
every 10 ms) and see whether it explodes :) .. I don't expect it to
explode with one client, but two may be enough in your case"


==> Raw discussion log:

<patrakov> darwish: hello. the "realtime budget" problem that you
           reported is actually a known issue for my DTS encoder.
           There, even one stream is enough on typical hardware if
           PulseAudio is left with its default of mixing 2 seconds
           ahead
<darwish>  patrakov, hi :-) .. oh, I see
<darwish>  patrakov, seems it'll need some deep surgery to solve
           this while keeping interrupts low
<patrakov> indeed
<patrakov> and in fact I am on the fence whether to remove the
           low-interrupts feature, as it never worked correctly with
           processing such as resampling
<patrakov> i.e. it may be that we just have to accept the 0.7w hit
<darwish>  hmm
<patrakov> please see http://lac.linuxaudio.org/2015/video.php?id=8
           (slides are enough)
<darwish>  patrakov, slide #13 summarizes everything really nicely :D
<patrakov> I also encourage you to take a look at CRAS source code -
           it has some efficient client-to-server communication method,
           so that the overhead from going down to 28 ms latency is
           only 0.2w, which is IMHO very tolerable and makes rewinds
           (which, together with speculative mixing ahead, are
           responsible for eating the realtime budget in your case)
           unneeded
<darwish>  hmm
<patrakov> basically the current 2000 ms default for the tsched buffer
           is based on the assumption that mixing is cheap, and that
           mixing 2000 ms of ausio should eat no more than 200 ms anyway
<darwish>  patrakov, unless a high amount of clients connect, leading to
           excessive rewinds ..
<patrakov> which is false if the CPU is slowed down by the cpufreq
           framework - it just doesn't see enough load to bump the
           frequency
<darwish>  patrakov, btw thanks a lot! I finally understood the concept
           of rewinding from your slides :D
[...]
<darwish>  hmmm .. "CRAS doesn’t have any of the discussed workarounds"
<patrakov> what was meant is: "CRAS doesn't have any of the discussed
           workarounds and still works fine on hardware found in
           Chromebooks"
<patrakov> no rewinds = no need to guess how much it is possible to
           rewind, no need to deal with non-rewindable ALSA plugins, no
           need to write a rewindable resampler, no correctness issues,
           at the cost of 0.2w of extra power consumed (and if we assume
           that Chrome is the only possible client, then that's 0.0w)
<patrakov> because Chrome never actually uses high latency
<darwish>  just found some slides by the CRAS folks here .. they also
           compare themselves with PA: http://goo.gl/zdmNu4
<patrakov> they indeed share a lot of ideas
[...]
<darwish>  for completeness I'll add excerpts from the discussion above
           to the bug report + links your slides and video conference
<patrakov> basically, I want you to actually write a client that does a
           lot of rewinds, calls pa_stream_write with bad timing (e.g.
           rewinds 990 ms and writes 1s every 10 ms) and see whether it
           explodes :)
<patrakov> I don't expect it to explode with one client, but two may be
           enough in your case
<darwish>  that client would be a nice discussion entry point :-)
<darwish>  I'm now working on some patches for the kernel to inform us
           when it kills PA.. will develop that client, and hopefully
           see how to fix this, afterwards
<patrakov> why do we need those patches?
<patrakov> doesn't the kernel already send SIGXCPU when the soft-limit
           is exceeded?
<patrakov> shouldn't we just set the soft limit correctly in PulseAudio?
<darwish>  it does .. that was the argument too from tglx
<darwish>  patrakov, http://www.serverphorums.com/read.php?12,450582
<patrakov> oh, ok
<darwish>  patrakov, but yeah .. I've asked myself too if it's better to
           just appropriately handle SIGXCPU
<darwish>  so I'm not sure if the kernel devs will accept the patch,
           honestly
<patrakov> on the other hand, can we handle SIGXCPU properly in the case
           when the CPU hog is a DTS encoder? It is not really
           "actionable" upon, other than logging a message.
<patrakov> you can set a flag that says "stop further mixing", but it is
           useless if we are DTS-encoding, not mixing
<darwish>  I can at least log a message in PulseAudio .. so when a user
           submits a bug report with PA killed, and see that message, we
           are 99% sure we've just exceeded our limits
<patrakov> Fair enough
<darwish>  and in that case we won't need the kernel patch I guess ..
[...]
<darwish>  OK I'll go and have some lunch now (and watch the linux audio
           conference video in the process ;-)) .. thanks a lot for this
           discussion, I've learned a lot :-)

Comment 8 Tanu Kaskinen 2016-03-21 07:47:35 UTC

Here's another idea for making a difficult client: use the maximum number of channels and high sample rate. We recently bumped the maximum sample rate to 384 kHz, so either use that, or something close to it. 384 kHz is a multiple of 48 kHz, which might help the resampler (that's just wild speculation - I know very little about resampler algorithms). 383987 is the nearest prime number, maybe that would be pessimal. I don't know if the sample format has much effect, but I'd imagine 24-bit non-native-endian samples would be the most difficult choice.

A general comment on the topic: if someone decides to try to fix this bug, it would be good to be clear whether the goal is to fully prevent the DoS attack, or if the goal is to just reduce the chances that we accidentally die during normal use. To me it seems very hard to reach a state where we can guarantee that DoS attacks are impossible.

First there is the problem that how do we know when we are reaching the limits of the cpu? If SIGXCPU can happen at relatively low load due to cpu scaling, then we can't easily use that, assuming that we don't want to limit ourselves to the lowest cpu frequency.

Another problem is that how do we mitigate the high cpu use without causing a different DoS scenario. If we randomly start killing streams, that's DoS too. I guess it would be possible to figure out what streams are from the "currently active user" when running as a system daemon and kill all other streams, or what streams are from untrusted applications in a sandboxing scenario and kill those.

Comment 9 Tanu Kaskinen 2016-03-21 08:07:03 UTC

(In reply to Ahmed S. Darwish from comment #0)
>   2. Disabling SHM (--disable-shm) makes triggering the bug much
>      quicker. Only after connection from client #27

I wouldn't have expected that. Clients send the audio to pulseaudio's main thread, which is not running with RT enabled. The amount of work done in the alsa sink thread shouldn't be affected by how the audio got from the client to the server.

Comment 10 Tanu Kaskinen 2016-03-21 08:45:43 UTC

One more comment...

(In reply to Ahmed S. Darwish from comment #6)
> Excellent. So beside a policy patchset, we also need a "handling
> abusive clients" one .. there's still a very long way to go for
> proper containers support ;-)

I'm not sure what you mean by "proper", but having the possibility to cause a DoS on the audio system doesn't seem hugely problematic to me. The goal of hostile applications is rarely to only make the life of the user miserable. Of course it would be nice to be able to prevent that, but even if we never fix this, I don't expect there to be much trouble from security point of view. A bigger problem is that legitimate use cases may bring the audio server down (high cpu use can also be caused by a single stream with heavy processing).

Comment 11 GitLab Migration User 2018-07-30 10:29:30 UTC

-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/pulseaudio/pulseaudio/issues/460.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.