Summary: | PulseAudio gets reliably killed upon a big number of client connections | ||
---|---|---|---|
Product: | PulseAudio | Reporter: | Ahmed S. Darwish <darwish.07> |
Component: | core | Assignee: | pulseaudio-bugs |
Status: | RESOLVED MOVED | QA Contact: | pulseaudio-bugs |
Severity: | major | ||
Priority: | medium | CC: | lennart, patrakov |
Version: | unspecified | ||
Hardware: | x86-64 (AMD64) | ||
OS: | Linux (All) | ||
Whiteboard: | |||
i915 platform: | i915 features: |
Description
Ahmed S. Darwish
2016-03-19 19:44:09 UTC
Thanks for the report. The kills may happen because of exceeding the allowed real-time budget. To verify this, please modify /etc/pulse/daemon.conf: realtime-scheduling=no Hi Alex, On Sun, Mar 20, 2016 at 05:48:28AM +0000, bugzilla-daemon@freedesktop.org wrote: > > https://bugs.freedesktop.org/show_bug.cgi?id=94629 > > --- Comment #1 from Alexander E. Patrakov <patrakov@gmail.com> --- > Thanks for the report. The kills may happen because of exceeding the allowed > real-time budget. To verify this, please modify /etc/pulse/daemon.conf: > > realtime-scheduling=no > Thanks! This indeed confirms that PA reliably exceeds its RT budget. Adding `realtime-scheduling=no' makes the bug completely vanish. Reset the option back to yes, and PA gets reliably killed again. After applying Arun's kernel rlimits verbosity patch, here is what I see in dmesg when PA gets unsolicitedly killed: [23891.784010] CPU Watchdog Timeout (hard): alsa-sink-VT170[7131] So this also adds another confirmation and shows why the bug does not appear by using NULL sinks.. [ Hmm, .. thinking about solutions ] Unfortunately lowering the maximum amount of clients is not a solution: on Tanu's machine, sometimes even just 18 clients did the trick :-( One of the reasons of the memfd + making pools per-client patches is to make PA a little bit safer as a system daemon (a fuzzer may also be built down the road). If clients can trigger a reliable PulseAudio kill, we have a reliable DoS attack on our hands.. So now the question is, how to modify PA to make sure it does not exceed its RT budget? Regards, Darwish One possibility is to make sure that it doesn't attempt to mix too much ahead. Please reenable realtime scheduling, and try this in default.pa: load-module module-udev-detect tsched_buffer_size=50000 But note that this will add up to 0.7W of extra power consumption in the single-client case, so cannot be made the default. On Sun, Mar 20, 2016 at 10:26:33AM +0000, bugzilla-daemon@freedesktop.org wrote: > > https://bugs.freedesktop.org/show_bug.cgi?id=94629 > > --- Comment #3 from Alexander E. Patrakov <patrakov@gmail.com> --- > One possibility is to make sure that it doesn't attempt to mix too much ahead. > Please reenable realtime scheduling, and try this in default.pa: > > load-module module-udev-detect tsched_buffer_size=50000 > Yup, I confirm this has made the bug appear much less. Originally, PA got killed only after 30 clients connections After adding the parameter above, the kill got triggered only after 60 client connections + waiting for a full minute.. > But note that this will add up to 0.7W of extra power consumption in the > single-client case, so cannot be made the default. > Sorry, my knowledge in this area is limited, but can we program PA to dynamically stop accepting more clients when it's close to approaching its realtime CPU limit? Maybe program PA upon approaching its soft limit and getting a SIGXCPU from the kernel not to accept more clients? Thanks, (In reply to Ahmed S. Darwish from comment #4) > Sorry, my knowledge in this area is limited, but can we program > PA to dynamically stop accepting more clients when it's close to > approaching its realtime CPU limit? No. The problem is not only with the number of clients. In steady state, PulseAudio may well support, say, 45 clients. However, if a client triggers rewinds often (either by changing its own volume frequently or by providing non-zero as the last two parameters of pa_stream_write()), that's much more work for PulseAudio. In other words, this is not only about not accepting new clients, but about stopping processing on behalf of already-connected clients that started behaving abusively. I encourage you to try to write such abusive client, just to see whether you can kill PulseAudio on your hardware using just two clients :) (In reply to Alexander E. Patrakov from comment #5) > (In reply to Ahmed S. Darwish from comment #4) > > Sorry, my knowledge in this area is limited, but can we program > > PA to dynamically stop accepting more clients when it's close to > > approaching its realtime CPU limit? > > No. The problem is not only with the number of clients. In steady state, > PulseAudio may well support, say, 45 clients. However, if a client triggers > rewinds often (either by changing its own volume frequently or by providing > non-zero as the last two parameters of pa_stream_write()), that's much more > work for PulseAudio. In other words, this is not only about not accepting > new clients, but about stopping processing on behalf of already-connected > clients that started behaving abusively. > > I encourage you to try to write such abusive client, just to see whether you > can kill PulseAudio on your hardware using just two clients :) Excellent. So beside a policy patchset, we also need a "handling abusive clients" one .. there's still a very long way to go for proper containers support ;-) After a lot of enlightening discussions with Alex, it seems this is a well-known problem in Pulse. For completeness of this bug report, here are the basic points: 1. Linux Audio Conference 2015, "Timing issues in desktop audio playback infrastructure", by Alexander slides: http://lac.linuxaudio.org/2015/download/rewind-slides.pdf The issue of unsolicited kills are _clearly_ summarized in slide #13 above: "to process (resample, mix, encode) 2000 ms of sound under the limited budget of 200ms of real-time. Not easy: on a weak CPU, a cpufreq-governed CPU, with software DTS encoder, under valgrined, etc ... Result: KILLED" Even more details are in the video conference and paper of the same topic here: http://lac.linuxaudio.org/2015/video.php?id=8 2. A second suggestion is to let PA appropriately program its realtime soft limit and install the appropriate SIGXCPU handlers in PA. This way, we can be almost sure that the kills are due to exceeding our budget. [ This is also the view favored by kernel developers as they don't won't to pollute the kernel logs much. http://www.gossamer-threads.com/lists/linux/kernel/1513490#1513490 ] 3. A third and final suggestion is to write some abusive clients to demonstrate how common the issue is, and that it's not only related to the number of connected clients, but to the issue of excessive rewinds and abusive clients in general "You could write a client that does a lot of rewinds, calls pa_stream_write with bad timing (e.g. rewinds 990 ms and writes 1s every 10 ms) and see whether it explodes :) .. I don't expect it to explode with one client, but two may be enough in your case" ==> Raw discussion log: <patrakov> darwish: hello. the "realtime budget" problem that you reported is actually a known issue for my DTS encoder. There, even one stream is enough on typical hardware if PulseAudio is left with its default of mixing 2 seconds ahead <darwish> patrakov, hi :-) .. oh, I see <darwish> patrakov, seems it'll need some deep surgery to solve this while keeping interrupts low <patrakov> indeed <patrakov> and in fact I am on the fence whether to remove the low-interrupts feature, as it never worked correctly with processing such as resampling <patrakov> i.e. it may be that we just have to accept the 0.7w hit <darwish> hmm <patrakov> please see http://lac.linuxaudio.org/2015/video.php?id=8 (slides are enough) <darwish> patrakov, slide #13 summarizes everything really nicely :D <patrakov> I also encourage you to take a look at CRAS source code - it has some efficient client-to-server communication method, so that the overhead from going down to 28 ms latency is only 0.2w, which is IMHO very tolerable and makes rewinds (which, together with speculative mixing ahead, are responsible for eating the realtime budget in your case) unneeded <darwish> hmm <patrakov> basically the current 2000 ms default for the tsched buffer is based on the assumption that mixing is cheap, and that mixing 2000 ms of ausio should eat no more than 200 ms anyway <darwish> patrakov, unless a high amount of clients connect, leading to excessive rewinds .. <patrakov> which is false if the CPU is slowed down by the cpufreq framework - it just doesn't see enough load to bump the frequency <darwish> patrakov, btw thanks a lot! I finally understood the concept of rewinding from your slides :D [...] <darwish> hmmm .. "CRAS doesn’t have any of the discussed workarounds" <patrakov> what was meant is: "CRAS doesn't have any of the discussed workarounds and still works fine on hardware found in Chromebooks" <patrakov> no rewinds = no need to guess how much it is possible to rewind, no need to deal with non-rewindable ALSA plugins, no need to write a rewindable resampler, no correctness issues, at the cost of 0.2w of extra power consumed (and if we assume that Chrome is the only possible client, then that's 0.0w) <patrakov> because Chrome never actually uses high latency <darwish> just found some slides by the CRAS folks here .. they also compare themselves with PA: http://goo.gl/zdmNu4 <patrakov> they indeed share a lot of ideas [...] <darwish> for completeness I'll add excerpts from the discussion above to the bug report + links your slides and video conference <patrakov> basically, I want you to actually write a client that does a lot of rewinds, calls pa_stream_write with bad timing (e.g. rewinds 990 ms and writes 1s every 10 ms) and see whether it explodes :) <patrakov> I don't expect it to explode with one client, but two may be enough in your case <darwish> that client would be a nice discussion entry point :-) <darwish> I'm now working on some patches for the kernel to inform us when it kills PA.. will develop that client, and hopefully see how to fix this, afterwards <patrakov> why do we need those patches? <patrakov> doesn't the kernel already send SIGXCPU when the soft-limit is exceeded? <patrakov> shouldn't we just set the soft limit correctly in PulseAudio? <darwish> it does .. that was the argument too from tglx <darwish> patrakov, http://www.serverphorums.com/read.php?12,450582 <patrakov> oh, ok <darwish> patrakov, but yeah .. I've asked myself too if it's better to just appropriately handle SIGXCPU <darwish> so I'm not sure if the kernel devs will accept the patch, honestly <patrakov> on the other hand, can we handle SIGXCPU properly in the case when the CPU hog is a DTS encoder? It is not really "actionable" upon, other than logging a message. <patrakov> you can set a flag that says "stop further mixing", but it is useless if we are DTS-encoding, not mixing <darwish> I can at least log a message in PulseAudio .. so when a user submits a bug report with PA killed, and see that message, we are 99% sure we've just exceeded our limits <patrakov> Fair enough <darwish> and in that case we won't need the kernel patch I guess .. [...] <darwish> OK I'll go and have some lunch now (and watch the linux audio conference video in the process ;-)) .. thanks a lot for this discussion, I've learned a lot :-) Here's another idea for making a difficult client: use the maximum number of channels and high sample rate. We recently bumped the maximum sample rate to 384 kHz, so either use that, or something close to it. 384 kHz is a multiple of 48 kHz, which might help the resampler (that's just wild speculation - I know very little about resampler algorithms). 383987 is the nearest prime number, maybe that would be pessimal. I don't know if the sample format has much effect, but I'd imagine 24-bit non-native-endian samples would be the most difficult choice. A general comment on the topic: if someone decides to try to fix this bug, it would be good to be clear whether the goal is to fully prevent the DoS attack, or if the goal is to just reduce the chances that we accidentally die during normal use. To me it seems very hard to reach a state where we can guarantee that DoS attacks are impossible. First there is the problem that how do we know when we are reaching the limits of the cpu? If SIGXCPU can happen at relatively low load due to cpu scaling, then we can't easily use that, assuming that we don't want to limit ourselves to the lowest cpu frequency. Another problem is that how do we mitigate the high cpu use without causing a different DoS scenario. If we randomly start killing streams, that's DoS too. I guess it would be possible to figure out what streams are from the "currently active user" when running as a system daemon and kill all other streams, or what streams are from untrusted applications in a sandboxing scenario and kill those. (In reply to Ahmed S. Darwish from comment #0) > 2. Disabling SHM (--disable-shm) makes triggering the bug much > quicker. Only after connection from client #27 I wouldn't have expected that. Clients send the audio to pulseaudio's main thread, which is not running with RT enabled. The amount of work done in the alsa sink thread shouldn't be affected by how the audio got from the client to the server. One more comment... (In reply to Ahmed S. Darwish from comment #6) > Excellent. So beside a policy patchset, we also need a "handling > abusive clients" one .. there's still a very long way to go for > proper containers support ;-) I'm not sure what you mean by "proper", but having the possibility to cause a DoS on the audio system doesn't seem hugely problematic to me. The goal of hostile applications is rarely to only make the life of the user miserable. Of course it would be nice to be able to prevent that, but even if we never fix this, I don't expect there to be much trouble from security point of view. A bigger problem is that legitimate use cases may bring the audio server down (high cpu use can also be caused by a single stream with heavy processing). -- GitLab Migration Automatic Message -- This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity. You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/pulseaudio/pulseaudio/issues/460. |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.