76418 – systemd unusable after segfault, zombies everywhere, unable to shutdown

Bug 76418 - systemd unusable after segfault, zombies everywhere, unable to shutdown

Summary: systemd unusable after segfault, zombies everywhere, unable to shutdown

Status:	RESOLVED WONTFIX

Alias:	None

Product:	systemd
Classification:	Unclassified
Component:	general (show other bugs)
Version:	unspecified
Hardware:	x86-64 (AMD64) Linux (All)

Importance:	medium major
Assignee:	systemd-bugs
QA Contact:	systemd-bugs

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2014-03-20 22:13 UTC by Peter Wu
Modified:	2014-05-24 08:55 UTC (History)
CC List:	1 user (show)

See Also:
i915 platform:
i915 features:

Attachments

Description Peter Wu 2014-03-20 22:13:54 UTC

When systemd (init!) 210-2 (Arch Linux x86_64) segfaulted some days ago, I was brought to tty1. This happened out nothing, I was just reading and not touching anything. After that segfault, systemd was totally unusable:

 - systemctl <stop | start | status> <anything> timed out
 - NetworkManager dispatcher services also timed out
 - unable to suspend (not by lid close, not by systemctl suspend, not by suspend key)
 - unable to shutdown (systemctl poweroff; shutdown -h now)
 - Ignores the documented SIGRTMIN+4 signal to shutdown the machine.
 - Ignores SIGTERM, SIGKILL (ok), but after sending another SIGSEGV, I got a kernel panic.
 - Zombies everywhere. When I was about to shutdown (panic), I got 3.2k zombie processes.

I still have a tiny core dump, but without debugging symbols it is quite useless. This report is not about that specific crash, but more about handling segfaults in general.

init is supposed to be unkillable right? It ignores SIGTERM and SIGKILL... but sending twice SIGSEGV results in a kernel panic because it killed itself. sysvinit on Debian does not have this issue, when it receives a segfault, it sleeps for 30 seconds, ignoring any signals. Due to its different architecture, services can still be started and stopped.

What is the expected behavior:
systemd should handle SIGSEGV gracefully, especially since it can be triggered by any root program. It should not let zombies walk over /proc/. It should not make it impossible to start/stop/query services.

Reproduced with QEMU:

 0. Install Arch in QEMU, edit /etc/systemd/journald.conf, log to /dev/ttyS1[1]. Edit /etc/default/grub, add `console=ttyS0 loglevel=7` to cmdline.
 1. qemu-system-x86_64 -enable-kvm -hda arch.qcow2 -m 1G -serial file:dmesg.txt -serial journal.txt
 2. tailf journal.txt
 3. kill -SEGV 1
 4. Observe the following in out.txt:

[  218.557179] systemd[1]: Caught <SEGV>, dumped core as pid 289.
[  218.558909] systemd[1]: Freezing execution.
[  218.567627] systemd-coredump[290]: Process 289 (systemd) dumped core.

 5. kill -SEGV 1
 6. Observe a kernel panic (VM frozen, journal.txt possible partially written). dmesg.txt contains:

[  229.817252] systemd[1]: segfault at 7fff77d31e68 ip 00007ffb412e4fd0 sp 00007fff77d31e70 error 6 in libc-2.19.so[7ffb4129d000+19e000]
[  229.833598] Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000008b
[  229.833598] 
[  229.834798] CPU: 0 PID: 1 Comm: systemd Not tainted 3.13.6-1-ARCH #1
[  229.835608] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[  229.836353]  ffff880037a6a760 ffff880119a4fc90 ffffffff81513274 ffffffff81700080
[  229.836658]  ffff880119a4fd08 ffffffff8150fe3a ffff880100000010 ffff880119a4fd18
[  229.836658]  ffff880119a4fcb8 0000000000000282 000000000000008b ffff880119a683c0
[  229.836658] Call Trace:
[  229.836658]  [<ffffffff81513274>] dump_stack+0x4d/0x6f
[  229.836658]  [<ffffffff8150fe3a>] panic+0xc8/0x1d7
[  229.836658]  [<ffffffff81064628>] do_exit+0xa78/0xa80
[  229.836658]  [<ffffffff810646af>] do_group_exit+0x3f/0xa0
[  229.836658]  [<ffffffff81073255>] get_signal_to_deliver+0x295/0x5f0
[  229.836658]  [<ffffffff81014498>] do_signal+0x48/0x950
[  229.836658]  [<ffffffff8140d0d0>] ? sockfd_lookup_light+0x20/0x80
[  229.836658]  [<ffffffff81014e08>] do_notify_resume+0x68/0xa0
[  229.836658]  [<ffffffff8151a37c>] retint_signal+0x48/0x8c

Reproducibility: 100%


Distro: Arch Linux x86_64
Kernel: Linux 3.14-rc5 (reproduced in 3.13.6-1-ARCH)
systemd: 211-2

 [1]: https://wiki.archlinux.org/index.php/systemd#Forward_journald_to_.2Fdev.2Ftty12

Comment 1 Peter Wu 2014-03-20 22:25:35 UTC

Besides SIGSEGV (11), the following signals also exhibit the issue:

 - SIGQUIT (3)
 - SIGILL (4)
 - SIGABRT (6)
 - SIGBUS (7)
 - SIGFPE (8)

Comment 2 Gerardo Exequiel Pozzi 2014-03-21 16:16:21 UTC

See http://lists.freedesktop.org/archives/systemd-devel/2012-September/006457.html

Comment 3 Lennart Poettering 2014-05-24 08:55:47 UTC

There's no reason really to second guess the admin. If the admin sends SIGSEGV to PID 1 that's hardly any better than let's say connecting /dev/urandom with /dev/mem... The admin has myriads of ways to bring the system to a standstill, anyway, SIGSEGV is just one of them.

Sending SIGSEGV as admin doesn't happen by accident. From systemd's perspective it's when PID 1 actually crashed, and in that case things are fucked really, we cannot recover from that (simply because the kernel doesn't allow us to "eat up" sigsegv, it will always cause a kernel oops, either immediately or when we try to exec() a new binary.

Note that sysvinit is only marginally better at that. Sure you can still invoke a service or two, since that's indepndent of the init process, but you cannot shutdown or anything else anymore.

The simple summary is that the system is fucked, and we should try hard to fix the reasons in systemd/PID 1 so that this never really happens.

In a way, this isn't any different from making the kernel oops. Thankfully though systemd is vastly simpler than the kernel and knows no loadable extensions, so things are much easier for us.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.