Bug 61191

Summary:	systemd-journald eats 100% CPU
Product:	systemd	Reporter:	Maxym Synytsky <synytsky>
Component:	general	Assignee:	systemd-bugs
Status:	RESOLVED FIXED	QA Contact:	systemd-bugs
Severity:	normal
Priority:	medium
Version:	unspecified
Hardware:	x86 (IA32)
OS:	Linux (All)
Whiteboard:
i915 platform:		i915 features:

Description Maxym Synytsky 2013-02-20 21:37:28 UTC

Upon an unknown condition systemd-journald starts to eat 100% of CPU.
strace shows the following:

read(8, 0xbfd1238b, 8192)               = -1 EPIPE (Broken pipe)
clock_gettime(CLOCK_REALTIME, {1361395609, 750800535}) = 0
clock_gettime(CLOCK_REALTIME, {1361395609, 750835989}) = 0
writev(6, [{"<46>", 4}, {"systemd-journald", 16}, {"[19796]: ", 9}, {"Sleeping for 514709548 ms", 25}, {"\n", 1}], 5) = 55
epoll_wait(7, {?} 0xbfd145dc, 1, 514709548) = 1
read(8, 0xbfd1238b, 8192)               = -1 EPIPE (Broken pipe)
clock_gettime(CLOCK_REALTIME, {1361395609, 750990449}) = 0
clock_gettime(CLOCK_REALTIME, {1361395609, 751025885}) = 0
writev(6, [{"<46>", 4}, {"systemd-journald", 16}, {"[19796]: ", 9}, {"Sleeping for 514709547 ms", 25}, {"\n", 1}], 5) = 55
epoll_wait(7, {?} 0xbfd145dc, 1, 514709547) = 1
read(8, 0xbfd1238b, 8192)               = -1 EPIPE (Broken pipe)
clock_gettime(CLOCK_REALTIME, {1361395609, 751180501}) = 0
clock_gettime(CLOCK_REALTIME, {1361395609, 751215977}) = 0
writev(6, [{"<46>", 4}, {"systemd-journald", 16}, {"[19796]: ", 9}, {"Sleeping for 514709547 ms", 25}, {"\n", 1}], 5) = 55
epoll_wait(7, {?} 0xbfd145dc, 1, 514709547) = 1
read(8, 0xbfd1238b, 8192)               = -1 EPIPE (Broken pipe)

And so on.
Distro is Arch Linux x86-32 running on x86-64 system if this matters, systemd version is 197.
If I restart systemd-journald, the CPU is not freed.

My case look very similar to the situation discussed here - https://bbs.archlinux.org/viewtopic.php?pid=1201128

Comment 1 Eric Griffith 2013-02-27 20:25:59 UTC

Confirmed, Arch Linux x86-64, systemd 197-4. I wasn't able to get an strace but in my case it wasn't a sustained problem. My fan kicked up I wondered what was causing it, pulled up htop and the journal was at 99% CPU on Core 0. Stayed like that for maybe 20seconds, then dropped back to normal. 

All I was doing was web browsing with Chromium and Rekonq. 

I find it odd that it hit 100% then eventually went back to normal, is there anything the journal does that should peg the CPU that hard? Or does systemd monitor the journal for hangs and restart it? (Did it hang, hence 100% cpu, systemd detected it, killed the journal and automatically restarted it?)

Comment 2 Eric Griffith 2013-02-27 20:26:58 UTC

Since i know about what time the process hit 100%, would any of the journals logs be applicable as additional debug info? Or can you even sort the journals logs by time, at all?

Comment 3 Lennart Poettering 2013-03-06 16:20:02 UTC

Hmm, in that strace it appears that fd 8 seems to be causing EPIPE continiusly. By any chance, could you figure out what fd that is?

I.e. what does /proc/$PID/fd/8 and /proc/$PID/fdinfo/8 point to/contain?

Comment 4 Maxym Synytsky 2013-03-07 07:03:41 UTC

There were a couple of reboots since then, so things changed.
I cannot see any more EPIPE error messages. As for the rest, the issue is still reproducible.
Currently strace shows the following:

clock_gettime(CLOCK_REALTIME, {1362639311, 615760612}) = 0
clock_gettime(CLOCK_REALTIME, {1362639311, 615791280}) = 0
writev(6, [{"<46>", 4}, {"systemd-journald", 16}, {"[21669]: ", 9}, {"Sleeping for 17492137 ms", 24}, {"\n", 1}], 5) = 54
epoll_wait(7, {?} 0xbfcf939c, 1, 17492137) = 1
read(8, "46,5081212822,316432051764,-;sys"..., 8192) = 79
clock_gettime(CLOCK_REALTIME, {1362639311, 615929585}) = 0
clock_gettime(CLOCK_REALTIME, {1362639311, 615959991}) = 0
writev(6, [{"<46>", 4}, {"systemd-journald", 16}, {"[21669]: ", 9}, {"Sleeping for 17492137 ms", 24}, {"\n", 1}], 5) = 54
epoll_wait(7, {?} 0xbfcf939c, 1, 17492137) = 1
read(8, "46,5081212823,316432051770,-;sys"..., 8192) = 79
clock_gettime(CLOCK_REALTIME, {1362639311, 616098321}) = 0
clock_gettime(CLOCK_REALTIME, {1362639311, 616128761}) = 0
writev(6, [{"<46>", 4}, {"systemd-journald", 16}, {"[21669]: ", 9}, {"Sleeping for 17492137 ms", 24}, {"\n", 1}], 5) = 54
epoll_wait(7, {?} 0xbfcf939c, 1, 17492137) = 1
read(8, "46,5081212824,316432051776,-;sys"..., 8192) = 79
clock_gettime(CLOCK_REALTIME, {1362639311, 616266463}) = 0
clock_gettime(CLOCK_REALTIME, {1362639311, 616296821}) = 0
writev(6, [{"<46>", 4}, {"systemd-journald", 16}, {"[21669]: ", 9}, {"Sleeping for 17492137 ms", 24}, {"\n", 1}], 5) = 54
epoll_wait(7, {?} 0xbfcf939c, 1, 17492137) = 1
read(8, "46,5081212825,316432051782,-;sys"..., 8192) = 79
clock_gettime(CLOCK_REALTIME, {1362639311, 616434289}) = 0
clock_gettime(CLOCK_REALTIME, {1362639311, 616464762}) = 0
writev(6, [{"<46>", 4}, {"systemd-journald", 16}, {"[21669]: ", 9}, {"Sleeping for 17492137 ms", 24}, {"\n", 1}], 5) = 54
epoll_wait(7, {?} 0xbfcf939c, 1, 17492137) = 1
read(8, "46,5081212826,316432051788,-;sys"..., 8192) = 79

Descriptors 6, 7 and 8 are as follows:

# ls -l /proc/21669/fd/6
l-wx------ 1 root root 64 Mar  7 00:29 /proc/21669/fd/6 -> /dev/kmsg
# ls -l /proc/21669/fd/7
lrwx------ 1 root root 64 Mar  7 00:29 /proc/21669/fd/7 -> anon_inode:[eventpoll]
# ls -l /proc/21669/fd/8
lrwx------ 1 root root 64 Mar  7 00:29 /proc/21669/fd/8 -> /dev/kmsg
#

BTW, when I set Storage=none the issue is not reproducible so I think this might be something with writing the logs.
So at the moment I can control and trigger this behaviour - set Storage back to the default value and restart journald.
If you need strace just after the journald restart I can try to get it.

Comment 5 Lennart Poettering 2013-03-07 15:12:42 UTC

Hmm, this seems to be a different loop, as this doesn't genreate EPIPE anymore.

This kinda appears to be a feedback loop: journald writes something to kmsg, and then reads it back, which it then writes a second time, and so on. Normally we should protect ourselves against that, but apparently that's not effective.

Do you have any special journal configuration set in journald.conf or on the kernel cmdline?

Comment 6 Maxym Synytsky 2013-03-08 20:54:30 UTC

At the time when I discovered this issue, I had nothing custom set in the config.
Now it has:
SystemMaxUse=100M
MaxRetentionSec=1week

Comment 7 Ieeiv Mabcer 2013-03-11 06:14:36 UTC

Same issue here (arch linux x86_64). This only happens if I set MaxRetentionSec in journald.conf. (archlinux package installs /var/log/journal by default so logs are not kept in ram)

See #45 post at https://bbs.archlinux.org/viewtopic.php?pid=1200373 , it can describe the issue better than I can.

Comment 8 Maxym Synytsky 2013-03-28 12:32:10 UTC

I feel I was mistaken when I wrote that I discovered this issue with vanilla config. Anyway, I commented out the MaxRetentionSec for now and can confirm that the problem vanished.

Comment 9 Kay Sievers 2013-03-28 14:08:51 UTC

This should be a fixed issue:
http://cgit.freedesktop.org/systemd/systemd/commit/?id=7964dfca1b6c4b37d069b3713f9f1b11219c3474

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.