Bug 66664

Summary: journalctl boot reporting inconsistent
Product: systemd Reporter: hortplumber
Component: generalAssignee: systemd-bugs
Status: RESOLVED FIXED QA Contact: systemd-bugs
Severity: major    
Priority: medium CC: hortplumber
Version: unspecified   
Hardware: Other   
OS: Linux (All)   
See Also: https://bugzilla.redhat.com/show_bug.cgi?id=963620
Whiteboard:
i915 platform: i915 features:
Attachments: Annotated output of journalctl (suitably filtered)
Diagnostic output from log analysis tool, 20130712
journalctl diagnostic script
report output

Description hortplumber 2013-07-07 12:24:18 UTC
Boot-time info reported by journalctl (systemd 204-1) is often inconsistent between essentially identical boot cycles. The inconsistencies take the form of randomly missing messages from the journalctl output. These lacunas can make it very challenging to understand exactly what is going during the boot sequence.

This problem is documented in detail in the following posting on the Arch forum:

    https://bbs.archlinux.org/viewtopic.php?id=163633

The posting contains information on how to reproduce the problem.

If you prefer, I can transcribe that posting here. Just let me know and I'll be happy to do so.

As an aside, I suspect -- but have no hard evidence -- that this issue may be related to the oft-reported journalctl corruption issue, e.g. as reported here:

    https://bbs.archlinux.org/viewtopic.php?pid=1239783
Comment 1 Zbigniew Jedrzejewski-Szmek 2013-07-07 12:39:16 UTC
It's unlikely to be related to journal corruptions, IMHO. I'd venture that it's more likely caused by journald simply not keeping up with the log output. Can you see what happens when you
a) Add Nice=-10 to /etc/systemd/system/systemd-journald.service.d/nice.conf
b) Add RateLimitInterval=0 to /etc/systemd/journald.conf
c) do a) and b)

Also:
d) Use journalctl from git to look at the suspected corrupt journal files. There have been some commits to make the verification less zealous and more verbose. (You can run journalctl directly from the build directory, without installation if that is easier.)

Zbyszek
Comment 2 hortplumber 2013-07-07 13:39:36 UTC
Your observational requests will take some time, since each requires doing multiple boot cycles and then careful hand-examination of the journalctl output. Can you specify among a,b,c the one that you'd most like to see and I'll do that one first. Thx.
Comment 3 Zbigniew Jedrzejewski-Szmek 2013-07-07 13:43:34 UTC
Try c). If it doesn't help, then the solution must lie somewhere else.
Comment 4 hortplumber 2013-07-07 21:24:38 UTC
Regarding (a): As presently configured, my system has no directory 

     /etc/systemd/system/systemd-journald.service.d

I also looked for any file anywhere under the / filesystem by name nice.conf. No joy.
Comment 5 hortplumber 2013-07-07 21:44:35 UTC
Regarding (a): As presently configured, my system has no directory 

     /etc/systemd/system/systemd-journald.service.d

I also looked for any file anywhere under the / filesystem by name nice.conf. No joy.
Comment 6 Zbigniew Jedrzejewski-Szmek 2013-07-08 00:09:58 UTC
(In reply to comment #5)
> Regarding (a): As presently configured, my system has no directory 
> 
>      /etc/systemd/system/systemd-journald.service.d
You have to create this directory.

> I also looked for any file anywhere under the / filesystem by name
> nice.conf. No joy.
The name is arbitrary, it just should end in .conf.
Comment 7 hortplumber 2013-07-08 01:49:58 UTC
Created attachment 82155 [details]
Annotated output of journalctl (suitably filtered)

This file is a hand-annotated version of journalctl output, prepared as

    journalctl -a --since '2013-07-07 18:00:00' | grep Start

The only changes made were to label the boot instances and separate them with some vertical whitespace.
Comment 8 hortplumber 2013-07-08 01:59:22 UTC
(In reply to comment #6)
> 
> You have to create this directory.
> 

OK. You originally said "Add Nice=-10..." which implies the file already existed; just wanted to be sure about it to avoid potentially adding confusion.

Anyway, I created nice.conf, put in Nice=-10, and uncommented the RateLimitInterval in journald.conf and set it to 0. Then ran seven boot cycles, doing nothing in between (i.e. as soon as the login prompt appears, log in as root and run "shutdown -r now", that's it.)

The results are posted in the attachment to Comment #7. Pretty much in line with what I observed earlier in the Arch forum posting. Below are some notes pertaining to that attachment, calling out some (but certainly not all) of the
lost-message differences.


Boot #1:
        * 93 lines
        * 4 Starting Getty 
        * 4 Started Getty 
        * 1 Starting Virtual console mouse server
        * 0 Started  Virtual console mouse server


Boot #2:
        * 63 lines
        * 0 Starting Getty 
        * 0 Started Getty 
        * 0 Starting Virtual console mouse server
        * 0 Started  Virtual console mouse server
        * Also missing several other messages present in most of the other
          boot instances, including "Manage Sound Card State", "Restore
          Sound Card State", "hack for mimicking rc.local"


Boot #3:
        * 95 lines
        * 3 Starting Getty
        * 3 Started Getty
        * 1 Starting Virtual console mouse server
        * 1 Started  Virtual console mouse server


Boot #4:
        * 93 lines
        * 4 Starting Getty
        * 3 Started Getty
        * 1 Starting Virtual console mouse server
        * 1 Started  Virtual console mouse server


Boot #5:
        * 91 lines
        * 3 Starting Getty
        * 2 Started Getty
        * 1 Starting Virtual console mouse server
        * 1 Started  Virtual console mouse server


Boot #6:
        * 94 lines
        * 4 Starting Getty
        * 4 Started Getty
        * 1 Starting Virtual console mouse server
        * 0 Started  Virtual console mouse server


Boot #7:
        * 93 lines
        * 3 Starting Getty
        * 3 Started Getty
        * 1 Starting Virtual console mouse server
        * 1 Started  Virtual console mouse server
Comment 9 hortplumber 2013-07-10 16:41:08 UTC
Hi Zbyszek,

In looking over the unfiltered journalctl output in more detail, it seems that systemd is complaining about the nice.conf file created at your behest:

    [/etc/systemd/system/systemd-journald.service.d/nice.conf:4] Assignment outside of section. Ignoring.

The file contains the line "Nice=-10" as you instructed. It contains no "section" info, which -- in view of the above error message -- it evidently should.

Given that the Nice parameter was ignored, all of the results that I prepared and posted earlier (Comment #7 and Comment #8) are not really representative of the experiment that you wanted me to perform.

Please provide the exact and complete desired content of the nice.conf file, so we can avoid further confusion and missteps in diagnosing the issue. I will then run the same experiment again, and generate a new set of reports.

Thx.
Comment 10 hortplumber 2013-07-13 14:18:54 UTC
Created attachment 82390 [details]
Diagnostic output from log analysis tool, 20130712

Not having heard anything further on this since last week, I tried 

        [service]
        Nice=-10

in the nice.conf file. This seems to avoid the "Assignment outside of section"
complaint.  Not sure if "Service" was your intented section, but no other
choice seemed to make much sense.

The results are shown in the attachment. Summary: No observable qualitative
difference from the earlier results posted in Comments #7 and #8 and the
original posting on the Arch forum: The distribution of the number of dropped
messages is still roughly bimodal, with the "better" cases dropping roughly
5% of messages from every boot sequence, and the "worse" cases dropping
around 30%.

I also wrote a tool to do the journalctl log analysis, i.e. identify missing
messages among a set of boot cycles, to avoid the tedious work of hand-
comparsion.  So if there are further experiments with process prioritization
or buffering (or whatever else) you might like to try, the results of those
experiments can now be obtained fairly quickly.

That tool was used to generate the results in the attachment. In that output,
"MOI's" means messages of interest, presently comprising only the "Starting"
and "Started" messages generated by systemd (because they are syntactically
consistent across boot sequences).  Even though this is just a small subset
of the total messages generated during a boot sequence, it seems sufficient
to demonstrate the problem, i.e. that journald presently has some serious
reliability issues. Needless to say, diagnosis of boot problems is made
significantly more complicated when boot-time messages are frequently and
silently being dropped from the logs.

I have some past experience dealing with issues of this sort, i.e. diagnostic
logging processes competing for resources (realtime and buffers) to keep up
with bursty message generation. If you're interested perhaps we can discuss,
either here or on the systemd ML.

Let me know if you need any more info or example results.
Comment 11 hortplumber 2013-07-13 15:49:47 UTC
FYI: Status of this bug had been NEEDINFO. Once I provided the requested info, I changed the status to ASSIGNED (since it seemed the most sensible of all available options). But now looking at the list of other open systemd bugs, I see that no others are in the ASSIGNED category, so it was probably a mistake for me to have changed it to that value. Apologies. Reverting it now to NEW, since that seems to be the most popular choice, even though I don't understand the semantics.

Btw, clicking on the "help" link for the "Status" dropdown yields a page of essentially useless info, since it explains semantics only for tags CONFIRMED, UNCONFIRMED, and INPROGRESS, none of which are actually used on this site. Would be nice to have explanations for semantics of tags actually in use here, i.e. NEW, ASSIGNED, RESOLVED, and NEEDINFO.
Comment 12 Zbigniew Jedrzejewski-Szmek 2013-07-14 03:18:48 UTC
(In reply to comment #10)
Sorry for not replying earlier... I got caught up with other things.

> Not having heard anything further on this since last week, I tried 
> 
>         [service]
>         Nice=-10
> 
> in the nice.conf file. This seems to avoid the "Assignment outside of
> section"
> complaint.  Not sure if "Service" was your intented section, but no other
> choice seemed to make much sense.
Yes, that is the way that this was intended to look.

> The results are shown in the attachment. Summary: No observable qualitative
> difference from the earlier results posted in Comments #7 and #8 and the
> original posting on the Arch forum: The distribution of the number of dropped
> messages is still roughly bimodal, with the "better" cases dropping roughly
> 5% of messages from every boot sequence, and the "worse" cases dropping
> around 30%.
Thank you for the extensive testing. In principle, there's no mechanism to ensure that journald is able to process messages in time. But it usually works
fine. Since you removed the burst limits, and are still observing dropped messages, then we can guess that journald is simply not able to process  messages fast enough. Apparently changing the nice level is not enough.
There's one last thing that you might try, see below.

> I also wrote a tool to do the journalctl log analysis, i.e. identify missing
> messages among a set of boot cycles, to avoid the tedious work of hand-
> comparsion.  So if there are further experiments with process prioritization
> or buffering (or whatever else) you might like to try, the results of those
> experiments can now be obtained fairly quickly.
> 
> That tool was used to generate the results in the attachment. In that output,
> "MOI's" means messages of interest, presently comprising only the "Starting"
> and "Started" messages generated by systemd (because they are syntactically
> consistent across boot sequences).  Even though this is just a small subset
> of the total messages generated during a boot sequence, it seems sufficient
> to demonstrate the problem, i.e. that journald presently has some serious
> reliability issues. Needless to say, diagnosis of boot problems is made
> significantly more complicated when boot-time messages are frequently and
> silently being dropped from the logs.
> 
> I have some past experience dealing with issues of this sort, i.e. diagnostic
> logging processes competing for resources (realtime and buffers) to keep up
> with bursty message generation. If you're interested perhaps we can discuss,
> either here or on the systemd ML.
> 
> Let me know if you need any more info or example results.
The buffer for message is (by default on most distros), only 10.
The value is set in /proc/sys/net/unix/max_dgram_qlen. Increasing it would be beneficial for journald, but this is a global setting, and we're waiting for the superior solution of adding a sockopt to modify this for a specific socket. But you can increase it for debugging. It has to be increased before the journal socket is created. sysctl is usually run after journald has been started, so it must be set in a different way. Adding

ExecStartPre=/usr/sbin/sysctl net/unix/max_dgram_qlen=1000

to /etc/systemd/system/systemd-journald.socket.d/config.conf file
should work (note that this is for .socket, not .service, since the socket is created first).
If you're using an initramfs with systemd, I think you'd need to put the file also in the initramfs.
Comment 13 Zbigniew Jedrzejewski-Szmek 2013-07-14 03:54:47 UTC
(In reply to comment #11)
> FYI: Status of this bug had been NEEDINFO. Once I provided the requested
> info, I changed the status to ASSIGNED (since it seemed the most sensible of
> all available options). But now looking at the list of other open systemd
> bugs, I see that no others are in the ASSIGNED category, so it was probably
> a mistake for me to have changed it to that value. Apologies. Reverting it
> now to NEW, since that seems to be the most popular choice, even though I
> don't understand the semantics.
We use ASSIGNED to signal that we're working on something, i.e. for communication between developers to avoid duplicate work. So the bug should stay NEW until somebody actually starts working on the solution.

> Btw, clicking on the "help" link for the "Status" dropdown yields a page of
> essentially useless info, since it explains semantics only for tags
> CONFIRMED, UNCONFIRMED, and INPROGRESS, none of which are actually used on
> this site. Would be nice to have explanations for semantics of tags actually
> in use here, i.e. NEW, ASSIGNED, RESOLVED, and NEEDINFO.
https://bugs.freedesktop.org/show_bug.cgi?id=66887
Comment 14 hortplumber 2013-07-14 13:01:32 UTC
(In reply to comment #12)

>
> Sorry for not replying earlier... I got caught up with other things.
> 

No problem, didn't mean to rush you, just letting you know what I had been fiddling around with.

>
> Adding
> 
> ExecStartPre=/usr/sbin/sysctl net/unix/max_dgram_qlen=1000
> 
> to /etc/systemd/system/systemd-journald.socket.d/config.conf file
> should work (note that this is for .socket, not .service, since the socket
> is created first).

Tried this, with the value you suggested (1000) and it does clean up the observed message drops. Now, over 10 bootseq's, each one reports 103/103 of the MOIs ("Starting"/"Started"), whereas had been seeing roughly 90/95 (in most cases) and occasionally 60/95.  

A few observations/comments:

 1. Obvious: When the per-socket sockopt capability is added, you'll probably  want to set the qlen to a value larger than 10. :) There's nothing especially unusual about my system, so I would have to guess that this problem is happening with qlen=10 for a fairly large fraction of users, but simply has not been noticed. (I didn't notice it myself until I began looking for a specific boot message of interest and noticed that it sometimes appeared and sometimes did not.)

2.  I fully understand (as you point out) that here is no guarantee that journald can always keep up with message generation rate with probability = 1 in all situations, regardless of buffer sizes and queue length choices. It's not possible to close the source->sink loop without incurring a priority inversion between source and sink when the source rate becomes very high. I appreciate that and am all too familiar with it. But, I would opine, that it ought to be an explicit design requirement on journald going forward that it report with "very high" probability when messages have been dropped, rather than dropping them silently. That ought to be possible using mutexes without getting into priority inversion issues. IMO as an ordinary user, silently dropping log messages should be considered grossly unacceptable except under the most extreme message loading conditions (e.g. DoS attack :)). Silent drops should have a vanishingly small probability of occurrence during ordinary events such as boot-up.

IMO, if you don't adopt this as a basic design principle, you're shooting yourself in the foot, PR-wise, because sooner or later people are going to start whining about journald logging reliability vis a vis "good old sysV". 

>
> If you're using an initramfs with systemd, I think you'd need to put the
> file also in the initramfs.
>

I am using an initramfs, but tried it first without putting the new config file into it, and it worked. Curious as to why, if it "shouldn't have", but not going to look it in the mouth. :)


One request, if you have time: Could you possibly post some example output from "journalctl -a" from some (any) systems that you might have access to? I'd like to run them thru my little analysis tool and see whether the kinds of drops I was seeing (with qlen=10) are occurring. This will also have benefit for you as developers to perhaps get an idea of how often this sort of stuff is happening.

Thanks for your time and assistance.
Comment 15 Zbigniew Jedrzejewski-Szmek 2013-07-14 15:16:01 UTC
(In reply to comment #14)
> (In reply to comment #12)
> A few observations/comments:
> 
>  1. Obvious: When the per-socket sockopt capability is added, you'll
> probably  want to set the qlen to a value larger than 10. :) There's nothing
> especially unusual about my system, so I would have to guess that this
> problem is happening with qlen=10 for a fairly large fraction of users, but
> simply has not been noticed. (I didn't notice it myself until I began
> looking for a specific boot message of interest and noticed that it
> sometimes appeared and sometimes did not.)
> 
> 2.  I fully understand (as you point out) that here is no guarantee that
> journald can always keep up with message generation rate with probability =
> 1 in all situations, regardless of buffer sizes and queue length choices.
> It's not possible to close the source->sink loop without incurring a
> priority inversion between source and sink when the source rate becomes very
> high. I appreciate that and am all too familiar with it. But, I would opine,
> that it ought to be an explicit design requirement on journald going forward
> that it report with "very high" probability when messages have been dropped,
> rather than dropping them silently. That ought to be possible using mutexes
> without getting into priority inversion issues. IMO as an ordinary user,
> silently dropping log messages should be considered grossly unacceptable
> except under the most extreme message loading conditions (e.g. DoS attack
> :)). Silent drops should have a vanishingly small probability of occurrence
> during ordinary events such as boot-up.
> 
> IMO, if you don't adopt this as a basic design principle, you're shooting
> yourself in the foot, PR-wise, because sooner or later people are going to
> start whining about journald logging reliability vis a vis "good old sysV". 
A major source of slowdown are various /proc lookups. There are some patches on the ML to reduce their number [1]... but also some patches to remove the need for them [2]. We can't really allow PID 1 to pause waiting for journald, so making journald more efficient is the way to go.

[1] http://lists.freedesktop.org/archives/systemd-devel/2013-June/011591.html
[2] http://people.apache.org/~jkaluza/patches/linux/

> > If you're using an initramfs with systemd, I think you'd need to put the
> > file also in the initramfs.
> >
> 
> I am using an initramfs, but tried it first without putting the new config
> file into it, and it worked. Curious as to why, if it "shouldn't have", but
> not going to look it in the mouth. :)
Thinking about it a bit more, I come to the conclusion that systemd must reopen the sockets when systemd is executed, so please ignore the part about initramfs.

> One request, if you have time: Could you possibly post some example output
> from "journalctl -a" from some (any) systems that you might have access to?
> I'd like to run them thru my little analysis tool and see whether the kinds
> of drops I was seeing (with qlen=10) are occurring. This will also have
> benefit for you as developers to perhaps get an idea of how often this sort
> of stuff is happening.
> 
> Thanks for your time and assistance.
Maybe you could post the tool. I'm not too keen on posting my logs publicly, since like everybody I type my password instead of the login every once in a while... I can post some logs from my machine at work later on.
Comment 16 hortplumber 2013-07-15 15:27:58 UTC
(In reply to comment #15)
>
> Maybe you could post the tool. I'm not too keen on posting my logs publicly,
> since like everybody I type my password instead of the login every once in a
> while... I can post some logs from my machine at work later on.
>

I'll be happy to post the tool, but would prefer to test it a bit first with
some logfile data other than my own, so it's more likely to work for other people who might be using different systemd version. (For example, there might be message formatting assumptions embedded in the diagnostic code that are specific to my system and/or systemd version.)

But I understand completely your reticence to post logs publically. How about
this as a compromise: Post some logs that have been sanitized like this:

    $ patt='Starting|Started|finished in|running in'
    $ journalctl -a --since [whenever]  | egrep "$patt" 

That filtering has a high probability of eliminating any messages containing private data, yet will still provide my tool with all the messages it needs to do the diagnostic analysis.

If you (or someone you know :) ) could post a few logs like this, obtained after doing say 5-6 identical boot cycles, it would be very helpful in testing the tool.  Then I can make any necessary fixes and post the tool here, and it will have a higher probability of actually working correctly for others who might be interested in using it. 

Just a polite request. Obviously if you don't have time, no big deal.

Thanks.
Comment 17 Kay Sievers 2013-07-15 15:36:34 UTC
Sent you a log file per private email. Thanks!
Comment 18 hortplumber 2013-07-15 15:57:34 UTC
(In reply to comment #17)
>
> Sent you a log file per private email. Thanks!
>

Thanks, Kay! Got it. I certainly won't post it anywhere at all, public or not.

Is it ok if I post the diagnostic results here? The diagnostic output contains absolutely nothing of a private nature or even any identifying information about the source of the logs.

(If you want, I can privately send you the diagnostic output and you can have a look for yourself to be assured it's as I've said above. Just let me know.)
Comment 19 hortplumber 2013-07-20 18:06:19 UTC
Created attachment 82739 [details]
journalctl diagnostic script

See comment #19 for more details.
Comment 20 hortplumber 2013-07-20 18:08:52 UTC
I just posted the journalctl diagnostic tool as an attachment.

It should run standalone on most systems with perl ~5.6 or later.  All my
local library dependencies were interpolated into the script so hopefully
there should not be any dangling dependencies.

The source contains a reasonably complete documentation header up top. If you
want to understand what the tool does and how it performs the drop-rate
analysis, you'll have to read it. Sorry.

If you simply want to run it in appliance-operator mode and post your results
here to the bug tracker for others to look over then you don't need to
understand much about what it's doing, but you should still read the paragraph
beginning with "jcdiag expects..." which describes the assumptions regarding
the logfile contents.  Otherwise the results you post may be meaningless.

To run in appliance-operator mode, just

   $ jcdiag -v yourlogfile 2> yourlogfile.err

should do the trick.  This will create a report file named yourlogfile.rpt,
and write verbose stderr contents to yourlogfile.err.  Post both of those
files to the bugtracker.

Neither file should contain any private information, but check it to be sure
of course. Anything posted to this bugtracker is world-visible.
Comment 21 Kay Sievers 2013-07-20 20:12:56 UTC
Created attachment 82744 [details]
report output

Deleted all journal data, rebooted a couple of times in a row, exported
the journal. Attached is the report.
Comment 22 hortplumber 2013-07-22 22:50:33 UTC
(In reply to comment #21)
>
> Deleted all journal data, rebooted a couple of times in a row, exported
> the journal. Attached is the report.
>

Thanks, Kay. Appreciate your time in doing this.

Hopefully some others will also post some results and we can get a better idea how prevalent this issue is.   So far, nothing in your results or mine suggests it is unusual. Suspect this is happening for a large fraction of users.
Comment 24 hortplumber 2013-07-24 19:01:46 UTC
(In reply to comment #23)
>
> Fixed in git:
> 

Lennart, did you happen to run any A-B comparisons using the tool that was posted in Comment #19? If so, would be interested to see results. 

(The only reason I'm asking instead of doing it myself is just because there's no -git pkg available in the Arch repo. Otherwise I'd try it and run my own comparisons.)

Thx.
Comment 25 hortplumber 2013-10-20 19:45:14 UTC
Tip o' the hat on the fix for this, thanks.

Just to close the loop, I just tried exercising a recent version of systemd (208-1) using the analysis utility used for characterizing the earlier problems. That tool checks only for "Starting" and "Started" messages, so obviously doesn't have 100% fault coverage, but it found zero dropped "Start*" messages in 5 boot cycles, whereas with 204-1, had been seeing something like 10-15% on average.  

So, nice job on the fix for a nasty problem. 

Cheers.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.