Bug 1378779

Summary: systemd-journald fails to start due to corrupted journal
Product: Red Hat Enterprise Linux 7 Reporter: Denis Silakov <dsilakov>
Component: systemdAssignee: systemd-maint
Status: CLOSED CURRENTRELEASE QA Contact: qe-baseos-daemons
Severity: medium Docs Contact:
Priority: unspecified    
Version: 7.2CC: msekleta, systemd-maint-list
Target Milestone: rc   
Target Release: ---   
Hardware: x86_64   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-10-04 06:00:03 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
systemd-journald start failure none

Description Denis Silakov 2016-09-23 09:02:16 UTC
Created attachment 1204044 [details]
systemd-journald start failure

Description of problem:

If journald is interrupted during journal rotation during boot, it can leave a corrupted journal preventing systemd-journald from further launching. More particular, we can see that in the running system journald fails to start because the current journal, /run/log/journal/.../system.journal is corrupt: it has size=0 in its header.
This seems to be due to the fact that there are a lot of log messages during the initial boot phase, and the initial journald instance decided that it needs to do log rotation, but was interrupted while doing so:

[   14.301981] systemd[1]: systemd-journald.service changed dead -> running
[   14.632710] systemd-journald[98]: Data hash table of /run/log/journal/78458568871841dfb74e3f29f463788c/system.journal has a fill level at 75.0 (15703 of 20935 items, 12058624 file size, 767 bytes per hash table item), suggesting rotation.
[   14.638730] systemd-journald[98]: /run/log/journal/78458568871841dfb74e3f29f463788c/system.journal: Journal header limits reached or header out-of-date, rotating.
[   14.642956] systemd-journald[98]: Rotating...
[   14.648489] systemd-journald[98]: Failed to create new runtime journal: Interrupted system call
[   14.651202] systemd-journald[98]: Vacuuming...
[   14.652769] systemd-journald[98]: Received SIGTERM from PID 1 (n/a).
[   14.654517] systemd-journald[98]: systemd-journald stopped as pid 98


I'm not certain what stopped it, I guess that was normal and would cause no problem if there was no rotation.
However no new instances of the journal could start after that (see attachment for log).


Version-Release number of selected component (if applicable):
systemd-219-19.el7

How reproducible:
Rarely. One should produce a huge number of messages during boot but even in this case journald fails occasionally.

Steps to Reproduce:
1. Setup a system to send a lot of messages to journal during boot
2. Reboot OS
3.

Actual results:
systemd-journald service is dead after boot

Expected results:
systemd-journald service is running after boot

Additional info:
There is an upstream fix (https://github.com/systemd/systemd/pull/695) for "journals that only contain a valid header, but nothing else". Not sure if this is our case...

Comment 2 Lukáš Nykrýn 2016-09-23 10:02:58 UTC
Before we start debugging would you will be willing to try our sneak preview for systemd for 7.3? It contains tons of journal-related fixes.

https://copr.fedorainfracloud.org/coprs/lnykryn/systemd-rhel-staging/
Just please be aware that this is unofficial, unsupported, test, ... build. (Although if you find some bug there feel free to contact me).

Comment 3 Denis Silakov 2016-09-23 10:04:41 UTC
Yes, makes sense. Though the problem occurs rarely, so it will take some time.

Comment 4 Michal Sekletar 2016-09-28 06:02:05 UTC
We have backported pull/695 already so I think that the issue is most likely fixed now.

Comment 5 Denis Silakov 2016-09-28 07:51:38 UTC
Thanks. I gave systemd-219-30.el7 a couple of tries and didn't get the failure, but I proceed with it - even with old systemd, it happened rarely.

Comment 6 Michal Sekletar 2016-10-04 06:00:03 UTC
Feel free to reopen in case you again hit the issue with systemd-219-30.el7 or newer.