Bug 1378779 - systemd-journald fails to start due to corrupted journal
Summary: systemd-journald fails to start due to corrupted journal
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: systemd
Version: 7.2
Hardware: x86_64
OS: Unspecified
unspecified
medium
Target Milestone: rc
: ---
Assignee: systemd-maint
QA Contact: qe-baseos-daemons
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-09-23 09:02 UTC by Denis Silakov
Modified: 2016-10-04 06:00 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2016-10-04 06:00:03 UTC
Target Upstream Version:


Attachments (Terms of Use)
systemd-journald start failure (3.19 KB, text/plain)
2016-09-23 09:02 UTC, Denis Silakov
no flags Details

Description Denis Silakov 2016-09-23 09:02:16 UTC
Created attachment 1204044 [details]
systemd-journald start failure

Description of problem:

If journald is interrupted during journal rotation during boot, it can leave a corrupted journal preventing systemd-journald from further launching. More particular, we can see that in the running system journald fails to start because the current journal, /run/log/journal/.../system.journal is corrupt: it has size=0 in its header.
This seems to be due to the fact that there are a lot of log messages during the initial boot phase, and the initial journald instance decided that it needs to do log rotation, but was interrupted while doing so:

[   14.301981] systemd[1]: systemd-journald.service changed dead -> running
[   14.632710] systemd-journald[98]: Data hash table of /run/log/journal/78458568871841dfb74e3f29f463788c/system.journal has a fill level at 75.0 (15703 of 20935 items, 12058624 file size, 767 bytes per hash table item), suggesting rotation.
[   14.638730] systemd-journald[98]: /run/log/journal/78458568871841dfb74e3f29f463788c/system.journal: Journal header limits reached or header out-of-date, rotating.
[   14.642956] systemd-journald[98]: Rotating...
[   14.648489] systemd-journald[98]: Failed to create new runtime journal: Interrupted system call
[   14.651202] systemd-journald[98]: Vacuuming...
[   14.652769] systemd-journald[98]: Received SIGTERM from PID 1 (n/a).
[   14.654517] systemd-journald[98]: systemd-journald stopped as pid 98


I'm not certain what stopped it, I guess that was normal and would cause no problem if there was no rotation.
However no new instances of the journal could start after that (see attachment for log).


Version-Release number of selected component (if applicable):
systemd-219-19.el7

How reproducible:
Rarely. One should produce a huge number of messages during boot but even in this case journald fails occasionally.

Steps to Reproduce:
1. Setup a system to send a lot of messages to journal during boot
2. Reboot OS
3.

Actual results:
systemd-journald service is dead after boot

Expected results:
systemd-journald service is running after boot

Additional info:
There is an upstream fix (https://github.com/systemd/systemd/pull/695) for "journals that only contain a valid header, but nothing else". Not sure if this is our case...

Comment 2 Lukáš Nykrýn 2016-09-23 10:02:58 UTC
Before we start debugging would you will be willing to try our sneak preview for systemd for 7.3? It contains tons of journal-related fixes.

https://copr.fedorainfracloud.org/coprs/lnykryn/systemd-rhel-staging/
Just please be aware that this is unofficial, unsupported, test, ... build. (Although if you find some bug there feel free to contact me).

Comment 3 Denis Silakov 2016-09-23 10:04:41 UTC
Yes, makes sense. Though the problem occurs rarely, so it will take some time.

Comment 4 Michal Sekletar 2016-09-28 06:02:05 UTC
We have backported pull/695 already so I think that the issue is most likely fixed now.

Comment 5 Denis Silakov 2016-09-28 07:51:38 UTC
Thanks. I gave systemd-219-30.el7 a couple of tries and didn't get the failure, but I proceed with it - even with old systemd, it happened rarely.

Comment 6 Michal Sekletar 2016-10-04 06:00:03 UTC
Feel free to reopen in case you again hit the issue with systemd-219-30.el7 or newer.


Note You need to log in before you can comment on or make changes to this bug.