Bug 1599658

Summary: JournalD write blocks system with SystemMaxFileSize=20G
Product: Red Hat Enterprise Linux 7 Reporter: Martin Häcker <spamfaenger>
Component: systemdAssignee: systemd-maint
Status: CLOSED WONTFIX QA Contact: qe-baseos-daemons
Severity: high Docs Contact:
Priority: unspecified    
Version: 7.5CC: rbu, redhat-bugzilla, systemd-maint-list
Target Milestone: rc   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-02-15 07:40:21 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
dstat log none

Description Martin Häcker 2018-07-10 10:33:40 UTC
Created attachment 1457763 [details]
dstat log

Description of problem:
We wanted longer log retention of a server that logs a bit more. This lead to a bug where systemctl status was very slow, as systemd is always opening all retained log files. @see https://github.com/systemd/systemd/issues/7963

To work around this we increased the SystemMaxFileSize to 20G, but this lead to another problem that the JournalD starts blocking the system some time after this setting is introduced by consuming all available IO bandwidth.

Version-Release number of selected component (if applicable):
1root@worker1 /etc/systemd # journalctl --version
systemd 219
+PAM +AUDIT +SELINUX +IMA -APPARMOR +SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT +GNUTLS +ACL +XZ +LZ4 -SECCOMP +BLKID +ELFUTILS +KMOD +IDN
root@worker1 /etc/systemd # cat /etc/*release*
CentOS Linux release 7.5.1804 (Core) 
Derived from Red Hat Enterprise Linux 7.5 (Source)
NAME="CentOS Linux"
VERSION="7 (Core)"
ID="centos"
ID_LIKE="rhel fedora"
VERSION_ID="7"
PRETTY_NAME="CentOS Linux 7 (Core)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:centos:centos:7"
HOME_URL="https://www.centos.org/"
BUG_REPORT_URL="https://bugs.centos.org/"

CENTOS_MANTISBT_PROJECT="CentOS-7"
CENTOS_MANTISBT_PROJECT_VERSION="7"
REDHAT_SUPPORT_PRODUCT="centos"
REDHAT_SUPPORT_PRODUCT_VERSION="7"

CentOS Linux release 7.5.1804 (Core) 
CentOS Linux release 7.5.1804 (Core) 
cpe:/o:centos:centos:7


How reproducible:


Steps to Reproduce:
1. Introduce journald logging config
  [Journal]
  Storage=persistent
  SystemMaxFileSize=20G
  SystemMaxUse=200G
  RateLimitInterval=0

2. Feed the system some log some data

Actual results:
The system explodes after some hours (depending on the rate of log intake). The way in which it does this is peculiar. At first the write load seems to increase slowly but surely to some hundreds of  k/seconds, then some spikes become visible where the write load spikes to some mb/seconds, then a breaking point is reached where the read and write load suddenly spikes to the full read/write bandwidth of the disk and stays there. After disabling the setting in question the write load returns to something about normal, but the read load stays very high.

I've monitored the read write load using `dstat -d  --noheaders`. I've included output from dstat from different periods to illustrate the point

After introducing those settings everything looks about normal, what base load should be:

-- snip --
 # dstat -d  --noheaders
-dsk/total-
 read  writ
 565k 1917k
   0  7958k
   0  9836k
   0  5704k
   0  2592k
   0    60k
   0    50k
   0  2930k
   0   176k
   0   882k
   0    52k
   0    72k
   0   546k
   0   212k
   0  1754k
   0    42k
   0    77k
   0   905k
   0   204k
   0  4022k
   0  4512k
   0   716k
   0  1778k
   0    50k
   0    52k
   0    86k
   0  9604k
   0  5388k
   0   386k
   0    46k
   0    60k
   0    42k
   0   876k
   0  3332k
   0  3564k
-- snap --

At some point the write load suddenly explodes:
-- snip --
   0    60k
   0  1432k
   0    68k
   0    59k
   0  4765k
   0  4004k
   0  2072k
   0   186k
   0    50k
   0    84k
   0  2618k
   0    11M
   0  2452k
   0  4444k
   0   362k
   0  1520k
  44k 1126k
  48k 6556k
  12k 4136k
1832k 6380k
 100k 1568k
   0  2588k
  88k   33M
   0    56M
  32k  161M
4096B  131M
  20k  134M
 184k  128M
 140k  181M
 332k  139M
8192B  184M
  56k  131M
  52k  127M
8192B  166M
  28k  132M
 304k  172M
-- snap --

Then the read load increases as well:
-- snip --
  39M  656k
  17M  558k
  12M  108M
6252k  147M
3772k  123M
3080k  207M
  14M  189M
3264k  188M
2236k  223M
8808k  202M
5568k  200M
7172k  183M
  10M  205M
4952k  208M
9076k  192M
  17M  161M
  30M  144M
  24M  140M
  24M  128M
  21M  140M
  44M  169M
  46M  193M
-- snap --

Then after disabling the setting the write load decreases back to something approaching normal:
-- snip --
  18M  197M
  26M  215M
  25M  181M
  43M  182M
  26M  776k
   0  1040k
  86M  289k
 146M  210k
 172M  217k
 166M   17k
 138M  226k
 198M    0 
 137M  274k
 163M  233k
 144M   26k
 149M  209k
 170M  257k
 169M   10k
 115M 1409k
  55M 1168k
 110M  583k
 108M  635k
 143M  673k
 126M  361k
-- snap --

However the read load remains high:
-- snip --
8272k 1753k
  32M  230k
  21M  966k
  54M  324k
  48M 5120B
  46M   93k
  29M  678k
  42M   18k
  38M  601k
  44M  146k
  77M 1024B
  34M  418k
  16M 1341k
  23M  880k
  20M 1224k
  20M 1200k
  61M   69k
  80M    0 
  55M  559k
  28M  324k
  41M   44k
  31M  744k
  18M 1236k
  46M 1244k
  39M  456k
  53M 9216B
  81M   17k
  75M   41k
  57M  191k
  32M  985k
  36M 1030k
  23M   93k
  13M 2036k
  47M   67k
-- snap --

Expected results:
journald should be able to cope with log file sizes of multiple gigabytes without trashing the system.

Additional info:
I've attached the full log of dstat run to help with analysis.

Comment 2 Martin Häcker 2018-07-10 10:51:06 UTC
To get the read load back under control, I needed to restart intermix various invocations of `journalctl --vacuum-size=2G` and `systemctl restart systemd-journald.service`

During this time I also experienced this issue, which might be related:

-- snip --
# systemctl restart systemd-journald.service 

** (pkttyagent:2980): WARNING **: Unable to register authentication agent: Timeout was reached
Error registering authentication agent: Timeout was reached (g-io-error-quark, 24)
Failed to restart systemd-journald.service: Die Wartezeit für die Verbindung ist abgelaufen
See system logs and 'systemctl status systemd-journald.service' for details.
-- snap --

Killing journald manually followed by more aggressive vacuuming solved that though.

Comment 3 Martin Häcker 2018-07-16 07:50:23 UTC
I would like to add that during all of this the amount of logging hasn't changed, i.e. it would be in the order of 100-500 kb/second

Comment 6 RHEL Program Management 2021-02-15 07:40:21 UTC
After evaluating this issue, there are no plans to address it further or fix it in an upcoming release.  Therefore, it is being closed.  If plans change such that this issue will be fixed in an upcoming release, then the bug can be reopened.