1445797 – journald docker driver, rate limiting across all containers is silently dropping messages.

Bug 1445797 - journald docker driver, rate limiting across all containers is silently dropping messages.

Summary: journald docker driver, rate limiting across all containers is silently drop...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Installer
Sub Component:
Version:	3.5.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	---
Target Release:	3.7.0
Assignee:	Scott Dodson
QA Contact:	Anping Li
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2017-04-26 14:17 UTC by Eric Rich
Modified:	2018-04-13 06:17 UTC (History)
CC List:	16 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	The installer has been updated to adjust the journald configuration to utilize persistent logging storage, increase the burst rate limit to 10000 messages/second with up to 8GiB of storage and 30 day retention period. This change ensures that significantly more log messages will be captured while ensuring that logging doesn't become too resource intensive.
Clone Of:
Environment:
Last Closed:	2017-11-28 21:53:29 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2017:3188	0	normal	SHIPPED_LIVE	Moderate: Red Hat OpenShift Container Platform 3.7 security, bug, and enhancement update	2017-11-29 02:34:54 UTC

Comment 1 Clayton Coleman 2017-04-26 14:50:52 UTC

The rate limiter for the docker log journal driver should be per container, not global.  Having a global rate limiter means one malicious user affects all users.  We want rate limiting to be container specific.

cc'ing the Kubernetes node team, who are working as well on how logs on the node evolve as part of the local IO management.  We want a solution primarily for end users (they can get all their container logs), that takes into account operators (the node needs to be rate limited and properly account logging cost to each user), and is able to allow logs to be efficiently taken off the node.

Comment 2 Peter Portante 2017-04-26 18:54:11 UTC

FWIW, journald can be configured to turn off rate limiting entirely.

We have to deal with RHEL defaults, which use tmpfs based /run/log/journal for the default output.  High log rates will allow journald to use up to 4GB of files in /run/log/journal.  It is likely, we'll want to use /var/log/journal, where journal is a dedicated performant disk.

Additionally, we need to be aware that journald / systemd itself can become a bottleneck, as there are reports where page faults or fsync calls in those daemons cause other subsystems to back up waiting.  This can be mitigated in part by the underlying storage.

Comment 3 Derek Carr 2017-04-28 13:59:32 UTC

cc'ing seth jennings

Comment 4 Peter Portante 2017-05-01 18:06:18 UTC

To fix this likely involves changes to journald in RHEL 7.4 (?), and work in OCP to take advantage of those fixes.

Comment 6 Peter Portante 2017-05-14 04:37:26 UTC

We also have to consider journald's buffer breaking at 2KB instead of docker's 16 KB.  See https://bugzilla.redhat.com/show_bug.cgi?id=1442262

Comment 7 Peter Portante 2017-05-14 04:43:56 UTC

Filed RHEL 7.5 RFE: https://bugzilla.redhat.com/show_bug.cgi?id=1450626

Comment 8 Peter Portante 2017-05-14 04:49:47 UTC

I also filed this RFE, https://bugzilla.redhat.com/show_bug.cgi?id=1450627, which talks about gracefully handling rate-limiting with tiers and ability to stall stdout/stderr as an option.

Comment 9 Noriko Hosoi 2017-05-15 19:02:08 UTC

(In reply to Peter Portante from comment #8)
> I also filed this RFE, https://bugzilla.redhat.com/show_bug.cgi?id=1450627,
> which talks about gracefully handling rate-limiting with tiers and ability
> to stall stdout/stderr as an option.

Hi Peter, may I add this question to the bug/RFE 1450627?

I set these 2 configuration parameters in /etc/systemd/journald.conf.
    RateLimitInterval=0
    RateLimitBurst=0

But still lots of messages were dropped (in my testing, about 93% are missing).  Could there be any way to guarantee all the messages are kept in journald as in /var/log/messages?

Note: the version of systemd is systemd-219-39.el7.x86_64 on RHEL7.4 beta.

Thanks.

Comment 10 Peter Portante 2017-05-16 17:49:43 UTC

(In reply to Noriko Hosoi from comment #9)
> (In reply to Peter Portante from comment #8)
> But still lots of messages were dropped (in my testing, about 93% are
> missing).

How are you determining that messages were dropped?

Comment 11 Noriko Hosoi 2017-05-16 19:47:13 UTC

(In reply to Peter Portante from comment #10)
> (In reply to Noriko Hosoi from comment #9)
> > (In reply to Peter Portante from comment #8)
> > But still lots of messages were dropped (in my testing, about 93% are
> > missing).
> 
> How are you determining that messages were dropped?

Thanks, Peter.  It seems my test needs some improvements...

I have this script to generate logs.

https://github.com/nhosoi/scripts/blob/master/README-logging.md

The script generates 1000 tags with the logger command; one tag contains 1000 log messages.

After running the script (and waiting long enough; e.g., the some logs took about 67 hours to be processed and stored in the elasticsearch), I counted logs in journald, /var/log/messages and the common logging elasticsearch.

I found although /var/log/messages contained 100% of the logs, the log count from the ES showed the small numbers.  (Note: the common logging is configured with journald as an input.)

Unfortunately, Since I did not configure journald.conf to increase the file size, when I ran journalctl (even with the --since option), no test logs were returned.  And I assumed logs were dropped by journald, but the assumption was wrong, I'd think...

If /var/log/messages have 100% of the log messages, journald also had all at one time?  Then, rotated and discarded?

Now, I have another question...  When the journald is flooded and the common logging takes that looooong time to process logs, is it guaranteed for the journald data not to be rotated/discarded unless the journald.pos used by the fluentd passes the point?

Apart from that, I'm going to rerun the test with large systemMaxUse and RuntimeMaxUse to prove journald is not dropping the logs...

Comment 12 Rich Megginson 2017-05-16 20:01:19 UTC

> Now, I have another question...  When the journald is flooded and the common logging takes that looooong time to process logs, is it guaranteed for the journald data not to be rotated/discarded unless the journald.pos used by the fluentd passes the point?

No.  It is possible that journald could trim log messages before fluentd has a chance to read them.

Comment 13 Noriko Hosoi 2017-05-16 20:09:20 UTC

Thanks, Rich.

I've restarted journald setting 4G to SystemMaxUse and RuntimeMaxUse as well as 1month to MaxFileSec to see if it has any impact on the test results...

Comment 14 Noriko Hosoi 2017-05-17 23:10:00 UTC

I reran the test with these config parameters in journald.conf:
    RateLimitInterval=0
    RateLimitBurst=0
    SystemMaxUse=4G
    RuntimeMaxUse=4G
    MaxFileSec=1month
then the common logging did not drop the log messages.  So, it's verified that what I reported in #c9 was the result of the journald database trimmed.  There were too many log messages issued which exceeds the journald max use size before the common logging process them.

Sorry about the noise.

Now, I have an opposite issue the ES stores duplicated logs.
E.g., This one logging request:
  logger -i -p local6.info -t stress_tag_M16_0 "short_message_M16_0_32"
is stored 83 times in the elasticsearch.

# oc exec $EPOD -- curl -s -k --cert $SECRETPATH/admin-cert --key $SECRETPATH/admin-key "https://localhost:9200/.operations.2017.05.16/_search?size=9999&q=message:short_message_M16_0_32" | python -mjson.tool | egrep message | wc -l
83

This is observed under the heavy stress.  Do we care this symptom?  If yes, shall I open another bug?

Comment 15 Rich Megginson 2017-05-18 01:14:05 UTC

Duplicate records is a known issue.  Fluentd sends records to Elasticsearch using a bulk index request https://www.elastic.co/guide/en/elasticsearch/guide/current/bulk.html using a "create" without an "_id" field.  One thing I have seen is that when Elasticsearch is under a huge load, it will take a very long time to respond to the request.  The fluentd elasticsearch plugin will timeout and retry the request, even though Elasticsearch will complete the request.  This means when the retry succeeds, there will be duplicate records in Elasticsearch.  We should be able to mitigate this by increasing the request_timeout parameter in the fluentd elasticsearch plugin.

Comment 16 Noriko Hosoi 2017-05-18 01:27:12 UTC

Ok. Thanks, Rich!  Let me run some more tests and see what we can do...

Comment 17 Rich Megginson 2017-06-20 17:59:34 UTC

Eric, or anyone else on the CC of this bz: What can we do for this bz for OCP 3.6 logging stack?  I'm not aware of any changes we could make to fluentd etc. to fix or mitigate this situation.  The only thing we can do is fall back to using docker --log-driver=json-file instead of journald on a case-by-case basis.  Anyone?  Please speak now or I will close this bz in favor of the journald bzs filed above.

Comment 18 Peter Portante 2017-06-20 19:39:10 UTC

Rich, the docker logging driver is not silently dropping messages.  The docker journal driver is sending all messages to journald.  It is journald that is dropping messages, and it does not do it silently, it posts a message to report that it is happening.

For RHEL (7.x), the journald configuration is not adequate at all.  It limits at 1,000 messages per service in a 30 second time period [1], which means dockerd can only send 1,000 messages in a 30 second period before they are rate-limited.

The best thing to do is bump the journald rate-limiting to something more reasonable, like 10,000 in a 1 second period, or whatever a user feels more comfortable with.

Note that other defaults for journald in RHEL need to be considered:

 * It uses /run/log/journal by default which can use up precious physical memory
   and logs grow big on that tmpfs volume
 * If used in conjunction with rsyslog, the imjournal module needs to have its notion
   of rate-limiting update to match the combined throughput of log messages flowing
   to journal.

[1] https://www.freedesktop.org/software/systemd/man/journald.conf.html#RateLimitIntervalSec=

Comment 19 Eric Rich 2017-06-20 21:17:43 UTC

(In reply to Peter Portante from comment #18)
> Rich, the docker logging driver is not silently dropping messages.  The
> docker journal driver is sending all messages to journald.  It is journald
> that is dropping messages, and it does not do it silently, it posts a
> message to report that it is happening.
> 
> For RHEL (7.x), the journald configuration is not adequate at all.  It
> limits at 1,000 messages per service in a 30 second time period [1], which
> means dockerd can only send 1,000 messages in a 30 second period before they
> are rate-limited.
> 
> The best thing to do is bump the journald rate-limiting to something more
> reasonable, like 10,000 in a 1 second period, or whatever a user feels more
> comfortable with.
> 
> Note that other defaults for journald in RHEL need to be considered:
> 
>  * It uses /run/log/journal by default which can use up precious physical
> memory
>    and logs grow big on that tmpfs volume
>  * If used in conjunction with rsyslog, the imjournal module needs to have
> its notion
>    of rate-limiting update to match the combined throughput of log messages
> flowing
>    to journal.
> 
> [1]
> https://www.freedesktop.org/software/systemd/man/journald.conf.
> html#RateLimitIntervalSec=

Why don't we put a configuration like this in place at install / update time (with 3.6) and define a default for the OCP product that we expect? 

We could base it on how much/many system resource the node has, or how much reserve the customer defines for system/kube resources.

Comment 20 Rich Megginson 2017-06-20 22:17:36 UTC

Ok, understood, I just don't think there is anything we can do in OpenShift logging.  Perhaps we need to move this bug to OpenShift core, or OpenShift ansible, or some other component where this problem can be addressed?

If we can't solve this bug in logging, I would like to move it off of the logging bug list.

Comment 22 Jeff Cantrill 2017-09-13 15:44:06 UTC

Does this PR address this issue?

https://github.com/openshift/openshift-ansible/pull/3753

Comment 24 Peter Portante 2017-09-21 19:30:51 UTC

(In reply to Jeff Cantrill from comment #22)
> Does this PR address this issue?
> https://github.com/openshift/openshift-ansible/pull/3753

Yes, we can close this issue as there is nothing for aggregated logging to do beyond this, other than possibly documenting the changes in PR 3753.

Comment 26 Peter Portante 2017-10-02 20:09:52 UTC

Regarding the reference to the "request_timeout" in comment 15, see BZ https://bugzilla.redhat.com/show_bug.cgi?id=1497836.

Comment 27 Rich Megginson 2017-10-09 21:19:47 UTC

Not sure if there is anything that can be done in the container driver - if not, then pass along to the systemd journald team.

Comment 29 Seth Jennings 2017-10-18 03:59:15 UTC

It looks like https://github.com/openshift/openshift-ansible/pull/3753 addressed this as much as it can be addressed without big changes to logging.  As there is no bz tracking that upstream issue/PR, I'm using this one.

Comment 30 Scott Dodson 2017-10-20 17:06:54 UTC

https://github.com/openshift/openshift-ansible/pull/5796 to get this during upgrades

Comment 34 Scott Dodson 2017-11-07 01:08:25 UTC

Fix should be in openshift-ansible-3.7.0-0.196.0

Comment 35 Anping Li 2017-11-08 06:52:57 UTC

Verified and pass on openshift-ansible-3.7.0-0.196.0. The default journal log are in /var/log/journal after fresh OCP installation and OCP upgrade.

The jouranl.conf had been changed as following. 

[Journal]
 Storage=persistent
 Compress=True
#Seal=yes
#SplitMode=uid
 SyncIntervalSec=1s
 RateLimitInterval=1s
 RateLimitBurst=10000
 SystemMaxUse=8G
 SystemMaxFileSize=10M
#RuntimeKeepFree=
#RuntimeMaxFileSize=
 MaxRetentionSec=1month
 ForwardToSyslog=False
#ForwardToKMsg=no
#ForwardToConsole=no
 ForwardToWall=False
#TTYPath=/dev/console
#MaxLevelStore=debug
#MaxLevelSyslog=debug
#MaxLevelKMsg=notice
#MaxLevelConsole=info
#MaxLevelWall=emerg

Comment 39 errata-xmlrpc 2017-11-28 21:53:29 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2017:3188

Note You need to log in before you can comment on or make changes to this bug.