The rate limiter for the docker log journal driver should be per container, not global. Having a global rate limiter means one malicious user affects all users. We want rate limiting to be container specific. cc'ing the Kubernetes node team, who are working as well on how logs on the node evolve as part of the local IO management. We want a solution primarily for end users (they can get all their container logs), that takes into account operators (the node needs to be rate limited and properly account logging cost to each user), and is able to allow logs to be efficiently taken off the node.
FWIW, journald can be configured to turn off rate limiting entirely. We have to deal with RHEL defaults, which use tmpfs based /run/log/journal for the default output. High log rates will allow journald to use up to 4GB of files in /run/log/journal. It is likely, we'll want to use /var/log/journal, where journal is a dedicated performant disk. Additionally, we need to be aware that journald / systemd itself can become a bottleneck, as there are reports where page faults or fsync calls in those daemons cause other subsystems to back up waiting. This can be mitigated in part by the underlying storage.
cc'ing seth jennings
To fix this likely involves changes to journald in RHEL 7.4 (?), and work in OCP to take advantage of those fixes.
We also have to consider journald's buffer breaking at 2KB instead of docker's 16 KB. See https://bugzilla.redhat.com/show_bug.cgi?id=1442262
Filed RHEL 7.5 RFE: https://bugzilla.redhat.com/show_bug.cgi?id=1450626
I also filed this RFE, https://bugzilla.redhat.com/show_bug.cgi?id=1450627, which talks about gracefully handling rate-limiting with tiers and ability to stall stdout/stderr as an option.
(In reply to Peter Portante from comment #8) > I also filed this RFE, https://bugzilla.redhat.com/show_bug.cgi?id=1450627, > which talks about gracefully handling rate-limiting with tiers and ability > to stall stdout/stderr as an option. Hi Peter, may I add this question to the bug/RFE 1450627? I set these 2 configuration parameters in /etc/systemd/journald.conf. RateLimitInterval=0 RateLimitBurst=0 But still lots of messages were dropped (in my testing, about 93% are missing). Could there be any way to guarantee all the messages are kept in journald as in /var/log/messages? Note: the version of systemd is systemd-219-39.el7.x86_64 on RHEL7.4 beta. Thanks.
(In reply to Noriko Hosoi from comment #9) > (In reply to Peter Portante from comment #8) > But still lots of messages were dropped (in my testing, about 93% are > missing). How are you determining that messages were dropped?
(In reply to Peter Portante from comment #10) > (In reply to Noriko Hosoi from comment #9) > > (In reply to Peter Portante from comment #8) > > But still lots of messages were dropped (in my testing, about 93% are > > missing). > > How are you determining that messages were dropped? Thanks, Peter. It seems my test needs some improvements... I have this script to generate logs. https://github.com/nhosoi/scripts/blob/master/README-logging.md The script generates 1000 tags with the logger command; one tag contains 1000 log messages. After running the script (and waiting long enough; e.g., the some logs took about 67 hours to be processed and stored in the elasticsearch), I counted logs in journald, /var/log/messages and the common logging elasticsearch. I found although /var/log/messages contained 100% of the logs, the log count from the ES showed the small numbers. (Note: the common logging is configured with journald as an input.) Unfortunately, Since I did not configure journald.conf to increase the file size, when I ran journalctl (even with the --since option), no test logs were returned. And I assumed logs were dropped by journald, but the assumption was wrong, I'd think... If /var/log/messages have 100% of the log messages, journald also had all at one time? Then, rotated and discarded? Now, I have another question... When the journald is flooded and the common logging takes that looooong time to process logs, is it guaranteed for the journald data not to be rotated/discarded unless the journald.pos used by the fluentd passes the point? Apart from that, I'm going to rerun the test with large systemMaxUse and RuntimeMaxUse to prove journald is not dropping the logs...
> Now, I have another question... When the journald is flooded and the common logging takes that looooong time to process logs, is it guaranteed for the journald data not to be rotated/discarded unless the journald.pos used by the fluentd passes the point? No. It is possible that journald could trim log messages before fluentd has a chance to read them.
Thanks, Rich. I've restarted journald setting 4G to SystemMaxUse and RuntimeMaxUse as well as 1month to MaxFileSec to see if it has any impact on the test results...
I reran the test with these config parameters in journald.conf: RateLimitInterval=0 RateLimitBurst=0 SystemMaxUse=4G RuntimeMaxUse=4G MaxFileSec=1month then the common logging did not drop the log messages. So, it's verified that what I reported in #c9 was the result of the journald database trimmed. There were too many log messages issued which exceeds the journald max use size before the common logging process them. Sorry about the noise. Now, I have an opposite issue the ES stores duplicated logs. E.g., This one logging request: logger -i -p local6.info -t stress_tag_M16_0 "short_message_M16_0_32" is stored 83 times in the elasticsearch. # oc exec $EPOD -- curl -s -k --cert $SECRETPATH/admin-cert --key $SECRETPATH/admin-key "https://localhost:9200/.operations.2017.05.16/_search?size=9999&q=message:short_message_M16_0_32" | python -mjson.tool | egrep message | wc -l 83 This is observed under the heavy stress. Do we care this symptom? If yes, shall I open another bug?
Duplicate records is a known issue. Fluentd sends records to Elasticsearch using a bulk index request https://www.elastic.co/guide/en/elasticsearch/guide/current/bulk.html using a "create" without an "_id" field. One thing I have seen is that when Elasticsearch is under a huge load, it will take a very long time to respond to the request. The fluentd elasticsearch plugin will timeout and retry the request, even though Elasticsearch will complete the request. This means when the retry succeeds, there will be duplicate records in Elasticsearch. We should be able to mitigate this by increasing the request_timeout parameter in the fluentd elasticsearch plugin.
Ok. Thanks, Rich! Let me run some more tests and see what we can do...
Eric, or anyone else on the CC of this bz: What can we do for this bz for OCP 3.6 logging stack? I'm not aware of any changes we could make to fluentd etc. to fix or mitigate this situation. The only thing we can do is fall back to using docker --log-driver=json-file instead of journald on a case-by-case basis. Anyone? Please speak now or I will close this bz in favor of the journald bzs filed above.
Rich, the docker logging driver is not silently dropping messages. The docker journal driver is sending all messages to journald. It is journald that is dropping messages, and it does not do it silently, it posts a message to report that it is happening. For RHEL (7.x), the journald configuration is not adequate at all. It limits at 1,000 messages per service in a 30 second time period [1], which means dockerd can only send 1,000 messages in a 30 second period before they are rate-limited. The best thing to do is bump the journald rate-limiting to something more reasonable, like 10,000 in a 1 second period, or whatever a user feels more comfortable with. Note that other defaults for journald in RHEL need to be considered: * It uses /run/log/journal by default which can use up precious physical memory and logs grow big on that tmpfs volume * If used in conjunction with rsyslog, the imjournal module needs to have its notion of rate-limiting update to match the combined throughput of log messages flowing to journal. [1] https://www.freedesktop.org/software/systemd/man/journald.conf.html#RateLimitIntervalSec=
(In reply to Peter Portante from comment #18) > Rich, the docker logging driver is not silently dropping messages. The > docker journal driver is sending all messages to journald. It is journald > that is dropping messages, and it does not do it silently, it posts a > message to report that it is happening. > > For RHEL (7.x), the journald configuration is not adequate at all. It > limits at 1,000 messages per service in a 30 second time period [1], which > means dockerd can only send 1,000 messages in a 30 second period before they > are rate-limited. > > The best thing to do is bump the journald rate-limiting to something more > reasonable, like 10,000 in a 1 second period, or whatever a user feels more > comfortable with. > > Note that other defaults for journald in RHEL need to be considered: > > * It uses /run/log/journal by default which can use up precious physical > memory > and logs grow big on that tmpfs volume > * If used in conjunction with rsyslog, the imjournal module needs to have > its notion > of rate-limiting update to match the combined throughput of log messages > flowing > to journal. > > [1] > https://www.freedesktop.org/software/systemd/man/journald.conf. > html#RateLimitIntervalSec= Why don't we put a configuration like this in place at install / update time (with 3.6) and define a default for the OCP product that we expect? We could base it on how much/many system resource the node has, or how much reserve the customer defines for system/kube resources.
Ok, understood, I just don't think there is anything we can do in OpenShift logging. Perhaps we need to move this bug to OpenShift core, or OpenShift ansible, or some other component where this problem can be addressed? If we can't solve this bug in logging, I would like to move it off of the logging bug list.
Does this PR address this issue? https://github.com/openshift/openshift-ansible/pull/3753
(In reply to Jeff Cantrill from comment #22) > Does this PR address this issue? > https://github.com/openshift/openshift-ansible/pull/3753 Yes, we can close this issue as there is nothing for aggregated logging to do beyond this, other than possibly documenting the changes in PR 3753.
Regarding the reference to the "request_timeout" in comment 15, see BZ https://bugzilla.redhat.com/show_bug.cgi?id=1497836.
Not sure if there is anything that can be done in the container driver - if not, then pass along to the systemd journald team.
It looks like https://github.com/openshift/openshift-ansible/pull/3753 addressed this as much as it can be addressed without big changes to logging. As there is no bz tracking that upstream issue/PR, I'm using this one.
https://github.com/openshift/openshift-ansible/pull/5796 to get this during upgrades
Fix should be in openshift-ansible-3.7.0-0.196.0
Verified and pass on openshift-ansible-3.7.0-0.196.0. The default journal log are in /var/log/journal after fresh OCP installation and OCP upgrade. The jouranl.conf had been changed as following. [Journal] Storage=persistent Compress=True #Seal=yes #SplitMode=uid SyncIntervalSec=1s RateLimitInterval=1s RateLimitBurst=10000 SystemMaxUse=8G SystemMaxFileSize=10M #RuntimeKeepFree= #RuntimeMaxFileSize= MaxRetentionSec=1month ForwardToSyslog=False #ForwardToKMsg=no #ForwardToConsole=no ForwardToWall=False #TTYPath=/dev/console #MaxLevelStore=debug #MaxLevelSyslog=debug #MaxLevelKMsg=notice #MaxLevelConsole=info #MaxLevelWall=emerg
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2017:3188