Description of problem: The OOTB containerLogMaxSize is only 10MB. This is causing fluentd and rsyslog to miss pod messages at high message rates (see bug 1741663 for example). The problem is alleviated by increasing containerLogMaxSize in a custom KubeletConfig. In modern systems, it seems we can afford default pod log sizes greater than 10MB as a default. I recommend increasing this to 50MB. Version-Release number of selected component (if applicable): 4.2 How reproducible: Always Steps to Reproduce: 1. Create a pod that logs at 1000 messages/second 2. Use the OpenShift logging solution to index pod logs 3. Messages are missing from the index in consecutive chunks that indicate the logs are wrapping at a rate fluentd is not keeping up with. The fluentd side of this is being pursued, but this seems like a reasonable change to our OOTB defaults. Actual results: Consecutive chunks of messages are missing from the ES indices Expected results: No messages lost by OpenShift logging. Additional info:
I suspect that increasing the size to 50Mi only delays the issue. It is not a fix. We need to figure out the root cause of the dropped messages on the rotation boundary.
Deferring to 4.3. There is more at play here that just bumping the tunable to cover up symptoms. Need to work with the logging team in 4.3 to figure out the right solution. Since any 4.1/4.2 customer that encounters this can change the tunable on their own, we have a workaround. Changing this tunable potentially increases disk usage due to logs by 5x. For 100 pods on a node, that is potentially 5Gi vs 1Gi now.
(In reply to Seth Jennings from comment #2) > Deferring to 4.3. There is more at play here that just bumping the tunable > to cover up symptoms. Need to work with the logging team in 4.3 to figure > out the right solution. The 10MB limit - is this the same limit that was used in OCP 3.x? If not, why was it changed? If it is the same, then I'm not sure why we are seeing problems with EFK logging in 4.x that we did not see in 3.x, unless it also has something to do with cri-o logging and the cri-o log file format, which is different than docker json-file. What is the number 10MB based on? Is it a number that was designed to work with log scrapers such as fluentd, rsyslog, loki promtail? Or is it designed to optimize the disk space for log files? > > Since any 4.1/4.2 customer that encounters this can change the tunable on > their own, we have a workaround. > > Changing this tunable potentially increases disk usage due to logs by 5x. > For 100 pods on a node, that is potentially 5Gi vs 1Gi now.
CRI Log Rotation backstory PR in 1.10 https://github.com/kubernetes/kubernetes/pull/59898 Backstory: https://github.com/kubernetes/kubernetes/issues/58823 https://github.com/kubernetes/enhancements/issues/552 https://docs.google.com/document/d/1oQe8dFiLln7cGyrRdholMsgogliOtpAzq6-K3068Ncg/edit# CRIContainerLogRotation feature gate alpha (disabled by default) in 1.10 beta (enabled by default) in 1.11 kubelet flags: --container-log-max-size (default 10Mi) --container-log-max-files (default 5) default size probably cribbed from json-file example in docker documentation https://docs.docker.com/config/containers/logging/json-file/ we started setting log-opts max-size=50m for docker in 3.11 https://github.com/openshift/openshift-ansible/commit/5e57addcb1bc88d36015e6f06c209985d1e0dbc7 that might be justification for changing our default to 50Mi in 4.x
PR: https://github.com/openshift/machine-config-operator/pull/1091
Can we target this bug for 4.2.0?
Done. Retargeted for 4.2.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:2922