1741955 – Increase OOTB kubeletConfig containerLogMaxSize

Bug 1741955 - Increase OOTB kubeletConfig containerLogMaxSize

Summary: Increase OOTB kubeletConfig containerLogMaxSize

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Node
Sub Component:
Version:	4.2.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	high
Target Milestone:	---
Target Release:	4.2.0
Assignee:	Ryan Phillips
QA Contact:	Sunil Choudhary
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-08-16 14:11 UTC by Mike Fiedler
Modified:	2019-10-16 06:36 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-10-16 06:36:19 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift machine-config-operator pull 1091	0	'None'	closed	Bug 1741955: bump containerLogMaxSize to 50MB	2021-02-12 16:46:34 UTC
Red Hat Product Errata	RHBA-2019:2922	0	None	None	None	2019-10-16 06:36:30 UTC

Description Mike Fiedler 2019-08-16 14:11:08 UTC

Description of problem:

The OOTB containerLogMaxSize is only 10MB.   This is causing fluentd and rsyslog to miss pod messages at high message rates (see bug 1741663 for example).  The problem is alleviated by increasing containerLogMaxSize in a custom KubeletConfig.

In modern systems, it seems we can afford default pod log sizes greater than 10MB as a default.   I recommend increasing this to 50MB.


Version-Release number of selected component (if applicable): 4.2


How reproducible: Always


Steps to Reproduce:
1.  Create a pod that logs at 1000 messages/second
2.  Use the OpenShift logging solution to index pod logs
3.  Messages are missing from the index in consecutive chunks that indicate the logs are wrapping at a rate fluentd is not keeping up with.

The fluentd side of this is being pursued, but this seems like a reasonable change to our OOTB defaults.


Actual results:

Consecutive chunks of messages are missing from the ES indices

Expected results:

No messages lost by OpenShift logging.

Additional info:

Comment 1 Seth Jennings 2019-08-16 14:42:38 UTC

I suspect that increasing the size to 50Mi only delays the issue. It is not a fix.  We need to figure out the root cause of the dropped messages on the rotation boundary.

Comment 2 Seth Jennings 2019-08-16 18:01:05 UTC

Deferring to 4.3.  There is more at play here that just bumping the tunable to cover up symptoms.  Need to work with the logging team in 4.3 to figure out the right solution.

Since any 4.1/4.2 customer that encounters this can change the tunable on their own, we have a workaround.

Changing this tunable potentially increases disk usage due to logs by 5x.  For 100 pods on a node, that is potentially 5Gi vs 1Gi now.

Comment 3 Rich Megginson 2019-08-29 13:39:45 UTC

(In reply to Seth Jennings from comment #2)
> Deferring to 4.3.  There is more at play here that just bumping the tunable
> to cover up symptoms.  Need to work with the logging team in 4.3 to figure
> out the right solution.

The 10MB limit - is this the same limit that was used in OCP 3.x?  If not, why was it changed?  If it is the same, then I'm not sure why we are seeing problems with EFK logging in 4.x that we did not see in 3.x, unless it also has something to do with cri-o logging and the cri-o log file format, which is different than docker json-file.

What is the number 10MB based on?  Is it a number that was designed to work with log scrapers such as fluentd, rsyslog, loki promtail?  Or is it designed to optimize the disk space for log files?

> 
> Since any 4.1/4.2 customer that encounters this can change the tunable on
> their own, we have a workaround.
> 
> Changing this tunable potentially increases disk usage due to logs by 5x. 
> For 100 pods on a node, that is potentially 5Gi vs 1Gi now.

Comment 4 Seth Jennings 2019-08-29 14:24:49 UTC

CRI Log Rotation backstory

PR in 1.10
https://github.com/kubernetes/kubernetes/pull/59898

Backstory:
https://github.com/kubernetes/kubernetes/issues/58823
https://github.com/kubernetes/enhancements/issues/552
https://docs.google.com/document/d/1oQe8dFiLln7cGyrRdholMsgogliOtpAzq6-K3068Ncg/edit#

CRIContainerLogRotation feature gate
alpha (disabled by default) in 1.10
beta (enabled by default) in 1.11

kubelet flags:
--container-log-max-size (default 10Mi)
--container-log-max-files (default 5)

default size probably cribbed from json-file example in docker documentation
https://docs.docker.com/config/containers/logging/json-file/

we started setting log-opts max-size=50m for docker in 3.11
https://github.com/openshift/openshift-ansible/commit/5e57addcb1bc88d36015e6f06c209985d1e0dbc7

that might be justification for changing our default to 50Mi in 4.x

Comment 5 Ryan Phillips 2019-08-29 14:38:16 UTC

PR: https://github.com/openshift/machine-config-operator/pull/1091

Comment 6 Rich Megginson 2019-08-30 02:19:07 UTC

Can we target this bug for 4.2.0?

Comment 7 Ryan Phillips 2019-08-30 03:44:41 UTC

Done. Retargeted for 4.2.

Comment 10 errata-xmlrpc 2019-10-16 06:36:19 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:2922

Note You need to log in before you can comment on or make changes to this bug.