Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 2090019

Summary:	CloudWatch forwarding rejecting large log events, fills tmpfs
Product:	OpenShift Container Platform	Reporter:	Karthik Perumal <kramraja>
Component:	Logging	Assignee:	Jeff Cantrill <jcantril>
Status:	CLOSED DEFERRED	QA Contact:	Anping Li <anli>
Severity:	urgent	Docs Contact:
Priority:	unspecified
Version:	4.10	CC:	bmontgom, travi
Target Milestone:	---	Keywords:	ServiceDeliveryBlocker
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2022-05-26 12:25:14 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Karthik Perumal 2022-05-25 00:54:58 UTC

Description of problem:

On several managed clusters configured to perform CloudWatch forwarding, the following condition has been observed in collector containers:

Log event in xxxxxx is discarded because it is too large: 301486 bytes exceeds limit of 262144 (Fluent::Plugin::CloudwatchLogsOutput::TooLargeEventError) 
The rejected logs are written to tmpfs on the node running the collector pod:

2022-05-16 22:46:15 +0000 [warn]: bad chunk is moved to /tmp/fluent/backup/worker0/object_3fe9caf3da38/5df28c737506490be5e3e7426bc2648f.log 
Over a sustained period of time, these logs eventually fill the available tmpfs space on the nodes, leading to memory exhaustion. If this occurs on the control plane nodes, it eventually brings the cluster into instability.

For the two clusters we've observed this on, it was cluster audit logs that trigger the 'too large' warning.

Version-Release number of selected component (if applicable):
We observed this issue occurring in cluster with CLO version '5.4.1-24' or '5.4.0-138' and that are forwarding logs to cloudwatch.
We have other clusters that forward to cloudwatch that are on an older version of CLO - '5.2.1' that are NOT affected by this bug.


Actual results:
tmpfs is filled leading to master node instability due to the bug


Expected results:
tmpfs does not get filled up when bigger chunks are not forwarded successfully or alternatively CLO should be able to send smaller chunks if appropriate.


Additional info:

Comment 4 Jeff Cantrill 2022-05-25 13:45:35 UTC

Does the following tuning mitigate the issue:

  forwarder:
    fluentd:
      buffer:
        chunkLimitSize: 72m

My initial reaction is that by tuning the chunksize to a value below the threshold would resolve the issue.