Description of problem: On several managed clusters configured to perform CloudWatch forwarding, the following condition has been observed in collector containers: Log event in xxxxxx is discarded because it is too large: 301486 bytes exceeds limit of 262144 (Fluent::Plugin::CloudwatchLogsOutput::TooLargeEventError) The rejected logs are written to tmpfs on the node running the collector pod: 2022-05-16 22:46:15 +0000 [warn]: bad chunk is moved to /tmp/fluent/backup/worker0/object_3fe9caf3da38/5df28c737506490be5e3e7426bc2648f.log Over a sustained period of time, these logs eventually fill the available tmpfs space on the nodes, leading to memory exhaustion. If this occurs on the control plane nodes, it eventually brings the cluster into instability. For the two clusters we've observed this on, it was cluster audit logs that trigger the 'too large' warning. Version-Release number of selected component (if applicable): We observed this issue occurring in cluster with CLO version '5.4.1-24' or '5.4.0-138' and that are forwarding logs to cloudwatch. We have other clusters that forward to cloudwatch that are on an older version of CLO - '5.2.1' that are NOT affected by this bug. Actual results: tmpfs is filled leading to master node instability due to the bug Expected results: tmpfs does not get filled up when bigger chunks are not forwarded successfully or alternatively CLO should be able to send smaller chunks if appropriate. Additional info:
Does the following tuning mitigate the issue: forwarder: fluentd: buffer: chunkLimitSize: 72m My initial reaction is that by tuning the chunksize to a value below the threshold would resolve the issue.