Bug 1826861

Summary:	DiskPressure due to 80 GB /var/lib/fluentd
Product:	OpenShift Container Platform	Reporter:	Periklis Tsirakidis <periklis>
Component:	Logging	Assignee:	Periklis Tsirakidis <periklis>
Status:	CLOSED ERRATA	QA Contact:	Anping Li <anli>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	4.4	CC:	aelganzo, aos-bugs, kgarriso, scuppett
Target Milestone:	---
Target Release:	4.4.z
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:	Cause: On high incoming log rates Fluentd could possible flood the node's filesystem because the buffer queues were not limited. Consequence: A node under disk pressure could eventually crash the node and thus the applications would be rescheduled. Fix: The fluentd buffer queue per output is limited to a fixed amount of chunks (default 32). Result: Node disk pressure due to fluentd buffers should be omited by this fix.	Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-05-18 13:35:02 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1780698
Bug Blocks:	1833226

Description Periklis Tsirakidis 2020-04-22 16:27:41 UTC

This bug was initially created as a copy of Bug #1780698

I am copying this bug because: 



Description of problem:
I have a cluster that was on 4.1.23 (upgraded continuously from about 4.1.4).

The upgrade to either 4.1.24 or 4.1.25 fails with a download error:

info: An upgrade is in progress. Unable to apply 4.1.25: could not download the update

Updates:

VERSION IMAGE
4.1.25  quay.io/openshift-release-dev/ocp-release@sha256:5f824fa3b3c44c6a78a5fc6a77a82edc47cf2b495bb6b2b31e3e0a4d3d77684b
4.1.24  quay.io/openshift-release-dev/ocp-release@sha256:6f87fb66dfa907db03981e69474ea3069deab66358c18d965f6331bd727ff23f

oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.1.23    True        True          44h     Unable to apply 4.1.25: could not download the update

All Cluster Operators show on 4.1.23.

oc adm must-gather is at https://drive.google.com/open?id=18mqD6BpEwAQbApb1_5MD9j-cMBckBIA6

I can provide a kubeconfig as well to poke around there.

Comment 1 Kirsten Garrison 2020-04-22 21:55:59 UTC

For some reason the PR wasn't linked to this BZ, by the bot:

https://github.com/openshift/cluster-logging-operator/pull/491

Comment 6 Anping Li 2020-05-08 11:01:55 UTC

Verified in clusterlogging.4.4.0-202005072005. 
Turn off the ES, the directory size didn't increase once it reached 257M. After the ES is turn back. the directory size decreased.

56M	/var/lib/fluentd/
56M	/var/lib/fluentd/clo_default_output_es
0	/var/lib/fluentd/retry_clo_default_output_es
121M	/var/lib/fluentd/
121M	/var/lib/fluentd/clo_default_output_es
0	/var/lib/fluentd/retry_clo_default_output_es

257M	/var/lib/fluentd/
257M	/var/lib/fluentd/clo_default_output_es
0	/var/lib/fluentd/retry_clo_default_output_es


Fri May  8 06:46:21 EDT 2020
257M	/var/lib/fluentd/
257M	/var/lib/fluentd/clo_default_output_es
0	/var/lib/fluentd/retry_clo_default_output_es
Fri May  8 06:51:23 EDT 2020
16M	/var/lib/fluentd/
16M	/var/lib/fluentd/clo_default_output_es
0	/var/lib/fluentd/retry_clo_default_output_es

Comment 8 errata-xmlrpc 2020-05-18 13:35:02 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2133