Bug 1824427

Summary: Fluentd storage -buffer-output-es-config- doesn't stop of growing
Product: OpenShift Container Platform Reporter: Oscar Casal Sanchez <ocasalsa>
Component: LoggingAssignee: Periklis Tsirakidis <periklis>
Status: CLOSED ERRATA QA Contact: Anping Li <anli>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.3.zCC: aos-bugs, scuppett
Target Milestone: ---Keywords: Reopened
Target Release: 4.3.z   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Cause: On high incoming log rates Fluentd could possible flood the node's filesystem because the buffer queues were not limited. Consequence: A node under disk pressure could eventually crash the node and thus the applications would be rescheduled. Fix: The fluentd buffer queue per output is limited to a fixed amount of chunks (default 32). Result: Node disk pressure due to fluentd buffers should be omited by this fix.
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-05-27 17:00:45 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1833226    
Bug Blocks:    

Description Oscar Casal Sanchez 2020-04-16 08:33:51 UTC
[Description of problem]

In OCP 4.3.10 using fluentd with the default configuration doesn't stop of growing the permanent storage if the Elasticsearch is down or it's not able to consume all the logs at the same rhythm that they are sent by Fluentd  and it could lead to full filesystem.

[Version-Release number of selected component (if applicable)]
4.3.10


[How reproducible]
Always


[Steps to Reproduce]

1. Deploy logging stack following the OCP 4.3 documentation [1]
2. Stop the elasticsearch or generate so many logs in Fluentd that ES is not able to consume


[Actual results]

## SSH to the node where fluentd is running
$ du -shc /sysroot/ostree/deploy/rhcos/var/lib/fluentd
45G	buffer-output-es-config
0	es-retry
45G	total


[Expected results]

It's expected that fluentd stops to keep the data in the permanent storage when it reaches a limit. From the documentation is possible to read "The permanent volume size must be larger than FILE_BUFFER_LIMIT multiplied by the output."
shift.com/container-platform/4.3/logging/cluster-logging.html

Comment 1 Stephen Cuppett 2020-04-16 13:18:14 UTC
Setting target release to current development version (4.5) for investigation. Where fixes (if any) are required/requested for prior versions, cloned BZs will be created when appropriate.

Comment 2 Jeff Cantrill 2020-04-17 16:14:22 UTC
Closing as a duplicate as its the same issue. A fix will be forthcoming with intention of backporting to 4.3

*** This bug has been marked as a duplicate of bug 1780698 ***

Comment 3 Periklis Tsirakidis 2020-05-08 17:39:25 UTC
PR in: https://bugzilla.redhat.com/show_bug.cgi?id=1833226

Comment 4 Periklis Tsirakidis 2020-05-14 06:47:53 UTC
Manually move to MODIFIED because same fix as in https://bugzilla.redhat.com/show_bug.cgi?id=1833226

Comment 7 Anping Li 2020-05-15 04:28:46 UTC
Verified on clusterlogging.4.3.20-202005141057
1) stop ES pods
2) The fluentd disk continue growing until the size is about 257M
3) Recover ES
4) The size decreased 

Thu May 14 23:10:35 EDT 2020
186M	/var/lib/fluentd/clo_default_output_es
0	/var/lib/fluentd/retry_clo_default_output_es
Thu May 14 23:13:36 EDT 2020
209M	/var/lib/fluentd/clo_default_output_es
0	/var/lib/fluentd/retry_clo_default_output_es
Thu May 14 23:16:38 EDT 2020
254M	/var/lib/fluentd/clo_default_output_es
0	/var/lib/fluentd/retry_clo_default_output_es
Thu May 14 23:19:39 EDT 2020
257M	/var/lib/fluentd/clo_default_output_es
0	/var/lib/fluentd/retry_clo_default_output_es
Thu May 14 23:22:40 EDT 2020
261M	/var/lib/fluentd/clo_default_output_es
0	/var/lib/fluentd/retry_clo_default_output_es
Thu May 14 23:25:41 EDT 2020
260M	/var/lib/fluentd/clo_default_output_es
0	/var/lib/fluentd/retry_clo_default_output_es
 <---snip ---->
Thu May 14 23:55:57 EDT 2020
257M	/var/lib/fluentd/clo_default_output_es
0	/var/lib/fluentd/retry_clo_default_output_es
Thu May 14 23:58:58 EDT 2020
262M	/var/lib/fluentd/clo_default_output_es
0	/var/lib/fluentd/retry_clo_default_output_es
Fri May 15 00:01:59 EDT 2020
568K	/var/lib/fluentd/clo_default_output_es
0	/var/lib/fluentd/retry_clo_default_output_es

Comment 9 errata-xmlrpc 2020-05-27 17:00:45 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2184