1824427 – Fluentd storage -buffer-output-es-config- doesn't stop of growing

Bug 1824427 - Fluentd storage -buffer-output-es-config- doesn't stop of growing

Summary: Fluentd storage -buffer-output-es-config- doesn't stop of growing

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Logging
Sub Component:
Version:	4.3.z
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	4.3.z
Assignee:	Periklis Tsirakidis
QA Contact:	Anping Li
Docs Contact:
URL:
Whiteboard:
Depends On:	1833226
Blocks:
TreeView+	depends on / blocked

Reported:	2020-04-16 08:33 UTC by Oscar Casal Sanchez
Modified:	2023-10-06 19:41 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Cause: On high incoming log rates Fluentd could possible flood the node's filesystem because the buffer queues were not limited. Consequence: A node under disk pressure could eventually crash the node and thus the applications would be rescheduled. Fix: The fluentd buffer queue per output is limited to a fixed amount of chunks (default 32). Result: Node disk pressure due to fluentd buffers should be omited by this fix.
Clone Of:
Environment:
Last Closed:	2020-05-27 17:00:45 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2020:2184	0	None	None	None	2020-05-27 17:00:58 UTC

Description Oscar Casal Sanchez 2020-04-16 08:33:51 UTC

[Description of problem]

In OCP 4.3.10 using fluentd with the default configuration doesn't stop of growing the permanent storage if the Elasticsearch is down or it's not able to consume all the logs at the same rhythm that they are sent by Fluentd  and it could lead to full filesystem.

[Version-Release number of selected component (if applicable)]
4.3.10


[How reproducible]
Always


[Steps to Reproduce]

1. Deploy logging stack following the OCP 4.3 documentation [1]
2. Stop the elasticsearch or generate so many logs in Fluentd that ES is not able to consume


[Actual results]

## SSH to the node where fluentd is running
$ du -shc /sysroot/ostree/deploy/rhcos/var/lib/fluentd
45G	buffer-output-es-config
0	es-retry
45G	total


[Expected results]

It's expected that fluentd stops to keep the data in the permanent storage when it reaches a limit. From the documentation is possible to read "The permanent volume size must be larger than FILE_BUFFER_LIMIT multiplied by the output."
shift.com/container-platform/4.3/logging/cluster-logging.html

Comment 1 Stephen Cuppett 2020-04-16 13:18:14 UTC

Setting target release to current development version (4.5) for investigation. Where fixes (if any) are required/requested for prior versions, cloned BZs will be created when appropriate.

Comment 2 Jeff Cantrill 2020-04-17 16:14:22 UTC

Closing as a duplicate as its the same issue. A fix will be forthcoming with intention of backporting to 4.3

*** This bug has been marked as a duplicate of bug 1780698 ***

Comment 3 Periklis Tsirakidis 2020-05-08 17:39:25 UTC

PR in: https://bugzilla.redhat.com/show_bug.cgi?id=1833226

Comment 4 Periklis Tsirakidis 2020-05-14 06:47:53 UTC

Manually move to MODIFIED because same fix as in https://bugzilla.redhat.com/show_bug.cgi?id=1833226

Comment 7 Anping Li 2020-05-15 04:28:46 UTC

Verified on clusterlogging.4.3.20-202005141057
1) stop ES pods
2) The fluentd disk continue growing until the size is about 257M
3) Recover ES
4) The size decreased 

Thu May 14 23:10:35 EDT 2020
186M	/var/lib/fluentd/clo_default_output_es
0	/var/lib/fluentd/retry_clo_default_output_es
Thu May 14 23:13:36 EDT 2020
209M	/var/lib/fluentd/clo_default_output_es
0	/var/lib/fluentd/retry_clo_default_output_es
Thu May 14 23:16:38 EDT 2020
254M	/var/lib/fluentd/clo_default_output_es
0	/var/lib/fluentd/retry_clo_default_output_es
Thu May 14 23:19:39 EDT 2020
257M	/var/lib/fluentd/clo_default_output_es
0	/var/lib/fluentd/retry_clo_default_output_es
Thu May 14 23:22:40 EDT 2020
261M	/var/lib/fluentd/clo_default_output_es
0	/var/lib/fluentd/retry_clo_default_output_es
Thu May 14 23:25:41 EDT 2020
260M	/var/lib/fluentd/clo_default_output_es
0	/var/lib/fluentd/retry_clo_default_output_es
 <---snip ---->
Thu May 14 23:55:57 EDT 2020
257M	/var/lib/fluentd/clo_default_output_es
0	/var/lib/fluentd/retry_clo_default_output_es
Thu May 14 23:58:58 EDT 2020
262M	/var/lib/fluentd/clo_default_output_es
0	/var/lib/fluentd/retry_clo_default_output_es
Fri May 15 00:01:59 EDT 2020
568K	/var/lib/fluentd/clo_default_output_es
0	/var/lib/fluentd/retry_clo_default_output_es

Comment 9 errata-xmlrpc 2020-05-27 17:00:45 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2184

Note You need to log in before you can comment on or make changes to this bug.