1506854 – default fluentd elasticsearch plugin request timeout too short by default, leads to potential log loss and stalled log flow

Bug 1506854 - default fluentd elasticsearch plugin request timeout too short by default, leads to potential log loss and stalled log flow

Summary: default fluentd elasticsearch plugin request timeout too short by default, le...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Logging
Sub Component:
Version:	3.4.1
Hardware:	All
OS:	All
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Target Release:	3.4.z
Assignee:	Rich Megginson
QA Contact:	Junqi Zhao
Docs Contact:
URL:
Whiteboard:
Depends On:	1497836
Blocks:
TreeView+	depends on / blocked

Reported:	2017-10-27 01:55 UTC by Xiaoli Tian
Modified:	2020-12-14 10:40 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Cause: If the logging system is under a heavy load, it may take longer than the 5 second timeout for Elasticsearch to respond, or it may respond with an error indicating that Fluentd needs to backoff. Consequence: In the former case, Fluentd will retry to send the records again, which can lead to having duplicate records. In the latter case, if Fluentd is unable to retry, it will drop records, leading to data loss. Fix: For the former case, the fix is to set the request_timeout to 10 minutes, so that Fluentd will wait up to 10 minutes for the reply from Elasticsearch before retrying the request. In the latter case, Fluentd will block attempting to read more input, until the output queues and buffers have enough room to write more data. Result: Greatly reduced chances of duplicate data (but not entirely eliminated). No data loss due to backpressure.
Clone Of:	1497836
Environment:
Last Closed:	2017-12-07 07:14:18 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	https://github.com/openshift origin-aggregated-logging pull 719	None	None	None	2017-10-27 02:02:48 UTC
Github	openshift origin-aggregated-logging pull 698	None	closed	Raise ES output plugin request timeout to 600s	2020-12-16 15:22:49 UTC
Github	openshift origin-aggregated-logging pull 706	None	closed	es-copy Raise ES output plugin request timeout to 600s	2020-12-16 15:22:49 UTC
Red Hat Knowledge Base (Solution)	3214991	None	None	None	2017-10-27 01:55:00 UTC
Red Hat Product Errata	RHSA-2017:3389	normal	SHIPPED_LIVE	Moderate: Red Hat OpenShift Enterprise security, bug fix, and enhancement update	2017-12-07 12:09:10 UTC

Comment 2 Anping Li 2017-10-27 11:39:18 UTC

QE have some tested against logging-fluentd:3.4.1-30 on OCP 3.4.1.44.26. 

The Fix is in logging-fluentd:3.4.1-30, The fluentd succeed to send about 440M logs in 3 hours. no 'Connection opened to Elasticsearch cluster' and exceptions was reported. The logs can be retrieved in kibana.

# openshift version
openshift v3.4.1.44.26
kubernetes v1.4.0+776c994
etcd 3.1.0-rc.0


Images:
logging-elasticsearch:3.4.1-45
logging-kibana:3.4.1-36
logging-fluentd:3.4.1-30
logging-auth-proxy:3.4.1-33
logging-curator:v3.4.1.44.26

Comment 5 errata-xmlrpc 2017-12-07 07:14:18 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2017:3389

Note You need to log in before you can comment on or make changes to this bug.