Bugzilla will be upgraded to version 5.0. The upgrade date is tentatively scheduled for 2 December 2018, pending final testing and feedback.
Bug 1497836 - default fluentd elasticsearch plugin request timeout too short by default, leads to potential log loss and stalled log flow
default fluentd elasticsearch plugin request timeout too short by default, le...
Status: CLOSED ERRATA
Product: OpenShift Container Platform
Classification: Red Hat
Component: Logging (Show other bugs)
3.6.1
All All
unspecified Severity urgent
: ---
: 3.6.z
Assigned To: Rich Megginson
Anping Li
: OpsBlocker, Reopened
Depends On:
Blocks: 1501948 1506854
  Show dependency treegraph
 
Reported: 2017-10-02 16:02 EDT by Peter Portante
Modified: 2017-12-07 02:12 EST (History)
8 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: If the logging system is under a heavy load, it may take longer than the 5 second timeout for Elasticsearch to respond, or it may respond with an error indicating that Fluentd needs to backoff. Consequence: In the former case, Fluentd will retry to send the records again, which can lead to having duplicate records. In the latter case, if Fluentd is unable to retry, it will drop records, leading to data loss. Fix: For the former case, the fix is to set the request_timeout to 10 minutes, so that Fluentd will wait up to 10 minutes for the reply from Elasticsearch before retrying the request. In the latter case, Fluentd will block attempting to read more input, until the output queues and buffers have enough room to write more data. Result: Greatly reduced chances of duplicate data (but not entirely eliminated). No data loss due to backpressure.
Story Points: ---
Clone Of:
: 1501948 1506854 (view as bug list)
Environment:
Last Closed: 2017-12-07 02:12:13 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)


External Trackers
Tracker ID Priority Status Summary Last Updated
Red Hat Knowledge Base (Solution) 3214991 None None None 2017-10-13 10:34 EDT
Github openshift/origin-aggregated-logging/pull/698 None None None 2017-10-04 22:29 EDT
Github openshift/origin-aggregated-logging/pull/706 None None None 2017-10-10 13:18 EDT
Red Hat Product Errata RHSA-2017:3389 normal SHIPPED_LIVE Moderate: Red Hat OpenShift Enterprise security, bug fix, and enhancement update 2017-12-07 07:09:10 EST

  None (edit)
Description Peter Portante 2017-10-02 16:02:58 EDT
At times, logging from a particular namespace seems to have been stopped from the Kibana view of logs.

This can occur when a namespace's pods running on one or more nodes in the cluster are not being indexed into Elasticsearch by that node's fluentd process.

The fluentd process can get into this state when every attempt to write logs to an Elasticsearch instance takes longer than 5 seconds to complete.  Until a consistent number of writes takes less than 5 seconds to complete, logging essentially stops working.

This can lead to log loss, when containers come and go and fluentd fills up its internal queue and is then unable to read logs from the new containers.

The fix requires a change to the "request_timeout" parameter of the elasticsearch output plugin to set its value to some very high value (600 seconds, or 10 minutes, should be sufficient for most purposes), and the use of "buffer_queue_full_action" set to "block" to prevent the further records being read from log files and subsequently dropped on the floor.

Further, the use of a "flush_interval" of 1 second serves to ensure writes flow with minimal delay to Elasticsearch, keeping the size of the writes smaller from all nodes in the cluster, allowing for more overlap between log file reading and Elasticsearch writes.
Comment 1 Peter Portante 2017-10-03 19:34:36 EDT
See also PR https://github.com/openshift/origin-aggregated-logging/pull/698 for a proposed set of changes upstream.
Comment 2 Jeff Cantrill 2017-10-06 14:20:35 EDT
Closing in favor of the referenced trello card
Comment 3 Peter Portante 2017-10-09 10:07:40 EDT
Why close this bug when we need a BZ to file this against 3.6.z, no?
Comment 7 Junqi Zhao 2017-10-27 06:04:16 EDT
Created 6 projects to populate logs and let it run for 3 hours, no exception was found, project logs could be found in kibana UI

# openshift version
openshift v3.6.173.0.49
kubernetes v1.6.1+5115d708d7
etcd 3.2.1


Images:
logging-auth-proxy-v3.6.173.0.49-4
logging-curator-v3.6.173.0.49-4
logging-elasticsearch-v3.6.173.0.49-5
logging-fluentd-v3.6.173.0.49-4
logging-kibana-v3.6.173.0.49-5


# rpm -qa | grep openshift-ansible
openshift-ansible-3.6.173.0.48-1.git.0.1609d30.el7.noarch
openshift-ansible-roles-3.6.173.0.48-1.git.0.1609d30.el7.noarch
openshift-ansible-docs-3.6.173.0.48-1.git.0.1609d30.el7.noarch
openshift-ansible-lookup-plugins-3.6.173.0.48-1.git.0.1609d30.el7.noarch
openshift-ansible-callback-plugins-3.6.173.0.48-1.git.0.1609d30.el7.noarch
openshift-ansible-playbooks-3.6.173.0.48-1.git.0.1609d30.el7.noarch
openshift-ansible-filter-plugins-3.6.173.0.48-1.git.0.1609d30.el7.noarch
Comment 8 Anping Li 2017-10-31 21:35:23 EDT
Performance testing also pass, no log loss.
Comment 9 Anping Li 2017-10-31 22:25:45 EDT
QE was able to run 750 - 1000 1K messages/sec through fluentd with no message loss using the default settings.   
750 - 1000 1K messages per second per fluentd is the current maximum we've found in testing.
Comment 12 errata-xmlrpc 2017-12-07 02:12:13 EST
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2017:3389

Note You need to log in before you can comment on or make changes to this bug.