Bug 1565372

Summary:	The ES fluentd plugin appears to direct the elasticsearch-api to retry a failed request 5 times before bubbling up an error causing ES bulk queue overflows
Product:	OpenShift Container Platform	Reporter:	Peter Portante <pportant>
Component:	Logging	Assignee:	Jeff Cantrill <jcantril>
Status:	CLOSED ERRATA	QA Contact:	Anping Li <anli>
Severity:	high	Docs Contact:
Priority:	high
Version:	3.9.0	CC:	aos-bugs, juzhao, pportant, rmeggins
Target Milestone:	---	Keywords:	OpsBlocker
Target Release:	3.9.z
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	No Doc Update
Doc Text:	undefined	Story Points:	---
Clone Of:		Environment:
Last Closed:	2018-06-06 15:46:20 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Peter Portante 2018-04-09 23:45:51 UTC

The ES fluentd plugin appears to direct the elasticsearch-api to retry a failed request 5 times before bubbling up an error causing ES bulk queue overflows.

We have a cluster with 105 nodes in it (3 masters, 3 infra, 99 compute), where the Elasticsearch bulk queue size parameter has been bumped to 400.

The fluentd collectors are configured to use two elasticsearch output plugin instances, one for app logs and one for ops logs.  The fluentd plugin only makes ES bulk API calls one at a time, so there should never be more than 2 outstanding bulk API calls made to Elasticsearch.  The request timeout parameter has been bumped to infinity (effectively).

However, we have seen bulk queue sizes well over 400 (where ES is in the process of rejecting the requests).

The most we should see is 105 * 2 = 210 bulk requests.

We suspect that https://github.com/uken/fluent-plugin-elasticsearch/blob/v0.12/lib/fluent/plugin/out_elasticsearch.rb#L179 is the root cause, where it is directing the lower layers of the API handling to retry 5 times on error.  There does not appear to be a way to configure this parameter in that version, though.

Regardless, we need to get control over the back-pressure behavior in fluentd so that the clients don't issue too many API requests when ES fills its bulk queues.

Comment 1 Peter Portante 2018-04-10 00:00:31 UTC

It appears this retry 5 times was added with this commit: https://github.com/uken/fluent-plugin-elasticsearch/commit/212c426400cee494c74bd479f55a1d005e94c610

Comment 2 Peter Portante 2018-04-10 00:02:11 UTC

FWIW, the elasticsearch-ruby API default to 'false" for the retry option: https://github.com/elastic/elasticsearch-ruby/blob/2.x/elasticsearch-transport/lib/elasticsearch/transport/client.rb#L57

Comment 3 Peter Portante 2018-04-10 01:03:19 UTC

Here is an example of a bulk request failure from a put mapping that took longer than 30s:

2018-04-09 14:56:10 +0000 [warn]: temporarily failed to flush the buffer. next_retry=2018-04-09 14:53:10 +0000 error_class="Fluent::ElasticsearchErrorHandler::UnrecognizedElasticsearchError" error="Unrecognized elasticsearch errors returned, retrying  {\"took\"=>49096, \"errors\"=>true, \"items\"=>[ ... {\"index\"=>{\"_index\"=>\"project.slacklab.c47fc056-91aa-11e7-9936-06579ed29230.2018.04.09\", \"_type\"=>\"com.redhat.viaq.common\", \"_id\"=>\"MjdhODRmZjQtNDNiYi00OWU4LWI1NzMtN2RhZmVhYWRlZGUx\", \"status\"=>503, \"error\"=>{\"type\"=>\"process_cluster_event_timeout_exception\", \"reason\"=>\"failed to process cluster event (put-mapping [com.redhat.viaq.common]) within 30s\"}}}, ... ]

Comment 4 Peter Portante 2018-04-10 01:07:02 UTC

Filed upstream PR to fix it: https://github.com/uken/fluent-plugin-elasticsearch/pull/395

Comment 5 Rich Megginson 2018-04-10 01:14:48 UTC

I don't think retry_on_failure will cause more than 2*(number of nodes) bulk index requests to be in flight at once.

This parameter gets used here: https://github.com/elastic/elasticsearch-ruby/blob/2.x/elasticsearch-transport/lib/elasticsearch/transport/transport/base.rb#L55

    @max_retries = options[:retry_on_failure].is_a?(Integer) ? options[:retry_on_failure] : DEFAULT_MAX_RETRIES

so in our case, @max_retries is 5.  @max_retries gets used in 2 places.  One is here: https://github.com/elastic/elasticsearch-ruby/blob/2.x/elasticsearch-transport/lib/elasticsearch/transport/transport/base.rb#L291

    rescue *host_unreachable_exceptions => e
      if tries <= max_retries
        retry

This only happens during connection, when the host is unreachable, so this isn't the case we're seeing.  The other place @max_retries is used is here: https://github.com/elastic/elasticsearch-ruby/blob/2.x/elasticsearch-transport/lib/elasticsearch/transport/transport/base.rb#L269

    rescue Elasticsearch::Transport::Transport::ServerError => e
      if @retry_on_status.include?(response.status)
        if tries <= max_retries
          retry

so we only retry if the response.status is included in @retry_on_status.  The default value is [], so by default this is not used.  This isn't set in the fluent-plugin-elasticsearch code:

# grep retry_on_status fluent-plugin-elasticsearch-1.13.3/lib/fluent/plugin/*
#

And it isn't set in our code.

Comment 6 Jeff Cantrill 2018-04-18 20:34:03 UTC

I believe this will be resolved by https://github.com/uken/fluent-plugin-elasticsearch/pull/405

Comment 7 Jeff Cantrill 2018-05-09 17:41:22 UTC

And also: https://github.com/openshift/origin-aggregated-logging/pull/1151

Comment 9 Junqi Zhao 2018-06-01 13:05:16 UTC

Peter,

Could you share more details on how to verify it?
Thanks.

Comment 10 Anping Li 2018-06-04 10:27:31 UTC

Verified in openshift3/logging-fluentd/images/v3.9.30-2. 

1. The fix have been merged.
2. The resend can be watched when Elastic search decrease the bulk size in Elalsticsearch.

2018-06-04 06:21:17 -0400 [debug]: plugin/elasticsearch_error_handler.rb:83:block in handle_error: Indexed (op = create), 14 es_rejected_execution_exception, 28 successes
2018-06-04 06:21:19 -0400 [debug]: plugin/elasticsearch_error_handler.rb:83:block in handle_error: Indexed (op = create), 68 es_rejected_execution_exception, 84 successes
2018-06-04 06:21:24 -0400 [debug]: plugin/elasticsearch_error_handler.rb:83:block in handle_error: Indexed (op = create), 102 successes, 104 es_rejected_execution_exception
2018-06-04 06:21:29 -0400 [debug]: plugin/elasticsearch_error_handler.rb:83:block in handle_error: Indexed (op = create), 18 es_rejected_execution_exception, 53 successes
2018-06-04 06:21:34 -0400 [debug]: plugin/elasticsearch_error_handler.rb:83:block in handle_error: Indexed (op = create), 43 successes, 44 es_rejected_execution_exception
2018-06-04 06:21:41 -0400 [debug]: plugin/elasticsearch_error_handler.rb:83:block in handle_error: Indexed (op = create), 164 successes, 172 es_rejected_execution_exception
2018-06-04 06:21:45 -0400 [info]: plugin/filter_kubernetes_metadata.rb:109:dump_stats: stats - namespace_cache_size: 3, pod_cache_size: 5, namespace_cache_api_updates: 5, pod_cache_api_updates: 5, id_cache_miss:


The test was running against small cluster by tune the thread pool size. no performance testing are executed.

Comment 12 errata-xmlrpc 2018-06-06 15:46:20 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:1796