The ES fluentd plugin appears to direct the elasticsearch-api to retry a failed request 5 times before bubbling up an error causing ES bulk queue overflows. We have a cluster with 105 nodes in it (3 masters, 3 infra, 99 compute), where the Elasticsearch bulk queue size parameter has been bumped to 400. The fluentd collectors are configured to use two elasticsearch output plugin instances, one for app logs and one for ops logs. The fluentd plugin only makes ES bulk API calls one at a time, so there should never be more than 2 outstanding bulk API calls made to Elasticsearch. The request timeout parameter has been bumped to infinity (effectively). However, we have seen bulk queue sizes well over 400 (where ES is in the process of rejecting the requests). The most we should see is 105 * 2 = 210 bulk requests. We suspect that https://github.com/uken/fluent-plugin-elasticsearch/blob/v0.12/lib/fluent/plugin/out_elasticsearch.rb#L179 is the root cause, where it is directing the lower layers of the API handling to retry 5 times on error. There does not appear to be a way to configure this parameter in that version, though. Regardless, we need to get control over the back-pressure behavior in fluentd so that the clients don't issue too many API requests when ES fills its bulk queues.
It appears this retry 5 times was added with this commit: https://github.com/uken/fluent-plugin-elasticsearch/commit/212c426400cee494c74bd479f55a1d005e94c610
FWIW, the elasticsearch-ruby API default to 'false" for the retry option: https://github.com/elastic/elasticsearch-ruby/blob/2.x/elasticsearch-transport/lib/elasticsearch/transport/client.rb#L57
Here is an example of a bulk request failure from a put mapping that took longer than 30s: 2018-04-09 14:56:10 +0000 [warn]: temporarily failed to flush the buffer. next_retry=2018-04-09 14:53:10 +0000 error_class="Fluent::ElasticsearchErrorHandler::UnrecognizedElasticsearchError" error="Unrecognized elasticsearch errors returned, retrying {\"took\"=>49096, \"errors\"=>true, \"items\"=>[ ... {\"index\"=>{\"_index\"=>\"project.slacklab.c47fc056-91aa-11e7-9936-06579ed29230.2018.04.09\", \"_type\"=>\"com.redhat.viaq.common\", \"_id\"=>\"MjdhODRmZjQtNDNiYi00OWU4LWI1NzMtN2RhZmVhYWRlZGUx\", \"status\"=>503, \"error\"=>{\"type\"=>\"process_cluster_event_timeout_exception\", \"reason\"=>\"failed to process cluster event (put-mapping [com.redhat.viaq.common]) within 30s\"}}}, ... ]
Filed upstream PR to fix it: https://github.com/uken/fluent-plugin-elasticsearch/pull/395
I don't think retry_on_failure will cause more than 2*(number of nodes) bulk index requests to be in flight at once. This parameter gets used here: https://github.com/elastic/elasticsearch-ruby/blob/2.x/elasticsearch-transport/lib/elasticsearch/transport/transport/base.rb#L55 @max_retries = options[:retry_on_failure].is_a?(Integer) ? options[:retry_on_failure] : DEFAULT_MAX_RETRIES so in our case, @max_retries is 5. @max_retries gets used in 2 places. One is here: https://github.com/elastic/elasticsearch-ruby/blob/2.x/elasticsearch-transport/lib/elasticsearch/transport/transport/base.rb#L291 rescue *host_unreachable_exceptions => e if tries <= max_retries retry This only happens during connection, when the host is unreachable, so this isn't the case we're seeing. The other place @max_retries is used is here: https://github.com/elastic/elasticsearch-ruby/blob/2.x/elasticsearch-transport/lib/elasticsearch/transport/transport/base.rb#L269 rescue Elasticsearch::Transport::Transport::ServerError => e if @retry_on_status.include?(response.status) if tries <= max_retries retry so we only retry if the response.status is included in @retry_on_status. The default value is [], so by default this is not used. This isn't set in the fluent-plugin-elasticsearch code: # grep retry_on_status fluent-plugin-elasticsearch-1.13.3/lib/fluent/plugin/* # And it isn't set in our code.
I believe this will be resolved by https://github.com/uken/fluent-plugin-elasticsearch/pull/405
And also: https://github.com/openshift/origin-aggregated-logging/pull/1151
Peter, Could you share more details on how to verify it? Thanks.
Verified in openshift3/logging-fluentd/images/v3.9.30-2. 1. The fix have been merged. 2. The resend can be watched when Elastic search decrease the bulk size in Elalsticsearch. 2018-06-04 06:21:17 -0400 [debug]: plugin/elasticsearch_error_handler.rb:83:block in handle_error: Indexed (op = create), 14 es_rejected_execution_exception, 28 successes 2018-06-04 06:21:19 -0400 [debug]: plugin/elasticsearch_error_handler.rb:83:block in handle_error: Indexed (op = create), 68 es_rejected_execution_exception, 84 successes 2018-06-04 06:21:24 -0400 [debug]: plugin/elasticsearch_error_handler.rb:83:block in handle_error: Indexed (op = create), 102 successes, 104 es_rejected_execution_exception 2018-06-04 06:21:29 -0400 [debug]: plugin/elasticsearch_error_handler.rb:83:block in handle_error: Indexed (op = create), 18 es_rejected_execution_exception, 53 successes 2018-06-04 06:21:34 -0400 [debug]: plugin/elasticsearch_error_handler.rb:83:block in handle_error: Indexed (op = create), 43 successes, 44 es_rejected_execution_exception 2018-06-04 06:21:41 -0400 [debug]: plugin/elasticsearch_error_handler.rb:83:block in handle_error: Indexed (op = create), 164 successes, 172 es_rejected_execution_exception 2018-06-04 06:21:45 -0400 [info]: plugin/filter_kubernetes_metadata.rb:109:dump_stats: stats - namespace_cache_size: 3, pod_cache_size: 5, namespace_cache_api_updates: 5, pod_cache_api_updates: 5, id_cache_miss: The test was running against small cluster by tune the thread pool size. no performance testing are executed.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:1796