Bug 1565372
Summary: | The ES fluentd plugin appears to direct the elasticsearch-api to retry a failed request 5 times before bubbling up an error causing ES bulk queue overflows | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Peter Portante <pportant> |
Component: | Logging | Assignee: | Jeff Cantrill <jcantril> |
Status: | CLOSED ERRATA | QA Contact: | Anping Li <anli> |
Severity: | high | Docs Contact: | |
Priority: | high | ||
Version: | 3.9.0 | CC: | aos-bugs, juzhao, pportant, rmeggins |
Target Milestone: | --- | Keywords: | OpsBlocker |
Target Release: | 3.9.z | ||
Hardware: | All | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | No Doc Update | |
Doc Text: |
undefined
|
Story Points: | --- |
Clone Of: | Environment: | ||
Last Closed: | 2018-06-06 15:46:20 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Peter Portante
2018-04-09 23:45:51 UTC
It appears this retry 5 times was added with this commit: https://github.com/uken/fluent-plugin-elasticsearch/commit/212c426400cee494c74bd479f55a1d005e94c610 FWIW, the elasticsearch-ruby API default to 'false" for the retry option: https://github.com/elastic/elasticsearch-ruby/blob/2.x/elasticsearch-transport/lib/elasticsearch/transport/client.rb#L57 Here is an example of a bulk request failure from a put mapping that took longer than 30s: 2018-04-09 14:56:10 +0000 [warn]: temporarily failed to flush the buffer. next_retry=2018-04-09 14:53:10 +0000 error_class="Fluent::ElasticsearchErrorHandler::UnrecognizedElasticsearchError" error="Unrecognized elasticsearch errors returned, retrying {\"took\"=>49096, \"errors\"=>true, \"items\"=>[ ... {\"index\"=>{\"_index\"=>\"project.slacklab.c47fc056-91aa-11e7-9936-06579ed29230.2018.04.09\", \"_type\"=>\"com.redhat.viaq.common\", \"_id\"=>\"MjdhODRmZjQtNDNiYi00OWU4LWI1NzMtN2RhZmVhYWRlZGUx\", \"status\"=>503, \"error\"=>{\"type\"=>\"process_cluster_event_timeout_exception\", \"reason\"=>\"failed to process cluster event (put-mapping [com.redhat.viaq.common]) within 30s\"}}}, ... ] Filed upstream PR to fix it: https://github.com/uken/fluent-plugin-elasticsearch/pull/395 I don't think retry_on_failure will cause more than 2*(number of nodes) bulk index requests to be in flight at once. This parameter gets used here: https://github.com/elastic/elasticsearch-ruby/blob/2.x/elasticsearch-transport/lib/elasticsearch/transport/transport/base.rb#L55 @max_retries = options[:retry_on_failure].is_a?(Integer) ? options[:retry_on_failure] : DEFAULT_MAX_RETRIES so in our case, @max_retries is 5. @max_retries gets used in 2 places. One is here: https://github.com/elastic/elasticsearch-ruby/blob/2.x/elasticsearch-transport/lib/elasticsearch/transport/transport/base.rb#L291 rescue *host_unreachable_exceptions => e if tries <= max_retries retry This only happens during connection, when the host is unreachable, so this isn't the case we're seeing. The other place @max_retries is used is here: https://github.com/elastic/elasticsearch-ruby/blob/2.x/elasticsearch-transport/lib/elasticsearch/transport/transport/base.rb#L269 rescue Elasticsearch::Transport::Transport::ServerError => e if @retry_on_status.include?(response.status) if tries <= max_retries retry so we only retry if the response.status is included in @retry_on_status. The default value is [], so by default this is not used. This isn't set in the fluent-plugin-elasticsearch code: # grep retry_on_status fluent-plugin-elasticsearch-1.13.3/lib/fluent/plugin/* # And it isn't set in our code. I believe this will be resolved by https://github.com/uken/fluent-plugin-elasticsearch/pull/405 Peter, Could you share more details on how to verify it? Thanks. Verified in openshift3/logging-fluentd/images/v3.9.30-2. 1. The fix have been merged. 2. The resend can be watched when Elastic search decrease the bulk size in Elalsticsearch. 2018-06-04 06:21:17 -0400 [debug]: plugin/elasticsearch_error_handler.rb:83:block in handle_error: Indexed (op = create), 14 es_rejected_execution_exception, 28 successes 2018-06-04 06:21:19 -0400 [debug]: plugin/elasticsearch_error_handler.rb:83:block in handle_error: Indexed (op = create), 68 es_rejected_execution_exception, 84 successes 2018-06-04 06:21:24 -0400 [debug]: plugin/elasticsearch_error_handler.rb:83:block in handle_error: Indexed (op = create), 102 successes, 104 es_rejected_execution_exception 2018-06-04 06:21:29 -0400 [debug]: plugin/elasticsearch_error_handler.rb:83:block in handle_error: Indexed (op = create), 18 es_rejected_execution_exception, 53 successes 2018-06-04 06:21:34 -0400 [debug]: plugin/elasticsearch_error_handler.rb:83:block in handle_error: Indexed (op = create), 43 successes, 44 es_rejected_execution_exception 2018-06-04 06:21:41 -0400 [debug]: plugin/elasticsearch_error_handler.rb:83:block in handle_error: Indexed (op = create), 164 successes, 172 es_rejected_execution_exception 2018-06-04 06:21:45 -0400 [info]: plugin/filter_kubernetes_metadata.rb:109:dump_stats: stats - namespace_cache_size: 3, pod_cache_size: 5, namespace_cache_api_updates: 5, pod_cache_api_updates: 5, id_cache_miss: The test was running against small cluster by tune the thread pool size. no performance testing are executed. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:1796 |