Bug 1541429
| Summary: | logging-fluentd gets wedged in a state where retries do not succeed after bulk index queue full error | |||
|---|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Mike Fiedler <mifiedle> | |
| Component: | Logging | Assignee: | Rich Megginson <rmeggins> | |
| Status: | CLOSED CURRENTRELEASE | QA Contact: | Mike Fiedler <mifiedle> | |
| Severity: | high | Docs Contact: | ||
| Priority: | unspecified | |||
| Version: | 3.9.0 | CC: | aos-bugs, dapark, jcantril, jmalde, mifiedle, pportant, rmeggins, smunilla, xtian | |
| Target Milestone: | --- | Keywords: | Reopened | |
| Target Release: | 3.9.z | |||
| Hardware: | x86_64 | |||
| OS: | Linux | |||
| Whiteboard: | aos-scalability-39 | |||
| Fixed In Version: | logging-fluentd-container-v3.9.30-1 | Doc Type: | Bug Fix | |
| Doc Text: |
Cause: When fluentd submits a bulk index request to Elasticsearch and it is rejected, fluentd will wait, then resubmit the entire bulk request again, even if some of the individual operations in the bulk request succeeded. If the request is rejected again, fluentd will perform an exponential backoff, until it retries every 5 minutes.
Consequence: If Elasticsearch is very busy, the user will notice that records do not show up.
Fix: fluentd will go through the bulk index error response and process each individual response. If the response was successful, or a duplicate, fluentd will discard the record. If the response was a "hard" error, fluentd will store the record in a "dead letter queue" - a file that the user will need to examine and determine what can be done about the "bad" records. If the response was a "soft" error, fluentd will resubmit the record to be sent to Elasticsearch.
Result: fluentd does not overwhelm Elasticsearch with retries of operations which have already succeeded, reducing the processing load. fluentd does not keep retrying operations which will never succeed, thus keeping the other records flowing into Elasticsearch.
|
Story Points: | --- | |
| Clone Of: | ||||
| : | 1596716 (view as bug list) | Environment: | ||
| Last Closed: | 2018-06-29 14:00:46 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Embargoed: | ||||
| Bug Depends On: | ||||
| Bug Blocks: | 1596716 | |||
| Attachments: | ||||
|
Description
Mike Fiedler
2018-02-02 14:35:50 UTC
Created attachment 1390200 [details]
buffer files
/var/lib/fluentd contents from the wedged logging-fluentd
Please advise what additional trace/log/documentation is required. I switched fluent-plugin-elasticsearch to the upstream released 1.x version which looked like it included the bulk index rejection handling code created by Peter in our "private" 1.9.5.1 release. But it may be that the upstream 1.x used slightly different code that doesn't work, in which case we may have to switch back to 1.9.5.1 until we can get the proper fix into upstream. I have a simple test which causes fluentd to handle Fluent::ElasticsearchErrorHandler::BulkIndexQueueFull. It starts a few curl processes in parallel that hammer Elasticsearch with bulk index requests, and sets the Elasticsearch bulk queue sizes artificially low. Here is an example of the stats during the run. There are 30 completed requests and 17 rejected.
curl_es $espod /_cat/thread_pool?v\&h=bc,br,ba,bq,bs
bc br ba bq bs
30 17 0 0 1
The test writes 50 container logs (json-file) and 50 system logs (journald) while fluentd is in this state.
So far, I have not been able to reproduce this bz. I'm using fluent-plugin-elasticsearch 1.13.1, but there's hardly any difference between 1.13.0 and 1.13.1.
It could be a problem that takes a while to show up. From the fluentd log:
2018-02-02 12:45:37 +0000 [warn]: temporarily failed to flush the buffer. next_retry=2018-02-02 12:45:37 +0000 error_class="Fluent::ElasticsearchErrorHandler::BulkIndexQueueFull" error="Bulk index queue is full, retrying" plugin_id="object:3fc8fe2e77e4"
2018-02-02 12:45:37 +0000 [warn]: retry succeeded. plugin_id="object:3fc8fe2e77e4"
This retry succeeded almost immediately. This is typically what I see in my testing.
2018-02-02 12:50:43 +0000 [warn]: temporarily failed to flush the buffer. next_retry=2018-02-02 12:50:44 +0000 error_class="Fluent::ElasticsearchErrorHandler::BulkIndexQueueFull" error="Bulk index queue is full, retrying" plugin_id="object:3fc8fe2e77e4"
2018-02-02 12:50:44 +0000 [warn]: retry succeeded. plugin_id="object:3fc8fe2e77e4"
It took 5 minutes for the condition to happen again, and again it succeeded almost immediately.
2018-02-02 12:50:45 +0000 [warn]: temporarily failed to flush the buffer. next_retry=2018-02-02 12:50:46 +0000 error_class="Fluent::ElasticsearchErrorHandler::BulkIndexQueueFull" error="Bulk index queue is full, retrying" plugin_id="object:3fc8fe2e77e4"
...
2018-02-02 13:49:04 +0000 [warn]: temporarily failed to flush the buffer. next_retry=2018-02-02 13:54:04 +0000 error_class="Fluent::ElasticsearchErrorHandler::BulkIndexQueueFull" error="Bulk index queue is full, retrying" plugin_id="object:3fc8fe2e77e4"
Then the problem happened again almost immediately after the retry succeeded, then it was stuck.
I don't think it has anything to do with the contents of the buffer, unless someone is reporting the wrong error message (es or fluent-plugin-es). To me, it looks like it keeps retrying, and Elasticsearch keeps replying that the bulk index queue is full, which could be a case of bad timing.
The other fluentd log is truncated here:
2018-02-02 12:38:34 +0000 [warn]: 'block' action stops input process until the buffer full is resolved. Check your pipeline this action is fit or not
Is that really the last line?
The elasticsearch pod log isn't very useful - need the log file from inside the pod at /elasticsearch/logging-es/logs/logging-es.log.
Also, monitor the bulk index queue stats e.g.
espod=get_es_pod
# esopspod=get_es_ops_pod if using ops cluster
ii=0
while [ $ii -lt 3600 ] ; do
date
oc exec -c elasticsearch $espod -- es_util --query=/_cat/thread_pool?v\&h=host,bc,br,ba,bq,bs
#oc exec -c elasticsearch $esopspod -- es_util --query=/_cat/thread_pool?v\&h=host,bc,br,ba,bq,bs
sleep 1
done
If the rejection rate isn't going up, but fluentd is still reporting the error, then there is a bug in the code.
This is what I'm working on: https://github.com/richm/origin-aggregated-logging/tree/bulk-rejection-test Specifically: https://github.com/richm/origin-aggregated-logging/blob/bulk-rejection-test/test/bulk_rejection.sh https://github.com/openshift/origin-aggregated-logging/pull/933 NOTE: This isn't a fix for the bug, but an attempt to create a CI test to force fluentd to handle bulk index rejection, retry, and eventual success in sending messages. Through some experiments, I believe this is related to the number of namespaces involved. More narrowing down has to be done, but I've bisected a particular scenario fairly well. 0. Configuration is 1 master, 1 infra, 2 computes. ES is on the infra, logging-fluentd on the computes 1. Pick an arbitrary message rate logging-fluentd is known to be able to handle comfortably from past release testing. I used 50 1K messages/second/node . In the tests below, each of the 2 compute nodes had pods logging at this aggregate rate. 2. Demonstrate that 50 mps/node works for 1 pod per node (each pod logging at 50mps). All messages received in ES, no buffers stuck in /var/lib/fluentd 3. Repeat, increasing the number of projects and pods per node while decreasing the message rate to keep the over all rate per node at 50mps per node. Example. 4 namespaces with 4 pods = 2 pods per node logging at 25mps each = 50mps/node 8 namespaces with 8 pods = 4 pods per node logging at 12.5mps each = 50mps/node 4. Keep going until bulk index rejection messages are seen in a. oc exec -c elasticsearch $esopspod -- es_util --query=/_cat/thread_pool?v\&h=host,bc,br,ba,bq,bs b. the logging-fluentd pod logs ===================================== Results (fluentd logs attached, let me know what other data is required) At 106 namespaces with 106 pods: 53 pods/node logging at 0.95 messages per second (50mps/node) we see bulk index rejections, but in the logging-fluentd logs, the retries always succeed (running for 30 minutes). The buffers in /var/lib/fluentd fluctuate at 1-2 but never grow beyond that. At 108 namespaces with 108 pods: 54 pods/node logging at 0.93333 messages per second (50mps/node) we again see bulk rejections which are initially successfully retried but eventually are not successfully retried. The buffers in /var/lib/fluentd start growing and eventually logging-fluentd is wedged with no new messages reaching ES - the bulk rejections in ES grow every 5 minutes as the logging-fluentd pods retry sending the queued data. =========== A final experiment was to use the problematic number of pods in a single namespace. 1 namespace with 108 pods logging at the rate above (0.9333mps). This resulted in all messages logged successfully and no bulk rejections whatsoever. This leads me to believe it is correlated with namespaces. Next experiment is to try a different message rate and see if the number of pods and namespaces is a "magic number" for this configuration. Logs will be attached. It sounds like we may have finally hit the threshold at which Elasticsearch is the bottleneck. It doesn't seem that it is a fluentd problem per se - that is, if Elasticsearch were able to process the bulk index requests faster, the fluentd retries would eventually succeed. But if fluentd cannot send records because Elasticsearch is never able to process them, then this isn't a fluentd problem, and we need to figure out how to scale Elasticsearch to be able to handle the load. My question re: comment 9 is why retries would stop succeeding? Long after all pod logging activity is stopped, logging-fluentd continues to retry sending what is in the buffer and elasticsearch continues to reject them. In my current run, this is what is stuck in fluentd (no current pod logging going on - cluster is idle): -rw-r--r--. 1 root root 8387939 Feb 8 16:46 buffer-output-es-config.output_tag.q564b62216838d714.log -rw-r--r--. 1 root root 8387607 Feb 8 16:47 buffer-output-es-config.output_tag.q564b6270197954ea.log -rw-r--r--. 1 root root 8387357 Feb 8 16:49 buffer-output-es-config.output_tag.q564b62bf01d22736.log -rw-r--r--. 1 root root 8388198 Feb 8 16:50 buffer-output-es-config.output_tag.q564b630da2cfe393.log -rw-r--r--. 1 root root 8387692 Feb 8 16:51 buffer-output-es-config.output_tag.q564b635c9e8c65d7.log -rw-r--r--. 1 root root 8388540 Feb 8 16:53 buffer-output-es-config.output_tag.q564b63ab50011e8f.log -rw-r--r--. 1 root root 8388457 Feb 8 16:54 buffer-output-es-config.output_tag.q564b63fa0e8974c5.log -rw-r--r--. 1 root root 8387497 Feb 8 16:56 buffer-output-es-config.output_tag.q564b6448f153ae0c.log -rw-r--r--. 1 root root 8387812 Feb 8 17:14 buffer-output-es-config.output_tag.q564b649a385490fd.log -rw-r--r--. 1 root root 1319787 Feb 8 17:29 buffer-output-es-config.output_tag.b564b68d31fa41437.log Why can't the retries clear these? Created attachment 1393285 [details]
logging-fluentd logs just below and at problem levels
Followup to comment 10. Here are the bulk queue stats while in this state: host bulk.completed bulk.rejected bulk.queue bulk.active bulk.queueSize 172.21.0.97 1301919 6367 0 0 50 host bulk.completed bulk.rejected bulk.queue bulk.active bulk.queueSize 172.21.0.97 1301955 6367 0 0 50 host bulk.completed bulk.rejected bulk.queue bulk.active bulk.queueSize 172.21.0.97 1302039 6369 0 0 50 host bulk.completed bulk.rejected bulk.queue bulk.active bulk.queueSize 172.21.0.97 1302073 6369 0 0 50 host bulk.completed bulk.rejected bulk.queue bulk.active bulk.queueSize 172.21.0.97 1302102 6369 0 0 50 host bulk.completed bulk.rejected bulk.queue bulk.active bulk.queueSize 172.21.0.97 1302139 6369 0 0 50 host bulk.completed bulk.rejected bulk.queue bulk.active bulk.queueSize 172.21.0.97 1302167 6369 0 0 50 host bulk.completed bulk.rejected bulk.queue bulk.active bulk.queueSize 172.21.0.97 1302179 6369 0 0 50 host bulk.completed bulk.rejected bulk.queue bulk.active bulk.queueSize 172.21.0.97 1302257 6372 0 0 50 The rejected go up 2 or 3 every 5 minutes corresponding with logging-fluentd retries. No bulk.queue entries though. Created attachment 1393325 [details]
/elasticsearch/logging-es/logs contents
> Why can't the retries clear these?
Why does Elasticsearch keep returning BulkIndexRejection? Fluentd cannot clear these if every time it tries to submit an operation Elasticsearch returns with BulkIndexQueueFull.
So either this is a legitimate problem with Elasticsearch, that for some reason it keeps rejecting requests from this fluentd, or there is some bug in the fluent-plugin-elasticsearch error handling which causes it to think Elasticsearch is returning an error when it is not?
So far, the evidence points to the former - fluentd is working normally, and it really is the case that Elasticsearch keeps rejecting the operations. The evidence is that the bulk rejections count keeps going up - Elasticsearch really is rejecting these bulk index requests.
It should be easy to rule out the case where fluentd says the bulk index request was rejected, but in actuality it really went through. Just do a search of Elasticsearch and look for a record which fluentd says did not go through.
There is a third option - fluentd tries to coalesce the retries into larger and fewer bulk index requests, and eventually hits a case where they are just too large for Elasticsearch to ever handle. fluentd shouldn't be doing this, but again perhaps there is a bug.
Please add this to the env vars in each DC: name:ES_JAVA_OPTS value: -Djavax.net.debug=all I would like to see if we can resolve the issue you are seeing with the null certs Opened bug 1543684 to track the SSLHandshakeException. We can dup one of these if they turn out to be related. Log with javax.net.debug=all is attached to that bz. I have been able to reproduce this problem in CI, during https://github.com/openshift/origin-aggregated-logging/pull/933 Fluentd (mux) keeps reporting 2018-02-10 01:59:44 +0000 [warn]: temporarily failed to flush the buffer. next_retry=2018-02-10 02:04:44 +0000 error_class="Fluent::ElasticsearchErrorHandler::BulkIndexQueueFull" error="Bulk index queue is full, retrying" plugin_id="object:3f8c680bf200" and the elasticsearch br count goes up when this happens: # oc exec -c elasticsearch $espod -- es_util --query=/_cat/thread_pool?v\&h=bc,br,bs,bq bc br bs bq 2118 116 1 0 going to try running the CI tests with es.logger.level: DEBUG to see if I can get more information from ES when this happens Marking TestBlocker - cannot log across 1000 projects for 3.9 scalability testing. See comment 21. 1000 project with 1000 pods on 200 compute nodes sending logs at at rate of 20 message/second/pod which is an aggregate rate or 100 messages/pod/node. 1000 pods in 200 projects sending messages at the same rate works fine. Sorry, did not fully answer the questions in comment 23. Here goes again: > 1000 projects with how many pods? pods per namespace? See comment 21. 1000 project with 1000 pods on 200 compute nodes sending logs at at rate of 20 message/second/pod which is an aggregate rate or 100 messages/pod/node. 1000 pods in 200 projects sending messages at the same rate works fine. > Are you saying that the latest result with 1000 projects is that you are getting > many BulkIndexQueueFull messages and retries are not successful, or entries are > otherwise being dropped? Retries are successful and there are 20% extra duplicate entries in the indices due to https://bugzilla.redhat.com/show_bug.cgi?id=1491401. The duplicates are not present (or very few) for 1000 pods in 200 namespaces. > Can you try this: <snip> Will do. Created attachment 1397070 [details]
logging-fluentd logs with debug set in <system> config
Created attachment 1398436 [details] new logs with debug set and problem of stuck/wedged buffers reproduced Attempting to channel this bug back towards the original problem report - logging-fluentd getting stuck with files in the /var/lib/fluentd buffer which never clear and have to be removed manually. The issue of increasing duplicates in the indices correlated with increasing number of namespaces (comment 20 through comment 25) can be addressed in https://bugzilla.redhat.com/show_bug.cgi?id=1491401 Captured logs with log_level debug as requested for 2 scenarios. 1. Fresh start of logging-fluentd while pods are logging. Aggregate rate per node is 50 1K messages/second. Eventually both logging-fluentd pods experience the issue and become "stuck" with buffers in /var/lib/fluentd which are never cleared. 2. Restart logging-fluentd pod with buffers stuck in /var/lib/fluentd. No pod logging occurs. Files are never cleared. In both scenarios ES does show increasing bulk.rejected but bulk.queue never shows a value. Sample: host bulk.completed bulk.rejected bulk.queue bulk.active bulk.queueSize 172.21.0.8 62256 314 0 0 50 host bulk.completed bulk.rejected bulk.queue bulk.active bulk.queueSize 172.21.0.8 63728 352 0 0 50 host bulk.completed bulk.rejected bulk.queue bulk.active bulk.queueSize 172.21.0.8 65204 358 0 0 50 host bulk.completed bulk.rejected bulk.queue bulk.active bulk.queueSize 172.21.0.8 65847 422 0 0 50 host bulk.completed bulk.rejected bulk.queue bulk.active bulk.queueSize 172.21.0.8 65901 436 0 0 50 host bulk.completed bulk.rejected bulk.queue bulk.active bulk.queueSize 172.21.0.8 66015 450 0 0 50 Files in the attachment in comment 27 Correspond to scenario 1 in comment 27: logging-fluentd-4nq5z_startup_and_then_stuck.log logging-fluentd-gqz4r_startup_and_then_stuck.log Corresponds to scenario 2 in comment 27: logging-fluentd-w2vv4_buffers_stuck_restart.log Try this: oc set env ds/logging-fluentd ES_RETRY_WAIT=5 That will tell fluentd to retry at most every 5 seconds, rather than the default 300 (5 minutes). In my testing, a couple of times I've been able to get fluentd to backoff all the way to 5 minutes, and it seems as if it is never able to get its bulk request through to Elasticsearch on a heavily loaded server. Having to wait every 5 minutes for connection failures is probably a good thing, but not for bulk index rejections. We may need different retry times for different error conditions. With a 5 second retry, I see more successful retry attempts. This only seems to happen in an ES cluster with size=1. I cannot reproduce this with size=3 @jcantrill - will your work on fluent-plugin-elasticsearch fix this problem? Yes. Moving to a later version where we discard corrupt messages from the chunk should resolve this issue. We still need to: * fix the 'success' case you found with success + duplicates * Configure the DLQ by adding a output handler for the @ERROR label otherwise the messages are just dropped Mike can you confirm if latest version of 3.9.z fluentd resolves this issue. We modified the fluent-plugin-elasticsearch to move records to a DLQ which can not be processed from the chunk @jcantril what should I look for to tell if the fix is present in the fluentd pod? I can still reproduce this on 3.9.27 (built on Apr 26): registry.reg-aws.openshift.com:443/openshift3/logging-fluentd v3.9 b4920dc470e0 6 hours ago 286 MB The behavior is the same as originally reported. Repeated retry messages without ever succeeding, even long after all pod traffic has ceased. The "good pattern" is to see "retry succeeded message" interspersed with the "retrying" messages. 2018-04-27 01:05:25 +0000 [warn]: temporarily failed to flush the buffer. next_retry=2018-04-27 01:05:25 +0000 error_class="Fluent::ElasticsearchErrorHandler::ElasticsearchError" error="Elasticsearch returned er rors, retrying. Add '@log_level debug' to your config to see the full response" plugin_id="object:3fdd21d75120" 2018-04-27 01:05:25 +0000 [warn]: /usr/share/gems/gems/fluent-plugin-elasticsearch-1.15.0/lib/fluent/plugin/elasticsearch_error_handler.rb:73:in `block in handle_error' 2018-04-27 01:05:25 +0000 [warn]: /usr/share/gems/gems/fluent-plugin-elasticsearch-1.15.0/lib/fluent/plugin/elasticsearch_error_handler.rb:65:in `each_key' 2018-04-27 01:05:25 +0000 [warn]: /usr/share/gems/gems/fluent-plugin-elasticsearch-1.15.0/lib/fluent/plugin/elasticsearch_error_handler.rb:65:in `handle_error' 2018-04-27 01:05:25 +0000 [warn]: /usr/share/gems/gems/fluent-plugin-elasticsearch-1.15.0/lib/fluent/plugin/out_elasticsearch.rb:415:in `send_bulk' 2018-04-27 01:05:25 +0000 [warn]: /usr/share/gems/gems/fluent-plugin-elasticsearch-1.15.0/lib/fluent/plugin/out_elasticsearch.rb:332:in `write_objects' 2018-04-27 01:05:25 +0000 [warn]: /usr/share/gems/gems/fluentd-0.12.42/lib/fluent/output.rb:490:in `write' 2018-04-27 01:05:25 +0000 [warn]: /usr/share/gems/gems/fluentd-0.12.42/lib/fluent/buffer.rb:354:in `write_chunk' 2018-04-27 01:05:25 +0000 [warn]: /usr/share/gems/gems/fluentd-0.12.42/lib/fluent/buffer.rb:333:in `pop' 2018-04-27 01:05:25 +0000 [warn]: /usr/share/gems/gems/fluentd-0.12.42/lib/fluent/output.rb:342:in `try_flush' 2018-04-27 01:05:25 +0000 [warn]: /usr/share/gems/gems/fluentd-0.12.42/lib/fluent/output.rb:149:in `run' 2018-04-27 01:05:26 +0000 [warn]: temporarily failed to flush the buffer. next_retry=2018-04-27 01:05:27 +0000 error_class="Fluent::ElasticsearchErrorHandler::ElasticsearchError" error="Elasticsearch returned er rors, retrying. Add '@log_level debug' to your config to see the full response" plugin_id="object:3fdd21d75120" 2018-04-27 01:05:26 +0000 [warn]: suppressed same stacktrace 2018-04-27 01:05:27 +0000 [warn]: temporarily failed to flush the buffer. next_retry=2018-04-27 01:05:31 +0000 error_class="Fluent::ElasticsearchErrorHandler::ElasticsearchError" error="Elasticsearch returned errors, retrying. Add '@log_level debug' to your config to see the full response" plugin_id="object:3fdd21d75120" 2018-04-27 01:05:27 +0000 [warn]: suppressed same stacktrace 2018-04-27 01:05:31 +0000 [warn]: temporarily failed to flush the buffer. next_retry=2018-04-27 01:05:39 +0000 error_class="Fluent::ElasticsearchErrorHandler::ElasticsearchError" error="Elasticsearch returned errors, retrying. Add '@log_level debug' to your config to see the full response" plugin_id="object:3fdd21d75120" 2018-04-27 01:05:31 +0000 [warn]: suppressed same stacktrace 2018-04-27 01:05:40 +0000 [warn]: temporarily failed to flush the buffer. next_retry=2018-04-27 01:05:57 +0000 error_class="Fluent::ElasticsearchErrorHandler::ElasticsearchError" error="Elasticsearch returned errors, retrying. Add '@log_level debug' to your config to see the full response" plugin_id="object:3fdd21d75120" 2018-04-27 01:05:40 +0000 [warn]: suppressed same stacktrace 2018-04-27 01:05:58 +0000 [warn]: temporarily failed to flush the buffer. next_retry=2018-04-27 01:06:33 +0000 error_class="Fluent::ElasticsearchErrorHandler::ElasticsearchError" error="Elasticsearch returned er rors, retrying. Add '@log_level debug' to your config to see the full response" plugin_id="object:3fdd21d75120" 2018-04-27 01:05:58 +0000 [warn]: suppressed same stacktrace 2018-04-27 01:06:33 +0000 [warn]: temporarily failed to flush the buffer. next_retry=2018-04-27 01:07:41 +0000 error_class="Fluent::ElasticsearchErrorHandler::ElasticsearchError" error="Elasticsearch returned er rors, retrying. Add '@log_level debug' to your config to see the full response" plugin_id="object:3fdd21d75120" 2018-04-27 01:06:33 +0000 [warn]: suppressed same stacktrace 2018-04-27 01:07:41 +0000 [warn]: temporarily failed to flush the buffer. next_retry=2018-04-27 01:09:45 +0000 error_class="Fluent::ElasticsearchErrorHandler::ElasticsearchError" error="Elasticsearch returned er rors, retrying. Add '@log_level debug' to your config to see the full response" plugin_id="object:3fdd21d75120" 2018-04-27 01:07:41 +0000 [warn]: suppressed same stacktrace 2018-04-27 01:09:46 +0000 [warn]: temporarily failed to flush the buffer. next_retry=2018-04-27 01:13:55 +0000 error_class="Fluent::ElasticsearchErrorHandler::ElasticsearchError" error="Elasticsearch returned er rors, retrying. Add '@log_level debug' to your config to see the full response" plugin_id="object:3fdd21d75120" 2018-04-27 01:09:46 +0000 [warn]: suppressed same stacktrace 2018-04-27 01:13:56 +0000 [warn]: temporarily failed to flush the buffer. next_retry=2018-04-27 01:18:55 +0000 error_class="Fluent::ElasticsearchErrorHandler::ElasticsearchError" error="Elasticsearch returned er rors, retrying. Add '@log_level debug' to your config to see the full response" plugin_id="object:3fdd21d75120" 2018-04-27 01:13:56 +0000 [warn]: suppressed same stacktrace You might be running into the case fixed in 1.15.1 by this: https://github.com/uken/fluent-plugin-elasticsearch/commit/d5862bee958c8c7926d310d18a65873e7239ae38 What happens is that during a retry, it may be the case that an error is returned by Elasticsearch when there is a mix of successful retries and duplicates. In that case, the entire operation was successful, but the plugin would still return Fluent::ElasticsearchErrorHandler::ElasticsearchError. I think we need to upgrade to 1.15.1 asap, or 1.15.2 which includes another minor fix. But I think the real solution to the retries will be in https://github.com/uken/fluent-plugin-elasticsearch/pull/405 and https://github.com/jcantrill/fluent-plugin-elasticsearch/pull/1 Moving to ASSIGNED based on comment 38 and comment 41. A new build of fluent is required. This will include 1.16.1 of the fluentd-elasticsearch-plugin which further provides changes for: * DLQ records that will never be able to be processed * Retry logic for bulk insert failures for un-indexed records. This issue was fixed in 3.9.z and should remain CLOSED. Please clone this issue or create a separate bz for an 3.7 issues. Thank you |