1541429 – logging-fluentd gets wedged in a state where retries do not succeed after bulk index queue full error

Bug 1541429 - logging-fluentd gets wedged in a state where retries do not succeed after bulk index queue full error

Summary: logging-fluentd gets wedged in a state where retries do not succeed after bu...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Logging
Sub Component:
Version:	3.9.0
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	3.9.z
Assignee:	Rich Megginson
QA Contact:	Mike Fiedler
Docs Contact:
URL:
Whiteboard:	aos-scalability-39
Depends On:
Blocks:	1596716
TreeView+	depends on / blocked

Reported:	2018-02-02 14:35 UTC by Mike Fiedler
Modified:	2021-12-10 15:38 UTC (History)
CC List:	9 users (show)
Fixed In Version:	logging-fluentd-container-v3.9.30-1
Doc Type:	Bug Fix
Doc Text:	Cause: When fluentd submits a bulk index request to Elasticsearch and it is rejected, fluentd will wait, then resubmit the entire bulk request again, even if some of the individual operations in the bulk request succeeded. If the request is rejected again, fluentd will perform an exponential backoff, until it retries every 5 minutes. Consequence: If Elasticsearch is very busy, the user will notice that records do not show up. Fix: fluentd will go through the bulk index error response and process each individual response. If the response was successful, or a duplicate, fluentd will discard the record. If the response was a "hard" error, fluentd will store the record in a "dead letter queue" - a file that the user will need to examine and determine what can be done about the "bad" records. If the response was a "soft" error, fluentd will resubmit the record to be sent to Elasticsearch. Result: fluentd does not overwhelm Elasticsearch with retries of operations which have already succeeded, reducing the processing load. fluentd does not keep retrying operations which will never succeed, thus keeping the other records flowing into Elasticsearch.
Clone Of:
Clones:	1596716 (view as bug list)
Environment:
Last Closed:	2018-06-29 14:00:46 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
logging-fluentd and elasticsearch logs (10.95 KB, application/x-gzip) 2018-02-02 14:35 UTC, Mike Fiedler	no flags	Details
buffer files (17.74 MB, application/x-gzip) 2018-02-02 14:36 UTC, Mike Fiedler	no flags	Details
logging-fluentd logs just below and at problem levels (6.88 KB, application/x-gzip) 2018-02-08 17:33 UTC, Mike Fiedler	no flags	Details
/elasticsearch/logging-es/logs contents (166.14 KB, application/x-gzip) 2018-02-08 18:00 UTC, Mike Fiedler	no flags	Details
logging-fluentd logs with debug set in <system> config (292.71 KB, application/x-gzip) 2018-02-16 15:12 UTC, Mike Fiedler	no flags	Details
new logs with debug set and problem of stuck/wedged buffers reproduced (39.86 KB, application/x-gzip) 2018-02-20 20:32 UTC, Mike Fiedler	no flags	Details
View All

Description Mike Fiedler 2018-02-02 14:35:50 UTC

Created attachment 1390199 [details]
logging-fluentd and elasticsearch logs

Description of problem:

logging-fluentd in 3.9.0-0.34.0 is getting into a state where it has queued buffers in /var/lib/fluentd and it can never send them to elasticsearch.    The pattern in the logging-fluentd logs is:

2018-02-02 12:45:37 +0000 [warn]: retry succeeded. plugin_id="object:3fc8fe2e77e4"                       
2018-02-02 12:50:43 +0000 [warn]: temporarily failed to flush the buffer. next_retry=2018-02-02 12:50:44 +0000 error_class="Fluent::ElasticsearchErrorHandler::BulkIndexQueueFull" error="Bulk index queue is full, retrying" plugin_id="object:3fc8fe2e77e4"          
  2018-02-02 12:50:43 +0000 [warn]: suppressed same stacktrace                                           
2018-02-02 12:50:44 +0000 [warn]: retry succeeded. plugin_id="object:3fc8fe2e77e4"                       
2018-02-02 12:50:45 +0000 [warn]: temporarily failed to flush the buffer. next_retry=2018-02-02 12:50:46 +0000 error_class="Fluent::ElasticsearchErrorHandler::BulkIndexQueueFull" error="Bulk index queue is full, retrying" plugin_id="object:3fc8fe2e77e4"          
  2018-02-02 12:50:45 +0000 [warn]: suppressed same stacktrace                                           
2018-02-02 12:50:46 +0000 [warn]: temporarily failed to flush the buffer. next_retry=2018-02-02 12:50:48 +0000 error_class="Fluent::ElasticsearchErrorHandler::BulkIndexQueueFull" error="Bulk index queue is full, retrying" plugin_id="object:3fc8fe2e77e4"          
  2018-02-02 12:50:46 +0000 [warn]: suppressed same stacktrace                                           
2018-02-02 12:50:48 +0000 [warn]: temporarily failed to flush the buffer. next_retry=2018-02-02 12:50:52 +0000 error_class="Fluent::ElasticsearchErrorHandler::BulkIndexQueueFull" error="Bulk index queue is full, retrying" plugin_id="object:3fc8fe2e77e4"          
  2018-02-02 12:50:48 +0000 [warn]: suppressed same stacktrace                                           
2018-02-02 12:50:52 +0000 [warn]: temporarily failed to flush the buffer. next_retry=2018-02-02 12:51:00 +0000 error_class="Fluent::ElasticsearchErrorHandler::BulkIndexQueueFull" error="Bulk index queue is full, retrying" plugin_id="object:3fc8fe2e77e4"          

Once logging-fluentd gets into this state it never clears without manually removing the queued buffers in /var/lib/fluentd.  My guess is there is something in the content of the queued buffers that is getting rejected by elasticsearch since the elasticsearch bulk index statistics do not show any entries in the queues.

This occurs even at very low message rates.   In the case of the attached logs and buffer files, there were 100 pods on the node logging at an aggregate rate of 2.5 messages per second (0.25 messages/second/pod x 100 pods).  Each message is 1K

Clearing the queued buffers and restarting logging-fluentd allows messages to flow again.   There does not appear to be anything wedged on the elasticsearch side - in the case of the attached logs logging-fluentd-f6gn4 is wedged but logging-fluentd-hjjvz is not wedged - it is still sending logs to ES successfully


Version-Release number of selected component (if applicable): Logging v3.9.0-0.34.0


How reproducible: Very


Steps to Reproduce:
1.  Install logging with an inventory similar to the one below.   logging-fluentd is sending directly to ES (no mux).   There is 1 master, 1 infra with ES and kibana and 2 computes running logging-fluentd
2.  Use the svt logging test harness to create 200 projects each with 1 pod logging at 15 messages per minute (0.25/second).   The svt test harness is here:  https://github.com/openshift/svt/blob/master/openshift_scalability/content/logtest/ocp_logtest-README.md

The configuration is:  

projects:
  - num: 200
    basename: logtest
    ifexists: delete
    tuning: default
    templates:
      - num: 1
        file: ./content/logtest/logtest-rc.json
        parameters:
         - REPLICAS: "1"
         - INITIAL_FLAGS: "--num-lines 250  --line-length 1024 --word-length 9 --rate 15 --fixed-line\n"

If it doesn't happen at a rate of 15, it will happen for sure at a rate of 30.

3. Monitor /var/lib/fluentd on the compute nodes

Actual results:

Eventually the buffer files will start to grow and never clear.   Monitoring the indices on Elasticsearch, eventually there will be no new entries.

Start and stop logging-fluentd without clearing /var/lib/fluentd.   The buffer files in /var/lib/fluentd will not be cleared and fluentd will still be wedged

Clear the files and restart fluentd, fluentd resumes normal operation


Expected results:

logging-fluentd can operate normally for 100 pods across 100 namespaces at rates much higher than 2.5 messages/second.

Additional info:

Logging inventory:

[OSEv3:children]                             
masters                                      
etcd                                         

[masters]                                    
ip-172-31-6-38

[etcd]                                       
ip-172-31-6-38

[OSEv3:vars]                                 
deployment_type=openshift-enterprise         

openshift_deployment_type=openshift-enterprise                                             
openshift_release=v3.9                       
openshift_docker_additional_registries=registry.reg-aws.openshift.com                      


openshift_logging_install_logging=true       
openshift_logging_master_url=https://ec2-54-68-197-206.us-west-2.compute.amazonaws.com:8443
openshift_logging_master_public_url=https://ec2-54-68-197-206.us-west-2.compute.amazonaws.com:8443
openshift_logging_kibana_hostname=kibana.apps.0131-zyo.qe.rhcloud.com                      
openshift_logging_namespace=logging          
openshift_logging_image_prefix=registry.reg-aws.openshift.com:443/openshift3/              
openshift_logging_image_version=v3.9         
openshift_logging_es_cluster_size=1          
openshift_logging_es_pvc_dynamic=true        
openshift_logging_es_pvc_size=50Gi           
openshift_logging_es_pvc_storage_class_name=gp2                                            
openshift_logging_fluentd_read_from_head=false                                             
openshift_logging_use_mux=false

Comment 1 Mike Fiedler 2018-02-02 14:36:52 UTC

Created attachment 1390200 [details]
buffer files

/var/lib/fluentd contents from the wedged logging-fluentd

Comment 2 Mike Fiedler 2018-02-02 14:38:03 UTC

Please advise  what additional trace/log/documentation is required.

Comment 3 Rich Megginson 2018-02-02 15:37:55 UTC

I switched fluent-plugin-elasticsearch to the upstream released 1.x version which looked like it included the bulk index rejection handling code created by Peter in our "private" 1.9.5.1 release.  But it may be that the upstream 1.x used slightly different code that doesn't work, in which case we may have to switch back to 1.9.5.1 until we can get the proper fix into upstream.

Comment 4 Rich Megginson 2018-02-03 23:19:54 UTC

I have a simple test which causes fluentd to handle Fluent::ElasticsearchErrorHandler::BulkIndexQueueFull.  It starts a few curl processes in parallel that hammer Elasticsearch with bulk index requests, and sets the Elasticsearch bulk queue sizes artificially low.  Here is an example of the stats during the run.  There are 30 completed requests and 17 rejected.

curl_es $espod /_cat/thread_pool?v\&h=bc,br,ba,bq,bs
bc br ba bq bs
30 17  0  0  1

The test writes 50 container logs (json-file) and 50 system logs (journald) while fluentd is in this state.

So far, I have not been able to reproduce this bz.  I'm using fluent-plugin-elasticsearch 1.13.1, but there's hardly any difference between 1.13.0 and 1.13.1.

It could be a problem that takes a while to show up.  From the fluentd log:

2018-02-02 12:45:37 +0000 [warn]: temporarily failed to flush the buffer. next_retry=2018-02-02 12:45:37 +0000 error_class="Fluent::ElasticsearchErrorHandler::BulkIndexQueueFull" error="Bulk index queue is full, retrying" plugin_id="object:3fc8fe2e77e4"
2018-02-02 12:45:37 +0000 [warn]: retry succeeded. plugin_id="object:3fc8fe2e77e4"

This retry succeeded almost immediately.  This is typically what I see in my testing.

2018-02-02 12:50:43 +0000 [warn]: temporarily failed to flush the buffer. next_retry=2018-02-02 12:50:44 +0000 error_class="Fluent::ElasticsearchErrorHandler::BulkIndexQueueFull" error="Bulk index queue is full, retrying" plugin_id="object:3fc8fe2e77e4"
2018-02-02 12:50:44 +0000 [warn]: retry succeeded. plugin_id="object:3fc8fe2e77e4"

It took 5 minutes for the condition to happen again, and again it succeeded almost immediately.

2018-02-02 12:50:45 +0000 [warn]: temporarily failed to flush the buffer. next_retry=2018-02-02 12:50:46 +0000 error_class="Fluent::ElasticsearchErrorHandler::BulkIndexQueueFull" error="Bulk index queue is full, retrying" plugin_id="object:3fc8fe2e77e4"
...
2018-02-02 13:49:04 +0000 [warn]: temporarily failed to flush the buffer. next_retry=2018-02-02 13:54:04 +0000 error_class="Fluent::ElasticsearchErrorHandler::BulkIndexQueueFull" error="Bulk index queue is full, retrying" plugin_id="object:3fc8fe2e77e4"

Then the problem happened again almost immediately after the retry succeeded, then it was stuck.

I don't think it has anything to do with the contents of the buffer, unless someone is reporting the wrong error message (es or fluent-plugin-es).  To me, it looks like it keeps retrying, and Elasticsearch keeps replying that the bulk index queue is full, which could be a case of bad timing.

The other fluentd log is truncated here:

2018-02-02 12:38:34 +0000 [warn]: 'block' action stops input process until the buffer full is resolved. Check your pipeline this action is fit or not

Is that really the last line?

The elasticsearch pod log isn't very useful - need the log file from inside the pod at /elasticsearch/logging-es/logs/logging-es.log.

Also, monitor the bulk index queue stats e.g.

espod=get_es_pod
# esopspod=get_es_ops_pod if using ops cluster
ii=0
while [ $ii -lt 3600 ] ; do
    date
    oc exec -c elasticsearch $espod -- es_util --query=/_cat/thread_pool?v\&h=host,bc,br,ba,bq,bs
    #oc exec -c elasticsearch $esopspod -- es_util --query=/_cat/thread_pool?v\&h=host,bc,br,ba,bq,bs
    sleep 1
done

If the rejection rate isn't going up, but fluentd is still reporting the error, then there is a bug in the code.

Comment 6 Rich Megginson 2018-02-07 15:18:19 UTC

This is what I'm working on: https://github.com/richm/origin-aggregated-logging/tree/bulk-rejection-test

Specifically: https://github.com/richm/origin-aggregated-logging/blob/bulk-rejection-test/test/bulk_rejection.sh

Comment 7 Rich Megginson 2018-02-08 14:52:35 UTC

https://github.com/openshift/origin-aggregated-logging/pull/933
NOTE: This isn't a fix for the bug, but an attempt to create a CI test to force fluentd to handle bulk index rejection, retry, and eventual success in sending messages.

Comment 8 Mike Fiedler 2018-02-08 15:58:55 UTC

Through some experiments, I believe this is related to the number of namespaces involved.   More narrowing down has to be done, but I've bisected a particular scenario fairly well.

0.  Configuration is 1 master, 1 infra, 2 computes.  ES is on the infra, logging-fluentd on the computes

1. Pick an arbitrary message rate logging-fluentd is known to be able to handle comfortably from past release testing.   I used 50 1K messages/second/node .  In the tests below, each of the 2 compute nodes had pods logging at this aggregate rate.

2. Demonstrate that 50 mps/node works for 1 pod per node (each pod logging at 50mps).  All messages received in ES, no buffers stuck in /var/lib/fluentd

3. Repeat, increasing the number of projects and pods per node while decreasing the message rate to keep the over all rate per node at 50mps per node.  Example.

   4 namespaces with 4 pods = 2 pods per node logging at 25mps each = 50mps/node
   8 namespaces with 8 pods = 4 pods per node logging at 12.5mps each = 50mps/node

4. Keep going until bulk index rejection messages are seen in
   a. oc exec -c elasticsearch $esopspod -- es_util --query=/_cat/thread_pool?v\&h=host,bc,br,ba,bq,bs
   b. the logging-fluentd pod logs

=====================================
Results (fluentd logs attached, let me know what other data is required)

At 106 namespaces with 106 pods:  53 pods/node logging at 0.95 messages per second (50mps/node) we see bulk index rejections, but in the logging-fluentd logs, the retries always succeed (running for 30 minutes).  The buffers in /var/lib/fluentd fluctuate at 1-2 but never grow beyond that.

At 108 namespaces with 108 pods: 54 pods/node logging at 0.93333 messages per second (50mps/node) we again see bulk rejections which are initially successfully retried but eventually are not successfully retried.   

The buffers in /var/lib/fluentd start growing and eventually logging-fluentd is wedged with no new messages reaching ES - the bulk rejections in ES grow every 5 minutes as the logging-fluentd pods retry sending the queued data.

===========

A final experiment was to use the problematic number of pods in a single namespace.  1 namespace with 108 pods logging at the rate above (0.9333mps).  This resulted in all messages logged successfully and no bulk rejections whatsoever.   This leads me to believe it is correlated with namespaces.

Next experiment is to try a different message rate and see if the number of pods and namespaces is a "magic  number" for this configuration.

Logs will be attached.

Comment 9 Rich Megginson 2018-02-08 17:26:26 UTC

It sounds like we may have finally hit the threshold at which Elasticsearch is the bottleneck.  It doesn't seem that it is a fluentd problem per se - that is, if Elasticsearch were able to process the bulk index requests faster, the fluentd retries would eventually succeed.  But if fluentd cannot send records because Elasticsearch is never able to process them, then this isn't a fluentd problem, and we need to figure out how to scale Elasticsearch to be able to handle the load.

Comment 10 Mike Fiedler 2018-02-08 17:30:33 UTC

My question re: comment 9 is why retries would stop succeeding?   Long after all pod logging activity is stopped, logging-fluentd continues to retry sending what is in the buffer and elasticsearch continues to reject them.   In my current run, this is what is stuck in fluentd (no current pod logging going on - cluster is idle):

-rw-r--r--. 1 root root 8387939 Feb  8 16:46 buffer-output-es-config.output_tag.q564b62216838d714.log                                   
-rw-r--r--. 1 root root 8387607 Feb  8 16:47 buffer-output-es-config.output_tag.q564b6270197954ea.log                                   
-rw-r--r--. 1 root root 8387357 Feb  8 16:49 buffer-output-es-config.output_tag.q564b62bf01d22736.log                                   
-rw-r--r--. 1 root root 8388198 Feb  8 16:50 buffer-output-es-config.output_tag.q564b630da2cfe393.log                                   
-rw-r--r--. 1 root root 8387692 Feb  8 16:51 buffer-output-es-config.output_tag.q564b635c9e8c65d7.log                                   
-rw-r--r--. 1 root root 8388540 Feb  8 16:53 buffer-output-es-config.output_tag.q564b63ab50011e8f.log                                   
-rw-r--r--. 1 root root 8388457 Feb  8 16:54 buffer-output-es-config.output_tag.q564b63fa0e8974c5.log                                   
-rw-r--r--. 1 root root 8387497 Feb  8 16:56 buffer-output-es-config.output_tag.q564b6448f153ae0c.log                                   
-rw-r--r--. 1 root root 8387812 Feb  8 17:14 buffer-output-es-config.output_tag.q564b649a385490fd.log                                   
-rw-r--r--. 1 root root 1319787 Feb  8 17:29 buffer-output-es-config.output_tag.b564b68d31fa41437.log 

Why can't the retries clear these?

Comment 11 Mike Fiedler 2018-02-08 17:33:56 UTC

Created attachment 1393285 [details]
logging-fluentd logs just below and at problem levels

Comment 12 Mike Fiedler 2018-02-08 17:44:07 UTC

Followup to comment 10.   Here are the bulk queue stats while in this state:

host        bulk.completed bulk.rejected bulk.queue bulk.active bulk.queueSize                                                           
172.21.0.97        1301919          6367          0           0             50                                                           
host        bulk.completed bulk.rejected bulk.queue bulk.active bulk.queueSize                                                           
172.21.0.97        1301955          6367          0           0             50                                                           
host        bulk.completed bulk.rejected bulk.queue bulk.active bulk.queueSize                                                           
172.21.0.97        1302039          6369          0           0             50                                                           
host        bulk.completed bulk.rejected bulk.queue bulk.active bulk.queueSize                                                           
172.21.0.97        1302073          6369          0           0             50                                                           
host        bulk.completed bulk.rejected bulk.queue bulk.active bulk.queueSize                                                           
172.21.0.97        1302102          6369          0           0             50                                                           
host        bulk.completed bulk.rejected bulk.queue bulk.active bulk.queueSize                                                           
172.21.0.97        1302139          6369          0           0             50                                                           
host        bulk.completed bulk.rejected bulk.queue bulk.active bulk.queueSize                                                           
172.21.0.97        1302167          6369          0           0             50                                                           
host        bulk.completed bulk.rejected bulk.queue bulk.active bulk.queueSize                                                           
172.21.0.97        1302179          6369          0           0             50                                                           
host        bulk.completed bulk.rejected bulk.queue bulk.active bulk.queueSize                                                           
172.21.0.97        1302257          6372          0           0             50 

The rejected go up 2 or 3 every 5 minutes corresponding with logging-fluentd retries.   No bulk.queue entries though.

Comment 13 Mike Fiedler 2018-02-08 18:00:11 UTC

Created attachment 1393325 [details]
/elasticsearch/logging-es/logs contents

Comment 14 Rich Megginson 2018-02-08 18:17:27 UTC

> Why can't the retries clear these?

Why does Elasticsearch keep returning BulkIndexRejection?  Fluentd cannot clear these if every time it tries to submit an operation Elasticsearch returns with BulkIndexQueueFull.

So either this is a legitimate problem with Elasticsearch, that for some reason it keeps rejecting requests from this fluentd, or there is some bug in the fluent-plugin-elasticsearch error handling which causes it to think Elasticsearch is returning an error when it is not?

So far, the evidence points to the former - fluentd is working normally, and it really is the case that Elasticsearch keeps rejecting the operations.  The evidence is that the bulk rejections count keeps going up - Elasticsearch really is rejecting these bulk index requests.

It should be easy to rule out the case where fluentd says the bulk index request was rejected, but in actuality it really went through.  Just do a search of Elasticsearch and look for a record which fluentd says did not go through.

There is a third option - fluentd tries to coalesce the retries into larger and fewer bulk index requests, and eventually hits a case where they are just too large for Elasticsearch to ever handle.  fluentd shouldn't be doing this, but again perhaps there is a bug.

Comment 15 Jeff Cantrill 2018-02-08 21:45:47 UTC

Please add this to the env vars in each DC:

name:ES_JAVA_OPTS
value: -Djavax.net.debug=all

I would like to see if we can resolve the issue you are seeing with the null certs

Comment 16 Mike Fiedler 2018-02-09 01:42:16 UTC

Opened bug 1543684 to track the SSLHandshakeException.   We can dup one of these if they turn out to be related. Log with javax.net.debug=all is attached to that bz.

Comment 19 Rich Megginson 2018-02-10 02:07:30 UTC

I have been able to reproduce this problem in CI, during https://github.com/openshift/origin-aggregated-logging/pull/933

Fluentd (mux) keeps reporting 

2018-02-10 01:59:44 +0000 [warn]: temporarily failed to flush the buffer. next_retry=2018-02-10 02:04:44 +0000 error_class="Fluent::ElasticsearchErrorHandler::BulkIndexQueueFull" error="Bulk index queue is full, retrying" plugin_id="object:3f8c680bf200"

and the elasticsearch br count goes up when this happens:

# oc exec -c elasticsearch $espod -- es_util --query=/_cat/thread_pool?v\&h=bc,br,bs,bq
  bc  br bs bq 
2118 116  1  0 

going to try running the CI tests with es.logger.level: DEBUG to see if I can get more information from ES when this happens

Comment 22 Mike Fiedler 2018-02-16 02:42:47 UTC

Marking TestBlocker - cannot log across 1000 projects for 3.9 scalability testing.

Comment 24 Mike Fiedler 2018-02-16 13:56:27 UTC

See comment 21.  1000 project with 1000 pods on 200 compute nodes sending logs at at rate of 20 message/second/pod which is an aggregate rate or 100 messages/pod/node.

1000 pods in 200 projects sending messages at the same rate works fine.

Comment 25 Mike Fiedler 2018-02-16 14:27:13 UTC

Sorry, did not fully answer the questions in comment 23.  Here goes again:

> 1000 projects with how many pods?  pods per namespace?

See comment 21.  1000 project with 1000 pods on 200 compute nodes sending logs at at rate of 20 message/second/pod which is an aggregate rate or 100 messages/pod/node.

1000 pods in 200 projects sending messages at the same rate works fine.

> Are you saying that the latest result with 1000 projects is that you are getting > many BulkIndexQueueFull messages and retries are not successful, or entries are > otherwise being dropped?

Retries are successful and there are 20% extra duplicate entries in the indices due to https://bugzilla.redhat.com/show_bug.cgi?id=1491401.   The duplicates are not present (or very few) for 1000 pods in 200 namespaces.

> Can you try this:

<snip>

Will do.

Comment 26 Mike Fiedler 2018-02-16 15:12:29 UTC

Created attachment 1397070 [details]
logging-fluentd logs with debug set in <system> config

Comment 27 Mike Fiedler 2018-02-20 20:32:01 UTC

Created attachment 1398436 [details]
new logs with debug set  and problem of stuck/wedged buffers reproduced

Attempting to channel this bug back towards the original problem report - logging-fluentd getting stuck with files in the /var/lib/fluentd buffer which never clear and have to be removed manually.  The issue of increasing duplicates in the indices correlated with increasing number of namespaces (comment 20 through comment 25) can be addressed in https://bugzilla.redhat.com/show_bug.cgi?id=1491401

Captured logs with log_level debug as requested for 2 scenarios.

1.  Fresh start of logging-fluentd while pods are logging.  Aggregate rate per node is 50 1K messages/second.  Eventually both logging-fluentd pods experience the issue and become "stuck" with buffers in /var/lib/fluentd which are never cleared.

2.  Restart logging-fluentd pod with buffers stuck in /var/lib/fluentd.  No pod logging occurs.   Files are never cleared.

In both scenarios ES does show increasing bulk.rejected but bulk.queue never shows a value.  Sample:

host       bulk.completed bulk.rejected bulk.queue bulk.active bulk.queueSize                                                                
172.21.0.8          62256           314          0           0             50                                                                
host       bulk.completed bulk.rejected bulk.queue bulk.active bulk.queueSize                                                                
172.21.0.8          63728           352          0           0             50                                                                
host       bulk.completed bulk.rejected bulk.queue bulk.active bulk.queueSize                                                                
172.21.0.8          65204           358          0           0             50                                                                
host       bulk.completed bulk.rejected bulk.queue bulk.active bulk.queueSize                                                                
172.21.0.8          65847           422          0           0             50                                                                
host       bulk.completed bulk.rejected bulk.queue bulk.active bulk.queueSize                                                                
172.21.0.8          65901           436          0           0             50                                                                
host       bulk.completed bulk.rejected bulk.queue bulk.active bulk.queueSize                                                                
172.21.0.8          66015           450          0           0             50

Comment 28 Mike Fiedler 2018-02-20 20:33:41 UTC

Files in the attachment in comment 27

Correspond to scenario 1 in comment 27:

logging-fluentd-4nq5z_startup_and_then_stuck.log                      
logging-fluentd-gqz4r_startup_and_then_stuck.log 

Corresponds to scenario 2 in comment 27:
                    
logging-fluentd-w2vv4_buffers_stuck_restart.log

Comment 29 Rich Megginson 2018-02-20 21:51:08 UTC

fmi/fyi - https://www.elastic.co/guide/en/elasticsearch/guide/current/_monitoring_individual_nodes.html#_threadpool_section

Comment 30 Rich Megginson 2018-02-21 01:12:36 UTC

Try this: 

oc set env ds/logging-fluentd ES_RETRY_WAIT=5

That will tell fluentd to retry at most every 5 seconds, rather than the default 300 (5 minutes).

In my testing, a couple of times I've been able to get fluentd to backoff all the way to 5 minutes, and it seems as if it is never able to get its bulk request through to Elasticsearch on a heavily loaded server.  Having to wait every 5 minutes for connection failures is probably a good thing, but not for bulk index rejections.  We may need different retry times for different error conditions.

Comment 31 Rich Megginson 2018-02-21 01:13:10 UTC

With a 5 second retry, I see more successful retry attempts.

Comment 34 Mike Fiedler 2018-03-12 19:56:06 UTC

This only seems to happen in an ES cluster with size=1.  I cannot reproduce this with size=3

Comment 35 Rich Megginson 2018-04-18 21:57:42 UTC

@jcantrill - will your work on fluent-plugin-elasticsearch fix this problem?

Comment 36 Jeff Cantrill 2018-04-20 18:42:30 UTC

Yes.  Moving to a later version where we discard corrupt messages from the chunk should resolve this issue.  We still need to:

* fix the 'success' case you found with success + duplicates
* Configure the DLQ by adding a output handler for the @ERROR label otherwise the messages are just dropped

Comment 37 Jeff Cantrill 2018-04-25 19:22:13 UTC

Mike can you confirm if latest version of 3.9.z fluentd resolves this issue.  We modified the fluent-plugin-elasticsearch to move records to a DLQ which can not be processed from the chunk

Comment 38 Mike Fiedler 2018-04-27 01:27:29 UTC

@jcantril what should I look for to tell if the fix is present in the fluentd pod?  I can still reproduce this on 3.9.27 (built on Apr 26):

registry.reg-aws.openshift.com:443/openshift3/logging-fluentd             v3.9                b4920dc470e0        6 hours ago         286 MB

Comment 40 Mike Fiedler 2018-04-27 02:06:19 UTC

The behavior is the same as originally reported.  Repeated retry messages without ever succeeding, even long after all pod traffic has ceased.  The "good pattern" is to see "retry succeeded message" interspersed with the "retrying" messages.

2018-04-27 01:05:25 +0000 [warn]: temporarily failed to flush the buffer. next_retry=2018-04-27 01:05:25 +0000 error_class="Fluent::ElasticsearchErrorHandler::ElasticsearchError" error="Elasticsearch returned er
rors, retrying. Add '@log_level debug' to your config to see the full response" plugin_id="object:3fdd21d75120"                                                                                                    
  2018-04-27 01:05:25 +0000 [warn]: /usr/share/gems/gems/fluent-plugin-elasticsearch-1.15.0/lib/fluent/plugin/elasticsearch_error_handler.rb:73:in `block in handle_error'                                         
  2018-04-27 01:05:25 +0000 [warn]: /usr/share/gems/gems/fluent-plugin-elasticsearch-1.15.0/lib/fluent/plugin/elasticsearch_error_handler.rb:65:in `each_key'                                                      
  2018-04-27 01:05:25 +0000 [warn]: /usr/share/gems/gems/fluent-plugin-elasticsearch-1.15.0/lib/fluent/plugin/elasticsearch_error_handler.rb:65:in `handle_error'                                                  
  2018-04-27 01:05:25 +0000 [warn]: /usr/share/gems/gems/fluent-plugin-elasticsearch-1.15.0/lib/fluent/plugin/out_elasticsearch.rb:415:in `send_bulk'                                                              
  2018-04-27 01:05:25 +0000 [warn]: /usr/share/gems/gems/fluent-plugin-elasticsearch-1.15.0/lib/fluent/plugin/out_elasticsearch.rb:332:in `write_objects'                                                          
  2018-04-27 01:05:25 +0000 [warn]: /usr/share/gems/gems/fluentd-0.12.42/lib/fluent/output.rb:490:in `write'                                                                                                       
  2018-04-27 01:05:25 +0000 [warn]: /usr/share/gems/gems/fluentd-0.12.42/lib/fluent/buffer.rb:354:in `write_chunk'                                                                                                 
  2018-04-27 01:05:25 +0000 [warn]: /usr/share/gems/gems/fluentd-0.12.42/lib/fluent/buffer.rb:333:in `pop'                                                                                                         
  2018-04-27 01:05:25 +0000 [warn]: /usr/share/gems/gems/fluentd-0.12.42/lib/fluent/output.rb:342:in `try_flush'                                                                                                   
  2018-04-27 01:05:25 +0000 [warn]: /usr/share/gems/gems/fluentd-0.12.42/lib/fluent/output.rb:149:in `run'                                                                                                         
2018-04-27 01:05:26 +0000 [warn]: temporarily failed to flush the buffer. next_retry=2018-04-27 01:05:27 +0000 error_class="Fluent::ElasticsearchErrorHandler::ElasticsearchError" error="Elasticsearch returned er
rors, retrying. Add '@log_level debug' to your config to see the full response" plugin_id="object:3fdd21d75120"                                                                                                    
  2018-04-27 01:05:26 +0000 [warn]: suppressed same stacktrace                                                                                                                                                     
2018-04-27 01:05:27 +0000 [warn]: temporarily failed to flush the buffer. next_retry=2018-04-27 01:05:31 +0000 error_class="Fluent::ElasticsearchErrorHandler::ElasticsearchError" error="Elasticsearch returned errors, retrying. Add '@log_level debug' to your config to see the full response" plugin_id="object:3fdd21d75120"                                                                                                    
  2018-04-27 01:05:27 +0000 [warn]: suppressed same stacktrace                                           
2018-04-27 01:05:31 +0000 [warn]: temporarily failed to flush the buffer. next_retry=2018-04-27 01:05:39 +0000 error_class="Fluent::ElasticsearchErrorHandler::ElasticsearchError" error="Elasticsearch returned errors, retrying. Add '@log_level debug' to your config to see the full response" plugin_id="object:3fdd21d75120"                                                                                                    
  2018-04-27 01:05:31 +0000 [warn]: suppressed same stacktrace                                           
2018-04-27 01:05:40 +0000 [warn]: temporarily failed to flush the buffer. next_retry=2018-04-27 01:05:57 +0000 error_class="Fluent::ElasticsearchErrorHandler::ElasticsearchError" error="Elasticsearch returned errors, retrying. Add '@log_level debug' to your config to see the full response" plugin_id="object:3fdd21d75120"
  2018-04-27 01:05:40 +0000 [warn]: suppressed same stacktrace
2018-04-27 01:05:58 +0000 [warn]: temporarily failed to flush the buffer. next_retry=2018-04-27 01:06:33 +0000 error_class="Fluent::ElasticsearchErrorHandler::ElasticsearchError" error="Elasticsearch returned er
rors, retrying. Add '@log_level debug' to your config to see the full response" plugin_id="object:3fdd21d75120"
  2018-04-27 01:05:58 +0000 [warn]: suppressed same stacktrace
2018-04-27 01:06:33 +0000 [warn]: temporarily failed to flush the buffer. next_retry=2018-04-27 01:07:41 +0000 error_class="Fluent::ElasticsearchErrorHandler::ElasticsearchError" error="Elasticsearch returned er
rors, retrying. Add '@log_level debug' to your config to see the full response" plugin_id="object:3fdd21d75120"
  2018-04-27 01:06:33 +0000 [warn]: suppressed same stacktrace
2018-04-27 01:07:41 +0000 [warn]: temporarily failed to flush the buffer. next_retry=2018-04-27 01:09:45 +0000 error_class="Fluent::ElasticsearchErrorHandler::ElasticsearchError" error="Elasticsearch returned er
rors, retrying. Add '@log_level debug' to your config to see the full response" plugin_id="object:3fdd21d75120"
  2018-04-27 01:07:41 +0000 [warn]: suppressed same stacktrace
2018-04-27 01:09:46 +0000 [warn]: temporarily failed to flush the buffer. next_retry=2018-04-27 01:13:55 +0000 error_class="Fluent::ElasticsearchErrorHandler::ElasticsearchError" error="Elasticsearch returned er
rors, retrying. Add '@log_level debug' to your config to see the full response" plugin_id="object:3fdd21d75120"
  2018-04-27 01:09:46 +0000 [warn]: suppressed same stacktrace
2018-04-27 01:13:56 +0000 [warn]: temporarily failed to flush the buffer. next_retry=2018-04-27 01:18:55 +0000 error_class="Fluent::ElasticsearchErrorHandler::ElasticsearchError" error="Elasticsearch returned er
rors, retrying. Add '@log_level debug' to your config to see the full response" plugin_id="object:3fdd21d75120"
  2018-04-27 01:13:56 +0000 [warn]: suppressed same stacktrace

Comment 41 Rich Megginson 2018-04-27 02:35:24 UTC

You might be running into the case fixed in 1.15.1 by this: https://github.com/uken/fluent-plugin-elasticsearch/commit/d5862bee958c8c7926d310d18a65873e7239ae38

What happens is that during a retry, it may be the case that an error is returned by Elasticsearch when there is a mix of successful retries and duplicates.  In that case, the entire operation was successful, but the plugin would still return Fluent::ElasticsearchErrorHandler::ElasticsearchError.

I think we need to upgrade to 1.15.1 asap, or 1.15.2 which includes another minor fix.

But I think the real solution to the retries will be in https://github.com/uken/fluent-plugin-elasticsearch/pull/405 and https://github.com/jcantrill/fluent-plugin-elasticsearch/pull/1

Comment 42 Mike Fiedler 2018-04-27 14:29:44 UTC

Moving to ASSIGNED based on comment 38 and comment 41.

Comment 43 Jeff Cantrill 2018-05-11 14:40:09 UTC

A new build of fluent is required.  This will include 1.16.1 of the fluentd-elasticsearch-plugin which further provides changes for:

* DLQ records that will never be able to be processed
* Retry logic for bulk insert failures for un-indexed records.

Comment 49 Mike Fiedler 2018-06-29 14:00:46 UTC

This issue was fixed in 3.9.z and should remain CLOSED.  Please clone this issue or create a separate bz for an 3.7 issues.  Thank you

Note You need to log in before you can comment on or make changes to this bug.