Bug 1408633

Summary:

[RFE] fluentd temporarily failed to flush the buffer

Product:

OpenShift Container Platform

Reporter:

Xia Zhao <xiazhao>

Component:

Logging

Assignee:

Peter Portante <pportant>

Status:

CLOSED NOTABUG

QA Contact:

Xia Zhao <xiazhao>

Severity:

low

Docs Contact:

Priority:

low

Version:

3.4.0

CC:

aos-bugs, jcantril, juzhao, nhosoi, pportant, rmeggins, xiazhao

Target Milestone:

---

Keywords:

FutureFeature, RFE

Target Release:

---

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Clones:

1479651 (view as bug list)

Environment:

Last Closed:

2017-10-09 21:25:17 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

1479651, 1479679, 1539994

Attachments:

Description	Flags
fluentd_log	none
es,fluentd pod log	none
fluntd log, ocp 3.3.1.20	none
fluntd log, logging-fluentd:3.3.1-21	none
additional waring info in fluentd pod log	none

Description Xia Zhao 2016-12-26 06:35:29 UTC

Created attachment 1235301 [details]
fluentd_log

Description of problem:
During the 3.4.0 testing on GCE we rarely see this error stacks in fluentd logs, but are still not sure how to reproduce it:

$ oc logs -f logging-fluentd-3sby6
2016-12-26 00:33:20 -0500 [info]: reading config file path="/etc/fluent/fluent.conf"
2016-12-26 00:34:00 -0500 [warn]: temporarily failed to flush the buffer. next_retry=2016-12-26 00:33:42 -0500 error_class="Fluent::ElasticsearchOutput::ConnectionFailure" error="Can not reach Elasticsearch cluster ({:host=>\"logging-es\", :port=>9200, :scheme=>\"https\", :user=>\"fluentd\", :password=>\"obfuscated\"})!" plugin_id="object:1ab41e4"
  2016-12-26 00:34:00 -0500 [warn]: /usr/share/gems/gems/fluent-plugin-elasticsearch-1.9.1/lib/fluent/plugin/out_elasticsearch_dynamic.rb:66:in `client'
  2016-12-26 00:34:00 -0500 [warn]: /usr/share/gems/gems/fluent-plugin-elasticsearch-1.9.1/lib/fluent/plugin/out_elasticsearch_dynamic.rb:202:in `rescue in send_bulk'
  2016-12-26 00:34:00 -0500 [warn]: /usr/share/gems/gems/fluent-plugin-elasticsearch-1.9.1/lib/fluent/plugin/out_elasticsearch_dynamic.rb:200:in `send_bulk'
  2016-12-26 00:34:00 -0500 [warn]: /usr/share/gems/gems/fluent-plugin-elasticsearch-1.9.1/lib/fluent/plugin/out_elasticsearch_dynamic.rb:193:in `block in write_objects'
  2016-12-26 00:34:00 -0500 [warn]: /usr/share/gems/gems/fluent-plugin-elasticsearch-1.9.1/lib/fluent/plugin/out_elasticsearch_dynamic.rb:192:in `each'
  2016-12-26 00:34:00 -0500 [warn]: /usr/share/gems/gems/fluent-plugin-elasticsearch-1.9.1/lib/fluent/plugin/out_elasticsearch_dynamic.rb:192:in `write_objects'
  2016-12-26 00:34:00 -0500 [warn]: /usr/share/gems/gems/fluentd-0.12.29/lib/fluent/output.rb:480:in `write'
  2016-12-26 00:34:00 -0500 [warn]: /usr/share/gems/gems/fluentd-0.12.29/lib/fluent/buffer.rb:354:in `write_chunk'
  2016-12-26 00:34:00 -0500 [warn]: /usr/share/gems/gems/fluentd-0.12.29/lib/fluent/buffer.rb:333:in `pop'
  2016-12-26 00:34:00 -0500 [warn]: /usr/share/gems/gems/fluentd-0.12.29/lib/fluent/output.rb:338:in `try_flush'
  2016-12-26 00:34:00 -0500 [warn]: /usr/share/gems/gems/fluentd-0.12.29/lib/fluent/output.rb:149:in `run'
2016-12-26 00:34:00 -0500 [warn]: temporarily failed to flush the buffer. next_retry=2016-12-26 00:33:41 -0500 error_class="Fluent::ElasticsearchOutput::ConnectionFailure" error="Can not reach Elasticsearch cluster ({:host=>\"logging-es\", :port=>9200, :scheme=>\"https\", :user=>\"fluentd\", :password=>\"obfuscated\"})!" plugin_id="object:1a6c72c"
  2016-12-26 00:34:00 -0500 [warn]: /usr/share/gems/gems/fluent-plugin-elasticsearch-1.9.1/lib/fluent/plugin/out_elasticsearch_dynamic.rb:66:in `client'
  2016-12-26 00:34:00 -0500 [warn]: /usr/share/gems/gems/fluent-plugin-elasticsearch-1.9.1/lib/fluent/plugin/out_elasticsearch_dynamic.rb:202:in `rescue in send_bulk'
  2016-12-26 00:34:00 -0500 [warn]: /usr/share/gems/gems/fluent-plugin-elasticsearch-1.9.1/lib/fluent/plugin/out_elasticsearch_dynamic.rb:200:in `send_bulk'
  2016-12-26 00:34:00 -0500 [warn]: /usr/share/gems/gems/fluent-plugin-elasticsearch-1.9.1/lib/fluent/plugin/out_elasticsearch_dynamic.rb:193:in `block in write_objects'
  2016-12-26 00:34:00 -0500 [warn]: /usr/share/gems/gems/fluent-plugin-elasticsearch-1.9.1/lib/fluent/plugin/out_elasticsearch_dynamic.rb:192:in `each'
  2016-12-26 00:34:00 -0500 [warn]: /usr/share/gems/gems/fluent-plugin-elasticsearch-1.9.1/lib/fluent/plugin/out_elasticsearch_dynamic.rb:192:in `write_objects'
  2016-12-26 00:34:00 -0500 [warn]: /usr/share/gems/gems/fluentd-0.12.29/lib/fluent/output.rb:480:in `write'
  2016-12-26 00:34:00 -0500 [warn]: /usr/share/gems/gems/fluentd-0.12.29/lib/fluent/buffer.rb:354:in `write_chunk'
  2016-12-26 00:34:00 -0500 [warn]: /usr/share/gems/gems/fluentd-0.12.29/lib/fluent/buffer.rb:333:in `pop'
  2016-12-26 00:34:00 -0500 [warn]: /usr/share/gems/gems/fluentd-0.12.29/lib/fluent/output.rb:338:in `try_flush'
  2016-12-26 00:34:00 -0500 [warn]: /usr/share/gems/gems/fluentd-0.12.29/lib/fluent/output.rb:149:in `run'
2016-12-26 00:34:01 -0500 [warn]: temporarily failed to flush the buffer. next_retry=2016-12-26 00:33:44 -0500 error_class="Fluent::ElasticsearchOutput::ConnectionFailure" error="Can not reach Elasticsearch cluster ({:host=>\"logging-es\", :port=>9200, :scheme=>\"https\", :user=>\"fluentd\", :password=>\"obfuscated\"})!" plugin_id="object:1ab41e4"
  2016-12-26 00:34:01 -0500 [warn]: suppressed same stacktrace
2016-12-26 00:34:02 -0500 [warn]: temporarily failed to flush the buffer. next_retry=2016-12-26 00:33:44 -0500 error_class="Fluent::ElasticsearchOutput::ConnectionFailure" error="Can not reach Elasticsearch cluster ({:host=>\"logging-es\", :port=>9200, :scheme=>\"https\", :user=>\"fluentd\", :password=>\"obfuscated\"})!" plugin_id="object:1a6c72c"
  2016-12-26 00:34:02 -0500 [warn]: suppressed same stacktrace
2016-12-26 00:34:03 -0500 [warn]: temporarily failed to flush the buffer. next_retry=2016-12-26 00:33:47 -0500 error_class="Fluent::ElasticsearchOutput::ConnectionFailure" error="Can not reach Elasticsearch cluster ({:host=>\"logging-es\", :port=>9200, :scheme=>\"https\", :user=>\"fluentd\", :password=>\"obfuscated\"})!" plugin_id="object:1ab41e4"
  2016-12-26 00:34:03 -0500 [warn]: suppressed same stacktrace

at last , retry succeeded:

2016-12-26 00:34:46 -0500 [warn]: retry succeeded. plugin_id="object:1ab41e4"
2016-12-26 00:34:51 -0500 [warn]: retry succeeded. plugin_id="object:1a6c72c"

Set severity to low since we didn't notice any impact to logging functionalities.


Version-Release number of selected component (if applicable):
(ops registry):
openshift3/logging-auth-proxy    e96b37a99960
openshift3/logging-kibana    27f978fc2946
openshift3/logging-fluentd    c493f8b4553b
openshift3/logging-elasticsearch    3ca95a8a9433
openshift3/logging-curator    e39988877cd9
openshift3/logging-deployer    1033ccb0557b

openshift version:
openshift v3.4.0.38
kubernetes v1.4.0+776c994
etcd 3.1.0-rc.0

How reproducible:
3-4 times on GCE platform

Steps to Reproduce:
1.Please refer to "Description of problem"
2.
3.

Actual results:
Sometimes, fluentd temporarily failed to flush the buffer

Expected results:
It's no need for fluentd throw out error stacks if temporarily failed to flush the buffer, and recovered later

Additional info:
full log of fluentd attached

Comment 1 Rich Megginson 2017-01-03 18:35:05 UTC

Is elasticsearch being restarted at these times?  Or perhaps it has not been started yet?  Is the problem here that you do not want to see a full stacktrace when this condition occurs?

Comment 2 Xia Zhao 2017-01-05 08:07:05 UTC

(In reply to Rich Megginson from comment #1)
> Is elasticsearch being restarted at these times?  Or perhaps it has not been
> started yet?  Is the problem here that you do not want to see a full
> stacktrace when this condition occurs?

Hi Rich,

I encountered this while configuring the fluentd secure forward settings according to the steps in [1]. Fluentd was restarted during my operations but the logging embedded ES was on, not restarted. 
I reported it since I see similar error stacks in [2], and expect not to see a full stacktrace which seem like indicating a real problem but in fact it's not.

[1]https://github.com/openshift/origin-aggregated-logging/blob/master/hack/testing/test-fluentd-forward.sh
[2]https://bugzilla.redhat.com/show_bug.cgi?id=1395448

Thanks,
Xiaa

Comment 4 Xia Zhao 2017-03-27 09:06:13 UTC

Issue reproduced on the latest 3.5.0 env:

# openshift version
openshift v3.5.5
kubernetes v1.5.2+43a9be4
etcd 3.1.0

Image tested with:
openshift3/logging-fluentd         3.5.0               32a4ac0a3e18

# oc logs logging-fluentd-l5102
2017-03-27 08:08:47 +0000 [info]: reading config file path="/etc/fluent/fluent.conf"
2017-03-27 08:08:56 +0000 [warn]: /var/log/messages not found. Continuing without tailing it.
2017-03-27 08:09:02 +0000 [warn]: temporarily failed to flush the buffer. next_retry=2017-03-27 08:09:02 +0000 error_class="Fluent::ElasticsearchOutput::ConnectionFailure" error="Can not reach Elasticsearch cluster ({:host=>\"logging-es\", :port=>9200, :scheme=>\"https\", :user=>\"fluentd\", :password=>\"obfuscated\"})!" plugin_id="object:aaaa00"
  2017-03-27 08:09:02 +0000 [warn]: /usr/share/gems/gems/fluent-plugin-elasticsearch-1.9.2/lib/fluent/plugin/out_elasticsearch_dynamic.rb:58:in `client'
  2017-03-27 08:09:02 +0000 [warn]: /usr/share/gems/gems/fluent-plugin-elasticsearch-1.9.2/lib/fluent/plugin/out_elasticsearch_dynamic.rb:194:in `rescue in send_bulk'
  2017-03-27 08:09:02 +0000 [warn]: /usr/share/gems/gems/fluent-plugin-elasticsearch-1.9.2/lib/fluent/plugin/out_elasticsearch_dynamic.rb:192:in `send_bulk'
  2017-03-27 08:09:02 +0000 [warn]: /usr/share/gems/gems/fluent-plugin-elasticsearch-1.9.2/lib/fluent/plugin/out_elasticsearch_dynamic.rb:185:in `block in write_objects'
  2017-03-27 08:09:02 +0000 [warn]: /usr/share/gems/gems/fluent-plugin-elasticsearch-1.9.2/lib/fluent/plugin/out_elasticsearch_dynamic.rb:184:in `each'
  2017-03-27 08:09:02 +0000 [warn]: /usr/share/gems/gems/fluent-plugin-elasticsearch-1.9.2/lib/fluent/plugin/out_elasticsearch_dynamic.rb:184:in `write_objects'
  2017-03-27 08:09:02 +0000 [warn]: /usr/share/gems/gems/fluentd-0.12.31/lib/fluent/output.rb:490:in `write'
  2017-03-27 08:09:02 +0000 [warn]: /usr/share/gems/gems/fluentd-0.12.31/lib/fluent/buffer.rb:354:in `write_chunk'
  2017-03-27 08:09:02 +0000 [warn]: /usr/share/gems/gems/fluentd-0.12.31/lib/fluent/buffer.rb:333:in `pop'
  2017-03-27 08:09:02 +0000 [warn]: /usr/share/gems/gems/fluentd-0.12.31/lib/fluent/output.rb:342:in `try_flush'
  2017-03-27 08:09:02 +0000 [warn]: /usr/share/gems/gems/fluentd-0.12.31/lib/fluent/output.rb:149:in `run'
2017-03-27 08:09:03 +0000 [warn]: temporarily failed to flush the buffer. next_retry=2017-03-27 08:09:04 +0000 error_class="Fluent::ElasticsearchOutput::ConnectionFailure" error="Can not reach Elasticsearch cluster ({:host=>\"logging-es\", :port=>9200, :scheme=>\"https\", :user=>\"fluentd\", :password=>\"obfuscated\"})!" plugin_id="object:aaaa00"
  2017-03-27 08:09:03 +0000 [warn]: suppressed same stacktrace
2017-03-27 08:09:04 +0000 [warn]: temporarily failed to flush the buffer. next_retry=2017-03-27 08:09:08 +0000 error_class="Fluent::ElasticsearchOutput::ConnectionFailure" error="Can not reach Elasticsearch cluster ({:host=>\"logging-es\", :port=>9200, :scheme=>\"https\", :user=>\"fluentd\", :password=>\"obfuscated\"})!" plugin_id="object:aaaa00"
  2017-03-27 08:09:04 +0000 [warn]: suppressed same stacktrace
2017-03-27 08:09:08 +0000 [warn]: temporarily failed to flush the buffer. next_retry=2017-03-27 08:09:17 +0000 error_class="Fluent::ElasticsearchOutput::ConnectionFailure" error="Can not reach Elasticsearch cluster ({:host=>\"logging-es\", :port=>9200, :scheme=>\"https\", :user=>\"fluentd\", :password=>\"obfuscated\"})!" plugin_id="object:aaaa00"
  2017-03-27 08:09:08 +0000 [warn]: suppressed same stacktrace
2017-03-27 08:09:17 +0000 [warn]: temporarily failed to flush the buffer. next_retry=2017-03-27 08:09:35 +0000 error_class="Fluent::ElasticsearchOutput::ConnectionFailure" error="Can not reach Elasticsearch cluster ({:host=>\"logging-es\", :port=>9200, :scheme=>\"https\", :user=>\"fluentd\", :password=>\"obfuscated\"})!" plugin_id="object:aaaa00"
  2017-03-27 08:09:17 +0000 [warn]: suppressed same stacktrace
2017-03-27 08:09:35 +0000 [warn]: temporarily failed to flush the buffer. next_retry=2017-03-27 08:10:05 +0000 error_class="Fluent::ElasticsearchOutput::ConnectionFailure" error="Can not reach Elasticsearch cluster ({:host=>\"logging-es\", :port=>9200, :scheme=>\"https\", :user=>\"fluentd\", :password=>\"obfuscated\"})!" plugin_id="object:aaaa00"
  2017-03-27 08:09:35 +0000 [warn]: suppressed same stacktrace
2017-03-27 08:10:08 +0000 [warn]: temporarily failed to flush the buffer. next_retry=2017-03-27 08:11:08 +0000 error_class="Elasticsearch::Transport::Transport::Errors::ServiceUnavailable" error="[503] " plugin_id="object:aaaa00"
  2017-03-27 08:10:08 +0000 [warn]: /usr/share/gems/gems/elasticsearch-transport-1.0.18/lib/elasticsearch/transport/transport/base.rb:201:in `__raise_transport_error'
  2017-03-27 08:10:08 +0000 [warn]: /usr/share/gems/gems/elasticsearch-transport-1.0.18/lib/elasticsearch/transport/transport/base.rb:312:in `perform_request'
  2017-03-27 08:10:08 +0000 [warn]: /usr/share/gems/gems/elasticsearch-transport-1.0.18/lib/elasticsearch/transport/transport/http/faraday.rb:20:in `perform_request'
  2017-03-27 08:10:08 +0000 [warn]: /usr/share/gems/gems/elasticsearch-transport-1.0.18/lib/elasticsearch/transport/client.rb:128:in `perform_request'
  2017-03-27 08:10:08 +0000 [warn]: /usr/share/gems/gems/elasticsearch-api-1.0.18/lib/elasticsearch/api/actions/ping.rb:20:in `ping'
  2017-03-27 08:10:08 +0000 [warn]: /usr/share/gems/gems/fluent-plugin-elasticsearch-1.9.2/lib/fluent/plugin/out_elasticsearch_dynamic.rb:58:in `client'
  2017-03-27 08:10:08 +0000 [warn]: /usr/share/gems/gems/fluent-plugin-elasticsearch-1.9.2/lib/fluent/plugin/out_elasticsearch_dynamic.rb:194:in `rescue in send_bulk'
  2017-03-27 08:10:08 +0000 [warn]: /usr/share/gems/gems/fluent-plugin-elasticsearch-1.9.2/lib/fluent/plugin/out_elasticsearch_dynamic.rb:192:in `send_bulk'
  2017-03-27 08:10:08 +0000 [warn]: /usr/share/gems/gems/fluent-plugin-elasticsearch-1.9.2/lib/fluent/plugin/out_elasticsearch_dynamic.rb:185:in `block in write_objects'
  2017-03-27 08:10:08 +0000 [warn]: /usr/share/gems/gems/fluent-plugin-elasticsearch-1.9.2/lib/fluent/plugin/out_elasticsearch_dynamic.rb:184:in `each'
  2017-03-27 08:10:08 +0000 [warn]: /usr/share/gems/gems/fluent-plugin-elasticsearch-1.9.2/lib/fluent/plugin/out_elasticsearch_dynamic.rb:184:in `write_objects'
  2017-03-27 08:10:08 +0000 [warn]: /usr/share/gems/gems/fluentd-0.12.31/lib/fluent/output.rb:490:in `write'
  2017-03-27 08:10:08 +0000 [warn]: /usr/share/gems/gems/fluentd-0.12.31/lib/fluent/buffer.rb:354:in `write_chunk'
  2017-03-27 08:10:08 +0000 [warn]: /usr/share/gems/gems/fluentd-0.12.31/lib/fluent/buffer.rb:333:in `pop'
  2017-03-27 08:10:08 +0000 [warn]: /usr/share/gems/gems/fluentd-0.12.31/lib/fluent/output.rb:342:in `try_flush'
  2017-03-27 08:10:08 +0000 [warn]: /usr/share/gems/gems/fluentd-0.12.31/lib/fluent/output.rb:149:in `run'
2017-03-27 08:11:11 +0000 [warn]: retry succeeded. plugin_id="object:aaaa00"

Comment 5 Rich Megginson 2017-03-27 17:28:11 UTC

At this time:

2017-03-27 08:09:02 +0000 [warn]: temporarily failed to flush the buffer. next_retry=2017-03-27 08:09:02 +0000 error_class="Fluent::ElasticsearchOutput::ConnectionFailure" error="Can not reach Elasticsearch cluster ({:host=>\"logging-es\", :port=>9200, :scheme=>\"https\", :user=>\"fluentd\", :password=>\"obfuscated\"})!" plugin_id="object:aaaa00"

Was elasticsearch up and running?

Comment 7 Xia Zhao 2017-04-06 06:28:39 UTC

Issue reproduced in 3.6.0:

# openshift version
openshift v3.6.16
kubernetes v1.5.2+43a9be4
etcd 3.1.0

images on ops registry:
openshift3/logging-fluentd                                 3.6.0               a55545358479 

logs attached in https://bugzilla.redhat.com/attachment.cgi?id=1269169

Comment 8 Rich Megginson 2017-04-07 21:13:58 UTC

Can you provide the logs of the elasticsearch pods logging-es-* and logging-es-ops-* that show what elasticsearch was doing between 
2017-04-05 23:22:07 -0400 [warn]: temporarily failed to flush the buffer. next_retry=2017-04-05 23:22:05 -0400 error_class="Fluent::ElasticsearchOutput::ConnectionFailure" error="Can not reach Elasticsearch cluster ({:host=>\"logging-es\", :port=>9200, :scheme=>\"https\", :user=>\"fluentd\", :password=>\"obfuscated\"})!" plugin_id="object:1cf3d24"
and
2017-04-06 02:20:44 -0400 [warn]: buffer flush took longer time than slow_flush_log_threshold: plugin_id="object:1ba92c0" elapsed_time=28.30003044 slow_flush_log_threshold=20.0
?

Comment 9 Junqi Zhao 2017-04-19 05:53:56 UTC

Tested with logging 3.6, and set DEBUG mode for es pod, but there was not useful info, see the attached log file. This error maybe is transient error, it is lost connection to es, but finally retry succeeded

Comment 10 Junqi Zhao 2017-04-19 05:54:57 UTC

Created attachment 1272478 [details]
es,fluentd pod log

Comment 11 Noriko Hosoi 2017-04-19 22:23:12 UTC

I also sometimes see the fluentd error.
2017-04-19 17:47:07 -0400 [warn]: temporarily failed to flush the buffer. next_retry=2017-04-19 17:47:13 -0400 error_class="Elasticsearch::Transport::Transport::Errors::ServiceUnavailable" error="[503] " plugin_id="object:1028a2c"

I'm certain it is transient.
2017-04-19 17:47:31 -0400 [warn]: retry succeeded. plugin_id="object:1028a2c"

At the moment [2017-04-19 17:47:07], elasticsearch has not started yet.
[2017-04-19 21:47:07,468][ERROR][com.floragunn.searchguard.auth.BackendRegistry] Not yet initialized (you may need to run sgadmin)
[2017-04-19 21:47:09,333][ERROR][com.floragunn.searchguard.auth.BackendRegistry] Not yet initialized (you may need to run sgadmin)

It turned to be GREEN 10 seconds later in my test environment.
[2017-04-19 21:47:17,825][ERROR][com.floragunn.searchguard.auth.BackendRegistry] Not yet initialized (you may need to run sgadmin)
Clustername: logging-es
Clusterstate: GREEN

(Please note that the fluentd log and the ES log are configured with the different timezone. :)

Is it ok to close this bug with NOTABUG?

Comment 12 Rich Megginson 2017-04-19 22:32:46 UTC

(In reply to Noriko Hosoi from comment #11)
> I also sometimes see the fluentd error.
> 2017-04-19 17:47:07 -0400 [warn]: temporarily failed to flush the buffer.
> next_retry=2017-04-19 17:47:13 -0400
> error_class="Elasticsearch::Transport::Transport::Errors::
> ServiceUnavailable" error="[503] " plugin_id="object:1028a2c"
> 
> I'm certain it is transient.
> 2017-04-19 17:47:31 -0400 [warn]: retry succeeded. plugin_id="object:1028a2c"
> 
> At the moment [2017-04-19 17:47:07], elasticsearch has not started yet.
> [2017-04-19
> 21:47:07,468][ERROR][com.floragunn.searchguard.auth.BackendRegistry] Not yet
> initialized (you may need to run sgadmin)
> [2017-04-19
> 21:47:09,333][ERROR][com.floragunn.searchguard.auth.BackendRegistry] Not yet
> initialized (you may need to run sgadmin)
> 
> It turned to be GREEN 10 seconds later in my test environment.
> [2017-04-19
> 21:47:17,825][ERROR][com.floragunn.searchguard.auth.BackendRegistry] Not yet
> initialized (you may need to run sgadmin)
> Clustername: logging-es
> Clusterstate: GREEN
> 
> (Please note that the fluentd log and the ES log are configured with the
> different timezone. :)
> 
> Is it ok to close this bug with NOTABUG?

In the case where fluentd reports "unable to flush buffer" because Elasticsearch is not running, then yes, this is not a bug.

I think we might want to reduce the verbosity of the fluentd logs though - seeing this particular error, and seeing it frequently at startup, is going to be distressing to users.  We should perhaps change this to an RFE - "Suppress Fluentd connection errors" - we could have a threshold e.g. if you get this error N times in M seconds, then throw an error every X seconds.

Comment 13 Junqi Zhao 2017-04-20 00:11:19 UTC

It seems this defect does not affect the whole logging functions, it did not happen prior to logging 3.4.0,  and it is not user friendly if this defect happens again and customers had been complained this issue.

Comment 14 Noriko Hosoi 2017-04-20 00:30:06 UTC

(In reply to Junqi Zhao from comment #13)
> It seems this defect does not affect the whole logging functions, it did not
> happen prior to logging 3.4.0,  and it is not user friendly if this defect
> happens again and customers had been complained this issue.

I agree it gives an unnecessary concern to customers.

I'm not sure, but prior to 3.4.0, elasticsearch could have started much quicker?

I'm trying to launch fluentd with the option --suppress-repeated-stacktrace, but it just suppresses repeated stacktraces.  So, it logs the error at least once...

Comment 15 Rich Megginson 2017-04-20 00:37:05 UTC

@Noriko - yes, with 3.4 we switched to ES 2.4.1 from ES 1.5, and changed the way it starts up too.  Another problem is that there is no orchestration - that is, we don't have a way to prevent the other services that use ES from starting until ES is really up and running and ready to accept client operations.

I don't think there is anything in fluentd currently that can help.  That's why I suggested an RFE because we will likely need to change fluentd itself or perhaps the elasticsearch plugin.

Comment 16 Noriko Hosoi 2017-04-20 01:15:01 UTC

Thank you, Rich.  It makes sense.  I was searching the way to set any dependency among containers/bots, but I could not find any good doc...

To set RFE, choosing RFE in the keywords box is good enough?

Comment 17 Junqi Zhao 2017-04-25 06:27:54 UTC

(In reply to Rich Megginson from comment #15)
> @Noriko - yes, with 3.4 we switched to ES 2.4.1 from ES 1.5, and changed the
> way it starts up too.  Another problem is that there is no orchestration -
> that is, we don't have a way to prevent the other services that use ES from
> starting until ES is really up and running and ready to accept client
> operations.
> 
> I don't think there is anything in fluentd currently that can help.  That's
> why I suggested an RFE because we will likely need to change fluentd itself
> or perhaps the elasticsearch plugin.

Tested on OCP v3.3.1.20, this error did not happen a few months ago, but appeared now. ES version is 1.5.2, fluentd version is 0.12.23, see the attached file

Comment 18 Junqi Zhao 2017-04-25 06:29:36 UTC

Created attachment 1273837 [details]
fluntd log, ocp 3.3.1.20

Comment 19 Xia Zhao 2017-05-09 06:01:58 UTC

The related info was provided in #Comment 17

Comment 20 Junqi Zhao 2017-08-08 08:17:43 UTC

Tested with logging-fluentd:3.3.1-21, besides the "fluentd temporarily failed to flush the buffer" warn info, we get the following warning info:
**************************************************************************
2017-08-08 03:51:00 -0400 [warn]: /var/log/containers/logging-es-ops-a7gbxh1s-1-deploy_logging_POD-e7a36ba4c2316f04d045871bc630c74e2e1e17ab44dea856977df2caddb7a33d.log not found. Continuing without tailing it.

**************************************************************************

Checked on node, these files are not exist, they are used for deployer pod log, maybe it is deleted when fluentd read logs from these files.

Comment 21 Junqi Zhao 2017-08-08 08:18:28 UTC

Created attachment 1310475 [details]
fluntd log, logging-fluentd:3.3.1-21

Comment 22 Noriko Hosoi 2017-08-08 22:49:50 UTC

Hi @Junqi, since this bug was opened for 3.4, can we stick to just 3.4?

I.e., could you open a separate bug for the 3.3.1 issue?  In the new bug for 3.3.1, could you put the steps you took to reproduce the problem?

In terms of "fluentd temporarily failed to flush the buffer" warn issue, we often see the warnings when the elasticsearch pod does not have enough resource.  To check if what you saw was similar to it or not, could you try restarting the elasticsearch with higher CPU limit?
        resources:
          limits:
            cpu: 1000m

In #c17, you put the following observation.  Were the machine powers (including the cpu and memory size) identical between for the 3.3.1.20 test case and the 3.4.x?  It would be helpful to record such information in the bug, as well.

> Tested on OCP v3.3.1.20, this error did not happen a few months ago, but appeared now.

Comment 23 Junqi Zhao 2017-08-09 07:39:55 UTC

(In reply to Noriko Hosoi from comment #22)
> Hi @Junqi, since this bug was opened for 3.4, can we stick to just 3.4?
> 
> I.e., could you open a separate bug for the 3.3.1 issue?  In the new bug for
> 3.3.1, could you put the steps you took to reproduce the problem?

Ok, opened defects for v3.5 and 3.3.1
v3.5  https://bugzilla.redhat.com/show_bug.cgi?id=1479651
3.3.1 https://bugzilla.redhat.com/show_bug.cgi?id=1479679
> 
> In terms of "fluentd temporarily failed to flush the buffer" warn issue, we
> often see the warnings when the elasticsearch pod does not have enough
> resource.  To check if what you saw was similar to it or not, could you try
> restarting the elasticsearch with higher CPU limit?
>         resources:
>           limits:
>             cpu: 1000m
There is no es error in 3.4, 3.3.1 and 3.5 fluentd pod logs now, and it is not caused by elasticsearch CPU limit, when the fluented pods started up, it will throw out "fluentd temporarily failed to flush the buffer" warn info, see the attached files for the defect

Comment 24 Noriko Hosoi 2017-08-09 23:24:35 UTC

(In reply to Junqi Zhao from comment #23)
> There is no es error in 3.4, 3.3.1 and 3.5 fluentd pod logs now, and it is
> not caused by elasticsearch CPU limit, when the fluented pods started up, it
> will throw out "fluentd temporarily failed to flush the buffer" warn info,
> see the attached files for the defect

Thanks for the clarification, @Junqi.  So, the warning messagess are issued just at the startup and disappear once "retry succeeded." is logged.  That is, the elasticsearch was slower to be ready to accept the incoming data than the fluentd sent them.

The readiness probe for the elasticsearch was introduced since 3.5.  I guess you don't see the warnings on 3.5 and newer, do you?

For 3.4, since the severity of this bug is low, we could close this bug as WONTFIX?

Comment 25 Junqi Zhao 2017-08-10 00:07:03 UTC

(In reply to Noriko Hosoi from comment #24)

> The readiness probe for the elasticsearch was introduced since 3.5.  I guess
> you don't see the warnings on 3.5 and newer, do you?

3.5 still has such issue, 3.6 does not have.
please see attachment in v3.5 defect 
https://bugzilla.redhat.com/show_bug.cgi?id=1479651
"temporarily failed to flush the buffer" is still there

> For 3.4, since the severity of this bug is low, we could close this bug as
> WONTFIX?

Although it's a low priority defect, if we know how to fix it, I think we should fix it, since the warning logs are noisy, and customer had complained about these warning messages.

Comment 27 Junqi Zhao 2017-08-18 01:53:20 UTC

@Noriko,

one more warning info I'd like to raise, in fluentd pod logs, there is additional warning info:

2017-08-17 21:37:02 -0400 [warn]: /var/log/containers/logging-curator-ops-1-deploy_logging_POD-af9d1fae33516e492cba52ffa97bc03f4a0cabf51a66720d6967b656d34fad53.log not found. Continuing without tailing it.


Checked on node, these files are not exist, they are used for deployer pod log, maybe it is deleted when fluentd read logs from these files.

more info please see my attached file.

Comment 28 Junqi Zhao 2017-08-18 01:56:31 UTC

Created attachment 1314985 [details]
additional waring info in fluentd pod log

Comment 29 Rich Megginson 2017-10-09 21:25:17 UTC

I'm going to close this bug about the connection issue as NOTABUG because it is transient and expected.

Please open a new bz about the 

2017-08-17 21:37:02 -0400 [warn]: /var/log/containers/logging-curator-ops-1-deploy_logging_POD-af9d1fae33516e492cba52ffa97bc03f4a0cabf51a66720d6967b656d34fad53.log not found. Continuing without tailing it.

issue.