Bug 1572508

Summary: fluentd unable to connect keeps retrying every 3 minutes
Product: [oVirt] ovirt-engine-metrics Reporter: Lukas Svaty <lsvaty>
Component: GenericAssignee: Shirly Radco <sradco>
Status: CLOSED CURRENTRELEASE QA Contact: Lukas Svaty <lsvaty>
Severity: high Docs Contact:
Priority: high    
Version: 1.1.4.2CC: bugs, lsvaty, rmeggins, ykaul, ylavi
Target Milestone: ovirt-4.2.4Flags: rule-engine: ovirt-4.2+
ykaul: exception+
Target Release: ---   
Hardware: All   
OS: All   
Whiteboard:
Fixed In Version: ovirt-engine-metrics-1.1.5 Doc Type: Bug Fix
Doc Text:
Cause: Logs are flooded with fluentd errors when the remote metrics store connection is lost. Consequence: To many fluentd logs Fix: Updated fluentd to alert once an hour if connection is still lost. Result: Less fluentd errors about connection lost
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-06-26 08:45:05 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Metrics RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1581842    

Description Lukas Svaty 2018-04-27 07:51:20 UTC
Description of problem:
When metrics store is down fluentd on engine and hosts keep retrying to connect to it every 3 minutes and ignores 

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1. Connect engine to metrics store
2. Power off metrics store VM


Actual results:
Fluentd keeps reconnecting, spamming journal, this causes hosts to run out of space and crash.

Expected results:
AFter exceeded limit of connections by fluentd, fluentd should stop (and stop collectd)

Additional info:
pr 27 09:33:37 brq-setup fluentd: 2018-04-27 09:33:37 +0200 [warn]: suppressed same stacktrace
Apr 27 09:33:37 brq-setup fluentd: 2018-04-27 09:33:37 +0200 fluent.warn: {"next_retry":"2018-04-27 09:38:31 +0200","error_class":"Fluent::ElasticsearchOutput::ConnectionFailure","error":"Can not reach Elasticsearch cluster ({:host=>\"brq-metrics.example.com\", :port=>9200, :scheme=>\"https\"})!","plugin_id":"object:3f8dc4685600","message":"temporarily failed to flush the buffer. next_retry=2018-04-27 09:38:31 +0200 error_class=\"Fluent::ElasticsearchOutput::ConnectionFailure\" error=\"Can not reach Elasticsearch cluster ({:host=>\\\"brq-metrics.example.com\\\", :port=>9200, :scheme=>\\\"https\\\"})!\" plugin_id=\"object:3f8dc4685600\""}
Apr 27 09:35:34 brq-setup fluentd: 2018-04-27 09:35:34 +0200 [warn]: temporarily failed to flush the buffer. next_retry=2018-04-27 09:40:27 +0200 error_class="Fluent::ElasticsearchOutput::ConnectionFailure" error="Can not reach Elasticsearch cluster ({:host=>\"brq-metrics.example.com\", :port=>9200, :scheme=>\"https\"})!" plugin_id="object:3f8dc463c93c"
Apr 27 09:35:34 brq-setup fluentd: 2018-04-27 09:35:34 +0200 [warn]: suppressed same stacktrace
Apr 27 09:35:34 brq-setup fluentd: 2018-04-27 09:35:34 +0200 fluent.warn: {"next_retry":"2018-04-27 09:40:27 +0200","error_class":"Fluent::ElasticsearchOutput::ConnectionFailure","error":"Can not reach Elasticsearch cluster ({:host=>\"brq-metrics.example.com\", :port=>9200, :scheme=>\"https\"})!","plugin_id":"object:3f8dc463c93c","message":"temporarily failed to flush the buffer. next_retry=2018-04-27 09:40:27 +0200 error_class=\"Fluent::ElasticsearchOutput::ConnectionFailure\" error=\"Can not reach Elasticsearch cluster ({:host=>\\\"brq-metrics.example.com\\\", :port=>9200, :scheme=>\\\"https\\\"})!\" plugin_id=\"object:3f8dc463c93c\""}
Apr 27 09:38:31 brq-setup fluentd: 2018-04-27 09:38:31 +0200 [warn]: failed to flush the buffer. error_class="Fluent::ElasticsearchOutput::ConnectionFailure" error="Can not reach Elasticsearch cluster ({:host=>\"brq-metrics.example.com\", :port=>9200, :scheme=>\"https\"})!" plugin_id="object:3f8dc4685600"

vvvvvvvvvvvvvvv
Apr 27 09:38:31 brq-setup fluentd: 2018-04-27 09:38:31 +0200 [warn]: retry count exceededs limit.
^^^^^^^^^^^^^^^^
this is ignored


Apr 27 09:38:31 brq-setup fluentd: 2018-04-27 09:38:31 +0200 [warn]: suppressed same stacktrace
Apr 27 09:38:31 brq-setup fluentd: 2018-04-27 09:38:31 +0200 [error]: throwing away old logs.
Apr 27 09:38:31 brq-setup fluentd: 2018-04-27 09:38:31 +0200 fluent.warn: {"error_class":"Fluent::ElasticsearchOutput::ConnectionFailure","error":"Can not reach Elasticsearch cluster ({:host=>\"brq-metrics.example.com\", :port=>9200, :scheme=>\"https\"})!","plugin_id":"object:3f8dc4685600","message":"failed to flush the buffer. error_class=\"Fluent::ElasticsearchOutput::ConnectionFailure\" error=\"Can not reach Elasticsearch cluster ({:host=>\\\"brq-metrics.example.com\\\", :port=>9200, :scheme=>\\\"https\\\"})!\" plugin_id=\"object:3f8dc4685600\""}
Apr 27 09:38:31 brq-setup fluentd: 2018-04-27 09:38:31 +0200 fluent.warn: {"message":"retry count exceededs limit."}
Apr 27 09:38:31 brq-setup fluentd: 2018-04-27 09:38:31 +0200 fluent.error: {"message":"throwing away old logs."}
Apr 27 09:40:01 brq-setup systemd: Started Session 2135 of user root.
Apr 27 09:40:01 brq-setup systemd: Starting Session 2135 of user root.
Apr 27 09:40:34 brq-setup fluentd: 2018-04-27 09:40:34 +0200 [warn]: temporarily failed to flush the buffer. next_retry=2018-04-27 09:45:27 +0200 error_class="Fluent::ElasticsearchOutput::ConnectionFailure" error="Can not reach Elasticsearch cluster ({:host=>\"brq-metrics.example.com\", :port=>9200, :scheme=>\"https\"})!" plugin_id="object:3f8dc463c93c"
Apr 27 09:40:34 brq-setup fluentd: 2018-04-27 09:40:34 +0200 [warn]: suppressed same stacktrace
Apr 27 09:40:34 brq-setup fluentd: 2018-04-27 09:40:34 +0200 fluent.warn: {"next_retry":"2018-04-27 09:45:27 +0200","error_class":"Fluent::ElasticsearchOutput::ConnectionFailure","error":"Can not reach Elasticsearch cluster ({:host=>\"brq-metrics.example.com\", :port=>9200, :scheme=>\"https\"})!","plugin_id":"object:3f8dc463c93c","message":"temporarily failed to flush the buffer. next_retry=2018-04-27 09:45:27 +0200 error_class=\"Fluent::ElasticsearchOutput::ConnectionFailure\" error=\"Can not reach Elasticsearch cluster ({:host=>\\\"brq-metrics.example.com\\\", :port=>9200, :scheme=>\\\"https\\\"})!\" plugin_id=\"object:3f8dc463c93c\""}
Apr 27 09:43:21 brq-setup systemd-logind: New session 2136 of user root.
Apr 27 09:43:21 brq-setup systemd: Started Session 2136 of user root.
Apr 27 09:43:21 brq-setup systemd: Starting Session 2136 of user root.
Apr 27 09:43:37 brq-setup fluentd: 2018-04-27 09:43:37 +0200 [warn]: temporarily failed to flush the buffer. next_retry=2018-04-27 09:43:32 +0200 error_class="Fluent::ElasticsearchOutput::ConnectionFailure" error="Can not reach Elasticsearch cluster ({:host=>\"brq-metrics.example.com\", :port=>9200, :scheme=>\"https\"})!" plugin_id="object:3f8dc4685600"
Apr 27 09:43:37 brq-setup fluentd: 2018-04-27 09:43:37 +0200 [warn]: suppressed same stacktrace
Apr 27 09:43:37 brq-setup fluentd: 2018-04-27 09:43:37 +0200 fluent.warn: {"next_retry":"2018-04-27 09:43:32 +0200","error_class":"Fluent::ElasticsearchOutput::ConnectionFailure","error":"Can not reach Elasticsearch cluster ({:host=>\"brq-metrics.example.com\", :port=>9200, :scheme=>\"https\"})!","plugin_id":"object:3f8dc4685600","message":"temporarily failed to flush the buffer. next_retry=2018-04-27 09:43:32 +0200 error_class=\"Fluent::ElasticsearchOutput::ConnectionFailure\" error=\"Can not reach Elasticsearch cluster ({:host=>\\\"brq-metrics.example.com\\\", :port=>9200, :scheme=>\\\"https\\\"})!\" plugin_id=\"object:3f8dc4685600\""}
Apr 27 09:43:49 brq-setup fluentd: 2018-04-27 09:43:49 +0200 [warn]: temporarily failed to flush the buffer. next_retry=2018-04-27 09:43:34 +0200 error_class="Fluent::ElasticsearchOutput::ConnectionFailure" error="Can not reach Elasticsearch cluster ({:host=>\"brq-metrics.example.com\", :port=>9200, :scheme=>\"https\"})!" plugin_id="object:3f8dc4685600"
Apr 27 09:43:49 brq-setup fluentd: 2018-04-27 09:43:49 +0200 [warn]: suppressed same stacktrace
Apr 27 09:43:49 brq-setup fluentd: 2018-04-27 09:43:49 +0200 fluent.warn: {"next_retry":"2018-04-27 09:43:34 +0200","error_class":"Fluent::ElasticsearchOutput::ConnectionFailure","error":"Can not reach Elasticsearch cluster ({:host=>\"brq-metrics.example.com\", :port=>9200, :scheme=>\"https\"})!","plugin_id":"object:3f8dc4685600","message":"temporarily failed to flush the buffer. next_retry=2018-04-27 09:43:34 +0200 error_class=\"Fluent::ElasticsearchOutput::ConnectionFailure\" error=\"Can not reach Elasticsearch cluster ({:host=>\\\"brq-metrics.example.com\\\", :port=>9200, :scheme=>\\\"https\\\"})!\" plugin_id=\"object:3f8dc4685600\""}

Comment 1 Shirly Radco 2018-04-29 07:40:28 UTC
This is the time set for the buffer configuration:

fluentd_max_retry_wait_metrics: 300s

fluentd_max_retry_wait_logs: 300s

User can update it to a higher value. Is that sufficient?

Comment 2 Yaniv Kaul 2018-04-29 08:51:43 UTC
(In reply to Shirly Radco from comment #1)
> This is the time set for the buffer configuration:
> 
> fluentd_max_retry_wait_metrics: 300s
> 
> fluentd_max_retry_wait_logs: 300s
> 
> User can update it to a higher value. Is that sufficient?

We need to promptly fix the issue of spamming /var/log/messages.
Do not bother to ask the user to set some value they have no idea what they are, what to set them to, and when to set them.

Comment 3 Rich Megginson 2018-04-30 14:26:29 UTC
(In reply to Yaniv Kaul from comment #2)
> (In reply to Shirly Radco from comment #1)
> > This is the time set for the buffer configuration:
> > 
> > fluentd_max_retry_wait_metrics: 300s
> > 
> > fluentd_max_retry_wait_logs: 300s
> > 
> > User can update it to a higher value. Is that sufficient?
> 
> We need to promptly fix the issue of spamming /var/log/messages.
> Do not bother to ask the user to set some value they have no idea what they
> are, what to set them to, and when to set them.

In https://github.com/openshift/origin-aggregated-logging/blob/master/fluentd/configs.d/openshift/output-es-config.conf the two key parameters are:

   max_retry_wait "#{ENV['ES_RETRY_WAIT'] || '300'}"

If seeing a log message every 5 minutes is too much, then set max_retry_wait to a higher value e.g. 600 for every 10 minutes.

   request_timeout 2147483648

That is, wait "forever" for the response from Elasticsearch.

But - it may make a difference why fluentd cannot connect to Elasticsearch - that is, is there some intervention required on the part of the fluentd (client side) and/or Elasticsearch (server side) in order to fix the problem?

Comment 4 Shirly Radco 2018-05-10 10:24:51 UTC
(In reply to Rich Megginson from comment #3)
> (In reply to Yaniv Kaul from comment #2)
> > (In reply to Shirly Radco from comment #1)
> > > This is the time set for the buffer configuration:
> > > 
> > > fluentd_max_retry_wait_metrics: 300s
> > > 
> > > fluentd_max_retry_wait_logs: 300s
> > > 
> > > User can update it to a higher value. Is that sufficient?
> > 
> > We need to promptly fix the issue of spamming /var/log/messages.
> > Do not bother to ask the user to set some value they have no idea what they
> > are, what to set them to, and when to set them.
> 
> In
> https://github.com/openshift/origin-aggregated-logging/blob/master/fluentd/
> configs.d/openshift/output-es-config.conf the two key parameters are:
> 
>    max_retry_wait "#{ENV['ES_RETRY_WAIT'] || '300'}"
> 
> If seeing a log message every 5 minutes is too much, then set max_retry_wait
> to a higher value e.g. 600 for every 10 minutes.
> 
>    request_timeout 2147483648
> 
> That is, wait "forever" for the response from Elasticsearch.
> 
> But - it may make a difference why fluentd cannot connect to Elasticsearch -
> that is, is there some intervention required on the part of the fluentd
> (client side) and/or Elasticsearch (server side) in order to fix the problem?


1. For logs, Should I use "disable_retry_limit" true, so we will not loss logs data?

2. For metrics and logs, I believe we can set "max_retry_wait" to 3600. So the logs will be logged once an hour. This does not mean the retry will not continue in the background according to the "retry_limit" and "retry_wait". Right? 

3. As far as I see the log is only for when the we hit the "max_retry_wait". Right?

Comment 5 Rich Megginson 2018-05-10 14:50:00 UTC
(In reply to Shirly Radco from comment #4)
> (In reply to Rich Megginson from comment #3)
> > (In reply to Yaniv Kaul from comment #2)
> > > (In reply to Shirly Radco from comment #1)
> > > > This is the time set for the buffer configuration:
> > > > 
> > > > fluentd_max_retry_wait_metrics: 300s
> > > > 
> > > > fluentd_max_retry_wait_logs: 300s
> > > > 
> > > > User can update it to a higher value. Is that sufficient?
> > > 
> > > We need to promptly fix the issue of spamming /var/log/messages.
> > > Do not bother to ask the user to set some value they have no idea what they
> > > are, what to set them to, and when to set them.
> > 
> > In
> > https://github.com/openshift/origin-aggregated-logging/blob/master/fluentd/
> > configs.d/openshift/output-es-config.conf the two key parameters are:
> > 
> >    max_retry_wait "#{ENV['ES_RETRY_WAIT'] || '300'}"
> > 
> > If seeing a log message every 5 minutes is too much, then set max_retry_wait
> > to a higher value e.g. 600 for every 10 minutes.
> > 
> >    request_timeout 2147483648
> > 
> > That is, wait "forever" for the response from Elasticsearch.
> > 
> > But - it may make a difference why fluentd cannot connect to Elasticsearch -
> > that is, is there some intervention required on the part of the fluentd
> > (client side) and/or Elasticsearch (server side) in order to fix the problem?
> 
> 
> 1. For logs, Should I use "disable_retry_limit" true, so we will not loss
> logs data?

Yes

> 
> 2. For metrics and logs, I believe we can set "max_retry_wait" to 3600. So
> the logs will be logged once an hour.

It means that fluentd will attempt a retry every hour.  Once the retry is successful, fluentd will send logs immediately from then on (until another error is hit and it has to retry again, every 1 second, then every 2 seconds, etc. until it hits every 3600 seconds).

>This does not mean the retry will not
> continue in the background according to the "retry_limit" and "retry_wait".
> Right?

If you set `disable_retry_limit true` then it should ignore "retry_limit"

> 
> 3. As far as I see the log is only for when the we hit the "max_retry_wait".
> Right?

?

Comment 6 Shirly Radco 2018-05-10 16:02:36 UTC
I updated fluentd configurations to log the errors and warnings once an hour.

Comment 7 Shirly Radco 2018-05-14 07:05:17 UTC
I updated fluentd to alert once an hour.
Also the retries are now starting at 1 minute and multiplied on each retry, until the max_wait_limit of 1 hour.

Comment 8 Lukas Svaty 2018-06-11 11:39:36 UTC
verified in ovirt-engine-metrics-1.1.5.1-1.el7ev.noarch

Comment 9 Sandro Bonazzola 2018-06-26 08:45:05 UTC
This bugzilla is included in oVirt 4.2.4 release, published on June 26th 2018.

Since the problem described in this bug report should be
resolved in oVirt 4.2.4 release, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.