1572508 – fluentd unable to connect keeps retrying every 3 minutes

Bug 1572508 - fluentd unable to connect keeps retrying every 3 minutes

Summary: fluentd unable to connect keeps retrying every 3 minutes

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	ovirt-engine-metrics
Classification:	oVirt
Component:	Generic
Sub Component:
Version:	1.1.4.2
Hardware:	All
OS:	All
Priority:	high
Severity:	high
Target Milestone:	ovirt-4.2.4
Target Release:	---
Assignee:	Shirly Radco
QA Contact:	Lukas Svaty
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1581842
TreeView+	depends on / blocked

Reported:	2018-04-27 07:51 UTC by Lukas Svaty
Modified:	2018-06-26 08:45 UTC (History)
CC List:	5 users (show)
Fixed In Version:	ovirt-engine-metrics-1.1.5
Clone Of:
Environment:
Last Closed:	2018-06-26 08:45:05 UTC
oVirt Team:	Metrics
Embargoed:
Flags:	rule-engine: ovirt-4.2+ ykaul: exception+

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
oVirt gerrit	91153	0	master	MERGED	Update fluentd buffer output settings	2021-02-08 18:12:35 UTC

Description Lukas Svaty 2018-04-27 07:51:20 UTC

Description of problem:
When metrics store is down fluentd on engine and hosts keep retrying to connect to it every 3 minutes and ignores 

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1. Connect engine to metrics store
2. Power off metrics store VM


Actual results:
Fluentd keeps reconnecting, spamming journal, this causes hosts to run out of space and crash.

Expected results:
AFter exceeded limit of connections by fluentd, fluentd should stop (and stop collectd)

Additional info:
pr 27 09:33:37 brq-setup fluentd: 2018-04-27 09:33:37 +0200 [warn]: suppressed same stacktrace
Apr 27 09:33:37 brq-setup fluentd: 2018-04-27 09:33:37 +0200 fluent.warn: {"next_retry":"2018-04-27 09:38:31 +0200","error_class":"Fluent::ElasticsearchOutput::ConnectionFailure","error":"Can not reach Elasticsearch cluster ({:host=>\"brq-metrics.example.com\", :port=>9200, :scheme=>\"https\"})!","plugin_id":"object:3f8dc4685600","message":"temporarily failed to flush the buffer. next_retry=2018-04-27 09:38:31 +0200 error_class=\"Fluent::ElasticsearchOutput::ConnectionFailure\" error=\"Can not reach Elasticsearch cluster ({:host=>\\\"brq-metrics.example.com\\\", :port=>9200, :scheme=>\\\"https\\\"})!\" plugin_id=\"object:3f8dc4685600\""}
Apr 27 09:35:34 brq-setup fluentd: 2018-04-27 09:35:34 +0200 [warn]: temporarily failed to flush the buffer. next_retry=2018-04-27 09:40:27 +0200 error_class="Fluent::ElasticsearchOutput::ConnectionFailure" error="Can not reach Elasticsearch cluster ({:host=>\"brq-metrics.example.com\", :port=>9200, :scheme=>\"https\"})!" plugin_id="object:3f8dc463c93c"
Apr 27 09:35:34 brq-setup fluentd: 2018-04-27 09:35:34 +0200 [warn]: suppressed same stacktrace
Apr 27 09:35:34 brq-setup fluentd: 2018-04-27 09:35:34 +0200 fluent.warn: {"next_retry":"2018-04-27 09:40:27 +0200","error_class":"Fluent::ElasticsearchOutput::ConnectionFailure","error":"Can not reach Elasticsearch cluster ({:host=>\"brq-metrics.example.com\", :port=>9200, :scheme=>\"https\"})!","plugin_id":"object:3f8dc463c93c","message":"temporarily failed to flush the buffer. next_retry=2018-04-27 09:40:27 +0200 error_class=\"Fluent::ElasticsearchOutput::ConnectionFailure\" error=\"Can not reach Elasticsearch cluster ({:host=>\\\"brq-metrics.example.com\\\", :port=>9200, :scheme=>\\\"https\\\"})!\" plugin_id=\"object:3f8dc463c93c\""}
Apr 27 09:38:31 brq-setup fluentd: 2018-04-27 09:38:31 +0200 [warn]: failed to flush the buffer. error_class="Fluent::ElasticsearchOutput::ConnectionFailure" error="Can not reach Elasticsearch cluster ({:host=>\"brq-metrics.example.com\", :port=>9200, :scheme=>\"https\"})!" plugin_id="object:3f8dc4685600"

vvvvvvvvvvvvvvv
Apr 27 09:38:31 brq-setup fluentd: 2018-04-27 09:38:31 +0200 [warn]: retry count exceededs limit.
^^^^^^^^^^^^^^^^
this is ignored


Apr 27 09:38:31 brq-setup fluentd: 2018-04-27 09:38:31 +0200 [warn]: suppressed same stacktrace
Apr 27 09:38:31 brq-setup fluentd: 2018-04-27 09:38:31 +0200 [error]: throwing away old logs.
Apr 27 09:38:31 brq-setup fluentd: 2018-04-27 09:38:31 +0200 fluent.warn: {"error_class":"Fluent::ElasticsearchOutput::ConnectionFailure","error":"Can not reach Elasticsearch cluster ({:host=>\"brq-metrics.example.com\", :port=>9200, :scheme=>\"https\"})!","plugin_id":"object:3f8dc4685600","message":"failed to flush the buffer. error_class=\"Fluent::ElasticsearchOutput::ConnectionFailure\" error=\"Can not reach Elasticsearch cluster ({:host=>\\\"brq-metrics.example.com\\\", :port=>9200, :scheme=>\\\"https\\\"})!\" plugin_id=\"object:3f8dc4685600\""}
Apr 27 09:38:31 brq-setup fluentd: 2018-04-27 09:38:31 +0200 fluent.warn: {"message":"retry count exceededs limit."}
Apr 27 09:38:31 brq-setup fluentd: 2018-04-27 09:38:31 +0200 fluent.error: {"message":"throwing away old logs."}
Apr 27 09:40:01 brq-setup systemd: Started Session 2135 of user root.
Apr 27 09:40:01 brq-setup systemd: Starting Session 2135 of user root.
Apr 27 09:40:34 brq-setup fluentd: 2018-04-27 09:40:34 +0200 [warn]: temporarily failed to flush the buffer. next_retry=2018-04-27 09:45:27 +0200 error_class="Fluent::ElasticsearchOutput::ConnectionFailure" error="Can not reach Elasticsearch cluster ({:host=>\"brq-metrics.example.com\", :port=>9200, :scheme=>\"https\"})!" plugin_id="object:3f8dc463c93c"
Apr 27 09:40:34 brq-setup fluentd: 2018-04-27 09:40:34 +0200 [warn]: suppressed same stacktrace
Apr 27 09:40:34 brq-setup fluentd: 2018-04-27 09:40:34 +0200 fluent.warn: {"next_retry":"2018-04-27 09:45:27 +0200","error_class":"Fluent::ElasticsearchOutput::ConnectionFailure","error":"Can not reach Elasticsearch cluster ({:host=>\"brq-metrics.example.com\", :port=>9200, :scheme=>\"https\"})!","plugin_id":"object:3f8dc463c93c","message":"temporarily failed to flush the buffer. next_retry=2018-04-27 09:45:27 +0200 error_class=\"Fluent::ElasticsearchOutput::ConnectionFailure\" error=\"Can not reach Elasticsearch cluster ({:host=>\\\"brq-metrics.example.com\\\", :port=>9200, :scheme=>\\\"https\\\"})!\" plugin_id=\"object:3f8dc463c93c\""}
Apr 27 09:43:21 brq-setup systemd-logind: New session 2136 of user root.
Apr 27 09:43:21 brq-setup systemd: Started Session 2136 of user root.
Apr 27 09:43:21 brq-setup systemd: Starting Session 2136 of user root.
Apr 27 09:43:37 brq-setup fluentd: 2018-04-27 09:43:37 +0200 [warn]: temporarily failed to flush the buffer. next_retry=2018-04-27 09:43:32 +0200 error_class="Fluent::ElasticsearchOutput::ConnectionFailure" error="Can not reach Elasticsearch cluster ({:host=>\"brq-metrics.example.com\", :port=>9200, :scheme=>\"https\"})!" plugin_id="object:3f8dc4685600"
Apr 27 09:43:37 brq-setup fluentd: 2018-04-27 09:43:37 +0200 [warn]: suppressed same stacktrace
Apr 27 09:43:37 brq-setup fluentd: 2018-04-27 09:43:37 +0200 fluent.warn: {"next_retry":"2018-04-27 09:43:32 +0200","error_class":"Fluent::ElasticsearchOutput::ConnectionFailure","error":"Can not reach Elasticsearch cluster ({:host=>\"brq-metrics.example.com\", :port=>9200, :scheme=>\"https\"})!","plugin_id":"object:3f8dc4685600","message":"temporarily failed to flush the buffer. next_retry=2018-04-27 09:43:32 +0200 error_class=\"Fluent::ElasticsearchOutput::ConnectionFailure\" error=\"Can not reach Elasticsearch cluster ({:host=>\\\"brq-metrics.example.com\\\", :port=>9200, :scheme=>\\\"https\\\"})!\" plugin_id=\"object:3f8dc4685600\""}
Apr 27 09:43:49 brq-setup fluentd: 2018-04-27 09:43:49 +0200 [warn]: temporarily failed to flush the buffer. next_retry=2018-04-27 09:43:34 +0200 error_class="Fluent::ElasticsearchOutput::ConnectionFailure" error="Can not reach Elasticsearch cluster ({:host=>\"brq-metrics.example.com\", :port=>9200, :scheme=>\"https\"})!" plugin_id="object:3f8dc4685600"
Apr 27 09:43:49 brq-setup fluentd: 2018-04-27 09:43:49 +0200 [warn]: suppressed same stacktrace
Apr 27 09:43:49 brq-setup fluentd: 2018-04-27 09:43:49 +0200 fluent.warn: {"next_retry":"2018-04-27 09:43:34 +0200","error_class":"Fluent::ElasticsearchOutput::ConnectionFailure","error":"Can not reach Elasticsearch cluster ({:host=>\"brq-metrics.example.com\", :port=>9200, :scheme=>\"https\"})!","plugin_id":"object:3f8dc4685600","message":"temporarily failed to flush the buffer. next_retry=2018-04-27 09:43:34 +0200 error_class=\"Fluent::ElasticsearchOutput::ConnectionFailure\" error=\"Can not reach Elasticsearch cluster ({:host=>\\\"brq-metrics.example.com\\\", :port=>9200, :scheme=>\\\"https\\\"})!\" plugin_id=\"object:3f8dc4685600\""}

Comment 1 Shirly Radco 2018-04-29 07:40:28 UTC

This is the time set for the buffer configuration:

fluentd_max_retry_wait_metrics: 300s

fluentd_max_retry_wait_logs: 300s

User can update it to a higher value. Is that sufficient?

Comment 2 Yaniv Kaul 2018-04-29 08:51:43 UTC

(In reply to Shirly Radco from comment #1)
> This is the time set for the buffer configuration:
> 
> fluentd_max_retry_wait_metrics: 300s
> 
> fluentd_max_retry_wait_logs: 300s
> 
> User can update it to a higher value. Is that sufficient?

We need to promptly fix the issue of spamming /var/log/messages.
Do not bother to ask the user to set some value they have no idea what they are, what to set them to, and when to set them.

Comment 3 Rich Megginson 2018-04-30 14:26:29 UTC

(In reply to Yaniv Kaul from comment #2)
> (In reply to Shirly Radco from comment #1)
> > This is the time set for the buffer configuration:
> > 
> > fluentd_max_retry_wait_metrics: 300s
> > 
> > fluentd_max_retry_wait_logs: 300s
> > 
> > User can update it to a higher value. Is that sufficient?
> 
> We need to promptly fix the issue of spamming /var/log/messages.
> Do not bother to ask the user to set some value they have no idea what they
> are, what to set them to, and when to set them.

In https://github.com/openshift/origin-aggregated-logging/blob/master/fluentd/configs.d/openshift/output-es-config.conf the two key parameters are:

   max_retry_wait "#{ENV['ES_RETRY_WAIT'] || '300'}"

If seeing a log message every 5 minutes is too much, then set max_retry_wait to a higher value e.g. 600 for every 10 minutes.

   request_timeout 2147483648

That is, wait "forever" for the response from Elasticsearch.

But - it may make a difference why fluentd cannot connect to Elasticsearch - that is, is there some intervention required on the part of the fluentd (client side) and/or Elasticsearch (server side) in order to fix the problem?

Comment 4 Shirly Radco 2018-05-10 10:24:51 UTC

(In reply to Rich Megginson from comment #3)
> (In reply to Yaniv Kaul from comment #2)
> > (In reply to Shirly Radco from comment #1)
> > > This is the time set for the buffer configuration:
> > > 
> > > fluentd_max_retry_wait_metrics: 300s
> > > 
> > > fluentd_max_retry_wait_logs: 300s
> > > 
> > > User can update it to a higher value. Is that sufficient?
> > 
> > We need to promptly fix the issue of spamming /var/log/messages.
> > Do not bother to ask the user to set some value they have no idea what they
> > are, what to set them to, and when to set them.
> 
> In
> https://github.com/openshift/origin-aggregated-logging/blob/master/fluentd/
> configs.d/openshift/output-es-config.conf the two key parameters are:
> 
>    max_retry_wait "#{ENV['ES_RETRY_WAIT'] || '300'}"
> 
> If seeing a log message every 5 minutes is too much, then set max_retry_wait
> to a higher value e.g. 600 for every 10 minutes.
> 
>    request_timeout 2147483648
> 
> That is, wait "forever" for the response from Elasticsearch.
> 
> But - it may make a difference why fluentd cannot connect to Elasticsearch -
> that is, is there some intervention required on the part of the fluentd
> (client side) and/or Elasticsearch (server side) in order to fix the problem?


1. For logs, Should I use "disable_retry_limit" true, so we will not loss logs data?

2. For metrics and logs, I believe we can set "max_retry_wait" to 3600. So the logs will be logged once an hour. This does not mean the retry will not continue in the background according to the "retry_limit" and "retry_wait". Right? 

3. As far as I see the log is only for when the we hit the "max_retry_wait". Right?

Comment 5 Rich Megginson 2018-05-10 14:50:00 UTC

(In reply to Shirly Radco from comment #4)
> (In reply to Rich Megginson from comment #3)
> > (In reply to Yaniv Kaul from comment #2)
> > > (In reply to Shirly Radco from comment #1)
> > > > This is the time set for the buffer configuration:
> > > > 
> > > > fluentd_max_retry_wait_metrics: 300s
> > > > 
> > > > fluentd_max_retry_wait_logs: 300s
> > > > 
> > > > User can update it to a higher value. Is that sufficient?
> > > 
> > > We need to promptly fix the issue of spamming /var/log/messages.
> > > Do not bother to ask the user to set some value they have no idea what they
> > > are, what to set them to, and when to set them.
> > 
> > In
> > https://github.com/openshift/origin-aggregated-logging/blob/master/fluentd/
> > configs.d/openshift/output-es-config.conf the two key parameters are:
> > 
> >    max_retry_wait "#{ENV['ES_RETRY_WAIT'] || '300'}"
> > 
> > If seeing a log message every 5 minutes is too much, then set max_retry_wait
> > to a higher value e.g. 600 for every 10 minutes.
> > 
> >    request_timeout 2147483648
> > 
> > That is, wait "forever" for the response from Elasticsearch.
> > 
> > But - it may make a difference why fluentd cannot connect to Elasticsearch -
> > that is, is there some intervention required on the part of the fluentd
> > (client side) and/or Elasticsearch (server side) in order to fix the problem?
> 
> 
> 1. For logs, Should I use "disable_retry_limit" true, so we will not loss
> logs data?

Yes

> 
> 2. For metrics and logs, I believe we can set "max_retry_wait" to 3600. So
> the logs will be logged once an hour.

It means that fluentd will attempt a retry every hour.  Once the retry is successful, fluentd will send logs immediately from then on (until another error is hit and it has to retry again, every 1 second, then every 2 seconds, etc. until it hits every 3600 seconds).

>This does not mean the retry will not
> continue in the background according to the "retry_limit" and "retry_wait".
> Right?

If you set `disable_retry_limit true` then it should ignore "retry_limit"

> 
> 3. As far as I see the log is only for when the we hit the "max_retry_wait".
> Right?

?

Comment 6 Shirly Radco 2018-05-10 16:02:36 UTC

I updated fluentd configurations to log the errors and warnings once an hour.

Comment 7 Shirly Radco 2018-05-14 07:05:17 UTC

I updated fluentd to alert once an hour.
Also the retries are now starting at 1 minute and multiplied on each retry, until the max_wait_limit of 1 hour.

Comment 8 Lukas Svaty 2018-06-11 11:39:36 UTC

verified in ovirt-engine-metrics-1.1.5.1-1.el7ev.noarch

Comment 9 Sandro Bonazzola 2018-06-26 08:45:05 UTC

This bugzilla is included in oVirt 4.2.4 release, published on June 26th 2018.

Since the problem described in this bug report should be
resolved in oVirt 4.2.4 release, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.

Note You need to log in before you can comment on or make changes to this bug.