Bug 1572508
Summary: | fluentd unable to connect keeps retrying every 3 minutes | ||
---|---|---|---|
Product: | [oVirt] ovirt-engine-metrics | Reporter: | Lukas Svaty <lsvaty> |
Component: | Generic | Assignee: | Shirly Radco <sradco> |
Status: | CLOSED CURRENTRELEASE | QA Contact: | Lukas Svaty <lsvaty> |
Severity: | high | Docs Contact: | |
Priority: | high | ||
Version: | 1.1.4.2 | CC: | bugs, lsvaty, rmeggins, ykaul, ylavi |
Target Milestone: | ovirt-4.2.4 | Flags: | rule-engine:
ovirt-4.2+
ykaul: exception+ |
Target Release: | --- | ||
Hardware: | All | ||
OS: | All | ||
Whiteboard: | |||
Fixed In Version: | ovirt-engine-metrics-1.1.5 | Doc Type: | Bug Fix |
Doc Text: |
Cause:
Logs are flooded with fluentd errors when the remote metrics store connection is lost.
Consequence:
To many fluentd logs
Fix:
Updated fluentd to alert once an hour if connection is still lost.
Result:
Less fluentd errors about connection lost
|
Story Points: | --- |
Clone Of: | Environment: | ||
Last Closed: | 2018-06-26 08:45:05 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | Metrics | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 1581842 |
Description
Lukas Svaty
2018-04-27 07:51:20 UTC
This is the time set for the buffer configuration: fluentd_max_retry_wait_metrics: 300s fluentd_max_retry_wait_logs: 300s User can update it to a higher value. Is that sufficient? (In reply to Shirly Radco from comment #1) > This is the time set for the buffer configuration: > > fluentd_max_retry_wait_metrics: 300s > > fluentd_max_retry_wait_logs: 300s > > User can update it to a higher value. Is that sufficient? We need to promptly fix the issue of spamming /var/log/messages. Do not bother to ask the user to set some value they have no idea what they are, what to set them to, and when to set them. (In reply to Yaniv Kaul from comment #2) > (In reply to Shirly Radco from comment #1) > > This is the time set for the buffer configuration: > > > > fluentd_max_retry_wait_metrics: 300s > > > > fluentd_max_retry_wait_logs: 300s > > > > User can update it to a higher value. Is that sufficient? > > We need to promptly fix the issue of spamming /var/log/messages. > Do not bother to ask the user to set some value they have no idea what they > are, what to set them to, and when to set them. In https://github.com/openshift/origin-aggregated-logging/blob/master/fluentd/configs.d/openshift/output-es-config.conf the two key parameters are: max_retry_wait "#{ENV['ES_RETRY_WAIT'] || '300'}" If seeing a log message every 5 minutes is too much, then set max_retry_wait to a higher value e.g. 600 for every 10 minutes. request_timeout 2147483648 That is, wait "forever" for the response from Elasticsearch. But - it may make a difference why fluentd cannot connect to Elasticsearch - that is, is there some intervention required on the part of the fluentd (client side) and/or Elasticsearch (server side) in order to fix the problem? (In reply to Rich Megginson from comment #3) > (In reply to Yaniv Kaul from comment #2) > > (In reply to Shirly Radco from comment #1) > > > This is the time set for the buffer configuration: > > > > > > fluentd_max_retry_wait_metrics: 300s > > > > > > fluentd_max_retry_wait_logs: 300s > > > > > > User can update it to a higher value. Is that sufficient? > > > > We need to promptly fix the issue of spamming /var/log/messages. > > Do not bother to ask the user to set some value they have no idea what they > > are, what to set them to, and when to set them. > > In > https://github.com/openshift/origin-aggregated-logging/blob/master/fluentd/ > configs.d/openshift/output-es-config.conf the two key parameters are: > > max_retry_wait "#{ENV['ES_RETRY_WAIT'] || '300'}" > > If seeing a log message every 5 minutes is too much, then set max_retry_wait > to a higher value e.g. 600 for every 10 minutes. > > request_timeout 2147483648 > > That is, wait "forever" for the response from Elasticsearch. > > But - it may make a difference why fluentd cannot connect to Elasticsearch - > that is, is there some intervention required on the part of the fluentd > (client side) and/or Elasticsearch (server side) in order to fix the problem? 1. For logs, Should I use "disable_retry_limit" true, so we will not loss logs data? 2. For metrics and logs, I believe we can set "max_retry_wait" to 3600. So the logs will be logged once an hour. This does not mean the retry will not continue in the background according to the "retry_limit" and "retry_wait". Right? 3. As far as I see the log is only for when the we hit the "max_retry_wait". Right? (In reply to Shirly Radco from comment #4) > (In reply to Rich Megginson from comment #3) > > (In reply to Yaniv Kaul from comment #2) > > > (In reply to Shirly Radco from comment #1) > > > > This is the time set for the buffer configuration: > > > > > > > > fluentd_max_retry_wait_metrics: 300s > > > > > > > > fluentd_max_retry_wait_logs: 300s > > > > > > > > User can update it to a higher value. Is that sufficient? > > > > > > We need to promptly fix the issue of spamming /var/log/messages. > > > Do not bother to ask the user to set some value they have no idea what they > > > are, what to set them to, and when to set them. > > > > In > > https://github.com/openshift/origin-aggregated-logging/blob/master/fluentd/ > > configs.d/openshift/output-es-config.conf the two key parameters are: > > > > max_retry_wait "#{ENV['ES_RETRY_WAIT'] || '300'}" > > > > If seeing a log message every 5 minutes is too much, then set max_retry_wait > > to a higher value e.g. 600 for every 10 minutes. > > > > request_timeout 2147483648 > > > > That is, wait "forever" for the response from Elasticsearch. > > > > But - it may make a difference why fluentd cannot connect to Elasticsearch - > > that is, is there some intervention required on the part of the fluentd > > (client side) and/or Elasticsearch (server side) in order to fix the problem? > > > 1. For logs, Should I use "disable_retry_limit" true, so we will not loss > logs data? Yes > > 2. For metrics and logs, I believe we can set "max_retry_wait" to 3600. So > the logs will be logged once an hour. It means that fluentd will attempt a retry every hour. Once the retry is successful, fluentd will send logs immediately from then on (until another error is hit and it has to retry again, every 1 second, then every 2 seconds, etc. until it hits every 3600 seconds). >This does not mean the retry will not > continue in the background according to the "retry_limit" and "retry_wait". > Right? If you set `disable_retry_limit true` then it should ignore "retry_limit" > > 3. As far as I see the log is only for when the we hit the "max_retry_wait". > Right? ? I updated fluentd configurations to log the errors and warnings once an hour. I updated fluentd to alert once an hour. Also the retries are now starting at 1 minute and multiplied on each retry, until the max_wait_limit of 1 hour. verified in ovirt-engine-metrics-1.1.5.1-1.el7ev.noarch This bugzilla is included in oVirt 4.2.4 release, published on June 26th 2018. Since the problem described in this bug report should be resolved in oVirt 4.2.4 release, it has been closed with a resolution of CURRENT RELEASE. If the solution does not work for you, please open a new bug report. |