Bug 1585666
Summary: | Some hosts stop reporting data to elasticsearch after a few minutes | ||
---|---|---|---|
Product: | [oVirt] ovirt-engine-metrics | Reporter: | Shirly Radco <sradco> |
Component: | Generic | Assignee: | Shirly Radco <sradco> |
Status: | CLOSED CURRENTRELEASE | QA Contact: | Lukas Svaty <lsvaty> |
Severity: | high | Docs Contact: | |
Priority: | high | ||
Version: | 1.1.4.2 | CC: | bugs, rmeggins, sradco |
Target Milestone: | ovirt-4.2.5 | Flags: | rule-engine:
ovirt-4.2+
|
Target Release: | 1.1.6 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | ovirt-engine-metrics-1.1.6 | Doc Type: | Bug Fix |
Doc Text: |
Cause:
For hosts that have many vms, having one fluentd thread to send metrics was not enough and caused the fluentd queue to increase and eventualy also collectd's.
Consequence:
fluentd stopped sending metrics.
Fix:
Increased the fluentd buffer threads to 5.
Result:
Fluentd manage to send metric in hosts that have many vms(Tested on hosts with 70-120 vms)
|
Story Points: | --- |
Clone Of: | Environment: | ||
Last Closed: | 2018-07-31 15:28:29 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | Metrics | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 1583573, 1594753 |
Description
Shirly Radco
2018-06-04 11:32:42 UTC
Rich, Did you encounter a situation where the fluentd just stopped sending logs to Elasticsearch due to high number/rate of logs? (In reply to Shirly Radco from comment #1) > Rich, Did you encounter a situation where the fluentd just stopped sending > logs to Elasticsearch due to high number/rate of logs? Yes. Could be for several reasons. Any errors in the fluentd logs? Any errors in the Elasticsearch logs? Is Elasticsearch overloaded? If so, it may be reporting bulk index rejections - see https://www.elastic.co/guide/en/elasticsearch/reference/2.4/cat-thread-pool.html#_available_thread_pools 1. determine the name of the project used by the logging components oc get projects | grep logging If "openshift-logging" is in that list, use it, otherwise, use "logging". I will refer to this project name as $PROJECT below. 2. get the name of an elasticsearch pod oc -n $PROJECT get pods -l component=es I will refer to the name of an es pod returned from this command as $espod below. 3. use oc -n $PROJECT -c elasticsearch $espod -- es_util --query=_cat/thread_pool?v\&h=br To see if there are bulk rejections (In reply to Rich Megginson from comment #2) > (In reply to Shirly Radco from comment #1) > > Rich, Did you encounter a situation where the fluentd just stopped sending > > logs to Elasticsearch due to high number/rate of logs? > > Yes. Could be for several reasons. > > Any errors in the fluentd logs? > > Any errors in the Elasticsearch logs? > > Is Elasticsearch overloaded? If so, it may be reporting bulk index > rejections - see > https://www.elastic.co/guide/en/elasticsearch/reference/2.4/cat-thread-pool. > html#_available_thread_pools > > 1. determine the name of the project used by the logging components > > oc get projects | grep logging > > If "openshift-logging" is in that list, use it, otherwise, use "logging". I > will refer to this project name as $PROJECT below. > > 2. get the name of an elasticsearch pod > > oc -n $PROJECT get pods -l component=es > > I will refer to the name of an es pod returned from this command as $espod > below. > > 3. use > > oc -n $PROJECT -c elasticsearch $espod -- es_util > --query=_cat/thread_pool?v\&h=br I run: oc -n logging -c elasticsearch logging-es-data-master-99my9lh4-2-tmd7l -- es_util --query=_cat/thread_pool?v\&h=br But I get an error Error: unknown command "logging-es-data-master-99my9lh4-2-tmd7l" for "oc" Run 'oc --help' for usage. for this command > > To see if there are bulk rejections Maybe you were missing exec ? oc exec -n $PROJECT -c elasticsearch $espod -- es_util --query=_cat/thread_pool?v\&h=br If yes, the he result I got is 0. yes, I was missing exec, and 0 means no bulk index rejections. so it must be due to something else Hi can you provide reproduction steps? The fix here was by disabling retries for metrics and adding additional threads for sending metrics. The issue was found for a metrics store located in TLV and the hosts were from Brno and had around 90 vms running on them. The latency and number of vms caused a raise in the retries until the fluentd queue was full. Not sure how to reproduce. Perhaps only in a scale env. For verification it is sufficient to disable the connection to the metrics store, we are not storing old values inside buffer of fluentd and disabled the retry send time. In this, we won't store the data, but rather retry in next interval (10s). This should be tested with connection problems on metrics store, and buffer of fluentd should not be filling up. This is not the case for the logs. SanityOnly - ovirt-engine-metrics-1.1.6.1-1.el7ev.noarch This bugzilla is included in oVirt 4.2.5 release, published on July 30th 2018. Since the problem described in this bug report should be resolved in oVirt 4.2.5 release, it has been closed with a resolution of CURRENT RELEASE. If the solution does not work for you, please open a new bug report. |