Description of problem: Hosts stop reporting data to elasticsearch after a few minutes. This seems slimier to what I saw in the scale testing that we did in the past. The hosts that have this issue have relatively high number of vms. In the case that I saw it has 116 vms. Fluentd is taking very high resources, over Version-Release number of selected component (if applicable): 1.1.4.2 How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
Rich, Did you encounter a situation where the fluentd just stopped sending logs to Elasticsearch due to high number/rate of logs?
(In reply to Shirly Radco from comment #1) > Rich, Did you encounter a situation where the fluentd just stopped sending > logs to Elasticsearch due to high number/rate of logs? Yes. Could be for several reasons. Any errors in the fluentd logs? Any errors in the Elasticsearch logs? Is Elasticsearch overloaded? If so, it may be reporting bulk index rejections - see https://www.elastic.co/guide/en/elasticsearch/reference/2.4/cat-thread-pool.html#_available_thread_pools 1. determine the name of the project used by the logging components oc get projects | grep logging If "openshift-logging" is in that list, use it, otherwise, use "logging". I will refer to this project name as $PROJECT below. 2. get the name of an elasticsearch pod oc -n $PROJECT get pods -l component=es I will refer to the name of an es pod returned from this command as $espod below. 3. use oc -n $PROJECT -c elasticsearch $espod -- es_util --query=_cat/thread_pool?v\&h=br To see if there are bulk rejections
(In reply to Rich Megginson from comment #2) > (In reply to Shirly Radco from comment #1) > > Rich, Did you encounter a situation where the fluentd just stopped sending > > logs to Elasticsearch due to high number/rate of logs? > > Yes. Could be for several reasons. > > Any errors in the fluentd logs? > > Any errors in the Elasticsearch logs? > > Is Elasticsearch overloaded? If so, it may be reporting bulk index > rejections - see > https://www.elastic.co/guide/en/elasticsearch/reference/2.4/cat-thread-pool. > html#_available_thread_pools > > 1. determine the name of the project used by the logging components > > oc get projects | grep logging > > If "openshift-logging" is in that list, use it, otherwise, use "logging". I > will refer to this project name as $PROJECT below. > > 2. get the name of an elasticsearch pod > > oc -n $PROJECT get pods -l component=es > > I will refer to the name of an es pod returned from this command as $espod > below. > > 3. use > > oc -n $PROJECT -c elasticsearch $espod -- es_util > --query=_cat/thread_pool?v\&h=br I run: oc -n logging -c elasticsearch logging-es-data-master-99my9lh4-2-tmd7l -- es_util --query=_cat/thread_pool?v\&h=br But I get an error Error: unknown command "logging-es-data-master-99my9lh4-2-tmd7l" for "oc" Run 'oc --help' for usage. for this command > > To see if there are bulk rejections
Maybe you were missing exec ? oc exec -n $PROJECT -c elasticsearch $espod -- es_util --query=_cat/thread_pool?v\&h=br If yes, the he result I got is 0.
yes, I was missing exec, and 0 means no bulk index rejections. so it must be due to something else
Hi can you provide reproduction steps?
The fix here was by disabling retries for metrics and adding additional threads for sending metrics. The issue was found for a metrics store located in TLV and the hosts were from Brno and had around 90 vms running on them. The latency and number of vms caused a raise in the retries until the fluentd queue was full. Not sure how to reproduce. Perhaps only in a scale env.
For verification it is sufficient to disable the connection to the metrics store, we are not storing old values inside buffer of fluentd and disabled the retry send time. In this, we won't store the data, but rather retry in next interval (10s). This should be tested with connection problems on metrics store, and buffer of fluentd should not be filling up. This is not the case for the logs.
SanityOnly - ovirt-engine-metrics-1.1.6.1-1.el7ev.noarch
This bugzilla is included in oVirt 4.2.5 release, published on July 30th 2018. Since the problem described in this bug report should be resolved in oVirt 4.2.5 release, it has been closed with a resolution of CURRENT RELEASE. If the solution does not work for you, please open a new bug report.