Bug 1585666 - Some hosts stop reporting data to elasticsearch after a few minutes
Summary: Some hosts stop reporting data to elasticsearch after a few minutes
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: ovirt-engine-metrics
Classification: oVirt
Component: Generic
Version: 1.1.4.2
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ovirt-4.2.5
: 1.1.6
Assignee: Shirly Radco
QA Contact: Lukas Svaty
URL:
Whiteboard:
Depends On:
Blocks: 1583573 ovirt-engine-metrics-1.1.6
TreeView+ depends on / blocked
 
Reported: 2018-06-04 11:32 UTC by Shirly Radco
Modified: 2018-07-31 15:28 UTC (History)
3 users (show)

Fixed In Version: ovirt-engine-metrics-1.1.6
Doc Type: Bug Fix
Doc Text:
Cause: For hosts that have many vms, having one fluentd thread to send metrics was not enough and caused the fluentd queue to increase and eventualy also collectd's. Consequence: fluentd stopped sending metrics. Fix: Increased the fluentd buffer threads to 5. Result: Fluentd manage to send metric in hosts that have many vms(Tested on hosts with 70-120 vms)
Clone Of:
Environment:
Last Closed: 2018-07-31 15:28:29 UTC
oVirt Team: Metrics
Embargoed:
rule-engine: ovirt-4.2+


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
oVirt gerrit 92195 0 master MERGED metrics: update fluentd metrics buffer 2018-06-14 11:36:10 UTC
oVirt gerrit 92257 0 ovirt-engine-metrics-4.2 MERGED metrics: update fluentd metrics buffer 2018-06-14 14:56:35 UTC

Description Shirly Radco 2018-06-04 11:32:42 UTC
Description of problem:
Hosts stop reporting data to elasticsearch after a few minutes.

This seems slimier to what I saw in the scale testing that we did in the past.

The hosts that have this issue have relatively high number of vms.
In the case that I saw it has 116 vms.

Fluentd is taking very high resources, over 

Version-Release number of selected component (if applicable):
1.1.4.2

How reproducible:


Steps to Reproduce:
1. 
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Shirly Radco 2018-06-04 11:59:04 UTC
Rich, Did you encounter a situation where the fluentd just stopped sending logs to Elasticsearch due to high number/rate of logs?

Comment 2 Rich Megginson 2018-06-04 15:34:18 UTC
(In reply to Shirly Radco from comment #1)
> Rich, Did you encounter a situation where the fluentd just stopped sending
> logs to Elasticsearch due to high number/rate of logs?

Yes.  Could be for several reasons.

Any errors in the fluentd logs?

Any errors in the Elasticsearch logs?

Is Elasticsearch overloaded?  If so, it may be reporting bulk index rejections - see https://www.elastic.co/guide/en/elasticsearch/reference/2.4/cat-thread-pool.html#_available_thread_pools

1. determine the name of the project used by the logging components

oc get projects | grep logging

If "openshift-logging" is in that list, use it, otherwise, use "logging".  I will refer to this project name as $PROJECT below.

2. get the name of an elasticsearch pod

oc -n $PROJECT get pods -l component=es

I will refer to the name of an es pod returned from this command as $espod below.

3. use 

oc -n $PROJECT -c elasticsearch $espod -- es_util --query=_cat/thread_pool?v\&h=br

To see if there are bulk rejections

Comment 3 Shirly Radco 2018-06-05 08:54:58 UTC
(In reply to Rich Megginson from comment #2)
> (In reply to Shirly Radco from comment #1)
> > Rich, Did you encounter a situation where the fluentd just stopped sending
> > logs to Elasticsearch due to high number/rate of logs?
> 
> Yes.  Could be for several reasons.
> 
> Any errors in the fluentd logs?
> 
> Any errors in the Elasticsearch logs?
> 
> Is Elasticsearch overloaded?  If so, it may be reporting bulk index
> rejections - see
> https://www.elastic.co/guide/en/elasticsearch/reference/2.4/cat-thread-pool.
> html#_available_thread_pools
> 
> 1. determine the name of the project used by the logging components
> 
> oc get projects | grep logging
> 
> If "openshift-logging" is in that list, use it, otherwise, use "logging".  I
> will refer to this project name as $PROJECT below.
> 
> 2. get the name of an elasticsearch pod
> 
> oc -n $PROJECT get pods -l component=es
> 
> I will refer to the name of an es pod returned from this command as $espod
> below.
> 
> 3. use 
> 
> oc -n $PROJECT -c elasticsearch $espod -- es_util
> --query=_cat/thread_pool?v\&h=br

I run:
oc -n logging -c elasticsearch logging-es-data-master-99my9lh4-2-tmd7l -- es_util --query=_cat/thread_pool?v\&h=br

But I get an error 

Error: unknown command "logging-es-data-master-99my9lh4-2-tmd7l" for "oc"
Run 'oc --help' for usage.
for this command
> 
> To see if there are bulk rejections

Comment 4 Shirly Radco 2018-06-05 08:58:38 UTC
Maybe you were missing exec ?
 
oc exec -n $PROJECT -c elasticsearch $espod -- es_util --query=_cat/thread_pool?v\&h=br


If yes, the he result I got is  0.

Comment 5 Rich Megginson 2018-06-05 15:30:22 UTC
yes, I was missing exec, and 0 means no bulk index rejections.
so it must be due to something else

Comment 6 Lukas Svaty 2018-07-20 16:30:07 UTC
Hi can you provide reproduction steps?

Comment 7 Shirly Radco 2018-07-22 07:46:32 UTC
The fix here was by disabling retries for metrics and adding additional threads for sending metrics.

The issue was found for a metrics store located in TLV and the hosts were from Brno and had around 90 vms running on them.
The latency and number of vms caused a raise in the retries until the fluentd queue was full.
Not sure how to reproduce. Perhaps only in a scale env.

Comment 8 Lukas Svaty 2018-07-24 07:25:22 UTC
For verification it is sufficient to disable the connection to the metrics store, we are not storing old values inside buffer of fluentd and disabled the retry send time. 

In this, we won't store the data, but rather retry in next interval (10s). This should be tested with connection problems on metrics store, and buffer of fluentd should not be filling up.

This is not the case for the logs.

Comment 9 Lukas Svaty 2018-07-30 19:12:59 UTC
SanityOnly - ovirt-engine-metrics-1.1.6.1-1.el7ev.noarch

Comment 10 Sandro Bonazzola 2018-07-31 15:28:29 UTC
This bugzilla is included in oVirt 4.2.5 release, published on July 30th 2018.

Since the problem described in this bug report should be
resolved in oVirt 4.2.5 release, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.


Note You need to log in before you can comment on or make changes to this bug.