Bug 1585666

Summary:	Some hosts stop reporting data to elasticsearch after a few minutes
Product:	[oVirt] ovirt-engine-metrics	Reporter:	Shirly Radco <sradco>
Component:	Generic	Assignee:	Shirly Radco <sradco>
Status:	CLOSED CURRENTRELEASE	QA Contact:	Lukas Svaty <lsvaty>
Severity:	high	Docs Contact:
Priority:	high
Version:	1.1.4.2	CC:	bugs, rmeggins, sradco
Target Milestone:	ovirt-4.2.5	Flags:	rule-engine: ovirt-4.2+
Target Release:	1.1.6
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	ovirt-engine-metrics-1.1.6	Doc Type:	Bug Fix
Doc Text:	Cause: For hosts that have many vms, having one fluentd thread to send metrics was not enough and caused the fluentd queue to increase and eventualy also collectd's. Consequence: fluentd stopped sending metrics. Fix: Increased the fluentd buffer threads to 5. Result: Fluentd manage to send metric in hosts that have many vms(Tested on hosts with 70-120 vms)	Story Points:	---
Clone Of:		Environment:
Last Closed:	2018-07-31 15:28:29 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	Metrics	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1583573, 1594753

Description Shirly Radco 2018-06-04 11:32:42 UTC

Description of problem:
Hosts stop reporting data to elasticsearch after a few minutes.

This seems slimier to what I saw in the scale testing that we did in the past.

The hosts that have this issue have relatively high number of vms.
In the case that I saw it has 116 vms.

Fluentd is taking very high resources, over 

Version-Release number of selected component (if applicable):
1.1.4.2

How reproducible:


Steps to Reproduce:
1. 
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Shirly Radco 2018-06-04 11:59:04 UTC

Rich, Did you encounter a situation where the fluentd just stopped sending logs to Elasticsearch due to high number/rate of logs?

Comment 2 Rich Megginson 2018-06-04 15:34:18 UTC

(In reply to Shirly Radco from comment #1)
> Rich, Did you encounter a situation where the fluentd just stopped sending
> logs to Elasticsearch due to high number/rate of logs?

Yes.  Could be for several reasons.

Any errors in the fluentd logs?

Any errors in the Elasticsearch logs?

Is Elasticsearch overloaded?  If so, it may be reporting bulk index rejections - see https://www.elastic.co/guide/en/elasticsearch/reference/2.4/cat-thread-pool.html#_available_thread_pools

1. determine the name of the project used by the logging components

oc get projects | grep logging

If "openshift-logging" is in that list, use it, otherwise, use "logging".  I will refer to this project name as $PROJECT below.

2. get the name of an elasticsearch pod

oc -n $PROJECT get pods -l component=es

I will refer to the name of an es pod returned from this command as $espod below.

3. use 

oc -n $PROJECT -c elasticsearch $espod -- es_util --query=_cat/thread_pool?v\&h=br

To see if there are bulk rejections

Comment 3 Shirly Radco 2018-06-05 08:54:58 UTC

(In reply to Rich Megginson from comment #2)
> (In reply to Shirly Radco from comment #1)
> > Rich, Did you encounter a situation where the fluentd just stopped sending
> > logs to Elasticsearch due to high number/rate of logs?
> 
> Yes.  Could be for several reasons.
> 
> Any errors in the fluentd logs?
> 
> Any errors in the Elasticsearch logs?
> 
> Is Elasticsearch overloaded?  If so, it may be reporting bulk index
> rejections - see
> https://www.elastic.co/guide/en/elasticsearch/reference/2.4/cat-thread-pool.
> html#_available_thread_pools
> 
> 1. determine the name of the project used by the logging components
> 
> oc get projects | grep logging
> 
> If "openshift-logging" is in that list, use it, otherwise, use "logging".  I
> will refer to this project name as $PROJECT below.
> 
> 2. get the name of an elasticsearch pod
> 
> oc -n $PROJECT get pods -l component=es
> 
> I will refer to the name of an es pod returned from this command as $espod
> below.
> 
> 3. use 
> 
> oc -n $PROJECT -c elasticsearch $espod -- es_util
> --query=_cat/thread_pool?v\&h=br

I run:
oc -n logging -c elasticsearch logging-es-data-master-99my9lh4-2-tmd7l -- es_util --query=_cat/thread_pool?v\&h=br

But I get an error 

Error: unknown command "logging-es-data-master-99my9lh4-2-tmd7l" for "oc"
Run 'oc --help' for usage.
for this command
> 
> To see if there are bulk rejections

Comment 4 Shirly Radco 2018-06-05 08:58:38 UTC

Maybe you were missing exec ?
 
oc exec -n $PROJECT -c elasticsearch $espod -- es_util --query=_cat/thread_pool?v\&h=br


If yes, the he result I got is  0.

Comment 5 Rich Megginson 2018-06-05 15:30:22 UTC

yes, I was missing exec, and 0 means no bulk index rejections.
so it must be due to something else

Comment 6 Lukas Svaty 2018-07-20 16:30:07 UTC

Hi can you provide reproduction steps?

Comment 7 Shirly Radco 2018-07-22 07:46:32 UTC

The fix here was by disabling retries for metrics and adding additional threads for sending metrics.

The issue was found for a metrics store located in TLV and the hosts were from Brno and had around 90 vms running on them.
The latency and number of vms caused a raise in the retries until the fluentd queue was full.
Not sure how to reproduce. Perhaps only in a scale env.

Comment 8 Lukas Svaty 2018-07-24 07:25:22 UTC

For verification it is sufficient to disable the connection to the metrics store, we are not storing old values inside buffer of fluentd and disabled the retry send time. 

In this, we won't store the data, but rather retry in next interval (10s). This should be tested with connection problems on metrics store, and buffer of fluentd should not be filling up.

This is not the case for the logs.

Comment 9 Lukas Svaty 2018-07-30 19:12:59 UTC

SanityOnly - ovirt-engine-metrics-1.1.6.1-1.el7ev.noarch

Comment 10 Sandro Bonazzola 2018-07-31 15:28:29 UTC

This bugzilla is included in oVirt 4.2.5 release, published on July 30th 2018.

Since the problem described in this bug report should be
resolved in oVirt 4.2.5 release, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.