1585666 – Some hosts stop reporting data to elasticsearch after a few minutes

Bug 1585666 - Some hosts stop reporting data to elasticsearch after a few minutes

Summary: Some hosts stop reporting data to elasticsearch after a few minutes

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	ovirt-engine-metrics
Classification:	oVirt
Component:	Generic
Sub Component:
Version:	1.1.4.2
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	ovirt-4.2.5
Target Release:	1.1.6
Assignee:	Shirly Radco
QA Contact:	Lukas Svaty
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1583573 ovirt-engine-metrics-1.1.6
TreeView+	depends on / blocked

Reported:	2018-06-04 11:32 UTC by Shirly Radco
Modified:	2018-07-31 15:28 UTC (History)
CC List:	3 users (show)
Fixed In Version:	ovirt-engine-metrics-1.1.6
Clone Of:
Environment:
Last Closed:	2018-07-31 15:28:29 UTC
oVirt Team:	Metrics
Embargoed:
Flags:	rule-engine: ovirt-4.2+

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
oVirt gerrit	92195	0	master	MERGED	metrics: update fluentd metrics buffer	2018-06-14 11:36:10 UTC
oVirt gerrit	92257	0	ovirt-engine-metrics-4.2	MERGED	metrics: update fluentd metrics buffer	2018-06-14 14:56:35 UTC

Description Shirly Radco 2018-06-04 11:32:42 UTC

Description of problem:
Hosts stop reporting data to elasticsearch after a few minutes.

This seems slimier to what I saw in the scale testing that we did in the past.

The hosts that have this issue have relatively high number of vms.
In the case that I saw it has 116 vms.

Fluentd is taking very high resources, over 

Version-Release number of selected component (if applicable):
1.1.4.2

How reproducible:


Steps to Reproduce:
1. 
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Shirly Radco 2018-06-04 11:59:04 UTC

Rich, Did you encounter a situation where the fluentd just stopped sending logs to Elasticsearch due to high number/rate of logs?

Comment 2 Rich Megginson 2018-06-04 15:34:18 UTC

(In reply to Shirly Radco from comment #1)
> Rich, Did you encounter a situation where the fluentd just stopped sending
> logs to Elasticsearch due to high number/rate of logs?

Yes.  Could be for several reasons.

Any errors in the fluentd logs?

Any errors in the Elasticsearch logs?

Is Elasticsearch overloaded?  If so, it may be reporting bulk index rejections - see https://www.elastic.co/guide/en/elasticsearch/reference/2.4/cat-thread-pool.html#_available_thread_pools

1. determine the name of the project used by the logging components

oc get projects | grep logging

If "openshift-logging" is in that list, use it, otherwise, use "logging".  I will refer to this project name as $PROJECT below.

2. get the name of an elasticsearch pod

oc -n $PROJECT get pods -l component=es

I will refer to the name of an es pod returned from this command as $espod below.

3. use 

oc -n $PROJECT -c elasticsearch $espod -- es_util --query=_cat/thread_pool?v\&h=br

To see if there are bulk rejections

Comment 3 Shirly Radco 2018-06-05 08:54:58 UTC

(In reply to Rich Megginson from comment #2)
> (In reply to Shirly Radco from comment #1)
> > Rich, Did you encounter a situation where the fluentd just stopped sending
> > logs to Elasticsearch due to high number/rate of logs?
> 
> Yes.  Could be for several reasons.
> 
> Any errors in the fluentd logs?
> 
> Any errors in the Elasticsearch logs?
> 
> Is Elasticsearch overloaded?  If so, it may be reporting bulk index
> rejections - see
> https://www.elastic.co/guide/en/elasticsearch/reference/2.4/cat-thread-pool.
> html#_available_thread_pools
> 
> 1. determine the name of the project used by the logging components
> 
> oc get projects | grep logging
> 
> If "openshift-logging" is in that list, use it, otherwise, use "logging".  I
> will refer to this project name as $PROJECT below.
> 
> 2. get the name of an elasticsearch pod
> 
> oc -n $PROJECT get pods -l component=es
> 
> I will refer to the name of an es pod returned from this command as $espod
> below.
> 
> 3. use 
> 
> oc -n $PROJECT -c elasticsearch $espod -- es_util
> --query=_cat/thread_pool?v\&h=br

I run:
oc -n logging -c elasticsearch logging-es-data-master-99my9lh4-2-tmd7l -- es_util --query=_cat/thread_pool?v\&h=br

But I get an error 

Error: unknown command "logging-es-data-master-99my9lh4-2-tmd7l" for "oc"
Run 'oc --help' for usage.
for this command
> 
> To see if there are bulk rejections

Comment 4 Shirly Radco 2018-06-05 08:58:38 UTC

Maybe you were missing exec ?
 
oc exec -n $PROJECT -c elasticsearch $espod -- es_util --query=_cat/thread_pool?v\&h=br


If yes, the he result I got is  0.

Comment 5 Rich Megginson 2018-06-05 15:30:22 UTC

yes, I was missing exec, and 0 means no bulk index rejections.
so it must be due to something else

Comment 6 Lukas Svaty 2018-07-20 16:30:07 UTC

Hi can you provide reproduction steps?

Comment 7 Shirly Radco 2018-07-22 07:46:32 UTC

The fix here was by disabling retries for metrics and adding additional threads for sending metrics.

The issue was found for a metrics store located in TLV and the hosts were from Brno and had around 90 vms running on them.
The latency and number of vms caused a raise in the retries until the fluentd queue was full.
Not sure how to reproduce. Perhaps only in a scale env.

Comment 8 Lukas Svaty 2018-07-24 07:25:22 UTC

For verification it is sufficient to disable the connection to the metrics store, we are not storing old values inside buffer of fluentd and disabled the retry send time. 

In this, we won't store the data, but rather retry in next interval (10s). This should be tested with connection problems on metrics store, and buffer of fluentd should not be filling up.

This is not the case for the logs.

Comment 9 Lukas Svaty 2018-07-30 19:12:59 UTC

SanityOnly - ovirt-engine-metrics-1.1.6.1-1.el7ev.noarch

Comment 10 Sandro Bonazzola 2018-07-31 15:28:29 UTC

This bugzilla is included in oVirt 4.2.5 release, published on July 30th 2018.

Since the problem described in this bug report should be
resolved in oVirt 4.2.5 release, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.

Note You need to log in before you can comment on or make changes to this bug.