Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1683736

Summary:

Metrics store downtime causing flood of /var/log/messages logs in engine and hosts

Product:

[oVirt] ovirt-engine-metrics

Reporter:

Ivana Saranova <isaranov>

Component:

Generic

Assignee:

Sandro Bonazzola <sbonazzo>

Status:

CLOSED CURRENTRELEASE

QA Contact:

Ivana Saranova <isaranov>

Severity:

urgent

Docs Contact:

Priority:

urgent

Version:

unspecified

CC:

aoconnor, bugs, jzmeskal, lleistne, mtessun, nhosoi, rmeggins, sbonazzo, sradco

Target Milestone:

ovirt-4.2.8-4

Flags:

sradco: ovirt-4.2?
sbonazzo: blocker?
mtessun: planning_ack+
sbonazzo: devel_ack+
lleistne: testing_ack+

Target Release:

1.2.2.2

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

rubygem-fluent-plugin-elasticsearch-1.17.2-1.el7, ovirt-engine-metrics-1.2.2.2-1.el7

Doc Type:

Bug Fix

Doc Text:

Previous version of rubygem-fluent-plugin-elasticsearch had a bug causing /var/log being filled when metrics store machine is down. A new build of rubygem-fluent-plugin-elasticsearch has been done from upstream including the fix for this issue.

Story Points:

---

Clone Of:

Environment:

Last Closed:

2019-06-03 07:55:45 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

Metrics

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

1692256

Bug Blocks:

Attachments:

Description	Flags
The error message generated on the engine and hosts	none

Description Ivana Saranova 2019-02-27 16:42:52 UTC

Created attachment 1539202 [details]
The error message generated on the engine and hosts

Description of problem:
When metrics-store machine is down, all connected engines and their hosts generate very long error messages in their /var/log/messages making the log file huge (dozens of GBs) and cause the engine and hosts run out of space very quickly. You have to manually delete the messages file and restart all the machines and get the metrics-store machine working to prevent this from happening.

Version-Release number of selected component (if applicable):
ovirt-engine-metrics-1.1.8-1.el7ev.noarch
ovirt-engine-4.2.8.2-0.1.el7ev.noarch

How reproducible:
Always

Steps to Reproduce:
1. Metrics-store machine is down
2. Wait for some time
3. Check your engine's and host's space

Actual results:
Engines and hosts are out of space.

Additional info:
The error message that is generated in /var/lib/messages is in the attachments.

Comment 2 Jan Zmeskal 2019-03-13 14:22:59 UTC

I believe the summary and description of this bug are not completely correct. It is true that you will experience this situation when metrics are down. But the primary reason for metrics being down in this case is the fact that they ran out of storage. If the file system when metrics store their data is filled, metrics go down and then everything that's been described happens. But the root cause of this is metrics' file system running out of capacity.

Comment 3 Shirly Radco 2019-03-13 15:19:58 UTC

Rich, Noriko, What can prevent this in Rsyslog setup?

Comment 4 Noriko Hosoi 2019-03-13 21:56:30 UTC

(In reply to Shirly Radco from comment #3)
> Rich, Noriko, What can prevent this in Rsyslog setup?

Shiry, I think this is likely a fluentd specific issue.  Rich, please correct me if I'm wrong.


The lengthy error message is printed in this section [1] as "error: e.to_s", which is piled error messages returned from elasticsearch.  [2] is the first part of the error value in "Jan 27 03:19:38 engine3 fluentd: 2019-01-27 03:19:38 +0100" warning with my adding newlines to make it easy to read.  As seen in [1], it piles up each error message in an error object and dumps it every time it retries.  In this snippet [2], the three "create" hash is almost identical except the _id's suffix "--f[aYZ]".

[1] - https://github.com/fluent/fluentd/blob/v0.12/lib/fluent/output.rb#L380-L410

[2]
error="Unrecognized elasticsearch errors returned, retrying  {\"took\"=>61601, \"errors\"=>true,

\"items\"=>[

{\"create\"=>{\"_index\"=>\"project.ovirt-metrics-lleistne-engine3.b9f028bd-18e0-11e9-8359-001a4a013f6b.2019.01.27\", \"_type\"=>\"com.redhat.viaq.common\", \"_id\"=>\"AWiNGfpDXmMM53QX--fY\", \"status\"=>503, \"error\"=>{\"type\"=>\"unavailable_shards_exception\", \"reason\"=>\"[project.ovirt-metrics-lleistne-engine3.b9f028bd-18e0-11e9-8359-001a4a013f6b.2019.01.27][0] primary shard is not active Timeout: [1m], request: [BulkShardRequest to [project.ovirt-metrics-lleistne-engine3.b9f028bd-18e0-11e9-8359-001a4a013f6b.2019.01.27] containing [1298] requests]\"}}},

{\"create\"=>{\"_index\"=>\"project.ovirt-metrics-lleistne-engine3.b9f028bd-18e0-11e9-8359-001a4a013f6b.2019.01.27\", \"_type\"=>\"com.redhat.viaq.common\", \"_id\"=>\"AWiNGfpDXmMM53QX--fZ\", \"status\"=>503, \"error\"=>{\"type\"=>\"unavailable_shards_exception\", \"reason\"=>\"[project.ovirt-metrics-lleistne-engine3.b9f028bd-18e0-11e9-8359-001a4a013f6b.2019.01.27][0] primary shard is not active Timeout: [1m], request: [BulkShardRequest to [project.ovirt-metrics-lleistne-engine3.b9f028bd-18e0-11e9-8359-001a4a013f6b.2019.01.27] containing [1298] requests]\"}}},

{\"create\"=>{\"_index\"=>\"project.ovirt-metrics-lleistne-engine3.b9f028bd-18e0-11e9-8359-001a4a013f6b.2019.01.27\", \"_type\"=>\"com.redhat.viaq.common\", \"_id\"=>\"AWiNGfpDXmMM53QX--fa\", \"status\"=>503, \"error\"=>{\"type\"=>\"unavailable_shards_exception\", \"reason\"=>\"[project.ovirt-metrics-lleistne-engine3.b9f028bd-18e0-11e9-8359-001a4a013f6b.2019.01.27][0] primary shard is not active Timeout: [1m], request: [BulkShardRequest to [project.ovirt-metrics-lleistne-engine3.b9f028bd-18e0-11e9-8359-001a4a013f6b.2019.01.27] containing [1298] requests]\"}}},

<<snip>>


The omelasticsearch in rsyslog has a smarter implementation.  (Note: we also use bulkmode: on; retryfailures: on, which works in the similar way as fluentd does.)  When the retry attempt finally fails, it dumps the failed records (see [3]).  It'd have some size, but not as large as the repeated error messages as for fluentd. 

[3] - https://github.com/linux-system-roles/logging/blob/master/roles/rsyslog/roles/output_roles/elasticsearch/defaults/main.yaml#L75-L77

Plus, rsyslog is more terse than fluentd.  When retryfailure is "on", the elasticsearch error info is provided to the users in the statistical manner.  See [4] and search, e.g., response.other.  That's said, the disk usage in the similar situation should be much less if rsyslog is used.

The following counters are available when retryfailures=”on” is used:

[4] - https://www.rsyslog.com/doc/v8-stable/configuration/modules/omelasticsearch.html

Comment 5 Shirly Radco 2019-03-20 20:45:14 UTC

We have implemented exponential backoff if fluentd. I'm not sure why this does not work in this case.
Can you please advise? How is this handled in OpenShift?

Comment 6 Rich Megginson 2019-03-20 21:59:14 UTC

(In reply to Shirly Radco from comment #5)

First, what version of fluent-plugin-elasticsearch are you using?  This error was generated in v1.15 but was removed from v1.16.  The latest 1.x version is 1.17.2.  This is the version in rhlog-1.0-rhel-7.

> We have implemented exponential backoff if fluentd.

Not sure what you mean.  If you mean

https://docs.fluentd.org/v0.12/articles/out_file#retry_wait,-max_retry_wait

then that doesn't really apply to the fluent-plugin-elasticsearch.  Or, rather, it only applies to connection errors.

> I'm not sure why this
> does not work in this case.
> Can you please advise? How is this handled in OpenShift?

https://github.com/richm/docs/releases/tag/20180510180102

I think the reason why we don't see this in OpenShift is because we are using 1.17.2

Comment 7 Shirly Radco 2019-03-21 13:31:31 UTC

Sandro, What version is tagged for RHV release?
We probably need to upgrade fluentd.

Comment 8 Sandro Bonazzola 2019-03-25 07:37:25 UTC

(In reply to Shirly Radco from comment #7)
> Sandro, What version is tagged for RHV release?

rubygem-fluent-plugin-elasticsearch-1.9.5.1-1.el7
Opened bug #1692256 for rebasing.

Comment 9 Sandro Bonazzola 2019-03-25 14:48:05 UTC

CentOS Build is available here: https://cbs.centos.org/koji/buildinfo?buildID=25562

Comment 10 Ivana Saranova 2019-04-16 13:54:17 UTC

Steps to Reproduce:
1. Metrics-store machine is out of space
2. Wait for some time
3. Check your engine's and host's messages

Actual results:
There are error messages generated in the /var/log/messages, but are fairly small and generation time is slow enough.

Verified in:
ovirt-engine-4.2.8.6-0.1.el7ev.noarch
rubygem-fluent-plugin-elasticsearch-1.17.2-1.el7.noarch