Bug 1683736
| Summary: | Metrics store downtime causing flood of /var/log/messages logs in engine and hosts | ||||||
|---|---|---|---|---|---|---|---|
| Product: | [oVirt] ovirt-engine-metrics | Reporter: | Ivana Saranova <isaranov> | ||||
| Component: | Generic | Assignee: | Sandro Bonazzola <sbonazzo> | ||||
| Status: | CLOSED CURRENTRELEASE | QA Contact: | Ivana Saranova <isaranov> | ||||
| Severity: | urgent | Docs Contact: | |||||
| Priority: | urgent | ||||||
| Version: | unspecified | CC: | aoconnor, bugs, jzmeskal, lleistne, mtessun, nhosoi, rmeggins, sbonazzo, sradco | ||||
| Target Milestone: | ovirt-4.2.8-4 | Flags: | sradco:
ovirt-4.2?
sbonazzo: blocker? mtessun: planning_ack+ sbonazzo: devel_ack+ lleistne: testing_ack+ |
||||
| Target Release: | 1.2.2.2 | ||||||
| Hardware: | Unspecified | ||||||
| OS: | Unspecified | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | rubygem-fluent-plugin-elasticsearch-1.17.2-1.el7, ovirt-engine-metrics-1.2.2.2-1.el7 | Doc Type: | Bug Fix | ||||
| Doc Text: |
Previous version of rubygem-fluent-plugin-elasticsearch had a bug causing /var/log being filled when metrics store machine is down.
A new build of rubygem-fluent-plugin-elasticsearch has been done from upstream including the fix for this issue.
|
Story Points: | --- | ||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2019-06-03 07:55:45 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | Metrics | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Bug Depends On: | 1692256 | ||||||
| Bug Blocks: | |||||||
| Attachments: |
|
||||||
I believe the summary and description of this bug are not completely correct. It is true that you will experience this situation when metrics are down. But the primary reason for metrics being down in this case is the fact that they ran out of storage. If the file system when metrics store their data is filled, metrics go down and then everything that's been described happens. But the root cause of this is metrics' file system running out of capacity. Rich, Noriko, What can prevent this in Rsyslog setup? (In reply to Shirly Radco from comment #3) > Rich, Noriko, What can prevent this in Rsyslog setup? Shiry, I think this is likely a fluentd specific issue. Rich, please correct me if I'm wrong. The lengthy error message is printed in this section [1] as "error: e.to_s", which is piled error messages returned from elasticsearch. [2] is the first part of the error value in "Jan 27 03:19:38 engine3 fluentd: 2019-01-27 03:19:38 +0100" warning with my adding newlines to make it easy to read. As seen in [1], it piles up each error message in an error object and dumps it every time it retries. In this snippet [2], the three "create" hash is almost identical except the _id's suffix "--f[aYZ]". [1] - https://github.com/fluent/fluentd/blob/v0.12/lib/fluent/output.rb#L380-L410 [2] error="Unrecognized elasticsearch errors returned, retrying {\"took\"=>61601, \"errors\"=>true, \"items\"=>[ {\"create\"=>{\"_index\"=>\"project.ovirt-metrics-lleistne-engine3.b9f028bd-18e0-11e9-8359-001a4a013f6b.2019.01.27\", \"_type\"=>\"com.redhat.viaq.common\", \"_id\"=>\"AWiNGfpDXmMM53QX--fY\", \"status\"=>503, \"error\"=>{\"type\"=>\"unavailable_shards_exception\", \"reason\"=>\"[project.ovirt-metrics-lleistne-engine3.b9f028bd-18e0-11e9-8359-001a4a013f6b.2019.01.27][0] primary shard is not active Timeout: [1m], request: [BulkShardRequest to [project.ovirt-metrics-lleistne-engine3.b9f028bd-18e0-11e9-8359-001a4a013f6b.2019.01.27] containing [1298] requests]\"}}}, {\"create\"=>{\"_index\"=>\"project.ovirt-metrics-lleistne-engine3.b9f028bd-18e0-11e9-8359-001a4a013f6b.2019.01.27\", \"_type\"=>\"com.redhat.viaq.common\", \"_id\"=>\"AWiNGfpDXmMM53QX--fZ\", \"status\"=>503, \"error\"=>{\"type\"=>\"unavailable_shards_exception\", \"reason\"=>\"[project.ovirt-metrics-lleistne-engine3.b9f028bd-18e0-11e9-8359-001a4a013f6b.2019.01.27][0] primary shard is not active Timeout: [1m], request: [BulkShardRequest to [project.ovirt-metrics-lleistne-engine3.b9f028bd-18e0-11e9-8359-001a4a013f6b.2019.01.27] containing [1298] requests]\"}}}, {\"create\"=>{\"_index\"=>\"project.ovirt-metrics-lleistne-engine3.b9f028bd-18e0-11e9-8359-001a4a013f6b.2019.01.27\", \"_type\"=>\"com.redhat.viaq.common\", \"_id\"=>\"AWiNGfpDXmMM53QX--fa\", \"status\"=>503, \"error\"=>{\"type\"=>\"unavailable_shards_exception\", \"reason\"=>\"[project.ovirt-metrics-lleistne-engine3.b9f028bd-18e0-11e9-8359-001a4a013f6b.2019.01.27][0] primary shard is not active Timeout: [1m], request: [BulkShardRequest to [project.ovirt-metrics-lleistne-engine3.b9f028bd-18e0-11e9-8359-001a4a013f6b.2019.01.27] containing [1298] requests]\"}}}, <<snip>> The omelasticsearch in rsyslog has a smarter implementation. (Note: we also use bulkmode: on; retryfailures: on, which works in the similar way as fluentd does.) When the retry attempt finally fails, it dumps the failed records (see [3]). It'd have some size, but not as large as the repeated error messages as for fluentd. [3] - https://github.com/linux-system-roles/logging/blob/master/roles/rsyslog/roles/output_roles/elasticsearch/defaults/main.yaml#L75-L77 Plus, rsyslog is more terse than fluentd. When retryfailure is "on", the elasticsearch error info is provided to the users in the statistical manner. See [4] and search, e.g., response.other. That's said, the disk usage in the similar situation should be much less if rsyslog is used. The following counters are available when retryfailures=”on” is used: [4] - https://www.rsyslog.com/doc/v8-stable/configuration/modules/omelasticsearch.html We have implemented exponential backoff if fluentd. I'm not sure why this does not work in this case. Can you please advise? How is this handled in OpenShift? (In reply to Shirly Radco from comment #5) First, what version of fluent-plugin-elasticsearch are you using? This error was generated in v1.15 but was removed from v1.16. The latest 1.x version is 1.17.2. This is the version in rhlog-1.0-rhel-7. > We have implemented exponential backoff if fluentd. Not sure what you mean. If you mean https://docs.fluentd.org/v0.12/articles/out_file#retry_wait,-max_retry_wait then that doesn't really apply to the fluent-plugin-elasticsearch. Or, rather, it only applies to connection errors. > I'm not sure why this > does not work in this case. > Can you please advise? How is this handled in OpenShift? https://github.com/richm/docs/releases/tag/20180510180102 I think the reason why we don't see this in OpenShift is because we are using 1.17.2 Sandro, What version is tagged for RHV release? We probably need to upgrade fluentd. (In reply to Shirly Radco from comment #7) > Sandro, What version is tagged for RHV release? rubygem-fluent-plugin-elasticsearch-1.9.5.1-1.el7 Opened bug #1692256 for rebasing. CentOS Build is available here: https://cbs.centos.org/koji/buildinfo?buildID=25562 Steps to Reproduce: 1. Metrics-store machine is out of space 2. Wait for some time 3. Check your engine's and host's messages Actual results: There are error messages generated in the /var/log/messages, but are fairly small and generation time is slow enough. Verified in: ovirt-engine-4.2.8.6-0.1.el7ev.noarch rubygem-fluent-plugin-elasticsearch-1.17.2-1.el7.noarch |
Created attachment 1539202 [details] The error message generated on the engine and hosts Description of problem: When metrics-store machine is down, all connected engines and their hosts generate very long error messages in their /var/log/messages making the log file huge (dozens of GBs) and cause the engine and hosts run out of space very quickly. You have to manually delete the messages file and restart all the machines and get the metrics-store machine working to prevent this from happening. Version-Release number of selected component (if applicable): ovirt-engine-metrics-1.1.8-1.el7ev.noarch ovirt-engine-4.2.8.2-0.1.el7ev.noarch How reproducible: Always Steps to Reproduce: 1. Metrics-store machine is down 2. Wait for some time 3. Check your engine's and host's space Actual results: Engines and hosts are out of space. Additional info: The error message that is generated in /var/lib/messages is in the attachments.