Bug 1893992
Summary: | Elasticsearch rollover pods failed with resource_already_exists_exception | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | naygupta |
Component: | Logging | Assignee: | ewolinet |
Status: | CLOSED ERRATA | QA Contact: | Qiaoling Tang <qitang> |
Severity: | high | Docs Contact: | Rolfe Dlugy-Hegwer <rdlugyhe> |
Priority: | low | ||
Version: | 4.5 | CC: | anli, aos-bugs, cruhm, ewolinet, hgomes, periklis, qitang, rdlugyhe |
Target Milestone: | --- | ||
Target Release: | 4.7.0 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | logging-exploration | ||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: |
* Previously, the Elasticsearch rollover pods failed with `resource_already_exists_exception` because, within the Elasticsearch rollover API, when the next index was created, the `*-write` alias was not updated to point to it. As a result, the next time the rollover API endpoint was triggered for that particular index, it received an error that the resource already existed. The current release fixes this issue: When doing a rollover in the indexmanagement cronjobs, if a new index has been created, it verifies that the alias points to the new index. This prevents the error from happening on the next execution. If the cluster is already receiving this error, a cronjob fixes the issue so that subsequent runs work as expected. Now, performing rollovers no longer produces the exception.
(link:https://bugzilla.redhat.com/show_bug.cgi?id=1893992[*BZ#1893992*])
|
Story Points: | --- |
Clone Of: | Environment: | ||
Last Closed: | 2021-02-24 11:21:19 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 1916475 |
Description
naygupta
2020-11-03 09:22:43 UTC
The one part of this issue in the delete script is solved by the PR in: https://bugzilla.redhat.com/show_bug.cgi?id=1899905 I am continuing investigation for the rollover script. @naygupta After carefully considering the must-gather contents and investigating a little bit more in the inertia of the Elasticsearch Rollover API, I can conclude the following: - The master node of this cluster looks to be under pressure, especially the master node. In addition GC takes a lot of time according to the logs. > ip heap.percent ram.percent cpu load_1m load_5m load_15m node.role master name > 10.128.4.13 32 99 30 19.01 18.50 18.00 mdi * elasticsearch-cdm-pve9r608-2 > 10.130.2.5 47 73 14 17.91 15.25 14.62 mdi - elasticsearch-cdm-pve9r608-1 - The index rollover via an alias is a multi-step task and asynchronous by nature. This means when you encounter that the rollover failed but a new index was created, it looks like the alias update failed because your cluster was under pressure. This is a bug being reported to Elastic for a long time now [1]. However, it implies some manual intervention for now if you hit this case: For example taking the infra logs rollover job where the alias is `infra-write` and points currently to `infra-000003` and fails to rollover to `infra-000004` because of cluster pressure, you should: - First remove the empty index `infra-000004` - Adapt the ES cluster CPU/Memory resources [1] https://github.com/elastic/elasticsearch/issues/30340 Tested with elasticsearch-operator.5.0.0-18, unable to reproduce this issue. Move to verified, if you hit this issue, please feel to reopen it. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Errata Advisory for Openshift Logging 5.0.0), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2021:0652 |