1893992 – Elasticsearch rollover pods failed with resource_already_exists_exception

Bug 1893992 - Elasticsearch rollover pods failed with resource_already_exists_exception

Summary: Elasticsearch rollover pods failed with resource_already_exists_exception

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Logging
Sub Component:
Version:	4.5
Hardware:	Unspecified
OS:	Unspecified
Priority:	low
Severity:	high
Target Milestone:	---
Target Release:	4.7.0
Assignee:	ewolinet
QA Contact:	Qiaoling Tang
Docs Contact:	Rolfe Dlugy-Hegwer
URL:
Whiteboard:	logging-exploration
Depends On:
Blocks:	1916475
TreeView+	depends on / blocked

Reported:	2020-11-03 09:22 UTC by naygupta
Modified:	2024-03-25 16:53 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	* Previously, the Elasticsearch rollover pods failed with `resource_already_exists_exception` because, within the Elasticsearch rollover API, when the next index was created, the `-write` alias was not updated to point to it. As a result, the next time the rollover API endpoint was triggered for that particular index, it received an error that the resource already existed. The current release fixes this issue: When doing a rollover in the indexmanagement cronjobs, if a new index has been created, it verifies that the alias points to the new index. This prevents the error from happening on the next execution. If the cluster is already receiving this error, a cronjob fixes the issue so that subsequent runs work as expected. Now, performing rollovers no longer produces the exception. (link:https://bugzilla.redhat.com/show_bug.cgi?id=1893992[BZ#1893992*])
Clone Of:
Environment:
Last Closed:	2021-02-24 11:21:19 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift elasticsearch-operator pull 618	0	None	closed	Bug 1893992: Improve elasticsearch indexmanagement rollover script	2021-02-15 19:19:59 UTC
Red Hat Product Errata	RHBA-2021:0652	0	None	None	None	2021-02-24 11:22:12 UTC

Description naygupta 2020-11-03 09:22:43 UTC

Description of problem:
Elasticsearch rollover pods failed with resource_already_exists_exception

Version-Release number of selected component (if applicable):
OpenShift Container Platform 4.5

How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:
Elasticsearch rollover pods failures are constant. 

$ parallel oc -n openshift-logging logs --prefix {} ::: elasticsearch-delete-app-1603918800-jn8xk elasticsearch-delete-audit-1603918800-sdhns elasticsearch-delete-infra-1603918800-bwf9h elasticsearch-rollover-app-1603964700-zkmmg elastics
earch-rollover-audit-1603964700-nx8pq elasticsearch-rollover-infra-1603964700-jghrf
[pod/elasticsearch-delete-app-1603918800-jn8xk/indexmanagement]
[pod/elasticsearch-delete-app-1603918800-jn8xk/indexmanagement] Traceback (most recent call last):
[pod/elasticsearch-delete-app-1603918800-jn8xk/indexmanagement]   File "<string>", line 2, in <module>
[pod/elasticsearch-delete-app-1603918800-jn8xk/indexmanagement]   File "/usr/lib64/python2.7/json/__init__.py", line 290, in load
[pod/elasticsearch-delete-app-1603918800-jn8xk/indexmanagement]     **kw)
[pod/elasticsearch-delete-app-1603918800-jn8xk/indexmanagement]   File "/usr/lib64/python2.7/json/__init__.py", line 338, in loads
[pod/elasticsearch-delete-app-1603918800-jn8xk/indexmanagement]     return _default_decoder.decode(s)
[pod/elasticsearch-delete-app-1603918800-jn8xk/indexmanagement]   File "/usr/lib64/python2.7/json/decoder.py", line 366, in decode
[pod/elasticsearch-delete-app-1603918800-jn8xk/indexmanagement]     obj, end = self.raw_decode(s, idx=_w(s, 0).end())
[pod/elasticsearch-delete-app-1603918800-jn8xk/indexmanagement]   File "/usr/lib64/python2.7/json/decoder.py", line 384, in raw_decode
[pod/elasticsearch-delete-app-1603918800-jn8xk/indexmanagement]     raise ValueError("No JSON object could be decoded")
[pod/elasticsearch-delete-app-1603918800-jn8xk/indexmanagement] ValueError: No JSON object could be decoded
[pod/elasticsearch-delete-audit-1603918800-sdhns/indexmanagement]
[pod/elasticsearch-delete-audit-1603918800-sdhns/indexmanagement] Traceback (most recent call last):
[pod/elasticsearch-delete-audit-1603918800-sdhns/indexmanagement]   File "<string>", line 2, in <module>
[pod/elasticsearch-delete-audit-1603918800-sdhns/indexmanagement]   File "/usr/lib64/python2.7/json/__init__.py", line 290, in load
[pod/elasticsearch-delete-audit-1603918800-sdhns/indexmanagement]     **kw)
[pod/elasticsearch-delete-audit-1603918800-sdhns/indexmanagement]   File "/usr/lib64/python2.7/json/__init__.py", line 338, in loads
[pod/elasticsearch-delete-audit-1603918800-sdhns/indexmanagement]     return _default_decoder.decode(s)
[pod/elasticsearch-delete-audit-1603918800-sdhns/indexmanagement]   File "/usr/lib64/python2.7/json/decoder.py", line 366, in decode
[pod/elasticsearch-delete-audit-1603918800-sdhns/indexmanagement]     obj, end = self.raw_decode(s, idx=_w(s, 0).end())
[pod/elasticsearch-delete-audit-1603918800-sdhns/indexmanagement]   File "/usr/lib64/python2.7/json/decoder.py", line 384, in raw_decode
[pod/elasticsearch-delete-audit-1603918800-sdhns/indexmanagement]     raise ValueError("No JSON object could be decoded")
[pod/elasticsearch-delete-audit-1603918800-sdhns/indexmanagement] ValueError: No JSON object could be decoded
[pod/elasticsearch-delete-infra-1603918800-bwf9h/indexmanagement]
[pod/elasticsearch-delete-infra-1603918800-bwf9h/indexmanagement] Traceback (most recent call last):
[pod/elasticsearch-delete-infra-1603918800-bwf9h/indexmanagement]   File "<string>", line 2, in <module>
[pod/elasticsearch-delete-infra-1603918800-bwf9h/indexmanagement]   File "/usr/lib64/python2.7/json/__init__.py", line 290, in load
[pod/elasticsearch-delete-infra-1603918800-bwf9h/indexmanagement]     **kw)
[pod/elasticsearch-delete-infra-1603918800-bwf9h/indexmanagement]   File "/usr/lib64/python2.7/json/__init__.py", line 338, in loads
[pod/elasticsearch-delete-infra-1603918800-bwf9h/indexmanagement]     return _default_decoder.decode(s)
[pod/elasticsearch-delete-infra-1603918800-bwf9h/indexmanagement]   File "/usr/lib64/python2.7/json/decoder.py", line 366, in decode
[pod/elasticsearch-delete-infra-1603918800-bwf9h/indexmanagement]     obj, end = self.raw_decode(s, idx=_w(s, 0).end())
[pod/elasticsearch-delete-infra-1603918800-bwf9h/indexmanagement]   File "/usr/lib64/python2.7/json/decoder.py", line 384, in raw_decode
[pod/elasticsearch-delete-infra-1603918800-bwf9h/indexmanagement]     raise ValueError("No JSON object could be decoded")
[pod/elasticsearch-delete-infra-1603918800-bwf9h/indexmanagement] ValueError: No JSON object could be decoded
[pod/elasticsearch-rollover-app-1603964700-zkmmg/indexmanagement] {
[pod/elasticsearch-rollover-app-1603964700-zkmmg/indexmanagement]   "error" : {
[pod/elasticsearch-rollover-app-1603964700-zkmmg/indexmanagement]     "root_cause" : [
[pod/elasticsearch-rollover-app-1603964700-zkmmg/indexmanagement]       {
[pod/elasticsearch-rollover-app-1603964700-zkmmg/indexmanagement]         "type" : "resource_already_exists_exception",
[pod/elasticsearch-rollover-app-1603964700-zkmmg/indexmanagement]         "reason" : "index [app-000004/36SzdIGFS0aQMqN3dIOxxQ] already exists",
[pod/elasticsearch-rollover-app-1603964700-zkmmg/indexmanagement]         "index_uuid" : "36SzdIGFS0aQMqN3dIOxxQ",
[pod/elasticsearch-rollover-app-1603964700-zkmmg/indexmanagement]         "index" : "app-000004"
[pod/elasticsearch-rollover-app-1603964700-zkmmg/indexmanagement]       }
[pod/elasticsearch-rollover-app-1603964700-zkmmg/indexmanagement]     ],
[pod/elasticsearch-rollover-app-1603964700-zkmmg/indexmanagement]     "type" : "resource_already_exists_exception",
[pod/elasticsearch-rollover-app-1603964700-zkmmg/indexmanagement]     "reason" : "index [app-000004/36SzdIGFS0aQMqN3dIOxxQ] already exists",
[pod/elasticsearch-rollover-app-1603964700-zkmmg/indexmanagement]     "index_uuid" : "36SzdIGFS0aQMqN3dIOxxQ",
[pod/elasticsearch-rollover-app-1603964700-zkmmg/indexmanagement]     "index" : "app-000004"
[pod/elasticsearch-rollover-app-1603964700-zkmmg/indexmanagement]   },
[pod/elasticsearch-rollover-app-1603964700-zkmmg/indexmanagement]   "status" : 400
[pod/elasticsearch-rollover-app-1603964700-zkmmg/indexmanagement] }
[pod/elasticsearch-rollover-audit-1603964700-nx8pq/indexmanagement] {
[pod/elasticsearch-rollover-audit-1603964700-nx8pq/indexmanagement]   "error" : {
[pod/elasticsearch-rollover-audit-1603964700-nx8pq/indexmanagement]     "root_cause" : [
[pod/elasticsearch-rollover-audit-1603964700-nx8pq/indexmanagement]       {
[pod/elasticsearch-rollover-audit-1603964700-nx8pq/indexmanagement]         "type" : "resource_already_exists_exception",
[pod/elasticsearch-rollover-audit-1603964700-nx8pq/indexmanagement]         "reason" : "index [audit-000002/jt_t-wDtQ2-X8h_qO_1cDw] already exists",
[pod/elasticsearch-rollover-audit-1603964700-nx8pq/indexmanagement]         "index_uuid" : "jt_t-wDtQ2-X8h_qO_1cDw",
[pod/elasticsearch-rollover-audit-1603964700-nx8pq/indexmanagement]         "index" : "audit-000002"
[pod/elasticsearch-rollover-audit-1603964700-nx8pq/indexmanagement]       }
[pod/elasticsearch-rollover-audit-1603964700-nx8pq/indexmanagement]     ],
[pod/elasticsearch-rollover-audit-1603964700-nx8pq/indexmanagement]     "type" : "resource_already_exists_exception",
[pod/elasticsearch-rollover-audit-1603964700-nx8pq/indexmanagement]     "reason" : "index [audit-000002/jt_t-wDtQ2-X8h_qO_1cDw] already exists",
[pod/elasticsearch-rollover-audit-1603964700-nx8pq/indexmanagement]     "index_uuid" : "jt_t-wDtQ2-X8h_qO_1cDw",
[pod/elasticsearch-rollover-audit-1603964700-nx8pq/indexmanagement]     "index" : "audit-000002"
[pod/elasticsearch-rollover-audit-1603964700-nx8pq/indexmanagement]   },
[pod/elasticsearch-rollover-audit-1603964700-nx8pq/indexmanagement]   "status" : 400
[pod/elasticsearch-rollover-audit-1603964700-nx8pq/indexmanagement] }
[pod/elasticsearch-rollover-infra-1603964700-jghrf/indexmanagement] {
[pod/elasticsearch-rollover-infra-1603964700-jghrf/indexmanagement]   "error" : {
[pod/elasticsearch-rollover-infra-1603964700-jghrf/indexmanagement]     "root_cause" : [
[pod/elasticsearch-rollover-infra-1603964700-jghrf/indexmanagement]       {
[pod/elasticsearch-rollover-infra-1603964700-jghrf/indexmanagement]         "type" : "resource_already_exists_exception",
[pod/elasticsearch-rollover-infra-1603964700-jghrf/indexmanagement]         "reason" : "index [infra-000004/VFr9HBz9QD6fWWq69HDlNA] already exists",
[pod/elasticsearch-rollover-infra-1603964700-jghrf/indexmanagement]         "index_uuid" : "VFr9HBz9QD6fWWq69HDlNA",
[pod/elasticsearch-rollover-infra-1603964700-jghrf/indexmanagement]         "index" : "infra-000004"
[pod/elasticsearch-rollover-infra-1603964700-jghrf/indexmanagement]       }
[pod/elasticsearch-rollover-infra-1603964700-jghrf/indexmanagement]     ],
[pod/elasticsearch-rollover-infra-1603964700-jghrf/indexmanagement]     "type" : "resource_already_exists_exception",
[pod/elasticsearch-rollover-infra-1603964700-jghrf/indexmanagement]     "reason" : "index [infra-000004/VFr9HBz9QD6fWWq69HDlNA] already exists",
[pod/elasticsearch-rollover-infra-1603964700-jghrf/indexmanagement]     "index_uuid" : "VFr9HBz9QD6fWWq69HDlNA",
[pod/elasticsearch-rollover-infra-1603964700-jghrf/indexmanagement]     "index" : "infra-000004"
[pod/elasticsearch-rollover-infra-1603964700-jghrf/indexmanagement]   },
[pod/elasticsearch-rollover-infra-1603964700-jghrf/indexmanagement]   "status" : 400
[pod/elasticsearch-rollover-infra-1603964700-jghrf/indexmanagement] }

The conditions causing these failures are intermittent, as often when a job fails a subsequent job completes successfully:

$ oc -n openshift-logging get pod | grep elasticsearch-
elasticsearch-cdm-pve9r608-1-76b7c8b6f7-s8x7x   2/2     Running             1          14h
elasticsearch-cdm-pve9r608-2-67d7787cb4-8wzxn   2/2     Running             2          14h
elasticsearch-cdm-pve9r608-3-669fb8b647-9fkmt   0/2     ContainerCreating   0          12h
elasticsearch-delete-app-1603918800-jn8xk       0/1     Error               0          13h
elasticsearch-delete-app-1603965600-pvf2q       0/1     Completed           0          105s
elasticsearch-delete-audit-1603918800-sdhns     0/1     Error               0          13h
elasticsearch-delete-audit-1603965600-x77tg     0/1     Completed           0          105s
elasticsearch-delete-infra-1603918800-bwf9h     0/1     Error               0          13h
elasticsearch-delete-infra-1603965600-r9lqq     0/1     Completed           0          105s
elasticsearch-rollover-app-1603965600-m5w7q     0/1     Error               0          105s
elasticsearch-rollover-audit-1603965600-85wbr   0/1     Error               0          105s
elasticsearch-rollover-infra-1603965600-wpqvx   0/1     Error               0          104s

Comment 3 Periklis Tsirakidis 2020-11-25 07:59:53 UTC

The one part of this issue in the delete script is solved by the PR in: https://bugzilla.redhat.com/show_bug.cgi?id=1899905

I am continuing investigation for the rollover script.

Comment 4 Periklis Tsirakidis 2020-11-25 14:56:55 UTC

@naygupta

After carefully considering the must-gather contents and investigating a little bit more in the inertia of the Elasticsearch Rollover API, I can conclude the following:

- The master node of this cluster looks to be under pressure, especially the master node. In addition GC takes a lot of time according to the logs.

> ip          heap.percent ram.percent cpu load_1m load_5m load_15m node.role master name
> 10.128.4.13           32          99  30   19.01   18.50    18.00 mdi       *      elasticsearch-cdm-pve9r608-2
> 10.130.2.5            47          73  14   17.91   15.25    14.62 mdi       -      elasticsearch-cdm-pve9r608-1

- The index rollover via an alias is a multi-step task and asynchronous by nature. This means when you encounter that the rollover failed but a new index was created, it looks like the alias update failed because your cluster was under pressure. This is a bug being reported to Elastic for a long time now [1]. However, it implies some manual intervention for now if you hit this case:

For example taking the infra logs rollover job where the alias is `infra-write` and points currently to `infra-000003` and fails to rollover to `infra-000004` because of cluster pressure, you should:
- First remove the empty index `infra-000004`
- Adapt the ES cluster CPU/Memory resources

[1] https://github.com/elastic/elasticsearch/issues/30340

Comment 9 Qiaoling Tang 2021-01-22 08:57:28 UTC

Tested with elasticsearch-operator.5.0.0-18, unable to reproduce this issue. Move to verified, if you hit this issue, please feel to reopen it.

Comment 13 errata-xmlrpc 2021-02-24 11:21:19 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Errata Advisory for Openshift Logging 5.0.0), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:0652

Note You need to log in before you can comment on or make changes to this bug.