Bug 1893992

Summary:	Elasticsearch rollover pods failed with resource_already_exists_exception
Product:	OpenShift Container Platform	Reporter:	naygupta
Component:	Logging	Assignee:	ewolinet
Status:	CLOSED ERRATA	QA Contact:	Qiaoling Tang <qitang>
Severity:	high	Docs Contact:	Rolfe Dlugy-Hegwer <rdlugyhe>
Priority:	low
Version:	4.5	CC:	anli, aos-bugs, cruhm, ewolinet, hgomes, periklis, qitang, rdlugyhe
Target Milestone:	---
Target Release:	4.7.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:	logging-exploration
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:	* Previously, the Elasticsearch rollover pods failed with `resource_already_exists_exception` because, within the Elasticsearch rollover API, when the next index was created, the `-write` alias was not updated to point to it. As a result, the next time the rollover API endpoint was triggered for that particular index, it received an error that the resource already existed. The current release fixes this issue: When doing a rollover in the indexmanagement cronjobs, if a new index has been created, it verifies that the alias points to the new index. This prevents the error from happening on the next execution. If the cluster is already receiving this error, a cronjob fixes the issue so that subsequent runs work as expected. Now, performing rollovers no longer produces the exception. (link:https://bugzilla.redhat.com/show_bug.cgi?id=1893992[BZ#1893992*])	Story Points:	---
Clone Of:		Environment:
Last Closed:	2021-02-24 11:21:19 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1916475

Description naygupta 2020-11-03 09:22:43 UTC

Description of problem:
Elasticsearch rollover pods failed with resource_already_exists_exception

Version-Release number of selected component (if applicable):
OpenShift Container Platform 4.5

How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:
Elasticsearch rollover pods failures are constant. 

$ parallel oc -n openshift-logging logs --prefix {} ::: elasticsearch-delete-app-1603918800-jn8xk elasticsearch-delete-audit-1603918800-sdhns elasticsearch-delete-infra-1603918800-bwf9h elasticsearch-rollover-app-1603964700-zkmmg elastics
earch-rollover-audit-1603964700-nx8pq elasticsearch-rollover-infra-1603964700-jghrf
[pod/elasticsearch-delete-app-1603918800-jn8xk/indexmanagement]
[pod/elasticsearch-delete-app-1603918800-jn8xk/indexmanagement] Traceback (most recent call last):
[pod/elasticsearch-delete-app-1603918800-jn8xk/indexmanagement]   File "<string>", line 2, in <module>
[pod/elasticsearch-delete-app-1603918800-jn8xk/indexmanagement]   File "/usr/lib64/python2.7/json/__init__.py", line 290, in load
[pod/elasticsearch-delete-app-1603918800-jn8xk/indexmanagement]     **kw)
[pod/elasticsearch-delete-app-1603918800-jn8xk/indexmanagement]   File "/usr/lib64/python2.7/json/__init__.py", line 338, in loads
[pod/elasticsearch-delete-app-1603918800-jn8xk/indexmanagement]     return _default_decoder.decode(s)
[pod/elasticsearch-delete-app-1603918800-jn8xk/indexmanagement]   File "/usr/lib64/python2.7/json/decoder.py", line 366, in decode
[pod/elasticsearch-delete-app-1603918800-jn8xk/indexmanagement]     obj, end = self.raw_decode(s, idx=_w(s, 0).end())
[pod/elasticsearch-delete-app-1603918800-jn8xk/indexmanagement]   File "/usr/lib64/python2.7/json/decoder.py", line 384, in raw_decode
[pod/elasticsearch-delete-app-1603918800-jn8xk/indexmanagement]     raise ValueError("No JSON object could be decoded")
[pod/elasticsearch-delete-app-1603918800-jn8xk/indexmanagement] ValueError: No JSON object could be decoded
[pod/elasticsearch-delete-audit-1603918800-sdhns/indexmanagement]
[pod/elasticsearch-delete-audit-1603918800-sdhns/indexmanagement] Traceback (most recent call last):
[pod/elasticsearch-delete-audit-1603918800-sdhns/indexmanagement]   File "<string>", line 2, in <module>
[pod/elasticsearch-delete-audit-1603918800-sdhns/indexmanagement]   File "/usr/lib64/python2.7/json/__init__.py", line 290, in load
[pod/elasticsearch-delete-audit-1603918800-sdhns/indexmanagement]     **kw)
[pod/elasticsearch-delete-audit-1603918800-sdhns/indexmanagement]   File "/usr/lib64/python2.7/json/__init__.py", line 338, in loads
[pod/elasticsearch-delete-audit-1603918800-sdhns/indexmanagement]     return _default_decoder.decode(s)
[pod/elasticsearch-delete-audit-1603918800-sdhns/indexmanagement]   File "/usr/lib64/python2.7/json/decoder.py", line 366, in decode
[pod/elasticsearch-delete-audit-1603918800-sdhns/indexmanagement]     obj, end = self.raw_decode(s, idx=_w(s, 0).end())
[pod/elasticsearch-delete-audit-1603918800-sdhns/indexmanagement]   File "/usr/lib64/python2.7/json/decoder.py", line 384, in raw_decode
[pod/elasticsearch-delete-audit-1603918800-sdhns/indexmanagement]     raise ValueError("No JSON object could be decoded")
[pod/elasticsearch-delete-audit-1603918800-sdhns/indexmanagement] ValueError: No JSON object could be decoded
[pod/elasticsearch-delete-infra-1603918800-bwf9h/indexmanagement]
[pod/elasticsearch-delete-infra-1603918800-bwf9h/indexmanagement] Traceback (most recent call last):
[pod/elasticsearch-delete-infra-1603918800-bwf9h/indexmanagement]   File "<string>", line 2, in <module>
[pod/elasticsearch-delete-infra-1603918800-bwf9h/indexmanagement]   File "/usr/lib64/python2.7/json/__init__.py", line 290, in load
[pod/elasticsearch-delete-infra-1603918800-bwf9h/indexmanagement]     **kw)
[pod/elasticsearch-delete-infra-1603918800-bwf9h/indexmanagement]   File "/usr/lib64/python2.7/json/__init__.py", line 338, in loads
[pod/elasticsearch-delete-infra-1603918800-bwf9h/indexmanagement]     return _default_decoder.decode(s)
[pod/elasticsearch-delete-infra-1603918800-bwf9h/indexmanagement]   File "/usr/lib64/python2.7/json/decoder.py", line 366, in decode
[pod/elasticsearch-delete-infra-1603918800-bwf9h/indexmanagement]     obj, end = self.raw_decode(s, idx=_w(s, 0).end())
[pod/elasticsearch-delete-infra-1603918800-bwf9h/indexmanagement]   File "/usr/lib64/python2.7/json/decoder.py", line 384, in raw_decode
[pod/elasticsearch-delete-infra-1603918800-bwf9h/indexmanagement]     raise ValueError("No JSON object could be decoded")
[pod/elasticsearch-delete-infra-1603918800-bwf9h/indexmanagement] ValueError: No JSON object could be decoded
[pod/elasticsearch-rollover-app-1603964700-zkmmg/indexmanagement] {
[pod/elasticsearch-rollover-app-1603964700-zkmmg/indexmanagement]   "error" : {
[pod/elasticsearch-rollover-app-1603964700-zkmmg/indexmanagement]     "root_cause" : [
[pod/elasticsearch-rollover-app-1603964700-zkmmg/indexmanagement]       {
[pod/elasticsearch-rollover-app-1603964700-zkmmg/indexmanagement]         "type" : "resource_already_exists_exception",
[pod/elasticsearch-rollover-app-1603964700-zkmmg/indexmanagement]         "reason" : "index [app-000004/36SzdIGFS0aQMqN3dIOxxQ] already exists",
[pod/elasticsearch-rollover-app-1603964700-zkmmg/indexmanagement]         "index_uuid" : "36SzdIGFS0aQMqN3dIOxxQ",
[pod/elasticsearch-rollover-app-1603964700-zkmmg/indexmanagement]         "index" : "app-000004"
[pod/elasticsearch-rollover-app-1603964700-zkmmg/indexmanagement]       }
[pod/elasticsearch-rollover-app-1603964700-zkmmg/indexmanagement]     ],
[pod/elasticsearch-rollover-app-1603964700-zkmmg/indexmanagement]     "type" : "resource_already_exists_exception",
[pod/elasticsearch-rollover-app-1603964700-zkmmg/indexmanagement]     "reason" : "index [app-000004/36SzdIGFS0aQMqN3dIOxxQ] already exists",
[pod/elasticsearch-rollover-app-1603964700-zkmmg/indexmanagement]     "index_uuid" : "36SzdIGFS0aQMqN3dIOxxQ",
[pod/elasticsearch-rollover-app-1603964700-zkmmg/indexmanagement]     "index" : "app-000004"
[pod/elasticsearch-rollover-app-1603964700-zkmmg/indexmanagement]   },
[pod/elasticsearch-rollover-app-1603964700-zkmmg/indexmanagement]   "status" : 400
[pod/elasticsearch-rollover-app-1603964700-zkmmg/indexmanagement] }
[pod/elasticsearch-rollover-audit-1603964700-nx8pq/indexmanagement] {
[pod/elasticsearch-rollover-audit-1603964700-nx8pq/indexmanagement]   "error" : {
[pod/elasticsearch-rollover-audit-1603964700-nx8pq/indexmanagement]     "root_cause" : [
[pod/elasticsearch-rollover-audit-1603964700-nx8pq/indexmanagement]       {
[pod/elasticsearch-rollover-audit-1603964700-nx8pq/indexmanagement]         "type" : "resource_already_exists_exception",
[pod/elasticsearch-rollover-audit-1603964700-nx8pq/indexmanagement]         "reason" : "index [audit-000002/jt_t-wDtQ2-X8h_qO_1cDw] already exists",
[pod/elasticsearch-rollover-audit-1603964700-nx8pq/indexmanagement]         "index_uuid" : "jt_t-wDtQ2-X8h_qO_1cDw",
[pod/elasticsearch-rollover-audit-1603964700-nx8pq/indexmanagement]         "index" : "audit-000002"
[pod/elasticsearch-rollover-audit-1603964700-nx8pq/indexmanagement]       }
[pod/elasticsearch-rollover-audit-1603964700-nx8pq/indexmanagement]     ],
[pod/elasticsearch-rollover-audit-1603964700-nx8pq/indexmanagement]     "type" : "resource_already_exists_exception",
[pod/elasticsearch-rollover-audit-1603964700-nx8pq/indexmanagement]     "reason" : "index [audit-000002/jt_t-wDtQ2-X8h_qO_1cDw] already exists",
[pod/elasticsearch-rollover-audit-1603964700-nx8pq/indexmanagement]     "index_uuid" : "jt_t-wDtQ2-X8h_qO_1cDw",
[pod/elasticsearch-rollover-audit-1603964700-nx8pq/indexmanagement]     "index" : "audit-000002"
[pod/elasticsearch-rollover-audit-1603964700-nx8pq/indexmanagement]   },
[pod/elasticsearch-rollover-audit-1603964700-nx8pq/indexmanagement]   "status" : 400
[pod/elasticsearch-rollover-audit-1603964700-nx8pq/indexmanagement] }
[pod/elasticsearch-rollover-infra-1603964700-jghrf/indexmanagement] {
[pod/elasticsearch-rollover-infra-1603964700-jghrf/indexmanagement]   "error" : {
[pod/elasticsearch-rollover-infra-1603964700-jghrf/indexmanagement]     "root_cause" : [
[pod/elasticsearch-rollover-infra-1603964700-jghrf/indexmanagement]       {
[pod/elasticsearch-rollover-infra-1603964700-jghrf/indexmanagement]         "type" : "resource_already_exists_exception",
[pod/elasticsearch-rollover-infra-1603964700-jghrf/indexmanagement]         "reason" : "index [infra-000004/VFr9HBz9QD6fWWq69HDlNA] already exists",
[pod/elasticsearch-rollover-infra-1603964700-jghrf/indexmanagement]         "index_uuid" : "VFr9HBz9QD6fWWq69HDlNA",
[pod/elasticsearch-rollover-infra-1603964700-jghrf/indexmanagement]         "index" : "infra-000004"
[pod/elasticsearch-rollover-infra-1603964700-jghrf/indexmanagement]       }
[pod/elasticsearch-rollover-infra-1603964700-jghrf/indexmanagement]     ],
[pod/elasticsearch-rollover-infra-1603964700-jghrf/indexmanagement]     "type" : "resource_already_exists_exception",
[pod/elasticsearch-rollover-infra-1603964700-jghrf/indexmanagement]     "reason" : "index [infra-000004/VFr9HBz9QD6fWWq69HDlNA] already exists",
[pod/elasticsearch-rollover-infra-1603964700-jghrf/indexmanagement]     "index_uuid" : "VFr9HBz9QD6fWWq69HDlNA",
[pod/elasticsearch-rollover-infra-1603964700-jghrf/indexmanagement]     "index" : "infra-000004"
[pod/elasticsearch-rollover-infra-1603964700-jghrf/indexmanagement]   },
[pod/elasticsearch-rollover-infra-1603964700-jghrf/indexmanagement]   "status" : 400
[pod/elasticsearch-rollover-infra-1603964700-jghrf/indexmanagement] }

The conditions causing these failures are intermittent, as often when a job fails a subsequent job completes successfully:

$ oc -n openshift-logging get pod | grep elasticsearch-
elasticsearch-cdm-pve9r608-1-76b7c8b6f7-s8x7x   2/2     Running             1          14h
elasticsearch-cdm-pve9r608-2-67d7787cb4-8wzxn   2/2     Running             2          14h
elasticsearch-cdm-pve9r608-3-669fb8b647-9fkmt   0/2     ContainerCreating   0          12h
elasticsearch-delete-app-1603918800-jn8xk       0/1     Error               0          13h
elasticsearch-delete-app-1603965600-pvf2q       0/1     Completed           0          105s
elasticsearch-delete-audit-1603918800-sdhns     0/1     Error               0          13h
elasticsearch-delete-audit-1603965600-x77tg     0/1     Completed           0          105s
elasticsearch-delete-infra-1603918800-bwf9h     0/1     Error               0          13h
elasticsearch-delete-infra-1603965600-r9lqq     0/1     Completed           0          105s
elasticsearch-rollover-app-1603965600-m5w7q     0/1     Error               0          105s
elasticsearch-rollover-audit-1603965600-85wbr   0/1     Error               0          105s
elasticsearch-rollover-infra-1603965600-wpqvx   0/1     Error               0          104s

Comment 3 Periklis Tsirakidis 2020-11-25 07:59:53 UTC

The one part of this issue in the delete script is solved by the PR in: https://bugzilla.redhat.com/show_bug.cgi?id=1899905

I am continuing investigation for the rollover script.

Comment 4 Periklis Tsirakidis 2020-11-25 14:56:55 UTC

@naygupta

After carefully considering the must-gather contents and investigating a little bit more in the inertia of the Elasticsearch Rollover API, I can conclude the following:

- The master node of this cluster looks to be under pressure, especially the master node. In addition GC takes a lot of time according to the logs.

> ip          heap.percent ram.percent cpu load_1m load_5m load_15m node.role master name
> 10.128.4.13           32          99  30   19.01   18.50    18.00 mdi       *      elasticsearch-cdm-pve9r608-2
> 10.130.2.5            47          73  14   17.91   15.25    14.62 mdi       -      elasticsearch-cdm-pve9r608-1

- The index rollover via an alias is a multi-step task and asynchronous by nature. This means when you encounter that the rollover failed but a new index was created, it looks like the alias update failed because your cluster was under pressure. This is a bug being reported to Elastic for a long time now [1]. However, it implies some manual intervention for now if you hit this case:

For example taking the infra logs rollover job where the alias is `infra-write` and points currently to `infra-000003` and fails to rollover to `infra-000004` because of cluster pressure, you should:
- First remove the empty index `infra-000004`
- Adapt the ES cluster CPU/Memory resources

[1] https://github.com/elastic/elasticsearch/issues/30340

Comment 9 Qiaoling Tang 2021-01-22 08:57:28 UTC

Tested with elasticsearch-operator.5.0.0-18, unable to reproduce this issue. Move to verified, if you hit this issue, please feel to reopen it.

Comment 13 errata-xmlrpc 2021-02-24 11:21:19 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Errata Advisory for Openshift Logging 5.0.0), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:0652