Bug 1929688

Summary: Sometimes the elasticsearch-delete-xxx job failed at "Unexpected exception indices:admin/aliases/get" - OCP 4.6.16
Product: OpenShift Container Platform Reporter: Victor Hernando <vhernand>
Component: LoggingAssignee: ewolinet
Status: CLOSED ERRATA QA Contact: Qiaoling Tang <qitang>
Severity: medium Docs Contact:
Priority: medium    
Version: 4.6CC: anli, aos-bugs, dahernan, ewolinet, ocasalsa, qitang, shishika, vjaypurk
Target Milestone: ---   
Target Release: 4.6.z   
Hardware: x86_64   
OS: Linux   
Whiteboard: logging-exploration
Fixed In Version: Doc Type: Bug Fix
Doc Text:
* Previously, while under load, Elasticsearch responded to some requests with an HTTP 500 error, even though there was nothing wrong with the cluster. Retrying the request was successful. This release fixes the issue by updating the cron jobs to be more resilient when encountering temporary HTTP 500 errors. Now, they will retry a request multiple times first before failing. (link:https://bugzilla.redhat.com/show_bug.cgi?id=1929688[*BZ#1929688*])
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-03-30 16:54:58 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Comment 7 David Hernández Fernández 2021-02-24 09:47:35 UTC
Same here, let us know if you need anything else, this is in OCP 4.6.16. and latest logging csv.
{"error":{"root_cause":[{"type":"security_exception","reason":"Unexpected exception indices:admin/aliases/get"}],"type":"security_exception","reason":"Unexpected exception indices:admin/aliases/get"},"status":500}
Error while attemping to determine the active write alias: {"error":{"root_cause":[{"type":"security_exception","reason":"Unexpected exception indices:admin/aliases/get"}],"type":"security_exception","reason":"Unexpected exception indices:admin/aliases/get"},"status":500}
{"error":{"root_cause":[{"type":"security_exception","reason":"Unexpected exception indices:admin/aliases/get"}],"type":"security_exception","reason":"Unexpected exception indices:admin/aliases/get"},"status":500}
Error while attemping to determine the active write alias: {"error":{"root_cause":[{"type":"security_exception","reason":"Unexpected exception indices:admin/aliases/get"}],"type":"security_exception","reason":"Unexpected exception indices:admin/aliases/get"},"status":500}

Comment 16 Qiaoling Tang 2021-03-25 08:46:18 UTC
Testing with elasticsearch-operator.4.6.0-202103202154.p0, I set the index management cronjobs to run in every 3 minutes and the ES cluster is running for about 29 hours, no job fails.

$ oc get pod
NAME                                            READY   STATUS             RESTARTS   AGE
cluster-logging-operator-6f66778f94-7zpmh       1/1     Running            0          29h
elasticsearch-cdm-kbvuvj7o-1-5989bcf7c4-vkxrc   2/2     Running            0          29h
elasticsearch-cdm-kbvuvj7o-2-57468594c7-5n8kf   2/2     Running            0          29h
elasticsearch-cdm-kbvuvj7o-3-5df4bc888d-5dx8h   2/2     Running            0          29h
elasticsearch-im-app-1616659740-dx989           0/1     Completed          0          79s
elasticsearch-im-audit-1616659740-p26qw         0/1     Completed          0          79s
elasticsearch-im-infra-1616659740-swdt7         0/1     Completed          0          79s
fluentd-bsjzw                                   1/1     Running            0          29h
fluentd-fsl9g                                   1/1     Running            0          29h
fluentd-pjqzd                                   1/1     Running            0          29h
fluentd-rdfkt                                   1/1     Running            0          29h
fluentd-tv9hh                                   1/1     Running            0          29h
fluentd-v6w9f                                   1/1     Running            0          29h
kibana-8685fbf674-c9fct                         2/2     Running            0          29h

Move this bz to verified.

Comment 18 errata-xmlrpc 2021-03-30 16:54:58 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6.23 extras update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:0954