Bug 1896578

Summary: elasticsearch-rollover and elasticsearch-delete pods in Error states [4.5 backport]
Product: OpenShift Container Platform Reporter: Courtney Ruhm <cruhm>
Component: LoggingAssignee: Periklis Tsirakidis <periklis>
Status: CLOSED DUPLICATE QA Contact: Anping Li <anli>
Severity: high Docs Contact:
Priority: high    
Version: 4.5CC: akhaire, alchan, anisal, aos-bugs, benjamin.hunt, hgomes, jcantril, john.johansson, mrobson, naoto30, periklis, ssadhale
Target Milestone: ---   
Target Release: 4.5.z   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: logging-exploration
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-11-30 17:24:58 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1866019    
Bug Blocks:    

Description Courtney Ruhm 2020-11-10 23:26:13 UTC
Description of problem:

This is a confirmed and fixed bug (1866019) in 4.6.1. IBM has customers requesting a backport to 4.5 and wondering if this is possible. 

Version-Release number of selected component (if applicable):

4.5

How reproducible:

Bug has already been confirmed and reproducible in 4.5

Comment 7 Jeff Cantrill 2020-11-18 21:14:33 UTC
(In reply to hgomes from comment #2)
> I have a customer have the following symptom:
> 
> - The jobs fails every 5h, and when it cannot clean up the indexes the
> kibana stop to work so we are not able to see the dashboards and logs
> consequently.

There is evidence this legitimately fails every 5 hours like clockwork?

> ## Jobs
> oc get pods
> NAME                                            READY   STATUS      RESTARTS
> AGE
> 39h
> elasticsearch-delete-app-1604575800-kfhd8       0/1     Completed   0       
> 56s
> elasticsearch-delete-audit-1604557800-m6r8z     0/1     Error       0       
> 5h
> elasticsearch-delete-audit-1604575800-cq988     0/1     Completed   0       
> 56s
> elasticsearch-delete-infra-1604575800-ln85q     0/1     Completed   0       
> 56s
> elasticsearch-rollover-app-1604575800-fctn5     0/1     Completed   0       
> 56s
> elasticsearch-rollover-audit-1604561400-tcjpt   0/1     Error       0       
> 4h
> elasticsearch-rollover-audit-1604575800-8wq2m   0/1     Completed   0       
> 56s
> elasticsearch-rollover-infra-1604575800-rvqnz   0/1     Completed   0       
> 55s

These jobs run by default every 15min.  As evidenced here, at least 1 subsequent run succeeded.  Is there any reason to think this is not transient in nature?  I have seen jobs fail during cluster upgrades when they are unable to contact the cluster.  It is reasonable to expect failures in this scenerio. High loads on the cluster may contribute to long response times and some failures of these jobs.  High load or large activity with ES indexing and/or curating data may lead to slow respose time when using Kibana.  Per #c5, a number of fixes were already backported into 4.5 to provide better error handling and make it more obvious the reasons for failure.  I would encourage you to work with the customer to determine if you are on the latest 4.5 version of the operator and work from there

Comment 9 Jeff Cantrill 2020-11-20 18:49:55 UTC
(In reply to Saurabh Sadhale from comment #8)

> However they are mentioning that the errors are observed in the cluster on a
> daily basis.

Can you please attach the logs from the failed jobs

Comment 10 Jeff Cantrill 2020-11-20 18:57:01 UTC
Can you also please paste the operator image version so we can be certain they have the fixes?  I'm wondering even if they do have the fixes if they are seeing failures related to the load on Elasticsearch

Comment 12 Periklis Tsirakidis 2020-11-27 08:15:29 UTC
@mrobson 

Looking at the errors provided, both seem to be addressed by the following BZs

> ValueError: No JSON object could be decoded

By https://bugzilla.redhat.com/show_bug.cgi?id=1899905

> {"error":{"root_cause":[{"type":"security_exception","reason":"Unexpected exception indices:admin/aliases/get"}],"type":"security_exception","reason":"Unexpected exception indices:admin/aliases/get"},"status":500}
> Error while attemping to determine the active write alias: {"error":{"root_cause":[{"type":"security_exception","reason":"Unexpected exception indices:admin/aliases/get"}],"type":"security_exception","reason":"Unexpected exception indices:admin/aliases/get"},"status":500}

By https://bugzilla.redhat.com/show_bug.cgi?id=1890838


Both are assigned someone is working on them. If you don't have any further objections, i will close this as a duplicate of one of those two BZ.

Comment 13 Periklis Tsirakidis 2020-11-27 08:20:34 UTC
@mrobson 

Please consider that I have already notified both BZ owners to consider backporting this to 4.5.z:

Here https://github.com/openshift/elasticsearch-operator/pull/588#issuecomment-734706309

and here https://bugzilla.redhat.com/show_bug.cgi?id=1890838#c8

Comment 15 Periklis Tsirakidis 2020-11-30 08:37:40 UTC
@anisal 

Could you take a look if the above linked BZ in [1] are the same as yours here? Both of them are in progress and/or have PRs attached.

https://bugzilla.redhat.com/show_bug.cgi?id=1896578#c12

Comment 16 Periklis Tsirakidis 2020-11-30 17:24:58 UTC
Marking this as a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1890838

*** This bug has been marked as a duplicate of bug 1890838 ***

Comment 18 bhunt 2020-12-15 17:14:54 UTC
I think this bug has been wrongfully closed as a duplicate of 1890838.
As stated by https://access.redhat.com/solutions/5410091 and the tittle, this bug is intended for the backport of the fix.
Can we re-open, or won't this be backported?

Comment 19 bhunt 2020-12-15 17:22:53 UTC
As far as I know there also isn't a stable pathway to upgrade from 4.5 to 4.6 yet.

Comment 20 Matthew Robson 2020-12-15 17:34:43 UTC
Benjamin, the fixes from https://bugzilla.redhat.com/show_bug.cgi?id=1866019 have already been backported and shipped in 4.5.15+: https://bugzilla.redhat.com/show_bug.cgi?id=1868675

Here is the PR:
https://github.com/openshift/elasticsearch-operator/pull/488

With those fixes, we found a few more use cases that present the same type of issue on 4.5.15+ and the latest EO / CLO.

This BZ was closed because the backports for the original BZ have shipped. It was closed as a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1890838 because that tracks one of follow up fixes for additional use cases.

There is a second bug in that pool of fixes as well: https://bugzilla.redhat.com/show_bug.cgi?id=1899905

BZ1890838 is merged into master and open for 4.6 backport:
https://github.com/openshift/origin-aggregated-logging/pull/2023
 

BZ1899905 is merged into master and open for 4.6 backport:
https://github.com/openshift/elasticsearch-operator/pull/588

Once they get backported to 4.6, they can be backported to 4.5.

Matt