Bug 1896578 - elasticsearch-rollover and elasticsearch-delete pods in Error states [4.5 backport]
Summary: elasticsearch-rollover and elasticsearch-delete pods in Error states [4.5 bac...
Keywords:
Status: CLOSED DUPLICATE of bug 1890838
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Logging
Version: 4.5
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.5.z
Assignee: Periklis Tsirakidis
QA Contact: Anping Li
URL:
Whiteboard: logging-exploration
Depends On: 1866019
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-11-10 23:26 UTC by Courtney Ruhm
Modified: 2024-06-13 23:23 UTC (History)
12 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-11-30 17:24:58 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Knowledge Base (Solution) 5410091 0 None None None 2020-11-17 19:02:43 UTC

Description Courtney Ruhm 2020-11-10 23:26:13 UTC
Description of problem:

This is a confirmed and fixed bug (1866019) in 4.6.1. IBM has customers requesting a backport to 4.5 and wondering if this is possible. 

Version-Release number of selected component (if applicable):

4.5

How reproducible:

Bug has already been confirmed and reproducible in 4.5

Comment 7 Jeff Cantrill 2020-11-18 21:14:33 UTC
(In reply to hgomes from comment #2)
> I have a customer have the following symptom:
> 
> - The jobs fails every 5h, and when it cannot clean up the indexes the
> kibana stop to work so we are not able to see the dashboards and logs
> consequently.

There is evidence this legitimately fails every 5 hours like clockwork?

> ## Jobs
> oc get pods
> NAME                                            READY   STATUS      RESTARTS
> AGE
> 39h
> elasticsearch-delete-app-1604575800-kfhd8       0/1     Completed   0       
> 56s
> elasticsearch-delete-audit-1604557800-m6r8z     0/1     Error       0       
> 5h
> elasticsearch-delete-audit-1604575800-cq988     0/1     Completed   0       
> 56s
> elasticsearch-delete-infra-1604575800-ln85q     0/1     Completed   0       
> 56s
> elasticsearch-rollover-app-1604575800-fctn5     0/1     Completed   0       
> 56s
> elasticsearch-rollover-audit-1604561400-tcjpt   0/1     Error       0       
> 4h
> elasticsearch-rollover-audit-1604575800-8wq2m   0/1     Completed   0       
> 56s
> elasticsearch-rollover-infra-1604575800-rvqnz   0/1     Completed   0       
> 55s

These jobs run by default every 15min.  As evidenced here, at least 1 subsequent run succeeded.  Is there any reason to think this is not transient in nature?  I have seen jobs fail during cluster upgrades when they are unable to contact the cluster.  It is reasonable to expect failures in this scenerio. High loads on the cluster may contribute to long response times and some failures of these jobs.  High load or large activity with ES indexing and/or curating data may lead to slow respose time when using Kibana.  Per #c5, a number of fixes were already backported into 4.5 to provide better error handling and make it more obvious the reasons for failure.  I would encourage you to work with the customer to determine if you are on the latest 4.5 version of the operator and work from there

Comment 9 Jeff Cantrill 2020-11-20 18:49:55 UTC
(In reply to Saurabh Sadhale from comment #8)

> However they are mentioning that the errors are observed in the cluster on a
> daily basis.

Can you please attach the logs from the failed jobs

Comment 10 Jeff Cantrill 2020-11-20 18:57:01 UTC
Can you also please paste the operator image version so we can be certain they have the fixes?  I'm wondering even if they do have the fixes if they are seeing failures related to the load on Elasticsearch

Comment 12 Periklis Tsirakidis 2020-11-27 08:15:29 UTC
@mrobson 

Looking at the errors provided, both seem to be addressed by the following BZs

> ValueError: No JSON object could be decoded

By https://bugzilla.redhat.com/show_bug.cgi?id=1899905

> {"error":{"root_cause":[{"type":"security_exception","reason":"Unexpected exception indices:admin/aliases/get"}],"type":"security_exception","reason":"Unexpected exception indices:admin/aliases/get"},"status":500}
> Error while attemping to determine the active write alias: {"error":{"root_cause":[{"type":"security_exception","reason":"Unexpected exception indices:admin/aliases/get"}],"type":"security_exception","reason":"Unexpected exception indices:admin/aliases/get"},"status":500}

By https://bugzilla.redhat.com/show_bug.cgi?id=1890838


Both are assigned someone is working on them. If you don't have any further objections, i will close this as a duplicate of one of those two BZ.

Comment 13 Periklis Tsirakidis 2020-11-27 08:20:34 UTC
@mrobson 

Please consider that I have already notified both BZ owners to consider backporting this to 4.5.z:

Here https://github.com/openshift/elasticsearch-operator/pull/588#issuecomment-734706309

and here https://bugzilla.redhat.com/show_bug.cgi?id=1890838#c8

Comment 15 Periklis Tsirakidis 2020-11-30 08:37:40 UTC
@anisal 

Could you take a look if the above linked BZ in [1] are the same as yours here? Both of them are in progress and/or have PRs attached.

https://bugzilla.redhat.com/show_bug.cgi?id=1896578#c12

Comment 16 Periklis Tsirakidis 2020-11-30 17:24:58 UTC
Marking this as a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1890838

*** This bug has been marked as a duplicate of bug 1890838 ***

Comment 18 bhunt 2020-12-15 17:14:54 UTC
I think this bug has been wrongfully closed as a duplicate of 1890838.
As stated by https://access.redhat.com/solutions/5410091 and the tittle, this bug is intended for the backport of the fix.
Can we re-open, or won't this be backported?

Comment 19 bhunt 2020-12-15 17:22:53 UTC
As far as I know there also isn't a stable pathway to upgrade from 4.5 to 4.6 yet.

Comment 20 Matthew Robson 2020-12-15 17:34:43 UTC
Benjamin, the fixes from https://bugzilla.redhat.com/show_bug.cgi?id=1866019 have already been backported and shipped in 4.5.15+: https://bugzilla.redhat.com/show_bug.cgi?id=1868675

Here is the PR:
https://github.com/openshift/elasticsearch-operator/pull/488

With those fixes, we found a few more use cases that present the same type of issue on 4.5.15+ and the latest EO / CLO.

This BZ was closed because the backports for the original BZ have shipped. It was closed as a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1890838 because that tracks one of follow up fixes for additional use cases.

There is a second bug in that pool of fixes as well: https://bugzilla.redhat.com/show_bug.cgi?id=1899905

BZ1890838 is merged into master and open for 4.6 backport:
https://github.com/openshift/origin-aggregated-logging/pull/2023
 

BZ1899905 is merged into master and open for 4.6 backport:
https://github.com/openshift/elasticsearch-operator/pull/588

Once they get backported to 4.6, they can be backported to 4.5.

Matt


Note You need to log in before you can comment on or make changes to this bug.