Description of problem: This is a confirmed and fixed bug (1866019) in 4.6.1. IBM has customers requesting a backport to 4.5 and wondering if this is possible. Version-Release number of selected component (if applicable): 4.5 How reproducible: Bug has already been confirmed and reproducible in 4.5
(In reply to hgomes from comment #2) > I have a customer have the following symptom: > > - The jobs fails every 5h, and when it cannot clean up the indexes the > kibana stop to work so we are not able to see the dashboards and logs > consequently. There is evidence this legitimately fails every 5 hours like clockwork? > ## Jobs > oc get pods > NAME READY STATUS RESTARTS > AGE > 39h > elasticsearch-delete-app-1604575800-kfhd8 0/1 Completed 0 > 56s > elasticsearch-delete-audit-1604557800-m6r8z 0/1 Error 0 > 5h > elasticsearch-delete-audit-1604575800-cq988 0/1 Completed 0 > 56s > elasticsearch-delete-infra-1604575800-ln85q 0/1 Completed 0 > 56s > elasticsearch-rollover-app-1604575800-fctn5 0/1 Completed 0 > 56s > elasticsearch-rollover-audit-1604561400-tcjpt 0/1 Error 0 > 4h > elasticsearch-rollover-audit-1604575800-8wq2m 0/1 Completed 0 > 56s > elasticsearch-rollover-infra-1604575800-rvqnz 0/1 Completed 0 > 55s These jobs run by default every 15min. As evidenced here, at least 1 subsequent run succeeded. Is there any reason to think this is not transient in nature? I have seen jobs fail during cluster upgrades when they are unable to contact the cluster. It is reasonable to expect failures in this scenerio. High loads on the cluster may contribute to long response times and some failures of these jobs. High load or large activity with ES indexing and/or curating data may lead to slow respose time when using Kibana. Per #c5, a number of fixes were already backported into 4.5 to provide better error handling and make it more obvious the reasons for failure. I would encourage you to work with the customer to determine if you are on the latest 4.5 version of the operator and work from there
(In reply to Saurabh Sadhale from comment #8) > However they are mentioning that the errors are observed in the cluster on a > daily basis. Can you please attach the logs from the failed jobs
Can you also please paste the operator image version so we can be certain they have the fixes? I'm wondering even if they do have the fixes if they are seeing failures related to the load on Elasticsearch
@mrobson Looking at the errors provided, both seem to be addressed by the following BZs > ValueError: No JSON object could be decoded By https://bugzilla.redhat.com/show_bug.cgi?id=1899905 > {"error":{"root_cause":[{"type":"security_exception","reason":"Unexpected exception indices:admin/aliases/get"}],"type":"security_exception","reason":"Unexpected exception indices:admin/aliases/get"},"status":500} > Error while attemping to determine the active write alias: {"error":{"root_cause":[{"type":"security_exception","reason":"Unexpected exception indices:admin/aliases/get"}],"type":"security_exception","reason":"Unexpected exception indices:admin/aliases/get"},"status":500} By https://bugzilla.redhat.com/show_bug.cgi?id=1890838 Both are assigned someone is working on them. If you don't have any further objections, i will close this as a duplicate of one of those two BZ.
@mrobson Please consider that I have already notified both BZ owners to consider backporting this to 4.5.z: Here https://github.com/openshift/elasticsearch-operator/pull/588#issuecomment-734706309 and here https://bugzilla.redhat.com/show_bug.cgi?id=1890838#c8
@anisal Could you take a look if the above linked BZ in [1] are the same as yours here? Both of them are in progress and/or have PRs attached. https://bugzilla.redhat.com/show_bug.cgi?id=1896578#c12
Marking this as a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1890838 *** This bug has been marked as a duplicate of bug 1890838 ***
I think this bug has been wrongfully closed as a duplicate of 1890838. As stated by https://access.redhat.com/solutions/5410091 and the tittle, this bug is intended for the backport of the fix. Can we re-open, or won't this be backported?
As far as I know there also isn't a stable pathway to upgrade from 4.5 to 4.6 yet.
Benjamin, the fixes from https://bugzilla.redhat.com/show_bug.cgi?id=1866019 have already been backported and shipped in 4.5.15+: https://bugzilla.redhat.com/show_bug.cgi?id=1868675 Here is the PR: https://github.com/openshift/elasticsearch-operator/pull/488 With those fixes, we found a few more use cases that present the same type of issue on 4.5.15+ and the latest EO / CLO. This BZ was closed because the backports for the original BZ have shipped. It was closed as a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1890838 because that tracks one of follow up fixes for additional use cases. There is a second bug in that pool of fixes as well: https://bugzilla.redhat.com/show_bug.cgi?id=1899905 BZ1890838 is merged into master and open for 4.6 backport: https://github.com/openshift/origin-aggregated-logging/pull/2023 BZ1899905 is merged into master and open for 4.6 backport: https://github.com/openshift/elasticsearch-operator/pull/588 Once they get backported to 4.6, they can be backported to 4.5. Matt