1928772 – Sometimes the elasticsearch-delete-xxx job failed: After OCP 4.6.16 patch.

Bug 1928772 - Sometimes the elasticsearch-delete-xxx job failed: After OCP 4.6.16 patch.

Summary: Sometimes the elasticsearch-delete-xxx job failed: After OCP 4.6.16 patch.

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Logging
Sub Component:
Version:	4.6
Hardware:	All
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.6.z
Assignee:	ewolinet
QA Contact:	Qiaoling Tang
Docs Contact:
URL:
Whiteboard:	logging-exploration
Duplicates (1):	1937216 (view as bug list)
Depends On:	1890838 1916910
Blocks:	1919075
TreeView+	depends on / blocked

Reported:	2021-02-15 14:30 UTC by David Hernández Fernández
Modified:	2024-10-01 17:29 UTC (History)
CC List:	51 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	* Previously, while under load, Elasticsearch responded to some requests with an HTTP 500 error, even though there was nothing wrong with the cluster. Retrying the request was successful. This release fixes the issue by updating the cron jobs to be more resilient when encountering temporary HTTP 500 errors. Now, they will retry a request multiple times first before failing. (link:https://bugzilla.redhat.com/show_bug.cgi?id=1928772[BZ#1928772])
Clone Of:	1916910
Environment:
Last Closed:	2021-03-30 16:54:58 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift elasticsearch-operator pull 678	None	open	[release-4.6] Bug 1928772: Adding safeguards to cronjob to be more resilient to es 500 responses	2021-03-15 18:17:27 UTC
Red Hat Knowledge Base (Solution)	5410091	None	None	None	2021-03-12 10:04:08 UTC
Red Hat Product Errata	RHBA-2021:0954	None	None	None	2021-03-30 16:55:09 UTC

Description David Hernández Fernández 2021-02-15 14:30:54 UTC

Created attachment 1757103 [details]
dump_logging

+++ This bug was initially created as a clone of Bug #1916910 +++
+++ This bug was initially created as a clone of Bug #1890838 +++

Description of problem: After using OCP 4.6.16 from https://access.redhat.com/errata/RHSA-2021:0310 & https://bugzilla.redhat.com/show_bug.cgi?id=1916910, the logs changed but the issue remain.

Actual results:
Leaving cluster for 3 days ends with 2 new failed elastic jobs:

oc get pod|grep Err
elasticsearch-im-audit-1613178900-q7t29         0/1     Error       0          2d11h
elasticsearch-im-infra-1613355300-fb7xv         0/1     Error       0          10h

oc logs elasticsearch-im-infra-1613355300-fb7xv
deleting indices: infra-000321
cat: /tmp/response.txt: No such file or directory
Current write index for infra-write: infra-000341
Checking results from _rollover call
Next write index for infra-write: infra-000341
Checking if infra-000341 exists
Checking if infra-000341 is the write index for infra-write
Done!

oc logs elasticsearch-im-audit-1613178900-q7t29
deleting indices: audit-000336
cat: /tmp/response.txt: No such file or directory
Current write index for audit-write: audit-000356
Checking results from _rollover call
Next write index for audit-write: audit-000356
Checking if audit-000356 exists
Checking if audit-000356 is the write index for audit-write
Done!

The error messages are new, but the result is the same.


Expected results: I think it's not expected to have this kind of errors.


Additional info: Attached dump.

NAME                                            READY   STATUS      RESTARTS   AGE     IP             NODE        NOMINATED NODE   READINESS GATES
cluster-logging-operator-597576d48-8vww9        1/1     Running     0          4d21h   10.222.4.8     aposa108t   <none>           <none>
curator-1613359800-6vb2k                        0/1     Completed   0          10h     10.221.3.124   aposa103t   <none>           <none>
elasticsearch-cdm-bxzkbe5l-1-b6656c9f8-zhf9w    2/2     Running     0          4d21h   10.220.0.12    aposa101t   <none>           <none>
elasticsearch-cdm-bxzkbe5l-2-846b7499df-rc8rt   2/2     Running     0          4d21h   10.220.0.11    aposa101t   <none>           <none>
elasticsearch-cdm-bxzkbe5l-3-9bbd4b7cb-tmp7f    2/2     Running     0          4d21h   10.220.2.17    aposa102t   <none>           <none>
elasticsearch-im-app-1613396700-4cpnn           0/1     Completed   0          13m     10.220.2.59    aposa102t   <none>           <none>
elasticsearch-im-audit-1613178900-q7t29         0/1     Error       0          2d12h   10.221.2.176   aposa103t   <none>           <none>
elasticsearch-im-audit-1613396700-xtnn5         0/1     Completed   0          13m     10.220.2.60    aposa102t   <none>           <none>
elasticsearch-im-infra-1613355300-fb7xv         0/1     Error       0          11h     10.220.3.205   aposa102t   <none>           <none>
elasticsearch-im-infra-1613396700-c4vrl         0/1     Completed   0          13m     10.221.3.168   aposa103t   <none>           <none>
fluentd-25kqz                                   1/1     Running     0          6d22h   10.221.2.5     aposa103t   <none>           <none>
fluentd-49xgk                                   1/1     Running     0          6d22h   10.220.2.4     aposa102t   <none>           <none>
fluentd-6qlm6                                   1/1     Running     0          6d22h   10.222.0.9     aposm101t   <none>           <none>
fluentd-8fs6n                                   1/1     Running     0          6d22h   10.221.4.5     aposa107t   <none>           <none>
fluentd-9bpjw                                   1/1     Running     0          6d22h   10.221.0.5     aposm103t   <none>           <none>
fluentd-kswdk                                   1/1     Running     0          6d22h   10.223.4.2     aposa109t   <none>           <none>
fluentd-lzsq2                                   1/1     Running     0          6d22h   10.220.4.3     aposa106t   <none>           <none>
fluentd-m6lvf                                   1/1     Running     0          6d22h   10.222.4.4     aposa108t   <none>           <none>
fluentd-mnqvl                                   1/1     Running     0          6d22h   10.220.0.3     aposa101t   <none>           <none>
fluentd-n7ncz                                   1/1     Running     0          6d22h   10.223.0.6     aposm102t   <none>           <none>
fluentd-tdf6x                                   1/1     Running     0          6d22h   10.223.2.2     aposa105t   <none>           <none>
fluentd-zbp9q                                   1/1     Running     0          6d22h   10.222.2.3     aposa104t   <none>           <none>


Example:
300-fb7xv/elasticsearch-im-infra-1613355300-fb7xv.yaml
---
apiVersion: v1
kind: Pod
metadata:
  annotations:
    k8s.v1.cni.cncf.io/network-status: |-
      [{
          "name": "",
          "interface": "eth0",
          "ips": [
              "10.220.3.205"
          ],
          "default": true,
          "dns": {}
      }]
    k8s.v1.cni.cncf.io/networks-status: |-
      [{
          "name": "",
          "interface": "eth0",
          "ips": [
              "10.220.3.205"
          ],
          "default": true,
          "dns": {}
      }]
    openshift.io/scc: restricted
  creationTimestamp: "2021-02-15T02:15:09Z"
  generateName: elasticsearch-im-infra-1613355300-
  labels:
    component: indexManagement
    controller-uid: 845b956a-56ea-41fa-a938-e051a6828220
    job-name: elasticsearch-im-infra-1613355300
    logging-infra: indexManagement
    provider: openshift
  name: elasticsearch-im-infra-1613355300-fb7xv
  namespace: openshift-logging
  ownerReferences:
  - apiVersion: batch/v1
    blockOwnerDeletion: true
    controller: true
    kind: Job
    name: elasticsearch-im-infra-1613355300
    uid: 845b956a-56ea-41fa-a938-e051a6828220
  resourceVersion: "722171623"
  selfLink: /api/v1/namespaces/openshift-logging/pods/elasticsearch-im-infra-1613355300-fb7xv
  uid: ab3643fb-642e-4cee-a378-a7e277417a4e
spec:
  containers:
  - args:
    - -c
    - ./delete;delete_rc=$?;./rollover;rollover_rc=$?;$(exit $delete_rc&&exit $rollover_rc)
    command:
    - bash
    env:
    - name: POLICY_MAPPING
      value: infra
    - name: MIN_AGE
      value: "604800000"
    - name: PAYLOAD
      value: eyJjb25kaXRpb25zIjp7Im1heF9hZ2UiOiI4aCIsIm1heF9kb2NzIjoxMjI4ODAwMDAsIm1heF9zaXplIjoiMTIwZ2IifX0=
    - name: POLICY_MAPPING
      value: infra
    - name: ES_SERVICE
      value: https://elasticsearch:9200
    image: registry.redhat.io/openshift4/ose-logging-elasticsearch6@sha256:3f3ade6146eaab2a4c014c16cb2025ac9203de700b593c6ef893f6e59fa54616
    imagePullPolicy: IfNotPresent
    name: indexmanagement
    resources:
      requests:
        cpu: 100m
        memory: 32Mi
    securityContext:
      capabilities:
        drop:
        - KILL
        - MKNOD
        - SETGID
        - SETUID
      runAsUser: 1000600000
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /etc/indexmanagement/keys
      name: certs
      readOnly: true
    - mountPath: /tmp/scripts
      name: scripts
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: elasticsearch-token-6jqpc
      readOnly: true
    workingDir: /tmp/scripts
  dnsPolicy: ClusterFirst
  enableServiceLinks: true
  imagePullSecrets:
  - name: elasticsearch-dockercfg-rtwhz
  nodeName: aposa102t
  nodeSelector:
    kubernetes.io/os: linux
    node-role.kubernetes.io/infra: ""
  preemptionPolicy: PreemptLowerPriority
  priority: 0
  restartPolicy: Never
  schedulerName: default-scheduler
  securityContext:
    fsGroup: 1000600000
    seLinuxOptions:
      level: s0:c25,c0
  serviceAccount: elasticsearch
  serviceAccountName: elasticsearch
  terminationGracePeriodSeconds: 300
  tolerations:
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
    tolerationSeconds: 300
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
    tolerationSeconds: 300
  - effect: NoSchedule
    key: node.kubernetes.io/memory-pressure
    operator: Exists
  volumes:
  - name: certs
    secret:
      defaultMode: 420
      secretName: elasticsearch
  - configMap:
      defaultMode: 511
      name: indexmanagement-scripts
    name: scripts
  - name: elasticsearch-token-6jqpc
    secret:
      defaultMode: 420
      secretName: elasticsearch-token-6jqpc
status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2021-02-15T02:15:09Z"
    status: "True"
    type: Initialized
  - lastProbeTime: null
    lastTransitionTime: "2021-02-15T02:15:19Z"
    message: 'containers with unready status: [indexmanagement]'
    reason: ContainersNotReady
    status: "False"
    type: Ready
  - lastProbeTime: null
    lastTransitionTime: "2021-02-15T02:15:19Z"
    message: 'containers with unready status: [indexmanagement]'
    reason: ContainersNotReady
    status: "False"
    type: ContainersReady
  - lastProbeTime: null
    lastTransitionTime: "2021-02-15T02:15:09Z"
    status: "True"
    type: PodScheduled
  containerStatuses:
  - containerID: cri-o://4830ad4cd6ca7235e6d3091ffcd731ce6715e89a1a758b2fa73daf12eda8fe84
    image: registry.redhat.io/openshift4/ose-logging-elasticsearch6@sha256:3f3ade6146eaab2a4c014c16cb2025ac9203de700b593c6ef893f6e59fa54616
    imageID: registry.redhat.io/openshift4/ose-logging-elasticsearch6@sha256:3f3ade6146eaab2a4c014c16cb2025ac9203de700b593c6ef893f6e59fa54616
    lastState: {}
    name: indexmanagement
    ready: false
    restartCount: 0
    started: false
    state:
      terminated:
        containerID: cri-o://4830ad4cd6ca7235e6d3091ffcd731ce6715e89a1a758b2fa73daf12eda8fe84
        exitCode: 1
        finishedAt: "2021-02-15T02:15:19Z"
        reason: Error
        startedAt: "2021-02-15T02:15:12Z"
  hostIP: 10.67.0.133
  phase: Failed
  podIP: 10.220.3.205
  podIPs:
  - ip: 10.220.3.205
  qosClass: Burstable
  startTime: "2021-02-15T02:15:09Z"

Comment 7 Oscar Casal Sanchez 2021-02-19 09:26:38 UTC

Hello,

I can see exactly the same issue that in BZ#1916910 that it was fixed in OCP 4.6.16.

~~~
# oc get pods
NAME                                            READY   STATUS      RESTARTS   AGE
cluster-logging-operator-7c87c4dfbd-vppsv       1/1     Running     0          18h
curator-1613619000-k8mbd                        0/1     Completed   0          12h
elasticsearch-cdm-qarn9lmr-1-5fb9cc7bf5-dhlth   2/2     Running     0          18h
elasticsearch-cdm-qarn9lmr-2-645bf4cd96-gxbl4   2/2     Running     0          18h
elasticsearch-cdm-qarn9lmr-3-8d4c95d8f-9cg6g    2/2     Running     0          18h
elasticsearch-im-app-1613664000-xdr8j           0/1     Completed   0          8m33s
elasticsearch-im-audit-1613664000-d672l         0/1     Completed   0          8m33s
elasticsearch-im-infra-1613655900-kdcdc         0/1     Error       0          143m
elasticsearch-im-infra-1613664000-j2zsh         0/1     Completed   0          8m33s
...


# oc logs pod/elasticsearch-im-infra-1613655900-kdcdc

deleting indices: infra-000719
{
  "error" : {
    "root_cause" : [
      {
        "type" : "security_exception",
        "reason" : "Unexpected exception indices:admin/delete"
      }
    ],
    "type" : "security_exception",
    "reason" : "Unexpected exception indices:admin/delete"
  },
  "status" : 500
}
Current write index for infra-write: infra-000740
Checking results from _rollover call
Next write index for infra-write: infra-000740
Checking if infra-000740 exists
Checking if infra-000740 is the write index for infra-write
Done!
~~~

Then, as Víctor has asked before, should we create a separate bugzilla for this that it's the same that it was tried to fix in BZ#1916910 and leaving this Bug only for the error related to the error?

  "cat: /tmp/response.txt: No such file or directory"


Best regards,
Oscar

Comment 10 ewolinet 2021-02-22 21:26:22 UTC

> # oc logs pod/elasticsearch-im-infra-1613655900-kdcdc
> 
> deleting indices: infra-000719
> {
>   "error" : {
>     "root_cause" : [
>       {
>         "type" : "security_exception",
>         "reason" : "Unexpected exception indices:admin/delete"
>       }
>     ],
>     "type" : "security_exception",
>     "reason" : "Unexpected exception indices:admin/delete"
>   },
>   "status" : 500
> }

This looks like an security issue with having permissions to delete -- do you seem the same errors in the ES pod logs as Victor posted above?
If so could you provide a must-gather for the cluster? I had requested one for this specific issue and have not yet received one.

Comment 39 Brett Jones 2021-03-15 20:44:12 UTC

*** Bug 1937216 has been marked as a duplicate of this bug. ***

Comment 44 Qiaoling Tang 2021-03-25 08:44:39 UTC

Testing with elasticsearch-operator.4.6.0-202103202154.p0, I set the index management cronjobs to run in every 3 minutes and the ES cluster is running for about 29 hours, no job fails.

$ oc get pod
NAME                                            READY   STATUS             RESTARTS   AGE
cluster-logging-operator-6f66778f94-7zpmh       1/1     Running            0          29h
elasticsearch-cdm-kbvuvj7o-1-5989bcf7c4-vkxrc   2/2     Running            0          29h
elasticsearch-cdm-kbvuvj7o-2-57468594c7-5n8kf   2/2     Running            0          29h
elasticsearch-cdm-kbvuvj7o-3-5df4bc888d-5dx8h   2/2     Running            0          29h
elasticsearch-im-app-1616659740-dx989           0/1     Completed          0          79s
elasticsearch-im-audit-1616659740-p26qw         0/1     Completed          0          79s
elasticsearch-im-infra-1616659740-swdt7         0/1     Completed          0          79s
fluentd-bsjzw                                   1/1     Running            0          29h
fluentd-fsl9g                                   1/1     Running            0          29h
fluentd-pjqzd                                   1/1     Running            0          29h
fluentd-rdfkt                                   1/1     Running            0          29h
fluentd-tv9hh                                   1/1     Running            0          29h
fluentd-v6w9f                                   1/1     Running            0          29h
kibana-8685fbf674-c9fct                         2/2     Running            0          29h

Move this bz to verified.

Comment 45 David Hernández Fernández 2021-03-27 08:10:41 UTC

I think 29 hours is not enough, could you verify it a bit further, please? (i.e 1 week) This bug has been "verified" in the past as well in other situations but happened after a few days.

Comment 47 Qiaoling Tang 2021-03-29 08:48:31 UTC

(In reply to David Hernández Fernández from comment #45)
> I think 29 hours is not enough, could you verify it a bit further, please?
> (i.e 1 week) This bug has been "verified" in the past as well in other
> situations but happened after a few days.

Hi David,

I tested with an earlier version(elasticsearch-operator.4.6.0-202103010126.p0) which doesn't have the fix, the issue reproduced after ES cluster running for 7 hours. Compared to the result in comment 44, I think the condition has significant improvement, so I verified this bz.

$ oc get pod
NAME                                            READY   STATUS      RESTARTS   AGE
cluster-logging-operator-79c8f9b6d5-ldgg6       1/1     Running     0          8h
elasticsearch-cdm-s6xfukoa-1-6989d86b54-gpjkm   2/2     Running     0          7h58m
elasticsearch-cdm-s6xfukoa-2-85f7f5bc7c-7tj6x   2/2     Running     0          7h57m
elasticsearch-cdm-s6xfukoa-3-5d7ccfbb96-ssthc   2/2     Running     0          7h56m
elasticsearch-im-app-1617007320-8q74b           0/1     Completed   0          96s
elasticsearch-im-audit-1617007320-r787v         0/1     Completed   0          96s
elasticsearch-im-infra-1617003900-5lzqd         0/1     Error       0          58m
elasticsearch-im-infra-1617007320-wm9r6         0/1     Completed   0          96s

Anyway, I'll follow your advice to keep my another cluster running for 1 week to see if the issue could be reproduced with the fix.

Comment 48 ewolinet 2021-03-29 16:20:53 UTC

(In reply to Qiaoling Tang from comment #47)
> (In reply to David Hernández Fernández from comment #45)
> > I think 29 hours is not enough, could you verify it a bit further, please?
> > (i.e 1 week) This bug has been "verified" in the past as well in other
> > situations but happened after a few days.
> 
> Hi David,
> 
> I tested with an earlier
> version(elasticsearch-operator.4.6.0-202103010126.p0) which doesn't have the
> fix, the issue reproduced after ES cluster running for 7 hours. Compared to
> the result in comment 44, I think the condition has significant improvement,
> so I verified this bz.
> 
> $ oc get pod
> NAME                                            READY   STATUS      RESTARTS
> AGE
> cluster-logging-operator-79c8f9b6d5-ldgg6       1/1     Running     0       
> 8h
> elasticsearch-cdm-s6xfukoa-1-6989d86b54-gpjkm   2/2     Running     0       
> 7h58m
> elasticsearch-cdm-s6xfukoa-2-85f7f5bc7c-7tj6x   2/2     Running     0       
> 7h57m
> elasticsearch-cdm-s6xfukoa-3-5d7ccfbb96-ssthc   2/2     Running     0       
> 7h56m
> elasticsearch-im-app-1617007320-8q74b           0/1     Completed   0       
> 96s
> elasticsearch-im-audit-1617007320-r787v         0/1     Completed   0       
> 96s
> elasticsearch-im-infra-1617003900-5lzqd         0/1     Error       0       
> 58m
> elasticsearch-im-infra-1617007320-wm9r6         0/1     Completed   0       
> 96s
> 
> Anyway, I'll follow your advice to keep my another cluster running for 1
> week to see if the issue could be reproduced with the fix.

To expand on this, previously we didn't have a method to consistently reproduce this issue.
This was something I could consistently have happen multiple times an hour and I had conveyed the method to recreate it locally to QE.

Comment 51 errata-xmlrpc 2021-03-30 16:54:58 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6.23 extras update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:0954

Comment 56 Aleksey Usov 2021-09-07 21:34:19 UTC

Still seeing it 4.7.21, but with a different error in the logs:

# oc logs elasticsearch-im-audit-1631027700-8gqhd
========================
Index management delete process starting for audit

No indices to delete
========================
Index management delete process starting for {"error":{"root_cause":[{"type":"security_exception","reason":"_opendistro_security_dls_query

Received an empty response from elasticsearch -- server may not be ready
========================
Index management rollover process starting for audit

Current write index for audit-write: audit-000011
Checking results from _rollover call
Next write index for audit-write: audit-000011
Checking if audit-000011 exists
Checking if audit-000011 is the write index for audit-write
Done!

Note You need to log in before you can comment on or make changes to this bug.

achakrat
akhaire
alchan
andcosta
anisal
anli
aos-bugs
apaladug
ausov
aygarg
ChetRHosey
cruhm
dageoffr
dahernan
dapark
dkulkarn
dseals
ewolinet
fgiloux
gkarager
jcantril
juherrer
kiyyappa
knakai
ksathe
luaparicio
lvlcek
mmohan
mrdest
mrobson
naoto30
nnosenzo
ocasalsa
openshift-bugs-escalate
periklis
prdeshpa
qitang
rkant
rnoma
rsandu
sauchter
shishika
sreber
ssadhale
ssonigra
tmicheli
travi
vhernand
vjaypurk
xingli
ykarajag