Created attachment 1757103 [details] dump_logging +++ This bug was initially created as a clone of Bug #1916910 +++ +++ This bug was initially created as a clone of Bug #1890838 +++ Description of problem: After using OCP 4.6.16 from https://access.redhat.com/errata/RHSA-2021:0310 & https://bugzilla.redhat.com/show_bug.cgi?id=1916910, the logs changed but the issue remain. Actual results: Leaving cluster for 3 days ends with 2 new failed elastic jobs: oc get pod|grep Err elasticsearch-im-audit-1613178900-q7t29 0/1 Error 0 2d11h elasticsearch-im-infra-1613355300-fb7xv 0/1 Error 0 10h oc logs elasticsearch-im-infra-1613355300-fb7xv deleting indices: infra-000321 cat: /tmp/response.txt: No such file or directory Current write index for infra-write: infra-000341 Checking results from _rollover call Next write index for infra-write: infra-000341 Checking if infra-000341 exists Checking if infra-000341 is the write index for infra-write Done! oc logs elasticsearch-im-audit-1613178900-q7t29 deleting indices: audit-000336 cat: /tmp/response.txt: No such file or directory Current write index for audit-write: audit-000356 Checking results from _rollover call Next write index for audit-write: audit-000356 Checking if audit-000356 exists Checking if audit-000356 is the write index for audit-write Done! The error messages are new, but the result is the same. Expected results: I think it's not expected to have this kind of errors. Additional info: Attached dump. NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES cluster-logging-operator-597576d48-8vww9 1/1 Running 0 4d21h 10.222.4.8 aposa108t <none> <none> curator-1613359800-6vb2k 0/1 Completed 0 10h 10.221.3.124 aposa103t <none> <none> elasticsearch-cdm-bxzkbe5l-1-b6656c9f8-zhf9w 2/2 Running 0 4d21h 10.220.0.12 aposa101t <none> <none> elasticsearch-cdm-bxzkbe5l-2-846b7499df-rc8rt 2/2 Running 0 4d21h 10.220.0.11 aposa101t <none> <none> elasticsearch-cdm-bxzkbe5l-3-9bbd4b7cb-tmp7f 2/2 Running 0 4d21h 10.220.2.17 aposa102t <none> <none> elasticsearch-im-app-1613396700-4cpnn 0/1 Completed 0 13m 10.220.2.59 aposa102t <none> <none> elasticsearch-im-audit-1613178900-q7t29 0/1 Error 0 2d12h 10.221.2.176 aposa103t <none> <none> elasticsearch-im-audit-1613396700-xtnn5 0/1 Completed 0 13m 10.220.2.60 aposa102t <none> <none> elasticsearch-im-infra-1613355300-fb7xv 0/1 Error 0 11h 10.220.3.205 aposa102t <none> <none> elasticsearch-im-infra-1613396700-c4vrl 0/1 Completed 0 13m 10.221.3.168 aposa103t <none> <none> fluentd-25kqz 1/1 Running 0 6d22h 10.221.2.5 aposa103t <none> <none> fluentd-49xgk 1/1 Running 0 6d22h 10.220.2.4 aposa102t <none> <none> fluentd-6qlm6 1/1 Running 0 6d22h 10.222.0.9 aposm101t <none> <none> fluentd-8fs6n 1/1 Running 0 6d22h 10.221.4.5 aposa107t <none> <none> fluentd-9bpjw 1/1 Running 0 6d22h 10.221.0.5 aposm103t <none> <none> fluentd-kswdk 1/1 Running 0 6d22h 10.223.4.2 aposa109t <none> <none> fluentd-lzsq2 1/1 Running 0 6d22h 10.220.4.3 aposa106t <none> <none> fluentd-m6lvf 1/1 Running 0 6d22h 10.222.4.4 aposa108t <none> <none> fluentd-mnqvl 1/1 Running 0 6d22h 10.220.0.3 aposa101t <none> <none> fluentd-n7ncz 1/1 Running 0 6d22h 10.223.0.6 aposm102t <none> <none> fluentd-tdf6x 1/1 Running 0 6d22h 10.223.2.2 aposa105t <none> <none> fluentd-zbp9q 1/1 Running 0 6d22h 10.222.2.3 aposa104t <none> <none> Example: 300-fb7xv/elasticsearch-im-infra-1613355300-fb7xv.yaml --- apiVersion: v1 kind: Pod metadata: annotations: k8s.v1.cni.cncf.io/network-status: |- [{ "name": "", "interface": "eth0", "ips": [ "10.220.3.205" ], "default": true, "dns": {} }] k8s.v1.cni.cncf.io/networks-status: |- [{ "name": "", "interface": "eth0", "ips": [ "10.220.3.205" ], "default": true, "dns": {} }] openshift.io/scc: restricted creationTimestamp: "2021-02-15T02:15:09Z" generateName: elasticsearch-im-infra-1613355300- labels: component: indexManagement controller-uid: 845b956a-56ea-41fa-a938-e051a6828220 job-name: elasticsearch-im-infra-1613355300 logging-infra: indexManagement provider: openshift name: elasticsearch-im-infra-1613355300-fb7xv namespace: openshift-logging ownerReferences: - apiVersion: batch/v1 blockOwnerDeletion: true controller: true kind: Job name: elasticsearch-im-infra-1613355300 uid: 845b956a-56ea-41fa-a938-e051a6828220 resourceVersion: "722171623" selfLink: /api/v1/namespaces/openshift-logging/pods/elasticsearch-im-infra-1613355300-fb7xv uid: ab3643fb-642e-4cee-a378-a7e277417a4e spec: containers: - args: - -c - ./delete;delete_rc=$?;./rollover;rollover_rc=$?;$(exit $delete_rc&&exit $rollover_rc) command: - bash env: - name: POLICY_MAPPING value: infra - name: MIN_AGE value: "604800000" - name: PAYLOAD value: eyJjb25kaXRpb25zIjp7Im1heF9hZ2UiOiI4aCIsIm1heF9kb2NzIjoxMjI4ODAwMDAsIm1heF9zaXplIjoiMTIwZ2IifX0= - name: POLICY_MAPPING value: infra - name: ES_SERVICE value: https://elasticsearch:9200 image: registry.redhat.io/openshift4/ose-logging-elasticsearch6@sha256:3f3ade6146eaab2a4c014c16cb2025ac9203de700b593c6ef893f6e59fa54616 imagePullPolicy: IfNotPresent name: indexmanagement resources: requests: cpu: 100m memory: 32Mi securityContext: capabilities: drop: - KILL - MKNOD - SETGID - SETUID runAsUser: 1000600000 terminationMessagePath: /dev/termination-log terminationMessagePolicy: File volumeMounts: - mountPath: /etc/indexmanagement/keys name: certs readOnly: true - mountPath: /tmp/scripts name: scripts - mountPath: /var/run/secrets/kubernetes.io/serviceaccount name: elasticsearch-token-6jqpc readOnly: true workingDir: /tmp/scripts dnsPolicy: ClusterFirst enableServiceLinks: true imagePullSecrets: - name: elasticsearch-dockercfg-rtwhz nodeName: aposa102t nodeSelector: kubernetes.io/os: linux node-role.kubernetes.io/infra: "" preemptionPolicy: PreemptLowerPriority priority: 0 restartPolicy: Never schedulerName: default-scheduler securityContext: fsGroup: 1000600000 seLinuxOptions: level: s0:c25,c0 serviceAccount: elasticsearch serviceAccountName: elasticsearch terminationGracePeriodSeconds: 300 tolerations: - effect: NoExecute key: node.kubernetes.io/not-ready operator: Exists tolerationSeconds: 300 - effect: NoExecute key: node.kubernetes.io/unreachable operator: Exists tolerationSeconds: 300 - effect: NoSchedule key: node.kubernetes.io/memory-pressure operator: Exists volumes: - name: certs secret: defaultMode: 420 secretName: elasticsearch - configMap: defaultMode: 511 name: indexmanagement-scripts name: scripts - name: elasticsearch-token-6jqpc secret: defaultMode: 420 secretName: elasticsearch-token-6jqpc status: conditions: - lastProbeTime: null lastTransitionTime: "2021-02-15T02:15:09Z" status: "True" type: Initialized - lastProbeTime: null lastTransitionTime: "2021-02-15T02:15:19Z" message: 'containers with unready status: [indexmanagement]' reason: ContainersNotReady status: "False" type: Ready - lastProbeTime: null lastTransitionTime: "2021-02-15T02:15:19Z" message: 'containers with unready status: [indexmanagement]' reason: ContainersNotReady status: "False" type: ContainersReady - lastProbeTime: null lastTransitionTime: "2021-02-15T02:15:09Z" status: "True" type: PodScheduled containerStatuses: - containerID: cri-o://4830ad4cd6ca7235e6d3091ffcd731ce6715e89a1a758b2fa73daf12eda8fe84 image: registry.redhat.io/openshift4/ose-logging-elasticsearch6@sha256:3f3ade6146eaab2a4c014c16cb2025ac9203de700b593c6ef893f6e59fa54616 imageID: registry.redhat.io/openshift4/ose-logging-elasticsearch6@sha256:3f3ade6146eaab2a4c014c16cb2025ac9203de700b593c6ef893f6e59fa54616 lastState: {} name: indexmanagement ready: false restartCount: 0 started: false state: terminated: containerID: cri-o://4830ad4cd6ca7235e6d3091ffcd731ce6715e89a1a758b2fa73daf12eda8fe84 exitCode: 1 finishedAt: "2021-02-15T02:15:19Z" reason: Error startedAt: "2021-02-15T02:15:12Z" hostIP: 10.67.0.133 phase: Failed podIP: 10.220.3.205 podIPs: - ip: 10.220.3.205 qosClass: Burstable startTime: "2021-02-15T02:15:09Z"
Hello, I can see exactly the same issue that in BZ#1916910 that it was fixed in OCP 4.6.16. ~~~ # oc get pods NAME READY STATUS RESTARTS AGE cluster-logging-operator-7c87c4dfbd-vppsv 1/1 Running 0 18h curator-1613619000-k8mbd 0/1 Completed 0 12h elasticsearch-cdm-qarn9lmr-1-5fb9cc7bf5-dhlth 2/2 Running 0 18h elasticsearch-cdm-qarn9lmr-2-645bf4cd96-gxbl4 2/2 Running 0 18h elasticsearch-cdm-qarn9lmr-3-8d4c95d8f-9cg6g 2/2 Running 0 18h elasticsearch-im-app-1613664000-xdr8j 0/1 Completed 0 8m33s elasticsearch-im-audit-1613664000-d672l 0/1 Completed 0 8m33s elasticsearch-im-infra-1613655900-kdcdc 0/1 Error 0 143m elasticsearch-im-infra-1613664000-j2zsh 0/1 Completed 0 8m33s ... # oc logs pod/elasticsearch-im-infra-1613655900-kdcdc deleting indices: infra-000719 { "error" : { "root_cause" : [ { "type" : "security_exception", "reason" : "Unexpected exception indices:admin/delete" } ], "type" : "security_exception", "reason" : "Unexpected exception indices:admin/delete" }, "status" : 500 } Current write index for infra-write: infra-000740 Checking results from _rollover call Next write index for infra-write: infra-000740 Checking if infra-000740 exists Checking if infra-000740 is the write index for infra-write Done! ~~~ Then, as Víctor has asked before, should we create a separate bugzilla for this that it's the same that it was tried to fix in BZ#1916910 and leaving this Bug only for the error related to the error? "cat: /tmp/response.txt: No such file or directory" Best regards, Oscar
> # oc logs pod/elasticsearch-im-infra-1613655900-kdcdc > > deleting indices: infra-000719 > { > "error" : { > "root_cause" : [ > { > "type" : "security_exception", > "reason" : "Unexpected exception indices:admin/delete" > } > ], > "type" : "security_exception", > "reason" : "Unexpected exception indices:admin/delete" > }, > "status" : 500 > } This looks like an security issue with having permissions to delete -- do you seem the same errors in the ES pod logs as Victor posted above? If so could you provide a must-gather for the cluster? I had requested one for this specific issue and have not yet received one.
*** Bug 1937216 has been marked as a duplicate of this bug. ***
Testing with elasticsearch-operator.4.6.0-202103202154.p0, I set the index management cronjobs to run in every 3 minutes and the ES cluster is running for about 29 hours, no job fails. $ oc get pod NAME READY STATUS RESTARTS AGE cluster-logging-operator-6f66778f94-7zpmh 1/1 Running 0 29h elasticsearch-cdm-kbvuvj7o-1-5989bcf7c4-vkxrc 2/2 Running 0 29h elasticsearch-cdm-kbvuvj7o-2-57468594c7-5n8kf 2/2 Running 0 29h elasticsearch-cdm-kbvuvj7o-3-5df4bc888d-5dx8h 2/2 Running 0 29h elasticsearch-im-app-1616659740-dx989 0/1 Completed 0 79s elasticsearch-im-audit-1616659740-p26qw 0/1 Completed 0 79s elasticsearch-im-infra-1616659740-swdt7 0/1 Completed 0 79s fluentd-bsjzw 1/1 Running 0 29h fluentd-fsl9g 1/1 Running 0 29h fluentd-pjqzd 1/1 Running 0 29h fluentd-rdfkt 1/1 Running 0 29h fluentd-tv9hh 1/1 Running 0 29h fluentd-v6w9f 1/1 Running 0 29h kibana-8685fbf674-c9fct 2/2 Running 0 29h Move this bz to verified.
I think 29 hours is not enough, could you verify it a bit further, please? (i.e 1 week) This bug has been "verified" in the past as well in other situations but happened after a few days.
(In reply to David Hernández Fernández from comment #45) > I think 29 hours is not enough, could you verify it a bit further, please? > (i.e 1 week) This bug has been "verified" in the past as well in other > situations but happened after a few days. Hi David, I tested with an earlier version(elasticsearch-operator.4.6.0-202103010126.p0) which doesn't have the fix, the issue reproduced after ES cluster running for 7 hours. Compared to the result in comment 44, I think the condition has significant improvement, so I verified this bz. $ oc get pod NAME READY STATUS RESTARTS AGE cluster-logging-operator-79c8f9b6d5-ldgg6 1/1 Running 0 8h elasticsearch-cdm-s6xfukoa-1-6989d86b54-gpjkm 2/2 Running 0 7h58m elasticsearch-cdm-s6xfukoa-2-85f7f5bc7c-7tj6x 2/2 Running 0 7h57m elasticsearch-cdm-s6xfukoa-3-5d7ccfbb96-ssthc 2/2 Running 0 7h56m elasticsearch-im-app-1617007320-8q74b 0/1 Completed 0 96s elasticsearch-im-audit-1617007320-r787v 0/1 Completed 0 96s elasticsearch-im-infra-1617003900-5lzqd 0/1 Error 0 58m elasticsearch-im-infra-1617007320-wm9r6 0/1 Completed 0 96s Anyway, I'll follow your advice to keep my another cluster running for 1 week to see if the issue could be reproduced with the fix.
(In reply to Qiaoling Tang from comment #47) > (In reply to David Hernández Fernández from comment #45) > > I think 29 hours is not enough, could you verify it a bit further, please? > > (i.e 1 week) This bug has been "verified" in the past as well in other > > situations but happened after a few days. > > Hi David, > > I tested with an earlier > version(elasticsearch-operator.4.6.0-202103010126.p0) which doesn't have the > fix, the issue reproduced after ES cluster running for 7 hours. Compared to > the result in comment 44, I think the condition has significant improvement, > so I verified this bz. > > $ oc get pod > NAME READY STATUS RESTARTS > AGE > cluster-logging-operator-79c8f9b6d5-ldgg6 1/1 Running 0 > 8h > elasticsearch-cdm-s6xfukoa-1-6989d86b54-gpjkm 2/2 Running 0 > 7h58m > elasticsearch-cdm-s6xfukoa-2-85f7f5bc7c-7tj6x 2/2 Running 0 > 7h57m > elasticsearch-cdm-s6xfukoa-3-5d7ccfbb96-ssthc 2/2 Running 0 > 7h56m > elasticsearch-im-app-1617007320-8q74b 0/1 Completed 0 > 96s > elasticsearch-im-audit-1617007320-r787v 0/1 Completed 0 > 96s > elasticsearch-im-infra-1617003900-5lzqd 0/1 Error 0 > 58m > elasticsearch-im-infra-1617007320-wm9r6 0/1 Completed 0 > 96s > > Anyway, I'll follow your advice to keep my another cluster running for 1 > week to see if the issue could be reproduced with the fix. To expand on this, previously we didn't have a method to consistently reproduce this issue. This was something I could consistently have happen multiple times an hour and I had conveyed the method to recreate it locally to QE.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6.23 extras update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2021:0954
Still seeing it 4.7.21, but with a different error in the logs: # oc logs elasticsearch-im-audit-1631027700-8gqhd ======================== Index management delete process starting for audit No indices to delete ======================== Index management delete process starting for {"error":{"root_cause":[{"type":"security_exception","reason":"_opendistro_security_dls_query Received an empty response from elasticsearch -- server may not be ready ======================== Index management rollover process starting for audit Current write index for audit-write: audit-000011 Checking results from _rollover call Next write index for audit-write: audit-000011 Checking if audit-000011 exists Checking if audit-000011 is the write index for audit-write Done!