Created attachment 1737865 [details] must-gather Description of problem: Deploy logging 4.7 on 4.7 cluster, then upgrade logging to a new 4.7 version, the ES cluster stuck in yellow status after 1 or 2 ES pod(s) is upgraded, and the upgrading couldn't go on. cl/instance: spec: collection: logs: fluentd: {} type: fluentd logStore: elasticsearch: nodeCount: 3 redundancyPolicy: SingleRedundancy resources: requests: memory: 2Gi storage: size: 20Gi storageClassName: standard retentionPolicy: application: maxAge: 60h audit: maxAge: 1d infra: maxAge: 3h type: elasticsearch managementState: Managed visualization: kibana: replicas: 1 type: kibana $ oc get pod NAME READY STATUS RESTARTS AGE cluster-logging-operator-7b8fb444cc-59sw6 1/1 Running 0 63m elasticsearch-cdm-vmehqabi-1-ccffdf6c5-fdwks 2/2 Running 0 62m elasticsearch-cdm-vmehqabi-2-678465b47b-wvwwx 2/2 Running 0 4h1m elasticsearch-cdm-vmehqabi-3-d79796b57-ctkfw 2/2 Running 0 4h1m elasticsearch-delete-app-1607504400-7tp98 0/1 Completed 0 63m elasticsearch-delete-app-1607508000-4h4mq 0/1 Error 0 3m57s elasticsearch-delete-audit-1607504400-g4sck 0/1 Completed 0 63m elasticsearch-delete-audit-1607508000-7h2md 0/1 Error 0 3m57s elasticsearch-delete-infra-1607504400-sqxgj 0/1 Completed 0 63m elasticsearch-delete-infra-1607508000-86f2b 0/1 Error 0 3m57s elasticsearch-rollover-app-1607504400-hs4x6 0/1 Completed 0 63m elasticsearch-rollover-app-1607508000-677dd 0/1 Error 0 3m57s elasticsearch-rollover-audit-1607504400-ffnqs 0/1 Completed 0 63m elasticsearch-rollover-audit-1607508000-87vsj 0/1 Error 0 3m57s elasticsearch-rollover-infra-1607504400-gnkz6 0/1 Completed 0 63m elasticsearch-rollover-infra-1607508000-pjk2j 0/1 Error 0 3m57s fluentd-b8nhc 1/1 Running 0 4h1m fluentd-fkm2q 1/1 Running 0 4h1m fluentd-qs25n 1/1 Running 0 4h1m fluentd-sp4m4 1/1 Running 0 4h1m fluentd-sr78d 1/1 Running 0 4h1m fluentd-szq6d 1/1 Running 0 4h1m kibana-74554f87c6-ln88p 2/2 Running 0 61m $ oc exec elasticsearch-cdm-vmehqabi-1-ccffdf6c5-fdwks -- shards Defaulting container name to elasticsearch. Use 'oc describe pod/elasticsearch-cdm-vmehqabi-1-ccffdf6c5-fdwks -n openshift-logging' to see all of the containers in this pod. infra-000004 1 p STARTED infra-000004 1 r UNASSIGNED NODE_LEFT infra-000004 2 p STARTED infra-000004 2 r UNASSIGNED NODE_LEFT infra-000004 0 p STARTED infra-000004 0 r STARTED infra-000010 1 r STARTED infra-000010 1 p STARTED infra-000010 2 p STARTED infra-000010 2 r UNASSIGNED NODE_LEFT infra-000010 0 p STARTED infra-000010 0 r UNASSIGNED NODE_LEFT infra-000009 1 p STARTED infra-000009 1 r UNASSIGNED NODE_LEFT infra-000009 2 p STARTED infra-000009 2 r UNASSIGNED NODE_LEFT $ oc exec elasticsearch-cdm-vmehqabi-1-ccffdf6c5-fdwks -- es_util --query=_cat/nodes?v Defaulting container name to elasticsearch. Use 'oc describe pod/elasticsearch-cdm-vmehqabi-1-ccffdf6c5-fdwks -n openshift-logging' to see all of the containers in this pod. ip heap.percent ram.percent cpu load_1m load_5m load_15m node.role master name 10.131.0.49 64 100 15 1.13 0.97 1.26 mdi * elasticsearch-cdm-vmehqabi-2 10.129.2.172 13 66 10 0.35 0.69 0.79 mdi - elasticsearch-cdm-vmehqabi-1 10.128.2.24 59 100 15 0.75 0.85 0.72 mdi - elasticsearch-cdm-vmehqabi-3 The EO keeps repeating the following error messages: "error":{"cluster":"elasticsearch","msg":"failed to create index template","namespace":"openshift-logging","response_body":null, "error":{"cluster":"elasticsearch","msg":"failed to get list of index templates","namespace":"openshift-logging","response_body":null, more details are in the must-gather Version-Release number of selected component (if applicable): upgrade from elasticsearch-operator.4.7.0-202012080225.p0 to elasticsearch-operator.4.7.0-202012082225.p0 How reproducible: 100% Steps to Reproduce: 1. deploy logging 4.7 2. upgrade logging to a new 4.7 version 3. Actual results: Expected results: Additional info:
Hit this issue too when upgrading OCP Cluster. step 1: deploy logging-4.7 on OCP 4.6 Step 2: upgrade OCP4.6-> 4.7. Result: The hit this issue.
Tested several upgrade path, here are the details: path 1: upgrade from clusterlogging.4.7.0-202101070834.p0 to clusterlogging.4.7.0-202101092121.p0: succeeded path 2: upgrade from clusterlogging.4.6.0-202101090741.p0(latest 4.6, but not released yet) to clusterlogging.4.7.0-202101092121.p0: succeeded path 3: upgrade from clusterlogging.4.6.0-202011221454.p0(latest released 4.6) to clusterlogging.4.7.0-202101092121.p0: failed, same issue as https://bugzilla.redhat.com/show_bug.cgi?id=1906641. Since we have https://bugzilla.redhat.com/show_bug.cgi?id=1906641 to track the issue in path 3, I move this bz to VERIFIED.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Errata Advisory for Openshift Logging 5.0.0), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2021:0652