Elasticsearch cluster can't be fully upgraded after upgrade from elasticsearch-operator.4.7.0-202012080225.p0 to elasticsearch-operator.4.7.0-202012082225.p0
* Previously, because of a bug, the software did not find some certificates and regenerated them. Normally this triggers the Elasticsearch operator to perform a full cluster restart on the Elasticsearch cluster. However, if this happens while the operator is performing a rolling restart, it can cause mismatched certificates. The current release fixes this issue. Now the operator consistently reads and writes certificates to the same working directory and only regenerates the certificates if needed. (link:https://bugzilla.redhat.com/show_bug.cgi?id=1905910[*BZ#1905910*])
Created attachment 1737865[details]
must-gather
Description of problem:
Deploy logging 4.7 on 4.7 cluster, then upgrade logging to a new 4.7 version, the ES cluster stuck in yellow status after 1 or 2 ES pod(s) is upgraded, and the upgrading couldn't go on.
cl/instance:
spec:
collection:
logs:
fluentd: {}
type: fluentd
logStore:
elasticsearch:
nodeCount: 3
redundancyPolicy: SingleRedundancy
resources:
requests:
memory: 2Gi
storage:
size: 20Gi
storageClassName: standard
retentionPolicy:
application:
maxAge: 60h
audit:
maxAge: 1d
infra:
maxAge: 3h
type: elasticsearch
managementState: Managed
visualization:
kibana:
replicas: 1
type: kibana
$ oc get pod
NAME READY STATUS RESTARTS AGE
cluster-logging-operator-7b8fb444cc-59sw6 1/1 Running 0 63m
elasticsearch-cdm-vmehqabi-1-ccffdf6c5-fdwks 2/2 Running 0 62m
elasticsearch-cdm-vmehqabi-2-678465b47b-wvwwx 2/2 Running 0 4h1m
elasticsearch-cdm-vmehqabi-3-d79796b57-ctkfw 2/2 Running 0 4h1m
elasticsearch-delete-app-1607504400-7tp98 0/1 Completed 0 63m
elasticsearch-delete-app-1607508000-4h4mq 0/1 Error 0 3m57s
elasticsearch-delete-audit-1607504400-g4sck 0/1 Completed 0 63m
elasticsearch-delete-audit-1607508000-7h2md 0/1 Error 0 3m57s
elasticsearch-delete-infra-1607504400-sqxgj 0/1 Completed 0 63m
elasticsearch-delete-infra-1607508000-86f2b 0/1 Error 0 3m57s
elasticsearch-rollover-app-1607504400-hs4x6 0/1 Completed 0 63m
elasticsearch-rollover-app-1607508000-677dd 0/1 Error 0 3m57s
elasticsearch-rollover-audit-1607504400-ffnqs 0/1 Completed 0 63m
elasticsearch-rollover-audit-1607508000-87vsj 0/1 Error 0 3m57s
elasticsearch-rollover-infra-1607504400-gnkz6 0/1 Completed 0 63m
elasticsearch-rollover-infra-1607508000-pjk2j 0/1 Error 0 3m57s
fluentd-b8nhc 1/1 Running 0 4h1m
fluentd-fkm2q 1/1 Running 0 4h1m
fluentd-qs25n 1/1 Running 0 4h1m
fluentd-sp4m4 1/1 Running 0 4h1m
fluentd-sr78d 1/1 Running 0 4h1m
fluentd-szq6d 1/1 Running 0 4h1m
kibana-74554f87c6-ln88p 2/2 Running 0 61m
$ oc exec elasticsearch-cdm-vmehqabi-1-ccffdf6c5-fdwks -- shards
Defaulting container name to elasticsearch.
Use 'oc describe pod/elasticsearch-cdm-vmehqabi-1-ccffdf6c5-fdwks -n openshift-logging' to see all of the containers in this pod.
infra-000004 1 p STARTED
infra-000004 1 r UNASSIGNED NODE_LEFT
infra-000004 2 p STARTED
infra-000004 2 r UNASSIGNED NODE_LEFT
infra-000004 0 p STARTED
infra-000004 0 r STARTED
infra-000010 1 r STARTED
infra-000010 1 p STARTED
infra-000010 2 p STARTED
infra-000010 2 r UNASSIGNED NODE_LEFT
infra-000010 0 p STARTED
infra-000010 0 r UNASSIGNED NODE_LEFT
infra-000009 1 p STARTED
infra-000009 1 r UNASSIGNED NODE_LEFT
infra-000009 2 p STARTED
infra-000009 2 r UNASSIGNED NODE_LEFT
$ oc exec elasticsearch-cdm-vmehqabi-1-ccffdf6c5-fdwks -- es_util --query=_cat/nodes?v
Defaulting container name to elasticsearch.
Use 'oc describe pod/elasticsearch-cdm-vmehqabi-1-ccffdf6c5-fdwks -n openshift-logging' to see all of the containers in this pod.
ip heap.percent ram.percent cpu load_1m load_5m load_15m node.role master name
10.131.0.49 64 100 15 1.13 0.97 1.26 mdi * elasticsearch-cdm-vmehqabi-2
10.129.2.172 13 66 10 0.35 0.69 0.79 mdi - elasticsearch-cdm-vmehqabi-1
10.128.2.24 59 100 15 0.75 0.85 0.72 mdi - elasticsearch-cdm-vmehqabi-3
The EO keeps repeating the following error messages:
"error":{"cluster":"elasticsearch","msg":"failed to create index template","namespace":"openshift-logging","response_body":null,
"error":{"cluster":"elasticsearch","msg":"failed to get list of index templates","namespace":"openshift-logging","response_body":null,
more details are in the must-gather
Version-Release number of selected component (if applicable):
upgrade from elasticsearch-operator.4.7.0-202012080225.p0 to elasticsearch-operator.4.7.0-202012082225.p0
How reproducible:
100%
Steps to Reproduce:
1. deploy logging 4.7
2. upgrade logging to a new 4.7 version
3.
Actual results:
Expected results:
Additional info:
Tested several upgrade path, here are the details:
path 1: upgrade from clusterlogging.4.7.0-202101070834.p0 to clusterlogging.4.7.0-202101092121.p0: succeeded
path 2: upgrade from clusterlogging.4.6.0-202101090741.p0(latest 4.6, but not released yet) to clusterlogging.4.7.0-202101092121.p0: succeeded
path 3: upgrade from clusterlogging.4.6.0-202011221454.p0(latest released 4.6) to clusterlogging.4.7.0-202101092121.p0: failed, same issue as https://bugzilla.redhat.com/show_bug.cgi?id=1906641.
Since we have https://bugzilla.redhat.com/show_bug.cgi?id=1906641 to track the issue in path 3, I move this bz to VERIFIED.
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory (Errata Advisory for Openshift Logging 5.0.0), and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.
https://access.redhat.com/errata/RHBA-2021:0652
Created attachment 1737865 [details] must-gather Description of problem: Deploy logging 4.7 on 4.7 cluster, then upgrade logging to a new 4.7 version, the ES cluster stuck in yellow status after 1 or 2 ES pod(s) is upgraded, and the upgrading couldn't go on. cl/instance: spec: collection: logs: fluentd: {} type: fluentd logStore: elasticsearch: nodeCount: 3 redundancyPolicy: SingleRedundancy resources: requests: memory: 2Gi storage: size: 20Gi storageClassName: standard retentionPolicy: application: maxAge: 60h audit: maxAge: 1d infra: maxAge: 3h type: elasticsearch managementState: Managed visualization: kibana: replicas: 1 type: kibana $ oc get pod NAME READY STATUS RESTARTS AGE cluster-logging-operator-7b8fb444cc-59sw6 1/1 Running 0 63m elasticsearch-cdm-vmehqabi-1-ccffdf6c5-fdwks 2/2 Running 0 62m elasticsearch-cdm-vmehqabi-2-678465b47b-wvwwx 2/2 Running 0 4h1m elasticsearch-cdm-vmehqabi-3-d79796b57-ctkfw 2/2 Running 0 4h1m elasticsearch-delete-app-1607504400-7tp98 0/1 Completed 0 63m elasticsearch-delete-app-1607508000-4h4mq 0/1 Error 0 3m57s elasticsearch-delete-audit-1607504400-g4sck 0/1 Completed 0 63m elasticsearch-delete-audit-1607508000-7h2md 0/1 Error 0 3m57s elasticsearch-delete-infra-1607504400-sqxgj 0/1 Completed 0 63m elasticsearch-delete-infra-1607508000-86f2b 0/1 Error 0 3m57s elasticsearch-rollover-app-1607504400-hs4x6 0/1 Completed 0 63m elasticsearch-rollover-app-1607508000-677dd 0/1 Error 0 3m57s elasticsearch-rollover-audit-1607504400-ffnqs 0/1 Completed 0 63m elasticsearch-rollover-audit-1607508000-87vsj 0/1 Error 0 3m57s elasticsearch-rollover-infra-1607504400-gnkz6 0/1 Completed 0 63m elasticsearch-rollover-infra-1607508000-pjk2j 0/1 Error 0 3m57s fluentd-b8nhc 1/1 Running 0 4h1m fluentd-fkm2q 1/1 Running 0 4h1m fluentd-qs25n 1/1 Running 0 4h1m fluentd-sp4m4 1/1 Running 0 4h1m fluentd-sr78d 1/1 Running 0 4h1m fluentd-szq6d 1/1 Running 0 4h1m kibana-74554f87c6-ln88p 2/2 Running 0 61m $ oc exec elasticsearch-cdm-vmehqabi-1-ccffdf6c5-fdwks -- shards Defaulting container name to elasticsearch. Use 'oc describe pod/elasticsearch-cdm-vmehqabi-1-ccffdf6c5-fdwks -n openshift-logging' to see all of the containers in this pod. infra-000004 1 p STARTED infra-000004 1 r UNASSIGNED NODE_LEFT infra-000004 2 p STARTED infra-000004 2 r UNASSIGNED NODE_LEFT infra-000004 0 p STARTED infra-000004 0 r STARTED infra-000010 1 r STARTED infra-000010 1 p STARTED infra-000010 2 p STARTED infra-000010 2 r UNASSIGNED NODE_LEFT infra-000010 0 p STARTED infra-000010 0 r UNASSIGNED NODE_LEFT infra-000009 1 p STARTED infra-000009 1 r UNASSIGNED NODE_LEFT infra-000009 2 p STARTED infra-000009 2 r UNASSIGNED NODE_LEFT $ oc exec elasticsearch-cdm-vmehqabi-1-ccffdf6c5-fdwks -- es_util --query=_cat/nodes?v Defaulting container name to elasticsearch. Use 'oc describe pod/elasticsearch-cdm-vmehqabi-1-ccffdf6c5-fdwks -n openshift-logging' to see all of the containers in this pod. ip heap.percent ram.percent cpu load_1m load_5m load_15m node.role master name 10.131.0.49 64 100 15 1.13 0.97 1.26 mdi * elasticsearch-cdm-vmehqabi-2 10.129.2.172 13 66 10 0.35 0.69 0.79 mdi - elasticsearch-cdm-vmehqabi-1 10.128.2.24 59 100 15 0.75 0.85 0.72 mdi - elasticsearch-cdm-vmehqabi-3 The EO keeps repeating the following error messages: "error":{"cluster":"elasticsearch","msg":"failed to create index template","namespace":"openshift-logging","response_body":null, "error":{"cluster":"elasticsearch","msg":"failed to get list of index templates","namespace":"openshift-logging","response_body":null, more details are in the must-gather Version-Release number of selected component (if applicable): upgrade from elasticsearch-operator.4.7.0-202012080225.p0 to elasticsearch-operator.4.7.0-202012082225.p0 How reproducible: 100% Steps to Reproduce: 1. deploy logging 4.7 2. upgrade logging to a new 4.7 version 3. Actual results: Expected results: Additional info: