Description of problem: The ES cluster couldn't be ready util I delete all ES pods. Version-Release number of selected component (if applicable): 4.5 How reproducible: Always Steps to Reproduce: 1. deploy clusterlogging 4.4 2. Upgrade EO to 4.5 3. Apply the Workaround. https://bugzilla.redhat.com/show_bug.cgi?id=1841832#c3 4. Check clusterlogging status. 5. Upgrade CLO 6. Check cluster logging status 7. delete all ES pods 7. check the ES status Actual results: See the attachment.
Created attachment 1695231 [details] Upgrade steps or logs
Created attachment 1695233 [details] elasticsearch pod log
[anli@preserve-docker-slave 96583]$ oc get pods NAME READY STATUS RESTARTS AGE cluster-logging-operator-568599f687-8prlw 1/1 Running 0 18m curator-1591363200-t8jrs 0/1 Completed 0 15m curator-1591363800-fshbz 1/1 Running 0 5m1s elasticsearch-cdm-dkx6l77h-1-5bfc78ffd-r5psk 1/2 Running 0 6m48s elasticsearch-cdm-dkx6l77h-2-589999f69f-bpwtf 1/2 Running 0 5m35s elasticsearch-cdm-dkx6l77h-3-846df5674d-4rgl7 1/2 Running 0 5m oc exec -c elasticsearch elasticsearch-cdm-dkx6l77h-1-5bfc78ffd-r5psk -- es_util '--query=_cluster/settings?pretty' { "persistent" : { "cluster" : { "routing" : { "allocation" : { "enable" : "primaries" } } }, "discovery" : { "zen" : { "minimum_master_nodes" : "2" } } }, "transient" : { "cluster" : { "routing" : { "allocation" : { "enable" : "all" } } } } } {"level":"info","ts":1591363697.9201612,"logger":"kubebuilder.controller","msg":"Starting workers","controller":"kibana-controller","worker count":1} time="2020-06-05T13:28:19Z" level=warning msg="Unable to perform synchronized flush: Failed to flush 3 shards in preparation for cluster restart" time="2020-06-05T13:28:22Z" level=info msg="Waiting for all nodes to rejoin cluster \"elasticsearch\" in namespace \"openshift-logging\"" time="2020-06-05T13:28:53Z" level=warning msg="when trying to perform full cluster restart: Timed out waiting for elasticsearch-cdm-dkx6l77h-1 to rejoin cluster elasticsearch" time="2020-06-05T13:29:30Z" level=info msg="Completed full cluster restart for cert redeploy on elasticsearch" time="2020-06-05T13:29:34Z" level=info msg="Beginning full cluster restart on elasticsearch" time="2020-06-05T13:30:06Z" level=info msg="Waiting for all nodes to rejoin cluster \"elasticsearch\" in namespace \"openshift-logging\"" time="2020-06-05T13:30:37Z" level=warning msg="when trying to perform full cluster restart: Timed out waiting for elasticsearch-cdm-dkx6l77h-2 to rejoin cluster elasticsearch"
We why have the same settings set at both the transient and persistent levels? Are we aware of https://www.elastic.co/guide/en/elasticsearch/reference/6.8/cluster-update-settings.html#_order_of_precedence ? The transient settings has precedence over persistent; making the "cluster.routing.allocation.enable" : "primaries" basically no-op.
Verified clusterlogging.4.4.0-202006061254 -> clusterlogging.v4.6.0 elasticsearch-operator.4.4.0-202006061254 -> elasticsearch-operator.v4.6.0
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4196