Bug 1844097

Summary: The ES pods couldn't be READY during upgrade.
Product: OpenShift Container Platform Reporter: Anping Li <anli>
Component: LoggingAssignee: Jeff Cantrill <jcantril>
Status: CLOSED ERRATA QA Contact: Anping Li <anli>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.5CC: aos-bugs, cruhm, lvlcek
Target Milestone: ---   
Target Release: 4.6.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: backport:4.5
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-10-27 16:05:23 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1845118    
Attachments:
Description Flags
Upgrade steps or logs
none
elasticsearch pod log none

Description Anping Li 2020-06-04 15:39:25 UTC
Description of problem:
The ES cluster couldn't be ready util I delete all ES pods.


Version-Release number of selected component (if applicable):
4.5

How reproducible:
Always

Steps to Reproduce:
1. deploy clusterlogging 4.4
2. Upgrade EO to 4.5
3. Apply the Workaround.  https://bugzilla.redhat.com/show_bug.cgi?id=1841832#c3
4. Check clusterlogging status.
5. Upgrade CLO
6. Check cluster logging status
7. delete all ES pods
7. check the ES status

Actual results:

See the attachment.

Comment 1 Anping Li 2020-06-04 15:40:45 UTC
Created attachment 1695231 [details]
Upgrade steps or logs

Comment 2 Anping Li 2020-06-04 15:42:11 UTC
Created attachment 1695233 [details]
elasticsearch pod log

Comment 3 Anping Li 2020-06-05 13:35:52 UTC
[anli@preserve-docker-slave 96583]$ oc get pods
NAME                                            READY   STATUS      RESTARTS   AGE
cluster-logging-operator-568599f687-8prlw       1/1     Running     0          18m
curator-1591363200-t8jrs                        0/1     Completed   0          15m
curator-1591363800-fshbz                        1/1     Running     0          5m1s
elasticsearch-cdm-dkx6l77h-1-5bfc78ffd-r5psk    1/2     Running     0          6m48s
elasticsearch-cdm-dkx6l77h-2-589999f69f-bpwtf   1/2     Running     0          5m35s
elasticsearch-cdm-dkx6l77h-3-846df5674d-4rgl7   1/2     Running     0          5m

 oc exec -c elasticsearch elasticsearch-cdm-dkx6l77h-1-5bfc78ffd-r5psk -- es_util '--query=_cluster/settings?pretty'
{
  "persistent" : {
    "cluster" : {
      "routing" : {
        "allocation" : {
          "enable" : "primaries"
        }
      }
    },
    "discovery" : {
      "zen" : {
        "minimum_master_nodes" : "2"
      }
    }
  },
  "transient" : {
    "cluster" : {
      "routing" : {
        "allocation" : {
          "enable" : "all"
        }
      }
    }
  }
}


{"level":"info","ts":1591363697.9201612,"logger":"kubebuilder.controller","msg":"Starting workers","controller":"kibana-controller","worker count":1}
time="2020-06-05T13:28:19Z" level=warning msg="Unable to perform synchronized flush: Failed to flush 3 shards in preparation for cluster restart"
time="2020-06-05T13:28:22Z" level=info msg="Waiting for all nodes to rejoin cluster \"elasticsearch\" in namespace \"openshift-logging\""
time="2020-06-05T13:28:53Z" level=warning msg="when trying to perform full cluster restart: Timed out waiting for elasticsearch-cdm-dkx6l77h-1 to rejoin cluster elasticsearch"
time="2020-06-05T13:29:30Z" level=info msg="Completed full cluster restart for cert redeploy on elasticsearch"
time="2020-06-05T13:29:34Z" level=info msg="Beginning full cluster restart on elasticsearch"
time="2020-06-05T13:30:06Z" level=info msg="Waiting for all nodes to rejoin cluster \"elasticsearch\" in namespace \"openshift-logging\""
time="2020-06-05T13:30:37Z" level=warning msg="when trying to perform full cluster restart: Timed out waiting for elasticsearch-cdm-dkx6l77h-2 to rejoin cluster elasticsearch"

Comment 4 Lukas Vlcek 2020-06-05 14:22:07 UTC
We why have the same settings set at both the transient and persistent levels?
Are we aware of https://www.elastic.co/guide/en/elasticsearch/reference/6.8/cluster-update-settings.html#_order_of_precedence ?

The transient settings has precedence over persistent; making the "cluster.routing.allocation.enable" : "primaries" basically no-op.

Comment 7 Anping Li 2020-06-08 07:50:16 UTC
Verified
 clusterlogging.4.4.0-202006061254 -> clusterlogging.v4.6.0 
 elasticsearch-operator.4.4.0-202006061254 -> elasticsearch-operator.v4.6.0

Comment 9 errata-xmlrpc 2020-10-27 16:05:23 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196