Bug 1844097

Summary:

The ES pods couldn't be READY during upgrade.

Product:

OpenShift Container Platform

Reporter:

Anping Li <anli>

Component:

Logging

Assignee:

Jeff Cantrill <jcantril>

Status:

CLOSED ERRATA

QA Contact:

Anping Li <anli>

Severity:

high

Docs Contact:

Priority:

unspecified

Version:

4.5

CC:

aos-bugs, cruhm, lvlcek

Target Milestone:

---

Target Release:

4.6.0

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

backport:4.5

Fixed In Version:

Doc Type:

No Doc Update

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2020-10-27 16:05:23 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

1845118

Attachments:

Description	Flags
Upgrade steps or logs	none
elasticsearch pod log	none

Description Anping Li 2020-06-04 15:39:25 UTC

Description of problem:
The ES cluster couldn't be ready util I delete all ES pods.


Version-Release number of selected component (if applicable):
4.5

How reproducible:
Always

Steps to Reproduce:
1. deploy clusterlogging 4.4
2. Upgrade EO to 4.5
3. Apply the Workaround.  https://bugzilla.redhat.com/show_bug.cgi?id=1841832#c3
4. Check clusterlogging status.
5. Upgrade CLO
6. Check cluster logging status
7. delete all ES pods
7. check the ES status

Actual results:

See the attachment.

Comment 1 Anping Li 2020-06-04 15:40:45 UTC

Created attachment 1695231 [details]
Upgrade steps or logs

Comment 2 Anping Li 2020-06-04 15:42:11 UTC

Created attachment 1695233 [details]
elasticsearch pod log

Comment 3 Anping Li 2020-06-05 13:35:52 UTC

[anli@preserve-docker-slave 96583]$ oc get pods
NAME                                            READY   STATUS      RESTARTS   AGE
cluster-logging-operator-568599f687-8prlw       1/1     Running     0          18m
curator-1591363200-t8jrs                        0/1     Completed   0          15m
curator-1591363800-fshbz                        1/1     Running     0          5m1s
elasticsearch-cdm-dkx6l77h-1-5bfc78ffd-r5psk    1/2     Running     0          6m48s
elasticsearch-cdm-dkx6l77h-2-589999f69f-bpwtf   1/2     Running     0          5m35s
elasticsearch-cdm-dkx6l77h-3-846df5674d-4rgl7   1/2     Running     0          5m

 oc exec -c elasticsearch elasticsearch-cdm-dkx6l77h-1-5bfc78ffd-r5psk -- es_util '--query=_cluster/settings?pretty'
{
  "persistent" : {
    "cluster" : {
      "routing" : {
        "allocation" : {
          "enable" : "primaries"
        }
      }
    },
    "discovery" : {
      "zen" : {
        "minimum_master_nodes" : "2"
      }
    }
  },
  "transient" : {
    "cluster" : {
      "routing" : {
        "allocation" : {
          "enable" : "all"
        }
      }
    }
  }
}


{"level":"info","ts":1591363697.9201612,"logger":"kubebuilder.controller","msg":"Starting workers","controller":"kibana-controller","worker count":1}
time="2020-06-05T13:28:19Z" level=warning msg="Unable to perform synchronized flush: Failed to flush 3 shards in preparation for cluster restart"
time="2020-06-05T13:28:22Z" level=info msg="Waiting for all nodes to rejoin cluster \"elasticsearch\" in namespace \"openshift-logging\""
time="2020-06-05T13:28:53Z" level=warning msg="when trying to perform full cluster restart: Timed out waiting for elasticsearch-cdm-dkx6l77h-1 to rejoin cluster elasticsearch"
time="2020-06-05T13:29:30Z" level=info msg="Completed full cluster restart for cert redeploy on elasticsearch"
time="2020-06-05T13:29:34Z" level=info msg="Beginning full cluster restart on elasticsearch"
time="2020-06-05T13:30:06Z" level=info msg="Waiting for all nodes to rejoin cluster \"elasticsearch\" in namespace \"openshift-logging\""
time="2020-06-05T13:30:37Z" level=warning msg="when trying to perform full cluster restart: Timed out waiting for elasticsearch-cdm-dkx6l77h-2 to rejoin cluster elasticsearch"

Comment 4 Lukas Vlcek 2020-06-05 14:22:07 UTC

We why have the same settings set at both the transient and persistent levels?
Are we aware of https://www.elastic.co/guide/en/elasticsearch/reference/6.8/cluster-update-settings.html#_order_of_precedence ?

The transient settings has precedence over persistent; making the "cluster.routing.allocation.enable" : "primaries" basically no-op.

Comment 7 Anping Li 2020-06-08 07:50:16 UTC

Verified
 clusterlogging.4.4.0-202006061254 -> clusterlogging.v4.6.0 
 elasticsearch-operator.4.4.0-202006061254 -> elasticsearch-operator.v4.6.0

Comment 9 errata-xmlrpc 2020-10-27 16:05:23 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196