1844097 – The ES pods couldn't be READY during upgrade.

Bug 1844097 - The ES pods couldn't be READY during upgrade.

Summary: The ES pods couldn't be READY during upgrade.

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Logging
Sub Component:
Version:	4.5
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	4.6.0
Assignee:	Jeff Cantrill
QA Contact:	Anping Li
Docs Contact:
URL:
Whiteboard:	backport:4.5
Depends On:
Blocks:	1845118
TreeView+	depends on / blocked

Reported:	2020-06-04 15:39 UTC by Anping Li
Modified:	2024-03-25 16:00 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-10-27 16:05:23 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
Upgrade steps or logs (9.01 KB, text/plain) 2020-06-04 15:40 UTC, Anping Li	no flags	Details
elasticsearch pod log (69.17 KB, text/plain) 2020-06-04 15:42 UTC, Anping Li	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift origin-aggregated-logging pull 1921	0	None	closed	Bug 1844097: Removing check that keeps only elected master seeding	2020-12-10 09:09:02 UTC
Red Hat Product Errata	RHBA-2020:4196	0	None	None	None	2020-10-27 16:05:49 UTC

Description Anping Li 2020-06-04 15:39:25 UTC

Description of problem:
The ES cluster couldn't be ready util I delete all ES pods.


Version-Release number of selected component (if applicable):
4.5

How reproducible:
Always

Steps to Reproduce:
1. deploy clusterlogging 4.4
2. Upgrade EO to 4.5
3. Apply the Workaround.  https://bugzilla.redhat.com/show_bug.cgi?id=1841832#c3
4. Check clusterlogging status.
5. Upgrade CLO
6. Check cluster logging status
7. delete all ES pods
7. check the ES status

Actual results:

See the attachment.

Comment 1 Anping Li 2020-06-04 15:40:45 UTC

Created attachment 1695231 [details]
Upgrade steps or logs

Comment 2 Anping Li 2020-06-04 15:42:11 UTC

Created attachment 1695233 [details]
elasticsearch pod log

Comment 3 Anping Li 2020-06-05 13:35:52 UTC

[anli@preserve-docker-slave 96583]$ oc get pods
NAME                                            READY   STATUS      RESTARTS   AGE
cluster-logging-operator-568599f687-8prlw       1/1     Running     0          18m
curator-1591363200-t8jrs                        0/1     Completed   0          15m
curator-1591363800-fshbz                        1/1     Running     0          5m1s
elasticsearch-cdm-dkx6l77h-1-5bfc78ffd-r5psk    1/2     Running     0          6m48s
elasticsearch-cdm-dkx6l77h-2-589999f69f-bpwtf   1/2     Running     0          5m35s
elasticsearch-cdm-dkx6l77h-3-846df5674d-4rgl7   1/2     Running     0          5m

 oc exec -c elasticsearch elasticsearch-cdm-dkx6l77h-1-5bfc78ffd-r5psk -- es_util '--query=_cluster/settings?pretty'
{
  "persistent" : {
    "cluster" : {
      "routing" : {
        "allocation" : {
          "enable" : "primaries"
        }
      }
    },
    "discovery" : {
      "zen" : {
        "minimum_master_nodes" : "2"
      }
    }
  },
  "transient" : {
    "cluster" : {
      "routing" : {
        "allocation" : {
          "enable" : "all"
        }
      }
    }
  }
}


{"level":"info","ts":1591363697.9201612,"logger":"kubebuilder.controller","msg":"Starting workers","controller":"kibana-controller","worker count":1}
time="2020-06-05T13:28:19Z" level=warning msg="Unable to perform synchronized flush: Failed to flush 3 shards in preparation for cluster restart"
time="2020-06-05T13:28:22Z" level=info msg="Waiting for all nodes to rejoin cluster \"elasticsearch\" in namespace \"openshift-logging\""
time="2020-06-05T13:28:53Z" level=warning msg="when trying to perform full cluster restart: Timed out waiting for elasticsearch-cdm-dkx6l77h-1 to rejoin cluster elasticsearch"
time="2020-06-05T13:29:30Z" level=info msg="Completed full cluster restart for cert redeploy on elasticsearch"
time="2020-06-05T13:29:34Z" level=info msg="Beginning full cluster restart on elasticsearch"
time="2020-06-05T13:30:06Z" level=info msg="Waiting for all nodes to rejoin cluster \"elasticsearch\" in namespace \"openshift-logging\""
time="2020-06-05T13:30:37Z" level=warning msg="when trying to perform full cluster restart: Timed out waiting for elasticsearch-cdm-dkx6l77h-2 to rejoin cluster elasticsearch"

Comment 4 Lukas Vlcek 2020-06-05 14:22:07 UTC

We why have the same settings set at both the transient and persistent levels?
Are we aware of https://www.elastic.co/guide/en/elasticsearch/reference/6.8/cluster-update-settings.html#_order_of_precedence ?

The transient settings has precedence over persistent; making the "cluster.routing.allocation.enable" : "primaries" basically no-op.

Comment 7 Anping Li 2020-06-08 07:50:16 UTC

Verified
 clusterlogging.4.4.0-202006061254 -> clusterlogging.v4.6.0 
 elasticsearch-operator.4.4.0-202006061254 -> elasticsearch-operator.v4.6.0

Comment 9 errata-xmlrpc 2020-10-27 16:05:23 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196

Note You need to log in before you can comment on or make changes to this bug.