Bug 1844097 - The ES pods couldn't be READY during upgrade.
Summary: The ES pods couldn't be READY during upgrade.
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Logging
Version: 4.5
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: 4.6.0
Assignee: Jeff Cantrill
QA Contact: Anping Li
URL:
Whiteboard: backport:4.5
Depends On:
Blocks: 1845118
TreeView+ depends on / blocked
 
Reported: 2020-06-04 15:39 UTC by Anping Li
Modified: 2024-03-25 16:00 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-10-27 16:05:23 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Upgrade steps or logs (9.01 KB, text/plain)
2020-06-04 15:40 UTC, Anping Li
no flags Details
elasticsearch pod log (69.17 KB, text/plain)
2020-06-04 15:42 UTC, Anping Li
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github openshift origin-aggregated-logging pull 1921 0 None closed Bug 1844097: Removing check that keeps only elected master seeding 2020-12-10 09:09:02 UTC
Red Hat Product Errata RHBA-2020:4196 0 None None None 2020-10-27 16:05:49 UTC

Description Anping Li 2020-06-04 15:39:25 UTC
Description of problem:
The ES cluster couldn't be ready util I delete all ES pods.


Version-Release number of selected component (if applicable):
4.5

How reproducible:
Always

Steps to Reproduce:
1. deploy clusterlogging 4.4
2. Upgrade EO to 4.5
3. Apply the Workaround.  https://bugzilla.redhat.com/show_bug.cgi?id=1841832#c3
4. Check clusterlogging status.
5. Upgrade CLO
6. Check cluster logging status
7. delete all ES pods
7. check the ES status

Actual results:

See the attachment.

Comment 1 Anping Li 2020-06-04 15:40:45 UTC
Created attachment 1695231 [details]
Upgrade steps or logs

Comment 2 Anping Li 2020-06-04 15:42:11 UTC
Created attachment 1695233 [details]
elasticsearch pod log

Comment 3 Anping Li 2020-06-05 13:35:52 UTC
[anli@preserve-docker-slave 96583]$ oc get pods
NAME                                            READY   STATUS      RESTARTS   AGE
cluster-logging-operator-568599f687-8prlw       1/1     Running     0          18m
curator-1591363200-t8jrs                        0/1     Completed   0          15m
curator-1591363800-fshbz                        1/1     Running     0          5m1s
elasticsearch-cdm-dkx6l77h-1-5bfc78ffd-r5psk    1/2     Running     0          6m48s
elasticsearch-cdm-dkx6l77h-2-589999f69f-bpwtf   1/2     Running     0          5m35s
elasticsearch-cdm-dkx6l77h-3-846df5674d-4rgl7   1/2     Running     0          5m

 oc exec -c elasticsearch elasticsearch-cdm-dkx6l77h-1-5bfc78ffd-r5psk -- es_util '--query=_cluster/settings?pretty'
{
  "persistent" : {
    "cluster" : {
      "routing" : {
        "allocation" : {
          "enable" : "primaries"
        }
      }
    },
    "discovery" : {
      "zen" : {
        "minimum_master_nodes" : "2"
      }
    }
  },
  "transient" : {
    "cluster" : {
      "routing" : {
        "allocation" : {
          "enable" : "all"
        }
      }
    }
  }
}


{"level":"info","ts":1591363697.9201612,"logger":"kubebuilder.controller","msg":"Starting workers","controller":"kibana-controller","worker count":1}
time="2020-06-05T13:28:19Z" level=warning msg="Unable to perform synchronized flush: Failed to flush 3 shards in preparation for cluster restart"
time="2020-06-05T13:28:22Z" level=info msg="Waiting for all nodes to rejoin cluster \"elasticsearch\" in namespace \"openshift-logging\""
time="2020-06-05T13:28:53Z" level=warning msg="when trying to perform full cluster restart: Timed out waiting for elasticsearch-cdm-dkx6l77h-1 to rejoin cluster elasticsearch"
time="2020-06-05T13:29:30Z" level=info msg="Completed full cluster restart for cert redeploy on elasticsearch"
time="2020-06-05T13:29:34Z" level=info msg="Beginning full cluster restart on elasticsearch"
time="2020-06-05T13:30:06Z" level=info msg="Waiting for all nodes to rejoin cluster \"elasticsearch\" in namespace \"openshift-logging\""
time="2020-06-05T13:30:37Z" level=warning msg="when trying to perform full cluster restart: Timed out waiting for elasticsearch-cdm-dkx6l77h-2 to rejoin cluster elasticsearch"

Comment 4 Lukas Vlcek 2020-06-05 14:22:07 UTC
We why have the same settings set at both the transient and persistent levels?
Are we aware of https://www.elastic.co/guide/en/elasticsearch/reference/6.8/cluster-update-settings.html#_order_of_precedence ?

The transient settings has precedence over persistent; making the "cluster.routing.allocation.enable" : "primaries" basically no-op.

Comment 7 Anping Li 2020-06-08 07:50:16 UTC
Verified
 clusterlogging.4.4.0-202006061254 -> clusterlogging.v4.6.0 
 elasticsearch-operator.4.4.0-202006061254 -> elasticsearch-operator.v4.6.0

Comment 9 errata-xmlrpc 2020-10-27 16:05:23 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196


Note You need to log in before you can comment on or make changes to this bug.