Bug 1835046
Summary: | Logging 4.4 shardAllocationEnabled is `none` after upgrade OCP cluster from 4.4 to 4.5. | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Qiaoling Tang <qitang> |
Component: | Logging | Assignee: | ewolinet |
Status: | CLOSED ERRATA | QA Contact: | Anping Li <anli> |
Severity: | high | Docs Contact: | |
Priority: | high | ||
Version: | 4.5 | CC: | aos-bugs, ewolinet, periklis, ssadhale |
Target Milestone: | --- | ||
Target Release: | 4.5.0 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2020-07-13 17:38:07 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Qiaoling Tang
2020-05-13 01:46:04 UTC
@Periklis, I think that is different. For example : The status is Yellow when cluster.routing.allocation.enable=all. ( https://bugzilla.redhat.com/show_bug.cgi?id=1838153) #oc rsh elasticsearch-cdm-xv9zo8gz-1-cbbd47549-5ksk4 sh-4.2$ es_cluster_health { "cluster_name" : "elasticsearch", "status" : "yellow", "timed_out" : false, "number_of_nodes" : 1, "number_of_data_nodes" : 1, "active_primary_shards" : 16, "active_shards" : 16, "relocating_shards" : 0, "initializing_shards" : 0, "unassigned_shards" : 10, "delayed_unassigned_shards" : 0, "number_of_pending_tasks" : 0, "number_of_in_flight_fetch" : 0, "task_max_waiting_in_queue_millis" : 0, "active_shards_percent_as_number" : 61.53846153846154 } $ es_util --query=_cluster/settings {"persistent":{"discovery":{"zen":{"minimum_master_nodes":"1"}}},"transient":{"cluster":{"routing":{"allocation":{"enable":"all"}}}}}sh-4.2$ sh-4.2$ sh-4.2$ sh-4.2$ es_util --query=_cat/shards .kibana 0 p STARTED 1 3.2kb 10.129.2.58 elasticsearch-cdm-xv9zo8gz-1 project.logjsonx.9b993d19-0818-4232-8940-cf06a750e965.2020.05.23 0 p STARTED 746 485.2kb 10.129.2.58 elasticsearch-cdm-xv9zo8gz-1 infra-write 1 p STARTED 0 162b 10.129.2.58 elasticsearch-cdm-xv9zo8gz-1 infra-write 1 r UNASSIGNED infra-write 4 p STARTED 0 162b 10.129.2.58 elasticsearch-cdm-xv9zo8gz-1 infra-write 4 r UNASSIGNED infra-write 2 p STARTED 0 162b 10.129.2.58 elasticsearch-cdm-xv9zo8gz-1 infra-write 2 r UNASSIGNED infra-write 3 p STARTED 0 162b 10.129.2.58 elasticsearch-cdm-xv9zo8gz-1 infra-write 3 r UNASSIGNED infra-write 0 p STARTED 0 162b 10.129.2.58 elasticsearch-cdm-xv9zo8gz-1 infra-write 0 r UNASSIGNED app-write 1 p STARTED 0 162b 10.129.2.58 elasticsearch-cdm-xv9zo8gz-1 app-write 1 r UNASSIGNED app-write 4 p STARTED 0 162b 10.129.2.58 elasticsearch-cdm-xv9zo8gz-1 app-write 4 r UNASSIGNED app-write 2 p STARTED 0 162b 10.129.2.58 elasticsearch-cdm-xv9zo8gz-1 app-write 2 r UNASSIGNED app-write 3 p STARTED 0 162b 10.129.2.58 elasticsearch-cdm-xv9zo8gz-1 app-write 3 r UNASSIGNED app-write 0 p STARTED 0 162b 10.129.2.58 elasticsearch-cdm-xv9zo8gz-1 app-write 0 r UNASSIGNED project.logflatx.b1c056f0-4405-45bd-8cea-76338862d9ed.2020.05.23 0 p STARTED 746 542.9kb 10.129.2.58 elasticsearch-cdm-xv9zo8gz-1 .searchguard 0 p STARTED 5 145.3kb 10.129.2.58 elasticsearch-cdm-xv9zo8gz-1 .kibana.647a750f1787408bf50088234ec0edd5a6a9b2ac 0 p STARTED 4 64.1kb 10.129.2.58 elasticsearch-cdm-xv9zo8gz-1 .operations.2020.05.23 0 p STARTED 31341 30.9mb 10.129.2.58 elasticsearch-cdm-xv9zo8gz-1 Based on your EO logs, it seems like the pods were originally unable to be deployed (by the scheduler) out so two of them bypassed the normal upgrade path: level=info msg="Requested to update node 'elasticsearch-cdm-m2j2lxw9-3', which is unschedulable. Skipping rolling restart scenario and performing redeploy now" level=info msg="Requested to update node 'elasticsearch-cdm-m2j2lxw9-2', which is unschedulable. Skipping rolling restart scenario and performing redeploy now" Then we see we timed out waiting on the second one to rollout, but eventually succeed and moved to do a normal upgrade on the last node: level=info msg="Timed out waiting for node elasticsearch-cdm-m2j2lxw9-2 to rollout" level=warning msg="Failed to progress update of unschedulable node 'elasticsearch-cdm-m2j2lxw9-2': timed out waiting for the condition" level=info msg="Waiting for cluster to be fully recovered before upgrading elasticsearch-cdm-m2j2lxw9-1: red / green" The odd thing is, that bypassing logic doesn't do anything with changing the shard allocation for the cluster. So it's possible it was leftover from a prior upgrade? Also, looking at the elasticsearch CR only one of the nodes it noted to be upgraded in the status... I'll see if I can recreate this. Per my understanding, the pods were originally unable to be deployed is because during upgrading cluster from 4.4 to 4.5, it should do upgrading on every node, and the nodes were schedule disabled when they were under upgrading. I had checked the `shardAllocationEnabled` status before the cluster was upgraded, it was `all` and seems everything worked well. Can you retest this? We have since updated the way we do our upgrades to not use the shard allocation of "none" per https://github.com/openshift/elasticsearch-operator/pull/355 It seems the transient.cluster.routing.allocation.enable is none default . If there are new indices, the CLO will change it to all momentarily. Move to verified. The Logging can be upgraded event transient.cluster.routing.allocation.enable is none. and 4.5, the transient.cluster.routing.allocation.enable=all. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:2409 |