1578605 – [free-int] timeout waiting for elastic search pods to be !red

Bug 1578605 - [free-int] timeout waiting for elastic search pods to be !red

Summary: [free-int] timeout waiting for elastic search pods to be !red

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Logging
Sub Component:
Version:	3.10.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Target Release:	3.10.0
Assignee:	ewolinet
QA Contact:	Anping Li
Docs Contact:
URL:
Whiteboard:
Depends On:	1581058
Blocks:
TreeView+	depends on / blocked

Reported:	2018-05-16 00:45 UTC by Justin Pierce
Modified:	2018-12-20 21:46 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Cause: As part of a change in how our handlers restart clusters, we changed it to always check that the pods were running. Consequence: A cluster that requires more than one node will not be ready since it is waiting for other members to join when we are scaling up the cluster. Fix: When we scale up, we don't wait for each pod to be ready so that it can find the correct number of cluster members. Result: The pods are able to be ready.
Clone Of:
Environment:
Last Closed:	2018-12-20 21:12:46 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
The ansible logs for logging upgrade (463.31 KB, application/x-gzip) 2018-05-29 10:27 UTC, Anping Li	no flags	Details
View All

Description Justin Pierce 2018-05-16 00:45:51 UTC

Description of problem:
During an upgrade of free-int to v3.10.0-0.37.0, the logging upgrade playbooks timed out:

RUNNING HANDLER [openshift_logging_elasticsearch : command] ********************
Friday 11 May 2018  17:06:49 +0000 (0:00:00.473)       0:09:18.097 ************ 
FAILED - RETRYING: Waiting for ES node logging-es-data-master-t7rrl3te health to be in ['green', 'yellow'] (40 retries left).
FAILED - RETRYING: Waiting for ES node logging-es-data-master-t7rrl3te health to be in ['green', 'yellow'] (39 retries left).
FAILED - RETRYING: Waiting for ES node logging-es-data-master-t7rrl3te health to be in ['green', 'yellow'] (38 retries left).
FAILED - RETRYING: Waiting for ES node logging-es-data-master-t7rrl3te health to be in ['green', 'yellow'] (37 retries left).
FAILED - RETRYING: Waiting for ES node logging-es-data-master-t7rrl3te health to be in ['green', 'yellow'] (36 retries left).
FAILED - RETRYING: Waiting for ES node logging-es-data-master-t7rrl3te health to be in ['green', 'yellow'] (35 retries left).
FAILED - RETRYING: Waiting for ES node logging-es-data-master-t7rrl3te health to be in ['green', 'yellow'] (34 retries left).
FAILED - RETRYING: Waiting for ES node logging-es-data-master-t7rrl3te health to be in ['green', 'yellow'] (33 retries left).
....

Comment 3 Anping Li 2018-05-17 01:52:18 UTC

@ewolinet, Thanks. The ES status be in red status! Are the ES pod restart changed the one ES nodes Shards Allocation status to enabled while the other nodes are still disabled?

Comment 4 ewolinet 2018-05-17 14:30:16 UTC

@Anping,

One of the changes we are making is to disable shard allocation before the rollout of a node and re-enable shard allocation after it is available but prior to waiting for the cluster to return to 'green'. 

The issue we are seeing is that when a new index is created, if shard allocation is set to 'none' then we are unable to place any of the shards for the index which will automatically put the cluster into a 'red' state. This change should allow the cluster to return to 'green' between restarts.

Comment 5 ewolinet 2018-05-17 16:59:32 UTC

https://github.com/openshift/openshift-ansible/pull/8415

Comment 6 Anping Li 2018-05-18 01:59:19 UTC

Shall we use persistent setting? I think the transient setting may be changed during ES restart.

Comment 11 Anping Li 2018-05-28 14:14:07 UTC

upgrade from v3.9 to v3.10 via openshift-ansible-3.10.0-0.53.0. The ES cluster are not restarted. 

The playbook report:
Cluster logging-es was not in an optimal state and will not be automatically restarted. Please see documentation regarding doing a rolling cluster restart.
[

Comment 12 Anping Li 2018-05-29 10:27:43 UTC

Created attachment 1445323 [details]
The ansible logs for logging upgrade

The cluster_pods.stdout_lines is 1. It should be 3. Attached all ansible logs.

RUNNING HANDLER [openshift_logging_elasticsearch : debug] *********************************************************************************************************************************************************
ok: [qe-anli310master-etcd-1.0529-l0l.qe.rhcloud.com] => {
    "msg": "Cluster logging-es was not in an optimal state and will not be automatically restarted. Please see documentation regarding doing a rolling cluster restart."
}

RUNNING HANDLER [openshift_logging_elasticsearch : debug] *********************************************************************************************************************************************************
ok: [qe-anli310master-etcd-1.0529-l0l.qe.rhcloud.com] => {
    "msg": "pod status is green, number_of_nodes is 3, cluster_pods.stdout_lines is 1"
}

Comment 15 Anping Li 2018-05-31 09:05:59 UTC

I watched red indices in v3.9 testing today.  When I redeployed logging, one automation scripts is creating and deleting projects. some project index had become red. the .operations and .orphaned also had become red. Not sure if that is same issue, just leave a message here.

Comment 19 Anping Li 2018-06-07 02:58:16 UTC

The upgrade works well with 3.10.0-0.60.0.

Note You need to log in before you can comment on or make changes to this bug.