Bug 1578605

Summary: [free-int] timeout waiting for elastic search pods to be !red
Product: OpenShift Container Platform Reporter: Justin Pierce <jupierce>
Component: LoggingAssignee: ewolinet
Status: CLOSED CURRENTRELEASE QA Contact: Anping Li <anli>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 3.10.0CC: aos-bugs, jcantril, pruan, rmeggins, xtian
Target Milestone: ---   
Target Release: 3.10.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Cause: As part of a change in how our handlers restart clusters, we changed it to always check that the pods were running. Consequence: A cluster that requires more than one node will not be ready since it is waiting for other members to join when we are scaling up the cluster. Fix: When we scale up, we don't wait for each pod to be ready so that it can find the correct number of cluster members. Result: The pods are able to be ready.
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-12-20 21:12:46 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1581058    
Bug Blocks:    
Attachments:
Description Flags
The ansible logs for logging upgrade none

Description Justin Pierce 2018-05-16 00:45:51 UTC
Description of problem:
During an upgrade of free-int to v3.10.0-0.37.0, the logging upgrade playbooks timed out:

RUNNING HANDLER [openshift_logging_elasticsearch : command] ********************
Friday 11 May 2018  17:06:49 +0000 (0:00:00.473)       0:09:18.097 ************ 
FAILED - RETRYING: Waiting for ES node logging-es-data-master-t7rrl3te health to be in ['green', 'yellow'] (40 retries left).
FAILED - RETRYING: Waiting for ES node logging-es-data-master-t7rrl3te health to be in ['green', 'yellow'] (39 retries left).
FAILED - RETRYING: Waiting for ES node logging-es-data-master-t7rrl3te health to be in ['green', 'yellow'] (38 retries left).
FAILED - RETRYING: Waiting for ES node logging-es-data-master-t7rrl3te health to be in ['green', 'yellow'] (37 retries left).
FAILED - RETRYING: Waiting for ES node logging-es-data-master-t7rrl3te health to be in ['green', 'yellow'] (36 retries left).
FAILED - RETRYING: Waiting for ES node logging-es-data-master-t7rrl3te health to be in ['green', 'yellow'] (35 retries left).
FAILED - RETRYING: Waiting for ES node logging-es-data-master-t7rrl3te health to be in ['green', 'yellow'] (34 retries left).
FAILED - RETRYING: Waiting for ES node logging-es-data-master-t7rrl3te health to be in ['green', 'yellow'] (33 retries left).
....

Comment 3 Anping Li 2018-05-17 01:52:18 UTC
@ewolinet, Thanks. The ES status be in red status! Are the ES pod restart changed the one ES nodes Shards Allocation status to enabled while the other nodes are still disabled?

Comment 4 ewolinet 2018-05-17 14:30:16 UTC
@Anping,

One of the changes we are making is to disable shard allocation before the rollout of a node and re-enable shard allocation after it is available but prior to waiting for the cluster to return to 'green'. 

The issue we are seeing is that when a new index is created, if shard allocation is set to 'none' then we are unable to place any of the shards for the index which will automatically put the cluster into a 'red' state. This change should allow the cluster to return to 'green' between restarts.

Comment 6 Anping Li 2018-05-18 01:59:19 UTC
Shall we use persistent setting? I think the transient setting may be changed during ES restart.

Comment 11 Anping Li 2018-05-28 14:14:07 UTC
upgrade from v3.9 to v3.10 via openshift-ansible-3.10.0-0.53.0. The ES cluster are not restarted. 

The playbook report:
Cluster logging-es was not in an optimal state and will not be automatically restarted. Please see documentation regarding doing a rolling cluster restart.
[

Comment 12 Anping Li 2018-05-29 10:27:43 UTC
Created attachment 1445323 [details]
The ansible logs for logging upgrade

The cluster_pods.stdout_lines is 1. It should be 3. Attached all ansible logs.

RUNNING HANDLER [openshift_logging_elasticsearch : debug] *********************************************************************************************************************************************************
ok: [qe-anli310master-etcd-1.0529-l0l.qe.rhcloud.com] => {
    "msg": "Cluster logging-es was not in an optimal state and will not be automatically restarted. Please see documentation regarding doing a rolling cluster restart."
}

RUNNING HANDLER [openshift_logging_elasticsearch : debug] *********************************************************************************************************************************************************
ok: [qe-anli310master-etcd-1.0529-l0l.qe.rhcloud.com] => {
    "msg": "pod status is green, number_of_nodes is 3, cluster_pods.stdout_lines is 1"
}

Comment 15 Anping Li 2018-05-31 09:05:59 UTC
I watched red indices in v3.9 testing today.  When I redeployed logging, one automation scripts is creating and deleting projects. some project index had become red. the .operations and .orphaned also had become red. Not sure if that is same issue, just leave a message here.

Comment 19 Anping Li 2018-06-07 02:58:16 UTC
The upgrade works well with 3.10.0-0.60.0.