Bug 1578605

Summary:

[free-int] timeout waiting for elastic search pods to be !red

Product:

OpenShift Container Platform

Reporter:

Justin Pierce <jupierce>

Component:

Logging

Assignee:

ewolinet

Status:

CLOSED CURRENTRELEASE

QA Contact:

Anping Li <anli>

Severity:

unspecified

Docs Contact:

Priority:

unspecified

Version:

3.10.0

CC:

aos-bugs, jcantril, pruan, rmeggins, xtian

Target Milestone:

---

Target Release:

3.10.0

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Cause: As part of a change in how our handlers restart clusters, we changed it to always check that the pods were running. Consequence: A cluster that requires more than one node will not be ready since it is waiting for other members to join when we are scaling up the cluster. Fix: When we scale up, we don't wait for each pod to be ready so that it can find the correct number of cluster members. Result: The pods are able to be ready.

Story Points:

---

Clone Of:

Environment:

Last Closed:

2018-12-20 21:12:46 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

1581058

Bug Blocks:

Attachments:

Description	Flags
The ansible logs for logging upgrade	none

Description Justin Pierce 2018-05-16 00:45:51 UTC

Description of problem:
During an upgrade of free-int to v3.10.0-0.37.0, the logging upgrade playbooks timed out:

RUNNING HANDLER [openshift_logging_elasticsearch : command] ********************
Friday 11 May 2018  17:06:49 +0000 (0:00:00.473)       0:09:18.097 ************ 
FAILED - RETRYING: Waiting for ES node logging-es-data-master-t7rrl3te health to be in ['green', 'yellow'] (40 retries left).
FAILED - RETRYING: Waiting for ES node logging-es-data-master-t7rrl3te health to be in ['green', 'yellow'] (39 retries left).
FAILED - RETRYING: Waiting for ES node logging-es-data-master-t7rrl3te health to be in ['green', 'yellow'] (38 retries left).
FAILED - RETRYING: Waiting for ES node logging-es-data-master-t7rrl3te health to be in ['green', 'yellow'] (37 retries left).
FAILED - RETRYING: Waiting for ES node logging-es-data-master-t7rrl3te health to be in ['green', 'yellow'] (36 retries left).
FAILED - RETRYING: Waiting for ES node logging-es-data-master-t7rrl3te health to be in ['green', 'yellow'] (35 retries left).
FAILED - RETRYING: Waiting for ES node logging-es-data-master-t7rrl3te health to be in ['green', 'yellow'] (34 retries left).
FAILED - RETRYING: Waiting for ES node logging-es-data-master-t7rrl3te health to be in ['green', 'yellow'] (33 retries left).
....

Comment 3 Anping Li 2018-05-17 01:52:18 UTC

@ewolinet, Thanks. The ES status be in red status! Are the ES pod restart changed the one ES nodes Shards Allocation status to enabled while the other nodes are still disabled?

Comment 4 ewolinet 2018-05-17 14:30:16 UTC

@Anping,

One of the changes we are making is to disable shard allocation before the rollout of a node and re-enable shard allocation after it is available but prior to waiting for the cluster to return to 'green'. 

The issue we are seeing is that when a new index is created, if shard allocation is set to 'none' then we are unable to place any of the shards for the index which will automatically put the cluster into a 'red' state. This change should allow the cluster to return to 'green' between restarts.

Comment 5 ewolinet 2018-05-17 16:59:32 UTC

https://github.com/openshift/openshift-ansible/pull/8415

Comment 6 Anping Li 2018-05-18 01:59:19 UTC

Shall we use persistent setting? I think the transient setting may be changed during ES restart.

Comment 11 Anping Li 2018-05-28 14:14:07 UTC

upgrade from v3.9 to v3.10 via openshift-ansible-3.10.0-0.53.0. The ES cluster are not restarted. 

The playbook report:
Cluster logging-es was not in an optimal state and will not be automatically restarted. Please see documentation regarding doing a rolling cluster restart.
[

Comment 12 Anping Li 2018-05-29 10:27:43 UTC

Created attachment 1445323 [details]
The ansible logs for logging upgrade

The cluster_pods.stdout_lines is 1. It should be 3. Attached all ansible logs.

RUNNING HANDLER [openshift_logging_elasticsearch : debug] *********************************************************************************************************************************************************
ok: [qe-anli310master-etcd-1.0529-l0l.qe.rhcloud.com] => {
    "msg": "Cluster logging-es was not in an optimal state and will not be automatically restarted. Please see documentation regarding doing a rolling cluster restart."
}

RUNNING HANDLER [openshift_logging_elasticsearch : debug] *********************************************************************************************************************************************************
ok: [qe-anli310master-etcd-1.0529-l0l.qe.rhcloud.com] => {
    "msg": "pod status is green, number_of_nodes is 3, cluster_pods.stdout_lines is 1"
}

Comment 15 Anping Li 2018-05-31 09:05:59 UTC

I watched red indices in v3.9 testing today.  When I redeployed logging, one automation scripts is creating and deleting projects. some project index had become red. the .operations and .orphaned also had become red. Not sure if that is same issue, just leave a message here.

Comment 19 Anping Li 2018-06-07 02:58:16 UTC

The upgrade works well with 3.10.0-0.60.0.