Bug 1571517

Summary:	ansible playbook fails as ES pod cannot get ready (Waiting for Quorum due to cluster deployment)
Product:	OpenShift Container Platform	Reporter:	Rajnikant <rkant>
Component:	Logging	Assignee:	Jeff Cantrill <jcantril>
Status:	CLOSED WORKSFORME	QA Contact:	Anping Li <anli>
Severity:	unspecified	Docs Contact:
Priority:	unspecified
Version:	3.7.0	CC:	aos-bugs, ewolinet, rkant, rmeggins
Target Milestone:	---
Target Release:	3.7.z
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2018-05-01 19:31:28 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Rajnikant 2018-04-25 03:39:14 UTC

Description of problem:
ansible playbook fails as ES pod cannot get ready (Waiting for Quorum due to cluster deployment)

Version-Release number of selected component (if applicable):
OpenShift Container Platform 3.7

How reproducible:

The ansible code for installing the logging modules (ES, Kibana, Fluend) is not working when deployment a cluster of ES (Minimum 3). It is only when performing a fresh installation.

The file is located at role/openshift_logging_elasticsearch/tasks/restart_es_node.yml.
The playbooks always fail as the ES pod cannot get ready (Waiting for Quorum due to cluster deployment)

<snip from restart_es_node.yml>
  command: >
    oc rollout latest {{ _es_node }} -n {{ openshift_logging_elasticsearch_namespace }}

- name: "Waiting for {{ _es_node }} to finish scaling up"
  oc_obj:
    state: list
    name: "{{ _es_node }}"
    namespace: "{{ openshift_logging_elasticsearch_namespace }}"
    kind: dc
  register: _dc_output
  until:
    - _dc_output.results.results[0].status is defined
    - _dc_output.results.results[0].status.readyReplicas is defined
    - _dc_output.results.results[0].status.readyReplicas > 0
    - _dc_output.results.results[0].status.updatedReplicas is defined
    - _dc_output.results.results[0].status.updatedReplicas > 0
  retries: 60
  delay: 30
</snip>


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 4 Jeff Cantrill 2018-04-25 14:28:11 UTC

I have reservations that simply adding wait of 30 seconds will consistently resolve this issue.  What about if the cluster takes a long time to pull the new images?  What if Elasticsearch does not have enough memory and it takes longer then 30 seconds to initialize itself?

Comment 5 ewolinet 2018-04-25 19:33:14 UTC

What version of openshift-ansible is being run against this? This should have been resolved already by [1] which prevents doing a health check against each ES node when we are scaling up [2].

Looking at the snippet you pasted, it seems you do not have the latest fixes.



[1] https://github.com/openshift/openshift-ansible/commit/15933df93f37e6fa3e70c2f724504c97ed109e3b

[2] https://github.com/openshift/openshift-ansible/blob/15933df93f37e6fa3e70c2f724504c97ed109e3b/roles/openshift_logging_elasticsearch/tasks/restart_es_node.yml#L6