Bug 1571517

Summary: ansible playbook fails as ES pod cannot get ready (Waiting for Quorum due to cluster deployment)
Product: OpenShift Container Platform Reporter: Rajnikant <rkant>
Component: LoggingAssignee: Jeff Cantrill <jcantril>
Status: CLOSED WORKSFORME QA Contact: Anping Li <anli>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 3.7.0CC: aos-bugs, ewolinet, rkant, rmeggins
Target Milestone: ---   
Target Release: 3.7.z   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-05-01 19:31:28 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Rajnikant 2018-04-25 03:39:14 UTC
Description of problem:
ansible playbook fails as ES pod cannot get ready (Waiting for Quorum due to cluster deployment)

Version-Release number of selected component (if applicable):
OpenShift Container Platform 3.7

How reproducible:

The ansible code for installing the logging modules (ES, Kibana, Fluend) is not working when deployment a cluster of ES (Minimum 3). It is only when performing a fresh installation.

The file is located at role/openshift_logging_elasticsearch/tasks/restart_es_node.yml.
The playbooks always fail as the ES pod cannot get ready (Waiting for Quorum due to cluster deployment)

<snip from restart_es_node.yml>
  command: >
    oc rollout latest {{ _es_node }} -n {{ openshift_logging_elasticsearch_namespace }}

- name: "Waiting for {{ _es_node }} to finish scaling up"
  oc_obj:
    state: list
    name: "{{ _es_node }}"
    namespace: "{{ openshift_logging_elasticsearch_namespace }}"
    kind: dc
  register: _dc_output
  until:
    - _dc_output.results.results[0].status is defined
    - _dc_output.results.results[0].status.readyReplicas is defined
    - _dc_output.results.results[0].status.readyReplicas > 0
    - _dc_output.results.results[0].status.updatedReplicas is defined
    - _dc_output.results.results[0].status.updatedReplicas > 0
  retries: 60
  delay: 30
</snip>


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 4 Jeff Cantrill 2018-04-25 14:28:11 UTC
I have reservations that simply adding wait of 30 seconds will consistently resolve this issue.  What about if the cluster takes a long time to pull the new images?  What if Elasticsearch does not have enough memory and it takes longer then 30 seconds to initialize itself?

Comment 5 ewolinet 2018-04-25 19:33:14 UTC
What version of openshift-ansible is being run against this? This should have been resolved already by [1] which prevents doing a health check against each ES node when we are scaling up [2].

Looking at the snippet you pasted, it seems you do not have the latest fixes.



[1] https://github.com/openshift/openshift-ansible/commit/15933df93f37e6fa3e70c2f724504c97ed109e3b

[2] https://github.com/openshift/openshift-ansible/blob/15933df93f37e6fa3e70c2f724504c97ed109e3b/roles/openshift_logging_elasticsearch/tasks/restart_es_node.yml#L6