Bug 1571517 - ansible playbook fails as ES pod cannot get ready (Waiting for Quorum due to cluster deployment)
Summary: ansible playbook fails as ES pod cannot get ready (Waiting for Quorum due to ...
Keywords:
Status: CLOSED WORKSFORME
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Logging
Version: 3.7.0
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
: 3.7.z
Assignee: Jeff Cantrill
QA Contact: Anping Li
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-04-25 03:39 UTC by Rajnikant
Modified: 2018-05-01 19:31 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-05-01 19:31:28 UTC
Target Upstream Version:


Attachments (Terms of Use)

Description Rajnikant 2018-04-25 03:39:14 UTC
Description of problem:
ansible playbook fails as ES pod cannot get ready (Waiting for Quorum due to cluster deployment)

Version-Release number of selected component (if applicable):
OpenShift Container Platform 3.7

How reproducible:

The ansible code for installing the logging modules (ES, Kibana, Fluend) is not working when deployment a cluster of ES (Minimum 3). It is only when performing a fresh installation.

The file is located at role/openshift_logging_elasticsearch/tasks/restart_es_node.yml.
The playbooks always fail as the ES pod cannot get ready (Waiting for Quorum due to cluster deployment)

<snip from restart_es_node.yml>
  command: >
    oc rollout latest {{ _es_node }} -n {{ openshift_logging_elasticsearch_namespace }}

- name: "Waiting for {{ _es_node }} to finish scaling up"
  oc_obj:
    state: list
    name: "{{ _es_node }}"
    namespace: "{{ openshift_logging_elasticsearch_namespace }}"
    kind: dc
  register: _dc_output
  until:
    - _dc_output.results.results[0].status is defined
    - _dc_output.results.results[0].status.readyReplicas is defined
    - _dc_output.results.results[0].status.readyReplicas > 0
    - _dc_output.results.results[0].status.updatedReplicas is defined
    - _dc_output.results.results[0].status.updatedReplicas > 0
  retries: 60
  delay: 30
</snip>


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 4 Jeff Cantrill 2018-04-25 14:28:11 UTC
I have reservations that simply adding wait of 30 seconds will consistently resolve this issue.  What about if the cluster takes a long time to pull the new images?  What if Elasticsearch does not have enough memory and it takes longer then 30 seconds to initialize itself?

Comment 5 ewolinet 2018-04-25 19:33:14 UTC
What version of openshift-ansible is being run against this? This should have been resolved already by [1] which prevents doing a health check against each ES node when we are scaling up [2].

Looking at the snippet you pasted, it seems you do not have the latest fixes.



[1] https://github.com/openshift/openshift-ansible/commit/15933df93f37e6fa3e70c2f724504c97ed109e3b

[2] https://github.com/openshift/openshift-ansible/blob/15933df93f37e6fa3e70c2f724504c97ed109e3b/roles/openshift_logging_elasticsearch/tasks/restart_es_node.yml#L6


Note You need to log in before you can comment on or make changes to this bug.