Bug 1544243 - Elasticsearch fails to scale up during installation when multiple replicas specified [NEEDINFO]
Summary: Elasticsearch fails to scale up during installation when multiple replicas sp...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Logging
Version: 3.7.0
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: 3.7.z
Assignee: ewolinet
QA Contact: Anping Li
URL:
Whiteboard:
Depends On: 1540099 1581058
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-02-11 16:02 UTC by Andrew Block
Modified: 2018-05-22 05:47 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: When creating an ES cluster of size 3+ the node quorum and recovery settings prevent oc getthe first ES node from ever reaching a ready and green state in time during a fresh install. Consequence: The playbook times out waiting for the first ES node to be ready. Fix: When we create new ES nodes, we do not wait for them to be healthy since the recovery settings and quorum would have changed and will need all nodes to be running at the same time. Result: We no longer see the playbook time out when creating large clusters of ES nodes.
Clone Of:
Environment:
Last Closed: 2018-04-05 09:38:31 UTC
Target Upstream Version:
ewolinet: needinfo? (andrew.block)


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
Red Hat Knowledge Base (Solution) 3365421 None None None 2018-02-27 22:23:28 UTC
Red Hat Product Errata RHBA-2018:0636 None None None 2018-04-05 09:39:21 UTC

Description Andrew Block 2018-02-11 16:02:34 UTC
Description of problem:

Unable to fully execute Aggregated Logging playbook when specifying multiple replicas of Elasticsearch.

Fails to rollout Elasticsearch replicas

FAILED - RETRYING: Waiting for logging-es-data-master-hsjwgec4 to finish scaling up (60 retries left).
FAILED - RETRYING: Waiting for logging-es-data-master-hsjwgec4 to finish scaling up (59 retries left).
FAILED - RETRYING: Waiting for logging-es-data-master-hsjwgec4 to finish scaling up (58 retries left).
FAILED - RETRYING: Waiting for logging-es-data-master-hsjwgec4 to finish scaling up (57 retries left).

Issue can be overcome by specifying the following inventory variable

logging_elasticsearch_rollout_override=true

Once playbook completes, each Elasticsearch DeploymentConfig can be rolled out

Version-Release number of selected component (if applicable):

3.7.23

How reproducible:

Always

Steps to Reproduce:
1. Specify multiple replicas of Elasticsearch in inventory

openshift_logging_es_number_of_replicas

2. Execute Aggregated Logging playbook

ansible-playbook [-i </path/to/inventory>] \
    /usr/share/ansible/openshift-ansible/playbooks/byo/openshift-cluster/openshift-logging.yml

Actual results:

Playbook fails as Elasticsearch never becomes ready


Expected results:

Aggregated logging playbook completes successfully

Additional info:

Comment 1 ewolinet 2018-02-12 14:39:44 UTC
Andy,
is this during a fresh installation or an upgrade?

Comment 2 Andrew Block 2018-02-12 15:58:45 UTC
(In reply to ewolinet from comment #1)
> Andy,
> is this during a fresh installation or an upgrade?

It occurs on both install and upgrades

Comment 3 ewolinet 2018-02-12 16:53:57 UTC
This should resolve the fresh install issue: https://github.com/openshift/openshift-ansible/pull/7097

When you say that the upgrade fails, is it that the playbook fails ultimately, or you see "FAILED - RETRYING: Waiting for logging-es-data-master-hsjwgec4 to finish scaling up (# retries left)." shows up in the logs a lot?

Also to clarify, when you say upgrade you do mean there is an existing deployment of logging and it is being upgraded? (not just that OCP is being upgraded and a fresh installation of logging is being installed).

Comment 4 Mark McKinstry 2018-02-12 23:24:09 UTC
I've been using the below command to deploy the ES pods after using Andy's workaround:

for x in $(oc get dc -l component=es -o=custom-columns=NAME:.metadata.name --no-headers); do oc rollout latest $x; done;

Comment 7 Anping Li 2018-03-01 02:56:11 UTC
Same issue with openshift3/ose-ansible/images/v3.7?

Comment 10 Anping Li 2018-03-06 06:45:16 UTC
Pass with openshift-ansible:v3.7.36.

Comment 15 errata-xmlrpc 2018-04-05 09:38:31 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:0636


Note You need to log in before you can comment on or make changes to this bug.