Bug 1544243

Summary:	Elasticsearch fails to scale up during installation when multiple replicas specified
Product:	OpenShift Container Platform	Reporter:	Andrew Block <andrew.block>
Component:	Logging	Assignee:	ewolinet
Status:	CLOSED ERRATA	QA Contact:	Anping Li <anli>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	3.7.0	CC:	andrew.block, aos-bugs, dmoessne, mmckinst, per.carlson, qitang, rmeggins, stwalter, tlarsson, wsun
Target Milestone:	---
Target Release:	3.7.z
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:	Cause: When creating an ES cluster of size 3+ the node quorum and recovery settings prevent oc getthe first ES node from ever reaching a ready and green state in time during a fresh install. Consequence: The playbook times out waiting for the first ES node to be ready. Fix: When we create new ES nodes, we do not wait for them to be healthy since the recovery settings and quorum would have changed and will need all nodes to be running at the same time. Result: We no longer see the playbook time out when creating large clusters of ES nodes.	Story Points:	---
Clone Of:		Environment:
Last Closed:	2018-04-05 09:38:31 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1540099, 1581058
Bug Blocks:

Description Andrew Block 2018-02-11 16:02:34 UTC

Description of problem:

Unable to fully execute Aggregated Logging playbook when specifying multiple replicas of Elasticsearch.

Fails to rollout Elasticsearch replicas

FAILED - RETRYING: Waiting for logging-es-data-master-hsjwgec4 to finish scaling up (60 retries left).
FAILED - RETRYING: Waiting for logging-es-data-master-hsjwgec4 to finish scaling up (59 retries left).
FAILED - RETRYING: Waiting for logging-es-data-master-hsjwgec4 to finish scaling up (58 retries left).
FAILED - RETRYING: Waiting for logging-es-data-master-hsjwgec4 to finish scaling up (57 retries left).

Issue can be overcome by specifying the following inventory variable

logging_elasticsearch_rollout_override=true

Once playbook completes, each Elasticsearch DeploymentConfig can be rolled out

Version-Release number of selected component (if applicable):

3.7.23

How reproducible:

Always

Steps to Reproduce:
1. Specify multiple replicas of Elasticsearch in inventory

openshift_logging_es_number_of_replicas

2. Execute Aggregated Logging playbook

ansible-playbook [-i </path/to/inventory>] \
    /usr/share/ansible/openshift-ansible/playbooks/byo/openshift-cluster/openshift-logging.yml

Actual results:

Playbook fails as Elasticsearch never becomes ready


Expected results:

Aggregated logging playbook completes successfully

Additional info:

Comment 1 ewolinet 2018-02-12 14:39:44 UTC

Andy,
is this during a fresh installation or an upgrade?

Comment 2 Andrew Block 2018-02-12 15:58:45 UTC

(In reply to ewolinet from comment #1)
> Andy,
> is this during a fresh installation or an upgrade?

It occurs on both install and upgrades

Comment 3 ewolinet 2018-02-12 16:53:57 UTC

This should resolve the fresh install issue: https://github.com/openshift/openshift-ansible/pull/7097

When you say that the upgrade fails, is it that the playbook fails ultimately, or you see "FAILED - RETRYING: Waiting for logging-es-data-master-hsjwgec4 to finish scaling up (# retries left)." shows up in the logs a lot?

Also to clarify, when you say upgrade you do mean there is an existing deployment of logging and it is being upgraded? (not just that OCP is being upgraded and a fresh installation of logging is being installed).

Comment 4 Mark McKinstry 2018-02-12 23:24:09 UTC

I've been using the below command to deploy the ES pods after using Andy's workaround:

for x in $(oc get dc -l component=es -o=custom-columns=NAME:.metadata.name --no-headers); do oc rollout latest $x; done;

Comment 7 Anping Li 2018-03-01 02:56:11 UTC

Same issue with openshift3/ose-ansible/images/v3.7?

Comment 10 Anping Li 2018-03-06 06:45:16 UTC

Pass with openshift-ansible:v3.7.36.

Comment 15 errata-xmlrpc 2018-04-05 09:38:31 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:0636

Comment 17 Red Hat Bugzilla 2023-09-15 00:06:27 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days