1507469 – Tasks for nodes already in the cluster running unnecessarily during scaleup

Bug 1507469 - Tasks for nodes already in the cluster running unnecessarily during scaleup

Summary: Tasks for nodes already in the cluster running unnecessarily during scaleup

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Installer
Sub Component:
Version:	3.7.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Target Release:	3.9.0
Assignee:	Michael Gugino
QA Contact:	Gan Huang
Docs Contact:
URL:
Whiteboard:	aos-scalability-37
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2017-10-30 10:24 UTC by Jiří Mencák
Modified:	2018-03-28 14:09 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2018-03-28 14:08:55 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2018:0489	0	None	None	None	2018-03-28 14:09:21 UTC

Description Jiří Mencák 2017-10-30 10:24:32 UTC

Description of problem:

According to OCP documentation (https://docs.openshift.com/container-platform/3.6/install_config/adding_hosts_to_existing_cluster.html#adding-nodes-advanced [step 6]), nodes that are already part of the OCP cluster should remain in the [nodes] section and new nodes should be added into [new_nodes] section.

Most of the tasks for nodes that are already part of the cluster are skipped and those that performs something on the target node seem to be insignificant, such as "yum clean all" -- openshift_repos : refresh cache.

While this works reasonably well on a small scale, this is a problem when scaling up to rougly 500+ hosts. A scaleup from 800->1100 nodes took 6h 35min, which was unacceptable. Scaleup times were increasing in proportion to the number of nodes in the cluster. The workaround I've used to make the scaleup times bearable (3-4h for ~300 nodes) was leave out old worker nodes from scaleups completely. I have not observed any side effects and scaleup times were nearly the same regardless the size of the cluster from which I started the scaleup.

Version-Release number of the following components:
Any, last tested with ocp 3.7.0-0.184.0.

How reproducible:
Always

Steps to Reproduce:
1. Install a small OCP cluster.
2. Scaleup the cluster with playbooks/byo/openshift-node/scaleup.yml

Actual results:
Tasks run unnecessarily on nodes which are already part of the cluster and many skipped tasks.

Expected results:
Run install tasks only when necessary.

Additional info:
I've observed ~30 skipped tasks for each node which was already part of the cluster and 1 "yum clean all" -- openshift_repos : refresh cache task which slows down the scaleup times significantly on a large scale.

Workaround:
Leave out the worker nodes which are already part of the existing cluster from scaleups. Please confirm for side effects.

Comment 2 Scott Dodson 2018-01-24 13:13:55 UTC

Mike,

Do you think your recent refactor has eliminated the tasks that were run on existing nodes during scaleup?

Comment 3 Michael Gugino 2018-01-24 14:39:06 UTC

Scott,

  Yes, I have removed most instances of oo_all_hosts, I believe only new nodes should be touched during scale up as there is no need to consult with existing nodes.  This should be ready to verify in master.

Comment 5 Johnny Liu 2018-01-25 05:34:32 UTC

This bug is targeted for 3.9, while this bug is attached to a 3.5/3.6/3.7 errata, pls attach it to a correct errata.

Comment 6 Michael Gugino 2018-01-25 14:07:27 UTC

PR Merged was: https://github.com/openshift/openshift-ansible/pull/6784

Comment 9 Gan Huang 2018-01-31 10:00:48 UTC

Verified in openshift-ansible-3.9.0-0.34.0.git.0.c7d9585.el7.noarch.rpm

According to the `PLAY RECAP` of the end, no `changes` preformed against the existing nodes, only 5 "ok" tasks executed against the existing nodes:

TASK [Gathering Facts]
TASK [Gathering Facts] 
TASK [group_by]
TASK [Gathering Facts]
TASK [group_by]

For a large scale, I think this is acceptable and should not introduce too much performance issues now. Please feel free to move back if you still see the issues.

Comment 12 errata-xmlrpc 2018-03-28 14:08:55 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:0489

Note You need to log in before you can comment on or make changes to this bug.