Description of problem: According to OCP documentation (https://docs.openshift.com/container-platform/3.6/install_config/adding_hosts_to_existing_cluster.html#adding-nodes-advanced [step 6]), nodes that are already part of the OCP cluster should remain in the [nodes] section and new nodes should be added into [new_nodes] section. Most of the tasks for nodes that are already part of the cluster are skipped and those that performs something on the target node seem to be insignificant, such as "yum clean all" -- openshift_repos : refresh cache. While this works reasonably well on a small scale, this is a problem when scaling up to rougly 500+ hosts. A scaleup from 800->1100 nodes took 6h 35min, which was unacceptable. Scaleup times were increasing in proportion to the number of nodes in the cluster. The workaround I've used to make the scaleup times bearable (3-4h for ~300 nodes) was leave out old worker nodes from scaleups completely. I have not observed any side effects and scaleup times were nearly the same regardless the size of the cluster from which I started the scaleup. Version-Release number of the following components: Any, last tested with ocp 3.7.0-0.184.0. How reproducible: Always Steps to Reproduce: 1. Install a small OCP cluster. 2. Scaleup the cluster with playbooks/byo/openshift-node/scaleup.yml Actual results: Tasks run unnecessarily on nodes which are already part of the cluster and many skipped tasks. Expected results: Run install tasks only when necessary. Additional info: I've observed ~30 skipped tasks for each node which was already part of the cluster and 1 "yum clean all" -- openshift_repos : refresh cache task which slows down the scaleup times significantly on a large scale. Workaround: Leave out the worker nodes which are already part of the existing cluster from scaleups. Please confirm for side effects.
Mike, Do you think your recent refactor has eliminated the tasks that were run on existing nodes during scaleup?
Scott, Yes, I have removed most instances of oo_all_hosts, I believe only new nodes should be touched during scale up as there is no need to consult with existing nodes. This should be ready to verify in master.
This bug is targeted for 3.9, while this bug is attached to a 3.5/3.6/3.7 errata, pls attach it to a correct errata.
PR Merged was: https://github.com/openshift/openshift-ansible/pull/6784
Verified in openshift-ansible-3.9.0-0.34.0.git.0.c7d9585.el7.noarch.rpm According to the `PLAY RECAP` of the end, no `changes` preformed against the existing nodes, only 5 "ok" tasks executed against the existing nodes: TASK [Gathering Facts] TASK [Gathering Facts] TASK [group_by] TASK [Gathering Facts] TASK [group_by] For a large scale, I think this is acceptable and should not introduce too much performance issues now. Please feel free to move back if you still see the issues.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:0489