Description of problem: In OSP 13, deploying a 100 baremetal node stack takes about 120m when using the default deployment method. On the same setup, trying config-download The heat stack creates in about 84 minutes and the ansible run fails immediately after due to some NIC issues on overcloud hardware. So I fixed that and reran the deploy, which took 176 minutes to finish (if the heat stack also needed to be created this could have easily taken 200+ minutes). Updating the already created heat stack (from the failed deploy) only took about 20 minutes, so it basically took 150+ minutes to run ansible. This is extremely slow when compared to non config-download deployments which finish in under 120m for the same scale Version-Release number of selected component (if applicable): OSP 13 How reproducible: 100% Steps to Reproduce: 1. Deploy a large overcloud with config-download 2. Time the deployment 3. Actual results: Takes very long to deploy large overcloud with config-download Expected results: Config-download deployment times should be comparable to non config-download deployment times. Additional info:
ansible.cfg laid down [defaults] roles_path = /etc/ansible/roles:/usr/share/ansible/roles retry_files_enabled = False log_path = /var/lib/mistral/7d13efac-08c8-4f9c-8d30-37a8bb89c0a8/ansible.log forks = 25 timeout = 30 gather_timeout = 30 [inventory] [privilege_escalation] [paramiko_connection] [ssh_connection] ssh_args = -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no -o ControlMaster=auto -o ControlPersist=30m -o ServerAliveInterval=5 -o ServerAliveCountMax=5 control_path_dir = /var/lib/mistral/7d13efac-08c8-4f9c-8d30-37a8bb89c0a8/ansible-ssh retries = 8 pipelining = True [persistent_connection] [accelerate] [selinux] [colors] [diff]
The biggest offender seems to be the Host prep steps play. Due to the number of roles in this deployment, there is a lot of task skipping going on as we go through the task list role by role. The first task only applies to a single role, the second task only applies to a single role, etc. All other roles are skipped for each task. It seems to be much quicker to instead use a separate play per role instead of a separate task per role. You can limit the play to just a single role, then Ansible does not need to compute what will be skipped. I limited the testing to 2 roles to speed things up a bit. Using separate plays resulted in the total run time going from 23m to 4m. (Note that using strategy:free had little effect). # --tags host_prep_steps --limit Controller:r630Compute real 23m31.119s user 23m25.239s sys 7m40.223s # --tags host_prep_steps --limit Controller:r630Compute, strategy:free real 23m48.704s user 23m56.315s sys 7m45.093s # --tags host_prep_steps --limit Controller:r630Compute, separate plays real 4m32.785s user 4m16.400s sys 1m42.193s # --tags host_prep_steps --limit Controller:r630Compute, separate plays, strategy:free real 4m15.653s user 4m27.188s sys 1m43.599s # --tags host_prep_steps separate plays real 12m23.470s user 12m33.849s sys 6m40.841s We'd need to patch tripleo-heat-templates and backport to 13 to get this benefit.
Another thing, the tasks from HostPrepDeployment are actually getting run twice. Once as a standalone deployment during pre_deploy_steps, and also during host_prep_steps. We can remove the standalone deployment. This patch https://review.opendev.org/#/c/623098 could be backported to 13. This should save a few minutes.
We could also look at backporting the patch where we parallelized pre and post deployments: https://review.opendev.org/#/c/574474/ https://review.opendev.org/#/c/574473/ This would likely also save some time.
I've posted the following patches for review: Remove HostPrepConfig: https://review.opendev.org/679146 (rocky) https://review.opendev.org/679147 (queens) Parallelize pre/post deployments: https://review.opendev.org/679151 (queens) https://review.opendev.org/679152 (queens) Use separate plays for Host prep steps: https://review.opendev.org/679149 (master)
This patch will fix the issue with tripleo-ssh-known-hosts: https://review.opendev.org/#/c/680516/
I've done most of backports of patches mentioned in this BZ. Except for https://review.opendev.org/#/c/680516/ for now, but Kevin Carter is looking at it.
I was building tripleo-heat-templates for another bug and picked up the fixes here so the THT version has been added to FixedInVersion. Not moving to MODIFIED yet because of Comment 10.
https://review.opendev.org/#/c/680516/ has now been backported. Moving to MODIFIED at this time.
Is there an ETA for this patch ? We have a 200 node OpenStack cloud and are considering to switch to config-download. I guess without this patch this is going to be painfully slow. Also are there any plans to backport the --config-donwload-only flag for the CLI ?
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:3794