Description of problem:
In OSP 13, deploying a 100 baremetal node stack takes about 120m when using the default deployment method.
On the same setup, trying config-download
The heat stack creates in about 84 minutes and the ansible run fails immediately after due to some NIC issues on overcloud hardware. So I fixed that and reran the deploy, which took 176 minutes to finish (if the heat stack also needed to be created this could have easily taken 200+ minutes). Updating the already created heat stack (from the failed deploy) only took about 20 minutes, so it basically took 150+ minutes to run ansible. This is extremely slow when compared to non config-download deployments which finish in under 120m for the same scale
Version-Release number of selected component (if applicable):
Steps to Reproduce:
1. Deploy a large overcloud with config-download
2. Time the deployment
Takes very long to deploy large overcloud with config-download
Config-download deployment times should be comparable to non config-download deployment times.
ansible.cfg laid down
roles_path = /etc/ansible/roles:/usr/share/ansible/roles
retry_files_enabled = False
log_path = /var/lib/mistral/7d13efac-08c8-4f9c-8d30-37a8bb89c0a8/ansible.log
forks = 25
timeout = 30
gather_timeout = 30
ssh_args = -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no -o ControlMaster=auto -o ControlPersist=30m -o ServerAliveInterval=5 -o ServerAliveCountMax=5
control_path_dir = /var/lib/mistral/7d13efac-08c8-4f9c-8d30-37a8bb89c0a8/ansible-ssh
retries = 8
pipelining = True
The biggest offender seems to be the Host prep steps play. Due to the number of roles in this deployment, there is a lot of task skipping going on as we go through the task list role by role. The first task only applies to a single role, the second task only applies to a single role, etc. All other roles are skipped for each task.
It seems to be much quicker to instead use a separate play per role instead of a separate task per role. You can limit the play to just a single role, then Ansible does not need to compute what will be skipped.
I limited the testing to 2 roles to speed things up a bit. Using separate plays resulted in the total run time going from 23m to 4m. (Note that using strategy:free had little effect).
# --tags host_prep_steps --limit Controller:r630Compute
# --tags host_prep_steps --limit Controller:r630Compute, strategy:free
# --tags host_prep_steps --limit Controller:r630Compute, separate plays
# --tags host_prep_steps --limit Controller:r630Compute, separate plays, strategy:free
# --tags host_prep_steps separate plays
We'd need to patch tripleo-heat-templates and backport to 13 to get this benefit.
Another thing, the tasks from HostPrepDeployment are actually getting run twice. Once as a standalone deployment during pre_deploy_steps, and also during host_prep_steps.
We can remove the standalone deployment. This patch https://review.opendev.org/#/c/623098 could be backported to 13.
This should save a few minutes.
We could also look at backporting the patch where we parallelized pre and post deployments:
This would likely also save some time.
I've posted the following patches for review:
Parallelize pre/post deployments:
Use separate plays for Host prep steps:
This patch will fix the issue with tripleo-ssh-known-hosts: https://review.opendev.org/#/c/680516/
I've done most of backports of patches mentioned in this BZ. Except for https://review.opendev.org/#/c/680516/ for now, but Kevin Carter is looking at it.
I was building tripleo-heat-templates for another bug and picked up the fixes here so the THT version has been added to FixedInVersion. Not moving to MODIFIED yet because of Comment 10.
https://review.opendev.org/#/c/680516/ has now been backported. Moving to MODIFIED at this time.
Is there an ETA for this patch ?
We have a 200 node OpenStack cloud and are considering to switch to config-download.
I guess without this patch this is going to be painfully slow.
Also are there any plans to backport the --config-donwload-only flag for the CLI ?
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory, and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.