Bug 1746537 - Deployment with config-download too slow when compared to non config-download deployment
Summary: Deployment with config-download too slow when compared to non config-download...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-tripleo-common
Version: 13.0 (Queens)
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: async
: 13.0 (Queens)
Assignee: Emilien Macchi
QA Contact: Alexander Chuzhoy
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-08-28 17:31 UTC by Sai Sindhur Malleni
Modified: 2019-11-07 14:02 UTC (History)
9 users (show)

Fixed In Version: openstack-tripleo-common-8.6.8-17.el7ost
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-11-07 14:02:10 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
OpenStack gerrit 680516 'None' MERGED Use blockinfile for tripleo-ssh-known-hosts 2020-05-02 18:06:06 UTC
OpenStack gerrit 681082 'None' MERGED Use blockinfile for tripleo-ssh-known-hosts 2020-05-02 18:06:06 UTC
Red Hat Product Errata RHBA-2019:3794 None None None 2019-11-07 14:02:37 UTC

Description Sai Sindhur Malleni 2019-08-28 17:31:31 UTC
Description of problem:
In OSP 13, deploying a 100 baremetal node stack takes about 120m when using the default deployment method.
On the same setup, trying config-download

The heat stack creates in about 84 minutes and the ansible run fails immediately after due to some NIC issues on overcloud hardware. So I fixed that and reran the deploy, which took 176 minutes to finish (if the heat stack also needed to be created this could have  easily taken 200+ minutes). Updating the already created heat stack (from the failed deploy) only took about 20 minutes, so it basically took 150+ minutes to run ansible. This is extremely slow when compared to non config-download deployments which finish in under 120m for the same scale

Version-Release number of selected component (if applicable):
OSP 13

How reproducible:
100%

Steps to Reproduce:
1. Deploy a large overcloud with config-download
2. Time the deployment
3.

Actual results:
Takes very long to deploy large overcloud with config-download

Expected results:
Config-download deployment times should be comparable to non config-download deployment times.

Additional info:

Comment 1 Sai Sindhur Malleni 2019-08-28 17:32:58 UTC
ansible.cfg laid down
[defaults]
roles_path = /etc/ansible/roles:/usr/share/ansible/roles
retry_files_enabled = False
log_path = /var/lib/mistral/7d13efac-08c8-4f9c-8d30-37a8bb89c0a8/ansible.log
forks = 25
timeout = 30
gather_timeout = 30

[inventory]

[privilege_escalation]

[paramiko_connection]

[ssh_connection]
ssh_args = -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no -o ControlMaster=auto -o ControlPersist=30m -o ServerAliveInterval=5 -o ServerAliveCountMax=5
control_path_dir = /var/lib/mistral/7d13efac-08c8-4f9c-8d30-37a8bb89c0a8/ansible-ssh
retries = 8
pipelining = True

[persistent_connection]

[accelerate]

[selinux]

[colors]

[diff]

Comment 2 James Slagle 2019-08-28 19:29:21 UTC
The biggest offender seems to be the Host prep steps play. Due to the number of roles in this deployment, there is a lot of task skipping going on as we go through the task list role by role. The first task only applies to a single role, the second task only applies to a single role, etc. All other roles are skipped for each task.

It seems to be much quicker to instead use a separate play per role instead of a separate task per role. You can limit the play to just a single role, then Ansible does not need to compute what will be skipped.

I limited the testing to 2 roles to speed things up a bit. Using separate plays resulted in the total run time going from 23m to 4m. (Note that using strategy:free had little effect).

# --tags host_prep_steps --limit Controller:r630Compute
real    23m31.119s
user    23m25.239s
sys     7m40.223s

# --tags host_prep_steps --limit Controller:r630Compute, strategy:free
real    23m48.704s
user    23m56.315s
sys     7m45.093s

# --tags host_prep_steps --limit Controller:r630Compute, separate plays
real    4m32.785s
user    4m16.400s
sys     1m42.193s

# --tags host_prep_steps --limit Controller:r630Compute, separate plays, strategy:free
real    4m15.653s
user    4m27.188s
sys     1m43.599s

# --tags host_prep_steps separate plays
real    12m23.470s
user    12m33.849s
sys     6m40.841s

We'd need to patch tripleo-heat-templates and backport to 13 to get this benefit.

Comment 3 James Slagle 2019-08-28 19:31:41 UTC
Another thing, the tasks from HostPrepDeployment are actually getting run twice. Once as a standalone deployment during pre_deploy_steps, and also during host_prep_steps.

We can remove the standalone deployment. This patch https://review.opendev.org/#/c/623098 could be backported to 13.

This should save a few minutes.

Comment 4 James Slagle 2019-08-28 19:34:57 UTC
We could also look at backporting the patch where we parallelized pre and post deployments:

https://review.opendev.org/#/c/574474/
https://review.opendev.org/#/c/574473/

This would likely also save some time.

Comment 5 James Slagle 2019-08-28 22:07:34 UTC
I've posted the following patches for review:

Remove HostPrepConfig:

https://review.opendev.org/679146 (rocky)
https://review.opendev.org/679147 (queens)

Parallelize pre/post deployments:

https://review.opendev.org/679151 (queens)
https://review.opendev.org/679152 (queens)

Use separate plays for Host prep steps:

https://review.opendev.org/679149 (master)

Comment 8 James Slagle 2019-09-05 20:55:56 UTC
This patch will fix the issue with tripleo-ssh-known-hosts: https://review.opendev.org/#/c/680516/

Comment 10 Emilien Macchi 2019-09-09 15:34:01 UTC
I've done most of backports of patches mentioned in this BZ. Except for https://review.opendev.org/#/c/680516/ for now, but Kevin Carter is looking at it.

Comment 11 Bob Fournier 2019-09-09 17:34:24 UTC
I was building tripleo-heat-templates for another bug and picked up the fixes here so the THT version has been added to FixedInVersion.  Not moving to MODIFIED yet because of Comment 10.

Comment 12 Kevin Carter 2019-09-11 22:36:06 UTC
 https://review.opendev.org/#/c/680516/ has now been backported. Moving to MODIFIED at this time.

Comment 22 Uemit Seren 2019-10-21 18:34:45 UTC
Is there an ETA for this patch ? 
We have a 200 node OpenStack cloud and are considering to switch to config-download. 
I guess without this patch this is going to be painfully slow. 

Also are there any plans to backport the --config-donwload-only flag for the CLI ?

Comment 24 errata-xmlrpc 2019-11-07 14:02:10 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:3794


Note You need to log in before you can comment on or make changes to this bug.