Bug 1911891 - deployment takes a long time when being run manually on the undercloud with ansible-playbook
Summary: deployment takes a long time when being run manually on the undercloud with a...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: tripleo-ansible
Version: 16.1 (Train)
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: z6
: 16.1 (Train on RHEL 8.2)
Assignee: Alex Schultz
QA Contact: Joe H. Rahme
URL:
Whiteboard:
: 1956321 1962589 (view as bug list)
Depends On: 1935364 1935365 1935366 1935406 1949290 2058027
Blocks: 1974985
TreeView+ depends on / blocked
 
Reported: 2021-01-01 10:26 UTC by Dhruv Shah
Modified: 2024-03-25 17:42 UTC (History)
19 users (show)

Fixed In Version: openstack-tripleo-common-11.4.1-1.20210329153522.75bd92a.el8ost python-tripleoclient-12.3.2-1.20210329153521.ae58329.el8ost tripleo-ansible-0.5.1-1.20210323173503.902c3c8.el8ost openstack-tripleo-heat-templates-11.3.2-1.20210324233518.29a02c1.el8ost
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1974985 (view as bug list)
Environment:
Last Closed: 2021-05-26 11:43:47 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Launchpad 1915761 0 None None None 2021-02-15 21:14:41 UTC
OpenStack gerrit 777416 0 None MERGED Reduce fact gathering 2021-05-26 11:49:55 UTC
OpenStack gerrit 777417 0 None MERGED Update skeleton role to use ansible_facts 2021-05-26 11:49:58 UTC
OpenStack gerrit 777418 0 None MERGED Use ansible_facts instead 2021-05-26 11:49:59 UTC
OpenStack gerrit 777419 0 None MERGED Fix bootstrap ansible_fact 2021-05-26 11:50:01 UTC
OpenStack gerrit 777420 0 None MERGED Drop service facts usage 2021-05-26 11:50:02 UTC
OpenStack gerrit 777421 0 None MERGED Use ansible_facts instead 2021-05-26 11:50:03 UTC
OpenStack gerrit 777422 0 None MERGED Reduce fact gathering 2021-05-26 11:50:04 UTC
OpenStack gerrit 777423 0 None MERGED Fix fact definition for deployments 2021-05-26 11:50:01 UTC
OpenStack gerrit 777424 0 None MERGED Drop inject_facts_as_vars 2021-05-26 11:50:06 UTC
OpenStack gerrit 777425 0 None MERGED Use include task for host prep tasks 2021-05-26 11:50:07 UTC
OpenStack gerrit 782550 0 None MERGED [TRAIN-Only] Update ansible python fact 2021-05-26 11:50:08 UTC
OpenStack gerrit 782579 0 None MERGED Revert "Revert "Add environment variables for better var handling"" 2021-05-26 11:50:09 UTC
Red Hat Issue Tracker OSP-645 0 None None None 2022-02-24 10:16:38 UTC
Red Hat Knowledge Base (Solution) 6975324 0 None None None 2022-09-12 05:43:47 UTC
Red Hat Product Errata RHSA-2021:2119 0 None None None 2021-05-26 11:44:24 UTC

Internal Links: 1935406

Description Dhruv Shah 2021-01-01 10:26:57 UTC
Description of problem:
while running the ansible-playbook command in undercloud it is taking too much time

time ./ansible-playbook-command.sh
PLAY RECAP ************************************************************************************************************************************************************************************************************************************
compute10r1-prod    : ok=286  changed=114  unreachable=0    failed=0    skipped=1017 rescued=0    ignored=0
compute11r1-prod    : ok=286  changed=114  unreachable=0    failed=0    skipped=1017 rescued=0    ignored=0
compute12r1-prod    : ok=286  changed=114  unreachable=0    failed=0    skipped=1017 rescued=0    ignored=0
compute13r1-prod    : ok=286  changed=114  unreachable=0    failed=0    skipped=1017 rescued=0    ignored=0
compute1r1-prod     : ok=295  changed=119  unreachable=0    failed=0    skipped=1008 rescued=0    ignored=0
compute2r1-prod     : ok=296  changed=118  unreachable=0    failed=0    skipped=1008 rescued=0    ignored=0
compute3r1-prod     : ok=286  changed=114  unreachable=0    failed=0    skipped=1017 rescued=0    ignored=0
compute4r1-prod     : ok=286  changed=114  unreachable=0    failed=0    skipped=1017 rescued=0    ignored=0
compute5r1-prod     : ok=286  changed=114  unreachable=0    failed=0    skipped=1017 rescued=0    ignored=0
compute6r1-prod     : ok=286  changed=114  unreachable=0    failed=0    skipped=1017 rescued=0    ignored=0
compute7r1-prod     : ok=286  changed=114  unreachable=0    failed=0    skipped=1017 rescued=0    ignored=0
compute8r1-prod     : ok=286  changed=114  unreachable=0    failed=0    skipped=1017 rescued=0    ignored=0
compute9r1-prod     : ok=286  changed=114  unreachable=0    failed=0    skipped=1017 rescued=0    ignored=0
controller3v-prod   : ok=306  changed=133  unreachable=0    failed=0    skipped=1035 rescued=0    ignored=0
swift3v-prod        : ok=261  changed=108  unreachable=0    failed=0    skipped=1047 rescued=0    ignored=0
compute10r1-prod    : ok=286  changed=114  unreachable=0    failed=0    skipped=1017 rescued=0    ignored=0
compute11r1-prod    : ok=286  changed=114  unreachable=0    failed=0    skipped=1017 rescued=0    ignored=0
compute12r1-prod    : ok=286  changed=114  unreachable=0    failed=0    skipped=1017 rescued=0    ignored=0
compute1r1-prod     : ok=295  changed=119  unreachable=0    failed=0    skipped=1008 rescued=0    ignored=0
compute1r2-prod     : ok=296  changed=120  unreachable=0    failed=0    skipped=1007 rescued=0    ignored=0
compute2r1-prod     : ok=296  changed=118  unreachable=0    failed=0    skipped=1008 rescued=0    ignored=0
compute3r1-prod     : ok=286  changed=114  unreachable=0    failed=0    skipped=1017 rescued=0    ignored=0
compute4r1-prod     : ok=286  changed=114  unreachable=0    failed=0    skipped=1017 rescued=0    ignored=0
compute5r1-prod     : ok=286  changed=114  unreachable=0    failed=0    skipped=1017 rescued=0    ignored=0
compute6r1-prod     : ok=286  changed=114  unreachable=0    failed=0    skipped=1017 rescued=0    ignored=0
compute7r1-prod     : ok=286  changed=114  unreachable=0    failed=0    skipped=1017 rescued=0    ignored=0
compute8r1-prod     : ok=286  changed=114  unreachable=0    failed=0    skipped=1017 rescued=0    ignored=0
compute9r1-prod     : ok=286  changed=114  unreachable=0    failed=0    skipped=1017 rescued=0    ignored=0
controller2v-prod   : ok=288  changed=133  unreachable=0    failed=0    skipped=1026 rescued=0    ignored=0
swift2v-prod        : ok=261  changed=108  unreachable=0    failed=0    skipped=1047 rescued=0    ignored=0
compute10r1-prod     : ok=289  changed=114  unreachable=0    failed=0    skipped=1017 rescued=0    ignored=0
compute11r1-prod     : ok=286  changed=114  unreachable=0    failed=0    skipped=1017 rescued=0    ignored=0
compute12r1-prod     : ok=286  changed=114  unreachable=0    failed=0    skipped=1017 rescued=0    ignored=0
compute1r1-prod      : ok=295  changed=119  unreachable=0    failed=0    skipped=1008 rescued=0    ignored=0
compute1r2-prod      : ok=296  changed=120  unreachable=0    failed=0    skipped=1007 rescued=0    ignored=0
compute2r1-prod      : ok=295  changed=119  unreachable=0    failed=0    skipped=1008 rescued=0    ignored=0
compute3r1-prod      : ok=286  changed=114  unreachable=0    failed=0    skipped=1017 rescued=0    ignored=0
compute4r1-prod      : ok=286  changed=114  unreachable=0    failed=0    skipped=1017 rescued=0    ignored=0
compute5r1-prod      : ok=286  changed=114  unreachable=0    failed=0    skipped=1017 rescued=0    ignored=0
compute6r1-prod      : ok=286  changed=114  unreachable=0    failed=0    skipped=1017 rescued=0    ignored=0
compute7r1-prod      : ok=286  changed=114  unreachable=0    failed=0    skipped=1017 rescued=0    ignored=0
compute8r1-prod      : ok=286  changed=114  unreachable=0    failed=0    skipped=1017 rescued=0    ignored=0
compute9r1-prod      : ok=286  changed=114  unreachable=0    failed=0    skipped=1017 rescued=0    ignored=0
controller1v-prod    : ok=288  changed=133  unreachable=0    failed=0    skipped=1026 rescued=0    ignored=0
swift1v-prod         : ok=261  changed=108  unreachable=0    failed=0    skipped=1047 rescued=0    ignored=0
undercloud                 : ok=106  changed=30   unreachable=0    failed=0    skipped=8    rescued=0    ignored=0

Monday 14 December 2020  21:10:30 +0200 (0:00:00.067)       3:14:01.716 *******
===============================================================================
Wait for containers to start for step 4 using paunch --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 529.03s
Wait for puppet host configuration to finish ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 257.09s
Render all_nodes data as group_vars for overcloud ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ 128.87s
Configure octavia on overcloud ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 111.31s
tripleo-hosts-entries : Render out the hosts entries ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 50.66s
tripleo-hieradata : Render hieradata from template ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ 50.56s
Wait for puppet host configuration to finish ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ 37.02s
Wait for puppet host configuration to finish ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ 36.97s
Wait for puppet host configuration to finish ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ 36.95s
Wait for puppet host configuration to finish ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ 36.60s
Wait for container-puppet tasks (generate config) to finish --------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 30.88s
Wait for container-puppet tasks (bootstrap tasks) for step 5 to finish ---------------------------------------------------------------------------------------------------------------------------------------------------------------- 22.85s
Gathering Facts ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 17.22s
Gathering Facts ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 17.17s
install needed packages --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 15.50s
install needed packages --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 15.39s
Sync cached facts --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 15.01s
Sync cached facts --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 14.90s
Sync cached facts --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 14.86s
Sync cached facts --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 14.82s

real	194m3.649s
user	166m43.413s
sys	39m20.972s

Version-Release number of selected component (if applicable):


How reproducible:
easily reproducible

Steps to Reproduce:
1.create multiple instances on a compute node
2.If possible add multiple NIC to the instance
3.Run ansible-playbook command from undercloud

Actual results:
ansible-playbook take a significant amount of time to collect the facts

Expected results:
ansible-playbook should not take much time for collecting the facts

Additional info:

Comment 2 Alex Schultz 2021-01-05 15:07:24 UTC
This is likely an issue with ansible itself if gathering facts are taking too long. It's likely not something we can work around in tripleo-ansible.  If possible, can we have the full deployment output log so we can see if there's other issues at play?  We have a BZ for 16.2 to try and improve things https://bugzilla.redhat.com/show_bug.cgi?id=1897890

Comment 4 Oskari Lemmela 2021-01-11 12:07:05 UTC
BZ1897890 might help quite a lot, because we have 10 roles defined. Actually issue is not facts gathering, but all the facts need to parsed also when skipping hosts. If multiple hosts have 5MB or more facts skipping get a lot slower. If skipping takes 10s then 1000*10s = 2,5h just for skipping tasks.

Comment 5 Alex Schultz 2021-01-11 13:57:02 UTC
Skipping tasks should not take 10s, however we've seen that if you end up using a default ansible.cfg instead of the one we generate. If you can provide what exactly was run and if there was an ansible.cfg in the current working directory, we can provide recommendations.

Comment 11 Alex Schultz 2021-02-15 17:59:50 UTC
So to provide a bit of an update, I am currently looking into how the fact gather actually affects task execution. I have found with compute system, the network fact gathering alone can dramatically increase the overall fact size due to all the tap interfaces created with the instances. This also has a side effect of increasing overall ansible memory utilization which can directly impact the speed at which tasks can be executed. This is like an issue with ansible itself and we're investigating if there's anything we can do from an OSP standpoint to reduce the impact.  Currently I believe we rely on some of the network and hardware facts as part of the deployment in that we may not be able to just turn them off. 

We might be able to reduce our reliance on them and improve things at a future time.  In the mean time, if a customer wishes to write their own playbook and execute them against these hosts, it's recommended to disable fact gathering if possible or reduce the scope.  You can specify "gather_subset = !virtual,!oahi,!facter,!network,all" in ansible.cfg to reduce the amount of information collected which should improve the overall execution time. I don't think this works with our deployment playbooks but I will be investigating this.

Comment 12 Alex Schultz 2021-02-18 20:34:41 UTC
I've filed an upstream ansible issue to try and figure out how we can improve it.  It turns out you can reduce the impact by disabling the INJECT_FACTS_AS_VARS setting in ansible. https://docs.ansible.com/ansible/latest/reference_appendices/config.html#inject-facts-as-vars  There is still a significant performance impact but it's greatly reduced in comparison to the existing impact. See https://github.com/ansible/ansible/issues/73654 for additional information. If a user is running ansible manually, please considering configuring this value to false (it's true by default).

Comment 19 David Rosenfeld 2021-04-16 20:49:23 UTC
Verified using this procedure:

Created a 3compute 3control 1ceph deployment

added interfaces to all three compute node using:
for i in $(seq 1 380); do ip tuntap add name dummy_tun$i mode tun; done
for i in $(seq 1 1274); do ip link add name dummy_br$i type bridge; done

Reran the deployment

With RHOS-16.1-RHEL-8-20210323.n.0 re-running deploy took 4635 seconds
With RHOS-16.1-RHEL-8-20210415.n.0 re-running deploy took 1939 seconds

Also did same procedure with a 1compute 1control 1ceph deployment

With RHOS-16.1-RHEL-8-20210323.n.0 re-running deploy took 2398 seconds
With RHOS-16.1-RHEL-8-20210415.n.0 re-running deploy took 1565 seconds

When using a large number of interfaces a significant reduction in the re-deploy time is seen with the fix.

Comment 25 Alex Schultz 2021-05-10 13:08:19 UTC
*** Bug 1956321 has been marked as a duplicate of this bug. ***

Comment 28 Alex Schultz 2021-05-20 19:03:34 UTC
*** Bug 1962589 has been marked as a duplicate of this bug. ***

Comment 36 errata-xmlrpc 2021-05-26 11:43:47 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Red Hat OpenStack Platform 16.1.6 (tripleo-ansible) security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2119


Note You need to log in before you can comment on or make changes to this bug.