Bug 1911891

Summary:	deployment takes a long time when being run manually on the undercloud with ansible-playbook
Product:	Red Hat OpenStack	Reporter:	Dhruv Shah <dhruv>
Component:	tripleo-ansible	Assignee:	Alex Schultz <aschultz>
Status:	CLOSED ERRATA	QA Contact:	Joe H. Rahme <jhakimra>
Severity:	high	Docs Contact:
Priority:	high
Version:	16.1 (Train)	CC:	aschultz, astupnik, bdobreli, drosenfe, ebarrera, jmelvin, johfulto, jpretori, kthakre, ljozsa, mschuppe, nchandek, nweinber, oskari.lemmela, pweeks, slinaber, vkommadi, whayutin, yatanaka
Target Milestone:	z6	Keywords:	Triaged
Target Release:	16.1 (Train on RHEL 8.2)
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	openstack-tripleo-common-11.4.1-1.20210329153522.75bd92a.el8ost python-tripleoclient-12.3.2-1.20210329153521.ae58329.el8ost tripleo-ansible-0.5.1-1.20210323173503.902c3c8.el8ost openstack-tripleo-heat-templates-11.3.2-1.20210324233518.29a02c1.el8ost	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:
Clones:	1974985 (view as bug list)		Environment:
Last Closed:	2021-05-26 11:43:47 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1935364, 1935365, 1935366, 1935406, 1949290, 2058027
Bug Blocks:	1974985

Description Dhruv Shah 2021-01-01 10:26:57 UTC

Description of problem:
while running the ansible-playbook command in undercloud it is taking too much time

time ./ansible-playbook-command.sh
PLAY RECAP ************************************************************************************************************************************************************************************************************************************
compute10r1-prod    : ok=286  changed=114  unreachable=0    failed=0    skipped=1017 rescued=0    ignored=0
compute11r1-prod    : ok=286  changed=114  unreachable=0    failed=0    skipped=1017 rescued=0    ignored=0
compute12r1-prod    : ok=286  changed=114  unreachable=0    failed=0    skipped=1017 rescued=0    ignored=0
compute13r1-prod    : ok=286  changed=114  unreachable=0    failed=0    skipped=1017 rescued=0    ignored=0
compute1r1-prod     : ok=295  changed=119  unreachable=0    failed=0    skipped=1008 rescued=0    ignored=0
compute2r1-prod     : ok=296  changed=118  unreachable=0    failed=0    skipped=1008 rescued=0    ignored=0
compute3r1-prod     : ok=286  changed=114  unreachable=0    failed=0    skipped=1017 rescued=0    ignored=0
compute4r1-prod     : ok=286  changed=114  unreachable=0    failed=0    skipped=1017 rescued=0    ignored=0
compute5r1-prod     : ok=286  changed=114  unreachable=0    failed=0    skipped=1017 rescued=0    ignored=0
compute6r1-prod     : ok=286  changed=114  unreachable=0    failed=0    skipped=1017 rescued=0    ignored=0
compute7r1-prod     : ok=286  changed=114  unreachable=0    failed=0    skipped=1017 rescued=0    ignored=0
compute8r1-prod     : ok=286  changed=114  unreachable=0    failed=0    skipped=1017 rescued=0    ignored=0
compute9r1-prod     : ok=286  changed=114  unreachable=0    failed=0    skipped=1017 rescued=0    ignored=0
controller3v-prod   : ok=306  changed=133  unreachable=0    failed=0    skipped=1035 rescued=0    ignored=0
swift3v-prod        : ok=261  changed=108  unreachable=0    failed=0    skipped=1047 rescued=0    ignored=0
compute10r1-prod    : ok=286  changed=114  unreachable=0    failed=0    skipped=1017 rescued=0    ignored=0
compute11r1-prod    : ok=286  changed=114  unreachable=0    failed=0    skipped=1017 rescued=0    ignored=0
compute12r1-prod    : ok=286  changed=114  unreachable=0    failed=0    skipped=1017 rescued=0    ignored=0
compute1r1-prod     : ok=295  changed=119  unreachable=0    failed=0    skipped=1008 rescued=0    ignored=0
compute1r2-prod     : ok=296  changed=120  unreachable=0    failed=0    skipped=1007 rescued=0    ignored=0
compute2r1-prod     : ok=296  changed=118  unreachable=0    failed=0    skipped=1008 rescued=0    ignored=0
compute3r1-prod     : ok=286  changed=114  unreachable=0    failed=0    skipped=1017 rescued=0    ignored=0
compute4r1-prod     : ok=286  changed=114  unreachable=0    failed=0    skipped=1017 rescued=0    ignored=0
compute5r1-prod     : ok=286  changed=114  unreachable=0    failed=0    skipped=1017 rescued=0    ignored=0
compute6r1-prod     : ok=286  changed=114  unreachable=0    failed=0    skipped=1017 rescued=0    ignored=0
compute7r1-prod     : ok=286  changed=114  unreachable=0    failed=0    skipped=1017 rescued=0    ignored=0
compute8r1-prod     : ok=286  changed=114  unreachable=0    failed=0    skipped=1017 rescued=0    ignored=0
compute9r1-prod     : ok=286  changed=114  unreachable=0    failed=0    skipped=1017 rescued=0    ignored=0
controller2v-prod   : ok=288  changed=133  unreachable=0    failed=0    skipped=1026 rescued=0    ignored=0
swift2v-prod        : ok=261  changed=108  unreachable=0    failed=0    skipped=1047 rescued=0    ignored=0
compute10r1-prod     : ok=289  changed=114  unreachable=0    failed=0    skipped=1017 rescued=0    ignored=0
compute11r1-prod     : ok=286  changed=114  unreachable=0    failed=0    skipped=1017 rescued=0    ignored=0
compute12r1-prod     : ok=286  changed=114  unreachable=0    failed=0    skipped=1017 rescued=0    ignored=0
compute1r1-prod      : ok=295  changed=119  unreachable=0    failed=0    skipped=1008 rescued=0    ignored=0
compute1r2-prod      : ok=296  changed=120  unreachable=0    failed=0    skipped=1007 rescued=0    ignored=0
compute2r1-prod      : ok=295  changed=119  unreachable=0    failed=0    skipped=1008 rescued=0    ignored=0
compute3r1-prod      : ok=286  changed=114  unreachable=0    failed=0    skipped=1017 rescued=0    ignored=0
compute4r1-prod      : ok=286  changed=114  unreachable=0    failed=0    skipped=1017 rescued=0    ignored=0
compute5r1-prod      : ok=286  changed=114  unreachable=0    failed=0    skipped=1017 rescued=0    ignored=0
compute6r1-prod      : ok=286  changed=114  unreachable=0    failed=0    skipped=1017 rescued=0    ignored=0
compute7r1-prod      : ok=286  changed=114  unreachable=0    failed=0    skipped=1017 rescued=0    ignored=0
compute8r1-prod      : ok=286  changed=114  unreachable=0    failed=0    skipped=1017 rescued=0    ignored=0
compute9r1-prod      : ok=286  changed=114  unreachable=0    failed=0    skipped=1017 rescued=0    ignored=0
controller1v-prod    : ok=288  changed=133  unreachable=0    failed=0    skipped=1026 rescued=0    ignored=0
swift1v-prod         : ok=261  changed=108  unreachable=0    failed=0    skipped=1047 rescued=0    ignored=0
undercloud                 : ok=106  changed=30   unreachable=0    failed=0    skipped=8    rescued=0    ignored=0

Monday 14 December 2020  21:10:30 +0200 (0:00:00.067)       3:14:01.716 *******
===============================================================================
Wait for containers to start for step 4 using paunch --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 529.03s
Wait for puppet host configuration to finish ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 257.09s
Render all_nodes data as group_vars for overcloud ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ 128.87s
Configure octavia on overcloud ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 111.31s
tripleo-hosts-entries : Render out the hosts entries ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 50.66s
tripleo-hieradata : Render hieradata from template ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ 50.56s
Wait for puppet host configuration to finish ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ 37.02s
Wait for puppet host configuration to finish ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ 36.97s
Wait for puppet host configuration to finish ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ 36.95s
Wait for puppet host configuration to finish ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ 36.60s
Wait for container-puppet tasks (generate config) to finish --------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 30.88s
Wait for container-puppet tasks (bootstrap tasks) for step 5 to finish ---------------------------------------------------------------------------------------------------------------------------------------------------------------- 22.85s
Gathering Facts ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 17.22s
Gathering Facts ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 17.17s
install needed packages --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 15.50s
install needed packages --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 15.39s
Sync cached facts --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 15.01s
Sync cached facts --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 14.90s
Sync cached facts --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 14.86s
Sync cached facts --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 14.82s

real	194m3.649s
user	166m43.413s
sys	39m20.972s

Version-Release number of selected component (if applicable):


How reproducible:
easily reproducible

Steps to Reproduce:
1.create multiple instances on a compute node
2.If possible add multiple NIC to the instance
3.Run ansible-playbook command from undercloud

Actual results:
ansible-playbook take a significant amount of time to collect the facts

Expected results:
ansible-playbook should not take much time for collecting the facts

Additional info:

Comment 2 Alex Schultz 2021-01-05 15:07:24 UTC

This is likely an issue with ansible itself if gathering facts are taking too long. It's likely not something we can work around in tripleo-ansible.  If possible, can we have the full deployment output log so we can see if there's other issues at play?  We have a BZ for 16.2 to try and improve things https://bugzilla.redhat.com/show_bug.cgi?id=1897890

Comment 4 Oskari Lemmela 2021-01-11 12:07:05 UTC

BZ1897890 might help quite a lot, because we have 10 roles defined. Actually issue is not facts gathering, but all the facts need to parsed also when skipping hosts. If multiple hosts have 5MB or more facts skipping get a lot slower. If skipping takes 10s then 1000*10s = 2,5h just for skipping tasks.

Comment 5 Alex Schultz 2021-01-11 13:57:02 UTC

Skipping tasks should not take 10s, however we've seen that if you end up using a default ansible.cfg instead of the one we generate. If you can provide what exactly was run and if there was an ansible.cfg in the current working directory, we can provide recommendations.

Comment 11 Alex Schultz 2021-02-15 17:59:50 UTC

So to provide a bit of an update, I am currently looking into how the fact gather actually affects task execution. I have found with compute system, the network fact gathering alone can dramatically increase the overall fact size due to all the tap interfaces created with the instances. This also has a side effect of increasing overall ansible memory utilization which can directly impact the speed at which tasks can be executed. This is like an issue with ansible itself and we're investigating if there's anything we can do from an OSP standpoint to reduce the impact.  Currently I believe we rely on some of the network and hardware facts as part of the deployment in that we may not be able to just turn them off. 

We might be able to reduce our reliance on them and improve things at a future time.  In the mean time, if a customer wishes to write their own playbook and execute them against these hosts, it's recommended to disable fact gathering if possible or reduce the scope.  You can specify "gather_subset = !virtual,!oahi,!facter,!network,all" in ansible.cfg to reduce the amount of information collected which should improve the overall execution time. I don't think this works with our deployment playbooks but I will be investigating this.

Comment 12 Alex Schultz 2021-02-18 20:34:41 UTC

I've filed an upstream ansible issue to try and figure out how we can improve it.  It turns out you can reduce the impact by disabling the INJECT_FACTS_AS_VARS setting in ansible. https://docs.ansible.com/ansible/latest/reference_appendices/config.html#inject-facts-as-vars  There is still a significant performance impact but it's greatly reduced in comparison to the existing impact. See https://github.com/ansible/ansible/issues/73654 for additional information. If a user is running ansible manually, please considering configuring this value to false (it's true by default).

Comment 19 David Rosenfeld 2021-04-16 20:49:23 UTC

Verified using this procedure:

Created a 3compute 3control 1ceph deployment

added interfaces to all three compute node using:
for i in $(seq 1 380); do ip tuntap add name dummy_tun$i mode tun; done
for i in $(seq 1 1274); do ip link add name dummy_br$i type bridge; done

Reran the deployment

With RHOS-16.1-RHEL-8-20210323.n.0 re-running deploy took 4635 seconds
With RHOS-16.1-RHEL-8-20210415.n.0 re-running deploy took 1939 seconds

Also did same procedure with a 1compute 1control 1ceph deployment

With RHOS-16.1-RHEL-8-20210323.n.0 re-running deploy took 2398 seconds
With RHOS-16.1-RHEL-8-20210415.n.0 re-running deploy took 1565 seconds

When using a large number of interfaces a significant reduction in the re-deploy time is seen with the fix.

Comment 25 Alex Schultz 2021-05-10 13:08:19 UTC

*** Bug 1956321 has been marked as a duplicate of this bug. ***

Comment 28 Alex Schultz 2021-05-20 19:03:34 UTC

*** Bug 1962589 has been marked as a duplicate of this bug. ***

Comment 36 errata-xmlrpc 2021-05-26 11:43:47 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Red Hat OpenStack Platform 16.1.6 (tripleo-ansible) security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2119