Bug 1911891
Summary: | deployment takes a long time when being run manually on the undercloud with ansible-playbook | |||
---|---|---|---|---|
Product: | Red Hat OpenStack | Reporter: | Dhruv Shah <dhruv> | |
Component: | tripleo-ansible | Assignee: | Alex Schultz <aschultz> | |
Status: | CLOSED ERRATA | QA Contact: | Joe H. Rahme <jhakimra> | |
Severity: | high | Docs Contact: | ||
Priority: | high | |||
Version: | 16.1 (Train) | CC: | aschultz, astupnik, bdobreli, drosenfe, ebarrera, jmelvin, johfulto, jpretori, kthakre, ljozsa, mschuppe, nchandek, nweinber, oskari.lemmela, pweeks, slinaber, vkommadi, whayutin, yatanaka | |
Target Milestone: | z6 | Keywords: | Triaged | |
Target Release: | 16.1 (Train on RHEL 8.2) | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | openstack-tripleo-common-11.4.1-1.20210329153522.75bd92a.el8ost python-tripleoclient-12.3.2-1.20210329153521.ae58329.el8ost tripleo-ansible-0.5.1-1.20210323173503.902c3c8.el8ost openstack-tripleo-heat-templates-11.3.2-1.20210324233518.29a02c1.el8ost | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 1974985 (view as bug list) | Environment: | ||
Last Closed: | 2021-05-26 11:43:47 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | 1935364, 1935365, 1935366, 1935406, 1949290, 2058027 | |||
Bug Blocks: | 1974985 |
Description
Dhruv Shah
2021-01-01 10:26:57 UTC
This is likely an issue with ansible itself if gathering facts are taking too long. It's likely not something we can work around in tripleo-ansible. If possible, can we have the full deployment output log so we can see if there's other issues at play? We have a BZ for 16.2 to try and improve things https://bugzilla.redhat.com/show_bug.cgi?id=1897890 BZ1897890 might help quite a lot, because we have 10 roles defined. Actually issue is not facts gathering, but all the facts need to parsed also when skipping hosts. If multiple hosts have 5MB or more facts skipping get a lot slower. If skipping takes 10s then 1000*10s = 2,5h just for skipping tasks. Skipping tasks should not take 10s, however we've seen that if you end up using a default ansible.cfg instead of the one we generate. If you can provide what exactly was run and if there was an ansible.cfg in the current working directory, we can provide recommendations. So to provide a bit of an update, I am currently looking into how the fact gather actually affects task execution. I have found with compute system, the network fact gathering alone can dramatically increase the overall fact size due to all the tap interfaces created with the instances. This also has a side effect of increasing overall ansible memory utilization which can directly impact the speed at which tasks can be executed. This is like an issue with ansible itself and we're investigating if there's anything we can do from an OSP standpoint to reduce the impact. Currently I believe we rely on some of the network and hardware facts as part of the deployment in that we may not be able to just turn them off. We might be able to reduce our reliance on them and improve things at a future time. In the mean time, if a customer wishes to write their own playbook and execute them against these hosts, it's recommended to disable fact gathering if possible or reduce the scope. You can specify "gather_subset = !virtual,!oahi,!facter,!network,all" in ansible.cfg to reduce the amount of information collected which should improve the overall execution time. I don't think this works with our deployment playbooks but I will be investigating this. I've filed an upstream ansible issue to try and figure out how we can improve it. It turns out you can reduce the impact by disabling the INJECT_FACTS_AS_VARS setting in ansible. https://docs.ansible.com/ansible/latest/reference_appendices/config.html#inject-facts-as-vars There is still a significant performance impact but it's greatly reduced in comparison to the existing impact. See https://github.com/ansible/ansible/issues/73654 for additional information. If a user is running ansible manually, please considering configuring this value to false (it's true by default). Verified using this procedure: Created a 3compute 3control 1ceph deployment added interfaces to all three compute node using: for i in $(seq 1 380); do ip tuntap add name dummy_tun$i mode tun; done for i in $(seq 1 1274); do ip link add name dummy_br$i type bridge; done Reran the deployment With RHOS-16.1-RHEL-8-20210323.n.0 re-running deploy took 4635 seconds With RHOS-16.1-RHEL-8-20210415.n.0 re-running deploy took 1939 seconds Also did same procedure with a 1compute 1control 1ceph deployment With RHOS-16.1-RHEL-8-20210323.n.0 re-running deploy took 2398 seconds With RHOS-16.1-RHEL-8-20210415.n.0 re-running deploy took 1565 seconds When using a large number of interfaces a significant reduction in the re-deploy time is seen with the fix. *** Bug 1956321 has been marked as a duplicate of this bug. *** *** Bug 1962589 has been marked as a duplicate of this bug. *** Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: Red Hat OpenStack Platform 16.1.6 (tripleo-ansible) security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2119 |