Description of problem: The config-download part (ansible) of OSP16.1 does not seem to scale well with number of roles. In our production system (still OSP13) we have 8 composable roles. On our dev environment (OSP16.1) we use 2 of them but we still specify the roles_data.yaml which contains all 8 composable roles. The OSP13 using the os-collect-config method is considerably faster than OSP16.1 with config-download mode. Here are the timings for a stack update (no changes) for 3 controllers + 4 hypervisors: - OSP13 (os-collect-config): 31m - OSP16.1 (config-download + 8 roles): 64m (heat + ansible), 51m (ansible) - OSP16.1 (config-download + 2 roles): 44m (heat + ansible), 32m (ansible) The main issue seems to be that the ansible deployment playbook is executed on all nodes but then the role specific steps are only included for the corresponding roles using when conditions. There was a similar bug (https://bugzilla.redhat.com/show_bug.cgi?id=1746537#c2) regarding speed of config-download in OSP13 which was traced to this issue and apparently fixed (https://review.opendev.org/#/c/679149/) and backported to OSP13, however the fix is not included in OSP16.1. I checked the corresponding /usr/share/openstack-tripleo-heat-templates/common/deploy-steps.j2 file in OSP16.1 To further speed up the speed of config-download it would make sense to not only have separate plays for the host_prep_tasks tasks but also for all other steps. How reproducible: always Steps to Reproduce: 1. Deploy initial overcloud 2. Stack update with 8 composble roles of which only 1 is used and measure time 3. Stack update with 1 composable role that is used and measure time Actual results: Stack update with 8 composable roles takes longer than with 1 Expected results: Stack update with 8 composable roles should take same amount as with 1 Additional info:
We're aware of why this happens but the fix likely won't be available until 16.2.
Hi Alex, thanks for the information. Is there an ETA for OSP 16.2 and will it be upstream Train or a newer release ? I couldn't find any information regarding the OSP roadmap beyond OSP 16.1
From a tripleo standpoint, it'll be based on what's available in stable/train on a newer version of RHEL8. The issue with this is the execution strategy of the deployment as part of ansible which we have addressed in Ussuri onward so we'll need to backport it to train and do additional testing.
I saw your blog post https://www.redhat.com/en/blog/faster-deployments-red-hat-openstack-platform-deployment-ansible-strategy-plugins Wouldn't the tripleo_free strategy still benefit from doing seperate plays for each role instead of skipping the hosts via a when clause or was this benchmarked/profiled and there is no significant runtime decrease when combining tripleo_free strategy with separate plays ?
No it would not because plays are not able to be executed in parallel where tasks are. If you split apart the plays, you have to parallelize the ansible execution which is more complicated. We get something closer to what was invoked under heat with the tasks within a specific play being run in parallel and not limited in their execution order because it's similar to the previous Deployment phases. There is a specific alignment of the execution that has to occur across an entire cloud which is where the plays contributing to this. What we have right now prior to the usage of tripleo_free is that we're running deployment tasks for roles in serial so while role 1 is executing, all the other roles are idle. The tripleo_free switch allows the play to continue on the non-targeted role to the tasks they need. The issue with trying to backport this is that it's big UX changes in order to allow for end users to be able to track what is going on which is why we're targeting 16.2 possible instead of making such a giant shift in 16.1.4 or a later version.
Specifically the more roles you have the following sections add additional overhead time as currently written: https://github.com/openstack/tripleo-heat-templates/blob/0fdaaf51ea9c97d89781e691ffcf2666fdde8ab5/common/deploy-steps.j2#L586-L597 https://github.com/openstack/tripleo-heat-templates/blob/0fdaaf51ea9c97d89781e691ffcf2666fdde8ab5/common/deploy-steps.j2#L672-L676 When tripleo_free is used, all the nodes will get to their tasks for execution as they hit this code rather than having to wait for all the previous roles to finish before they start executing. Then all nodes stop proceeding with the deployment until the play is finished and then they start to execute in order. This is similar to how the heat deployment execution process used to occur where all nodes would run their "step 1" tasks at the same time regardless of roles but the whole process waited until "step 1" was fully complete before moving on to the next step.
Ah I see. Thanks for the detailed explanation.
Deployed job: DFG-df-deployment-16.2-virthost-3cont_1comp_3ceph_3db_2net_3msg-yes_UC_SSL-no_OC_SSL-ceph-ipv4-geneve-RHELOSP-31889 which has six composable roles. Did a stack update and recorded: Elapsed Time: 0:24:47.198112 Appended six unused roles(CountDefault: 0) to end of roles_data.yaml Did stack update and recorded: Elapsed Time: 0:24:55.587699 Increased time was not seen with unused composable roles.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat OpenStack Platform (RHOSP) 16.2 enhancement advisory), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2021:3483