Bug 1897890 - OSP16.1 config-download does not scale with increasing number of roles
Summary: OSP16.1 config-download does not scale with increasing number of roles
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-tripleo-heat-templates
Version: 16.1 (Train)
Hardware: Unspecified
OS: Unspecified
medium
high
Target Milestone: beta
: 16.2 (Train on RHEL 8.4)
Assignee: Alex Schultz
QA Contact: David Rosenfeld
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-11-15 11:09 UTC by Uemit Seren
Modified: 2024-10-01 17:06 UTC (History)
9 users (show)

Fixed In Version: openstack-tripleo-heat-templates-11.3.2-2.20201217112250.a0330d2.el8ost
Doc Type: Enhancement
Doc Text:
This enhancement improves the efficiency, performance, and execution time of deployment and update tasks for environments with a large number of roles. The logging output of the deployment process has been improved to include task IDs for better tracking of specific task executions, which can occur at different times. You can use the task IDs to correlate timing and execution when you troubleshoot executions.
Clone Of:
Environment:
Last Closed: 2021-09-15 07:09:58 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
OpenStack gerrit 742263 0 None MERGED Switch deploy steps to tripleo_free 2021-02-09 11:01:10 UTC
OpenStack gerrit 742264 0 None MERGED Use tripleo linear when not using tripleo free 2021-02-09 11:01:10 UTC
OpenStack gerrit 763242 0 None MERGED [TRAIN-Only] Remove __init__ from base strategy 2021-02-09 11:01:10 UTC
Red Hat Bugzilla 1746537 0 medium CLOSED Deployment with config-download too slow when compared to non config-download deployment 2023-10-06 18:35:43 UTC
Red Hat Issue Tracker OSP-3036 0 None None None 2024-10-01 17:06:58 UTC
Red Hat Product Errata RHEA-2021:3483 0 None None None 2021-09-15 07:10:28 UTC

Description Uemit Seren 2020-11-15 11:09:59 UTC
Description of problem:

The config-download part (ansible) of OSP16.1 does not seem to scale well with number of roles. 

In our production system (still OSP13) we have 8 composable roles.
 
On our dev environment (OSP16.1) we use 2 of them but we still specify the roles_data.yaml which contains all 8 composable roles.

The OSP13 using the os-collect-config method is considerably faster than OSP16.1 with config-download mode. 
Here are the timings for a stack update (no changes) for 3 controllers + 4 hypervisors:

- OSP13 (os-collect-config): 31m
- OSP16.1 (config-download + 8 roles): 64m (heat + ansible), 51m (ansible)
- OSP16.1 (config-download + 2 roles): 44m (heat + ansible), 32m (ansible)

The main issue seems to be that the ansible deployment playbook is executed on all nodes but then the role specific steps are only included for the corresponding roles using when conditions.

There was a similar bug (https://bugzilla.redhat.com/show_bug.cgi?id=1746537#c2) regarding speed of config-download in OSP13 which was traced to this issue and apparently fixed (https://review.opendev.org/#/c/679149/) and backported to OSP13, however the fix is not included in OSP16.1. 

I checked the corresponding /usr/share/openstack-tripleo-heat-templates/common/deploy-steps.j2 file in OSP16.1

To further speed up the speed of config-download it would make sense to not only have separate plays for the host_prep_tasks tasks but also for all other steps.

How reproducible:

always


Steps to Reproduce:
1. Deploy initial overcloud
2. Stack update with 8 composble roles of which only 1 is used and measure time
3. Stack update with 1 composable role that is used and measure time

Actual results:

Stack update with 8 composable roles takes longer than with 1

Expected results:

Stack update with 8 composable roles should take same amount as with 1


Additional info:

Comment 1 Alex Schultz 2020-11-17 16:22:15 UTC
We're aware of why this happens but the fix likely won't be available until 16.2.

Comment 2 Uemit Seren 2020-11-17 17:00:25 UTC
Hi Alex, 
thanks for the information.

Is there an ETA for OSP 16.2 and will it be upstream Train or a newer release ? 
I couldn't find any information regarding the OSP roadmap beyond OSP 16.1

Comment 3 Alex Schultz 2020-11-17 17:26:46 UTC
From a tripleo standpoint, it'll be based on what's available in stable/train on a newer version of RHEL8. The issue with this is the execution strategy of the deployment as part of ansible which we have addressed in Ussuri onward so we'll need to backport it to train and do additional testing.

Comment 4 Uemit Seren 2020-11-17 18:43:03 UTC
I saw your blog post https://www.redhat.com/en/blog/faster-deployments-red-hat-openstack-platform-deployment-ansible-strategy-plugins 

Wouldn't the tripleo_free strategy still benefit from doing seperate plays for each role instead of skipping the hosts via a when clause or was this benchmarked/profiled and there is no significant runtime decrease when combining tripleo_free strategy with separate plays ?

Comment 5 Alex Schultz 2020-11-17 18:56:55 UTC
No it would not because plays are not able to be executed in parallel where tasks are.  If you split apart the plays, you have to parallelize the ansible execution which is more complicated.  We get something closer to what was invoked under heat with the tasks within a specific play being run in parallel and not limited in their execution order because it's similar to the previous Deployment phases.  There is a specific alignment of the execution that has to occur across an entire cloud which is where the plays contributing to this.  What we have right now prior to the usage of tripleo_free is that we're running deployment tasks for roles in serial so while role 1 is executing, all the other roles are idle.  The tripleo_free switch allows the play to continue on the non-targeted role to the tasks they need. The issue with trying to backport this is that it's big UX changes in order to allow for end users to be able to track what is going on which is why we're targeting 16.2 possible instead of making such a giant shift in 16.1.4 or a later version.

Comment 6 Alex Schultz 2020-11-17 19:08:27 UTC
Specifically the more roles you have the following sections add additional overhead time as currently written:

https://github.com/openstack/tripleo-heat-templates/blob/0fdaaf51ea9c97d89781e691ffcf2666fdde8ab5/common/deploy-steps.j2#L586-L597
https://github.com/openstack/tripleo-heat-templates/blob/0fdaaf51ea9c97d89781e691ffcf2666fdde8ab5/common/deploy-steps.j2#L672-L676


When tripleo_free is used, all the nodes will get to their tasks for execution as they hit this code rather than having to wait for all the previous roles to finish before they start executing.  Then all nodes stop proceeding with the deployment until the play is finished and then they start to execute in order. This is similar to how the heat deployment execution process used to occur where all nodes would run their "step 1" tasks at the same time regardless of roles but the whole process waited until "step 1" was fully complete before moving on to the next step.

Comment 7 Uemit Seren 2020-11-17 19:52:26 UTC
Ah I see.  Thanks for the detailed explanation.

Comment 11 David Rosenfeld 2021-07-23 14:44:21 UTC
Deployed job:  DFG-df-deployment-16.2-virthost-3cont_1comp_3ceph_3db_2net_3msg-yes_UC_SSL-no_OC_SSL-ceph-ipv4-geneve-RHELOSP-31889 which has six composable roles. 

Did a stack update and recorded: Elapsed Time: 0:24:47.198112

Appended six unused roles(CountDefault: 0) to end of roles_data.yaml

Did stack update and recorded:  Elapsed Time: 0:24:55.587699

Increased time was not seen with unused composable roles.

Comment 15 errata-xmlrpc 2021-09-15 07:09:58 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenStack Platform (RHOSP) 16.2 enhancement advisory), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2021:3483


Note You need to log in before you can comment on or make changes to this bug.