1897890 – OSP16.1 config-download does not scale with increasing number of roles

Bug 1897890 - OSP16.1 config-download does not scale with increasing number of roles

Summary: OSP16.1 config-download does not scale with increasing number of roles

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	openstack-tripleo-heat-templates
Sub Component:
Version:	16.1 (Train)
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	high
Target Milestone:	beta
Target Release:	16.2 (Train on RHEL 8.4)
Assignee:	Alex Schultz
QA Contact:	David Rosenfeld
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-11-15 11:09 UTC by Uemit Seren
Modified:	2024-10-01 17:06 UTC (History)
CC List:	9 users (show)
Fixed In Version:	openstack-tripleo-heat-templates-11.3.2-2.20201217112250.a0330d2.el8ost
Doc Type:	Enhancement
Doc Text:	This enhancement improves the efficiency, performance, and execution time of deployment and update tasks for environments with a large number of roles. The logging output of the deployment process has been improved to include task IDs for better tracking of specific task executions, which can occur at different times. You can use the task IDs to correlate timing and execution when you troubleshoot executions.
Clone Of:
Environment:
Last Closed:	2021-09-15 07:09:58 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
OpenStack gerrit	742263	None	MERGED	Switch deploy steps to tripleo_free	2021-02-09 11:01:10 UTC
OpenStack gerrit	742264	None	MERGED	Use tripleo linear when not using tripleo free	2021-02-09 11:01:10 UTC
OpenStack gerrit	763242	None	MERGED	[TRAIN-Only] Remove __init__ from base strategy	2021-02-09 11:01:10 UTC
Red Hat Bugzilla	1746537	medium	CLOSED	Deployment with config-download too slow when compared to non config-download deployment	2023-10-06 18:35:43 UTC
Red Hat Issue Tracker	OSP-3036	None	None	None	2024-10-01 17:06:58 UTC
Red Hat Product Errata	RHEA-2021:3483	None	None	None	2021-09-15 07:10:28 UTC

Description Uemit Seren 2020-11-15 11:09:59 UTC

Description of problem:

The config-download part (ansible) of OSP16.1 does not seem to scale well with number of roles.

In our production system (still OSP13) we have 8 composable roles.

On our dev environment (OSP16.1) we use 2 of them but we still specify the roles_data.yaml which contains all 8 composable roles.

The OSP13 using the os-collect-config method is considerably faster than OSP16.1 with config-download mode.
Here are the timings for a stack update (no changes) for 3 controllers + 4 hypervisors:

- OSP13 (os-collect-config): 31m
- OSP16.1 (config-download + 8 roles): 64m (heat + ansible), 51m (ansible)
- OSP16.1 (config-download + 2 roles): 44m (heat + ansible), 32m (ansible)

The main issue seems to be that the ansible deployment playbook is executed on all nodes but then the role specific steps are only included for the corresponding roles using when conditions.

There was a similar bug (https://bugzilla.redhat.com/show_bug.cgi?id=1746537#c2) regarding speed of config-download in OSP13 which was traced to this issue and apparently fixed (https://review.opendev.org/#/c/679149/) and backported to OSP13, however the fix is not included in OSP16.1.

I checked the corresponding /usr/share/openstack-tripleo-heat-templates/common/deploy-steps.j2 file in OSP16.1

To further speed up the speed of config-download it would make sense to not only have separate plays for the host_prep_tasks tasks but also for all other steps.

How reproducible:

always

Steps to Reproduce:
1. Deploy initial overcloud
2. Stack update with 8 composble roles of which only 1 is used and measure time
3. Stack update with 1 composable role that is used and measure time

Actual results:

Stack update with 8 composable roles takes longer than with 1

Expected results:

Stack update with 8 composable roles should take same amount as with 1

Additional info:

Comment 1 Alex Schultz 2020-11-17 16:22:15 UTC

We're aware of why this happens but the fix likely won't be available until 16.2.

Comment 2 Uemit Seren 2020-11-17 17:00:25 UTC

Hi Alex, 
thanks for the information.

Is there an ETA for OSP 16.2 and will it be upstream Train or a newer release ? 
I couldn't find any information regarding the OSP roadmap beyond OSP 16.1

Comment 3 Alex Schultz 2020-11-17 17:26:46 UTC

From a tripleo standpoint, it'll be based on what's available in stable/train on a newer version of RHEL8. The issue with this is the execution strategy of the deployment as part of ansible which we have addressed in Ussuri onward so we'll need to backport it to train and do additional testing.

Comment 4 Uemit Seren 2020-11-17 18:43:03 UTC

I saw your blog post https://www.redhat.com/en/blog/faster-deployments-red-hat-openstack-platform-deployment-ansible-strategy-plugins 

Wouldn't the tripleo_free strategy still benefit from doing seperate plays for each role instead of skipping the hosts via a when clause or was this benchmarked/profiled and there is no significant runtime decrease when combining tripleo_free strategy with separate plays ?

Comment 5 Alex Schultz 2020-11-17 18:56:55 UTC

No it would not because plays are not able to be executed in parallel where tasks are.  If you split apart the plays, you have to parallelize the ansible execution which is more complicated.  We get something closer to what was invoked under heat with the tasks within a specific play being run in parallel and not limited in their execution order because it's similar to the previous Deployment phases.  There is a specific alignment of the execution that has to occur across an entire cloud which is where the plays contributing to this.  What we have right now prior to the usage of tripleo_free is that we're running deployment tasks for roles in serial so while role 1 is executing, all the other roles are idle.  The tripleo_free switch allows the play to continue on the non-targeted role to the tasks they need. The issue with trying to backport this is that it's big UX changes in order to allow for end users to be able to track what is going on which is why we're targeting 16.2 possible instead of making such a giant shift in 16.1.4 or a later version.

Comment 6 Alex Schultz 2020-11-17 19:08:27 UTC

Specifically the more roles you have the following sections add additional overhead time as currently written:

https://github.com/openstack/tripleo-heat-templates/blob/0fdaaf51ea9c97d89781e691ffcf2666fdde8ab5/common/deploy-steps.j2#L586-L597
https://github.com/openstack/tripleo-heat-templates/blob/0fdaaf51ea9c97d89781e691ffcf2666fdde8ab5/common/deploy-steps.j2#L672-L676


When tripleo_free is used, all the nodes will get to their tasks for execution as they hit this code rather than having to wait for all the previous roles to finish before they start executing.  Then all nodes stop proceeding with the deployment until the play is finished and then they start to execute in order. This is similar to how the heat deployment execution process used to occur where all nodes would run their "step 1" tasks at the same time regardless of roles but the whole process waited until "step 1" was fully complete before moving on to the next step.

Comment 7 Uemit Seren 2020-11-17 19:52:26 UTC

Ah I see.  Thanks for the detailed explanation.

Comment 11 David Rosenfeld 2021-07-23 14:44:21 UTC

Deployed job:  DFG-df-deployment-16.2-virthost-3cont_1comp_3ceph_3db_2net_3msg-yes_UC_SSL-no_OC_SSL-ceph-ipv4-geneve-RHELOSP-31889 which has six composable roles. 

Did a stack update and recorded: Elapsed Time: 0:24:47.198112

Appended six unused roles(CountDefault: 0) to end of roles_data.yaml

Did stack update and recorded:  Elapsed Time: 0:24:55.587699

Increased time was not seen with unused composable roles.

Comment 15 errata-xmlrpc 2021-09-15 07:09:58 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenStack Platform (RHOSP) 16.2 enhancement advisory), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2021:3483

Note You need to log in before you can comment on or make changes to this bug.