Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 2224177

Summary: [RFE]overcloud node provision doesn't execute network configuration in parallel
Product: Red Hat OpenStack Reporter: Keigo Noha <knoha>
Component: python-tripleoclientAssignee: Keigo Noha <knoha>
Status: CLOSED NOTABUG QA Contact: David Rosenfeld <drosenfe>
Severity: low Docs Contact:
Priority: low    
Version: 17.1 (Wallaby)CC: astupnik, elicohen, hjensas, jkreger, jslagle, mariel, mburns, sbaker
Target Milestone: ---Keywords: Documentation, Reopened, Triaged
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2024-12-05 18:09:16 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 2222869    

Description Keigo Noha 2023-07-20 06:57:55 UTC
Description of problem:
Current python-tripleoclient executes OS deployment in parallel according to concurrency option.
~~~
class ProvisionNode(command.Command):
:
    def take_action(self, parsed_args):
:
        extra_vars = {
            "stack_name": parsed_args.stack,
            "baremetal_deployment": roles,
            "baremetal_deployed_path": output_path,
            "ssh_public_keys": ssh_key,
            "ssh_private_key_file": key,
            "ssh_user_name": parsed_args.overcloud_ssh_user,
            "node_timeout": parsed_args.timeout,
            "concurrency": parsed_args.concurrency,
            "manage_network_ports": True,
            "configure_networking": parsed_args.network_config,
            "configure_networking": parsed_args.network_config,
            "working_dir": working_dir,
            "templates": parsed_args.templates,
            "overwrite": overwrite,
        }

        with oooutils.TempDirs() as tmp:
            oooutils.run_ansible_playbook(
                playbook='cli-overcloud-node-provision.yaml',
                inventory='localhost,',
                workdir=tmp,
                playbook_dir=constants.ANSIBLE_TRIPLEO_PLAYBOOKS,
                verbosity=oooutils.playbook_verbosity(self=self),
                extra_vars=extra_vars,
            )
~~~

However, the latter code, configuring network things, is run by per role.
~~~
        oooutils.run_role_playbooks(self, working_dir, roles_file_dir,
                                    roles, parsed_args.network_config)
~~~

Spine-leaf environment will have many custom roles for leafs.
This implementation will increase the execution cycle directly based on the number of roles.

To reduce the time of node provisioning, can we run this process in parallel and limit the number of parallel execution at a time to prevent resource starvation?

Comment 2 Harald Jensås 2023-08-15 09:18:21 UTC
I think this bug is invalid?
The 'cli-overcloud-node-network-config.yaml' playbook run's outside the role loop without any limit on what roles it executes on, so ansible will run it in paralell on all nodes in all roles.

However, the growvols play and any extra playbooks added by the operator does run per-role with per-role "extra_vars".


https://opendev.org/openstack/python-tripleoclient/src/branch/stable/wallaby/tripleoclient/utils.py#L2837-L2872

2837 def run_role_playbooks(self, working_dir, roles_file_dir, roles,
2838                        network_config=True):
2839     inventory_file = os.path.join(working_dir,
2840                                   'tripleo-ansible-inventory.yaml')
2841     with open(inventory_file, 'r') as f:
2842         inventory = yaml.safe_load(f.read())
2843         
2844     growvols_play = 'cli-overcloud-node-growvols.yaml'
2845     growvols_path = rel_or_abs_path_role_playbook(
2846         constants.ANSIBLE_TRIPLEO_PLAYBOOKS, growvols_play)
2847         
2848     # Pre-Network Config
2849     for role in roles:
2850         if role.get('count', 1) == 0:
2851             continue
2852         
2853         role_playbooks = []
2854                     
2855         for x in role.get('ansible_playbooks', []):
2856             role_playbooks.append(x['playbook'])
2857         
2858             run_role_playbook(self, inventory, roles_file_dir, x['playbook'],
2859                               limit_hosts=role['name'],
2860                               extra_vars=x.get('extra_vars', {}))
2861                 
2862         if growvols_path not in role_playbooks:
2863             # growvols was not run with custom extra_vars, run it with defaults
2864             run_role_playbook(self, inventory,
2865                               constants.ANSIBLE_TRIPLEO_PLAYBOOKS,
2866                               growvols_play,
2867                               limit_hosts=role['name'])
2868 
2869     if network_config:
2870         # Network Config
2871         run_role_playbook(self, inventory, constants.ANSIBLE_TRIPLEO_PLAYBOOKS,
2872                           'cli-overcloud-node-network-config.yaml')

Comment 3 Steve Baker 2023-08-21 19:47:32 UTC
Closing for now because this kind of optimisation is probably too much effort for a 17.1 z stream change. If you can provide some specific timings for how slow this is in the spine and leaf case that would give us more perspective. But since it only runs on initial deployment it may still not be worth it to optimise.

Comment 4 Keigo Noha 2023-08-24 05:37:18 UTC
Hi Steve,

I'd like to introduce the scenario about this issue.

In spine-leaf environment, a user needs to create multiple role for each leaf.
Also, some users want to do their own ansible playbooks at the time of node provisioning.
In this scenario, the increased number of roles directly affects to the execution time of node provisioning.

Additionally, in large scale environment, a user doesn't do the node addition at the initial deployment.
The usual use-case, the user adds nodes depending on the demand of workload of each leaf.
As a conclusion, adding node is usual operation after they opened to the deployment to their user.

Would you please reconsider the work for this bugzilla?

Best regards,
Keigo Noha

Comment 5 Steve Baker 2023-09-18 20:17:18 UTC
As requested in comment #3 we can't evaluate this without having specific time measurements for how much the deployment slows as the number of roles increases.

Also we'd like to reiterate that the effort required to make this enhancement may be out of proportion with the benefit of parallelising this operation.

Comment 6 Steve Baker 2023-10-02 19:36:10 UTC
We don't have enough information to fully evaluate this currently, closing out for now.

Comment 7 Keigo Noha 2023-10-05 05:08:58 UTC
Hi Steve,

I was Out of Office in last 2 weeks of September.
I looked at the customer's customer's overcloud-baremetal-deployed.yaml.
In the file, they assigned /usr/share/ansible/tripleo-playbooks/cli-overcloud-node-kernelargs.yaml for assigning kernel args of DPDK purpose.

The playbook triggers node reboot. That means increasing the number of role multiplies the time consumption by node reboot.
For example, when node reboot takes 10 minutes, 3 roles consumes 30 minutes. 9 minutes consumes 90 minutes and so on.

So, the concurrency is the key to reduce the time consumption by node reboot.

Best regards,
Keigo Noha

Comment 8 Julia Kreger 2023-10-09 19:56:58 UTC
Greetings Keigo,

Unfortunately what your seeking is a feature level of work, with at present an indeterminable amount of effort to achieve for a shipped product.

The engineering team's priority is on future versions of OSP at this time. At this point, you will need to convert this to an RFE, and make the case with Product Management to convince them that we should spend the time on this very specific performance enhancement.

At this point, I'm marking this item as Closed/Wont-Fix. If you have any questions, please feel to reach out, but I recommend discussing with Gil Rosenberg first.

Thanks,

-Julia

Comment 9 Keigo Noha 2023-11-09 06:34:21 UTC
Hi Julia,

My customer did a test with 90 nodes deployment with 30 roles.
They found that it took 10 hours and most of the time, 6.5 hours consumed in overcloud node provision.
They strongly request us to improve the current implementation because the current implementation is not usable in large scale deployment.

Could you please reconsider this topic for OSP17.1?

Best regards,
Keigo Noha

Comment 10 Julia Kreger 2023-11-09 14:26:52 UTC
Greetings Keigo,

I will present it to the team for consideration, but I can make no guarantees.

Our next Triage session is on Monday.

-Julia

Comment 14 Steve Baker 2023-12-11 20:56:58 UTC
This approach looks fine, we'll tag in documentation to decide how it might be appropriate to document this.

Comment 18 Steve Baker 2024-03-25 19:42:41 UTC
Assigning to Keigo to write this KB article, setting a NEEDINFO for myself for finding the instructions on how to write a KB article.