Bug 2224177
| Summary: | [RFE]overcloud node provision doesn't execute network configuration in parallel | ||
|---|---|---|---|
| Product: | Red Hat OpenStack | Reporter: | Keigo Noha <knoha> |
| Component: | python-tripleoclient | Assignee: | Keigo Noha <knoha> |
| Status: | CLOSED NOTABUG | QA Contact: | David Rosenfeld <drosenfe> |
| Severity: | low | Docs Contact: | |
| Priority: | low | ||
| Version: | 17.1 (Wallaby) | CC: | astupnik, elicohen, hjensas, jkreger, jslagle, mariel, mburns, sbaker |
| Target Milestone: | --- | Keywords: | Documentation, Reopened, Triaged |
| Target Release: | --- | ||
| Hardware: | x86_64 | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2024-12-05 18:09:16 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Bug Depends On: | |||
| Bug Blocks: | 2222869 | ||
I think this bug is invalid? The 'cli-overcloud-node-network-config.yaml' playbook run's outside the role loop without any limit on what roles it executes on, so ansible will run it in paralell on all nodes in all roles. However, the growvols play and any extra playbooks added by the operator does run per-role with per-role "extra_vars". https://opendev.org/openstack/python-tripleoclient/src/branch/stable/wallaby/tripleoclient/utils.py#L2837-L2872 2837 def run_role_playbooks(self, working_dir, roles_file_dir, roles, 2838 network_config=True): 2839 inventory_file = os.path.join(working_dir, 2840 'tripleo-ansible-inventory.yaml') 2841 with open(inventory_file, 'r') as f: 2842 inventory = yaml.safe_load(f.read()) 2843 2844 growvols_play = 'cli-overcloud-node-growvols.yaml' 2845 growvols_path = rel_or_abs_path_role_playbook( 2846 constants.ANSIBLE_TRIPLEO_PLAYBOOKS, growvols_play) 2847 2848 # Pre-Network Config 2849 for role in roles: 2850 if role.get('count', 1) == 0: 2851 continue 2852 2853 role_playbooks = [] 2854 2855 for x in role.get('ansible_playbooks', []): 2856 role_playbooks.append(x['playbook']) 2857 2858 run_role_playbook(self, inventory, roles_file_dir, x['playbook'], 2859 limit_hosts=role['name'], 2860 extra_vars=x.get('extra_vars', {})) 2861 2862 if growvols_path not in role_playbooks: 2863 # growvols was not run with custom extra_vars, run it with defaults 2864 run_role_playbook(self, inventory, 2865 constants.ANSIBLE_TRIPLEO_PLAYBOOKS, 2866 growvols_play, 2867 limit_hosts=role['name']) 2868 2869 if network_config: 2870 # Network Config 2871 run_role_playbook(self, inventory, constants.ANSIBLE_TRIPLEO_PLAYBOOKS, 2872 'cli-overcloud-node-network-config.yaml') Closing for now because this kind of optimisation is probably too much effort for a 17.1 z stream change. If you can provide some specific timings for how slow this is in the spine and leaf case that would give us more perspective. But since it only runs on initial deployment it may still not be worth it to optimise. Hi Steve, I'd like to introduce the scenario about this issue. In spine-leaf environment, a user needs to create multiple role for each leaf. Also, some users want to do their own ansible playbooks at the time of node provisioning. In this scenario, the increased number of roles directly affects to the execution time of node provisioning. Additionally, in large scale environment, a user doesn't do the node addition at the initial deployment. The usual use-case, the user adds nodes depending on the demand of workload of each leaf. As a conclusion, adding node is usual operation after they opened to the deployment to their user. Would you please reconsider the work for this bugzilla? Best regards, Keigo Noha As requested in comment #3 we can't evaluate this without having specific time measurements for how much the deployment slows as the number of roles increases. Also we'd like to reiterate that the effort required to make this enhancement may be out of proportion with the benefit of parallelising this operation. We don't have enough information to fully evaluate this currently, closing out for now. Hi Steve, I was Out of Office in last 2 weeks of September. I looked at the customer's customer's overcloud-baremetal-deployed.yaml. In the file, they assigned /usr/share/ansible/tripleo-playbooks/cli-overcloud-node-kernelargs.yaml for assigning kernel args of DPDK purpose. The playbook triggers node reboot. That means increasing the number of role multiplies the time consumption by node reboot. For example, when node reboot takes 10 minutes, 3 roles consumes 30 minutes. 9 minutes consumes 90 minutes and so on. So, the concurrency is the key to reduce the time consumption by node reboot. Best regards, Keigo Noha Greetings Keigo, Unfortunately what your seeking is a feature level of work, with at present an indeterminable amount of effort to achieve for a shipped product. The engineering team's priority is on future versions of OSP at this time. At this point, you will need to convert this to an RFE, and make the case with Product Management to convince them that we should spend the time on this very specific performance enhancement. At this point, I'm marking this item as Closed/Wont-Fix. If you have any questions, please feel to reach out, but I recommend discussing with Gil Rosenberg first. Thanks, -Julia Hi Julia, My customer did a test with 90 nodes deployment with 30 roles. They found that it took 10 hours and most of the time, 6.5 hours consumed in overcloud node provision. They strongly request us to improve the current implementation because the current implementation is not usable in large scale deployment. Could you please reconsider this topic for OSP17.1? Best regards, Keigo Noha Greetings Keigo, I will present it to the team for consideration, but I can make no guarantees. Our next Triage session is on Monday. -Julia This approach looks fine, we'll tag in documentation to decide how it might be appropriate to document this. Assigning to Keigo to write this KB article, setting a NEEDINFO for myself for finding the instructions on how to write a KB article. |
Description of problem: Current python-tripleoclient executes OS deployment in parallel according to concurrency option. ~~~ class ProvisionNode(command.Command): : def take_action(self, parsed_args): : extra_vars = { "stack_name": parsed_args.stack, "baremetal_deployment": roles, "baremetal_deployed_path": output_path, "ssh_public_keys": ssh_key, "ssh_private_key_file": key, "ssh_user_name": parsed_args.overcloud_ssh_user, "node_timeout": parsed_args.timeout, "concurrency": parsed_args.concurrency, "manage_network_ports": True, "configure_networking": parsed_args.network_config, "configure_networking": parsed_args.network_config, "working_dir": working_dir, "templates": parsed_args.templates, "overwrite": overwrite, } with oooutils.TempDirs() as tmp: oooutils.run_ansible_playbook( playbook='cli-overcloud-node-provision.yaml', inventory='localhost,', workdir=tmp, playbook_dir=constants.ANSIBLE_TRIPLEO_PLAYBOOKS, verbosity=oooutils.playbook_verbosity(self=self), extra_vars=extra_vars, ) ~~~ However, the latter code, configuring network things, is run by per role. ~~~ oooutils.run_role_playbooks(self, working_dir, roles_file_dir, roles, parsed_args.network_config) ~~~ Spine-leaf environment will have many custom roles for leafs. This implementation will increase the execution cycle directly based on the number of roles. To reduce the time of node provisioning, can we run this process in parallel and limit the number of parallel execution at a time to prevent resource starvation?