Description of problem: The example in https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/17.0/html/director_installation_and_usage/assembly_provisioning-and-deploying-your-overcloud#proc_provisioning-bare-metal-nodes-for-the-overcloud_ironic_provisioning section 5 for overcloud-baremetal-deploy.yaml with subnet information would fail in current 17.1 (python3-metalsmith-1.4.4-1.20230517141000.5e7461e.el9ost.noarch.rpm) Our downstream package is missing this patch: commit 264836d59ac741424c3fad4d47e51073722c848f Author: Harald Jensås <hjensas> Date: Thu Dec 9 15:20:29 2021 +0100 Allow both 'network' and 'subnet' in NIC Version-Release number of selected component (if applicable): 17.1 python3-metalsmith-1.4.4-1.20230517141000.5e7461e.el9ost.noarch.rpm How reproducible: Always Steps to Reproduce: 1. 2. 3. Actual results: The following command fails. (undercloud) [stack@undercloud ~]$ openstack overcloud node provision --stack central --network-config -o /home/stack/templates/central/deployed_metal.yaml /home/stack/templates/central/overcloud-baremetal-deploy.yaml The error seen (formatted in a more readable form) is: Deploy attempt failed on node f18-h21-000-r640.tng.rdu2.scalelab.redhat.com (UUID 2fcd3a61-e212-48fc-9c4c-91b832a3ca9e), cleaning up Traceback (most recent call last): File "/usr/lib/python3.9/site-packages/metalsmith/_provisioner.py", line 393, in provision_node nics.validate() File "/usr/lib/python3.9/site-packages/metalsmith/_nics.py", line 60, in validate result.append(('network', self._get_network(nic))) File "/usr/lib/python3.9/site-packages/metalsmith/_nics.py", line 136, in _get_network raise exceptions.InvalidNIC( metalsmith.exceptions.InvalidNIC: Unexpected fields for a network: subnet Expected results: Additional info: We are testing edge deployments with l3 routes on 17.1. This is a blocker for us. I will try to see if I can manually patch just this fix.
Hey, we have a job doing this in downstream CI: - name: Compute1 count: 2 hostname_format: 'c1-compute-%index%' defaults: profile: compute network_config: template: /home/stack/virt/network/three-nics-vlans/compute.j2 networks: - network: ctlplane vif: true - network: internal_api subnet: internal_api1_subnet - network: storage subnet: storage1_subnet - network: tenant subnet: tenant1_subnet instances: - name: compute-0 hostname: compute-0 - name: compute-1 hostname: compute-1 - Which example are you using? Is it the specific nodes example? Can you attach your - home/stack/templates/central/overcloud-baremetal-deploy.yaml file please?
So this is happening on the "ctlplane" network? In that case, the workaround is to simply not specify the subnet. The correct subnet will be used automatically based on the physical_network bridge mappings in neutron. The physical network property on the baremetal ports must be set, but this happens automatically when you introspect the nodes with OSP 17.x director.
Introspection succeeds. I got this example from the document. I am not sure if it is happening for ctlplane. Here is the file I used. Since I am doing this on baremetal, I made a few changes, but this file should be similar: cat templates/central/overcloud-baremetal-deploy.yaml-bak [10/1558] - name: Controller0 count: 3 defaults: resource_class: baremetal.control profile: control network_config: default_route_network: - External template: /home/stack/templates/central/network/leaf0/controller0.j2 networks: - network: ctlplane subnet: leaf0 vif: true - network: storage subnet: storage_leaf0_subnet - network: storage_mgmt subnet: storage_mgmt_leaf0_subnet - network: internal_api subnet: internalapi_leaf0_subnet - network: tenant subnet: tenant_leaf0_subnet - network: external subnet: external_leaf0_subnet ansible_playbooks: - playbook: /usr/share/ansible/tripleo-playbooks/cli-overcloud-node-growvols.yaml extra_vars: growvols_args: > /=10GB /tmp=1GB /var/log=10GB /var/log/audit=1GB /home=10GB /srv=10GB /var=100% - name: ComputeHCI-r640 count: 4 defaults: resource_class: baremetal.computel0 profile: compute network_config: template: /home/stack/templates/central/network/leaf0/computehci-r640.j2 networks: - network: ctlplane subnet: leaf0 vif: true - network: storage subnet: storage_leaf0_subnet - network: internal_api subnet: internalapi_leaf0_subnet - network: tenant subnet: tenant_leaf0_subnet ansible_playbooks: - playbook: /usr/share/ansible/tripleo-playbooks/cli-overcloud-node-growvols.yaml extra_vars: growvols_args: > /=10GB /tmp=1GB /var/log=10GB /var/log/audit=1GB /home=10GB /srv=10GB /var=100% I have set physical network property as per the documentation. I have tried your recommendation and it helps me proceed, but pxeboot fails to find mac under pxelinux.cfg/<mac> . Is removing the subnet the solution for this dcn deployment? Should we remove this from documentation example? The error I am facing now is probably not because of removing the subnet as the ports created for the nodes have physical-network 'ctlplane'. Any idea what could be the issue? http://perf1.lab.bos.redhat.com/jaison/edge-l3/osp-edge-backup2.tar.xz
the pxe boot issue was resolved too.
Confirmed edits are in place in python3-metalsmith-1.4.4-17.1.20230815101022.5e7461e.el9ost.noarch.rpm from latest compose RHOS-17.1-RHEL-9-20231122.n.1 This compose ran phases 1, 2 & 3 with no errors in the package.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat OpenStack Platform 17.1.2 bug fix and enhancement advisory), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2024:0209