Bug 2219641 - metalsmith does not recognize subnet field under network in overcloud-baremetal-deploy.yaml as per documentation
Summary: metalsmith does not recognize subnet field under network in overcloud-baremet...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: python-metalsmith
Version: 17.1 (Wallaby)
Hardware: All
OS: Linux
high
high
Target Milestone: z2
: 17.1
Assignee: Harald Jensås
QA Contact: James E. LaBarre
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2023-07-04 17:06 UTC by Jaison Raju
Modified: 2024-01-16 14:32 UTC (History)
5 users (show)

Fixed In Version: python-metalsmith-1.4.4-17.1.20230815101022.5e7461e.el9ost
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2024-01-16 14:32:47 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
OpenStack gerrit 827219 0 None MERGED Allow both 'network' and 'subnet' in NIC 2023-07-06 17:54:45 UTC
OpenStack gerrit 887867 0 None MERGED Allow both 'network' and 'subnet' in NIC 2023-07-31 19:53:26 UTC
Red Hat Issue Tracker OSP-26308 0 None None None 2023-07-04 17:09:43 UTC
Red Hat Product Errata RHBA-2024:0209 0 None None None 2024-01-16 14:32:49 UTC

Description Jaison Raju 2023-07-04 17:06:45 UTC
Description of problem:
The example in https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/17.0/html/director_installation_and_usage/assembly_provisioning-and-deploying-your-overcloud#proc_provisioning-bare-metal-nodes-for-the-overcloud_ironic_provisioning section 5 for overcloud-baremetal-deploy.yaml with subnet information would fail in current 17.1 (python3-metalsmith-1.4.4-1.20230517141000.5e7461e.el9ost.noarch.rpm)

Our downstream package is missing this patch:
commit 264836d59ac741424c3fad4d47e51073722c848f
Author: Harald Jensås <hjensas>
Date:   Thu Dec 9 15:20:29 2021 +0100

    Allow both 'network' and 'subnet' in NIC


Version-Release number of selected component (if applicable):
17.1
python3-metalsmith-1.4.4-1.20230517141000.5e7461e.el9ost.noarch.rpm

How reproducible:
Always

Steps to Reproduce:
1.
2.
3.

Actual results:
The following command fails.
(undercloud) [stack@undercloud ~]$ openstack overcloud node provision --stack central --network-config -o /home/stack/templates/central/deployed_metal.yaml /home/stack/templates/central/overcloud-baremetal-deploy.yaml
The error seen (formatted in a more readable form) is:
Deploy attempt failed on node f18-h21-000-r640.tng.rdu2.scalelab.redhat.com (UUID 2fcd3a61-e212-48fc-9c4c-91b832a3ca9e), cleaning up
Traceback (most recent call last):
  File "/usr/lib/python3.9/site-packages/metalsmith/_provisioner.py", line 393, in provision_node
    nics.validate()
  File "/usr/lib/python3.9/site-packages/metalsmith/_nics.py", line 60, in validate
    result.append(('network', self._get_network(nic)))
  File "/usr/lib/python3.9/site-packages/metalsmith/_nics.py", line 136, in _get_network
    raise exceptions.InvalidNIC(
metalsmith.exceptions.InvalidNIC: Unexpected fields for a network: subnet

Expected results:


Additional info:
We are testing edge deployments with l3 routes on 17.1. This is a blocker for us.
I will try to see if I can manually patch just this fix.

Comment 1 Harald Jensås 2023-07-06 18:00:19 UTC
Hey, we have a job doing this in downstream CI:

- name: Compute1
  count: 2
  hostname_format: 'c1-compute-%index%'
  defaults:
    profile: compute
    network_config:
      template: /home/stack/virt/network/three-nics-vlans/compute.j2
    networks:
    - network: ctlplane
      vif: true
    - network: internal_api
      subnet: internal_api1_subnet
    - network: storage
      subnet: storage1_subnet
    - network: tenant
      subnet: tenant1_subnet
  instances:
  - name: compute-0
    hostname: compute-0
  - name: compute-1
    hostname: compute-1

- Which example are you using? Is it the specific nodes example?

Can you attach your - home/stack/templates/central/overcloud-baremetal-deploy.yaml file please?

Comment 2 Harald Jensås 2023-07-06 18:17:30 UTC
So this is happening on the "ctlplane" network?
 In that case, the workaround is to simply not specify the subnet. 
 The correct subnet will be used automatically based on the physical_network bridge mappings in neutron.
 The physical network property on the baremetal ports must be set, but this happens automatically when you introspect the nodes with OSP 17.x director.

Comment 3 Jaison Raju 2023-07-18 02:05:33 UTC
Introspection succeeds. I got this example from the document.
I am not sure if it is happening for ctlplane. Here is the file I used. Since I am doing this on baremetal, I made a few changes, but this file should be similar:

 cat templates/central/overcloud-baremetal-deploy.yaml-bak                                                                                                                                                                                                                   [10/1558]
- name: Controller0                                                                                                                                            
  count: 3       
  defaults:             
    resource_class: baremetal.control
    profile: control
    network_config:    
      default_route_network:
      - External    
      template: /home/stack/templates/central/network/leaf0/controller0.j2
    networks:      
    - network: ctlplane                  
      subnet: leaf0
      vif: true
    - network: storage
      subnet: storage_leaf0_subnet
    - network: storage_mgmt
      subnet: storage_mgmt_leaf0_subnet
    - network: internal_api
      subnet: internalapi_leaf0_subnet
    - network: tenant
      subnet: tenant_leaf0_subnet
    - network: external
      subnet: external_leaf0_subnet
  ansible_playbooks:
    - playbook: /usr/share/ansible/tripleo-playbooks/cli-overcloud-node-growvols.yaml
      extra_vars:
        growvols_args: >
          /=10GB
          /tmp=1GB
          /var/log=10GB
          /var/log/audit=1GB
          /home=10GB
          /srv=10GB
          /var=100%
- name: ComputeHCI-r640
  count: 4
  defaults:
    resource_class: baremetal.computel0
    profile: compute
    network_config:
      template: /home/stack/templates/central/network/leaf0/computehci-r640.j2
    networks:
    - network: ctlplane
      subnet: leaf0
      vif: true
    - network: storage
      subnet: storage_leaf0_subnet
    - network: internal_api
      subnet: internalapi_leaf0_subnet
    - network: tenant
      subnet: tenant_leaf0_subnet
  ansible_playbooks:
    - playbook: /usr/share/ansible/tripleo-playbooks/cli-overcloud-node-growvols.yaml
      extra_vars:
        growvols_args: >
          /=10GB
          /tmp=1GB
          /var/log=10GB
          /var/log/audit=1GB
          /home=10GB
          /srv=10GB
          /var=100%
I have set physical network property as per the documentation. I have tried your recommendation and it helps me proceed, but pxeboot fails to find mac under pxelinux.cfg/<mac> . Is removing the subnet the solution for this dcn deployment? Should we remove this from documentation example?

The error I am facing now is probably not because of removing the subnet as the ports created for the nodes have physical-network 'ctlplane'.
Any idea what could be the issue?
http://perf1.lab.bos.redhat.com/jaison/edge-l3/osp-edge-backup2.tar.xz

Comment 4 Jaison Raju 2023-07-18 08:00:27 UTC
the pxe boot issue was resolved too.

Comment 10 James E. LaBarre 2023-12-04 22:42:51 UTC
Confirmed edits are in place in python3-metalsmith-1.4.4-17.1.20230815101022.5e7461e.el9ost.noarch.rpm from latest compose RHOS-17.1-RHEL-9-20231122.n.1

This compose ran phases 1, 2 & 3 with no errors in the package.

Comment 19 errata-xmlrpc 2024-01-16 14:32:47 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenStack Platform 17.1.2 bug fix and enhancement advisory), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2024:0209


Note You need to log in before you can comment on or make changes to this bug.