Description of problem: Incorrect names for the ConnectX-5 cards in the RHEL 9 compute after the Operating System upgrade. Pre FFU we need this configuration because all overcloud nodes are initially RHEL 8 (note devnames for mlx cards are enp4s0f0 and enp4s0f1): ~~~ (overcloud) [stack@undercloud-0 ~]$ cat ospd-16.2-geneve-ovn-dpdk-sriov-ctlplane-dataplane-bonding-hybrid/os-net-config-mappings.yaml --- resource_registry: OS::TripleO::ComputeOvsDpdkSriov::NodeUserData: /usr/share/openstack-tripleo-heat-templates/firstboot/os-net-config-mappings.yaml parameter_defaults: NetConfigDataLookup: computegroup: dmiString: "system-product-name" id: "PowerEdge R730" nic1: "eno1" # In biosdevname "em1" nic2: "eno2" # In biosdevname "em2" nic3: "enp130s0f0" # In biosdevname "p4p1" nic4: "enp130s0f1" # In biosdevname "p4p2" nic5: "enp130s0f2" # In biosdevname "p4p3" nic6: "enp130s0f3" # In biosdevname "p4p4" nic7: "enp6s0f0" # In biosdevname "p7p1" nic8: "enp6s0f1" # In biosdevname "p7p2" nic9: "enp6s0f2" # In biosdevname "p7p3" nic10: "enp6s0f3" # In biosdevname "p7p4" nic11: "enp4s0f0" # In biosdevname "p6p1" nic12: "enp4s0f1" # In biosdevname "p6p2" ~~~ The same thing happens for NeutronPhysicalDevMappings in sriov-config.yaml: ~~~ (overcloud) [stack@undercloud-0 ~]$ cat ospd-16.2-geneve-ovn-dpdk-sriov-ctlplane-dataplane-bonding-hybrid/sriov-config.yaml --- parameter_defaults: ComputeOvsDpdkSriovExtraConfig: neutron::agents::ml2::sriov::resource_provider_hypervisors: "enp4s0f0:%{hiera('fqdn_canonical')},enp4s0f1:%{hiera('fqdn_canonical')}" ComputeOvsDpdkSriovParameters: NeutronSriovResourceProviderBandwidths: enp4s0f0:4000000:9000000,enp4s0f1:4000000:9000000 NovaPCIPassthrough: - address: "0000:06:00.2" trusted: "true" physical_network: "sriov-1" - address: "0000:06:00.3" trusted: "true" physical_network: "sriov-2" - address: "0000:04:00.0" trusted: "true" physical_network: "sriov-mlx-1" - address: "0000:04:00.1" trusted: "true" physical_network: "sriov-mlx-2" NeutronPhysicalDevMappings: - "sriov-1:enp6s0f2" - "sriov-2:enp6s0f3" - "sriov-mlx-1:enp4s0f0" - "sriov-mlx-2:enp4s0f1" ~~~ Those configs work find for the RHEL 8 compute but not for the RHEL 9 compute because devnames are changed to enp4s0f0np0 and enp4s0f1np1 respectively: ~~~ pci@0000:04:00.0 enp4s0f0np0 network MT27800 Family [ConnectX-5] pci@0000:04:00.1 enp4s0f1np1 network MT27800 Family [ConnectX-5] ~~~ As a consequence, VMs cannot be scheduled on RHEL 9 compute because of this error: ~~~ Fault: {'code': 500, 'created': '2023-07-19T09:44:50Z', 'message': 'Insufficient compute resources: Requested instance NUMA topology together with requested PCI devices cannot fit the given host NUM A topology; Claim pci failed.', 'details': 'Traceback (most recent call last):\n File "/usr/lib/python3.9/site-packages/nova/compute/manager.py", line 2388, in _build_and_run_instance\n with self.rt.instance _claim(context, instance, node, allocs,\n File "/usr/lib/python3.9/site-packages/oslo_concurrency/lockutils.py", line 360, in inner\n return f(*args, **kwargs)\n File "/usr/lib/python3.9/site-packages/nova/ compute/resource_tracker.py", line 171, in instance_claim\n claim = claims.Claim(context, instance, nodename, self, cn,\n File "/usr/lib/python3.9/site-packages/nova/compute/claims.py", line 72, in __init__\ n self._claim_test(compute_node, limits)\n File "/usr/lib/python3.9/site-packages/nova/compute/claims.py", line 113, in _claim_test\n raise exception.ComputeResourcesUnavailable(reason=\nnova.exception.Co mputeResourcesUnavailable: Insufficient compute resources: Requested instance NUMA topology together with requested PCI devices cannot fit the given host NUMA topology; Claim pci failed.\n\nDuring handling of th e above exception, another exception occurred:\n\nTraceback (most recent call last):\n File "/usr/lib/python3.9/site-packages/nova/compute/manager.py", line 2237, in _do_build_and_run_instance\n self._build_ and_run_instance(context, instance, image,\n File "/usr/lib/python3.9/site-packages/nova/compute/manager.py", line 2439, in _build_and_run_instance\n raise exception.RescheduledException(\nnova.exception.Res cheduledException: Build of instance cbf54424-fa45-407d-9b66-df1a1a2b2baf was re-scheduled: Insufficient compute resources: Requested instance NUMA topology together with requested PCI devices cannot fit the giv en host NUMA topology; Claim pci failed.\n'} ~~~ We can observe the following errors in /var/log/os-net-config.log of RHEL 9 compute: ~~~ [root@computedpdksriov-0 ~]# grep ERROR /var/log/os-net-config.log.0 2023-07-19 05:39:48.590 ERROR os_net_config.objects.mapped_nics nic enp4s0f0 not found in available nics (eno1, eno2, eno3, eno4, enp4s0f0np0, enp4s0f1np1, enp6s0f0, enp6s0f1, enp6s0f2, enp6s0f3, enp130s0f0, enp130s0f1, enp130s0f2, enp130s0f3) 2023-07-19 05:39:48.590 ERROR os_net_config.objects.mapped_nics nic enp4s0f1 not found in available nics (eno1, eno2, eno3, eno4, enp4s0f0np0, enp4s0f1np1, enp6s0f0, enp6s0f1, enp6s0f2, enp6s0f3, enp130s0f0, enp130s0f1, enp130s0f2, enp130s0f3) 2023-07-19 05:48:09.871 ERROR os_net_config.objects.mapped_nics nic enp4s0f0 not found in available nics (eno1, eno2, eno3, eno4, enp4s0f0np0, enp4s0f1np1, enp6s0f0, enp6s0f1, enp6s0f2, enp6s0f3, enp130s0f0, enp130s0f1, enp130s0f2, enp130s0f3) 2023-07-19 05:48:09.871 ERROR os_net_config.objects.mapped_nics nic enp4s0f1 not found in available nics (eno1, eno2, eno3, eno4, enp4s0f0np0, enp4s0f1np1, enp6s0f0, enp6s0f1, enp6s0f2, enp6s0f3, enp130s0f0, enp130s0f1, enp130s0f2, enp130s0f3) 2023-07-19 05:49:02.918 ERROR os_net_config.objects.mapped_nics nic enp4s0f0 not found in available nics (eno1, eno2, eno3, eno4, enp4s0f0np0, enp4s0f1np1, enp6s0f0, enp6s0f1, enp6s0f2, enp6s0f3, enp130s0f0, enp130s0f1, enp130s0f2, enp130s0f3) 2023-07-19 05:49:02.918 ERROR os_net_config.objects.mapped_nics nic enp4s0f1 not found in available nics (eno1, eno2, eno3, eno4, enp4s0f0np0, enp4s0f1np1, enp6s0f0, enp6s0f1, enp6s0f2, enp6s0f3, enp130s0f0, enp130s0f1, enp130s0f2, enp130s0f3) 2023-07-19 05:49:40.160 ERROR os_net_config.objects.mapped_nics nic enp4s0f0 not found in available nics (eno1, eno2, eno3, eno4, enp4s0f0np0, enp4s0f1np1, enp6s0f0, enp6s0f1, enp6s0f2, enp6s0f3, enp130s0f0, enp130s0f1, enp130s0f2, enp130s0f3) 2023-07-19 05:49:40.160 ERROR os_net_config.objects.mapped_nics nic enp4s0f1 not found in available nics (eno1, eno2, eno3, eno4, enp4s0f0np0, enp4s0f1np1, enp6s0f0, enp6s0f1, enp6s0f2, enp6s0f3, enp130s0f0, enp130s0f1, enp130s0f2, enp130s0f3) 2023-07-19 05:52:29.064 ERROR os_net_config.objects.mapped_nics nic enp4s0f0 not found in available nics (eno1, eno2, eno3, eno4, enp4s0f0np0, enp4s0f1np1, enp6s0f0, enp6s0f1, enp6s0f2, enp6s0f3, enp130s0f0, enp130s0f1, enp130s0f2, enp130s0f3) 2023-07-19 05:52:29.064 ERROR os_net_config.objects.mapped_nics nic enp4s0f1 not found in available nics (eno1, eno2, eno3, eno4, enp4s0f0np0, enp4s0f1np1, enp6s0f0, enp6s0f1, enp6s0f2, enp6s0f3, enp130s0f0, enp130s0f1, enp130s0f2, enp130s0f3) 2023-07-19 05:53:27.165 ERROR os_net_config.objects.mapped_nics nic enp4s0f0 not found in available nics (eno1, eno2, eno3, eno4, enp4s0f0np0, enp4s0f1np1, enp6s0f0, enp6s0f1, enp6s0f2, enp6s0f3, enp130s0f0, enp130s0f1, enp130s0f2, enp130s0f3) 2023-07-19 05:53:27.165 ERROR os_net_config.objects.mapped_nics nic enp4s0f1 not found in available nics (eno1, eno2, eno3, eno4, enp4s0f0np0, enp4s0f1np1, enp6s0f0, enp6s0f1, enp6s0f2, enp6s0f3, enp130s0f0, enp130s0f1, enp130s0f2, enp130s0f3) ~~~ And the VFs are not properly configured for mlx: ~~~ [root@computedpdksriov-0 ~]# lshw -c network -businfo Bus info Device Class Description ========================================================= pci@0000:04:00.0 enp4s0f0np0 network MT27800 Family [ConnectX-5] pci@0000:04:00.1 enp4s0f1np1 network MT27800 Family [ConnectX-5] pci@0000:06:00.0 network Ethernet Controller X710 for 10GbE SFP+ ... ~~~ This issue can be manually fixed by performing the following actions: 1. Modify these configuration files: ~~~ [root@computedpdksriov-0 ~]# diff /etc/os-net-config/mapping.yaml.orig /etc/os-net-config/mapping.yaml 4,5c4,5 < nic11: enp4s0f0 < nic12: enp4s0f1 --- > nic11: enp4s0f0np0 > nic12: enp4s0f1np1 [root@computedpdksriov-0 ~]# cat /var/lib/os-net-config/sriov_config.yaml - device_type: pf link_mode: legacy name: enp6s0f2 numvfs: 10 promisc: 'off' vdpa: false - device_type: pf link_mode: legacy name: enp6s0f3 numvfs: 10 promisc: 'off' vdpa: false - device_type: pf link_mode: legacy name: enp4s0f0np0 numvfs: 10 promisc: 'off' vdpa: false - device_type: pf link_mode: legacy name: enp4s0f1np1 numvfs: 10 promisc: 'off' vdpa: false [root@computedpdksriov-0 ~]# diff /var/lib/config-data/puppet-generated/neutron/etc/neutron/plugins/ml2/sriov_agent.ini.orig /var/lib/config-data/puppet-generated/neutron/etc/neutron/plugins/ml2/sriov_agent.ini 153,155c153,155 < physical_device_mappings=sriov-1:enp6s0f2,sriov-2:enp6s0f3,sriov-mlx-1:enp4s0f0,sriov-mlx-2:enp4s0f1 < resource_provider_bandwidths=enp4s0f0:4000000:9000000,enp4s0f1:4000000:9000000 < resource_provider_hypervisors=enp4s0f0:computedpdksriov-0.localdomain,enp4s0f1:computedpdksriov-0.localdomain --- > physical_device_mappings=sriov-1:enp6s0f2,sriov-2:enp6s0f3,sriov-mlx-1:enp4s0f0np0,sriov-mlx-2:enp4s0f1np1 > resource_provider_bandwidths=enp4s0f0np0:4000000:9000000,enp4s0f1np1:4000000:9000000 > resource_provider_hypervisors=enp4s0f0np0:computedpdksriov-0.localdomain,enp4s0f1np1:computedpdksriov-0.localdomain ~~~ 3. Run: `os-net-config -c /etc/os-net-config/config.json` 4. Restart tripleo_nova_compute.service and tripleo_neutron_sriov_agent.service services After performing those actions we can observe that VFs are properly configured: ~~~ [root@computedpdksriov-0 ~]# lshw -c network -businfo Bus info Device Class Description ========================================================= pci@0000:04:00.0 enp4s0f0np0 network MT27800 Family [ConnectX-5] pci@0000:04:00.1 enp4s0f1np1 network MT27800 Family [ConnectX-5] pci@0000:04:00.2 enp4s0f0v0 network MT27800 Family [ConnectX-5 Virtual Function] pci@0000:04:00.3 enp4s0f0v1 network MT27800 Family [ConnectX-5 Virtual Function] pci@0000:04:00.4 enp4s0f0v2 network MT27800 Family [ConnectX-5 Virtual Function] pci@0000:04:00.5 enp4s0f0v3 network MT27800 Family [ConnectX-5 Virtual Function] pci@0000:04:00.6 enp4s0f0v4 network MT27800 Family [ConnectX-5 Virtual Function] pci@0000:04:00.7 enp4s0f0v5 network MT27800 Family [ConnectX-5 Virtual Function] pci@0000:04:01.0 enp4s0f0v6 network MT27800 Family [ConnectX-5 Virtual Function] pci@0000:04:01.1 enp4s0f0v7 network MT27800 Family [ConnectX-5 Virtual Function] pci@0000:04:01.2 enp4s0f0v8 network MT27800 Family [ConnectX-5 Virtual Function] pci@0000:04:01.3 enp4s0f0v9 network MT27800 Family [ConnectX-5 Virtual Function] pci@0000:04:02.6 enp4s0f1v0 network MT27800 Family [ConnectX-5 Virtual Function] pci@0000:04:02.7 enp4s0f1v1 network MT27800 Family [ConnectX-5 Virtual Function] pci@0000:04:03.0 enp4s0f1v2 network MT27800 Family [ConnectX-5 Virtual Function] pci@0000:04:03.1 enp4s0f1v3 network MT27800 Family [ConnectX-5 Virtual Function] pci@0000:04:03.2 enp4s0f1v4 network MT27800 Family [ConnectX-5 Virtual Function] pci@0000:04:03.3 enp4s0f1v5 network MT27800 Family [ConnectX-5 Virtual Function] pci@0000:04:03.4 enp4s0f1v6 network MT27800 Family [ConnectX-5 Virtual Function] pci@0000:04:03.5 enp4s0f1v7 network MT27800 Family [ConnectX-5 Virtual Function] pci@0000:04:03.6 enp4s0f1v8 network MT27800 Family [ConnectX-5 Virtual Function] pci@0000:04:03.7 enp4s0f1v9 network MT27800 Family [ConnectX-5 Virtual Function] pci@0000:06:00.0 network Ethernet Controller X710 for 10GbE SFP+ ... ~~~ And the VMs can be created on RHEL 9 as well: ~~~ (overcloud) [stack@undercloud-0 ~]$ openstack server list --all --long -fyaml - Availability Zone: nova Flavor: nfv_qe_base_flavor Host: computedpdksriov-0.localdomain ID: 0aa34d88-71d1-46c4-8782-3233f543f712 Image ID: 5d6c36bc-400f-4573-9030-89ad743b39cb Image Name: rhel-guest-image-nfv-2-8.7-1660.x86_64.qcow2 Name: tempest-TestNfvBasic-server-1014796180 Networks: dpdk-mgmt: - 10.10.10.196 - 10.46.141.162 sriov_mellanox_vf: - 60.0.0.151 Power State: 1 Properties: {} Status: ACTIVE Task State: null ~~~ Version-Release number of selected component (if applicable): 17.1 after FFU How reproducible: 100 % Actual results: VMs cannot be spawned on RHEL 9 compute Expected results: VMs expected to be created on RHEL 9. In order to do that the enp4s0fN devnames must be adapted to enp4s0fNnpN for RHEL 9 computes.
probably you don't have in system_upgrade.yaml NICsPrefixesToUdev: ['en'] In docs this should be mentioned that on mellanox and dell machines we should use this. Not a blocker!
After including `NICsPrefixesToUdev: ['en']` in system_upgrade.yaml: ~~~ [stack@undercloud-0 ~]$ grep NICsPrefixesToUdev system_upgrade.yaml NICsPrefixesToUdev: ['en'] ~~~ No errors in /var/log/os-net-config.log: ~~~ [root@computedpdksriov-0 ~]# grep -i error /var/log/os-net-config.log [root@computedpdksriov-0 ~]# ~~~ And ConnectX-5 NICs and VFs properly configured: ~~~ [root@computedpdksriov-0 ~]# cat /etc/redhat-release Red Hat Enterprise Linux release 9.2 (Plow) [root@computedpdksriov-0 ~]# lshw -c network -businfo Bus info Device Class Description ======================================================== pci@0000:04:00.0 enp4s0f0 network MT27800 Family [ConnectX-5] pci@0000:04:00.1 enp4s0f1 network MT27800 Family [ConnectX-5] pci@0000:04:00.2 enp4s0f0v0 network MT27800 Family [ConnectX-5 Virtual Function] pci@0000:04:00.3 enp4s0f0v1 network MT27800 Family [ConnectX-5 Virtual Function] pci@0000:04:00.4 enp4s0f0v2 network MT27800 Family [ConnectX-5 Virtual Function] pci@0000:04:00.5 enp4s0f0v3 network MT27800 Family [ConnectX-5 Virtual Function] pci@0000:04:00.6 enp4s0f0v4 network MT27800 Family [ConnectX-5 Virtual Function] pci@0000:04:00.7 enp4s0f0v5 network MT27800 Family [ConnectX-5 Virtual Function] pci@0000:04:01.0 enp4s0f0v6 network MT27800 Family [ConnectX-5 Virtual Function] pci@0000:04:01.1 enp4s0f0v7 network MT27800 Family [ConnectX-5 Virtual Function] pci@0000:04:01.2 enp4s0f0v8 network MT27800 Family [ConnectX-5 Virtual Function] pci@0000:04:01.3 enp4s0f0v9 network MT27800 Family [ConnectX-5 Virtual Function] pci@0000:04:02.6 enp4s0f1v0 network MT27800 Family [ConnectX-5 Virtual Function] pci@0000:04:02.7 enp4s0f1v1 network MT27800 Family [ConnectX-5 Virtual Function] pci@0000:04:03.0 enp4s0f1v2 network MT27800 Family [ConnectX-5 Virtual Function] pci@0000:04:03.1 enp4s0f1v3 network MT27800 Family [ConnectX-5 Virtual Function] pci@0000:04:03.2 enp4s0f1v4 network MT27800 Family [ConnectX-5 Virtual Function] pci@0000:04:03.3 enp4s0f1v5 network MT27800 Family [ConnectX-5 Virtual Function] pci@0000:04:03.4 enp4s0f1v6 network MT27800 Family [ConnectX-5 Virtual Function] pci@0000:04:03.5 enp4s0f1v7 network MT27800 Family [ConnectX-5 Virtual Function] pci@0000:04:03.6 enp4s0f1v8 network MT27800 Family [ConnectX-5 Virtual Function] pci@0000:04:03.7 enp4s0f1v9 network MT27800 Family [ConnectX-5 Virtual Function] ~~~ And VMs can be instantiated on RHEL 9 compute: ~~~ +--------------------------------------+----------------------------------------+--------+------------+-------------+---------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------+--------------------------------------+--------------------+-------------------+--------------------------------+------------+ | ID | Name | Status | Task State | Power State | Networks | Image Name | Image ID | Flavor | Availability Zone | Host | Properties | +--------------------------------------+----------------------------------------+--------+------------+-------------+---------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------+--------------------------------------+--------------------+-------------------+--------------------------------+------------+ | 0904c14a-e042-45a2-8d0b-50667979fb47 | tempest-TestNfvBasic-server-1890780728 | ACTIVE | None | Running | dpdk-data=10.10.20.178; dpdk-mgmt=10.10.10.133, 10.46.141.171; sriov_mellanox_vf=60.0.0.100; sriov_pf=50.0.0.198; sriov_vf=40.0.0.137 | rhel-guest-image-nfv-2-8.7-1660.x86_64.qcow2 | 6d2a4a12-bef3-4162-8708-23626a4c7c08 | nfv_qe_base_flavor | nova | computedpdksriov-0.localdomain | | +--------------------------------------+----------------------------------------+--------+------------+-------------+---------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------+--------------------------------------+--------------------+-------------------+--------------------------------+------------+ ~~~