Bug 2223941 - [FFU 16.2 -> 17.1] Wrong device name for Mellanox card after OS upgrade [NEEDINFO]
Summary: [FFU 16.2 -> 17.1] Wrong device name for Mellanox card after OS upgrade
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-tripleo-heat-templates
Version: 17.1 (Wallaby)
Hardware: x86_64
OS: Linux
unspecified
high
Target Milestone: ---
: ---
Assignee: OSP Team
QA Contact: Joe H. Rahme
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2023-07-19 10:47 UTC by Ricardo Diaz
Modified: 2023-08-03 15:46 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2023-07-20 17:45:25 UTC
Target Upstream Version:
Embargoed:
ifrangs: needinfo? (rhos-maint)


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker OSP-26734 0 None None None 2023-07-19 10:49:45 UTC

Description Ricardo Diaz 2023-07-19 10:47:47 UTC
Description of problem:

Incorrect names for the ConnectX-5 cards in the RHEL 9 compute after the Operating System upgrade.

Pre FFU we need this configuration because all overcloud nodes are initially RHEL 8 (note devnames for mlx cards are enp4s0f0 and enp4s0f1):

~~~
(overcloud) [stack@undercloud-0 ~]$ cat ospd-16.2-geneve-ovn-dpdk-sriov-ctlplane-dataplane-bonding-hybrid/os-net-config-mappings.yaml 
---
resource_registry:
  OS::TripleO::ComputeOvsDpdkSriov::NodeUserData: /usr/share/openstack-tripleo-heat-templates/firstboot/os-net-config-mappings.yaml

parameter_defaults:
  NetConfigDataLookup:
    computegroup:
      dmiString: "system-product-name"
      id: "PowerEdge R730"
      nic1: "eno1" # In biosdevname "em1"
      nic2: "eno2" # In biosdevname "em2"
      nic3: "enp130s0f0" # In biosdevname "p4p1"
      nic4: "enp130s0f1" # In biosdevname "p4p2"
      nic5: "enp130s0f2" # In biosdevname "p4p3"
      nic6: "enp130s0f3" # In biosdevname "p4p4"
      nic7: "enp6s0f0" # In biosdevname "p7p1"
      nic8: "enp6s0f1" # In biosdevname "p7p2"
      nic9: "enp6s0f2" # In biosdevname "p7p3"
      nic10: "enp6s0f3" # In biosdevname "p7p4"
      nic11: "enp4s0f0" # In biosdevname "p6p1"
      nic12: "enp4s0f1" # In biosdevname "p6p2"
~~~

The same thing happens for NeutronPhysicalDevMappings in sriov-config.yaml:

~~~
(overcloud) [stack@undercloud-0 ~]$ cat ospd-16.2-geneve-ovn-dpdk-sriov-ctlplane-dataplane-bonding-hybrid/sriov-config.yaml 
---
parameter_defaults:
  ComputeOvsDpdkSriovExtraConfig:
    neutron::agents::ml2::sriov::resource_provider_hypervisors: "enp4s0f0:%{hiera('fqdn_canonical')},enp4s0f1:%{hiera('fqdn_canonical')}"
  ComputeOvsDpdkSriovParameters:
    NeutronSriovResourceProviderBandwidths: enp4s0f0:4000000:9000000,enp4s0f1:4000000:9000000
    NovaPCIPassthrough:
      - address: "0000:06:00.2"
        trusted: "true"
        physical_network: "sriov-1"
      - address: "0000:06:00.3"
        trusted: "true"
        physical_network: "sriov-2"
      - address: "0000:04:00.0"
        trusted: "true"
        physical_network: "sriov-mlx-1"
      - address: "0000:04:00.1"
        trusted: "true"
        physical_network: "sriov-mlx-2"

    NeutronPhysicalDevMappings:
      - "sriov-1:enp6s0f2"
      - "sriov-2:enp6s0f3"
      - "sriov-mlx-1:enp4s0f0"
      - "sriov-mlx-2:enp4s0f1"
~~~

Those configs work find for the RHEL 8 compute but not for the RHEL 9 compute because devnames are changed to enp4s0f0np0 and enp4s0f1np1 respectively:

~~~
pci@0000:04:00.0  enp4s0f0np0  network        MT27800 Family [ConnectX-5]            
pci@0000:04:00.1  enp4s0f1np1  network        MT27800 Family [ConnectX-5]     
~~~

As a consequence, VMs cannot be scheduled on RHEL 9 compute because of this error:

~~~
Fault: {'code': 500, 'created': '2023-07-19T09:44:50Z', 'message': 'Insufficient compute resources: Requested instance NUMA topology together with requested PCI devices cannot fit the given host NUM
A topology; Claim pci failed.', 'details': 'Traceback (most recent call last):\n  File "/usr/lib/python3.9/site-packages/nova/compute/manager.py", line 2388, in _build_and_run_instance\n    with self.rt.instance
_claim(context, instance, node, allocs,\n  File "/usr/lib/python3.9/site-packages/oslo_concurrency/lockutils.py", line 360, in inner\n    return f(*args, **kwargs)\n  File "/usr/lib/python3.9/site-packages/nova/
compute/resource_tracker.py", line 171, in instance_claim\n    claim = claims.Claim(context, instance, nodename, self, cn,\n  File "/usr/lib/python3.9/site-packages/nova/compute/claims.py", line 72, in __init__\
n    self._claim_test(compute_node, limits)\n  File "/usr/lib/python3.9/site-packages/nova/compute/claims.py", line 113, in _claim_test\n    raise exception.ComputeResourcesUnavailable(reason=\nnova.exception.Co
mputeResourcesUnavailable: Insufficient compute resources: Requested instance NUMA topology together with requested PCI devices cannot fit the given host NUMA topology; Claim pci failed.\n\nDuring handling of th
e above exception, another exception occurred:\n\nTraceback (most recent call last):\n  File "/usr/lib/python3.9/site-packages/nova/compute/manager.py", line 2237, in _do_build_and_run_instance\n    self._build_
and_run_instance(context, instance, image,\n  File "/usr/lib/python3.9/site-packages/nova/compute/manager.py", line 2439, in _build_and_run_instance\n    raise exception.RescheduledException(\nnova.exception.Res
cheduledException: Build of instance cbf54424-fa45-407d-9b66-df1a1a2b2baf was re-scheduled: Insufficient compute resources: Requested instance NUMA topology together with requested PCI devices cannot fit the giv
en host NUMA topology; Claim pci failed.\n'}
~~~

We can observe the following errors in /var/log/os-net-config.log of RHEL 9 compute:

~~~
[root@computedpdksriov-0 ~]# grep ERROR /var/log/os-net-config.log.0 
2023-07-19 05:39:48.590 ERROR os_net_config.objects.mapped_nics nic enp4s0f0 not found in available nics (eno1, eno2, eno3, eno4, enp4s0f0np0, enp4s0f1np1, enp6s0f0, enp6s0f1, enp6s0f2, enp6s0f3, enp130s0f0, enp130s0f1, enp130s0f2, enp130s0f3)
2023-07-19 05:39:48.590 ERROR os_net_config.objects.mapped_nics nic enp4s0f1 not found in available nics (eno1, eno2, eno3, eno4, enp4s0f0np0, enp4s0f1np1, enp6s0f0, enp6s0f1, enp6s0f2, enp6s0f3, enp130s0f0, enp130s0f1, enp130s0f2, enp130s0f3)
2023-07-19 05:48:09.871 ERROR os_net_config.objects.mapped_nics nic enp4s0f0 not found in available nics (eno1, eno2, eno3, eno4, enp4s0f0np0, enp4s0f1np1, enp6s0f0, enp6s0f1, enp6s0f2, enp6s0f3, enp130s0f0, enp130s0f1, enp130s0f2, enp130s0f3)
2023-07-19 05:48:09.871 ERROR os_net_config.objects.mapped_nics nic enp4s0f1 not found in available nics (eno1, eno2, eno3, eno4, enp4s0f0np0, enp4s0f1np1, enp6s0f0, enp6s0f1, enp6s0f2, enp6s0f3, enp130s0f0, enp130s0f1, enp130s0f2, enp130s0f3)
2023-07-19 05:49:02.918 ERROR os_net_config.objects.mapped_nics nic enp4s0f0 not found in available nics (eno1, eno2, eno3, eno4, enp4s0f0np0, enp4s0f1np1, enp6s0f0, enp6s0f1, enp6s0f2, enp6s0f3, enp130s0f0, enp130s0f1, enp130s0f2, enp130s0f3)
2023-07-19 05:49:02.918 ERROR os_net_config.objects.mapped_nics nic enp4s0f1 not found in available nics (eno1, eno2, eno3, eno4, enp4s0f0np0, enp4s0f1np1, enp6s0f0, enp6s0f1, enp6s0f2, enp6s0f3, enp130s0f0, enp130s0f1, enp130s0f2, enp130s0f3)
2023-07-19 05:49:40.160 ERROR os_net_config.objects.mapped_nics nic enp4s0f0 not found in available nics (eno1, eno2, eno3, eno4, enp4s0f0np0, enp4s0f1np1, enp6s0f0, enp6s0f1, enp6s0f2, enp6s0f3, enp130s0f0, enp130s0f1, enp130s0f2, enp130s0f3)
2023-07-19 05:49:40.160 ERROR os_net_config.objects.mapped_nics nic enp4s0f1 not found in available nics (eno1, eno2, eno3, eno4, enp4s0f0np0, enp4s0f1np1, enp6s0f0, enp6s0f1, enp6s0f2, enp6s0f3, enp130s0f0, enp130s0f1, enp130s0f2, enp130s0f3)
2023-07-19 05:52:29.064 ERROR os_net_config.objects.mapped_nics nic enp4s0f0 not found in available nics (eno1, eno2, eno3, eno4, enp4s0f0np0, enp4s0f1np1, enp6s0f0, enp6s0f1, enp6s0f2, enp6s0f3, enp130s0f0, enp130s0f1, enp130s0f2, enp130s0f3)
2023-07-19 05:52:29.064 ERROR os_net_config.objects.mapped_nics nic enp4s0f1 not found in available nics (eno1, eno2, eno3, eno4, enp4s0f0np0, enp4s0f1np1, enp6s0f0, enp6s0f1, enp6s0f2, enp6s0f3, enp130s0f0, enp130s0f1, enp130s0f2, enp130s0f3)
2023-07-19 05:53:27.165 ERROR os_net_config.objects.mapped_nics nic enp4s0f0 not found in available nics (eno1, eno2, eno3, eno4, enp4s0f0np0, enp4s0f1np1, enp6s0f0, enp6s0f1, enp6s0f2, enp6s0f3, enp130s0f0, enp130s0f1, enp130s0f2, enp130s0f3)
2023-07-19 05:53:27.165 ERROR os_net_config.objects.mapped_nics nic enp4s0f1 not found in available nics (eno1, eno2, eno3, eno4, enp4s0f0np0, enp4s0f1np1, enp6s0f0, enp6s0f1, enp6s0f2, enp6s0f3, enp130s0f0, enp130s0f1, enp130s0f2, enp130s0f3)
~~~

And the VFs are not properly configured for mlx:

~~~
[root@computedpdksriov-0 ~]# lshw -c network -businfo                                                                                                                                                              
Bus info          Device       Class          Description                                                                                                                                                          
=========================================================                                                                                                                                                          
pci@0000:04:00.0  enp4s0f0np0  network        MT27800 Family [ConnectX-5]                                                                                                                                          
pci@0000:04:00.1  enp4s0f1np1  network        MT27800 Family [ConnectX-5]                                                                                                                                          
pci@0000:06:00.0               network        Ethernet Controller X710 for 10GbE SFP+                                                                                                                              
...
~~~

This issue can be manually fixed by performing the following actions:

1. Modify these configuration files:
~~~
[root@computedpdksriov-0 ~]# diff /etc/os-net-config/mapping.yaml.orig /etc/os-net-config/mapping.yaml
4,5c4,5
<   nic11: enp4s0f0
<   nic12: enp4s0f1
---
>   nic11: enp4s0f0np0
>   nic12: enp4s0f1np1

[root@computedpdksriov-0 ~]# cat /var/lib/os-net-config/sriov_config.yaml 
- device_type: pf
  link_mode: legacy
  name: enp6s0f2
  numvfs: 10
  promisc: 'off'
  vdpa: false
- device_type: pf
  link_mode: legacy
  name: enp6s0f3
  numvfs: 10
  promisc: 'off'
  vdpa: false
- device_type: pf
  link_mode: legacy
  name: enp4s0f0np0
  numvfs: 10
  promisc: 'off'
  vdpa: false
- device_type: pf
  link_mode: legacy
  name: enp4s0f1np1
  numvfs: 10
  promisc: 'off'
  vdpa: false

[root@computedpdksriov-0 ~]# diff /var/lib/config-data/puppet-generated/neutron/etc/neutron/plugins/ml2/sriov_agent.ini.orig /var/lib/config-data/puppet-generated/neutron/etc/neutron/plugins/ml2/sriov_agent.ini
153,155c153,155
< physical_device_mappings=sriov-1:enp6s0f2,sriov-2:enp6s0f3,sriov-mlx-1:enp4s0f0,sriov-mlx-2:enp4s0f1
< resource_provider_bandwidths=enp4s0f0:4000000:9000000,enp4s0f1:4000000:9000000
< resource_provider_hypervisors=enp4s0f0:computedpdksriov-0.localdomain,enp4s0f1:computedpdksriov-0.localdomain
---
> physical_device_mappings=sriov-1:enp6s0f2,sriov-2:enp6s0f3,sriov-mlx-1:enp4s0f0np0,sriov-mlx-2:enp4s0f1np1
> resource_provider_bandwidths=enp4s0f0np0:4000000:9000000,enp4s0f1np1:4000000:9000000
> resource_provider_hypervisors=enp4s0f0np0:computedpdksriov-0.localdomain,enp4s0f1np1:computedpdksriov-0.localdomain
~~~

3. Run: `os-net-config -c /etc/os-net-config/config.json`

4. Restart tripleo_nova_compute.service and tripleo_neutron_sriov_agent.service services

After performing those actions we can observe that VFs are properly configured:

~~~
 [root@computedpdksriov-0 ~]# lshw -c network -businfo                                                                                                                                                              
Bus info          Device       Class          Description                                                                                                                                                          
=========================================================                                                                                                                                                          
pci@0000:04:00.0  enp4s0f0np0  network        MT27800 Family [ConnectX-5]                                                                                                                                          
pci@0000:04:00.1  enp4s0f1np1  network        MT27800 Family [ConnectX-5]                                                                                                                                          
pci@0000:04:00.2  enp4s0f0v0   network        MT27800 Family [ConnectX-5 Virtual Function]                                                                                                                         
pci@0000:04:00.3  enp4s0f0v1   network        MT27800 Family [ConnectX-5 Virtual Function]
pci@0000:04:00.4  enp4s0f0v2   network        MT27800 Family [ConnectX-5 Virtual Function]                                                                                                                         
pci@0000:04:00.5  enp4s0f0v3   network        MT27800 Family [ConnectX-5 Virtual Function]                                                                                                                         
pci@0000:04:00.6  enp4s0f0v4   network        MT27800 Family [ConnectX-5 Virtual Function]                                                                                                                         
pci@0000:04:00.7  enp4s0f0v5   network        MT27800 Family [ConnectX-5 Virtual Function]                                                                                                                         
pci@0000:04:01.0  enp4s0f0v6   network        MT27800 Family [ConnectX-5 Virtual Function]                                                                                                                         
pci@0000:04:01.1  enp4s0f0v7   network        MT27800 Family [ConnectX-5 Virtual Function]                                                                                                                         
pci@0000:04:01.2  enp4s0f0v8   network        MT27800 Family [ConnectX-5 Virtual Function]                                                                                                                         
pci@0000:04:01.3  enp4s0f0v9   network        MT27800 Family [ConnectX-5 Virtual Function]                                                                                                                         
pci@0000:04:02.6  enp4s0f1v0   network        MT27800 Family [ConnectX-5 Virtual Function]                                                                                                                         
pci@0000:04:02.7  enp4s0f1v1   network        MT27800 Family [ConnectX-5 Virtual Function]                                                                                                                         
pci@0000:04:03.0  enp4s0f1v2   network        MT27800 Family [ConnectX-5 Virtual Function]                                                                                                                         
pci@0000:04:03.1  enp4s0f1v3   network        MT27800 Family [ConnectX-5 Virtual Function]                                                                                                                         
pci@0000:04:03.2  enp4s0f1v4   network        MT27800 Family [ConnectX-5 Virtual Function]
pci@0000:04:03.3  enp4s0f1v5   network        MT27800 Family [ConnectX-5 Virtual Function]                                                                                                                         
pci@0000:04:03.4  enp4s0f1v6   network        MT27800 Family [ConnectX-5 Virtual Function]                                                                                                                         
pci@0000:04:03.5  enp4s0f1v7   network        MT27800 Family [ConnectX-5 Virtual Function]                                                                                                                         
pci@0000:04:03.6  enp4s0f1v8   network        MT27800 Family [ConnectX-5 Virtual Function]                                                                                                                         
pci@0000:04:03.7  enp4s0f1v9   network        MT27800 Family [ConnectX-5 Virtual Function]                                                                                                                         
pci@0000:06:00.0               network        Ethernet Controller X710 for 10GbE SFP+   
...
~~~

And the VMs can be created on RHEL 9 as well:

~~~
(overcloud) [stack@undercloud-0 ~]$ openstack server list --all --long -fyaml
- Availability Zone: nova
  Flavor: nfv_qe_base_flavor
  Host: computedpdksriov-0.localdomain
  ID: 0aa34d88-71d1-46c4-8782-3233f543f712
  Image ID: 5d6c36bc-400f-4573-9030-89ad743b39cb
  Image Name: rhel-guest-image-nfv-2-8.7-1660.x86_64.qcow2
  Name: tempest-TestNfvBasic-server-1014796180
  Networks:
    dpdk-mgmt:
    - 10.10.10.196
    - 10.46.141.162
    sriov_mellanox_vf:
    - 60.0.0.151
  Power State: 1
  Properties: {}
  Status: ACTIVE
  Task State: null
~~~


Version-Release number of selected component (if applicable):
17.1 after FFU


How reproducible:
100 %

Actual results:
VMs cannot be spawned on RHEL 9 compute

Expected results:
VMs expected to be created on RHEL 9.

In order to do that the enp4s0fN devnames must be adapted to enp4s0fNnpN for RHEL 9 computes.

Comment 1 Lukas Bezdicka 2023-07-19 12:09:14 UTC
probably you don't have in system_upgrade.yaml
NICsPrefixesToUdev: ['en']

In docs this should be mentioned that on mellanox and dell machines we should use this.

Not a blocker!

Comment 2 Ricardo Diaz 2023-07-20 17:27:44 UTC
After including `NICsPrefixesToUdev: ['en']` in system_upgrade.yaml:
~~~
[stack@undercloud-0 ~]$ grep NICsPrefixesToUdev system_upgrade.yaml 
  NICsPrefixesToUdev: ['en']
~~~

No errors in /var/log/os-net-config.log:
~~~
[root@computedpdksriov-0 ~]# grep -i error /var/log/os-net-config.log 
[root@computedpdksriov-0 ~]# 
~~~

And ConnectX-5 NICs and VFs properly configured:
~~~
[root@computedpdksriov-0 ~]# cat /etc/redhat-release 
Red Hat Enterprise Linux release 9.2 (Plow)

[root@computedpdksriov-0 ~]# lshw -c network -businfo
Bus info          Device      Class          Description
========================================================
pci@0000:04:00.0  enp4s0f0    network        MT27800 Family [ConnectX-5]
pci@0000:04:00.1  enp4s0f1    network        MT27800 Family [ConnectX-5]
pci@0000:04:00.2  enp4s0f0v0  network        MT27800 Family [ConnectX-5 Virtual Function]
pci@0000:04:00.3  enp4s0f0v1  network        MT27800 Family [ConnectX-5 Virtual Function]
pci@0000:04:00.4  enp4s0f0v2  network        MT27800 Family [ConnectX-5 Virtual Function]
pci@0000:04:00.5  enp4s0f0v3  network        MT27800 Family [ConnectX-5 Virtual Function]
pci@0000:04:00.6  enp4s0f0v4  network        MT27800 Family [ConnectX-5 Virtual Function]
pci@0000:04:00.7  enp4s0f0v5  network        MT27800 Family [ConnectX-5 Virtual Function]
pci@0000:04:01.0  enp4s0f0v6  network        MT27800 Family [ConnectX-5 Virtual Function]
pci@0000:04:01.1  enp4s0f0v7  network        MT27800 Family [ConnectX-5 Virtual Function]
pci@0000:04:01.2  enp4s0f0v8  network        MT27800 Family [ConnectX-5 Virtual Function]
pci@0000:04:01.3  enp4s0f0v9  network        MT27800 Family [ConnectX-5 Virtual Function]
pci@0000:04:02.6  enp4s0f1v0  network        MT27800 Family [ConnectX-5 Virtual Function]
pci@0000:04:02.7  enp4s0f1v1  network        MT27800 Family [ConnectX-5 Virtual Function]
pci@0000:04:03.0  enp4s0f1v2  network        MT27800 Family [ConnectX-5 Virtual Function]
pci@0000:04:03.1  enp4s0f1v3  network        MT27800 Family [ConnectX-5 Virtual Function]
pci@0000:04:03.2  enp4s0f1v4  network        MT27800 Family [ConnectX-5 Virtual Function]
pci@0000:04:03.3  enp4s0f1v5  network        MT27800 Family [ConnectX-5 Virtual Function]
pci@0000:04:03.4  enp4s0f1v6  network        MT27800 Family [ConnectX-5 Virtual Function]
pci@0000:04:03.5  enp4s0f1v7  network        MT27800 Family [ConnectX-5 Virtual Function]
pci@0000:04:03.6  enp4s0f1v8  network        MT27800 Family [ConnectX-5 Virtual Function]
pci@0000:04:03.7  enp4s0f1v9  network        MT27800 Family [ConnectX-5 Virtual Function]
~~~

And VMs can be instantiated on RHEL 9 compute:
~~~
+--------------------------------------+----------------------------------------+--------+------------+-------------+---------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------+--------------------------------------+--------------------+-------------------+--------------------------------+------------+
| ID                                   | Name                                   | Status | Task State | Power State | Networks                                                                                                                              | Image Name                                   | Image ID                             | Flavor             | Availability Zone | Host                           | Properties |
+--------------------------------------+----------------------------------------+--------+------------+-------------+---------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------+--------------------------------------+--------------------+-------------------+--------------------------------+------------+
| 0904c14a-e042-45a2-8d0b-50667979fb47 | tempest-TestNfvBasic-server-1890780728 | ACTIVE | None       | Running     | dpdk-data=10.10.20.178; dpdk-mgmt=10.10.10.133, 10.46.141.171; sriov_mellanox_vf=60.0.0.100; sriov_pf=50.0.0.198; sriov_vf=40.0.0.137 | rhel-guest-image-nfv-2-8.7-1660.x86_64.qcow2 | 6d2a4a12-bef3-4162-8708-23626a4c7c08 | nfv_qe_base_flavor | nova              | computedpdksriov-0.localdomain |            |
+--------------------------------------+----------------------------------------+--------+------------+-------------+---------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------+--------------------------------------+--------------------+-------------------+--------------------------------+------------+
~~~


Note You need to log in before you can comment on or make changes to this bug.