Bug 2054417 - SR-IOV policies do not trigger the VFs provisioning until manually power cycling the node
Summary: SR-IOV policies do not trigger the VFs provisioning until manually power cycl...
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.8
Hardware: Unspecified
OS: Unspecified
medium
unspecified
Target Milestone: ---
: ---
Assignee: Vrinda
QA Contact: zhaozhanqi
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-02-14 22:11 UTC by Manuel Rodriguez
Modified: 2022-05-31 15:01 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-05-31 15:01:23 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Manuel Rodriguez 2022-02-14 22:11:13 UTC
Description of problem:

After creating SR-IOV policies, neither the allocations nor the VFs are listed in the node until we manually power cycle.


Version-Release number of selected component (if applicable):

$ oc version
Client Version: 4.8.31
Server Version: 4.8.31
Kubernetes Version: v1.21.6+b82a451

$ oc get csv -n openshift-sriov-network-operator
NAME                                        DISPLAY                   VERSION              REPLACES   PHASE
sriov-network-operator.4.8.0-202201210133   SR-IOV Network Operator   4.8.0-202201210133              Succeeded



How reproducible:
All the time in OCP 4.8 and Mellanox cards.
We are running SNO.


Steps to Reproduce:
1. Deploy OCP 4.8 Latest
2. Install SR-IOV operator
3. Deploy one or multiple SR-IOV policies with a Mellanox card.
4. Verify the policies are created, but no VFs are created in the node.



Actual results:

SR-IOV policies are created but the VFs are not listed in the interface nor any allocatable resources in the node.

$ oc get SriovNetworkNodePolicy -n openshift-sriov-network-operator
NAME                       AGE
default                    62m
mlnx6-dpdk-node-policy01   32m
mlnx6-dpdk-node-policy02   32m
mlnx6-dpdk-node-policy03   32m
mlnx6-dpdk-node-policy04   32m

$ oc get node
NAME         STATUS   ROLES           AGE   VERSION
snohost-02   Ready    master,worker   82m   v1.21.6+b82a451

$ oc get node snohost-02 -o json | jq .status.allocatable
{
  "cpu": "111500m",
  "ephemeral-storage": "482690118881",
  "hugepages-1Gi": "64Gi",
  "hugepages-2Mi": "0",
  "memory": "195460796Ki",
  "pods": "250"
}

$ sudo ip link show ens8f1
9: ens8f1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 1000
    link/ether 0c:42:a1:e1:db:d9 brd ff:ff:ff:ff:ff:ff



Expected results:

- Requested VFs and allocations are listed in the node. If power cycle is required, SR-IOV and MC should trigger the reboot of the node.

[kni05@sno-provisioner01 ~]$ oc get node snohost-02 -o json | jq .status.allocatable
{
  "cpu": "111500m",
  "ephemeral-storage": "482690118881",
  "hugepages-1Gi": "64Gi",
  "hugepages-2Mi": "0",
  "memory": "195460796Ki",
  "openshift.io/cucp": "4",
  "openshift.io/cuup": "4",
  "openshift.io/n3": "4",
  "openshift.io/n6": "4",
  "pods": "250"
}

[core@snohost-02 ~]$ sudo lspci -nn | grep Eth
...
d8:00.1 Ethernet controller: Mellanox Technologies MT28908 Family [ConnectX-6]
d8:02.2 Ethernet controller: Mellanox Technologies MT28908 Family [ConnectX-6 Virtual Function]
d8:02.3 Ethernet controller: Mellanox Technologies MT28908 Family [ConnectX-6 Virtual Function]
d8:02.4 Ethernet controller: Mellanox Technologies MT28908 Family [ConnectX-6 Virtual Function]
d8:02.5 Ethernet controller: Mellanox Technologies MT28908 Family [ConnectX-6 Virtual Function]
d8:02.6 Ethernet controller: Mellanox Technologies MT28908 Family [ConnectX-6 Virtual Function]
d8:02.7 Ethernet controller: Mellanox Technologies MT28908 Family [ConnectX-6 Virtual Function]
d8:03.0 Ethernet controller: Mellanox Technologies MT28908 Family [ConnectX-6 Virtual Function]
d8:03.1 Ethernet controller: Mellanox Technologies MT28908 Family [ConnectX-6 Virtual Function]
d8:03.2 Ethernet controller: Mellanox Technologies MT28908 Family [ConnectX-6 Virtual Function]
d8:03.3 Ethernet controller: Mellanox Technologies MT28908 Family [ConnectX-6 Virtual Function]
d8:03.4 Ethernet controller: Mellanox Technologies MT28908 Family [ConnectX-6 Virtual Function]
d8:03.5 Ethernet controller: Mellanox Technologies MT28908 Family [ConnectX-6 Virtual Function]
d8:03.6 Ethernet controller: Mellanox Technologies MT28908 Family [ConnectX-6 Virtual Function]
d8:03.7 Ethernet controller: Mellanox Technologies MT28908 Family [ConnectX-6 Virtual Function]
d8:04.0 Ethernet controller: Mellanox Technologies MT28908 Family [ConnectX-6 Virtual Function]
d8:04.1 Ethernet controller: Mellanox Technologies MT28908 Family [ConnectX-6 Virtual Function]



Additional info:

[core@snohost-02 ~]$ sudo lspci -nn | grep Eth
...
d8:00.1 Ethernet controller [0200]: Mellanox Technologies MT28908 Family [ConnectX-6] [15b3:101b]

[core@snohost-02 ~]$ sudo ethtool -i ens8f1
driver: mlx5_core
version: 5.0-0
firmware-version: 20.27.4000 (MT_0000000236)
expansion-rom-version: 
bus-info: 0000:d8:00.1

must-gather data attached and collected via ose-sriov-operator-must-gather image.

Comment 4 Manuel Rodriguez 2022-02-17 03:59:41 UTC
Hi

I think I reproduced the problem and see how to fix it, the SR-IOV feature was not enabled in the NIC at the BIOS level, so when running lspci I didn't see the capabilities and the file /sys/class/net/<nic-name>/device/sriov_totalvfs didn't exist, I reproduced this with a NIC: Mellanox Technologies MT27800 Family [ConnectX-5] [15b3:1017], but I suspect same behavior occurs with any NIC in the same situation. 

We have a Dell R640 and an HP ProLiant DL360 Gen10, and got the same results, but the moment I enabled SR-IOV in the NIC (not the global BIOS option), I was able to see the capabilities and the file sriov_totalvfs was created and returned the value I set in the BIOS. Now I can create the SR-IOV policies and the VFs are created without having to reboot, I can also delete the policies and VFs get cleaned. 

We just have another Dell R740 where we couldn't find anywhere in the BIOS the Mellanox ConnectX-6 card options, I checked some Dell blogs, but the options are not present at all. We suspect an old firmware or old iDRAC version causing this issue.

So at this point I do not think it's an OCP or operator issue, but if you have heard of anything related with Mellanox cards, it would be appreciated. I'll also update if we find anything in the next days.

Thanks,


Note You need to log in before you can comment on or make changes to this bug.