Bug 1914414

Summary: SRIOV enablement for Emulex Corporation OneConnect NIC (10df:0720) is not working anymore
Product: OpenShift Container Platform Reporter: Denis Ollier <dollierp>
Component: NetworkingAssignee: Peng Liu <pliu>
Networking sub component: SR-IOV QA Contact: zhaozhanqi <zzhao>
Status: CLOSED ERRATA Docs Contact:
Severity: low    
Priority: low CC: bbennett, dosmith, pliu, zshi
Version: 4.7   
Target Milestone: ---   
Target Release: 4.9.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-10-18 17:29:03 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
dmesq output from a worker node none

Description Denis Ollier 2021-01-08 18:52:15 UTC
Disclaimer
----------

I know this NIC model is not part of the list of officialy supported models but it used to work fine with OCP 4.4 and/or 4.5.

Description of problem
----------------------

Trying to enable SRIOV for device Intel Corporation Ethernet Connection X722 (rev 09) (PCI ID 8086:37cc) fails with the following error:

> W0108 17:33:29.166332    5288 utils.go:70] DiscoverSriovDevices(): unable to parse device driver for device 0000:08:00.0 -> class: 'Network controller' vendor: 'Intel Corporation' product: 'Ethernet Connection X722' "error getting driver info for device 0000:08:00.0 readlink /sys/bus/pci/devices/0000:08:00.0/driver: no such file or directory"

Version-Release number of selected component
--------------------------------------------

> OCP: 4.7.0-fc.0
> RHCOS: 47.83.202012171901-0
> SR-IOV Network Operator: 4.6.0-202012160125.p0

Steps to Reproduce
------------------

1. Deploy an OCP BM IPI cluster.

2. Deploy the NFD operator to get nodes labeled as sriov capable:

> feature.node.kubernetes.io/network-sriov.capable: "true"
> feature.node.kubernetes.io/pci-10df.present: "true"
> feature.node.kubernetes.io/pci-10df.sriov.capable: "true"
> feature.node.kubernetes.io/pci-8086.present: "true"
> feature.node.kubernetes.io/pci-8086.sriov.capable: "true"

3. Deploy the SRIOV operator.

> NAME                                           DISPLAY                   VERSION                 REPLACES   PHASE
> sriov-network-operator.4.6.0-202012160125.p0   SR-IOV Network Operator   4.6.0-202012160125.p0              Succeeded

4. Disable sriov operator-webhook as described in OCP documentation:

> oc patch sriovoperatorconfig default \
>   --namespace='openshift-sriov-network-operator' \
>   --type='merge' \
>   --patch='{"spec":{"enableOperatorWebhook":false}}'

5. Create a SriovNetworkNodePolicy:

> ---
> kind: SriovNetworkNodePolicy
> apiVersion: sriovnetwork.openshift.io/v1
> metadata:
>   name: sriov-network-policy
>   namespace: openshift-sriov-network-operator
> spec:
>   resourceName: sriov_nics
>   nodeSelector:
>     feature.node.kubernetes.io/network-sriov.capable: "true"
>   nicSelector:
>     pfNames:
>       - ens2f0
>   deviceType: vfio-pci
>   numVfs: 16

Actual results
--------------

Nodes reboot several times during SRIOV configuration but no VFs are created for ens2f0 NIC.

An error is logged by the sriov-network-config-daemon DaemonSet:

> W0108 17:33:29.166332    5288 utils.go:70] DiscoverSriovDevices(): unable to parse device driver for device 0000:08:00.0 -> class: 'Network controller' vendor: 'Intel Corporation' product: 'Ethernet Connection X722' "error getting driver info for device 0000:08:00.0 readlink /sys/bus/pci/devices/0000:08:00.0/driver: no such file or directory"

Expected results
----------------

16 VFs should be created for ens2f0 NIC.

Additional info
---------------

No driver file is actually present in /sys/bus/pci/devices/0000:08:00.0/:

> sudo ls -l /sys/bus/pci/devices/0000:08:00.0/
> 
> [snip]
> 
> -r--r--r--. 1 root root     4096 Jan  8 16:51 device
> -r--r--r--. 1 root root     4096 Jan  8 18:05 dma_mask_bits
> -rw-r--r--. 1 root root     4096 Jan  8 18:05 driver_override
> -rw-r--r--. 1 root root     4096 Jan  8 18:05 enable
> lrwxrwxrwx. 1 root root        0 Jan  8 16:53 firmware_node -> ../../../../../LNXSYSTM:00/LNXSYBUS:00/PNP0A08:01/device:f3/device:f5/device:fc/device:fd
> lrwxrwxrwx. 1 root root        0 Jan  8 16:53 iommu -> ../../../../../virtual/iommu/dmar4
> lrwxrwxrwx. 1 root root        0 Jan  8 16:53 iommu_group -> ../../../../../../kernel/iommu_groups/25
> 
> [snip]
> 
> -rw-r--r--. 1 root root     4096 Jan  8 18:05 sriov_drivers_autoprobe
> -rw-r--r--. 1 root root     4096 Jan  8 16:53 sriov_numvfs
> -r--r--r--. 1 root root     4096 Jan  8 18:05 sriov_offset
> -r--r--r--. 1 root root     4096 Jan  8 18:05 sriov_stride
> -r--r--r--. 1 root root     4096 Jan  8 16:53 sriov_totalvfs
> -r--r--r--. 1 root root     4096 Jan  8 18:05 sriov_vf_device
> 
> [snip]
> 

Looking at the code we can see the following function call:

> DiscoverSriovDevices()
> └─ https://github.com/openshift/sriov-network-operator/blob/release-4.7/pkg/utils/utils.go#L45-L119
>    └─ GetDriverName()
>       └─ https://github.com/openshift/sriov-network-operator/blob/release-4.7/vendor/github.com/intel/sriov-network-device-plugin/pkg/utils/utils.go#L343-L351

This code does not appear to have changed for some time so it might be due to a change outside of sriov operator (RHCOS kernel?).

Comment 1 Peng Liu 2021-01-13 14:09:17 UTC
Could you check the dmesg of the node? It looks like a dup of https://bugzilla.redhat.com/show_bug.cgi?id=1875338

Comment 2 Denis Ollier 2021-01-13 14:47:05 UTC
Created attachment 1747093 [details]
dmesq output from a worker node

Comment 3 Denis Ollier 2021-01-13 14:49:26 UTC
I didn't see any obvious error in dmesg, attaching it to the BZ so that you can have a look.

Comment 4 Denis Ollier 2021-01-20 10:45:09 UTC
Thanks to Peng Liu we found out that I have been confused in my analysis by the error messages in the logs.

The interface I'm trying to configure (ens2f0) is an Emulex Corporation OneConnect NIC (Skyhawk) (rev 11) with PCI id 10df:0720:

> Subsystem: Emulex Corporation Device e871
> Flags: bus master, fast devsel, latency 0, IRQ 39, NUMA node 0
> Memory at dec0c000 (64-bit, prefetchable) [size=16K]
> Memory at de7e0000 (64-bit, prefetchable) [size=128K]
> Memory at de7c0000 (64-bit, prefetchable) [size=128K]
> Expansion ROM at dee80000 [disabled] [size=512K]
> Capabilities: [40] Power Management version 3
> Capabilities: [48] MSI-X: Enable+ Count=32 Masked-
> Capabilities: [c0] Express Endpoint, MSI 00
> Capabilities: [b8] Vital Product Data
> Capabilities: [100] Advanced Error Reporting
> Capabilities: [180] Single Root I/O Virtualization (SR-IOV)
> Capabilities: [160] Alternative Routing-ID Interpretation (ARI)
> Capabilities: [168] Device Serial Number 00-10-9b-ff-fe-35-88-b0
> Capabilities: [210] Secondary PCI Express
> Kernel driver in use: be2net
> Kernel modules: be2net

The Intel NIC mentioned in the logs is not the one I try to configure, this Intel NIC is not listed by the `ip link show` command in the first place.

Comment 5 Peng Liu 2021-01-20 10:55:17 UTC
In the sriov operator, we assume there is at least one NIC from the supported vendors. In the node, there is one intel NIC. But the driver was not loaded as expected. So, there is no NICs from Intel or Mellanox was discovered by the configure daemon, none of the vendor plugins was loaded. With current logic, the generic plugin will not configure any NICs on that node in this case.

Comment 7 Peng Liu 2021-06-27 12:10:55 UTC
It shall be fixed by https://github.com/k8snetworkplumbingwg/sriov-network-operator/pull/145

Comment 9 zhaozhanqi 2021-07-14 07:33:16 UTC
Verified this bug on 4.9.0-202107090514

1. Add unsupported adapter by `oc edit cm supported-nic-ids`
2. Delete the configdaemon pod and webhook pods to make them recreated.
3. Create unsupported policy and it can be created.

Comment 12 errata-xmlrpc 2021-10-18 17:29:03 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:3759