Bug 1914414 - SRIOV enablement for Emulex Corporation OneConnect NIC (10df:0720) is not working anymore
Summary: SRIOV enablement for Emulex Corporation OneConnect NIC (10df:0720) is not wor...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.7
Hardware: Unspecified
OS: Unspecified
low
low
Target Milestone: ---
: 4.9.0
Assignee: Peng Liu
QA Contact: zhaozhanqi
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-01-08 18:52 UTC by Denis Ollier
Modified: 2021-10-18 17:29 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-10-18 17:29:03 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
dmesq output from a worker node (169.86 KB, text/plain)
2021-01-13 14:47 UTC, Denis Ollier
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2021:3759 0 None None None 2021-10-18 17:29:49 UTC

Description Denis Ollier 2021-01-08 18:52:15 UTC
Disclaimer
----------

I know this NIC model is not part of the list of officialy supported models but it used to work fine with OCP 4.4 and/or 4.5.

Description of problem
----------------------

Trying to enable SRIOV for device Intel Corporation Ethernet Connection X722 (rev 09) (PCI ID 8086:37cc) fails with the following error:

> W0108 17:33:29.166332    5288 utils.go:70] DiscoverSriovDevices(): unable to parse device driver for device 0000:08:00.0 -> class: 'Network controller' vendor: 'Intel Corporation' product: 'Ethernet Connection X722' "error getting driver info for device 0000:08:00.0 readlink /sys/bus/pci/devices/0000:08:00.0/driver: no such file or directory"

Version-Release number of selected component
--------------------------------------------

> OCP: 4.7.0-fc.0
> RHCOS: 47.83.202012171901-0
> SR-IOV Network Operator: 4.6.0-202012160125.p0

Steps to Reproduce
------------------

1. Deploy an OCP BM IPI cluster.

2. Deploy the NFD operator to get nodes labeled as sriov capable:

> feature.node.kubernetes.io/network-sriov.capable: "true"
> feature.node.kubernetes.io/pci-10df.present: "true"
> feature.node.kubernetes.io/pci-10df.sriov.capable: "true"
> feature.node.kubernetes.io/pci-8086.present: "true"
> feature.node.kubernetes.io/pci-8086.sriov.capable: "true"

3. Deploy the SRIOV operator.

> NAME                                           DISPLAY                   VERSION                 REPLACES   PHASE
> sriov-network-operator.4.6.0-202012160125.p0   SR-IOV Network Operator   4.6.0-202012160125.p0              Succeeded

4. Disable sriov operator-webhook as described in OCP documentation:

> oc patch sriovoperatorconfig default \
>   --namespace='openshift-sriov-network-operator' \
>   --type='merge' \
>   --patch='{"spec":{"enableOperatorWebhook":false}}'

5. Create a SriovNetworkNodePolicy:

> ---
> kind: SriovNetworkNodePolicy
> apiVersion: sriovnetwork.openshift.io/v1
> metadata:
>   name: sriov-network-policy
>   namespace: openshift-sriov-network-operator
> spec:
>   resourceName: sriov_nics
>   nodeSelector:
>     feature.node.kubernetes.io/network-sriov.capable: "true"
>   nicSelector:
>     pfNames:
>       - ens2f0
>   deviceType: vfio-pci
>   numVfs: 16

Actual results
--------------

Nodes reboot several times during SRIOV configuration but no VFs are created for ens2f0 NIC.

An error is logged by the sriov-network-config-daemon DaemonSet:

> W0108 17:33:29.166332    5288 utils.go:70] DiscoverSriovDevices(): unable to parse device driver for device 0000:08:00.0 -> class: 'Network controller' vendor: 'Intel Corporation' product: 'Ethernet Connection X722' "error getting driver info for device 0000:08:00.0 readlink /sys/bus/pci/devices/0000:08:00.0/driver: no such file or directory"

Expected results
----------------

16 VFs should be created for ens2f0 NIC.

Additional info
---------------

No driver file is actually present in /sys/bus/pci/devices/0000:08:00.0/:

> sudo ls -l /sys/bus/pci/devices/0000:08:00.0/
> 
> [snip]
> 
> -r--r--r--. 1 root root     4096 Jan  8 16:51 device
> -r--r--r--. 1 root root     4096 Jan  8 18:05 dma_mask_bits
> -rw-r--r--. 1 root root     4096 Jan  8 18:05 driver_override
> -rw-r--r--. 1 root root     4096 Jan  8 18:05 enable
> lrwxrwxrwx. 1 root root        0 Jan  8 16:53 firmware_node -> ../../../../../LNXSYSTM:00/LNXSYBUS:00/PNP0A08:01/device:f3/device:f5/device:fc/device:fd
> lrwxrwxrwx. 1 root root        0 Jan  8 16:53 iommu -> ../../../../../virtual/iommu/dmar4
> lrwxrwxrwx. 1 root root        0 Jan  8 16:53 iommu_group -> ../../../../../../kernel/iommu_groups/25
> 
> [snip]
> 
> -rw-r--r--. 1 root root     4096 Jan  8 18:05 sriov_drivers_autoprobe
> -rw-r--r--. 1 root root     4096 Jan  8 16:53 sriov_numvfs
> -r--r--r--. 1 root root     4096 Jan  8 18:05 sriov_offset
> -r--r--r--. 1 root root     4096 Jan  8 18:05 sriov_stride
> -r--r--r--. 1 root root     4096 Jan  8 16:53 sriov_totalvfs
> -r--r--r--. 1 root root     4096 Jan  8 18:05 sriov_vf_device
> 
> [snip]
> 

Looking at the code we can see the following function call:

> DiscoverSriovDevices()
> └─ https://github.com/openshift/sriov-network-operator/blob/release-4.7/pkg/utils/utils.go#L45-L119
>    └─ GetDriverName()
>       └─ https://github.com/openshift/sriov-network-operator/blob/release-4.7/vendor/github.com/intel/sriov-network-device-plugin/pkg/utils/utils.go#L343-L351

This code does not appear to have changed for some time so it might be due to a change outside of sriov operator (RHCOS kernel?).

Comment 1 Peng Liu 2021-01-13 14:09:17 UTC
Could you check the dmesg of the node? It looks like a dup of https://bugzilla.redhat.com/show_bug.cgi?id=1875338

Comment 2 Denis Ollier 2021-01-13 14:47:05 UTC
Created attachment 1747093 [details]
dmesq output from a worker node

Comment 3 Denis Ollier 2021-01-13 14:49:26 UTC
I didn't see any obvious error in dmesg, attaching it to the BZ so that you can have a look.

Comment 4 Denis Ollier 2021-01-20 10:45:09 UTC
Thanks to Peng Liu we found out that I have been confused in my analysis by the error messages in the logs.

The interface I'm trying to configure (ens2f0) is an Emulex Corporation OneConnect NIC (Skyhawk) (rev 11) with PCI id 10df:0720:

> Subsystem: Emulex Corporation Device e871
> Flags: bus master, fast devsel, latency 0, IRQ 39, NUMA node 0
> Memory at dec0c000 (64-bit, prefetchable) [size=16K]
> Memory at de7e0000 (64-bit, prefetchable) [size=128K]
> Memory at de7c0000 (64-bit, prefetchable) [size=128K]
> Expansion ROM at dee80000 [disabled] [size=512K]
> Capabilities: [40] Power Management version 3
> Capabilities: [48] MSI-X: Enable+ Count=32 Masked-
> Capabilities: [c0] Express Endpoint, MSI 00
> Capabilities: [b8] Vital Product Data
> Capabilities: [100] Advanced Error Reporting
> Capabilities: [180] Single Root I/O Virtualization (SR-IOV)
> Capabilities: [160] Alternative Routing-ID Interpretation (ARI)
> Capabilities: [168] Device Serial Number 00-10-9b-ff-fe-35-88-b0
> Capabilities: [210] Secondary PCI Express
> Kernel driver in use: be2net
> Kernel modules: be2net

The Intel NIC mentioned in the logs is not the one I try to configure, this Intel NIC is not listed by the `ip link show` command in the first place.

Comment 5 Peng Liu 2021-01-20 10:55:17 UTC
In the sriov operator, we assume there is at least one NIC from the supported vendors. In the node, there is one intel NIC. But the driver was not loaded as expected. So, there is no NICs from Intel or Mellanox was discovered by the configure daemon, none of the vendor plugins was loaded. With current logic, the generic plugin will not configure any NICs on that node in this case.

Comment 7 Peng Liu 2021-06-27 12:10:55 UTC
It shall be fixed by https://github.com/k8snetworkplumbingwg/sriov-network-operator/pull/145

Comment 9 zhaozhanqi 2021-07-14 07:33:16 UTC
Verified this bug on 4.9.0-202107090514

1. Add unsupported adapter by `oc edit cm supported-nic-ids`
2. Delete the configdaemon pod and webhook pods to make them recreated.
3. Create unsupported policy and it can be created.

Comment 12 errata-xmlrpc 2021-10-18 17:29:03 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:3759


Note You need to log in before you can comment on or make changes to this bug.