Disclaimer ---------- I know this NIC model is not part of the list of officialy supported models but it used to work fine with OCP 4.4 and/or 4.5. Description of problem ---------------------- Trying to enable SRIOV for device Intel Corporation Ethernet Connection X722 (rev 09) (PCI ID 8086:37cc) fails with the following error: > W0108 17:33:29.166332 5288 utils.go:70] DiscoverSriovDevices(): unable to parse device driver for device 0000:08:00.0 -> class: 'Network controller' vendor: 'Intel Corporation' product: 'Ethernet Connection X722' "error getting driver info for device 0000:08:00.0 readlink /sys/bus/pci/devices/0000:08:00.0/driver: no such file or directory" Version-Release number of selected component -------------------------------------------- > OCP: 4.7.0-fc.0 > RHCOS: 47.83.202012171901-0 > SR-IOV Network Operator: 4.6.0-202012160125.p0 Steps to Reproduce ------------------ 1. Deploy an OCP BM IPI cluster. 2. Deploy the NFD operator to get nodes labeled as sriov capable: > feature.node.kubernetes.io/network-sriov.capable: "true" > feature.node.kubernetes.io/pci-10df.present: "true" > feature.node.kubernetes.io/pci-10df.sriov.capable: "true" > feature.node.kubernetes.io/pci-8086.present: "true" > feature.node.kubernetes.io/pci-8086.sriov.capable: "true" 3. Deploy the SRIOV operator. > NAME DISPLAY VERSION REPLACES PHASE > sriov-network-operator.4.6.0-202012160125.p0 SR-IOV Network Operator 4.6.0-202012160125.p0 Succeeded 4. Disable sriov operator-webhook as described in OCP documentation: > oc patch sriovoperatorconfig default \ > --namespace='openshift-sriov-network-operator' \ > --type='merge' \ > --patch='{"spec":{"enableOperatorWebhook":false}}' 5. Create a SriovNetworkNodePolicy: > --- > kind: SriovNetworkNodePolicy > apiVersion: sriovnetwork.openshift.io/v1 > metadata: > name: sriov-network-policy > namespace: openshift-sriov-network-operator > spec: > resourceName: sriov_nics > nodeSelector: > feature.node.kubernetes.io/network-sriov.capable: "true" > nicSelector: > pfNames: > - ens2f0 > deviceType: vfio-pci > numVfs: 16 Actual results -------------- Nodes reboot several times during SRIOV configuration but no VFs are created for ens2f0 NIC. An error is logged by the sriov-network-config-daemon DaemonSet: > W0108 17:33:29.166332 5288 utils.go:70] DiscoverSriovDevices(): unable to parse device driver for device 0000:08:00.0 -> class: 'Network controller' vendor: 'Intel Corporation' product: 'Ethernet Connection X722' "error getting driver info for device 0000:08:00.0 readlink /sys/bus/pci/devices/0000:08:00.0/driver: no such file or directory" Expected results ---------------- 16 VFs should be created for ens2f0 NIC. Additional info --------------- No driver file is actually present in /sys/bus/pci/devices/0000:08:00.0/: > sudo ls -l /sys/bus/pci/devices/0000:08:00.0/ > > [snip] > > -r--r--r--. 1 root root 4096 Jan 8 16:51 device > -r--r--r--. 1 root root 4096 Jan 8 18:05 dma_mask_bits > -rw-r--r--. 1 root root 4096 Jan 8 18:05 driver_override > -rw-r--r--. 1 root root 4096 Jan 8 18:05 enable > lrwxrwxrwx. 1 root root 0 Jan 8 16:53 firmware_node -> ../../../../../LNXSYSTM:00/LNXSYBUS:00/PNP0A08:01/device:f3/device:f5/device:fc/device:fd > lrwxrwxrwx. 1 root root 0 Jan 8 16:53 iommu -> ../../../../../virtual/iommu/dmar4 > lrwxrwxrwx. 1 root root 0 Jan 8 16:53 iommu_group -> ../../../../../../kernel/iommu_groups/25 > > [snip] > > -rw-r--r--. 1 root root 4096 Jan 8 18:05 sriov_drivers_autoprobe > -rw-r--r--. 1 root root 4096 Jan 8 16:53 sriov_numvfs > -r--r--r--. 1 root root 4096 Jan 8 18:05 sriov_offset > -r--r--r--. 1 root root 4096 Jan 8 18:05 sriov_stride > -r--r--r--. 1 root root 4096 Jan 8 16:53 sriov_totalvfs > -r--r--r--. 1 root root 4096 Jan 8 18:05 sriov_vf_device > > [snip] > Looking at the code we can see the following function call: > DiscoverSriovDevices() > └─ https://github.com/openshift/sriov-network-operator/blob/release-4.7/pkg/utils/utils.go#L45-L119 > └─ GetDriverName() > └─ https://github.com/openshift/sriov-network-operator/blob/release-4.7/vendor/github.com/intel/sriov-network-device-plugin/pkg/utils/utils.go#L343-L351 This code does not appear to have changed for some time so it might be due to a change outside of sriov operator (RHCOS kernel?).
Could you check the dmesg of the node? It looks like a dup of https://bugzilla.redhat.com/show_bug.cgi?id=1875338
Created attachment 1747093 [details] dmesq output from a worker node
I didn't see any obvious error in dmesg, attaching it to the BZ so that you can have a look.
Thanks to Peng Liu we found out that I have been confused in my analysis by the error messages in the logs. The interface I'm trying to configure (ens2f0) is an Emulex Corporation OneConnect NIC (Skyhawk) (rev 11) with PCI id 10df:0720: > Subsystem: Emulex Corporation Device e871 > Flags: bus master, fast devsel, latency 0, IRQ 39, NUMA node 0 > Memory at dec0c000 (64-bit, prefetchable) [size=16K] > Memory at de7e0000 (64-bit, prefetchable) [size=128K] > Memory at de7c0000 (64-bit, prefetchable) [size=128K] > Expansion ROM at dee80000 [disabled] [size=512K] > Capabilities: [40] Power Management version 3 > Capabilities: [48] MSI-X: Enable+ Count=32 Masked- > Capabilities: [c0] Express Endpoint, MSI 00 > Capabilities: [b8] Vital Product Data > Capabilities: [100] Advanced Error Reporting > Capabilities: [180] Single Root I/O Virtualization (SR-IOV) > Capabilities: [160] Alternative Routing-ID Interpretation (ARI) > Capabilities: [168] Device Serial Number 00-10-9b-ff-fe-35-88-b0 > Capabilities: [210] Secondary PCI Express > Kernel driver in use: be2net > Kernel modules: be2net The Intel NIC mentioned in the logs is not the one I try to configure, this Intel NIC is not listed by the `ip link show` command in the first place.
In the sriov operator, we assume there is at least one NIC from the supported vendors. In the node, there is one intel NIC. But the driver was not loaded as expected. So, there is no NICs from Intel or Mellanox was discovered by the configure daemon, none of the vendor plugins was loaded. With current logic, the generic plugin will not configure any NICs on that node in this case.
It shall be fixed by https://github.com/k8snetworkplumbingwg/sriov-network-operator/pull/145
Verified this bug on 4.9.0-202107090514 1. Add unsupported adapter by `oc edit cm supported-nic-ids` 2. Delete the configdaemon pod and webhook pods to make them recreated. 3. Create unsupported policy and it can be created.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:3759