Bug 1850505

Summary: SR-IOV webhook fails with "no matched NIC is selected by the nicSelector"
Product: OpenShift Container Platform Reporter: Petr Horáček <phoracek>
Component: NetworkingAssignee: Peng Liu <pliu>
Networking sub component: SR-IOV QA Contact: zhaozhanqi <zzhao>
Status: CLOSED ERRATA Docs Contact:
Severity: high    
Priority: unspecified CC: gkapoor, igarciam
Version: 4.5Flags: pliu: needinfo-
Target Milestone: ---   
Target Release: 4.6.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-10-27 16:09:32 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
sriov-webhook.log none

Description Petr Horáček 2020-06-24 12:18:07 UTC
Description of problem:
The webhook rejects a new Policy claiming it does not match any available NIC despite there are available matching NICs and it worked with 4.4 SR-IOV operator.


Version-Release number of selected component (if applicable):
OCP 4.5.0-rc.2
SR-IOV operator 4.5 rh-verified-operators


How reproducible:
Always


Steps to Reproduce:
1. Check available NICs
    - deviceID: "1572"
      driver: i40e
      mtu: 1500
      name: ens2f1
      pciAddress: 0000:3b:00.1
      totalvfs: 64
      vendor: "8086"
2. Create a policy:
cat <<EOF | oc create -f -
---
apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetworkNodePolicy
metadata:
  name: policy-ens2f1
  namespace: openshift-sriov-network-operator
spec:
  deviceType: vfio-pci
  nicSelector:
    rootDevices:
    - 0000:3b:00.1
    vendor: "8086"
    pfNames:
    - ens2f1
  nodeSelector:
    feature.node.kubernetes.io/network-sriov.capable: "true"
  numVfs: 5
  resourceName: sriov_net
EOF

Actual results:
It fails with:
Error from server (no matched NIC is selected by the nicSelector in CR policy-ens2f1): error when creating "STDIN": admission webhook "operator-webhook.sriovnetwork.openshift.io" denied the request: no matched NIC is selected by the nicSelector in CR policy-ens2f1


Expected results:
Should configure the Policy.

Comment 1 Petr Horáček 2020-06-24 12:23:58 UTC
Created attachment 1698586 [details]
sriov-webhook.log

Comment 2 Peng Liu 2020-06-29 13:28:39 UTC
Hi Petr,

Could you share the output of 'oc get node -o yaml' and 'oc get sriovnetworknodestate -o yaml' for node cnvqe-10.lab.eng.tlv2.redhat.com and cnvqe-11.lab.eng.tlv2.redhat.com?

Comment 4 Peng Liu 2020-06-29 14:04:52 UTC
Hi Geetika,

I cannot reproduce this issue in my environment. Does it always happen in yours?
From the node state manifests you attached, it looks the policy has been applied. Did you turn off the operator webhook to make it happen?

Comment 5 Petr Horáček 2020-06-29 14:17:02 UTC
Yes, we disabled the webhook to get over this issue. It happened always on our 4.5 environment.

Comment 7 Peng Liu 2020-06-29 15:19:16 UTC
The device which you want to select is not in the supported NIC list. It has been blocked by https://github.com/openshift/sriov-network-operator/pull/204. So it is expected behavior. So when you want to configure an unsupported NIC model, you shall disable the operator webhook.

Comment 8 Petr Horáček 2020-06-29 15:43:47 UTC
Makes sense. So all that needs to be done here is to fix the error message from "no matched NIC is selected by the nicSelector" to something more appropriate?

Comment 9 Peng Liu 2020-06-30 02:52:25 UTC
How about change to "no supported NIC is selected by the nicSelector"?

Comment 10 Petr Horáček 2020-06-30 07:36:20 UTC
That sounds ok.

I have some second thoughts, I get that Red Hat can't officially support models it does not test. However, the operator is known to work with other similar models too and it seems to me wasteful to hard-limit ourselves with a subset of them. Can we have a configuration option to disable the model check instead of disabling of the whole webhook? It would have the same effect, it would be 100% explicit and we would be able to utilize other features of the webhook.

It would help the upstream community, including myself who has X710 (not the supported XXV710).

Comment 11 Peng Liu 2020-07-01 13:19:23 UTC
Hi Petr,
I agree that we shall a way to allow users to try the unsupported NIC models. We're planning to have a proper systematic solution for that. It is on our to-do list.
For this BZ, can we close it with the error message change?

Comment 12 Petr Horáček 2020-07-01 13:22:30 UTC
I can hardly complain about something which is unsupported, the message change would be great :)

For the solution you mention, could I track it anywhere? On Jira maybe?

Comment 13 Peng Liu 2020-07-01 13:49:25 UTC
It hasn't started yet. I'll keep you posted.

Comment 16 zhaozhanqi 2020-07-07 03:11:26 UTC
Verified this bug

Comment 18 errata-xmlrpc 2020-10-27 16:09:32 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196