Bug 2087172 - NNCP deployment flakiness when having multiple SriovNetworkNodePolicy
Summary: NNCP deployment flakiness when having multiple SriovNetworkNodePolicy
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Container Native Virtualization (CNV)
Classification: Red Hat
Component: Networking
Version: 4.10.0
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
: 4.11.2
Assignee: Petr Horáček
QA Contact: Yossi Segev
URL:
Whiteboard:
: 2137250 (view as bug list)
Depends On: 2094025 2103433
Blocks: 2137250
TreeView+ depends on / blocked
 
Reported: 2022-05-17 13:45 UTC by Adi Zavalkovsky
Modified: 2023-01-12 14:09 UTC (History)
6 users (show)

Fixed In Version: registry-proxy.engineering.redhat.com/rh-osbs/openshift-ose-kubernetes-nmstate-handler-rhel8:v4.11.0-202209292219.p0.g2d445f0.assembly.stream
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 2094025 2137250 (view as bug list)
Environment:
Last Closed: 2023-01-12 14:08:55 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
SriovNetworkNodePolicy.yaml (1.41 KB, text/plain)
2022-05-17 13:45 UTC, Adi Zavalkovsky
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker CNV-18652 0 None None None 2022-10-27 11:56:47 UTC
Red Hat Issue Tracker CNV-19788 0 None None None 2022-10-27 11:56:45 UTC
Red Hat Product Errata RHEA-2023:0155 0 None None None 2023-01-12 14:09:12 UTC

Description Adi Zavalkovsky 2022-05-17 13:45:17 UTC
Created attachment 1880479 [details]
SriovNetworkNodePolicy.yaml

Description of problem:
When having multiple SriovNetworkNodePolicy for the same iface (Which hold different config), NNCP deployment sometimes fails with the following message:
libnmstate.error.NmstateVerificationError
  Found VF ports count does not match desired 32, current is:
NNCE cnv-qe-infra-17.cnvqe2.lab.eng.rdu2.redhat.com.static-ip-cnv-qe-infra-17.cnvqe2.lab.eng.rdu2.redhat.com: libnmstate.error.NmstateVerificationError"

To clarify - applied VF ports count is 0, because one SriovNetworkNodePolicy sets desired to 0. The other policy sets desired to 32.
Both policies attached.

NNS info about said interface -
[adi@fedora cnv-tests]$ oc get nns cnv-qe-infra-17.cnvqe2.lab.eng.rdu2.redhat.com -o yaml
apiVersion: nmstate.io/v1beta1
kind: NodeNetworkState
...
    - ethernet:
        auto-negotiation: false
        duplex: full
        speed: 10000
        sr-iov:
          total-vfs: 0
          vfs: []
      ipv4:
        address:
        - ip: 10.1.156.17
          prefix-length: 24
        auto-dns: true
        auto-gateway: true
        auto-route-table-id: 0
        auto-routes: true
        dhcp: true
        enabled: true
      ipv6:
        address:
        - ip: fe80::e643:4bff:feec:8400
          prefix-length: 64
        auto-dns: true
        auto-gateway: true
        auto-route-table-id: 0
        auto-routes: true
        autoconf: true
        dhcp: true
        enabled: true
      lldp:
        enabled: false
      mac-address: E4:43:4B:EC:84:00
      mtu: 1500
      name: eno1
      state: up
      type: ethernet
...

Version-Release number of selected component (if applicable):
kubernetes-nmstate-handler v4.10.1-12

How reproducible:
On any Openshift cluster with CNV and SRIOV operator.

Steps to Reproduce:
1. Deploy attached SriovNetworkNodePolicys (Default my be applied when installing SRIOV-operator, so no need to apply it).
2. Deploy following NNCP (Adjust values) -
apiVersion: nmstate.io/v1
kind: NodeNetworkConfigurationPolicy
metadata:
  name: static-ip-cnv-qe-infra-18.cnvqe2.lab.eng.rdu2.redhat.com
spec:
  desiredState:
    interfaces:
    - ipv4:
        address:
        - ip: 10.1.156.18
          prefix-length: 24
        auto-dns: true
        dhcp: false
        enabled: true
      ipv6:
        address:
        - ip: fe80::e643:4bff:feec:76d0
          prefix-length: 64
        auto-dns: true
        autoconf: false
        dhcp: false
        enabled: true
      name: eno1
      state: up
      type: ethernet
  nodeSelector:
    kubernetes.io/hostname: cnv-qe-infra-18.cnvqe2.lab.eng.rdu2.redhat.com

Actual results:
NNCP deployment fails with the following message:
libnmstate.error.NmstateVerificationError
  Found VF ports count does not match desired 32, current is:
NNCE cnv-qe-infra-17.cnvqe2.lab.eng.rdu2.redhat.com.static-ip-cnv-qe-infra-17.cnvqe2.lab.eng.rdu2.redhat.com: libnmstate.error.NmstateVerificationError"

Expected results:
NNCP deployment should be applied succesfully

Additional info:
Two points that this bug should address - 
1. Why should nmstate be concerned with SRIOV config, when it's not required to make changes.
2. Why isn't nmstate able to determine which of the two sriov policies represents actual desired state? If SRIOV operator was able to deploy both policies, nmstate shouldn't bother with this.

Comment 1 Quique Llorente 2022-05-26 10:07:51 UTC
@azavalko Can you also add full nmstate logs either from the NNCE digest or from handler pod logs.

Clearly nmstate should not take into account sriov here, I remember we where having similar issues with vxlan + openshift-sdn, at the end they fixed it by ignoring vxlan if is not part of the configuration, similar solution should fix this.

Comment 6 Petr Horáček 2022-07-14 12:58:56 UTC
We have an RPM of nmstate that should fix it. We would like to install it on nmstate Pods, to verify that it resolves the issue first.

The RPM build will expire and will get deleted in 10 days.

@azavalko are you able to reproduce the issue? So we can confirm that the new RPM fixes it?

Comment 7 awax 2022-07-14 13:56:56 UTC
The old RPM was expired. We now have a new one and will test it.

Comment 13 Petr Horáček 2022-07-21 08:01:31 UTC
Waiting until August 2 for the fix to become avaialable in RHEL, so we can rebuild downstream images.

Comment 14 Petr Horáček 2022-08-22 10:07:31 UTC
The fix should become available with nmstate-1.2.1-4.el8_6. The current released knmstate is still using nmstate-1.2.1-3.el8_6: https://catalog.redhat.com/software/containers/openshift4/ose-kubernetes-nmstate-handler-rhel8/5e97379dbed8bd66f83dffb0?tag=v4.11.0-202208020235.p0.ga6744d1.assembly.stream&push_date=1660126963000&container-tabs=packages

Comment 15 Petr Horáček 2022-10-03 14:27:33 UTC
The fix should be available in the recent knmstate 4.11 builds

Comment 16 Yossi Segev 2022-10-20 10:32:17 UTC
I installed a new BM cluster (OCP 4.11.9) with the latest knmstate, and it still uses nmstate-1.2.1-3.el8_6.x86_64.
Can't verify this bug yet.

Comment 17 Yossi Segev 2022-10-27 11:51:31 UTC
Checked again with
OCP 4.11.12
kubernetes-nmstate-operator.4.11.0-202208300306
nmstate in use is still nmstate-1.2.1-3.el8_6.x86_64, so the fix is still not available for our clusters.

Comment 18 Yossi Segev 2022-11-03 07:18:15 UTC
Re-checked, and the nmstate version with the fix is still not installed for 4.11.

Comment 19 Petr Horáček 2022-11-04 12:18:17 UTC
I'm looking at an OCP 4.11.5 cluster, with kubernetes-nmstate-operator.4.11.0-202210250857. It has nmstate-handler using registry.redhat.io/openshift4/ose-kubernetes-nmstate-handler-rhel8@sha256:6fd8cf5eb2fd19d6ae70d832cc2314ebbd1db2403f2c9b530af493fa8cc11f1b image. This image should have the required nmstate RPM in it: nmstate-1.2.1-4

It seems that the nmstate operator on your cluster was much older than that. Is it possible that the cluster is not configured to upgrade nmstate automatically, so it stuck on the original release?

Comment 22 Yossi Segev 2022-11-07 13:24:29 UTC
According to the k8s-nmstate team, there is currently an issue with pulling nmstate 4.11 images.
https://github.com/openshift/kubernetes-nmstate/pull/312
I'll track it to see when it is resolved.

Comment 23 Petr Horáček 2022-11-09 13:44:50 UTC
*** Bug 2137250 has been marked as a duplicate of this bug. ***

Comment 24 Yossi Segev 2022-11-13 19:02:39 UTC
I have just checked, and kubernetes-nmstate-operator.4.11.0-202208300306 is still the one that is installed (with OCP 4.11.13).
So the bug cannot be verified yet.

Comment 25 Mor Cohen 2022-11-14 07:49:07 UTC
@ysegev I found out that the following bundle is available in the 4.12 index image - kubernetes-nmstate-operator.4.12.0-202211110827, can you verify the bug with this operator version?

Comment 26 Yossi Segev 2022-11-14 09:59:31 UTC
> @ysegev I found out that the following bundle is available in the 4.12 index image - kubernetes-nmstate-operator.4.12.0-202211110827, can you verify the bug with this operator version?

Unforutnately not, this bug should be verified on 4.11, with 4.11 components - including nmstate.

Comment 27 Yossi Segev 2022-11-21 12:55:14 UTC
Changed target release to 4.11.2, as the nmstate fix is still not available for OCP 4.11 yet (checked 3 days ago).

Comment 29 Yossi Segev 2022-12-22 12:37:56 UTC
Latest 4.11.z deployment job (with OCP 4.11.20) still installs kubernetes-nmstate-operator.4.11.0-202208300306, so our clusters still don't have the fix, and the bug cannot be verified yet.

Comment 30 Mor Cohen 2022-12-27 08:42:59 UTC
Hey Yossi, I have updated the script for installing nmstate and now we install it from production.

Comment 31 Yossi Segev 2022-12-29 17:05:28 UTC
Verified on:
OCP 4.11.20
CNV4.11.2
kubernetes-nmstate-operator.4.11.0-202212070335


1. The default SriovNetworkNodePolicy, similar to the one attached to the original bug description, was already applied on the cluster (as part of the SR-IOV installation).
2. I applied the following policy, whihc is applied oin the same PF interface (eno1):

apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetworkNodePolicy
metadata:
  name: sriov-network-policy-2
  namespace: openshift-sriov-network-operator
spec:
  deviceType: vfio-pci
  nicSelector:
    pfNames:
    - eno1
    rootDevices:
    - "0000:19:00.0"
  nodeSelector:
    feature.node.kubernetes.io/network-sriov.capable: "true"
  numVfs: 10
  resourceName: sriov_nics_2

3. Applied this NodeNetworkConfiguration policy, which sets the same interface (eno1):

apiVersion: nmstate.io/v1
kind: NodeNetworkConfigurationPolicy
metadata:
  name: static-ip-cnv-qe-infra-24.cnvqe2.lab.eng.rdu2.redhat.com
spec:
  desiredState:
    interfaces:
    - ipv4:
        address:
        - ip: 10.1.156.18
          prefix-length: 24
        auto-dns: true
        dhcp: false
        enabled: false
      name: eno1
      state: up
      type: ethernet
  nodeSelector:
    kubernetes.io/hostname: cnv-qe-infra-24.cnvqe2.lab.eng.rdu2.redhat.com

The NNCP was applied successfully.

Comment 38 errata-xmlrpc 2023-01-12 14:08:55 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Virtualization 4.11.2 Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2023:0155


Note You need to log in before you can comment on or make changes to this bug.