Created attachment 1880479 [details] SriovNetworkNodePolicy.yaml Description of problem: When having multiple SriovNetworkNodePolicy for the same iface (Which hold different config), NNCP deployment sometimes fails with the following message: libnmstate.error.NmstateVerificationError Found VF ports count does not match desired 32, current is: NNCE cnv-qe-infra-17.cnvqe2.lab.eng.rdu2.redhat.com.static-ip-cnv-qe-infra-17.cnvqe2.lab.eng.rdu2.redhat.com: libnmstate.error.NmstateVerificationError" To clarify - applied VF ports count is 0, because one SriovNetworkNodePolicy sets desired to 0. The other policy sets desired to 32. Both policies attached. NNS info about said interface - [adi@fedora cnv-tests]$ oc get nns cnv-qe-infra-17.cnvqe2.lab.eng.rdu2.redhat.com -o yaml apiVersion: nmstate.io/v1beta1 kind: NodeNetworkState ... - ethernet: auto-negotiation: false duplex: full speed: 10000 sr-iov: total-vfs: 0 vfs: [] ipv4: address: - ip: 10.1.156.17 prefix-length: 24 auto-dns: true auto-gateway: true auto-route-table-id: 0 auto-routes: true dhcp: true enabled: true ipv6: address: - ip: fe80::e643:4bff:feec:8400 prefix-length: 64 auto-dns: true auto-gateway: true auto-route-table-id: 0 auto-routes: true autoconf: true dhcp: true enabled: true lldp: enabled: false mac-address: E4:43:4B:EC:84:00 mtu: 1500 name: eno1 state: up type: ethernet ... Version-Release number of selected component (if applicable): kubernetes-nmstate-handler v4.10.1-12 How reproducible: On any Openshift cluster with CNV and SRIOV operator. Steps to Reproduce: 1. Deploy attached SriovNetworkNodePolicys (Default my be applied when installing SRIOV-operator, so no need to apply it). 2. Deploy following NNCP (Adjust values) - apiVersion: nmstate.io/v1 kind: NodeNetworkConfigurationPolicy metadata: name: static-ip-cnv-qe-infra-18.cnvqe2.lab.eng.rdu2.redhat.com spec: desiredState: interfaces: - ipv4: address: - ip: 10.1.156.18 prefix-length: 24 auto-dns: true dhcp: false enabled: true ipv6: address: - ip: fe80::e643:4bff:feec:76d0 prefix-length: 64 auto-dns: true autoconf: false dhcp: false enabled: true name: eno1 state: up type: ethernet nodeSelector: kubernetes.io/hostname: cnv-qe-infra-18.cnvqe2.lab.eng.rdu2.redhat.com Actual results: NNCP deployment fails with the following message: libnmstate.error.NmstateVerificationError Found VF ports count does not match desired 32, current is: NNCE cnv-qe-infra-17.cnvqe2.lab.eng.rdu2.redhat.com.static-ip-cnv-qe-infra-17.cnvqe2.lab.eng.rdu2.redhat.com: libnmstate.error.NmstateVerificationError" Expected results: NNCP deployment should be applied succesfully Additional info: Two points that this bug should address - 1. Why should nmstate be concerned with SRIOV config, when it's not required to make changes. 2. Why isn't nmstate able to determine which of the two sriov policies represents actual desired state? If SRIOV operator was able to deploy both policies, nmstate shouldn't bother with this.
@azavalko Can you also add full nmstate logs either from the NNCE digest or from handler pod logs. Clearly nmstate should not take into account sriov here, I remember we where having similar issues with vxlan + openshift-sdn, at the end they fixed it by ignoring vxlan if is not part of the configuration, similar solution should fix this.
We have an RPM of nmstate that should fix it. We would like to install it on nmstate Pods, to verify that it resolves the issue first. The RPM build will expire and will get deleted in 10 days. @azavalko are you able to reproduce the issue? So we can confirm that the new RPM fixes it?
The old RPM was expired. We now have a new one and will test it.
Waiting until August 2 for the fix to become avaialable in RHEL, so we can rebuild downstream images.
The fix should become available with nmstate-1.2.1-4.el8_6. The current released knmstate is still using nmstate-1.2.1-3.el8_6: https://catalog.redhat.com/software/containers/openshift4/ose-kubernetes-nmstate-handler-rhel8/5e97379dbed8bd66f83dffb0?tag=v4.11.0-202208020235.p0.ga6744d1.assembly.stream&push_date=1660126963000&container-tabs=packages
The fix should be available in the recent knmstate 4.11 builds
I installed a new BM cluster (OCP 4.11.9) with the latest knmstate, and it still uses nmstate-1.2.1-3.el8_6.x86_64. Can't verify this bug yet.
Checked again with OCP 4.11.12 kubernetes-nmstate-operator.4.11.0-202208300306 nmstate in use is still nmstate-1.2.1-3.el8_6.x86_64, so the fix is still not available for our clusters.
Re-checked, and the nmstate version with the fix is still not installed for 4.11.
I'm looking at an OCP 4.11.5 cluster, with kubernetes-nmstate-operator.4.11.0-202210250857. It has nmstate-handler using registry.redhat.io/openshift4/ose-kubernetes-nmstate-handler-rhel8@sha256:6fd8cf5eb2fd19d6ae70d832cc2314ebbd1db2403f2c9b530af493fa8cc11f1b image. This image should have the required nmstate RPM in it: nmstate-1.2.1-4 It seems that the nmstate operator on your cluster was much older than that. Is it possible that the cluster is not configured to upgrade nmstate automatically, so it stuck on the original release?
According to the k8s-nmstate team, there is currently an issue with pulling nmstate 4.11 images. https://github.com/openshift/kubernetes-nmstate/pull/312 I'll track it to see when it is resolved.
*** Bug 2137250 has been marked as a duplicate of this bug. ***
I have just checked, and kubernetes-nmstate-operator.4.11.0-202208300306 is still the one that is installed (with OCP 4.11.13). So the bug cannot be verified yet.
@ysegev I found out that the following bundle is available in the 4.12 index image - kubernetes-nmstate-operator.4.12.0-202211110827, can you verify the bug with this operator version?
> @ysegev I found out that the following bundle is available in the 4.12 index image - kubernetes-nmstate-operator.4.12.0-202211110827, can you verify the bug with this operator version? Unforutnately not, this bug should be verified on 4.11, with 4.11 components - including nmstate.
Changed target release to 4.11.2, as the nmstate fix is still not available for OCP 4.11 yet (checked 3 days ago).
Latest 4.11.z deployment job (with OCP 4.11.20) still installs kubernetes-nmstate-operator.4.11.0-202208300306, so our clusters still don't have the fix, and the bug cannot be verified yet.
Hey Yossi, I have updated the script for installing nmstate and now we install it from production.
Verified on: OCP 4.11.20 CNV4.11.2 kubernetes-nmstate-operator.4.11.0-202212070335 1. The default SriovNetworkNodePolicy, similar to the one attached to the original bug description, was already applied on the cluster (as part of the SR-IOV installation). 2. I applied the following policy, whihc is applied oin the same PF interface (eno1): apiVersion: sriovnetwork.openshift.io/v1 kind: SriovNetworkNodePolicy metadata: name: sriov-network-policy-2 namespace: openshift-sriov-network-operator spec: deviceType: vfio-pci nicSelector: pfNames: - eno1 rootDevices: - "0000:19:00.0" nodeSelector: feature.node.kubernetes.io/network-sriov.capable: "true" numVfs: 10 resourceName: sriov_nics_2 3. Applied this NodeNetworkConfiguration policy, which sets the same interface (eno1): apiVersion: nmstate.io/v1 kind: NodeNetworkConfigurationPolicy metadata: name: static-ip-cnv-qe-infra-24.cnvqe2.lab.eng.rdu2.redhat.com spec: desiredState: interfaces: - ipv4: address: - ip: 10.1.156.18 prefix-length: 24 auto-dns: true dhcp: false enabled: false name: eno1 state: up type: ethernet nodeSelector: kubernetes.io/hostname: cnv-qe-infra-24.cnvqe2.lab.eng.rdu2.redhat.com The NNCP was applied successfully.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Virtualization 4.11.2 Images), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2023:0155