Description of problem: NNCE is failing with error: ERROR: State editing already in progress.\nCommit, roll back or wait before retrying There do not appear to be any other errors. Version-Release number of selected component (if applicable): OCP 4.10 How reproducible: Making sure a filtering bridge interface exists (using MachineConfig in our case). Create X policies for X vlan and bridge devices on top of the filtering bridge interface. If all went well: reboot a node, and verify the nnce/nncp failed status after node became Ready again. Actual results: NNCE is not applied and goes into failing state. node Failing The node had been working, but started to fail for unknown reasons. It has been restarted, but the same errors occur. I did notice that there was this bug fix upstream by a Red Hat Engineer dprince, but that would just be fixing what I'm guessing is only masking the real error. error: at node reconcile creating NodeNetworkState: Error updating nodeNetworkState: Operation cannot be fulfilled on nodenetworkstates.nmstate.io \"worker-11\": the object has been modified; please apply your changes to the latest version and try again","errorVerbose":"Operation cannot be fulfilled on nodenetworkstates.nmstate.io \"worker-11\": the object has been modified; please apply your changes to the latest version and try again\nError updating nodeNetworkState\ngithub.com/nmstate/kubernetes-nmstate/pkg/helper.UpdateCurr
(In reply to Eswar Vadla from comment #0) > Making sure a filtering bridge interface exists (using MachineConfig in our > case). Note that in 4.10 it should not be necessary to use machine-configs for network configuration and I wouldn't recommend it since then you have two different operators managing network config files (MCO and kubernetes-nmstate). It doesn't sound like that's the problem here, but if there's configuration they can't use kubernetes-nmstate for we would recommend the networkConfig mechanism to deploy that initially: https://docs.openshift.com/container-platform/4.10/installing/installing_bare_metal_ipi/ipi-install-installation-workflow.html#configuring-host-network-interfaces-in-the-install-config-yaml-file_ipi-install-installation-workflow It uses the same syntax as kubernetes-nmstate so it's easy to modify on day 2 with the operator and there's no chance of conflicts.
We believe that this issue has been fixed in CNV 4.10.5. Could you upgrade to that version?
Eswar, considering Quique's answers, is anything else needed before we close this?
We have being able to reproduce the "in progress" issue, looks like if the nmstate-handler pod get restarted before commiting the nmstate checkpoint, the next "nmstatectl set" operation will fail since checkpoint is lingering around, we are to put a fix to rollback checkpoints at the beggining of nmstate-handler pods. This cannot be an issue with node retart since checkpoints disappear if NetworkManager daemon get restarted.
Added upstream fix to remove pending checkpoint before apply new state, for example when the handler pods are restarted. Also we have take some elpased times of each operation on the current nmstatectl python version and the future rust version and the difference is near an order of magnitude, so related to performance we will just wait to integrate the the future nmstate rust version.
POST waiting for https://github.com/openshift/kubernetes-nmstate/pull/334
Changes were merged downstream https://github.com/openshift/kubernetes-nmstate/pull/339
The bug fix was verified on a PSI cluster: Openshift version: 4.13.0-0.nightly-2023-03-23-000343 CNV version: v4.13.0.rhel9-1836 Verification steps: 1. Decide on which node to create the NNCP (but don't create it yet). 2. Find the nmstate-handler pod that is running on that node. 3. From the executor, create a checkpoint on that handler (this will simulate the reboot of the node): cat <<EOF | oc exec -it nmstate-handler-4sxq2 -n openshift-nmstate -- nmstatectl apply --no-commit --timeout 60 --- { "interfaces": [ { "name": "br0", "type": "linux-bridge", "state": "up", "bridge": { "options": { "stp": { "enabled": false } }, "port": [] } } ] } EOF 4. Within the timeout specified (60 seconds in my case), create the NNCP: at <<EOF | oc create -f - pipe heredoc> apiVersion: nmstate.io/v1beta1 kind: NodeNetworkConfigurationPolicy metadata: name: nncp-1 spec: desiredState: interfaces: - bridge: options: stp: enabled: false port: - name: ens8 ipv4: dhcp: false enabled: false ipv6: enabled: false name: br1test state: up type: linux-bridge nodeSelector: kubernetes.io/hostname: n-awax-413-o-gg24b-worker-0-6jbcn EOF 5. Wait for the NNCP to be configured successfully. The bug is solved.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Virtualization 4.13.0 Images security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2023:3205