Bug 2116562

Summary:	NodeNetworkConfigurationPolicy "ERROR: State editing already in progress. Commit, roll back or wait before retrying"
Product:	Container Native Virtualization (CNV)	Reporter:	Eswar Vadla <evadla>
Component:	Networking	Assignee:	Quique Llorente <ellorent>
Status:	CLOSED ERRATA	QA Contact:	awax
Severity:	high	Docs Contact:
Priority:	medium
Version:	4.10.0	CC:	alitke, bnemec, cstabler, ellorent, hchaturv, igarcia, nrozen, phoracek
Target Milestone:	---	Flags:	hchaturv: needinfo- hchaturv: needinfo-
Target Release:	4.13.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	4.13 nightly	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2023-05-18 02:55:41 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Eswar Vadla 2022-08-08 20:03:52 UTC

Description of problem:

NNCE is failing with error:

ERROR: State editing already in progress.\nCommit, roll back or wait before retrying

There do not appear to be any other errors.

Version-Release number of selected component (if applicable):
OCP 4.10

How reproducible:
Making sure a filtering bridge interface exists (using MachineConfig in our case).

Create X policies for X vlan and bridge devices on top of the filtering bridge interface.

If all went well: reboot a node, and verify the nnce/nncp failed status after node became Ready again.


Actual results:

NNCE is not applied and goes into failing state.

node   Failing

The node had been working, but started to fail for unknown reasons. It has been restarted, but the same errors occur.

I did notice that there was this bug fix upstream by a Red Hat Engineer dprince, but that would just be fixing what I'm guessing is only masking the real error.



error: at node reconcile creating NodeNetworkState: Error updating nodeNetworkState: Operation cannot be fulfilled on nodenetworkstates.nmstate.io \"worker-11\": the object has been modified; please apply your changes to the latest version and try again","errorVerbose":"Operation cannot be fulfilled on nodenetworkstates.nmstate.io \"worker-11\": the object has been modified; please apply your changes to the latest version and try again\nError updating nodeNetworkState\ngithub.com/nmstate/kubernetes-nmstate/pkg/helper.UpdateCurr

Comment 3 Ben Nemec 2022-08-09 17:28:33 UTC

(In reply to Eswar Vadla from comment #0)
> Making sure a filtering bridge interface exists (using MachineConfig in our
> case).

Note that in 4.10 it should not be necessary to use machine-configs for network configuration and I wouldn't recommend it since then you have two different operators managing network config files (MCO and kubernetes-nmstate). It doesn't sound like that's the problem here, but if there's configuration they can't use kubernetes-nmstate for we would recommend the networkConfig mechanism to deploy that initially: https://docs.openshift.com/container-platform/4.10/installing/installing_bare_metal_ipi/ipi-install-installation-workflow.html#configuring-host-network-interfaces-in-the-install-config-yaml-file_ipi-install-installation-workflow It uses the same syntax as kubernetes-nmstate so it's easy to modify on day 2 with the operator and there's no chance of conflicts.

Comment 11 Petr Horáček 2022-10-11 08:26:37 UTC

We believe that this issue has been fixed in CNV 4.10.5. Could you upgrade to that version?

Comment 18 Petr Horáček 2022-12-08 10:19:23 UTC

Eswar, considering Quique's answers, is anything else needed before we close this?

Comment 23 Quique Llorente 2023-01-04 09:30:49 UTC

We have being able to reproduce the "in progress" issue, looks like if the nmstate-handler pod get restarted before commiting the nmstate checkpoint, the next "nmstatectl set" operation will fail since checkpoint is lingering around, we are to put a fix to rollback checkpoints at the beggining of nmstate-handler pods.

This cannot be an issue with node retart since checkpoints disappear if NetworkManager daemon get restarted.

Comment 24 Quique Llorente 2023-01-11 09:42:44 UTC

Added upstream fix to remove pending checkpoint before apply new state, for example when the handler pods are restarted.

Also we have take some elpased times of each operation on the current nmstatectl python version and the future rust version
and the difference is near an order of magnitude, so related to performance we will just wait to integrate the the future 
nmstate rust version.

Comment 25 Quique Llorente 2023-01-31 12:16:11 UTC

POST waiting for https://github.com/openshift/kubernetes-nmstate/pull/334

Comment 28 Petr Horáček 2023-02-23 08:43:41 UTC

Changes were merged downstream https://github.com/openshift/kubernetes-nmstate/pull/339

Comment 33 awax 2023-03-30 08:51:08 UTC

The bug fix was verified on a PSI cluster:
Openshift version: 4.13.0-0.nightly-2023-03-23-000343
CNV version: v4.13.0.rhel9-1836

Verification steps:
1. Decide on which node to create the NNCP (but don't create it yet).
2. Find the nmstate-handler pod that is running on that node.
3. From the executor, create a checkpoint on that handler (this will simulate the reboot of the node):
cat <<EOF | oc exec -it nmstate-handler-4sxq2 -n openshift-nmstate -- nmstatectl apply --no-commit --timeout 60
---                                                                             
{
  "interfaces": [
    {
      "name": "br0",
      "type": "linux-bridge",
      "state": "up",
      "bridge": {
        "options": {
          "stp": {
            "enabled": false
          }
        },
        "port": []
      }
    }
  ]
}
EOF

4. Within the timeout specified (60 seconds in my case), create the NNCP:
at <<EOF | oc create -f -
pipe heredoc> apiVersion: nmstate.io/v1beta1
kind: NodeNetworkConfigurationPolicy
metadata:
  name: nncp-1
spec:
  desiredState:
    interfaces:
    - bridge:
        options:
          stp:
            enabled: false
        port:
        - name: ens8
      ipv4:
        dhcp: false
        enabled: false
      ipv6:
        enabled: false
      name: br1test
      state: up
      type: linux-bridge
  nodeSelector:
    kubernetes.io/hostname: n-awax-413-o-gg24b-worker-0-6jbcn
EOF

5. Wait for the NNCP to be configured successfully.

The bug is solved.

Comment 37 errata-xmlrpc 2023-05-18 02:55:41 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Virtualization 4.13.0 Images security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2023:3205