2116562 – NodeNetworkConfigurationPolicy "ERROR: State editing already in progress. Commit, roll back or wait before retrying"

Bug 2116562 - NodeNetworkConfigurationPolicy "ERROR: State editing already in progress. Commit, roll back or wait before retrying"

Summary: NodeNetworkConfigurationPolicy "ERROR: State editing already in progress. Com...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Container Native Virtualization (CNV)
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.10.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	high
Target Milestone:	---
Target Release:	4.13.0
Assignee:	Quique Llorente
QA Contact:	awax
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2022-08-08 20:03 UTC by Eswar Vadla
Modified:	2023-05-18 02:56 UTC (History)
CC List:	8 users (show)
Fixed In Version:	4.13 nightly
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2023-05-18 02:55:41 UTC
Target Upstream Version:
Embargoed:
Dependent Products:
Flags:	hchaturv: needinfo- hchaturv: needinfo-

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	nmstate kubernetes-nmstate pull 1137	None	Merged	handler: Rollback always before Apply	2023-02-07 21:48:14 UTC
Red Hat Issue Tracker	CNV-21792	None	None	None	2022-11-02 10:22:38 UTC
Red Hat Product Errata	RHSA-2023:3205	None	None	None	2023-05-18 02:56:30 UTC

Description Eswar Vadla 2022-08-08 20:03:52 UTC

Description of problem:

NNCE is failing with error:

ERROR: State editing already in progress.\nCommit, roll back or wait before retrying

There do not appear to be any other errors.

Version-Release number of selected component (if applicable):
OCP 4.10

How reproducible:
Making sure a filtering bridge interface exists (using MachineConfig in our case).

Create X policies for X vlan and bridge devices on top of the filtering bridge interface.

If all went well: reboot a node, and verify the nnce/nncp failed status after node became Ready again.


Actual results:

NNCE is not applied and goes into failing state.

node   Failing

The node had been working, but started to fail for unknown reasons. It has been restarted, but the same errors occur.

I did notice that there was this bug fix upstream by a Red Hat Engineer dprince, but that would just be fixing what I'm guessing is only masking the real error.



error: at node reconcile creating NodeNetworkState: Error updating nodeNetworkState: Operation cannot be fulfilled on nodenetworkstates.nmstate.io \"worker-11\": the object has been modified; please apply your changes to the latest version and try again","errorVerbose":"Operation cannot be fulfilled on nodenetworkstates.nmstate.io \"worker-11\": the object has been modified; please apply your changes to the latest version and try again\nError updating nodeNetworkState\ngithub.com/nmstate/kubernetes-nmstate/pkg/helper.UpdateCurr

Comment 3 Ben Nemec 2022-08-09 17:28:33 UTC

(In reply to Eswar Vadla from comment #0)
> Making sure a filtering bridge interface exists (using MachineConfig in our
> case).

Note that in 4.10 it should not be necessary to use machine-configs for network configuration and I wouldn't recommend it since then you have two different operators managing network config files (MCO and kubernetes-nmstate). It doesn't sound like that's the problem here, but if there's configuration they can't use kubernetes-nmstate for we would recommend the networkConfig mechanism to deploy that initially: https://docs.openshift.com/container-platform/4.10/installing/installing_bare_metal_ipi/ipi-install-installation-workflow.html#configuring-host-network-interfaces-in-the-install-config-yaml-file_ipi-install-installation-workflow It uses the same syntax as kubernetes-nmstate so it's easy to modify on day 2 with the operator and there's no chance of conflicts.

Comment 11 Petr Horáček 2022-10-11 08:26:37 UTC

We believe that this issue has been fixed in CNV 4.10.5. Could you upgrade to that version?

Comment 18 Petr Horáček 2022-12-08 10:19:23 UTC

Eswar, considering Quique's answers, is anything else needed before we close this?

Comment 23 Quique Llorente 2023-01-04 09:30:49 UTC

We have being able to reproduce the "in progress" issue, looks like if the nmstate-handler pod get restarted before commiting the nmstate checkpoint, the next "nmstatectl set" operation will fail since checkpoint is lingering around, we are to put a fix to rollback checkpoints at the beggining of nmstate-handler pods.

This cannot be an issue with node retart since checkpoints disappear if NetworkManager daemon get restarted.

Comment 24 Quique Llorente 2023-01-11 09:42:44 UTC

Added upstream fix to remove pending checkpoint before apply new state, for example when the handler pods are restarted.

Also we have take some elpased times of each operation on the current nmstatectl python version and the future rust version
and the difference is near an order of magnitude, so related to performance we will just wait to integrate the the future 
nmstate rust version.

Comment 25 Quique Llorente 2023-01-31 12:16:11 UTC

POST waiting for https://github.com/openshift/kubernetes-nmstate/pull/334

Comment 28 Petr Horáček 2023-02-23 08:43:41 UTC

Changes were merged downstream https://github.com/openshift/kubernetes-nmstate/pull/339

Comment 33 awax 2023-03-30 08:51:08 UTC

The bug fix was verified on a PSI cluster:
Openshift version: 4.13.0-0.nightly-2023-03-23-000343
CNV version: v4.13.0.rhel9-1836

Verification steps:
1. Decide on which node to create the NNCP (but don't create it yet).
2. Find the nmstate-handler pod that is running on that node.
3. From the executor, create a checkpoint on that handler (this will simulate the reboot of the node):
cat <<EOF | oc exec -it nmstate-handler-4sxq2 -n openshift-nmstate -- nmstatectl apply --no-commit --timeout 60
---                                                                             
{
  "interfaces": [
    {
      "name": "br0",
      "type": "linux-bridge",
      "state": "up",
      "bridge": {
        "options": {
          "stp": {
            "enabled": false
          }
        },
        "port": []
      }
    }
  ]
}
EOF

4. Within the timeout specified (60 seconds in my case), create the NNCP:
at <<EOF | oc create -f -
pipe heredoc> apiVersion: nmstate.io/v1beta1
kind: NodeNetworkConfigurationPolicy
metadata:
  name: nncp-1
spec:
  desiredState:
    interfaces:
    - bridge:
        options:
          stp:
            enabled: false
        port:
        - name: ens8
      ipv4:
        dhcp: false
        enabled: false
      ipv6:
        enabled: false
      name: br1test
      state: up
      type: linux-bridge
  nodeSelector:
    kubernetes.io/hostname: n-awax-413-o-gg24b-worker-0-6jbcn
EOF

5. Wait for the NNCP to be configured successfully.

The bug is solved.

Comment 37 errata-xmlrpc 2023-05-18 02:55:41 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Virtualization 4.13.0 Images security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2023:3205

Note You need to log in before you can comment on or make changes to this bug.