Bug 2116562
| Summary: | NodeNetworkConfigurationPolicy "ERROR: State editing already in progress. Commit, roll back or wait before retrying" | ||
|---|---|---|---|
| Product: | Container Native Virtualization (CNV) | Reporter: | Eswar Vadla <evadla> |
| Component: | Networking | Assignee: | Quique Llorente <ellorent> |
| Status: | CLOSED ERRATA | QA Contact: | awax |
| Severity: | high | Docs Contact: | |
| Priority: | medium | ||
| Version: | 4.10.0 | CC: | alitke, bnemec, cstabler, ellorent, hchaturv, igarcia, nrozen, phoracek |
| Target Milestone: | --- | Flags: | hchaturv:
needinfo-
hchaturv: needinfo- |
| Target Release: | 4.13.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | 4.13 nightly | Doc Type: | If docs needed, set a value |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2023-05-18 02:55:41 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Eswar Vadla
2022-08-08 20:03:52 UTC
(In reply to Eswar Vadla from comment #0) > Making sure a filtering bridge interface exists (using MachineConfig in our > case). Note that in 4.10 it should not be necessary to use machine-configs for network configuration and I wouldn't recommend it since then you have two different operators managing network config files (MCO and kubernetes-nmstate). It doesn't sound like that's the problem here, but if there's configuration they can't use kubernetes-nmstate for we would recommend the networkConfig mechanism to deploy that initially: https://docs.openshift.com/container-platform/4.10/installing/installing_bare_metal_ipi/ipi-install-installation-workflow.html#configuring-host-network-interfaces-in-the-install-config-yaml-file_ipi-install-installation-workflow It uses the same syntax as kubernetes-nmstate so it's easy to modify on day 2 with the operator and there's no chance of conflicts. We believe that this issue has been fixed in CNV 4.10.5. Could you upgrade to that version? Eswar, considering Quique's answers, is anything else needed before we close this? We have being able to reproduce the "in progress" issue, looks like if the nmstate-handler pod get restarted before commiting the nmstate checkpoint, the next "nmstatectl set" operation will fail since checkpoint is lingering around, we are to put a fix to rollback checkpoints at the beggining of nmstate-handler pods. This cannot be an issue with node retart since checkpoints disappear if NetworkManager daemon get restarted. Added upstream fix to remove pending checkpoint before apply new state, for example when the handler pods are restarted. Also we have take some elpased times of each operation on the current nmstatectl python version and the future rust version and the difference is near an order of magnitude, so related to performance we will just wait to integrate the the future nmstate rust version. POST waiting for https://github.com/openshift/kubernetes-nmstate/pull/334 Changes were merged downstream https://github.com/openshift/kubernetes-nmstate/pull/339 The bug fix was verified on a PSI cluster:
Openshift version: 4.13.0-0.nightly-2023-03-23-000343
CNV version: v4.13.0.rhel9-1836
Verification steps:
1. Decide on which node to create the NNCP (but don't create it yet).
2. Find the nmstate-handler pod that is running on that node.
3. From the executor, create a checkpoint on that handler (this will simulate the reboot of the node):
cat <<EOF | oc exec -it nmstate-handler-4sxq2 -n openshift-nmstate -- nmstatectl apply --no-commit --timeout 60
---
{
"interfaces": [
{
"name": "br0",
"type": "linux-bridge",
"state": "up",
"bridge": {
"options": {
"stp": {
"enabled": false
}
},
"port": []
}
}
]
}
EOF
4. Within the timeout specified (60 seconds in my case), create the NNCP:
at <<EOF | oc create -f -
pipe heredoc> apiVersion: nmstate.io/v1beta1
kind: NodeNetworkConfigurationPolicy
metadata:
name: nncp-1
spec:
desiredState:
interfaces:
- bridge:
options:
stp:
enabled: false
port:
- name: ens8
ipv4:
dhcp: false
enabled: false
ipv6:
enabled: false
name: br1test
state: up
type: linux-bridge
nodeSelector:
kubernetes.io/hostname: n-awax-413-o-gg24b-worker-0-6jbcn
EOF
5. Wait for the NNCP to be configured successfully.
The bug is solved.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Virtualization 4.13.0 Images security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2023:3205 |