Created attachment 1792083 [details] nmstate pod logs Created attachment 1792083 [details] nmstate pod logs Created attachment 1792083 [details] nmstate pod logs Description of problem: NNCP is stuck in progressing while NNCEs all seem to be complete. Configuration was not applied on any of the hosts. Following line in logs looks suspicious: "Operation cannot be fulfilled on nodenetworkconfigurationpolicies \"bigip-ha-policy\": Another node is working on configuration" Version-Release number of selected component (if applicable): CNV 2.6.5 How reproducible: Unknown Steps to Reproduce: 1. Apply multiple policies 2. (All of them are successfully applied now) 3. (Couple of days passes, it is not clear whether it was still healthy) 4. Upgrade CNV Actual results: One of them is successfully applied, the other stuck progressing (for weeks). Expected results: Both should either succeed or fail with a clear error in NNCE. Additional info: It is unclear whether the Policy failed before the upgrade or only after CNV upgrade. Edit: Petr made some clarifications on the description
We don't have a lot to work with here. I'm hoping we would be able to figure why we got stuck on Progressing based on the logs.
Created attachment 1792101 [details] nmstate logs Logs after deleting and reapplying the bigip-ha.policy and thewsnmacvlanpolicy-bond1 policy. It is taking a very long time for the policy to apply
Just for clarification, the issue is related to CNV 2.6 since the log "Another node is working on" is only there at CNV 4.8 the mechanism is different, so looks like after some time CNV 2.6 is trying to Reconcile NNCPs agains and fail.
@ncocker since this is an upgrade, can you check what RHCOS version are the nodes running on ? We may be having a libnm 0.3 vs NM 1.0 issue.
Also we can try to run the NNCP in parallel so it does not wait for the other nodes (I think there is no issue with that here) for that add "parallel: true" to the NNCP specs.
Looks like during the upgrade a handler pod got restarted in the middle of a progressing NNCP and they left the field "nodeRunningUpdate: worker-3" at the status so the "worker-3" handler thinks that there is something going on and never ends, there is a fix for that https://github.com/nmstate/kubernetes-nmstate/pull/763, can you verify that you have the latest build ?
Thanks Quique, that's awesome. I think we can mark this as a duplicate of [1] and [2] for 4.8 and 2.6.6 respectively. [1] https://bugzilla.redhat.com/show_bug.cgi?id=1967771 [2] https://bugzilla.redhat.com/show_bug.cgi?id=1967887 Both of these are now on QA and should land in targeted releases. *** This bug has been marked as a duplicate of bug 1967771 ***
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days