Bug 1973734

Summary: webscale: nncp is stuck in progressing
Product: Container Native Virtualization (CNV) Reporter: Nabeel Cocker <ncocker>
Component: NetworkingAssignee: Quique Llorente <ellorent>
Status: CLOSED DUPLICATE QA Contact: Meni Yakove <myakove>
Severity: urgent Docs Contact:
Priority: high    
Version: 2.6.5CC: agurenko, cnv-qe-bugs, dfediuck, mcornea, phoracek, rsdeor
Target Milestone: ---   
Target Release: 4.8.1   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-06-21 11:33:53 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
nmstate pod logs
none
nmstate logs none

Description Nabeel Cocker 2021-06-18 15:14:46 UTC
Created attachment 1792083 [details]
nmstate pod logs

Created attachment 1792083 [details]
nmstate pod logs

Created attachment 1792083 [details]
nmstate pod logs

Description of problem:

NNCP is stuck in progressing while NNCEs all seem to be complete. Configuration was not applied on any of the hosts. Following line in logs looks suspicious: "Operation cannot be fulfilled on nodenetworkconfigurationpolicies \"bigip-ha-policy\": Another node is working on configuration"


Version-Release number of selected component (if applicable):
CNV 2.6.5


How reproducible:
Unknown


Steps to Reproduce:
1. Apply multiple policies
2. (All of them are successfully applied now)
3. (Couple of days passes, it is not clear whether it was still healthy)
4. Upgrade CNV

Actual results:
One of them is successfully applied, the other stuck progressing (for weeks).


Expected results:
Both should either succeed or fail with a clear error in NNCE.


Additional info:
It is unclear whether the Policy failed before the upgrade or only after CNV upgrade.

Edit: Petr made some clarifications on the description

Comment 1 Petr Horáček 2021-06-18 15:27:15 UTC
We don't have a lot to work with here. I'm hoping we would be able to figure why we got stuck on Progressing based on the logs.

Comment 2 Nabeel Cocker 2021-06-18 16:48:36 UTC
Created attachment 1792101 [details]
nmstate logs

Logs after deleting and reapplying the bigip-ha.policy and thewsnmacvlanpolicy-bond1 policy.  It is taking a very long time for the policy to apply

Comment 3 Quique Llorente 2021-06-21 08:16:00 UTC
Just for clarification, the issue is related to CNV 2.6 since the log "Another node is working on" is only there at CNV 4.8 the mechanism is different, so looks like after some time CNV 2.6 is trying to Reconcile NNCPs agains and fail.

Comment 4 Quique Llorente 2021-06-21 08:22:41 UTC
@ncocker since this is an upgrade, can you check what RHCOS version are the nodes running on ? We may be having a libnm 0.3 vs NM 1.0 issue.

Comment 5 Quique Llorente 2021-06-21 08:27:33 UTC
Also we can try to run the NNCP in parallel so it does not wait for the other nodes (I think there is no issue with that here) for that add "parallel: true" to the NNCP specs.

Comment 6 Quique Llorente 2021-06-21 10:16:10 UTC
Looks like during the upgrade a handler pod got restarted in the middle of a progressing NNCP and they left the field "nodeRunningUpdate: worker-3" at the status so the "worker-3" handler thinks that there is something going on and never ends, there is a fix for that https://github.com/nmstate/kubernetes-nmstate/pull/763, can you verify that you have the latest build ?

Comment 8 Petr Horáček 2021-06-21 11:33:53 UTC
Thanks Quique, that's awesome. I think we can mark this as a duplicate of [1] and [2] for 4.8 and 2.6.6 respectively.

[1] https://bugzilla.redhat.com/show_bug.cgi?id=1967771
[2] https://bugzilla.redhat.com/show_bug.cgi?id=1967887

Both of these are now on QA and should land in targeted releases.

*** This bug has been marked as a duplicate of bug 1967771 ***

Comment 9 Red Hat Bugzilla 2023-09-15 01:10:09 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days