Bug 1973734 - webscale: nncp is stuck in progressing
Summary: webscale: nncp is stuck in progressing
Keywords:
Status: CLOSED DUPLICATE of bug 1967771
Alias: None
Product: Container Native Virtualization (CNV)
Classification: Red Hat
Component: Networking
Version: 2.6.5
Hardware: Unspecified
OS: Unspecified
high
urgent
Target Milestone: ---
: 4.8.1
Assignee: Quique Llorente
QA Contact: Meni Yakove
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-06-18 15:14 UTC by Nabeel Cocker
Modified: 2023-09-15 01:10 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-06-21 11:33:53 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
nmstate pod logs (1.48 MB, application/octet-stream)
2021-06-18 15:14 UTC, Nabeel Cocker
no flags Details
nmstate logs (2.62 MB, application/octet-stream)
2021-06-18 16:48 UTC, Nabeel Cocker
no flags Details

Description Nabeel Cocker 2021-06-18 15:14:46 UTC
Created attachment 1792083 [details]
nmstate pod logs

Created attachment 1792083 [details]
nmstate pod logs

Created attachment 1792083 [details]
nmstate pod logs

Description of problem:

NNCP is stuck in progressing while NNCEs all seem to be complete. Configuration was not applied on any of the hosts. Following line in logs looks suspicious: "Operation cannot be fulfilled on nodenetworkconfigurationpolicies \"bigip-ha-policy\": Another node is working on configuration"


Version-Release number of selected component (if applicable):
CNV 2.6.5


How reproducible:
Unknown


Steps to Reproduce:
1. Apply multiple policies
2. (All of them are successfully applied now)
3. (Couple of days passes, it is not clear whether it was still healthy)
4. Upgrade CNV

Actual results:
One of them is successfully applied, the other stuck progressing (for weeks).


Expected results:
Both should either succeed or fail with a clear error in NNCE.


Additional info:
It is unclear whether the Policy failed before the upgrade or only after CNV upgrade.

Edit: Petr made some clarifications on the description

Comment 1 Petr Horáček 2021-06-18 15:27:15 UTC
We don't have a lot to work with here. I'm hoping we would be able to figure why we got stuck on Progressing based on the logs.

Comment 2 Nabeel Cocker 2021-06-18 16:48:36 UTC
Created attachment 1792101 [details]
nmstate logs

Logs after deleting and reapplying the bigip-ha.policy and thewsnmacvlanpolicy-bond1 policy.  It is taking a very long time for the policy to apply

Comment 3 Quique Llorente 2021-06-21 08:16:00 UTC
Just for clarification, the issue is related to CNV 2.6 since the log "Another node is working on" is only there at CNV 4.8 the mechanism is different, so looks like after some time CNV 2.6 is trying to Reconcile NNCPs agains and fail.

Comment 4 Quique Llorente 2021-06-21 08:22:41 UTC
@ncocker since this is an upgrade, can you check what RHCOS version are the nodes running on ? We may be having a libnm 0.3 vs NM 1.0 issue.

Comment 5 Quique Llorente 2021-06-21 08:27:33 UTC
Also we can try to run the NNCP in parallel so it does not wait for the other nodes (I think there is no issue with that here) for that add "parallel: true" to the NNCP specs.

Comment 6 Quique Llorente 2021-06-21 10:16:10 UTC
Looks like during the upgrade a handler pod got restarted in the middle of a progressing NNCP and they left the field "nodeRunningUpdate: worker-3" at the status so the "worker-3" handler thinks that there is something going on and never ends, there is a fix for that https://github.com/nmstate/kubernetes-nmstate/pull/763, can you verify that you have the latest build ?

Comment 8 Petr Horáček 2021-06-21 11:33:53 UTC
Thanks Quique, that's awesome. I think we can mark this as a duplicate of [1] and [2] for 4.8 and 2.6.6 respectively.

[1] https://bugzilla.redhat.com/show_bug.cgi?id=1967771
[2] https://bugzilla.redhat.com/show_bug.cgi?id=1967887

Both of these are now on QA and should land in targeted releases.

*** This bug has been marked as a duplicate of bug 1967771 ***

Comment 9 Red Hat Bugzilla 2023-09-15 01:10:09 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days


Note You need to log in before you can comment on or make changes to this bug.