Bug 1973734

Summary:

webscale: nncp is stuck in progressing

Product:

Container Native Virtualization (CNV)

Reporter:

Nabeel Cocker <ncocker>

Component:

Networking

Assignee:

Quique Llorente <ellorent>

Status:

CLOSED DUPLICATE

QA Contact:

Meni Yakove <myakove>

Severity:

urgent

Docs Contact:

Priority:

high

Version:

2.6.5

CC:

agurenko, cnv-qe-bugs, dfediuck, mcornea, phoracek, rsdeor

Target Milestone:

---

Target Release:

4.8.1

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2021-06-21 11:33:53 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
nmstate pod logs	none
nmstate logs	none

Description Nabeel Cocker 2021-06-18 15:14:46 UTC

Created attachment 1792083 [details]
nmstate pod logs

Created attachment 1792083 [details]
nmstate pod logs

Created attachment 1792083 [details]
nmstate pod logs

Description of problem:

NNCP is stuck in progressing while NNCEs all seem to be complete. Configuration was not applied on any of the hosts. Following line in logs looks suspicious: "Operation cannot be fulfilled on nodenetworkconfigurationpolicies \"bigip-ha-policy\": Another node is working on configuration"


Version-Release number of selected component (if applicable):
CNV 2.6.5


How reproducible:
Unknown


Steps to Reproduce:
1. Apply multiple policies
2. (All of them are successfully applied now)
3. (Couple of days passes, it is not clear whether it was still healthy)
4. Upgrade CNV

Actual results:
One of them is successfully applied, the other stuck progressing (for weeks).


Expected results:
Both should either succeed or fail with a clear error in NNCE.


Additional info:
It is unclear whether the Policy failed before the upgrade or only after CNV upgrade.

Edit: Petr made some clarifications on the description

Comment 1 Petr Horáček 2021-06-18 15:27:15 UTC

We don't have a lot to work with here. I'm hoping we would be able to figure why we got stuck on Progressing based on the logs.

Comment 2 Nabeel Cocker 2021-06-18 16:48:36 UTC

Created attachment 1792101 [details]
nmstate logs

Logs after deleting and reapplying the bigip-ha.policy and thewsnmacvlanpolicy-bond1 policy.  It is taking a very long time for the policy to apply

Comment 3 Quique Llorente 2021-06-21 08:16:00 UTC

Just for clarification, the issue is related to CNV 2.6 since the log "Another node is working on" is only there at CNV 4.8 the mechanism is different, so looks like after some time CNV 2.6 is trying to Reconcile NNCPs agains and fail.

Comment 4 Quique Llorente 2021-06-21 08:22:41 UTC

@ncocker since this is an upgrade, can you check what RHCOS version are the nodes running on ? We may be having a libnm 0.3 vs NM 1.0 issue.

Comment 5 Quique Llorente 2021-06-21 08:27:33 UTC

Also we can try to run the NNCP in parallel so it does not wait for the other nodes (I think there is no issue with that here) for that add "parallel: true" to the NNCP specs.

Comment 6 Quique Llorente 2021-06-21 10:16:10 UTC

Looks like during the upgrade a handler pod got restarted in the middle of a progressing NNCP and they left the field "nodeRunningUpdate: worker-3" at the status so the "worker-3" handler thinks that there is something going on and never ends, there is a fix for that https://github.com/nmstate/kubernetes-nmstate/pull/763, can you verify that you have the latest build ?

Comment 8 Petr Horáček 2021-06-21 11:33:53 UTC

Thanks Quique, that's awesome. I think we can mark this as a duplicate of [1] and [2] for 4.8 and 2.6.6 respectively.

[1] https://bugzilla.redhat.com/show_bug.cgi?id=1967771
[2] https://bugzilla.redhat.com/show_bug.cgi?id=1967887

Both of these are now on QA and should land in targeted releases.

*** This bug has been marked as a duplicate of bug 1967771 ***

Comment 9 Red Hat Bugzilla 2023-09-15 01:10:09 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days