Bug 1967771 - nmstate is not progressing on a node and not configuring vlan filtering that causes an outage for VMs
Summary: nmstate is not progressing on a node and not configuring vlan filtering that ...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Container Native Virtualization (CNV)
Classification: Red Hat
Component: Networking
Version: 2.6.3
Hardware: x86_64
OS: Unspecified
high
urgent
Target Milestone: ---
: 4.8.0
Assignee: Quique Llorente
QA Contact: Ofir Nash
URL:
Whiteboard:
: 1973734 (view as bug list)
Depends On:
Blocks: 1967887
TreeView+ depends on / blocked
 
Reported: 2021-06-03 20:44 UTC by kseremet
Modified: 2023-09-15 01:11 UTC (History)
5 users (show)

Fixed In Version: kubernetes-nmstate-handler-container-v4.8.0-18
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1967887 (view as bug list)
Environment:
Last Closed: 2021-07-27 14:32:39 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker CNV-12294 0 None None None 2023-09-15 01:11:09 UTC
Red Hat Product Errata RHSA-2021:2920 0 None None None 2021-07-27 14:33:33 UTC

Description kseremet 2021-06-03 20:44:32 UTC
Description of problem:

After an OCP upgrade, we have realized that there is network problem on a worker node. ~30 VMs scheduled on the affected node right after it updated,rebooted and became ready and VMs lost their network access. After checking the node's network configuration, we realized that vlan filtering was not configured on the linux bridge br1 which is used for VM networking. Node was restarted but the problem still persist.

Version-Release number of selected component (if applicable):

How reproducible:

Always

Steps to Reproduce:

1. Kill an nmstate-handler pod forcefully when it's progressing on a node and .status.nodeRunningUpdate was set.

2. .status.nodeRunningUpdate field is not released and the nmstate-handler pod complains about "Another node is working on configuration" in a dead lock.

3.

Actual results:

Because nmstate-handler pods are not doing anything, they are not configuring vlan filtering and all the vms loose network access.
We were lucky because, there are nncp's for every node using a nodeSelector based on 'kubernetes.io/hostname' in our setup, so failure domain is limited in our case.
However, if there were a single NNCP shared by multiple nodes, it might easily end up with a full cluster outage where all the vms lost their network access when nodes are upgraded and rebooted one by one by the machine-config operator.


Expected results:

nmstate-handler pods must be capable of releasing .status.nodeRunningUpdate and break the loop if it has been set previously by themselves.
Instead of constantly throwing "Another node is working on configuration" and not progressing with the actual network configuration on the node, they should understand this lock set by a previous instance which is not alive.

Additional info:

Comment 1 Quique Llorente 2021-06-04 07:55:10 UTC
As a workaround we can bypass the rollout buy setting "parallel: false" on the NNCP, but we have to be sure that cluster will be ok if all the nodes are configuring in parallel.

The u/s kubernetes-nmstate version is 0.37

Comment 2 Quique Llorente 2021-06-04 10:16:42 UTC
u/s fix for CNV 2.6 https://github.com/nmstate/kubernetes-nmstate/pull/763

Comment 3 Quique Llorente 2021-06-04 10:24:37 UTC
The workaround is "parallel: true" not "parallel: false".

Comment 4 Quique Llorente 2021-06-08 09:07:26 UTC
We are going to keep this bz open since solution for 4.8 is different from 2.6 that has it's own bz already https://bugzilla.redhat.com/show_bug.cgi?id=1967887

Comment 5 Petr Horáček 2021-06-10 09:27:53 UTC
This may be a blocker for 4.8.

Comment 6 Quique Llorente 2021-06-10 09:28:37 UTC
In progress fix https://github.com/nmstate/kubernetes-nmstate/pull/771

Comment 7 Petr Horáček 2021-06-21 11:33:52 UTC
*** Bug 1973734 has been marked as a duplicate of this bug. ***

Comment 8 Nabeel Cocker 2021-06-24 15:24:28 UTC
Team,

Had a question.  Would node reboots put us back into a state where the nncp is stuck in progressing again?

We had node reboots in the cluster and see nncp with "nomatchingnodes" or progressing...

Just want clarification.

thanks
Nabeel

Comment 9 Petr Horáček 2021-06-25 08:59:41 UTC
I talked with Quique and he confirmed that after reboot, the issue is expected to reappear. This should be too solved with the fix.

Are configured interfaces persisted after the reboot, even though the status gets stuck there?

Comment 10 Ofir Nash 2021-06-28 21:12:44 UTC
Verified on version: nmstate-handler version is: v4.8.0-18.

Scenario checked: 
1. Create NNCP that configures Linux Bridge on worker node X.
2. Delete the matching nmstate-handler pod of the worker node X while it is progressing (Status: ConfigurationProgressing).
3. Verified new nmstate-handler pod created and releases the lock - causing it to progress and create successfully the NNCP.

We have an automation that verifies the exact scenario: https://code.engineering.redhat.com/gerrit/c/cnv-tests/+/250668
(Currently under CR and will be merged once approved)

Comment 13 errata-xmlrpc 2021-07-27 14:32:39 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Virtualization 4.8.0 Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2920

Comment 14 Red Hat Bugzilla 2023-09-15 01:08:58 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days


Note You need to log in before you can comment on or make changes to this bug.