1967887 – [2.6.6] nmstate is not progressing on a node and not configuring vlan filtering that causes an outage for VMs

Bug 1967887 - [2.6.6] nmstate is not progressing on a node and not configuring vlan filtering that causes an outage for VMs

Summary: [2.6.6] nmstate is not progressing on a node and not configuring vlan filteri...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Container Native Virtualization (CNV)
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	2.6.3
Hardware:	x86_64
OS:	Unspecified
Priority:	high
Severity:	urgent
Target Milestone:	---
Target Release:	2.6.6
Assignee:	Quique Llorente
QA Contact:	Meni Yakove
Docs Contact:
URL:
Whiteboard:
Depends On:	1967771
Blocks:
TreeView+	depends on / blocked

Reported:	2021-06-04 10:15 UTC by Petr Horáček
Modified:	2021-08-10 17:34 UTC (History)
CC List:	6 users (show)
Fixed In Version:	kubernetes-nmstate-handler-container-v2.6.5-6
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:	1967771
Environment:
Last Closed:	2021-08-10 17:33:37 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2021:3119	0	None	None	None	2021-08-10 17:34:27 UTC

Description Petr Horáček 2021-06-04 10:15:58 UTC

+++ This bug was initially created as a clone of Bug #1967771 +++

Description of problem:

After an OCP upgrade, we have realized that there is network problem on a worker node. ~30 VMs scheduled on the affected node right after it updated,rebooted and became ready and VMs lost their network access. After checking the node's network configuration, we realized that vlan filtering was not configured on the linux bridge br1 which is used for VM networking. Node was restarted but the problem still persist.

Version-Release number of selected component (if applicable):

How reproducible:

Always

Steps to Reproduce:

1. Kill an nmstate-handler pod forcefully when it's progressing on a node and .status.nodeRunningUpdate was set.

2. .status.nodeRunningUpdate field is not released and the nmstate-handler pod complains about "Another node is working on configuration" in a dead lock.

3.

Actual results:

Because nmstate-handler pods are not doing anything, they are not configuring vlan filtering and all the vms loose network access.
We were lucky because, there are nncp's for every node using a nodeSelector based on 'kubernetes.io/hostname' in our setup, so failure domain is limited in our case.
However, if there were a single NNCP shared by multiple nodes, it might easily end up with a full cluster outage where all the vms lost their network access when nodes are upgraded and rebooted one by one by the machine-config operator.


Expected results:

nmstate-handler pods must be capable of releasing .status.nodeRunningUpdate and break the loop if it has been set previously by themselves.
Instead of constantly throwing "Another node is working on configuration" and not progressing with the actual network configuration on the node, they should understand this lock set by a previous instance which is not alive.

Additional info:

--- Additional comment from Quique Llorente on 2021-06-04 07:55:10 UTC ---

As a workaround we can bypass the rollout buy setting "parallel: false" on the NNCP, but we have to be sure that cluster will be ok if all the nodes are configuring in parallel.

The u/s kubernetes-nmstate version is 0.37

Comment 1 Quique Llorente 2021-06-07 08:08:43 UTC

Created a CNAO release with the fix u/s https://github.com/kubevirt/cluster-network-addons-operator/releases/tag/v0.44.5

Comment 2 Yossi Segev 2021-06-29 11:40:46 UTC

Verified on a cluster installed with these versions:
Openshift: 4.7.18
Kubernetes Version: v1.20.0+87cc9a4
CNV: 2.6.6
nmstate-handler: v2.6.6-1

Reproduction scenario:
1. Apply the following NNCP to create a bridge on a selected worker-node:
apiVersion: nmstate.io/v1beta1
kind: NodeNetworkConfigurationPolicy
metadata:
  name: br1test-nncp
spec:
  desiredState:
    interfaces:
    - bridge:
        options:
          stp:
            enabled: false
        port:
        - name: ens9
      ipv4:
        dhcp: false
        enabled: false
      ipv6:
        enabled: false
      name: br1test
      state: up
      type: linux-bridge
  nodeSelector:
    kubernetes.io/hostname: "net-yoss-266-s8cdb-worker-0-gtws9"

2. While the NNCP is still in status ConfigurationProgressing - I deleted the nmstate-handler pod that ran on the selected pod.

Result:
a. The NNCP status (which I continuoysly checked using "oc get nncp -w") went blank, then back to ConfigurationProgressing.
b. A new nmstate-handler pod started running on the selected node.
c. Finally - the NNCP status went to SuccesfullyConfigured.
d. The bridge interface was created on the selected node.

Comment 7 errata-xmlrpc 2021-08-10 17:33:37 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Virtualization 2.6.6 Images security and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:3119

Note You need to log in before you can comment on or make changes to this bug.