Bug 1896469

Summary: In cluster with OVN Kubernetes networking - a node doesn't recover when configuring linux-bridge over its default NIC
Product: Container Native Virtualization (CNV) Reporter: Yossi Segev <ysegev>
Component: NetworkingAssignee: Quique Llorente <ellorent>
Status: CLOSED ERRATA QA Contact: Meni Yakove <myakove>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 2.5.0CC: cnv-qe-bugs, ellorent, phoracek, yboaron
Target Milestone: ---Flags: yboaron: needinfo-
Target Release: 4.9.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: kubernetes-nmstate-handler-container-v4.9.0-10 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-11-02 15:57:26 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1915850    
Bug Blocks:    
Attachments:
Description Flags
nmstate crictl logs cnv 2.6
none
NetworkManager at debug level
none
NodeNetworkState before apply the policy
none
policy applied none

Description Yossi Segev 2020-11-10 16:06:32 UTC
Description of problem:
Due to bz 1885605 - linux-bridge cannot be configured over the default NIC of a node in a cluster with OVN Kubernetes networking.
When applying this configuration - the node gets to "NotReady" state, and doesn't recover, until rebooted.


Version-Release number of selected component (if applicable):
OCP 4.6
CNV 2.5


How reproducible:
Always


Steps to Reproduce:
Follow the steps of the description of bz 1885605 (https://bugzilla.redhat.com/show_bug.cgi?id=1885605#c0)

Actual results:
Each node on which this configuration is applied on get to "NotReady" state:
# oc get nodes
NAME       STATUS                     ROLES    AGE    VERSION
worker-1   NotReady                      worker   101m   v1.19.0+d59ce34



Expected results:
Node recovers and gets back to "Ready" state.


Additional info:
Workaround: Reboot the node (I only managed to do it via virsh):
# virsh list
 Id   Name              State
---------------------------------
 18   ostest_master_1   running
 19   ostest_master_2   running
 20   ostest_master_0   running
 23   ostest_worker_0   running
 24   ostest_worker_1   running

# virsh reboot ostest_worker_1

Comment 1 Quique Llorente 2020-12-14 13:34:02 UTC
Created attachment 1738972 [details]
nmstate handler pod logs

Comment 2 Petr Horáček 2021-01-06 11:39:58 UTC
Deferring this to 2.7. OVN is still a tech preview and we document that it is not allowed to reconfigure the default iface using knmstate when OVN is used.

Comment 3 Quique Llorente 2021-01-11 12:26:45 UTC
@yboaron can we retest this with latest CNV ?

Comment 4 Quique Llorente 2021-01-13 10:31:52 UTC
Created attachment 1746999 [details]
nmstate crictl logs cnv 2.6

This is the logs taking directly from the node since we lose TCP connectivity it's done using openstack novnc.

Comment 5 Quique Llorente 2021-01-13 11:26:59 UTC
Looks like nmstate is not able to rollback this kind of configuration since it involves linux-bridge and ovs also the ping we do after rollback is failing (since nmstate is not able to do the rollback) and it ends with handler trying to mark NNCE as success (wich is wrong) but it cannot since apiserver connectivity is broken.

Also I suspect that nmstate 1.0 will fix that since it does not allow from the beginning to have the same slave a multiple devices, so it should be fixed ad CNV 2.8.

Comment 6 Quique Llorente 2021-01-13 11:39:04 UTC
Also note that restaring the node make it accessible again.

Comment 7 Yossi Boaron 2021-01-13 11:46:52 UTC
@ellorent ,  I think you tagged the wrong Yossi

Comment 8 Quique Llorente 2021-01-13 12:42:18 UTC
Created attachment 1747053 [details]
NetworkManager at debug level

Comment 9 Quique Llorente 2021-01-13 12:50:57 UTC
Created attachment 1747055 [details]
NodeNetworkState before apply the policy

Comment 10 Quique Llorente 2021-01-13 12:51:24 UTC
Created attachment 1747056 [details]
policy applied

Comment 11 Quique Llorente 2021-01-13 14:48:14 UTC
Bug openned at nmstate team https://bugzilla.redhat.com/show_bug.cgi?id=1915850

Comment 12 Quique Llorente 2021-01-13 15:07:15 UTC
Just as a sidenot restarting the worker restores the connectivity.

Comment 13 Petr Horáček 2021-01-25 10:16:03 UTC
Moving this tracker to NEW. Keeping it until the linked nmstate bug gets resolved.

Comment 14 Quique Llorente 2021-05-24 09:23:11 UTC
Rollback is working fine at CNV 4.8 with nmstdate 1.0.2, now we have to see if veth works fine too.

Comment 15 Quique Llorente 2021-05-24 09:40:53 UTC
(In reply to Quique Llorente from comment #14)
> Rollback is working fine at CNV 4.8 with nmstdate 1.0.2, now we have to see
> if veth works fine too.

The cluster was using openshift-sdn not OVNKubernetes.

Comment 16 Petr Horáček 2021-07-21 11:23:04 UTC
This should be now addressed in the latest rebuild of 4.9.

Comment 17 Ofir Nash 2021-08-08 10:47:34 UTC
Verified on cluster with OVN Kubernetes Networking.
Version verified: kubernetes-nmstate-handler-container version is: v4.9.0-18

Steps verified:
1. Create and applied Linux Bridge over default NIC (Took from here: https://bugzilla.redhat.com/show_bug.cgi?id=1885605)
2. The nodes that the NNCP applied on recovered and are on status Ready:

[cnv-qe-jenkins@onash-490-ovn-9nbdm-executor extract-cnv-image-versions]$ oc get nodes
NAME                                 STATUS   ROLES    AGE    VERSION
onash-490-ovn-9nbdm-master-0         Ready    master   134m   v1.21.1+8268f88
onash-490-ovn-9nbdm-master-1         Ready    master   134m   v1.21.1+8268f88
onash-490-ovn-9nbdm-master-2         Ready    master   133m   v1.21.1+8268f88
onash-490-ovn-9nbdm-worker-0-8btgf   Ready    worker   117m   v1.21.1+8268f88
onash-490-ovn-9nbdm-worker-0-fvqv8   Ready    worker   117m   v1.21.1+8268f88
onash-490-ovn-9nbdm-worker-0-vzgqb   Ready    worker   113m   v1.21.1+8268f88

Comment 21 errata-xmlrpc 2021-11-02 15:57:26 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Virtualization 4.9.0 Images security and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:4104