Bug 1896469 - In cluster with OVN Kubernetes networking - a node doesn't recover when configuring linux-bridge over its default NIC
Summary: In cluster with OVN Kubernetes networking - a node doesn't recover when confi...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Container Native Virtualization (CNV)
Classification: Red Hat
Component: Networking
Version: 2.5.0
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
: 4.9.0
Assignee: Quique Llorente
QA Contact: Meni Yakove
URL:
Whiteboard:
Depends On: 1915850
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-11-10 16:06 UTC by Yossi Segev
Modified: 2021-11-02 15:58 UTC (History)
4 users (show)

Fixed In Version: kubernetes-nmstate-handler-container-v4.9.0-10
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-11-02 15:57:26 UTC
Target Upstream Version:
Embargoed:
yboaron: needinfo-


Attachments (Terms of Use)
nmstate crictl logs cnv 2.6 (51.06 KB, text/plain)
2021-01-13 10:31 UTC, Quique Llorente
no flags Details
NetworkManager at debug level (7.66 MB, text/plain)
2021-01-13 12:42 UTC, Quique Llorente
no flags Details
NodeNetworkState before apply the policy (76 bytes, text/plain)
2021-01-13 12:50 UTC, Quique Llorente
no flags Details
policy applied (351 bytes, text/plain)
2021-01-13 12:51 UTC, Quique Llorente
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2021:4104 0 None None None 2021-11-02 15:58:40 UTC

Internal Links: 1915850

Description Yossi Segev 2020-11-10 16:06:32 UTC
Description of problem:
Due to bz 1885605 - linux-bridge cannot be configured over the default NIC of a node in a cluster with OVN Kubernetes networking.
When applying this configuration - the node gets to "NotReady" state, and doesn't recover, until rebooted.


Version-Release number of selected component (if applicable):
OCP 4.6
CNV 2.5


How reproducible:
Always


Steps to Reproduce:
Follow the steps of the description of bz 1885605 (https://bugzilla.redhat.com/show_bug.cgi?id=1885605#c0)

Actual results:
Each node on which this configuration is applied on get to "NotReady" state:
# oc get nodes
NAME       STATUS                     ROLES    AGE    VERSION
worker-1   NotReady                      worker   101m   v1.19.0+d59ce34



Expected results:
Node recovers and gets back to "Ready" state.


Additional info:
Workaround: Reboot the node (I only managed to do it via virsh):
# virsh list
 Id   Name              State
---------------------------------
 18   ostest_master_1   running
 19   ostest_master_2   running
 20   ostest_master_0   running
 23   ostest_worker_0   running
 24   ostest_worker_1   running

# virsh reboot ostest_worker_1

Comment 1 Quique Llorente 2020-12-14 13:34:02 UTC
Created attachment 1738972 [details]
nmstate handler pod logs

Comment 2 Petr Horáček 2021-01-06 11:39:58 UTC
Deferring this to 2.7. OVN is still a tech preview and we document that it is not allowed to reconfigure the default iface using knmstate when OVN is used.

Comment 3 Quique Llorente 2021-01-11 12:26:45 UTC
@yboaron can we retest this with latest CNV ?

Comment 4 Quique Llorente 2021-01-13 10:31:52 UTC
Created attachment 1746999 [details]
nmstate crictl logs cnv 2.6

This is the logs taking directly from the node since we lose TCP connectivity it's done using openstack novnc.

Comment 5 Quique Llorente 2021-01-13 11:26:59 UTC
Looks like nmstate is not able to rollback this kind of configuration since it involves linux-bridge and ovs also the ping we do after rollback is failing (since nmstate is not able to do the rollback) and it ends with handler trying to mark NNCE as success (wich is wrong) but it cannot since apiserver connectivity is broken.

Also I suspect that nmstate 1.0 will fix that since it does not allow from the beginning to have the same slave a multiple devices, so it should be fixed ad CNV 2.8.

Comment 6 Quique Llorente 2021-01-13 11:39:04 UTC
Also note that restaring the node make it accessible again.

Comment 7 Yossi Boaron 2021-01-13 11:46:52 UTC
@ellorent ,  I think you tagged the wrong Yossi

Comment 8 Quique Llorente 2021-01-13 12:42:18 UTC
Created attachment 1747053 [details]
NetworkManager at debug level

Comment 9 Quique Llorente 2021-01-13 12:50:57 UTC
Created attachment 1747055 [details]
NodeNetworkState before apply the policy

Comment 10 Quique Llorente 2021-01-13 12:51:24 UTC
Created attachment 1747056 [details]
policy applied

Comment 11 Quique Llorente 2021-01-13 14:48:14 UTC
Bug openned at nmstate team https://bugzilla.redhat.com/show_bug.cgi?id=1915850

Comment 12 Quique Llorente 2021-01-13 15:07:15 UTC
Just as a sidenot restarting the worker restores the connectivity.

Comment 13 Petr Horáček 2021-01-25 10:16:03 UTC
Moving this tracker to NEW. Keeping it until the linked nmstate bug gets resolved.

Comment 14 Quique Llorente 2021-05-24 09:23:11 UTC
Rollback is working fine at CNV 4.8 with nmstdate 1.0.2, now we have to see if veth works fine too.

Comment 15 Quique Llorente 2021-05-24 09:40:53 UTC
(In reply to Quique Llorente from comment #14)
> Rollback is working fine at CNV 4.8 with nmstdate 1.0.2, now we have to see
> if veth works fine too.

The cluster was using openshift-sdn not OVNKubernetes.

Comment 16 Petr Horáček 2021-07-21 11:23:04 UTC
This should be now addressed in the latest rebuild of 4.9.

Comment 17 Ofir Nash 2021-08-08 10:47:34 UTC
Verified on cluster with OVN Kubernetes Networking.
Version verified: kubernetes-nmstate-handler-container version is: v4.9.0-18

Steps verified:
1. Create and applied Linux Bridge over default NIC (Took from here: https://bugzilla.redhat.com/show_bug.cgi?id=1885605)
2. The nodes that the NNCP applied on recovered and are on status Ready:

[cnv-qe-jenkins@onash-490-ovn-9nbdm-executor extract-cnv-image-versions]$ oc get nodes
NAME                                 STATUS   ROLES    AGE    VERSION
onash-490-ovn-9nbdm-master-0         Ready    master   134m   v1.21.1+8268f88
onash-490-ovn-9nbdm-master-1         Ready    master   134m   v1.21.1+8268f88
onash-490-ovn-9nbdm-master-2         Ready    master   133m   v1.21.1+8268f88
onash-490-ovn-9nbdm-worker-0-8btgf   Ready    worker   117m   v1.21.1+8268f88
onash-490-ovn-9nbdm-worker-0-fvqv8   Ready    worker   117m   v1.21.1+8268f88
onash-490-ovn-9nbdm-worker-0-vzgqb   Ready    worker   113m   v1.21.1+8268f88

Comment 21 errata-xmlrpc 2021-11-02 15:57:26 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Virtualization 4.9.0 Images security and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:4104


Note You need to log in before you can comment on or make changes to this bug.