1920806 – [OVN]Nodes lost network connection after reboot on the vSphere UPI

Bug 1920806 - [OVN]Nodes lost network connection after reboot on the vSphere UPI

Summary: [OVN]Nodes lost network connection after reboot on the vSphere UPI

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.7
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.7.0
Assignee:	Aniket Bhat
QA Contact:	huirwang
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-01-27 04:46 UTC by huirwang
Modified:	2021-02-24 15:57 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-02-24 15:56:44 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
screenshot from vsphere console (127.80 KB, image/png) 2021-01-27 05:19 UTC, huirwang	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift machine-config-operator pull 2362	0	None	closed	Bug 1916363: Fix how files from system-connections-merged get copied into system-connections	2021-02-01 23:26:22 UTC
Red Hat Product Errata	RHSA-2020:5633	0	None	None	None	2021-02-24 15:56:59 UTC

Description huirwang 2021-01-27 04:46:18 UTC

Description of problem:
After reboot the node, the node lost network connection.

Version-Release number of selected component (if applicable):
4.7.0-0.nightly-2021-01-22-134922

How reproducible:
Currently run into this issue in two clusters, both of them are UPI vSphere OVN clusters.

Steps to Reproduce:
1. Create a UPI vSphere OVN cluster.
2. Reboot one node


Actual results:

The rebooted node keeps in NotReady status.


oc get nodes -o wide
NAME                                STATUS     ROLES    AGE    VERSION           INTERNAL-IP      EXTERNAL-IP      OS-IMAGE                                                       KERNEL-VERSION                 CONTAINER-RUNTIME
compute-0                           NotReady   worker   2d1h   v1.20.0+d9c52cc   172.31.246.28    172.31.246.28    Red Hat Enterprise Linux CoreOS 47.83.202101171239-0 (Ootpa)   4.18.0-240.10.1.el8_3.x86_64   cri-o://1.20.0-0.rhaos4.7.gitd9f17c8.el8.42
compute-1                           Ready      worker   2d1h   v1.20.0+d9c52cc   172.31.246.22    172.31.246.22    Red Hat Enterprise Linux CoreOS 47.83.202101171239-0 (Ootpa)   4.18.0-240.10.1.el8_3.x86_64   cri-o://1.20.0-0.rhaos4.7.gitd9f17c8.el8.42
control-plane-0                     Ready      master   2d2h   v1.20.0+d9c52cc   172.31.246.24    172.31.246.24    Red Hat Enterprise Linux CoreOS 47.83.202101171239-0 (Ootpa)   4.18.0-240.10.1.el8_3.x86_64   cri-o://1.20.0-0.rhaos4.7.gitd9f17c8.el8.42
control-plane-1                     Ready      master   2d2h   v1.20.0+d9c52cc   172.31.246.19    172.31.246.19    Red Hat Enterprise Linux CoreOS 47.83.202101171239-0 (Ootpa)   4.18.0-240.10.1.el8_3.x86_64   cri-o://1.20.0-0.rhaos4.7.gitd9f17c8.el8.42
control-plane-2                     Ready      master   2d2h   v1.20.0+d9c52cc   172.31.246.26    172.31.246.26    Red Hat Enterprise Linux CoreOS 47.83.202101171239-0 (Ootpa)   4.18.0-240.10.1.el8_3.x86_64   cri-o://1.20.0-0.rhaos4.7.gitd9f17c8.el8.42
xiuwang-shared-w9nc5-worker-xvxdw   Ready      worker   25h    v1.20.0+d9c52cc   172.31.247.117   172.31.247.117   Red Hat Enterprise Linux CoreOS 47.83.202101171239-0 (Ootpa)   4.18.0-240.10.1.el8_3.x86_64   cri-o://1.20.0-0.rhaos4.7.gitd9f17c8.el8.42


[core@compute-1 ~]$ ping 172.31.246.28
PING 172.31.246.28 (172.31.246.28) 56(84) bytes of data.
From 172.31.246.22 icmp_seq=1 Destination Host Unreachable
From 172.31.246.22 icmp_seq=2 Destination Host Unreachable
From 172.31.246.22 icmp_seq=3 Destination Host Unreachable
From 172.31.246.22 icmp_seq=4 Destination Host Unreachable
From 172.31.246.22 icmp_seq=5 Destination Host Unreachable
From 172.31.246.22 icmp_seq=6 Destination Host Unreachable
^C
--- 172.31.246.28 ping statistics ---
7 packets transmitted, 0 received, +6 errors, 100% packet loss, time 134ms
pipe 3


oc get pods -n openshift-ovn-kubernetes -o wide
NAME                   READY   STATUS    RESTARTS   AGE    IP               NODE                                NOMINATED NODE   READINESS GATES
ovnkube-master-p6mmt   6/6     Running   1          2d2h   172.31.246.19    control-plane-1                     <none>           <none>
ovnkube-master-p84jb   6/6     Running   3          2d2h   172.31.246.26    control-plane-2                     <none>           <none>
ovnkube-master-xck5f   6/6     Running   3          2d2h   172.31.246.24    control-plane-0                     <none>           <none>
ovnkube-node-2pbrt     3/3     Running   0          2d2h   172.31.246.28    compute-0                           <none>           <none>
ovnkube-node-444dz     3/3     Running   0          2d2h   172.31.246.22    compute-1                           <none>           <none>
ovnkube-node-4kz9g     3/3     Running   0          26h    172.31.247.117   xiuwang-shared-w9nc5-worker-xvxdw   <none>           <none>
ovnkube-node-cfckr     3/3     Running   0          2d2h   172.31.246.26    control-plane-2                     <none>           <none>
ovnkube-node-kdbpx     3/3     Running   0          2d2h   172.31.246.24    control-plane-0                     <none>           <none>
ovnkube-node-q58gd     3/3     Running   0          2d2h   172.31.246.19    control-plane-1                     <none>           <none>
ovs-node-4w575         1/1     Running   0          2d2h   172.31.246.19    control-plane-1                     <none>           <none>
ovs-node-cfrss         1/1     Running   0          2d2h   172.31.246.22    compute-1                           <none>           <none>
ovs-node-dpg9l         1/1     Running   0          26h    172.31.247.117   xiuwang-shared-w9nc5-worker-xvxdw   <none>           <none>
ovs-node-jdx4j         1/1     Running   0          2d2h   172.31.246.28    compute-0                           <none>           <none>
ovs-node-rc9cz         1/1     Running   0          2d2h   172.31.246.26    control-plane-2                     <none>           <none>
ovs-node-sb44p         1/1     Running   0          2d2h   172.31.246.24    control-plane-0                     <none>           <none>

oc logs ovnkube-node-2pbrt   -n openshift-ovn-kubernetes -c ovnkube-node
Error from server: Get "https://172.31.246.28:10250/containerLogs/openshift-ovn-kubernetes/ovnkube-node-2pbrt/ovnkube-node": dial tcp 172.31.246.28:10250: connect: no route to host


As cannot access to the NotReady node, checked from vSphere console, the IP lost for the NotReady node. Attached the screenshot.

Expected results:
The node should work well after reboot.

Additional info:

Comment 2 huirwang 2021-01-27 05:19:07 UTC

Created attachment 1751115 [details]
screenshot from vsphere console

Comment 4 Alexander Constantinescu 2021-01-27 14:44:26 UTC

I've managed to recover the node by opening up a web-console through the vSphere UI and modifying the kernel args to reboot into single mode. 

The problem is that the ovs-configuration.service cannot perform "nmcli conn up ovs-if-phys0", and it seems there's a problem between network manager and the OVS DB as there are connection failures logged. I am investigating why that is.

Comment 11 errata-xmlrpc 2021-02-24 15:56:44 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633

Note You need to log in before you can comment on or make changes to this bug.