Description of problem: After reboot the node, the node lost network connection. Version-Release number of selected component (if applicable): 4.7.0-0.nightly-2021-01-22-134922 How reproducible: Currently run into this issue in two clusters, both of them are UPI vSphere OVN clusters. Steps to Reproduce: 1. Create a UPI vSphere OVN cluster. 2. Reboot one node Actual results: The rebooted node keeps in NotReady status. oc get nodes -o wide NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME compute-0 NotReady worker 2d1h v1.20.0+d9c52cc 172.31.246.28 172.31.246.28 Red Hat Enterprise Linux CoreOS 47.83.202101171239-0 (Ootpa) 4.18.0-240.10.1.el8_3.x86_64 cri-o://1.20.0-0.rhaos4.7.gitd9f17c8.el8.42 compute-1 Ready worker 2d1h v1.20.0+d9c52cc 172.31.246.22 172.31.246.22 Red Hat Enterprise Linux CoreOS 47.83.202101171239-0 (Ootpa) 4.18.0-240.10.1.el8_3.x86_64 cri-o://1.20.0-0.rhaos4.7.gitd9f17c8.el8.42 control-plane-0 Ready master 2d2h v1.20.0+d9c52cc 172.31.246.24 172.31.246.24 Red Hat Enterprise Linux CoreOS 47.83.202101171239-0 (Ootpa) 4.18.0-240.10.1.el8_3.x86_64 cri-o://1.20.0-0.rhaos4.7.gitd9f17c8.el8.42 control-plane-1 Ready master 2d2h v1.20.0+d9c52cc 172.31.246.19 172.31.246.19 Red Hat Enterprise Linux CoreOS 47.83.202101171239-0 (Ootpa) 4.18.0-240.10.1.el8_3.x86_64 cri-o://1.20.0-0.rhaos4.7.gitd9f17c8.el8.42 control-plane-2 Ready master 2d2h v1.20.0+d9c52cc 172.31.246.26 172.31.246.26 Red Hat Enterprise Linux CoreOS 47.83.202101171239-0 (Ootpa) 4.18.0-240.10.1.el8_3.x86_64 cri-o://1.20.0-0.rhaos4.7.gitd9f17c8.el8.42 xiuwang-shared-w9nc5-worker-xvxdw Ready worker 25h v1.20.0+d9c52cc 172.31.247.117 172.31.247.117 Red Hat Enterprise Linux CoreOS 47.83.202101171239-0 (Ootpa) 4.18.0-240.10.1.el8_3.x86_64 cri-o://1.20.0-0.rhaos4.7.gitd9f17c8.el8.42 [core@compute-1 ~]$ ping 172.31.246.28 PING 172.31.246.28 (172.31.246.28) 56(84) bytes of data. From 172.31.246.22 icmp_seq=1 Destination Host Unreachable From 172.31.246.22 icmp_seq=2 Destination Host Unreachable From 172.31.246.22 icmp_seq=3 Destination Host Unreachable From 172.31.246.22 icmp_seq=4 Destination Host Unreachable From 172.31.246.22 icmp_seq=5 Destination Host Unreachable From 172.31.246.22 icmp_seq=6 Destination Host Unreachable ^C --- 172.31.246.28 ping statistics --- 7 packets transmitted, 0 received, +6 errors, 100% packet loss, time 134ms pipe 3 oc get pods -n openshift-ovn-kubernetes -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES ovnkube-master-p6mmt 6/6 Running 1 2d2h 172.31.246.19 control-plane-1 <none> <none> ovnkube-master-p84jb 6/6 Running 3 2d2h 172.31.246.26 control-plane-2 <none> <none> ovnkube-master-xck5f 6/6 Running 3 2d2h 172.31.246.24 control-plane-0 <none> <none> ovnkube-node-2pbrt 3/3 Running 0 2d2h 172.31.246.28 compute-0 <none> <none> ovnkube-node-444dz 3/3 Running 0 2d2h 172.31.246.22 compute-1 <none> <none> ovnkube-node-4kz9g 3/3 Running 0 26h 172.31.247.117 xiuwang-shared-w9nc5-worker-xvxdw <none> <none> ovnkube-node-cfckr 3/3 Running 0 2d2h 172.31.246.26 control-plane-2 <none> <none> ovnkube-node-kdbpx 3/3 Running 0 2d2h 172.31.246.24 control-plane-0 <none> <none> ovnkube-node-q58gd 3/3 Running 0 2d2h 172.31.246.19 control-plane-1 <none> <none> ovs-node-4w575 1/1 Running 0 2d2h 172.31.246.19 control-plane-1 <none> <none> ovs-node-cfrss 1/1 Running 0 2d2h 172.31.246.22 compute-1 <none> <none> ovs-node-dpg9l 1/1 Running 0 26h 172.31.247.117 xiuwang-shared-w9nc5-worker-xvxdw <none> <none> ovs-node-jdx4j 1/1 Running 0 2d2h 172.31.246.28 compute-0 <none> <none> ovs-node-rc9cz 1/1 Running 0 2d2h 172.31.246.26 control-plane-2 <none> <none> ovs-node-sb44p 1/1 Running 0 2d2h 172.31.246.24 control-plane-0 <none> <none> oc logs ovnkube-node-2pbrt -n openshift-ovn-kubernetes -c ovnkube-node Error from server: Get "https://172.31.246.28:10250/containerLogs/openshift-ovn-kubernetes/ovnkube-node-2pbrt/ovnkube-node": dial tcp 172.31.246.28:10250: connect: no route to host As cannot access to the NotReady node, checked from vSphere console, the IP lost for the NotReady node. Attached the screenshot. Expected results: The node should work well after reboot. Additional info:
Created attachment 1751115 [details] screenshot from vsphere console
I've managed to recover the node by opening up a web-console through the vSphere UI and modifying the kernel args to reboot into single mode. The problem is that the ovs-configuration.service cannot perform "nmcli conn up ovs-if-phys0", and it seems there's a problem between network manager and the OVS DB as there are connection failures logged. I am investigating why that is.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:5633