Description of problem: When rebooting a node in a 4.7 cluster with network type OVNKubernetes it is not coming back up. The cluster was deployed with IPI. We see the following error during the reboot: ``` [ 89.059633] configure-ovs.sh[1749]: + echo 'No default route found on attempt: 11' [0/1891] [ 89.059908] configure-ovs.sh[1749]: No default route found on attempt: 11 [ 89.060168] configure-ovs.sh[1749]: + sleep 5 [ 94.060274] configure-ovs.sh[1749]: + '[' 11 -lt 12 ']' [ 94.060727] configure-ovs.sh[1749]: ++ ip route show default [ 94.061351] configure-ovs.sh[1749]: ++ awk '{if ($4 == "dev") print $5; exit}' [ 94.063165] configure-ovs.sh[1749]: + iface= [ 94.063272] configure-ovs.sh[1749]: + [[ -n '' ]] [ 94.063824] configure-ovs.sh[1749]: ++ awk '{if ($4 == "dev") print $5; exit}' [ 94.065250] configure-ovs.sh[1749]: ++ ip -6 route show default [ 94.067261] configure-ovs.sh[1749]: + iface= [ 94.067309] configure-ovs.sh[1749]: + [[ -n '' ]] [ 94.067651] configure-ovs.sh[1749]: + counter=12 [ 94.067813] configure-ovs.sh[1749]: + echo 'No default route found on attempt: 12' [ 94.068223] configure-ovs.sh[1749]: No default route found on attempt: 12 [ 94.068511] configure-ovs.sh[1749]: + sleep 5 [ 99.068545] configure-ovs.sh[1749]: + '[' 12 -lt 12 ']' [ 99.068749] configure-ovs.sh[1749]: + '[' '' = br-ex ']' [ 99.068959] configure-ovs.sh[1749]: + '[' -z '' ']' [ 99.069202] configure-ovs.sh[1749]: + echo 'ERROR: Unable to find default gateway interface' [ 99.069603] configure-ovs.sh[1749]: ERROR: Unable to find default gateway interface [ 99.069871] configure-ovs.sh[1749]: + exit 1 ``` Version-Release number of selected component (if applicable): 4.7 nightly How reproducible: everytime Steps to Reproduce: After creating a cluster, reboot the nodes Actual results: [root@cnfdd5-installer ~]# oc get node NAME STATUS ROLES AGE VERSION cnfdd5.clus2.t5g.lab.eng.bos.redhat.com Ready worker,worker-cnf 6h51m v1.19.2+4abb4a7 cnfdd6.clus2.t5g.lab.eng.bos.redhat.com NotReady,SchedulingDisabled worker,worker-cnf 6h52m v1.19.2+4abb4a7 cnfdd7.clus2.t5g.lab.eng.bos.redhat.com NotReady,SchedulingDisabled worker 6h52m v1.19.2+4abb4a7 dhcp19-17-115.clus2.t5g.lab.eng.bos.redhat.com NotReady,SchedulingDisabled master,virtual 8h v1.19.2+4abb4a7 dhcp19-17-116.clus2.t5g.lab.eng.bos.redhat.com Ready master,virtual 8h v1.19.2+4abb4a7 dhcp19-17-117.clus2.t5g.lab.eng.bos.redhat.com Ready master,virtual 8h v1.19.2+4abb4a7 Expected results: The nodes should come back up from a reboot. Additional info: In a cluster with network type OpenShiftSDN the nodes are coming up just fine after a reboot. Couldn't get must-gather due to an error - "error: gather did not start for pod must-gather-8fm27: timed out waiting for the condition"
does `ip -6 r show default` give multiple lines of output? If so, this is a dup of bug 1896898 (which will be fixed soon)
(In reply to Dan Winship from comment #1) > does `ip -6 r show default` give multiple lines of output? If so, this is a > dup of bug 1896898 (which will be fixed soon) it return zero lines output :-) both for ipv4 and ipv6. we get ip only on the provisioining NIC. baremetal is down and it's a 4.7 bug it doesnt happen to us with 4.6 at all. also note again - it happen after reboot and doesnt happen with OpenshiftSDN
If you can't get must-gather can you get an sosreport from one of the nodes that's in the rebooted-and-now-not-coming-up state? Or if not that, at least "ip a" and "ip r".
(In reply to Dan Winship from comment #3) > If you can't get must-gather can you get an sosreport from one of the nodes > that's in the rebooted-and-now-not-coming-up state? Or if not that, at least > "ip a" and "ip r". [core@localhost ~]$ ip a 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever inet6 ::1/128 scope host valid_lft forever preferred_lft forever 2: ens2f0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000 link/ether 40:a6:b7:0d:a0:60 brd ff:ff:ff:ff:ff:ff inet 172.22.0.115/24 brd 172.22.0.255 scope global dynamic noprefixroute ens2f0 valid_lft 2671sec preferred_lft 2671sec inet6 fe80::42a6:b7ff:fe0d:a060/64 scope link noprefixroute valid_lft forever preferred_lft forever 3: ens2f1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000 link/ether 40:a6:b7:0d:a0:61 brd ff:ff:ff:ff:ff:ff 4: ens7f0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000 link/ether 40:a6:b7:0d:61:00 brd ff:ff:ff:ff:ff:ff 5: ens7f1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000 link/ether 40:a6:b7:0d:61:01 brd ff:ff:ff:ff:ff:ff 6: eno1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq master ovs-system state UP group default qlen 1000 link/ether 0c:42:a1:6c:e0:74 brd ff:ff:ff:ff:ff:ff 7: eno2: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN group default qlen 1000 link/ether 0c:42:a1:6c:e0:75 brd ff:ff:ff:ff:ff:ff 8: ovs-system: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000 link/ether 26:23:20:0e:3b:c7 brd ff:ff:ff:ff:ff:ff 9: genev_sys_6081: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 65000 qdisc noqueue master ovs-system state UNKNOWN group default qlen 1000 link/ether 4e:05:d4:b8:7c:78 brd ff:ff:ff:ff:ff:ff inet6 fe80::4c05:d4ff:feb8:7c78/64 scope link valid_lft forever preferred_lft forever 10: ovn-k8s-mp0: <BROADCAST,MULTICAST> mtu 1400 qdisc noop state DOWN group default qlen 1000 link/ether 6a:03:af:44:2c:4b brd ff:ff:ff:ff:ff:ff 11: br-int: <BROADCAST,MULTICAST> mtu 1400 qdisc noop state DOWN group default qlen 1000 link/ether ce:92:34:2a:f8:d1 brd ff:ff:ff:ff:ff:ff 12: ovn-k8s-gw0: <BROADCAST,MULTICAST> mtu 1400 qdisc noop state DOWN group default qlen 1000 link/ether 0a:58:a9:fe:00:01 brd ff:ff:ff:ff:ff:ff 13: br-local: <BROADCAST,MULTICAST> mtu 1400 qdisc noop state DOWN group default qlen 1000 link/ether 96:d2:bc:19:df:4d brd ff:ff:ff:ff:ff:ff [core@localhost ~]$ ip r 172.22.0.0/24 dev ens2f0 proto kernel scope link src 172.22.0.115 metric 100
I followed this doc to get an sosreport - https://access.redhat.com/solutions/3820762 but couldn't because of the network issue. Is there a workaround? [root@cnfdt18-installer ~]# ssh core.0.115 Red Hat Enterprise Linux CoreOS 47.82.202011171242-0 Part of OpenShift 4.7, RHCOS is a Kubernetes native operating system managed by the Machine Config Operator (`clusteroperator/machine-config`). WARNING: Direct SSH access to machines is not recommended; instead, make configuration changes via `machineconfig` objects: https://docs.openshift.com/container-platform/4.7/architecture/architecture-rhcos.html --- Last login: Wed Nov 18 13:27:29 2020 from 172.22.0.253 [systemd] Failed Units: 4 afterburn-hostname.service NetworkManager-wait-online.service node-valid-hostname.service ovs-configuration.service [core@localhost ~]$ sudo -i [systemd] Failed Units: 4 afterburn-hostname.service NetworkManager-wait-online.service node-valid-hostname.service ovs-configuration.service [root@localhost ~]# toolbox Trying to pull registry.redhat.io/rhel8/support-tools... Get "https://registry.redhat.io/v2/": dial tcp: lookup registry.redhat.io on 10.46.0.32:53: dial udp 10.46.0.32:53: connect: network is unreachable Error: error pulling image "registry.redhat.io/rhel8/support-tools": unable to pull registry.redhat.io/rhel8/support-tools: unable to pull image: Error initializing source docker://registry.redhat.io/rhel8/support-tools:latest: error pinging docker registry registry.redhat.io: Get "https://registry.redhat.io/v2/": dial tcp: lookup registry.redhat.io on 10.46.0.32:53: dial udp 10.46.0.32:53: connect: network is unreachable Would you like to manually authenticate to registry: 'registry.redhat.io' and try again? [y/N] y Username: ykashtan Password: Error: error authenticating creds for "registry.redhat.io": error pinging docker registry registry.redhat.io: Get "https://registry.redhat.io/v2/": dial tcp: lookup registry.redhat.io on 10.46.0.32:53: dial udp 10.46.0.32:53: connect: network is unreachable
[root@localhost ~]# journalctl -xn -- Logs begin at Wed 2020-11-18 10:18:07 UTC, end at Wed 2020-11-18 12:48:17 UTC. -- Nov 18 12:48:14 localhost.localdomain bash[4517]: time="2020-11-18T12:48:14Z" level=error msg="Failed to find a suitable node IP" Nov 18 12:48:15 localhost.localdomain bash[4517]: time="2020-11-18T12:48:15Z" level=info msg="Checking whether address 127.0.0.1/8 lo contains VIP 10.46.55.109" Nov 18 12:48:15 localhost.localdomain bash[4517]: time="2020-11-18T12:48:15Z" level=info msg="Checking whether address 172.22.0.115/24 ens2f0 contains VIP 10.46.55.109" Nov 18 12:48:15 localhost.localdomain bash[4517]: time="2020-11-18T12:48:15Z" level=error msg="Failed to find a suitable node IP" Nov 18 12:48:16 localhost.localdomain bash[4517]: time="2020-11-18T12:48:16Z" level=info msg="Checking whether address 127.0.0.1/8 lo contains VIP 10.46.55.109" Nov 18 12:48:16 localhost.localdomain bash[4517]: time="2020-11-18T12:48:16Z" level=info msg="Checking whether address 172.22.0.115/24 ens2f0 contains VIP 10.46.55.109" Nov 18 12:48:16 localhost.localdomain bash[4517]: time="2020-11-18T12:48:16Z" level=error msg="Failed to find a suitable node IP" Nov 18 12:48:17 localhost.localdomain bash[4517]: time="2020-11-18T12:48:17Z" level=info msg="Checking whether address 127.0.0.1/8 lo contains VIP 10.46.55.109" Nov 18 12:48:17 localhost.localdomain bash[4517]: time="2020-11-18T12:48:17Z" level=info msg="Checking whether address 172.22.0.115/24 ens2f0 contains VIP 10.46.55.109" Nov 18 12:48:17 localhost.localdomain bash[4517]: time="2020-11-18T12:48:17Z" level=error msg="Failed to find a suitable node IP" [systemd] Failed Units: 4 afterburn-hostname.service NetworkManager-wait-online.service node-valid-hostname.service ovs-configuration.service
Do I get this right that the nodes do not have default route set?
Created attachment 1733286 [details] screenshot
yes, no default route. no ip no nothing on the baremetal interface. provisioning VLAN is fine.
Can you compare with the nodes not having this issue? The fact the node doesn't have a default route doesn't look right to me.
*** Bug 1902674 has been marked as a duplicate of this bug. ***
Found this bug separately and filed yet another duplicate of it in: https://bugzilla.redhat.com/show_bug.cgi?id=1903152#c0.
*** Bug 1903152 has been marked as a duplicate of this bug. ***
Copying the description from the duplicate since it contains a bit more investigation: Description of problem: 4.7 makes use of OverlayFS to make sure that any changes that happen at runtime, stay only at runtime. In order to do that: * It mounts OverlayFS to a new directory /etc/NetworkManager/system-connections-merged and tells NetworkManager to use it as its source of system connection configuration. lowerdir=/etc/NetworkManager/system-connections,upperdir=/run/nm-system-connections,workdir=/run/nm-system-connections-work This happens before NetworkManager runs, as NetworkManager needs to be started pointing to /etc/NetworkManager/system-connections-merged. So just after systemd finishes setting up the temporary directories, it gets set up. Another part of the networking setup done by ovs-configuration.service is in charge of setting up NetworkManager and open vSwitch for OVN Kubernetes. The way it does the set up consists on checking which NetworkManager connection is the one used the default gateway and morphing it into a bridged connection. How reproducible: 100% Steps to Reproduce: 1. Deploy OCP 4.7 with OVN Kubernetes 2. oc debug node/mynodename 3. chroot /chroot 4. systemctl reboot Actual results: The node boots up and is unable to set up its networking, so it appears as NotReady in `oc get nodes`. It also can't be accessed via `oc debug`. Expected results: After a short time, mynodename shows up as Ready in `oc get nodes` and can be accessed doing oc debug node/mynodename Additional info: The reason for this is that the NetworkManager configuration that ovs-configuration.service ends up being ephemeral due to OverlayFS, whereas the ovsdb configuration that comes from the same service is not. The inconsistency makes it impossible to boot. Workarounds: While the bug is being worked on, one can do the following to be able to reboot the nodes: 1. oc debug into each node after they appear as ready 2. copy the contents of /etc/NetworkManager/system-connections-merged into /etc/NetworkManager/system-connections
@yprokule @mcornea Hey guys, Can you help moving this to verified if you don't see this issue on your env? Openshift-qe currently don't have access to IPI BM env
Verified on 4.7.0-0.nightly-2020-12-09-112139
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:5633