Hide Forgot
I have just followed the procedure to migrate from OpenshiftSDN to OVNKubernetes on OCP 4.7.32 and I'm getting the exact same issue. After rebooting the nodes one of the interfaces is not being created, ending up with all ovnkube-node pods in CrashLoopBackOff because ovnkube-node container is unable to start on all nodes (RHCOS and RHEL): $ oc get pods -o wide -n openshift-ovn-kubernetes NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES ovnkube-master-9s48h 6/6 Running 0 3h49m 172.23.191.54 master-1.prod-openshift4.redhatrules.local <none> <none> ovnkube-master-9vlhf 6/6 Running 0 3h49m 172.23.191.59 master-0.prod-openshift4.redhatrules.local <none> <none> ovnkube-master-b882v 6/6 Running 2 3h49m 172.23.191.191 master-2.prod-openshift4.redhatrules.local <none> <none> ovnkube-node-2q4wn 2/3 CrashLoopBackOff 5 173m 172.23.191.191 master-2.prod-openshift4.redhatrules.local <none> <none> ovnkube-node-4hwkp 2/3 CrashLoopBackOff 20 173m 172.23.191.212 infra-1.prod-openshift4.redhatrules.local <none> <none> ovnkube-node-7mnl5 2/3 CrashLoopBackOff 5 173m 172.23.191.54 master-1.prod-openshift4.redhatrules.local <none> <none> ovnkube-node-8fvhf 2/3 CrashLoopBackOff 5 173m 172.23.191.59 master-0.prod-openshift4.redhatrules.local <none> <none> ovnkube-node-gpc4z 2/3 CrashLoopBackOff 19 173m 172.23.191.249 worker-1.prod-openshift4.redhatrules.local <none> <none> ovnkube-node-hj89f 2/3 CrashLoopBackOff 20 173m 172.23.191.112 infra-2.prod-openshift4.redhatrules.local <none> <none> ovnkube-node-pmbtr 2/3 CrashLoopBackOff 19 173m 172.23.191.96 worker-0.prod-openshift4.redhatrules.local <none> <none> ovnkube-node-rqkwt 2/3 CrashLoopBackOff 27 173m 172.23.191.173 infra-0.prod-openshift4.redhatrules.local <none> <none> Many other pods are in ContainerCreating state because of this. Looking at the container the error is the same: ovnkube-node: Container ID: cri-o://f0e9430ab6537b698178895f51712b7917a93e43980f8bff1e5ddc1c3ec52f59 Image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:ccd8b2cae88b05b57bb03d7321c2d59585471f25fda74c773b208a124cde1dfd Image ID: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:ccd8b2cae88b05b57bb03d7321c2d59585471f25fda74c773b208a124cde1dfd Port: 29103/TCP Host Port: 29103/TCP Command: /bin/bash -c set -xe if [[ -f "/env/${K8S_NODE}" ]]; then set -o allexport source "/env/${K8S_NODE}" set +o allexport fi echo "I$(date "+%m%d %H:%M:%S.%N") - waiting for db_ip addresses" cp -f /usr/libexec/cni/ovn-k8s-cni-overlay /cni-bin-dir/ ovn_config_namespace=openshift-ovn-kubernetes echo "I$(date "+%m%d %H:%M:%S.%N") - disable conntrack on geneve port" iptables -t raw -A PREROUTING -p udp --dport 6081 -j NOTRACK iptables -t raw -A OUTPUT -p udp --dport 6081 -j NOTRACK retries=0 while true; do # TODO: change to use '--request-timeout=30s', if https://github.com/kubernetes/kubernetes/issues/49343 is fixed. db_ip=$(timeout 30 kubectl get ep -n ${ovn_config_namespace} ovnkube-db -o jsonpath='{.subsets[0].addresses[0].ip}') if [[ -n "${db_ip}" ]]; then break fi (( retries += 1 )) if [[ "${retries}" -gt 40 ]]; then echo "E$(date "+%m%d %H:%M:%S.%N") - db endpoint never came up" exit 1 fi echo "I$(date "+%m%d %H:%M:%S.%N") - waiting for db endpoint" sleep 5 done echo "I$(date "+%m%d %H:%M:%S.%N") - starting ovnkube-node db_ip ${db_ip}" gateway_mode_flags= # Check to see if ovs is provided by the node. This is only for upgrade from 4.5->4.6 or # openshift-sdn to ovn-kube conversion if grep -q OVNKubernetes /etc/systemd/system/ovs-configuration.service ; then gateway_mode_flags="--gateway-mode local --gateway-interface br-ex" else gateway_mode_flags="--gateway-mode local --gateway-interface none" fi exec /usr/bin/ovnkube --init-node "${K8S_NODE}" \ --nb-address "ssl:172.23.191.191:9641,ssl:172.23.191.54:9641,ssl:172.23.191.59:9641" \ --sb-address "ssl:172.23.191.191:9642,ssl:172.23.191.54:9642,ssl:172.23.191.59:9642" \ --nb-client-privkey /ovn-cert/tls.key \ --nb-client-cert /ovn-cert/tls.crt \ --nb-client-cacert /ovn-ca/ca-bundle.crt \ --nb-cert-common-name "ovn" \ --sb-client-privkey /ovn-cert/tls.key \ --sb-client-cert /ovn-cert/tls.crt \ --sb-client-cacert /ovn-ca/ca-bundle.crt \ --sb-cert-common-name "ovn" \ --config-file=/run/ovnkube-config/ovnkube.conf \ --loglevel "${OVN_KUBE_LOG_LEVEL}" \ --inactivity-probe="${OVN_CONTROLLER_INACTIVITY_PROBE}" \ ${gateway_mode_flags} \ --metrics-bind-address "127.0.0.1:29103" State: Waiting Reason: CrashLoopBackOff Last State: Terminated Reason: Error Message: 18438 ovs.go:173] exec(3): stderr: "" I1101 15:44:09.603900 18438 ovs.go:169] exec(4): /usr/bin/ovs-ofctl dump-aggregate br-int I1101 15:44:09.624070 18438 ovs.go:172] exec(4): stdout: "NXST_AGGREGATE reply (xid=0x4): packet_count=0 byte_count=0 flow_count=4160\n" I1101 15:44:09.624139 18438 ovs.go:173] exec(4): stderr: "" I1101 15:44:09.624280 18438 ovs.go:169] exec(5): /usr/bin/ovs-vsctl --timeout=15 -- --if-exists del-port br-int k8s-master-0.pr -- --may-exist add-port br-int ovn-k8s-mp0 -- set interface ovn-k8s-mp0 type=internal mtu_request=1400 external-ids:iface-id=k8s-master-0.prod-openshift4.redhatrules.local I1101 15:44:09.645430 18438 ovs.go:172] exec(5): stdout: "" I1101 15:44:09.645535 18438 ovs.go:173] exec(5): stderr: "" I1101 15:44:09.645637 18438 ovs.go:169] exec(6): /usr/bin/ovs-vsctl --timeout=15 --if-exists get interface ovn-k8s-mp0 mac_in_use I1101 15:44:09.662635 18438 ovs.go:172] exec(6): stdout: "\"ee:81:e3:04:05:dc\"\n" I1101 15:44:09.662743 18438 ovs.go:173] exec(6): stderr: "" I1101 15:44:09.662870 18438 ovs.go:169] exec(7): /usr/bin/ovs-vsctl --timeout=15 set interface ovn-k8s-mp0 mac=ee\:81\:e3\:04\:05\:dc I1101 15:44:09.679184 18438 ovs.go:172] exec(7): stdout: "" I1101 15:44:09.679290 18438 ovs.go:173] exec(7): stderr: "" I1101 15:44:09.796423 18438 gateway_init.go:162] Initializing Gateway Functionality I1101 15:44:09.797592 18438 gateway_localnet.go:173] Node local addresses initialized to: map[10.131.16.2:{10.131.16.0 fffff000} 127.0.0.1:{127.0.0.0 ff000000} 172.23.191.59:{172.23.188.0 fffffc00} ::1:{::1 ffffffffffffffffffffffffffffffff} fe80::d0e1:beff:fed1:b453:{fe80:: ffffffffffffffff0000000000000000} fe80::ec81:e3ff:fe04:5dc:{fe80:: ffffffffffffffff0000000000000000} fe80::f2e1:222:f898:9e39:{fe80:: ffffffffffffffff0000000000000000}] I1101 15:44:09.798372 18438 helper_linux.go:73] Found default gateway interface enp1s0 172.23.188.2 F1101 15:44:09.798601 18438 ovnkube.go:130] could not find IP addresses: failed to lookup link none: Link not found Looking at a master for example seems that many interfaces are not created and it looks that some from OpenshiftSDN are still present: [root@master-2 ~]# nmcli -p dev ===================== Status of devices ===================== DEVICE TYPE STATE CONNECTION -------------------------------------------------------------------------------------- enp1s0 ethernet connected Wired connection 1 vxlan_sys_4789 vxlan disconnected -- genev_sys_6081 geneve unmanaged -- lo loopback unmanaged -- br-int ovs-bridge unmanaged -- br0 ovs-bridge unmanaged -- br-int ovs-interface unmanaged -- br0 ovs-interface unmanaged -- ovn-k8s-mp0 ovs-interface unmanaged -- tun0 ovs-interface unmanaged -- br-int ovs-port unmanaged -- br0 ovs-port unmanaged -- ovn-007ae0-0 ovs-port unmanaged -- ovn-0fdca7-0 ovs-port unmanaged -- ovn-3e439a-0 ovs-port unmanaged -- ovn-49236f-0 ovs-port unmanaged -- ovn-d4e052-0 ovs-port unmanaged -- ovn-dd748d-0 ovs-port unmanaged -- ovn-ee6264-0 ovs-port unmanaged -- ovn-k8s-mp0 ovs-port unmanaged -- tun0 ovs-port unmanaged -- veth11d1fdb8 ovs-port unmanaged -- veth74b0121f ovs-port unmanaged -- veth896a0680 ovs-port unmanaged -- vetha87ca073 ovs-port unmanaged -- vethd54c1fa3 ovs-port unmanaged -- vethdd4c983b ovs-port unmanaged -- vethdea8fc48 ovs-port unmanaged -- vethff72f48b ovs-port unmanaged -- vxlan0 ovs-port unmanaged --
When doing the migration, did you follow the instruction of 4.7's doc or 4.8's? The procedures are different between 4.7 and 4.8.
It looks like a regression introduced in 4.7.18 by https://github.com/openshift/ovn-kubernetes/commit/1c28b968f1b0fb6252e7f4d3061b107d5cc498e0. Tested with both 4.7.17 and 4.7.18. The migration works with 4.7.17 but not 4.7.18.
I have followed the documentation for OCP 4.7. Did with OCP 4.6.46 and everything went well. After starting the mcp the nodes got updated with the rest of the network's configuration.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.7.40 bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2021:5088