+++ This bug was initially created as a clone of Bug #1943334 +++ When an ovnkube node pod is upgraded, the old pods are killed and new ones started some time later. Observed gap between old -> new can be 1m or more. During this time no pods can be started, but the node is still available for scheduling and indeed this happens and those pods time out. They will get retried, but it's pointless to try running pods while the node networking is down. One fix could be to taint the node NoSchedule in the ovnkube-node container termination hook, and clear any existing taint when ovnkube-node starts. ovnkube containers (and anything else network-y like multus) might have to tolerate this taint. eg lifecycle: preStop: exec: command: - /bin/bash - -c - | rm -f /etc/cni/net.d/10-ovn-kubernetes.conf kubectl taint nodes ${K8S_NODE} "k8s.ovn.org/network-unavailable:NoSchedule" and then programmatically remove the taint in ovnkube-node after writing out the CNI config file when everything is initialized. ---- Same strategy could likely be done for openshift-sdn's node process.
Closing this bug in favour of https://issues.redhat.com/browse/SDN-2241. Solution will have to be implemented in CRI-O