1943336 – [4.7] [openshift-sdn] node pod should taint NoSchedule on termination; clear on startup

Bug 1943336 - [4.7] [openshift-sdn] node pod should taint NoSchedule on termination; clear on startup

Summary: [4.7] [openshift-sdn] node pod should taint NoSchedule on termination; clear ...

Keywords:
Status:	CLOSED DEFERRED
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.7
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Surya Seetharaman
QA Contact:	zhaozhanqi
Docs Contact:
URL:
Whiteboard:
Depends On:	1943334
Blocks:
TreeView+	depends on / blocked

Reported:	2021-03-25 19:54 UTC by Dan Williams
Modified:	2022-03-02 23:13 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:	1943334
Environment:
Last Closed:	2021-09-14 07:02:35 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Dan Williams 2021-03-25 19:54:21 UTC

+++ This bug was initially created as a clone of Bug #1943334 +++

When an ovnkube node pod is upgraded, the old pods are killed and new ones started some time later. Observed gap between old -> new can be 1m or more. During this time no pods can be started, but the node is still available for scheduling and indeed this happens and those pods time out. They will get retried, but it's pointless to try running pods while the node networking is down.

One fix could be to taint the node NoSchedule in the ovnkube-node container termination hook, and clear any existing taint when ovnkube-node starts. ovnkube containers (and anything else network-y like multus) might have to tolerate this taint.

eg

        lifecycle:
          preStop:
            exec:
              command:
              - /bin/bash
              - -c
              - |
                rm -f /etc/cni/net.d/10-ovn-kubernetes.conf
                kubectl taint nodes ${K8S_NODE} "k8s.ovn.org/network-unavailable:NoSchedule"

and then programmatically remove the taint in ovnkube-node after writing out the CNI config file when everything is initialized.

----

Same strategy could likely be done for openshift-sdn's node process.

Comment 3 Surya Seetharaman 2021-09-14 07:02:35 UTC

Closing this bug in favour of https://issues.redhat.com/browse/SDN-2241. Solution will have to be implemented in CRI-O

Note You need to log in before you can comment on or make changes to this bug.