Bug 1943334 - [ovnkube] node pod should taint NoSchedule on termination; clear on startup
Summary: [ovnkube] node pod should taint NoSchedule on termination; clear on startup
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.7
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.9.0
Assignee: Surya Seetharaman
QA Contact: Ying Wang
URL:
Whiteboard:
Depends On:
Blocks: 1943336 1943566
TreeView+ depends on / blocked
 
Reported: 2021-03-25 19:52 UTC by Dan Williams
Modified: 2021-10-18 17:30 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
: 1943336 (view as bug list)
Environment:
Last Closed: 2021-10-18 17:29:50 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift ovn-kubernetes pull 671 0 None None None 2021-08-19 12:49:23 UTC
Github ovn-org ovn-kubernetes pull 2183 0 None closed Taint node with NoSchedule effect when ovnkube pod is down 2021-07-26 15:07:56 UTC
Red Hat Product Errata RHSA-2021:3759 0 None None None 2021-10-18 17:30:13 UTC

Description Dan Williams 2021-03-25 19:52:38 UTC
When an ovnkube node pod is upgraded, the old pods are killed and new ones started some time later. Observed gap between old -> new can be 1m or more. During this time no pods can be started, but the node is still available for scheduling and indeed this happens and those pods time out. They will get retried, but it's pointless to try running pods while the node networking is down.

One fix could be to taint the node NoSchedule in the ovnkube-node container termination hook, and clear any existing taint when ovnkube-node starts. ovnkube containers (and anything else network-y like multus) might have to tolerate this taint.

eg

        lifecycle:
          preStop:
            exec:
              command:
              - /bin/bash
              - -c
              - |
                rm -f /etc/cni/net.d/10-ovn-kubernetes.conf
                kubectl taint nodes ${K8S_NODE} "k8s.ovn.org/network-unavailable:NoSchedule"

and then programmatically remove the taint in ovnkube-node after writing out the CNI config file when everything is initialized.

Comment 1 Petr Muller 2021-03-26 13:35:42 UTC
Is this bug connected to one of the failures we're seeing in 4.7->4.8 OVN CI jobs, e.g. https://testgrid.k8s.io/redhat-openshift-ocp-release-4.8-informing#periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-ovn-upgrade? Would be nice to get them crosslinked.

Cluster upgrade.[sig-arch] Check if alerts are firing during or after upgrade success
openshift-tests.[sig-arch][Feature:ClusterUpgrade] Cluster should remain functional during upgrade [Disruptive] [Serial]
operator.Run multi-stage test e2e-aws-ovn-upgrade - e2e-aws-ovn-upgrade-openshift-e2e-test container test
openshift-tests.[sig-network] pods should successfully create sandboxes by other
Cluster upgrade.[sig-cluster-lifecycle] cluster upgrade should be fast
openshift-tests.[sig-node] nodes should not go unready after being upgraded and go unready only once
openshift-tests.[sig-node] pods should never transition back to pending
openshift-tests.[sig-node] kubelet terminates kube-apiserver gracefully
Cluster upgrade.[sig-api-machinery] Kubernetes APIs remain available for new connections
Cluster upgrade.[sig-api-machinery] Kubernetes APIs remain available with reused connections
Cluster upgrade.[sig-api-machinery] OAuth APIs remain available for new connections
Cluster upgrade.[sig-api-machinery] OpenShift APIs remain available for new connections
Cluster upgrade.[sig-api-machinery] OpenShift APIs remain available with reused connections
Cluster upgrade.[sig-network-edge] Application behind service load balancer with PDB is not disrupted
Cluster upgrade.[sig-api-machinery] OAuth APIs remain available with reused connections
Cluster upgrade.[sig-network-edge] Cluster frontend ingress remain available

Comment 2 Dan Williams 2021-03-27 00:58:54 UTC
(In reply to Petr Muller from comment #1)
> Is this bug connected to one of the failures we're seeing in 4.7->4.8 OVN CI
> jobs, e.g.
> https://testgrid.k8s.io/redhat-openshift-ocp-release-4.8-informing#periodic-
> ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-ovn-
> upgrade? Would be nice to get them crosslinked.

If you see an upgrade fail because of the synthetic "create pod sandboxes" test due to:

1) multus ReadinessIndicator file
2) failed to retrieve annotations

Then yes, this bug would theoretically fix it, because both of those are due to nodes still schedulable when networking is down. And obviously if networking is down and you try to start pod that needs the network, creating the sandbox will fail.

Comment 3 Surya Seetharaman 2021-04-13 06:29:27 UTC
Instead of tainting the node in the pre-stop hook, the approach to be taken is to taint the node in the signal handler when the SIGTERM is received during pod termination because it is considered to be more reliable. Working on a PR.

Comment 4 Antonio Ojea 2021-05-07 07:53:31 UTC
Is this not impossing a restriction to the whole cluster? How other components will handle it?

I think that the CNO should handle the upgrade logic, giving pods permissions to taint nodes on restarts is risky, easy to go wrong and hard to debug.

In this case, the kubelet checks if that network is ready to set the node status, both CRIO and Containerd can set the node as not ready if the network is not ready, shouldn't this be a better approach?

https://github.com/cri-o/cri-o/blob/8118d4a58693d1c7211b5fa73df1be2db1f9404f/server/runtime_status.go#L10-L14

Comment 5 Antonio Ojea 2021-05-07 08:04:17 UTC
(In reply to Antonio Ojea from comment #4)
> Is this not impossing a restriction to the whole cluster? How other
> components will handle it?
> 

I found it , sorry https://github.com/openshift/enhancements/blob/69993901761ad2cd320ffb2faf0a58d91a637ffd/CONVENTIONS.md

Comment 6 Dan Williams 2021-05-07 14:24:34 UTC
(In reply to Antonio Ojea from comment #4)
> Is this not impossing a restriction to the whole cluster? How other
> components will handle it?
> 
> I think that the CNO should handle the upgrade logic, giving pods
> permissions to taint nodes on restarts is risky, easy to go wrong and hard
> to debug.
> 
> In this case, the kubelet checks if that network is ready to set the node
> status, both CRIO and Containerd can set the node as not ready if the
> network is not ready, shouldn't this be a better approach?
> 
> https://github.com/cri-o/cri-o/blob/8118d4a58693d1c7211b5fa73df1be2db1f9404f/
> server/runtime_status.go#L10-L14

The issue is that *multus* is the primary netowrk plugin that ocicni and kubelet are looking for. And it's not a long-running process (though it should be) and therefore, it doesn't know to remove it's CNI config file (that crio/ocicni looks for in the Status() call) when the primary plugin removes its config file on upgrade.

Ideally Multus inotify watches the real network plugin config dir, and updates/removes its own config file (that CRIO looks for) as appropriate. But that doesn't happen yet.

So you have a good point, should probably discuss in next team meeting what our approach should be?

Comment 7 Dan Winship 2021-05-10 16:12:14 UTC
> I think that the CNO should handle the upgrade logic

CNO doesn't know which node is upgrading at any given time; the DaemonSet controller controls that. So CNO would have to watch the generated Pods to figure that out. (Of course, that information would also be good for status reporting, so maybe it _should_ do this...)

> giving pods permissions to taint nodes on restarts is risky, easy to go wrong and hard to debug.

Yeah, I'm dubious about changing this this late in the release cycle.

> Ideally Multus inotify watches the real network plugin config dir, and updates/removes its own config file (that CRIO looks for) as appropriate.

Kubelet does not have any concept of the network plugin *becoming* unavailable. It can deal with the network plugin being unavailable when kubelet starts, but it assumes that once the network plugin is available, it will stay available until the node reboots. (ie, It does not consider the possibility that the pod network might be maintained by a pod, or that you might upgrade the pod network software while kubelet is running.)

https://github.com/openshift/origin/pull/24906 was an attempt to think about how we might make this work better, although it doesn't actually taint the node, it just uses a "soft admit handler", so pods can still get scheduled to the node, they'll just stay stuck in Pending. I don't know if that would solve the timeout problems mentioned in comment 0. (Of course the real problem here is "Observed gap between old -> new can be 1m or more.". It's just like we had discussed before with the old OVS pods; we need to fix it so that the new pods get pulled before the old ones are killed.)

Comment 8 Dan Williams 2021-05-10 21:17:10 UTC
(In reply to Dan Winship from comment #7)
> > I think that the CNO should handle the upgrade logic
> 
> CNO doesn't know which node is upgrading at any given time; the DaemonSet
> controller controls that. So CNO would have to watch the generated Pods to
> figure that out. (Of course, that information would also be good for status
> reporting, so maybe it _should_ do this...)
> 
> > giving pods permissions to taint nodes on restarts is risky, easy to go wrong and hard to debug.
> 
> Yeah, I'm dubious about changing this this late in the release cycle.
> 
> > Ideally Multus inotify watches the real network plugin config dir, and updates/removes its own config file (that CRIO looks for) as appropriate.
> 
> Kubelet does not have any concept of the network plugin *becoming*
> unavailable. It can deal with the network plugin being unavailable when
> kubelet starts, but it assumes that once the network plugin is available, it
> will stay available until the node reboots. (ie, It does not consider the
> possibility that the pod network might be maintained by a pod, or that you
> might upgrade the pod network software while kubelet is running.)

I'm not sure that's accurate? Kubelet calls the CRI's Status() function periodically, and CRIO in turn asks ocicni if the default network exists. If there is no default network anymore (because say the plugin removed its config) then ocicni will return an error, and CRIO will reflect that back to kubelet as a network error. Kubelet then uses that to set NetworkUnavailable status on the node, like happens on startup before the plugin writes the config. But I may have gotten the chain wrong?

Comment 9 Antonio Ojea 2021-05-11 10:07:29 UTC
I just tested it in KIND, that uses containerd, but let's assume it works like CRIO here.

I had a cluster working for some time, then I've removed the cni configuration file, and after a few seconds the node went NotReady

kubectl get nodes -o wide
kind-worker          NotReady 

kubelet logs

May 11 10:01:16 kind-worker kubelet[262]: E0511 10:01:16.765544     262 kubelet.go:2211] "Container runtime network not ready" networkReady="NetworkReady=false reason:NetworkPl
uginNotReady message:Network plugin returns error: cni plugin not initialized"
May 11 10:01:18 kind-worker kubelet[262]: I0511 10:01:18.696054     262 setters.go:577] "Node became not ready" node="kind-worker" condition={Type:Ready Status:False LastHeartb
eatTime:2021-05-11 10:01:18.696012937 +0000 UTC m=+675272.120555559 LastTransitionTime:2021-05-11 10:01:18.696012937 +0000 UTC m=+675272.120555559 Reason:KubeletNotReady Messag
e:container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni plugin not initialized}

So, replacing files seems a bit agressive, but it works.

Maybe we can extend ocicni/crio to provide this feature that Dan Winship mentions: "network plugin *becoming* unavailable." ??

Comment 10 Surya Seetharaman 2021-05-11 10:26:11 UTC
(In reply to Antonio Ojea from comment #9)
> I just tested it in KIND, that uses containerd, but let's assume it works
> like CRIO here.
> 
> I had a cluster working for some time, then I've removed the cni
> configuration file, and after a few seconds the node went NotReady
> 
> kubectl get nodes -o wide
> kind-worker          NotReady 
> 
> kubelet logs
> 
> May 11 10:01:16 kind-worker kubelet[262]: E0511 10:01:16.765544     262
> kubelet.go:2211] "Container runtime network not ready"
> networkReady="NetworkReady=false reason:NetworkPl
> uginNotReady message:Network plugin returns error: cni plugin not
> initialized"
> May 11 10:01:18 kind-worker kubelet[262]: I0511 10:01:18.696054     262
> setters.go:577] "Node became not ready" node="kind-worker"
> condition={Type:Ready Status:False LastHeartb
> eatTime:2021-05-11 10:01:18.696012937 +0000 UTC m=+675272.120555559
> LastTransitionTime:2021-05-11 10:01:18.696012937 +0000 UTC
> m=+675272.120555559 Reason:KubeletNotReady Messag
> e:container runtime network not ready: NetworkReady=false
> reason:NetworkPluginNotReady message:Network plugin returns error: cni
> plugin not initialized}
> 
> So, replacing files seems a bit agressive, but it works.
> 
> Maybe we can extend ocicni/crio to provide this feature that Dan Winship
> mentions: "network plugin *becoming* unavailable." ??

Based on dcbw's comment and what you tried here ^ it seems Kubelet already knows to handle the case of "network not ready" and marks the node NotReady. To trigger that we need to do <xxx?? (definitely hoping something better than deleting config file)> maybe on multus ? But that would make it an openshift solution and it won't be available for non-openshift ovn-k8s - not sure if we care?

Also will there be exceptions as in critical component pods (api/etcd during upgrade - I don't know the order which pods get upgraded first etc.) which already tolerate the "NoSchedule" taint but might need to handle the "NotReady" state?

Comment 11 Antonio Ojea 2021-05-11 11:07:03 UTC
(In reply to Surya Seetharaman from comment #10)
> 
> Based on dcbw's comment and what you tried here ^ it seems Kubelet already
> knows to handle the case of "network not ready" and marks the node NotReady.
> To trigger that we need to do <xxx?? (definitely hoping something better
> than deleting config file)> maybe on multus ? But that would make it an
> openshift solution and it won't be available for non-openshift ovn-k8s - not
> sure if we care?

It is a container runtime solution, if we check crio-o coed, the Status() check if there is a network defined in the cni config folder

https://github.com/cri-o/cri-o/blob/8118d4a58693d1c7211b5fa73df1be2db1f9404f/vendor/github.com/cri-o/ocicni/pkg/ocicni/ocicni.go#L858-L862

> 
> Also will there be exceptions as in critical component pods (api/etcd during
> upgrade - I don't know the order which pods get upgraded first etc.) which
> already tolerate the "NoSchedule" taint but might need to handle the
> "NotReady" state?

seems this was heavily discussed already, in the same context https://github.com/kubernetes/kubernetes/issues/45717

My reading is that this is automatically tainting the node 
https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/#taint-based-evictions

so TIL that the network not-ready status it is indeed a taint/toleration problem, and adding it directly from the CNI/networking pod is just shortcuting the CRI ...

Comment 12 Surya Seetharaman 2021-05-11 11:33:19 UTC
shortcuting CRI calls all the way to plugin and setting not ready in the early part itself instead of doing it in the plugin, deff seems to be a better solution. Let me try to check the history of that commit and also understand better what that would look like.

Comment 13 Dan Winship 2021-05-11 12:59:20 UTC
(In reply to Dan Williams from comment #8)
> I'm not sure that's accurate? Kubelet calls the CRI's Status() function
> periodically, and CRIO in turn asks ocicni if the default network exists. If
> there is no default network anymore (because say the plugin removed its
> config) then ocicni will return an error, and CRIO will reflect that back to
> kubelet as a network error. Kubelet then uses that to set NetworkUnavailable
> status on the node, like happens on startup before the plugin writes the
> config. But I may have gotten the chain wrong?

Huh... I grepped for NetworkUnavailable and there were no hits except at startup time.

Ah, that's because it doesn't actually set NetworkUnavailable, it just sets the Ready condition to False / adds the NotReady taint, whenever any of a number of kubelet subsystems reports an error. (Separately, if CRI Status returns a networking error, kubelet also blocks pod-network pods from being created regardless of tolerations, but does not block hostNetwork pods.)

OK, so yeah, this should work.

Comment 14 Dan Winship 2021-05-11 13:00:02 UTC
> OK, so yeah, this should work.

(because everything that tolerates NetworkUnavailable also tolerates NotReady)

Comment 15 Surya Seetharaman 2021-07-23 10:56:40 UTC
Upstream commit merged.

Comment 17 Ying Wang 2021-08-26 08:18:42 UTC

checked on version 4.9.0-0.nightly-2021-08-25-042335, when deleting ovnkube-node pod, taint 'k8s.ovn.org/network-unavailable:NoSchedule' can be seen on worker node. After pod running, this taint is removed.

Also checked logs on newly recreated ovnkube-node pod, can see taint removing logs.


% oc describe node ip-10-0-70-181.us-east-2.compute.internal | grep Tain
Taints:       k8s.ovn.org/network-unavailable:NoSchedule
% oc describe node ip-10-0-70-181.us-east-2.compute.internal | grep Tain
Taints:       node-role.kubernetes.io/master:NoSchedule
% oc describe node ip-10-0-70-181.us-east-2.compute.internal | grep Tain
Taints:       node-role.kubernetes.io/master:NoSchedule


% oc logs ovnkube-node-6b7wf -c ovnkube-node | grep taint
I0826 07:58:02.461748 2222618 node.go:577] MTU (9001) of gateway network interface br-ex is big enough to deal with Geneve header overhead (sum 8913). Making sure node is not tainted with &Taint{Key:k8s.ovn.org/mtu-too-small,Value:,Effect:NoSchedule,TimeAdded:<nil>,}...
I0826 07:58:02.479199 2222618 kube.go:191] Removed taint k8s.ovn.org/mtu-too-small:NoSchedule on node ip-10-0-70-181.us-east-2.compute.internal
I0826 07:58:02.829091 2222618 kube.go:179] Removing taint k8s.ovn.org/network-unavailable:NoSchedule from Node ip-10-0-70-181.us-east-2.compute.internal
I0826 07:58:02.850254 2222618 kube.go:191] Removed taint k8s.ovn.org/network-unavailable:NoSchedule on node ip-10-0-70-181.us-east-2.compute.internal

% oc version
Client Version: 4.9.0-0.nightly-2021-08-18-144658
Server Version: 4.9.0-0.nightly-2021-08-25-042335
Kubernetes Version: v1.22.0-rc.0+5c2f7cd

Comment 20 errata-xmlrpc 2021-10-18 17:29:50 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:3759


Note You need to log in before you can comment on or make changes to this bug.