Bug 1906254

Summary:	MCDDrainError firing on node.kubernetes.io/unschedulable toleration contention
Product:	OpenShift Container Platform	Reporter:	bembery
Component:	Machine Config Operator	Assignee:	Kirsten Garrison <kgarriso>
Status:	CLOSED DUPLICATE	QA Contact:	Michael Nguyen <mnguyen>
Severity:	low	Docs Contact:
Priority:	low
Version:	4.5	CC:	kgarriso, travi, wking
Target Milestone:	---
Target Release:	---
Hardware:	x86_64
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2021-07-15 00:17:59 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description bembery 2020-12-10 02:41:51 UTC

Description of problem:

During an upgrade, SRE was paged for MCDDrainError where CVO termination was going beyond the global timeout of 1m 30s apparently 5 times.

alertname = MCDDrainError
drain_time = 543.44871573 sec
endpoint = metrics
err = 5 tries: error when waiting for pod "cluster-version-operator-xx-xxx" terminating: global timeout reached: 1m30s
job = machine-config-daemon
namespace = openshift-machine-config-operator
pod = machine-config-daemon-zdspp
message = Drain failed on  , updates may be blocked. For more details:  oc logs -f -n openshift-machine-config-operator machine-config-daemon-<hash> -c machine-config-daemon

When looking at he MCO pod the following error was listed:

5 tries: error when waiting for pod "cluster-version-operator-xxx-xx" terminating: global timeout reached: 1m30s

It was mentioned that the expectation is that the kubelet to KILL the CVO after 130s 

Actual results:

* MCDDrainError triggered

Expected results:

* killed once terminationGracePeriodSeconds has expired

Comment 4 W. Trevor King 2020-12-10 07:18:03 UTC

I'm fuzzy on how this gets up into the alert, but the underlying issue seems to be a fight going on between:

* the MCD trying to drain the CVO by setting node.kubernetes.io/unschedulable and then repeatedly killing CVO pods, while
* the CVO's ReplicaSet controller gamely creates replacements which tolerate unschedulable (background on why years ago in [1]).

I don't understand taints and tolerations well enough [2], but if there's a way to say "We'd prefer the CVO to be scheduled on a node that doesn't have the node.kubernetes.io/unschedulable taint, but if going onto a tainted node is the only way to get scheduled, we'll accept that too", that seems like it would at least mitigate the contention.

[1]: https://github.com/openshift/cluster-version-operator/pull/182#discussion_r280948358
[2]: https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/

Comment 5 W. Trevor King 2020-12-10 07:23:57 UTC

The unschedulable toleration does not make the cut in [1].  Auditing a recent 4.7 nightly [2]:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-4.7/1336632578890797056/artifacts/e2e-aws/pods.json | jq -r '.items[] | .metadata as $m | .spec.tolerations[] | select(.key == "node.kubernetes.io/unschedulable") | $m.namespace + " " + $m.name + " " + (. | tostring)'
openshift-cluster-version cluster-version-operator-c4dbbfcbb-vz27g {"effect":"NoSchedule","key":"node.kubernetes.io/unschedulable","operator":"Exists"}
openshift-ingress-canary ingress-canary-4sxd6 {"effect":"NoSchedule","key":"node.kubernetes.io/unschedulable","operator":"Exists"}
openshift-ingress-canary ingress-canary-j57xq {"effect":"NoSchedule","key":"node.kubernetes.io/unschedulable","operator":"Exists"}
openshift-ingress-canary ingress-canary-lpgbd {"effect":"NoSchedule","key":"node.kubernetes.io/unschedulable","operator":"Exists"}
openshift-machine-config-operator machine-config-server-54kkg {"effect":"NoSchedule","key":"node.kubernetes.io/unschedulable","operator":"Exists"}
openshift-machine-config-operator machine-config-server-lnssm {"effect":"NoSchedule","key":"node.kubernetes.io/unschedulable","operator":"Exists"}
openshift-machine-config-operator machine-config-server-w57mj {"effect":"NoSchedule","key":"node.kubernetes.io/unschedulable","operator":"Exists"}
openshift-multus multus-admission-controller-5q62c {"effect":"NoSchedule","key":"node.kubernetes.io/unschedulable","operator":"Exists"}
openshift-multus multus-admission-controller-78px4 {"effect":"NoSchedule","key":"node.kubernetes.io/unschedulable","operator":"Exists"}
openshift-multus multus-admission-controller-vckmv {"effect":"NoSchedule","key":"node.kubernetes.io/unschedulable","operator":"Exists"}
openshift-sdn sdn-controller-msdtk {"effect":"NoSchedule","key":"node.kubernetes.io/unschedulable","operator":"Exists"}
openshift-sdn sdn-controller-x26hs {"effect":"NoSchedule","key":"node.kubernetes.io/unschedulable","operator":"Exists"}
openshift-sdn sdn-controller-znqlz {"effect":"NoSchedule","key":"node.kubernetes.io/unschedulable","operator":"Exists"}

[1]: https://github.com/openshift/enhancements/blame/94baf7dd83a909d04a00a99c117bdf90e53c5e63/CONVENTIONS.md#L165-L173
[2]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-4.7/1336632578890797056

Comment 6 W. Trevor King 2020-12-10 07:27:00 UTC

MCD doesn't drain DaemonSet pods, because they can't get rescheduled on an alternate node, so here's a better audit:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-4.7/1336632578890797056/artifacts/e2e-aws/pods.json | jq -r '.items[] | .metadata as $m | .spec.tolerations[] | select(.key == "node.kubernetes.io/unschedulable") | $m.namespace + " " + $m.name + " " + ($m.ownerReferences[].kind | tostring) + " " + (. | tostring)' | grep -v DaemonSet
openshift-cluster-version cluster-version-operator-c4dbbfcbb-vz27g ReplicaSet {"effect":"NoSchedule","key":"node.kubernetes.io/unschedulable","operator":"Exists"}

Shows that this is just MCD vs. CVO.

Comment 7 Kirsten Garrison 2020-12-10 08:04:11 UTC

Thanks for all the details Trevor, will look into this.

Comment 8 Kirsten Garrison 2021-07-15 00:17:59 UTC

Closing this for now as the immediate bug would not occur.

We have made improvements to drain logic such that this alert would not have fired. In 4.7+ drain timeouts are now an hour (your drain took 350s total):

See: https://github.com/openshift/machine-config-operator/pull/2605

*** This bug has been marked as a duplicate of bug 1968759 ***