Bug 1906254 - MCDDrainError firing on node.kubernetes.io/unschedulable toleration contention
Summary: MCDDrainError firing on node.kubernetes.io/unschedulable toleration contention
Status: CLOSED DUPLICATE of bug 1968759
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Machine Config Operator
Version: 4.5
Hardware: x86_64
OS: Unspecified
Target Milestone: ---
: ---
Assignee: Kirsten Garrison
QA Contact: Michael Nguyen
Depends On:
TreeView+ depends on / blocked
Reported: 2020-12-10 02:41 UTC by bembery
Modified: 2021-07-15 00:17 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Last Closed: 2021-07-15 00:17:59 UTC
Target Upstream Version:

Attachments (Terms of Use)

Description bembery 2020-12-10 02:41:51 UTC
Description of problem:

During an upgrade, SRE was paged for MCDDrainError where CVO termination was going beyond the global timeout of 1m 30s apparently 5 times.

alertname = MCDDrainError
drain_time = 543.44871573 sec
endpoint = metrics
err = 5 tries: error when waiting for pod "cluster-version-operator-xx-xxx" terminating: global timeout reached: 1m30s
job = machine-config-daemon
namespace = openshift-machine-config-operator
pod = machine-config-daemon-zdspp
message = Drain failed on  , updates may be blocked. For more details:  oc logs -f -n openshift-machine-config-operator machine-config-daemon-<hash> -c machine-config-daemon

When looking at he MCO pod the following error was listed:

5 tries: error when waiting for pod "cluster-version-operator-xxx-xx" terminating: global timeout reached: 1m30s

It was mentioned that the expectation is that the kubelet to KILL the CVO after 130s 

Actual results:

* MCDDrainError triggered

Expected results:

* killed once terminationGracePeriodSeconds has expired

Comment 4 W. Trevor King 2020-12-10 07:18:03 UTC
I'm fuzzy on how this gets up into the alert, but the underlying issue seems to be a fight going on between:

* the MCD trying to drain the CVO by setting node.kubernetes.io/unschedulable and then repeatedly killing CVO pods, while
* the CVO's ReplicaSet controller gamely creates replacements which tolerate unschedulable (background on why years ago in [1]).

I don't understand taints and tolerations well enough [2], but if there's a way to say "We'd prefer the CVO to be scheduled on a node that doesn't have the node.kubernetes.io/unschedulable taint, but if going onto a tainted node is the only way to get scheduled, we'll accept that too", that seems like it would at least mitigate the contention.

[1]: https://github.com/openshift/cluster-version-operator/pull/182#discussion_r280948358
[2]: https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/

Comment 5 W. Trevor King 2020-12-10 07:23:57 UTC
The unschedulable toleration does not make the cut in [1].  Auditing a recent 4.7 nightly [2]:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-4.7/1336632578890797056/artifacts/e2e-aws/pods.json | jq -r '.items[] | .metadata as $m | .spec.tolerations[] | select(.key == "node.kubernetes.io/unschedulable") | $m.namespace + " " + $m.name + " " + (. | tostring)'
openshift-cluster-version cluster-version-operator-c4dbbfcbb-vz27g {"effect":"NoSchedule","key":"node.kubernetes.io/unschedulable","operator":"Exists"}
openshift-ingress-canary ingress-canary-4sxd6 {"effect":"NoSchedule","key":"node.kubernetes.io/unschedulable","operator":"Exists"}
openshift-ingress-canary ingress-canary-j57xq {"effect":"NoSchedule","key":"node.kubernetes.io/unschedulable","operator":"Exists"}
openshift-ingress-canary ingress-canary-lpgbd {"effect":"NoSchedule","key":"node.kubernetes.io/unschedulable","operator":"Exists"}
openshift-machine-config-operator machine-config-server-54kkg {"effect":"NoSchedule","key":"node.kubernetes.io/unschedulable","operator":"Exists"}
openshift-machine-config-operator machine-config-server-lnssm {"effect":"NoSchedule","key":"node.kubernetes.io/unschedulable","operator":"Exists"}
openshift-machine-config-operator machine-config-server-w57mj {"effect":"NoSchedule","key":"node.kubernetes.io/unschedulable","operator":"Exists"}
openshift-multus multus-admission-controller-5q62c {"effect":"NoSchedule","key":"node.kubernetes.io/unschedulable","operator":"Exists"}
openshift-multus multus-admission-controller-78px4 {"effect":"NoSchedule","key":"node.kubernetes.io/unschedulable","operator":"Exists"}
openshift-multus multus-admission-controller-vckmv {"effect":"NoSchedule","key":"node.kubernetes.io/unschedulable","operator":"Exists"}
openshift-sdn sdn-controller-msdtk {"effect":"NoSchedule","key":"node.kubernetes.io/unschedulable","operator":"Exists"}
openshift-sdn sdn-controller-x26hs {"effect":"NoSchedule","key":"node.kubernetes.io/unschedulable","operator":"Exists"}
openshift-sdn sdn-controller-znqlz {"effect":"NoSchedule","key":"node.kubernetes.io/unschedulable","operator":"Exists"}

[1]: https://github.com/openshift/enhancements/blame/94baf7dd83a909d04a00a99c117bdf90e53c5e63/CONVENTIONS.md#L165-L173
[2]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-4.7/1336632578890797056

Comment 6 W. Trevor King 2020-12-10 07:27:00 UTC
MCD doesn't drain DaemonSet pods, because they can't get rescheduled on an alternate node, so here's a better audit:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-4.7/1336632578890797056/artifacts/e2e-aws/pods.json | jq -r '.items[] | .metadata as $m | .spec.tolerations[] | select(.key == "node.kubernetes.io/unschedulable") | $m.namespace + " " + $m.name + " " + ($m.ownerReferences[].kind | tostring) + " " + (. | tostring)' | grep -v DaemonSet
openshift-cluster-version cluster-version-operator-c4dbbfcbb-vz27g ReplicaSet {"effect":"NoSchedule","key":"node.kubernetes.io/unschedulable","operator":"Exists"}

Shows that this is just MCD vs. CVO.

Comment 7 Kirsten Garrison 2020-12-10 08:04:11 UTC
Thanks for all the details Trevor, will look into this.

Comment 8 Kirsten Garrison 2021-07-15 00:17:59 UTC
Closing this for now as the immediate bug would not occur.

We have made improvements to drain logic such that this alert would not have fired. In 4.7+ drain timeouts are now an hour (your drain took 350s total):

See: https://github.com/openshift/machine-config-operator/pull/2605

*** This bug has been marked as a duplicate of bug 1968759 ***

Note You need to log in before you can comment on or make changes to this bug.