Description of problem: During an upgrade, SRE was paged for MCDDrainError where CVO termination was going beyond the global timeout of 1m 30s apparently 5 times. alertname = MCDDrainError drain_time = 543.44871573 sec endpoint = metrics err = 5 tries: error when waiting for pod "cluster-version-operator-xx-xxx" terminating: global timeout reached: 1m30s job = machine-config-daemon namespace = openshift-machine-config-operator pod = machine-config-daemon-zdspp message = Drain failed on , updates may be blocked. For more details: oc logs -f -n openshift-machine-config-operator machine-config-daemon-<hash> -c machine-config-daemon When looking at he MCO pod the following error was listed: 5 tries: error when waiting for pod "cluster-version-operator-xxx-xx" terminating: global timeout reached: 1m30s It was mentioned that the expectation is that the kubelet to KILL the CVO after 130s Actual results: * MCDDrainError triggered Expected results: * killed once terminationGracePeriodSeconds has expired
I'm fuzzy on how this gets up into the alert, but the underlying issue seems to be a fight going on between: * the MCD trying to drain the CVO by setting node.kubernetes.io/unschedulable and then repeatedly killing CVO pods, while * the CVO's ReplicaSet controller gamely creates replacements which tolerate unschedulable (background on why years ago in [1]). I don't understand taints and tolerations well enough [2], but if there's a way to say "We'd prefer the CVO to be scheduled on a node that doesn't have the node.kubernetes.io/unschedulable taint, but if going onto a tainted node is the only way to get scheduled, we'll accept that too", that seems like it would at least mitigate the contention. [1]: https://github.com/openshift/cluster-version-operator/pull/182#discussion_r280948358 [2]: https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/
The unschedulable toleration does not make the cut in [1]. Auditing a recent 4.7 nightly [2]: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-4.7/1336632578890797056/artifacts/e2e-aws/pods.json | jq -r '.items[] | .metadata as $m | .spec.tolerations[] | select(.key == "node.kubernetes.io/unschedulable") | $m.namespace + " " + $m.name + " " + (. | tostring)' openshift-cluster-version cluster-version-operator-c4dbbfcbb-vz27g {"effect":"NoSchedule","key":"node.kubernetes.io/unschedulable","operator":"Exists"} openshift-ingress-canary ingress-canary-4sxd6 {"effect":"NoSchedule","key":"node.kubernetes.io/unschedulable","operator":"Exists"} openshift-ingress-canary ingress-canary-j57xq {"effect":"NoSchedule","key":"node.kubernetes.io/unschedulable","operator":"Exists"} openshift-ingress-canary ingress-canary-lpgbd {"effect":"NoSchedule","key":"node.kubernetes.io/unschedulable","operator":"Exists"} openshift-machine-config-operator machine-config-server-54kkg {"effect":"NoSchedule","key":"node.kubernetes.io/unschedulable","operator":"Exists"} openshift-machine-config-operator machine-config-server-lnssm {"effect":"NoSchedule","key":"node.kubernetes.io/unschedulable","operator":"Exists"} openshift-machine-config-operator machine-config-server-w57mj {"effect":"NoSchedule","key":"node.kubernetes.io/unschedulable","operator":"Exists"} openshift-multus multus-admission-controller-5q62c {"effect":"NoSchedule","key":"node.kubernetes.io/unschedulable","operator":"Exists"} openshift-multus multus-admission-controller-78px4 {"effect":"NoSchedule","key":"node.kubernetes.io/unschedulable","operator":"Exists"} openshift-multus multus-admission-controller-vckmv {"effect":"NoSchedule","key":"node.kubernetes.io/unschedulable","operator":"Exists"} openshift-sdn sdn-controller-msdtk {"effect":"NoSchedule","key":"node.kubernetes.io/unschedulable","operator":"Exists"} openshift-sdn sdn-controller-x26hs {"effect":"NoSchedule","key":"node.kubernetes.io/unschedulable","operator":"Exists"} openshift-sdn sdn-controller-znqlz {"effect":"NoSchedule","key":"node.kubernetes.io/unschedulable","operator":"Exists"} [1]: https://github.com/openshift/enhancements/blame/94baf7dd83a909d04a00a99c117bdf90e53c5e63/CONVENTIONS.md#L165-L173 [2]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-4.7/1336632578890797056
MCD doesn't drain DaemonSet pods, because they can't get rescheduled on an alternate node, so here's a better audit: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-4.7/1336632578890797056/artifacts/e2e-aws/pods.json | jq -r '.items[] | .metadata as $m | .spec.tolerations[] | select(.key == "node.kubernetes.io/unschedulable") | $m.namespace + " " + $m.name + " " + ($m.ownerReferences[].kind | tostring) + " " + (. | tostring)' | grep -v DaemonSet openshift-cluster-version cluster-version-operator-c4dbbfcbb-vz27g ReplicaSet {"effect":"NoSchedule","key":"node.kubernetes.io/unschedulable","operator":"Exists"} Shows that this is just MCD vs. CVO.
Thanks for all the details Trevor, will look into this.
Closing this for now as the immediate bug would not occur. We have made improvements to drain logic such that this alert would not have fired. In 4.7+ drain timeouts are now an hour (your drain took 350s total): See: https://github.com/openshift/machine-config-operator/pull/2605 *** This bug has been marked as a duplicate of bug 1968759 ***