I believe that this is the related PR: https://github.com/openshift/machine-config-operator/pull/700
The above PR just merged.
There's an updated e2e test that checks for invalid tolerations (https://github.com/openshift/origin/pull/22752), which fails with the string `pods found with invalid tolerations`. A search of the last 7d of CI logs shows there are no failures with this problem related to the MCO. https://ci-search-ci-search-next.svc.ci.openshift.org/?search=pods+found+with+invalid+tolerations&maxAge=168h&context=5&type=all This seems to indicate this BZ has been properly fixed. We're going to try some additional tests to functionally verify this.
I stood up a cluster and stopped the kubelet.service running on the master node (using oc debug node) and the etcd quorum guard pod running on it became unschedulable after the tolerationSeconds elapsed. The machine config daemon pod also running on this node (that doesn't have tolerationSeconds) did not evict itself. This functions as described in from https://github.com/openshift/machine-config-operator/pull/700 $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.1.0-0.okd-2019-05-13-204940 True False 105m Cluster version is 4.1.0-0.okd-2019-05-13-204940 Closing as verified.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:0758