Bug 1706204

Summary: Remove unreachable tolerates from etcd quorum guard
Product: OpenShift Container Platform Reporter: Robert Krawitz <rkrawitz>
Component: Machine Config OperatorAssignee: ravig <rgudimet>
Status: CLOSED ERRATA QA Contact: Micah Abbott <miabbott>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: 4.1.0CC: eparis, kgarriso, mnguyen, rkrawitz
Target Milestone: ---   
Target Release: 4.1.0   
Hardware: All   
OS: All   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-06-04 10:48:19 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Comment 2 Kirsten Garrison 2019-05-07 19:07:47 UTC
I believe that this is the related PR: https://github.com/openshift/machine-config-operator/pull/700

Comment 3 Kirsten Garrison 2019-05-07 23:14:03 UTC
The above PR just merged.

Comment 5 Micah Abbott 2019-05-16 15:14:45 UTC
There's an updated e2e test that checks for invalid tolerations (https://github.com/openshift/origin/pull/22752), which fails with the string `pods found with invalid tolerations`.  A search of the last 7d of CI logs shows there are no failures with this problem related to the MCO.

https://ci-search-ci-search-next.svc.ci.openshift.org/?search=pods+found+with+invalid+tolerations&maxAge=168h&context=5&type=all

This seems to indicate this BZ has been properly fixed.  We're going to try some additional tests to functionally verify this.

Comment 6 Michael Nguyen 2019-05-16 19:33:23 UTC
I stood up a cluster and stopped the kubelet.service running on the master node (using oc debug node) and the etcd quorum guard pod running on it became unschedulable after the tolerationSeconds elapsed.  The machine config daemon pod also running on this node (that doesn't have tolerationSeconds) did not evict itself.  This functions as described in from https://github.com/openshift/machine-config-operator/pull/700


$ oc get clusterversion
NAME      VERSION                         AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.1.0-0.okd-2019-05-13-204940   True        False         105m    Cluster version is 4.1.0-0.okd-2019-05-13-204940

Closing as verified.

Comment 8 errata-xmlrpc 2019-06-04 10:48:19 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0758