Bug 1706204 - Remove unreachable tolerates from etcd quorum guard
Summary: Remove unreachable tolerates from etcd quorum guard
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Machine Config Operator
Version: 4.1.0
Hardware: All
OS: All
unspecified
urgent
Target Milestone: ---
: 4.1.0
Assignee: ravig
QA Contact: Micah Abbott
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-05-03 19:22 UTC by Robert Krawitz
Modified: 2019-06-04 10:48 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-06-04 10:48:19 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2019:0758 0 None None None 2019-06-04 10:48:27 UTC

Comment 2 Kirsten Garrison 2019-05-07 19:07:47 UTC
I believe that this is the related PR: https://github.com/openshift/machine-config-operator/pull/700

Comment 3 Kirsten Garrison 2019-05-07 23:14:03 UTC
The above PR just merged.

Comment 5 Micah Abbott 2019-05-16 15:14:45 UTC
There's an updated e2e test that checks for invalid tolerations (https://github.com/openshift/origin/pull/22752), which fails with the string `pods found with invalid tolerations`.  A search of the last 7d of CI logs shows there are no failures with this problem related to the MCO.

https://ci-search-ci-search-next.svc.ci.openshift.org/?search=pods+found+with+invalid+tolerations&maxAge=168h&context=5&type=all

This seems to indicate this BZ has been properly fixed.  We're going to try some additional tests to functionally verify this.

Comment 6 Michael Nguyen 2019-05-16 19:33:23 UTC
I stood up a cluster and stopped the kubelet.service running on the master node (using oc debug node) and the etcd quorum guard pod running on it became unschedulable after the tolerationSeconds elapsed.  The machine config daemon pod also running on this node (that doesn't have tolerationSeconds) did not evict itself.  This functions as described in from https://github.com/openshift/machine-config-operator/pull/700


$ oc get clusterversion
NAME      VERSION                         AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.1.0-0.okd-2019-05-13-204940   True        False         105m    Cluster version is 4.1.0-0.okd-2019-05-13-204940

Closing as verified.

Comment 8 errata-xmlrpc 2019-06-04 10:48:19 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0758


Note You need to log in before you can comment on or make changes to this bug.