Bug 1840358 - Possible to delete 2 masters simultaneously if kubelet unreachable
Summary: Possible to delete 2 masters simultaneously if kubelet unreachable
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Cloud Compute
Version: 4.6
Hardware: Unspecified
OS: Unspecified
unspecified
low
Target Milestone: ---
: 4.6.0
Assignee: Michael Gugino
QA Contact: sunzhaohua
URL:
Whiteboard:
Depends On:
Blocks: 1884195
TreeView+ depends on / blocked
 
Reported: 2020-05-26 19:07 UTC by Michael Gugino
Modified: 2020-10-27 16:01 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1884195 (view as bug list)
Environment:
Last Closed: 2020-10-27 16:01:40 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-etcd-operator pull 426 0 None closed Bug 1840358: etcd-quorum-guard remove toleration timeouts 2021-01-27 07:26:26 UTC
Red Hat Product Errata RHBA-2020:4196 0 None None None 2020-10-27 16:01:44 UTC

Description Michael Gugino 2020-05-26 19:07:21 UTC
Description of problem:

If 2/3 master kubelets are unreachable, and at least one of the corresponding instances still has etcd up (eg, etcd has quorum, cluster still working), then it's possible to delete the working etcd member (you can delete the non-working etcd member first, as well) via the machine-api.


Version-Release number of selected component (if applicable):


How reproducible:

TBD.  I think it will happen more often than not.  Drain did get hung up on a pending pod that seemed to be ignored during the eviction calls, but we still waited for it to be deleted.  The pending pod did not have a deletion timestamp.  A subsequent set of pending pods accepted the eviction request and went terminating.

The pod in question was etcd-revision-pruner.


Steps to Reproduce:
1.  Stop the kubelet on 2/3 master nodes
2.  Delete first stopped master via machine-api
3.  Delete second stopped master via machine-api

Actual results:
Both machines are deleted successfully, causing cluster to die.

Expected results:
Second master should not get deleted.

Additional info:

While the test procedure is entirely contrived, it's perfectly possible for a kubelet to hit a bug during a scenario when a master outage has taken place.  If the kubelet crashes or otherwise becomes unreachable during this time, all it would take is 1 errant command to delete the wrong master and the whole cluster dies, and worse yet, the instances are now removed, so recovery will be much more difficult.

This condition was brought about by utilizing skip-wait-for-delete-timeout seconds in the drain library.  Previously, pods stuck in terminating would cause drain to hang indefinitely.  We don't want this to be the case for workers, but it was inadvertently protecting us from this situation with masters.

The underlying issue (IMO) is the API should use eviction rather than deletion to de-schedule pods from unreachable nodes.  As-is, the API marks everything deleted and thus the pods are all in 'terminating' state, regardless of their underlying state.  Now, things get a little tricky if the API uses eviction and hits the scenario of 1 down, 1 unreachable but pods are running.  The running one may get marked as terminating with eviction, while subsequent calls to the one that's hard-down will fail if there's a PDB in place.  Automated deletion of the running node might proceed and cause an outage, but at this point, it's more or less a multi-host outage and not much we can defend against.

Recommended Actions:

1) Downstream, we need to add a mechanism to ensure a master machine cannot be removed unless etcd gives us the OK.  This might include utilizing lifecycle hooks such as: https://docs.google.com/document/d/1pQLxOYA9r95jnrUZPimE4oajYjIPMcC9SPX2O8_nmRE/edit?ts=5eb19b4f#heading=h.6lu48ejhsolf

2) Upstream, ensure the API server uses eviction instead of delete for unreachable nodes.

https://github.com/kubernetes/kubernetes/issues/91465

Comment 1 Michael Gugino 2020-05-27 20:25:57 UTC
Looking into this some, etcd-quorum-guard only tolerates NoExecute and NoSchedule for a short time (120) seconds.  After that point, the NodeLifeCycleManager scheduler will delete it.  Once the pod is marked for deletion, it will not hang up our draining process.

On the flip side, if I remove the TolerationSeconds and the pod tolerates the condition indefinitely, then drain is blocked indefinitely by this issue: https://github.com/kubernetes/kubernetes/issues/80389#issuecomment-634429215

Which means, even if we modify the LifeCycleManager to utilize eviction rather than delete, we're still in a race with PDBs because once the pod goes Ready: False, PDBs will block us.

This can be mitigated with the following two PRs combined:  https://github.com/kubernetes/kubernetes/pull/83906 and https://github.com/kubernetes/kubernetes/pull/81175

The latter PR needs to be updated, but the basic premise is that we can validate the PDBs actually have minAvailable or Desired - Disrupted, then we know we can safely evict the not-ready pod because it's already not counted towards the PDBs.

Comment 2 W. Trevor King 2020-06-25 06:10:41 UTC
> While the test procedure is entirely contrived.

While I was shadowing SRE yesterday, we saw a cluster in exactly this situation (one happy control-plane machine, one with a happy etcd member but dead kubelet, and one with both the etcd member and kubelet dead).  To recover, we exec'ed into the etcd member on the fully-healthy node, ran some etcdctl to find the fully dead node, bounced that node, waited for its etcd member to rejoin, and then bounced the semy-healthy node.  It recovered the cluster, but involved more poking around under the hood than I'd have liked ;).

Comment 3 Michael Gugino 2020-06-25 13:28:34 UTC
(In reply to W. Trevor King from comment #2)
> > While the test procedure is entirely contrived.
> 
> While I was shadowing SRE yesterday, we saw a cluster in exactly this
> situation (one happy control-plane machine, one with a happy etcd member but
> dead kubelet, and one with both the etcd member and kubelet dead).  To
> recover, we exec'ed into the etcd member on the fully-healthy node, ran some
> etcdctl to find the fully dead node, bounced that node, waited for its etcd
> member to rejoin, and then bounced the semy-healthy node.  It recovered the
> cluster, but involved more poking around under the hood than I'd have liked
> ;).

Wow, great feedback.  I was uncertain how often or likely this would happen out in the real world.  These situations can certainly be tricky for our users.  Once we have control-plane replacement automation and MachineHealthChecks, we can account for this scenario in a variety of ways, though the reboot case is a tricky one.

Comment 6 sunzhaohua 2020-09-27 10:00:08 UTC
Verified
clusterversion:4.6.0-0.nightly-2020-09-26-202331
1.  Stop the kubelet on 2/3 master nodes
2.  Delete first stopped master via machine-api
3.  Delete second stopped master via machine-api

Nither could be deleted.
% ./oc get node
NAME                                         STATUS     ROLES    AGE     VERSION
ip-10-0-141-237.us-east-2.compute.internal   Ready      worker   7h36m   v1.19.0+fff8183
ip-10-0-144-243.us-east-2.compute.internal   NotReady   master   7h46m   v1.19.0+fff8183
ip-10-0-179-155.us-east-2.compute.internal   Ready      worker   7h36m   v1.19.0+fff8183
ip-10-0-180-200.us-east-2.compute.internal   NotReady   master   7h45m   v1.19.0+fff8183
ip-10-0-212-210.us-east-2.compute.internal   Ready      master   7h45m   v1.19.0+fff8183
ip-10-0-221-81.us-east-2.compute.internal    Ready      worker   7h36m   v1.19.0+fff8183

% ./oc get machine
NAME                                        PHASE     TYPE        REGION      ZONE         AGE
zhsun927aws-b2hfl-master-0                  Running   m5.xlarge   us-east-2   us-east-2a   7h54m
zhsun927aws-b2hfl-master-1                  Running   m5.xlarge   us-east-2   us-east-2b   7h54m
zhsun927aws-b2hfl-master-2                  Running   m5.xlarge   us-east-2   us-east-2c   7h54m
zhsun927aws-b2hfl-worker-us-east-2a-l42dg   Running   m5.large    us-east-2   us-east-2a   7h40m
zhsun927aws-b2hfl-worker-us-east-2b-tdvj6   Running   m5.large    us-east-2   us-east-2b   7h40m
zhsun927aws-b2hfl-worker-us-east-2c-9d2g2   Running   m5.large    us-east-2   us-east-2c   7h40m

% ./oc delete machine zhsun927aws-b2hfl-master-0
machine.machine.openshift.io "zhsun927aws-b2hfl-master-0" deleted
^C
% ./oc delete machine zhsun927aws-b2hfl-master-1
machine.machine.openshift.io "zhsun927aws-b2hfl-master-1" deleted
^C

Comment 9 errata-xmlrpc 2020-10-27 16:01:40 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196


Note You need to log in before you can comment on or make changes to this bug.