Description of problem: If 2/3 master kubelets are unreachable, and at least one of the corresponding instances still has etcd up (eg, etcd has quorum, cluster still working), then it's possible to delete the working etcd member (you can delete the non-working etcd member first, as well) via the machine-api. Version-Release number of selected component (if applicable): How reproducible: TBD. I think it will happen more often than not. Drain did get hung up on a pending pod that seemed to be ignored during the eviction calls, but we still waited for it to be deleted. The pending pod did not have a deletion timestamp. A subsequent set of pending pods accepted the eviction request and went terminating. The pod in question was etcd-revision-pruner. Steps to Reproduce: 1. Stop the kubelet on 2/3 master nodes 2. Delete first stopped master via machine-api 3. Delete second stopped master via machine-api Actual results: Both machines are deleted successfully, causing cluster to die. Expected results: Second master should not get deleted. Additional info: While the test procedure is entirely contrived, it's perfectly possible for a kubelet to hit a bug during a scenario when a master outage has taken place. If the kubelet crashes or otherwise becomes unreachable during this time, all it would take is 1 errant command to delete the wrong master and the whole cluster dies, and worse yet, the instances are now removed, so recovery will be much more difficult. This condition was brought about by utilizing skip-wait-for-delete-timeout seconds in the drain library. Previously, pods stuck in terminating would cause drain to hang indefinitely. We don't want this to be the case for workers, but it was inadvertently protecting us from this situation with masters. The underlying issue (IMO) is the API should use eviction rather than deletion to de-schedule pods from unreachable nodes. As-is, the API marks everything deleted and thus the pods are all in 'terminating' state, regardless of their underlying state. Now, things get a little tricky if the API uses eviction and hits the scenario of 1 down, 1 unreachable but pods are running. The running one may get marked as terminating with eviction, while subsequent calls to the one that's hard-down will fail if there's a PDB in place. Automated deletion of the running node might proceed and cause an outage, but at this point, it's more or less a multi-host outage and not much we can defend against. Recommended Actions: 1) Downstream, we need to add a mechanism to ensure a master machine cannot be removed unless etcd gives us the OK. This might include utilizing lifecycle hooks such as: https://docs.google.com/document/d/1pQLxOYA9r95jnrUZPimE4oajYjIPMcC9SPX2O8_nmRE/edit?ts=5eb19b4f#heading=h.6lu48ejhsolf 2) Upstream, ensure the API server uses eviction instead of delete for unreachable nodes. https://github.com/kubernetes/kubernetes/issues/91465
Looking into this some, etcd-quorum-guard only tolerates NoExecute and NoSchedule for a short time (120) seconds. After that point, the NodeLifeCycleManager scheduler will delete it. Once the pod is marked for deletion, it will not hang up our draining process. On the flip side, if I remove the TolerationSeconds and the pod tolerates the condition indefinitely, then drain is blocked indefinitely by this issue: https://github.com/kubernetes/kubernetes/issues/80389#issuecomment-634429215 Which means, even if we modify the LifeCycleManager to utilize eviction rather than delete, we're still in a race with PDBs because once the pod goes Ready: False, PDBs will block us. This can be mitigated with the following two PRs combined: https://github.com/kubernetes/kubernetes/pull/83906 and https://github.com/kubernetes/kubernetes/pull/81175 The latter PR needs to be updated, but the basic premise is that we can validate the PDBs actually have minAvailable or Desired - Disrupted, then we know we can safely evict the not-ready pod because it's already not counted towards the PDBs.
> While the test procedure is entirely contrived. While I was shadowing SRE yesterday, we saw a cluster in exactly this situation (one happy control-plane machine, one with a happy etcd member but dead kubelet, and one with both the etcd member and kubelet dead). To recover, we exec'ed into the etcd member on the fully-healthy node, ran some etcdctl to find the fully dead node, bounced that node, waited for its etcd member to rejoin, and then bounced the semy-healthy node. It recovered the cluster, but involved more poking around under the hood than I'd have liked ;).
(In reply to W. Trevor King from comment #2) > > While the test procedure is entirely contrived. > > While I was shadowing SRE yesterday, we saw a cluster in exactly this > situation (one happy control-plane machine, one with a happy etcd member but > dead kubelet, and one with both the etcd member and kubelet dead). To > recover, we exec'ed into the etcd member on the fully-healthy node, ran some > etcdctl to find the fully dead node, bounced that node, waited for its etcd > member to rejoin, and then bounced the semy-healthy node. It recovered the > cluster, but involved more poking around under the hood than I'd have liked > ;). Wow, great feedback. I was uncertain how often or likely this would happen out in the real world. These situations can certainly be tricky for our users. Once we have control-plane replacement automation and MachineHealthChecks, we can account for this scenario in a variety of ways, though the reboot case is a tricky one.
Verified clusterversion:4.6.0-0.nightly-2020-09-26-202331 1. Stop the kubelet on 2/3 master nodes 2. Delete first stopped master via machine-api 3. Delete second stopped master via machine-api Nither could be deleted. % ./oc get node NAME STATUS ROLES AGE VERSION ip-10-0-141-237.us-east-2.compute.internal Ready worker 7h36m v1.19.0+fff8183 ip-10-0-144-243.us-east-2.compute.internal NotReady master 7h46m v1.19.0+fff8183 ip-10-0-179-155.us-east-2.compute.internal Ready worker 7h36m v1.19.0+fff8183 ip-10-0-180-200.us-east-2.compute.internal NotReady master 7h45m v1.19.0+fff8183 ip-10-0-212-210.us-east-2.compute.internal Ready master 7h45m v1.19.0+fff8183 ip-10-0-221-81.us-east-2.compute.internal Ready worker 7h36m v1.19.0+fff8183 % ./oc get machine NAME PHASE TYPE REGION ZONE AGE zhsun927aws-b2hfl-master-0 Running m5.xlarge us-east-2 us-east-2a 7h54m zhsun927aws-b2hfl-master-1 Running m5.xlarge us-east-2 us-east-2b 7h54m zhsun927aws-b2hfl-master-2 Running m5.xlarge us-east-2 us-east-2c 7h54m zhsun927aws-b2hfl-worker-us-east-2a-l42dg Running m5.large us-east-2 us-east-2a 7h40m zhsun927aws-b2hfl-worker-us-east-2b-tdvj6 Running m5.large us-east-2 us-east-2b 7h40m zhsun927aws-b2hfl-worker-us-east-2c-9d2g2 Running m5.large us-east-2 us-east-2c 7h40m % ./oc delete machine zhsun927aws-b2hfl-master-0 machine.machine.openshift.io "zhsun927aws-b2hfl-master-0" deleted ^C % ./oc delete machine zhsun927aws-b2hfl-master-1 machine.machine.openshift.io "zhsun927aws-b2hfl-master-1" deleted ^C
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4196