Hide Forgot
+++ This bug was initially created as a clone of Bug #1804738 +++ Description of problem: If a cloud provider instance is removed, the Machine Autoscaler determines that the Machine has an unregistered node and, after 15 minutes, will remove the unregistered node. This currently is not done idempotently and, if the Machine takes some time to be deleted (Machine controller slow to remove finalizer), the Autoscaler will call to scale down the replicaset a second or third time. This can result in healthy nodes being removed from the cluster for no reason Version-Release number of selected component (if applicable): 4.4 How reproducible: Easily reproducible Steps to Reproduce: 1. Deploy Openshift cluster with Machine Autoscaler pointing to a MachinSet 2. Ensure there are at least 3 nodes in the MachineSet (you may need to add extra workloads) 3. Terminate an instance from the cloud provider side 4. Wait for 15 minutes and observe several Machines being deleted at once Actual results: Expected results: Additional info:
Verified clusterversion: 4.3.9 Only the machine associated with the unregistered node was deleted. $ oc get node NAME STATUS ROLES AGE VERSION ip-10-0-140-11.us-east-2.compute.internal Ready worker 30m v1.16.2 ip-10-0-140-138.us-east-2.compute.internal Ready worker 30m v1.16.2 ip-10-0-140-198.us-east-2.compute.internal Ready worker 5m39s v1.16.2 ip-10-0-140-65.us-east-2.compute.internal NotReady worker 131m v1.16.2 ip-10-0-142-13.us-east-2.compute.internal Ready master 139m v1.16.2 ip-10-0-150-145.us-east-2.compute.internal Ready master 139m v1.16.2 ip-10-0-158-176.us-east-2.compute.internal Ready worker 131m v1.16.2 ip-10-0-172-251.us-east-2.compute.internal Ready master 139m v1.16.2 $ oc get node NAME STATUS ROLES AGE VERSION ip-10-0-135-212.us-east-2.compute.internal Ready worker 70m v1.16.2 ip-10-0-140-11.us-east-2.compute.internal Ready worker 120m v1.16.2 ip-10-0-140-138.us-east-2.compute.internal Ready worker 120m v1.16.2 ip-10-0-140-198.us-east-2.compute.internal Ready worker 95m v1.16.2 ip-10-0-142-13.us-east-2.compute.internal Ready master 3h49m v1.16.2 ip-10-0-150-145.us-east-2.compute.internal Ready master 3h49m v1.16.2 ip-10-0-158-176.us-east-2.compute.internal Ready worker 3h40m v1.16.2 ip-10-0-172-251.us-east-2.compute.internal Ready master 3h49m v1.16.2
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:1262