Description of problem: If a cloud provider instance is removed, the Machine Autoscaler determines that the Machine has an unregistered node and, after 15 minutes, will remove the unregistered node. This currently is not done idempotently and, if the Machine takes some time to be deleted (Machine controller slow to remove finalizer), the Autoscaler will call to scale down the replicaset a second or third time. This can result in healthy nodes being removed from the cluster for no reason Version-Release number of selected component (if applicable): 4.4 How reproducible: Easily reproducible Steps to Reproduce: 1. Deploy Openshift cluster with Machine Autoscaler pointing to a MachinSet 2. Ensure there are at least 3 nodes in the MachineSet (you may need to add extra workloads) 3. Terminate an instance from the cloud provider side 4. Wait for 15 minutes and observe several Machines being deleted at once Actual results: Expected results: Additional info:
Verified 4.4.0-0.nightly-2020-02-21-045519 Only the machine associated with the unregistered node was deleted. $ oc get machine NAME PHASE TYPE REGION ZONE AGE zhsun2-k9bts-m-0 Running n1-standard-4 us-central1 us-central1-a 142m zhsun2-k9bts-m-1 Running n1-standard-4 us-central1 us-central1-b 142m zhsun2-k9bts-m-2 Running n1-standard-4 us-central1 us-central1-c 142m zhsun2-k9bts-w-a-45r2z Failed n1-standard-4 us-central1 us-central1-a 16m zhsun2-k9bts-w-a-dd9n2 Running n1-standard-4 us-central1 us-central1-a 136m zhsun2-k9bts-w-a-jc79x Running n1-standard-4 us-central1 us-central1-a 16m zhsun2-k9bts-w-a-z2c9x Running n1-standard-4 us-central1 us-central1-a 16m zhsun2-k9bts-w-b-hvkfh Running n1-standard-4 us-central1 us-central1-b 136m zhsun2-k9bts-w-c-g7h6v Running n1-standard-4 us-central1 us-central1-c 136m $ oc get machine NAME PHASE TYPE REGION ZONE AGE zhsun2-k9bts-m-0 Running n1-standard-4 us-central1 us-central1-a 158m zhsun2-k9bts-m-1 Running n1-standard-4 us-central1 us-central1-b 158m zhsun2-k9bts-m-2 Running n1-standard-4 us-central1 us-central1-c 158m zhsun2-k9bts-w-a-dd9n2 Running n1-standard-4 us-central1 us-central1-a 152m zhsun2-k9bts-w-a-jc79x Running n1-standard-4 us-central1 us-central1-a 32m zhsun2-k9bts-w-a-lh8jc Running n1-standard-4 us-central1 us-central1-a 8m26s zhsun2-k9bts-w-a-z2c9x Running n1-standard-4 us-central1 us-central1-a 32m zhsun2-k9bts-w-b-hvkfh Running n1-standard-4 us-central1 us-central1-b 152m zhsun2-k9bts-w-c-g7h6v Running n1-standard-4 us-central1 us-central1-c 152m
vefified in 4.5 clusterversion: 4.5.0-0.ci-2020-02-25-010652 Only the machine associated with the unregistered node was deleted. $ oc get machine NAME PHASE TYPE REGION ZONE AGE zhsun45-tcth9-master-0 Running m4.xlarge us-east-2 us-east-2a 4h41m zhsun45-tcth9-master-1 Running m4.xlarge us-east-2 us-east-2b 4h41m zhsun45-tcth9-master-2 Running m4.xlarge us-east-2 us-east-2c 4h41m zhsun45-tcth9-worker-us-east-2a-8s4pd Failed m4.large us-east-2 us-east-2a 23m zhsun45-tcth9-worker-us-east-2a-92mmd Running m4.large us-east-2 us-east-2a 23m zhsun45-tcth9-worker-us-east-2a-fstfv Running m4.large us-east-2 us-east-2a 23m zhsun45-tcth9-worker-us-east-2a-km77d Running m4.large us-east-2 us-east-2a 4h35m zhsun45-tcth9-worker-us-east-2a-qtkcd Running m4.large us-east-2 us-east-2a 23m zhsun45-tcth9-worker-us-east-2a-z6rk8 Running m4.large us-east-2 us-east-2a 23m zhsun45-tcth9-worker-us-east-2a-znh2d Running m4.large us-east-2 us-east-2a 23m zhsun45-tcth9-worker-us-east-2b-v6mwk Running m4.large us-east-2 us-east-2b 4h35m zhsun45-tcth9-worker-us-east-2c-fs4r6 Running m4.large us-east-2 us-east-2c 4h35m [szh@localhost installer]$ oc get machine NAME PHASE TYPE REGION ZONE AGE zhsun45-tcth9-master-0 Running m4.xlarge us-east-2 us-east-2a 4h55m zhsun45-tcth9-master-1 Running m4.xlarge us-east-2 us-east-2b 4h55m zhsun45-tcth9-master-2 Running m4.xlarge us-east-2 us-east-2c 4h55m zhsun45-tcth9-worker-us-east-2a-5679h Running m4.large us-east-2 us-east-2a 11m zhsun45-tcth9-worker-us-east-2a-92mmd Running m4.large us-east-2 us-east-2a 37m zhsun45-tcth9-worker-us-east-2a-fstfv Running m4.large us-east-2 us-east-2a 37m zhsun45-tcth9-worker-us-east-2a-km77d Running m4.large us-east-2 us-east-2a 4h49m zhsun45-tcth9-worker-us-east-2a-qtkcd Running m4.large us-east-2 us-east-2a 37m zhsun45-tcth9-worker-us-east-2a-z6rk8 Running m4.large us-east-2 us-east-2a 37m zhsun45-tcth9-worker-us-east-2a-znh2d Running m4.large us-east-2 us-east-2a 37m zhsun45-tcth9-worker-us-east-2b-v6mwk Running m4.large us-east-2 us-east-2b 4h49m zhsun45-tcth9-worker-us-east-2c-fs4r6 Running m4.large us-east-2 us-east-2c 4h49m
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:2409