Bug 1805153

Summary: Machine Autoscaler does not remove nodes idempotently
Product: OpenShift Container Platform Reporter: Joel Speed <jspeed>
Component: Cloud ComputeAssignee: Joel Speed <jspeed>
Cloud Compute sub component: Other Providers QA Contact: sunzhaohua <zhsun>
Status: CLOSED ERRATA Docs Contact:
Severity: unspecified    
Priority: unspecified CC: ademicev, vlaad, zhsun
Version: 4.4   
Target Milestone: ---   
Target Release: 4.4.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
When scaling down, in certain scenarios, the autoscaler would remove more than the intended number of nodes, removing required capacity from the cluster and resulting in a scale up being required and interruption to workloads.
Story Points: ---
Clone Of: 1804738 Environment:
Last Closed: 2020-05-15 16:14:54 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1804738    
Bug Blocks: 1805160    

Description Joel Speed 2020-02-20 11:34:00 UTC
+++ This bug was initially created as a clone of Bug #1804738 +++

Description of problem:

If a cloud provider instance is removed, the Machine Autoscaler determines that the Machine has an unregistered node and, after 15 minutes, will remove the unregistered node.

This currently is not done idempotently and, if the Machine takes some time to be deleted (Machine controller slow to remove finalizer), the Autoscaler will call to scale down the replicaset a second or third time.

This can result in healthy nodes being removed from the cluster for no reason

Version-Release number of selected component (if applicable):

4.4

How reproducible:

Easily reproducible

Steps to Reproduce:
1. Deploy Openshift cluster with Machine Autoscaler pointing to a MachinSet
2. Ensure there are at least 3 nodes in the MachineSet (you may need to add extra workloads)
3. Terminate an instance from the cloud provider side
4. Wait for 15 minutes and observe several Machines being deleted at once

Actual results:


Expected results:


Additional info:

Comment 3 sunzhaohua 2020-02-25 07:08:31 UTC
Verified
4.4.0-0.nightly-2020-02-21-045519


Only the machine associated with the unregistered node was deleted.

$ oc get machine
NAME                     PHASE     TYPE            REGION        ZONE            AGE
zhsun2-k9bts-m-0         Running   n1-standard-4   us-central1   us-central1-a   142m
zhsun2-k9bts-m-1         Running   n1-standard-4   us-central1   us-central1-b   142m
zhsun2-k9bts-m-2         Running   n1-standard-4   us-central1   us-central1-c   142m
zhsun2-k9bts-w-a-45r2z   Failed    n1-standard-4   us-central1   us-central1-a   16m
zhsun2-k9bts-w-a-dd9n2   Running   n1-standard-4   us-central1   us-central1-a   136m
zhsun2-k9bts-w-a-jc79x   Running   n1-standard-4   us-central1   us-central1-a   16m
zhsun2-k9bts-w-a-z2c9x   Running   n1-standard-4   us-central1   us-central1-a   16m
zhsun2-k9bts-w-b-hvkfh   Running   n1-standard-4   us-central1   us-central1-b   136m
zhsun2-k9bts-w-c-g7h6v   Running   n1-standard-4   us-central1   us-central1-c   136m



$ oc get machine
NAME                     PHASE     TYPE            REGION        ZONE            AGE
zhsun2-k9bts-m-0         Running   n1-standard-4   us-central1   us-central1-a   158m
zhsun2-k9bts-m-1         Running   n1-standard-4   us-central1   us-central1-b   158m
zhsun2-k9bts-m-2         Running   n1-standard-4   us-central1   us-central1-c   158m
zhsun2-k9bts-w-a-dd9n2   Running   n1-standard-4   us-central1   us-central1-a   152m
zhsun2-k9bts-w-a-jc79x   Running   n1-standard-4   us-central1   us-central1-a   32m
zhsun2-k9bts-w-a-lh8jc   Running   n1-standard-4   us-central1   us-central1-a   8m26s
zhsun2-k9bts-w-a-z2c9x   Running   n1-standard-4   us-central1   us-central1-a   32m
zhsun2-k9bts-w-b-hvkfh   Running   n1-standard-4   us-central1   us-central1-b   152m
zhsun2-k9bts-w-c-g7h6v   Running   n1-standard-4   us-central1   us-central1-c   152m