Bug 1804738

Summary: Machine Autoscaler does not remove nodes idempotently
Product: OpenShift Container Platform Reporter: Joel Speed <jspeed>
Component: Cloud ComputeAssignee: Joel Speed <jspeed>
Cloud Compute sub component: Other Providers QA Contact: sunzhaohua <zhsun>
Status: CLOSED ERRATA Docs Contact:
Severity: unspecified    
Priority: unspecified CC: ademicev
Version: 4.4   
Target Milestone: ---   
Target Release: 4.5.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
When scaling down, in certain scenarios, the autoscaler would remove more than the intended number of nodes, removing required capacity from the cluster and resulting in a scale up being required and interruption to workloads.
Story Points: ---
Clone Of:
: 1805153 1805160 (view as bug list) Environment:
Last Closed: 2020-07-13 17:16:11 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1805153    

Description Joel Speed 2020-02-19 14:27:19 UTC
Description of problem:

If a cloud provider instance is removed, the Machine Autoscaler determines that the Machine has an unregistered node and, after 15 minutes, will remove the unregistered node.

This currently is not done idempotently and, if the Machine takes some time to be deleted (Machine controller slow to remove finalizer), the Autoscaler will call to scale down the replicaset a second or third time.

This can result in healthy nodes being removed from the cluster for no reason

Version-Release number of selected component (if applicable):

4.4

How reproducible:

Easily reproducible

Steps to Reproduce:
1. Deploy Openshift cluster with Machine Autoscaler pointing to a MachinSet
2. Ensure there are at least 3 nodes in the MachineSet (you may need to add extra workloads)
3. Terminate an instance from the cloud provider side
4. Wait for 15 minutes and observe several Machines being deleted at once

Actual results:


Expected results:


Additional info:

Comment 3 sunzhaohua 2020-02-21 10:54:57 UTC
Verified
4.4.0-0.nightly-2020-02-21-045519


Only the machine associated with the unregistered node was deleted.

$ oc get machine
NAME                     PHASE     TYPE            REGION        ZONE            AGE
zhsun2-k9bts-m-0         Running   n1-standard-4   us-central1   us-central1-a   142m
zhsun2-k9bts-m-1         Running   n1-standard-4   us-central1   us-central1-b   142m
zhsun2-k9bts-m-2         Running   n1-standard-4   us-central1   us-central1-c   142m
zhsun2-k9bts-w-a-45r2z   Failed    n1-standard-4   us-central1   us-central1-a   16m
zhsun2-k9bts-w-a-dd9n2   Running   n1-standard-4   us-central1   us-central1-a   136m
zhsun2-k9bts-w-a-jc79x   Running   n1-standard-4   us-central1   us-central1-a   16m
zhsun2-k9bts-w-a-z2c9x   Running   n1-standard-4   us-central1   us-central1-a   16m
zhsun2-k9bts-w-b-hvkfh   Running   n1-standard-4   us-central1   us-central1-b   136m
zhsun2-k9bts-w-c-g7h6v   Running   n1-standard-4   us-central1   us-central1-c   136m



$ oc get machine
NAME                     PHASE     TYPE            REGION        ZONE            AGE
zhsun2-k9bts-m-0         Running   n1-standard-4   us-central1   us-central1-a   158m
zhsun2-k9bts-m-1         Running   n1-standard-4   us-central1   us-central1-b   158m
zhsun2-k9bts-m-2         Running   n1-standard-4   us-central1   us-central1-c   158m
zhsun2-k9bts-w-a-dd9n2   Running   n1-standard-4   us-central1   us-central1-a   152m
zhsun2-k9bts-w-a-jc79x   Running   n1-standard-4   us-central1   us-central1-a   32m
zhsun2-k9bts-w-a-lh8jc   Running   n1-standard-4   us-central1   us-central1-a   8m26s
zhsun2-k9bts-w-a-z2c9x   Running   n1-standard-4   us-central1   us-central1-a   32m
zhsun2-k9bts-w-b-hvkfh   Running   n1-standard-4   us-central1   us-central1-b   152m
zhsun2-k9bts-w-c-g7h6v   Running   n1-standard-4   us-central1   us-central1-c   152m

Comment 4 sunzhaohua 2020-02-25 08:23:23 UTC
vefified in 4.5
clusterversion: 4.5.0-0.ci-2020-02-25-010652

Only the machine associated with the unregistered node was deleted.

$ oc get machine
NAME                                    PHASE     TYPE        REGION      ZONE         AGE
zhsun45-tcth9-master-0                  Running   m4.xlarge   us-east-2   us-east-2a   4h41m
zhsun45-tcth9-master-1                  Running   m4.xlarge   us-east-2   us-east-2b   4h41m
zhsun45-tcth9-master-2                  Running   m4.xlarge   us-east-2   us-east-2c   4h41m
zhsun45-tcth9-worker-us-east-2a-8s4pd   Failed    m4.large    us-east-2   us-east-2a   23m
zhsun45-tcth9-worker-us-east-2a-92mmd   Running   m4.large    us-east-2   us-east-2a   23m
zhsun45-tcth9-worker-us-east-2a-fstfv   Running   m4.large    us-east-2   us-east-2a   23m
zhsun45-tcth9-worker-us-east-2a-km77d   Running   m4.large    us-east-2   us-east-2a   4h35m
zhsun45-tcth9-worker-us-east-2a-qtkcd   Running   m4.large    us-east-2   us-east-2a   23m
zhsun45-tcth9-worker-us-east-2a-z6rk8   Running   m4.large    us-east-2   us-east-2a   23m
zhsun45-tcth9-worker-us-east-2a-znh2d   Running   m4.large    us-east-2   us-east-2a   23m
zhsun45-tcth9-worker-us-east-2b-v6mwk   Running   m4.large    us-east-2   us-east-2b   4h35m
zhsun45-tcth9-worker-us-east-2c-fs4r6   Running   m4.large    us-east-2   us-east-2c   4h35m
[szh@localhost installer]$ oc get machine
NAME                                    PHASE     TYPE        REGION      ZONE         AGE
zhsun45-tcth9-master-0                  Running   m4.xlarge   us-east-2   us-east-2a   4h55m
zhsun45-tcth9-master-1                  Running   m4.xlarge   us-east-2   us-east-2b   4h55m
zhsun45-tcth9-master-2                  Running   m4.xlarge   us-east-2   us-east-2c   4h55m
zhsun45-tcth9-worker-us-east-2a-5679h   Running   m4.large    us-east-2   us-east-2a   11m
zhsun45-tcth9-worker-us-east-2a-92mmd   Running   m4.large    us-east-2   us-east-2a   37m
zhsun45-tcth9-worker-us-east-2a-fstfv   Running   m4.large    us-east-2   us-east-2a   37m
zhsun45-tcth9-worker-us-east-2a-km77d   Running   m4.large    us-east-2   us-east-2a   4h49m
zhsun45-tcth9-worker-us-east-2a-qtkcd   Running   m4.large    us-east-2   us-east-2a   37m
zhsun45-tcth9-worker-us-east-2a-z6rk8   Running   m4.large    us-east-2   us-east-2a   37m
zhsun45-tcth9-worker-us-east-2a-znh2d   Running   m4.large    us-east-2   us-east-2a   37m
zhsun45-tcth9-worker-us-east-2b-v6mwk   Running   m4.large    us-east-2   us-east-2b   4h49m
zhsun45-tcth9-worker-us-east-2c-fs4r6   Running   m4.large    us-east-2   us-east-2c   4h49m

Comment 6 errata-xmlrpc 2020-07-13 17:16:11 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2409