Bug 1804738 - Machine Autoscaler does not remove nodes idempotently
Summary: Machine Autoscaler does not remove nodes idempotently
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Cloud Compute
Version: 4.4
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
: 4.5.0
Assignee: Joel Speed
QA Contact: sunzhaohua
URL:
Whiteboard:
Depends On:
Blocks: 1805153
TreeView+ depends on / blocked
 
Reported: 2020-02-19 14:27 UTC by Joel Speed
Modified: 2020-07-13 17:16 UTC (History)
1 user (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
When scaling down, in certain scenarios, the autoscaler would remove more than the intended number of nodes, removing required capacity from the cluster and resulting in a scale up being required and interruption to workloads.
Clone Of:
: 1805153 1805160 (view as bug list)
Environment:
Last Closed: 2020-07-13 17:16:11 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift kubernetes-autoscaler pull 125 0 None closed BUG 1804738: Ensure DeleteNodes doesn't delete a node twice 2021-01-13 08:42:10 UTC
Red Hat Product Errata RHBA-2020:2409 0 None None None 2020-07-13 17:16:36 UTC

Description Joel Speed 2020-02-19 14:27:19 UTC
Description of problem:

If a cloud provider instance is removed, the Machine Autoscaler determines that the Machine has an unregistered node and, after 15 minutes, will remove the unregistered node.

This currently is not done idempotently and, if the Machine takes some time to be deleted (Machine controller slow to remove finalizer), the Autoscaler will call to scale down the replicaset a second or third time.

This can result in healthy nodes being removed from the cluster for no reason

Version-Release number of selected component (if applicable):

4.4

How reproducible:

Easily reproducible

Steps to Reproduce:
1. Deploy Openshift cluster with Machine Autoscaler pointing to a MachinSet
2. Ensure there are at least 3 nodes in the MachineSet (you may need to add extra workloads)
3. Terminate an instance from the cloud provider side
4. Wait for 15 minutes and observe several Machines being deleted at once

Actual results:


Expected results:


Additional info:

Comment 3 sunzhaohua 2020-02-21 10:54:57 UTC
Verified
4.4.0-0.nightly-2020-02-21-045519


Only the machine associated with the unregistered node was deleted.

$ oc get machine
NAME                     PHASE     TYPE            REGION        ZONE            AGE
zhsun2-k9bts-m-0         Running   n1-standard-4   us-central1   us-central1-a   142m
zhsun2-k9bts-m-1         Running   n1-standard-4   us-central1   us-central1-b   142m
zhsun2-k9bts-m-2         Running   n1-standard-4   us-central1   us-central1-c   142m
zhsun2-k9bts-w-a-45r2z   Failed    n1-standard-4   us-central1   us-central1-a   16m
zhsun2-k9bts-w-a-dd9n2   Running   n1-standard-4   us-central1   us-central1-a   136m
zhsun2-k9bts-w-a-jc79x   Running   n1-standard-4   us-central1   us-central1-a   16m
zhsun2-k9bts-w-a-z2c9x   Running   n1-standard-4   us-central1   us-central1-a   16m
zhsun2-k9bts-w-b-hvkfh   Running   n1-standard-4   us-central1   us-central1-b   136m
zhsun2-k9bts-w-c-g7h6v   Running   n1-standard-4   us-central1   us-central1-c   136m



$ oc get machine
NAME                     PHASE     TYPE            REGION        ZONE            AGE
zhsun2-k9bts-m-0         Running   n1-standard-4   us-central1   us-central1-a   158m
zhsun2-k9bts-m-1         Running   n1-standard-4   us-central1   us-central1-b   158m
zhsun2-k9bts-m-2         Running   n1-standard-4   us-central1   us-central1-c   158m
zhsun2-k9bts-w-a-dd9n2   Running   n1-standard-4   us-central1   us-central1-a   152m
zhsun2-k9bts-w-a-jc79x   Running   n1-standard-4   us-central1   us-central1-a   32m
zhsun2-k9bts-w-a-lh8jc   Running   n1-standard-4   us-central1   us-central1-a   8m26s
zhsun2-k9bts-w-a-z2c9x   Running   n1-standard-4   us-central1   us-central1-a   32m
zhsun2-k9bts-w-b-hvkfh   Running   n1-standard-4   us-central1   us-central1-b   152m
zhsun2-k9bts-w-c-g7h6v   Running   n1-standard-4   us-central1   us-central1-c   152m

Comment 4 sunzhaohua 2020-02-25 08:23:23 UTC
vefified in 4.5
clusterversion: 4.5.0-0.ci-2020-02-25-010652

Only the machine associated with the unregistered node was deleted.

$ oc get machine
NAME                                    PHASE     TYPE        REGION      ZONE         AGE
zhsun45-tcth9-master-0                  Running   m4.xlarge   us-east-2   us-east-2a   4h41m
zhsun45-tcth9-master-1                  Running   m4.xlarge   us-east-2   us-east-2b   4h41m
zhsun45-tcth9-master-2                  Running   m4.xlarge   us-east-2   us-east-2c   4h41m
zhsun45-tcth9-worker-us-east-2a-8s4pd   Failed    m4.large    us-east-2   us-east-2a   23m
zhsun45-tcth9-worker-us-east-2a-92mmd   Running   m4.large    us-east-2   us-east-2a   23m
zhsun45-tcth9-worker-us-east-2a-fstfv   Running   m4.large    us-east-2   us-east-2a   23m
zhsun45-tcth9-worker-us-east-2a-km77d   Running   m4.large    us-east-2   us-east-2a   4h35m
zhsun45-tcth9-worker-us-east-2a-qtkcd   Running   m4.large    us-east-2   us-east-2a   23m
zhsun45-tcth9-worker-us-east-2a-z6rk8   Running   m4.large    us-east-2   us-east-2a   23m
zhsun45-tcth9-worker-us-east-2a-znh2d   Running   m4.large    us-east-2   us-east-2a   23m
zhsun45-tcth9-worker-us-east-2b-v6mwk   Running   m4.large    us-east-2   us-east-2b   4h35m
zhsun45-tcth9-worker-us-east-2c-fs4r6   Running   m4.large    us-east-2   us-east-2c   4h35m
[szh@localhost installer]$ oc get machine
NAME                                    PHASE     TYPE        REGION      ZONE         AGE
zhsun45-tcth9-master-0                  Running   m4.xlarge   us-east-2   us-east-2a   4h55m
zhsun45-tcth9-master-1                  Running   m4.xlarge   us-east-2   us-east-2b   4h55m
zhsun45-tcth9-master-2                  Running   m4.xlarge   us-east-2   us-east-2c   4h55m
zhsun45-tcth9-worker-us-east-2a-5679h   Running   m4.large    us-east-2   us-east-2a   11m
zhsun45-tcth9-worker-us-east-2a-92mmd   Running   m4.large    us-east-2   us-east-2a   37m
zhsun45-tcth9-worker-us-east-2a-fstfv   Running   m4.large    us-east-2   us-east-2a   37m
zhsun45-tcth9-worker-us-east-2a-km77d   Running   m4.large    us-east-2   us-east-2a   4h49m
zhsun45-tcth9-worker-us-east-2a-qtkcd   Running   m4.large    us-east-2   us-east-2a   37m
zhsun45-tcth9-worker-us-east-2a-z6rk8   Running   m4.large    us-east-2   us-east-2a   37m
zhsun45-tcth9-worker-us-east-2a-znh2d   Running   m4.large    us-east-2   us-east-2a   37m
zhsun45-tcth9-worker-us-east-2b-v6mwk   Running   m4.large    us-east-2   us-east-2b   4h49m
zhsun45-tcth9-worker-us-east-2c-fs4r6   Running   m4.large    us-east-2   us-east-2c   4h49m

Comment 6 errata-xmlrpc 2020-07-13 17:16:11 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2409


Note You need to log in before you can comment on or make changes to this bug.