Bug 1805153 - Machine Autoscaler does not remove nodes idempotently
Summary: Machine Autoscaler does not remove nodes idempotently
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Cloud Compute
Version: 4.4
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
: 4.4.0
Assignee: Joel Speed
QA Contact: sunzhaohua
URL:
Whiteboard:
Depends On: 1804738
Blocks: 1805160
TreeView+ depends on / blocked
 
Reported: 2020-02-20 11:34 UTC by Joel Speed
Modified: 2020-05-15 16:14 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
When scaling down, in certain scenarios, the autoscaler would remove more than the intended number of nodes, removing required capacity from the cluster and resulting in a scale up being required and interruption to workloads.
Clone Of: 1804738
Environment:
Last Closed: 2020-05-15 16:14:54 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift kubernetes-autoscaler pull 128 0 None closed [release-4.4] BUG 1805153: Ensure DeleteNodes doesn't delete a node twice 2020-05-14 10:23:41 UTC

Description Joel Speed 2020-02-20 11:34:00 UTC
+++ This bug was initially created as a clone of Bug #1804738 +++

Description of problem:

If a cloud provider instance is removed, the Machine Autoscaler determines that the Machine has an unregistered node and, after 15 minutes, will remove the unregistered node.

This currently is not done idempotently and, if the Machine takes some time to be deleted (Machine controller slow to remove finalizer), the Autoscaler will call to scale down the replicaset a second or third time.

This can result in healthy nodes being removed from the cluster for no reason

Version-Release number of selected component (if applicable):

4.4

How reproducible:

Easily reproducible

Steps to Reproduce:
1. Deploy Openshift cluster with Machine Autoscaler pointing to a MachinSet
2. Ensure there are at least 3 nodes in the MachineSet (you may need to add extra workloads)
3. Terminate an instance from the cloud provider side
4. Wait for 15 minutes and observe several Machines being deleted at once

Actual results:


Expected results:


Additional info:

Comment 3 sunzhaohua 2020-02-25 07:08:31 UTC
Verified
4.4.0-0.nightly-2020-02-21-045519


Only the machine associated with the unregistered node was deleted.

$ oc get machine
NAME                     PHASE     TYPE            REGION        ZONE            AGE
zhsun2-k9bts-m-0         Running   n1-standard-4   us-central1   us-central1-a   142m
zhsun2-k9bts-m-1         Running   n1-standard-4   us-central1   us-central1-b   142m
zhsun2-k9bts-m-2         Running   n1-standard-4   us-central1   us-central1-c   142m
zhsun2-k9bts-w-a-45r2z   Failed    n1-standard-4   us-central1   us-central1-a   16m
zhsun2-k9bts-w-a-dd9n2   Running   n1-standard-4   us-central1   us-central1-a   136m
zhsun2-k9bts-w-a-jc79x   Running   n1-standard-4   us-central1   us-central1-a   16m
zhsun2-k9bts-w-a-z2c9x   Running   n1-standard-4   us-central1   us-central1-a   16m
zhsun2-k9bts-w-b-hvkfh   Running   n1-standard-4   us-central1   us-central1-b   136m
zhsun2-k9bts-w-c-g7h6v   Running   n1-standard-4   us-central1   us-central1-c   136m



$ oc get machine
NAME                     PHASE     TYPE            REGION        ZONE            AGE
zhsun2-k9bts-m-0         Running   n1-standard-4   us-central1   us-central1-a   158m
zhsun2-k9bts-m-1         Running   n1-standard-4   us-central1   us-central1-b   158m
zhsun2-k9bts-m-2         Running   n1-standard-4   us-central1   us-central1-c   158m
zhsun2-k9bts-w-a-dd9n2   Running   n1-standard-4   us-central1   us-central1-a   152m
zhsun2-k9bts-w-a-jc79x   Running   n1-standard-4   us-central1   us-central1-a   32m
zhsun2-k9bts-w-a-lh8jc   Running   n1-standard-4   us-central1   us-central1-a   8m26s
zhsun2-k9bts-w-a-z2c9x   Running   n1-standard-4   us-central1   us-central1-a   32m
zhsun2-k9bts-w-b-hvkfh   Running   n1-standard-4   us-central1   us-central1-b   152m
zhsun2-k9bts-w-c-g7h6v   Running   n1-standard-4   us-central1   us-central1-c   152m


Note You need to log in before you can comment on or make changes to this bug.