1804738 – Machine Autoscaler does not remove nodes idempotently

Bug 1804738 - Machine Autoscaler does not remove nodes idempotently

Summary: Machine Autoscaler does not remove nodes idempotently

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Cloud Compute
Sub Component:
Version:	4.4
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Target Release:	4.5.0
Assignee:	Joel Speed
QA Contact:	sunzhaohua
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1805153
TreeView+	depends on / blocked

Reported:	2020-02-19 14:27 UTC by Joel Speed
Modified:	2020-07-13 17:16 UTC (History)
CC List:	1 user (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	When scaling down, in certain scenarios, the autoscaler would remove more than the intended number of nodes, removing required capacity from the cluster and resulting in a scale up being required and interruption to workloads.
Clone Of:
Clones:	1805153 1805160 (view as bug list)
Environment:
Last Closed:	2020-07-13 17:16:11 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift kubernetes-autoscaler pull 125	0	None	closed	BUG 1804738: Ensure DeleteNodes doesn't delete a node twice	2021-01-13 08:42:10 UTC
Red Hat Product Errata	RHBA-2020:2409	0	None	None	None	2020-07-13 17:16:36 UTC

Description Joel Speed 2020-02-19 14:27:19 UTC

Description of problem:

If a cloud provider instance is removed, the Machine Autoscaler determines that the Machine has an unregistered node and, after 15 minutes, will remove the unregistered node.

This currently is not done idempotently and, if the Machine takes some time to be deleted (Machine controller slow to remove finalizer), the Autoscaler will call to scale down the replicaset a second or third time.

This can result in healthy nodes being removed from the cluster for no reason

Version-Release number of selected component (if applicable):

4.4

How reproducible:

Easily reproducible

Steps to Reproduce:
1. Deploy Openshift cluster with Machine Autoscaler pointing to a MachinSet
2. Ensure there are at least 3 nodes in the MachineSet (you may need to add extra workloads)
3. Terminate an instance from the cloud provider side
4. Wait for 15 minutes and observe several Machines being deleted at once

Actual results:


Expected results:


Additional info:

Comment 3 sunzhaohua 2020-02-21 10:54:57 UTC

Verified
4.4.0-0.nightly-2020-02-21-045519


Only the machine associated with the unregistered node was deleted.

$ oc get machine
NAME                     PHASE     TYPE            REGION        ZONE            AGE
zhsun2-k9bts-m-0         Running   n1-standard-4   us-central1   us-central1-a   142m
zhsun2-k9bts-m-1         Running   n1-standard-4   us-central1   us-central1-b   142m
zhsun2-k9bts-m-2         Running   n1-standard-4   us-central1   us-central1-c   142m
zhsun2-k9bts-w-a-45r2z   Failed    n1-standard-4   us-central1   us-central1-a   16m
zhsun2-k9bts-w-a-dd9n2   Running   n1-standard-4   us-central1   us-central1-a   136m
zhsun2-k9bts-w-a-jc79x   Running   n1-standard-4   us-central1   us-central1-a   16m
zhsun2-k9bts-w-a-z2c9x   Running   n1-standard-4   us-central1   us-central1-a   16m
zhsun2-k9bts-w-b-hvkfh   Running   n1-standard-4   us-central1   us-central1-b   136m
zhsun2-k9bts-w-c-g7h6v   Running   n1-standard-4   us-central1   us-central1-c   136m



$ oc get machine
NAME                     PHASE     TYPE            REGION        ZONE            AGE
zhsun2-k9bts-m-0         Running   n1-standard-4   us-central1   us-central1-a   158m
zhsun2-k9bts-m-1         Running   n1-standard-4   us-central1   us-central1-b   158m
zhsun2-k9bts-m-2         Running   n1-standard-4   us-central1   us-central1-c   158m
zhsun2-k9bts-w-a-dd9n2   Running   n1-standard-4   us-central1   us-central1-a   152m
zhsun2-k9bts-w-a-jc79x   Running   n1-standard-4   us-central1   us-central1-a   32m
zhsun2-k9bts-w-a-lh8jc   Running   n1-standard-4   us-central1   us-central1-a   8m26s
zhsun2-k9bts-w-a-z2c9x   Running   n1-standard-4   us-central1   us-central1-a   32m
zhsun2-k9bts-w-b-hvkfh   Running   n1-standard-4   us-central1   us-central1-b   152m
zhsun2-k9bts-w-c-g7h6v   Running   n1-standard-4   us-central1   us-central1-c   152m

Comment 4 sunzhaohua 2020-02-25 08:23:23 UTC

vefified in 4.5
clusterversion: 4.5.0-0.ci-2020-02-25-010652

Only the machine associated with the unregistered node was deleted.

$ oc get machine
NAME                                    PHASE     TYPE        REGION      ZONE         AGE
zhsun45-tcth9-master-0                  Running   m4.xlarge   us-east-2   us-east-2a   4h41m
zhsun45-tcth9-master-1                  Running   m4.xlarge   us-east-2   us-east-2b   4h41m
zhsun45-tcth9-master-2                  Running   m4.xlarge   us-east-2   us-east-2c   4h41m
zhsun45-tcth9-worker-us-east-2a-8s4pd   Failed    m4.large    us-east-2   us-east-2a   23m
zhsun45-tcth9-worker-us-east-2a-92mmd   Running   m4.large    us-east-2   us-east-2a   23m
zhsun45-tcth9-worker-us-east-2a-fstfv   Running   m4.large    us-east-2   us-east-2a   23m
zhsun45-tcth9-worker-us-east-2a-km77d   Running   m4.large    us-east-2   us-east-2a   4h35m
zhsun45-tcth9-worker-us-east-2a-qtkcd   Running   m4.large    us-east-2   us-east-2a   23m
zhsun45-tcth9-worker-us-east-2a-z6rk8   Running   m4.large    us-east-2   us-east-2a   23m
zhsun45-tcth9-worker-us-east-2a-znh2d   Running   m4.large    us-east-2   us-east-2a   23m
zhsun45-tcth9-worker-us-east-2b-v6mwk   Running   m4.large    us-east-2   us-east-2b   4h35m
zhsun45-tcth9-worker-us-east-2c-fs4r6   Running   m4.large    us-east-2   us-east-2c   4h35m
[szh@localhost installer]$ oc get machine
NAME                                    PHASE     TYPE        REGION      ZONE         AGE
zhsun45-tcth9-master-0                  Running   m4.xlarge   us-east-2   us-east-2a   4h55m
zhsun45-tcth9-master-1                  Running   m4.xlarge   us-east-2   us-east-2b   4h55m
zhsun45-tcth9-master-2                  Running   m4.xlarge   us-east-2   us-east-2c   4h55m
zhsun45-tcth9-worker-us-east-2a-5679h   Running   m4.large    us-east-2   us-east-2a   11m
zhsun45-tcth9-worker-us-east-2a-92mmd   Running   m4.large    us-east-2   us-east-2a   37m
zhsun45-tcth9-worker-us-east-2a-fstfv   Running   m4.large    us-east-2   us-east-2a   37m
zhsun45-tcth9-worker-us-east-2a-km77d   Running   m4.large    us-east-2   us-east-2a   4h49m
zhsun45-tcth9-worker-us-east-2a-qtkcd   Running   m4.large    us-east-2   us-east-2a   37m
zhsun45-tcth9-worker-us-east-2a-z6rk8   Running   m4.large    us-east-2   us-east-2a   37m
zhsun45-tcth9-worker-us-east-2a-znh2d   Running   m4.large    us-east-2   us-east-2a   37m
zhsun45-tcth9-worker-us-east-2b-v6mwk   Running   m4.large    us-east-2   us-east-2b   4h49m
zhsun45-tcth9-worker-us-east-2c-fs4r6   Running   m4.large    us-east-2   us-east-2c   4h49m

Comment 6 errata-xmlrpc 2020-07-13 17:16:11 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2409

Note You need to log in before you can comment on or make changes to this bug.