1805160 – Machine Autoscaler does not remove nodes idempotently

Bug 1805160 - Machine Autoscaler does not remove nodes idempotently

Summary: Machine Autoscaler does not remove nodes idempotently

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Cloud Compute
Sub Component:
Version:	4.3.z
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Target Release:	4.3.z
Assignee:	Joel Speed
QA Contact:	sunzhaohua
Docs Contact:
URL:
Whiteboard:
Depends On:	1805153
Blocks:
TreeView+	depends on / blocked

Reported:	2020-02-20 11:48 UTC by Joel Speed
Modified:	2020-04-08 07:40 UTC (History)
CC List:	1 user (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:	1804738
Environment:
Last Closed:	2020-04-08 07:39:51 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift kubernetes-autoscaler pull 129	0	None	closed	[release-4.3] BUG 1805160: Ensure DeleteNodes doesn't delete a node twice	2020-04-23 20:47:53 UTC
Red Hat Product Errata	RHBA-2020:1262	0	None	None	None	2020-04-08 07:40:03 UTC

Description Joel Speed 2020-02-20 11:48:25 UTC

+++ This bug was initially created as a clone of Bug #1804738 +++

Description of problem:

If a cloud provider instance is removed, the Machine Autoscaler determines that the Machine has an unregistered node and, after 15 minutes, will remove the unregistered node.

This currently is not done idempotently and, if the Machine takes some time to be deleted (Machine controller slow to remove finalizer), the Autoscaler will call to scale down the replicaset a second or third time.

This can result in healthy nodes being removed from the cluster for no reason

Version-Release number of selected component (if applicable):

4.4

How reproducible:

Easily reproducible

Steps to Reproduce:
1. Deploy Openshift cluster with Machine Autoscaler pointing to a MachinSet
2. Ensure there are at least 3 nodes in the MachineSet (you may need to add extra workloads)
3. Terminate an instance from the cloud provider side
4. Wait for 15 minutes and observe several Machines being deleted at once

Actual results:


Expected results:


Additional info:

Comment 3 sunzhaohua 2020-03-27 06:52:27 UTC

Verified

clusterversion: 4.3.9

Only the machine associated with the unregistered node was deleted.

$ oc get node
NAME                                         STATUS     ROLES    AGE     VERSION
ip-10-0-140-11.us-east-2.compute.internal    Ready      worker   30m     v1.16.2
ip-10-0-140-138.us-east-2.compute.internal   Ready      worker   30m     v1.16.2
ip-10-0-140-198.us-east-2.compute.internal   Ready      worker   5m39s   v1.16.2
ip-10-0-140-65.us-east-2.compute.internal    NotReady   worker   131m    v1.16.2
ip-10-0-142-13.us-east-2.compute.internal    Ready      master   139m    v1.16.2
ip-10-0-150-145.us-east-2.compute.internal   Ready      master   139m    v1.16.2
ip-10-0-158-176.us-east-2.compute.internal   Ready      worker   131m    v1.16.2
ip-10-0-172-251.us-east-2.compute.internal   Ready      master   139m    v1.16.2
$ oc get node
NAME                                         STATUS   ROLES    AGE     VERSION
ip-10-0-135-212.us-east-2.compute.internal   Ready    worker   70m     v1.16.2
ip-10-0-140-11.us-east-2.compute.internal    Ready    worker   120m    v1.16.2
ip-10-0-140-138.us-east-2.compute.internal   Ready    worker   120m    v1.16.2
ip-10-0-140-198.us-east-2.compute.internal   Ready    worker   95m     v1.16.2
ip-10-0-142-13.us-east-2.compute.internal    Ready    master   3h49m   v1.16.2
ip-10-0-150-145.us-east-2.compute.internal   Ready    master   3h49m   v1.16.2
ip-10-0-158-176.us-east-2.compute.internal   Ready    worker   3h40m   v1.16.2
ip-10-0-172-251.us-east-2.compute.internal   Ready    master   3h49m   v1.16.2

Comment 5 errata-xmlrpc 2020-04-08 07:39:51 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:1262

Note You need to log in before you can comment on or make changes to this bug.