Bug 2025767

Summary:	VMs orphaned during machineset scaleup
Product:	OpenShift Container Platform	Reporter:	Matt Bargenquast <mbargenq>
Component:	Cloud Compute	Assignee:	Joel Speed <jspeed>
Cloud Compute sub component:	Other Providers	QA Contact:	sunzhaohua <zhsun>
Status:	CLOSED ERRATA	Docs Contact:
Severity:	urgent
Priority:	urgent	CC:	cattias, cblecker, dofinn, haowang, todabasi, travi, wking
Version:	4.9	Keywords:	ServiceDeliveryBlocker
Target Milestone:	---
Target Release:	4.10.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:	Cause: In certain scenarios, the Machine API could reconcile a Machine before AWS has synced a VM creation across its API Consequence: AWS reports that the newly created VM does not exist which causes Machine API to fail the Machine Fix: Wait to mark a Machine as provisioned until the AWS API has synced and reports the instance exists Result: Likelihood of leaking instances has been reduced	Story Points:	---
Clone Of:
Clones:	2029993 (view as bug list)		Environment:
Last Closed:	2022-03-10 16:30:15 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	2029993

Description Matt Bargenquast 2021-11-23 02:36:14 UTC

Description of problem:

OVERVIEW:
Creation of a new MachineSet on an aWS cluster resulted in 30 orphaned VM instances left in the cloud account. The 30th (successful) machine VM was successfully created and cleaned up upon deletion of the MachineSet, however the remaining VM instances were not cleaned up.

MORE DETAIL:
Prior to a control plane upgrade from 4.9.7 to 4.9.8, a machineset was created on the cluster ("ocmquayrop01uw2-h5thf-worker-us-west-2a-upgrade").

In AWS's CloudTrail logs, we were able to see a record of 31 RunInstances events occuring over the course of a 90 second period between 2021-11-22T10:00:00Z and 2021-11-22T10:01:30Z.

All of these VMs were successfully provisioned. The final VM (instance ID i-03b0dd3b1a3768d2e) was the one that eventually became a machine in the cluster ("ocmquayrop01uw2-h5thf-worker-us-west-2a-upgrade-htpsf").

After the upgrade, the MachineSet was deleted. The "ocmquayrop01uw2-h5thf-worker-us-west-2a-upgrade-htpsf" machine was removed. The other 30 VMs remained in a Running state in the account.

Some examples of names of other VMs and their instance IDs are:

ocmquayrop01uw2-h5thf-worker-us-west-2a-upgrade-z4c9t i-07848599a136cb67b
ocmquayrop01uw2-h5thf-worker-us-west-2a-upgrade-cnc9r i-062d82b69b1c79bb4
ocmquayrop01uw2-h5thf-worker-us-west-2a-upgrade-n88tn i-0937232775e2a3303
ocmquayrop01uw2-h5thf-worker-us-west-2a-upgrade-4w4rx i-0586366aa6646d366
ocmquayrop01uw2-h5thf-worker-us-west-2a-upgrade-tq2bp i-041898c1f2390f701

The new MachineSet which was created was identical to the cluster's existing "ocmquayrop01uw2-h5thf-worker-us-west-2a" MachineSet, except for the following differences:

- Replicas was set to "1"
- spec.template.labels."upgrade.managed.openshift.io" was defined and set to true
- spec.template.labels."machine.openshift.io/cluster-api-machineset" was defined and set to "ocmquayrop01uw2-h5thf-worker-us-west-2a-upgrade"
- spec.selector.matchLabels."upgrade.managed.openshift.io" was defined and set to true
- spec.selector.matchLabels."machine.openshift.io/cluster-api-machineset" was defined and set to "ocmquayrop01uw2-h5thf-worker-us-west-2a-upgrade"

All other labels and content were the same.

Version-Release number of selected component (if applicable):

4.9.7

How reproducible:

We have also seen this on another 4.9.7 cluster in the last week.

Actual results:

30 orphaned VMs left in the account after deletion of the MachineSet.

Expected results:

No orphaned VMs left in the account after deletion of the MachineSet.

Additional info:

Comment 17 sunzhaohua 2021-12-09 02:34:49 UTC

It's hard to reproduce this issue. I run a regression test for the new changes, no new issues were found. Move to verified
clusterversion: 4.10.0-0.nightly-2021-12-06-201335
Test run: https://polarion.engineering.redhat.com/polarion/#/project/OSE/testrun?id=20211203-0941

Comment 25 errata-xmlrpc 2022-03-10 16:30:15 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056