Bug 2025767 - VMs orphaned during machineset scaleup
Summary: VMs orphaned during machineset scaleup
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Cloud Compute
Version: 4.9
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: ---
: 4.10.0
Assignee: Joel Speed
QA Contact: sunzhaohua
URL:
Whiteboard:
Depends On:
Blocks: 2029993
TreeView+ depends on / blocked
 
Reported: 2021-11-23 02:36 UTC by Matt Bargenquast
Modified: 2022-03-10 16:30 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: In certain scenarios, the Machine API could reconcile a Machine before AWS has synced a VM creation across its API Consequence: AWS reports that the newly created VM does not exist which causes Machine API to fail the Machine Fix: Wait to mark a Machine as provisioned until the AWS API has synced and reports the instance exists Result: Likelihood of leaking instances has been reduced
Clone Of:
: 2029993 (view as bug list)
Environment:
Last Closed: 2022-03-10 16:30:15 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift machine-api-provider-aws pull 11 0 None open Bug 2025767: Prevent Machine from being considered provisioned until it exists in AWS 2021-12-02 17:05:55 UTC
Red Hat Product Errata RHSA-2022:0056 0 None None None 2022-03-10 16:30:26 UTC

Description Matt Bargenquast 2021-11-23 02:36:14 UTC
Description of problem:

OVERVIEW:
Creation of a new MachineSet on an aWS cluster resulted in 30 orphaned VM instances left in the cloud account. The 30th (successful) machine VM was successfully created and cleaned up upon deletion of the MachineSet, however the remaining VM instances were not cleaned up.

MORE DETAIL:
Prior to a control plane upgrade from 4.9.7 to 4.9.8, a machineset was created on the cluster ("ocmquayrop01uw2-h5thf-worker-us-west-2a-upgrade").

In AWS's CloudTrail logs, we were able to see a record of 31 RunInstances events occuring over the course of a 90 second period between 2021-11-22T10:00:00Z and 2021-11-22T10:01:30Z.

All of these VMs were successfully provisioned. The final VM (instance ID i-03b0dd3b1a3768d2e) was the one that eventually became a machine in the cluster ("ocmquayrop01uw2-h5thf-worker-us-west-2a-upgrade-htpsf"). 

After the upgrade, the MachineSet was deleted. The "ocmquayrop01uw2-h5thf-worker-us-west-2a-upgrade-htpsf" machine was removed. The other 30 VMs remained in a Running state in the account.

Some examples of names of other VMs and their instance IDs are:

ocmquayrop01uw2-h5thf-worker-us-west-2a-upgrade-z4c9t i-07848599a136cb67b	
ocmquayrop01uw2-h5thf-worker-us-west-2a-upgrade-cnc9r i-062d82b69b1c79bb4	
ocmquayrop01uw2-h5thf-worker-us-west-2a-upgrade-n88tn i-0937232775e2a3303	
ocmquayrop01uw2-h5thf-worker-us-west-2a-upgrade-4w4rx i-0586366aa6646d366	
ocmquayrop01uw2-h5thf-worker-us-west-2a-upgrade-tq2bp i-041898c1f2390f701	

The new MachineSet which was created was identical to the cluster's existing "ocmquayrop01uw2-h5thf-worker-us-west-2a" MachineSet, except for the following differences:

  - Replicas was set to "1"
  - spec.template.labels."upgrade.managed.openshift.io" was defined and set to true
  - spec.template.labels."machine.openshift.io/cluster-api-machineset" was defined and set to "ocmquayrop01uw2-h5thf-worker-us-west-2a-upgrade"
  - spec.selector.matchLabels."upgrade.managed.openshift.io" was defined and set to true
  - spec.selector.matchLabels."machine.openshift.io/cluster-api-machineset" was defined and set to "ocmquayrop01uw2-h5thf-worker-us-west-2a-upgrade"
 
All other labels and content were the same.

Version-Release number of selected component (if applicable):

4.9.7

How reproducible:

We have also seen this on another 4.9.7 cluster in the last week.

Actual results:

30 orphaned VMs left in the account after deletion of the MachineSet.

Expected results:

No orphaned VMs left in the account after deletion of the MachineSet.

Additional info:

Comment 17 sunzhaohua 2021-12-09 02:34:49 UTC
It's hard to reproduce this issue. I run a regression test for the new changes, no new issues were found. Move to verified
clusterversion: 4.10.0-0.nightly-2021-12-06-201335
Test run: https://polarion.engineering.redhat.com/polarion/#/project/OSE/testrun?id=20211203-0941

Comment 25 errata-xmlrpc 2022-03-10 16:30:15 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056


Note You need to log in before you can comment on or make changes to this bug.