Bug 2025767

Summary: VMs orphaned during machineset scaleup
Product: OpenShift Container Platform Reporter: Matt Bargenquast <mbargenq>
Component: Cloud ComputeAssignee: Joel Speed <jspeed>
Cloud Compute sub component: Other Providers QA Contact: sunzhaohua <zhsun>
Status: CLOSED ERRATA Docs Contact:
Severity: urgent    
Priority: urgent CC: cattias, cblecker, dofinn, haowang, todabasi, travi, wking
Version: 4.9Keywords: ServiceDeliveryBlocker
Target Milestone: ---   
Target Release: 4.10.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Cause: In certain scenarios, the Machine API could reconcile a Machine before AWS has synced a VM creation across its API Consequence: AWS reports that the newly created VM does not exist which causes Machine API to fail the Machine Fix: Wait to mark a Machine as provisioned until the AWS API has synced and reports the instance exists Result: Likelihood of leaking instances has been reduced
Story Points: ---
Clone Of:
: 2029993 (view as bug list) Environment:
Last Closed: 2022-03-10 16:30:15 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 2029993    

Description Matt Bargenquast 2021-11-23 02:36:14 UTC
Description of problem:

OVERVIEW:
Creation of a new MachineSet on an aWS cluster resulted in 30 orphaned VM instances left in the cloud account. The 30th (successful) machine VM was successfully created and cleaned up upon deletion of the MachineSet, however the remaining VM instances were not cleaned up.

MORE DETAIL:
Prior to a control plane upgrade from 4.9.7 to 4.9.8, a machineset was created on the cluster ("ocmquayrop01uw2-h5thf-worker-us-west-2a-upgrade").

In AWS's CloudTrail logs, we were able to see a record of 31 RunInstances events occuring over the course of a 90 second period between 2021-11-22T10:00:00Z and 2021-11-22T10:01:30Z.

All of these VMs were successfully provisioned. The final VM (instance ID i-03b0dd3b1a3768d2e) was the one that eventually became a machine in the cluster ("ocmquayrop01uw2-h5thf-worker-us-west-2a-upgrade-htpsf"). 

After the upgrade, the MachineSet was deleted. The "ocmquayrop01uw2-h5thf-worker-us-west-2a-upgrade-htpsf" machine was removed. The other 30 VMs remained in a Running state in the account.

Some examples of names of other VMs and their instance IDs are:

ocmquayrop01uw2-h5thf-worker-us-west-2a-upgrade-z4c9t i-07848599a136cb67b	
ocmquayrop01uw2-h5thf-worker-us-west-2a-upgrade-cnc9r i-062d82b69b1c79bb4	
ocmquayrop01uw2-h5thf-worker-us-west-2a-upgrade-n88tn i-0937232775e2a3303	
ocmquayrop01uw2-h5thf-worker-us-west-2a-upgrade-4w4rx i-0586366aa6646d366	
ocmquayrop01uw2-h5thf-worker-us-west-2a-upgrade-tq2bp i-041898c1f2390f701	

The new MachineSet which was created was identical to the cluster's existing "ocmquayrop01uw2-h5thf-worker-us-west-2a" MachineSet, except for the following differences:

  - Replicas was set to "1"
  - spec.template.labels."upgrade.managed.openshift.io" was defined and set to true
  - spec.template.labels."machine.openshift.io/cluster-api-machineset" was defined and set to "ocmquayrop01uw2-h5thf-worker-us-west-2a-upgrade"
  - spec.selector.matchLabels."upgrade.managed.openshift.io" was defined and set to true
  - spec.selector.matchLabels."machine.openshift.io/cluster-api-machineset" was defined and set to "ocmquayrop01uw2-h5thf-worker-us-west-2a-upgrade"
 
All other labels and content were the same.

Version-Release number of selected component (if applicable):

4.9.7

How reproducible:

We have also seen this on another 4.9.7 cluster in the last week.

Actual results:

30 orphaned VMs left in the account after deletion of the MachineSet.

Expected results:

No orphaned VMs left in the account after deletion of the MachineSet.

Additional info:

Comment 17 sunzhaohua 2021-12-09 02:34:49 UTC
It's hard to reproduce this issue. I run a regression test for the new changes, no new issues were found. Move to verified
clusterversion: 4.10.0-0.nightly-2021-12-06-201335
Test run: https://polarion.engineering.redhat.com/polarion/#/project/OSE/testrun?id=20211203-0941

Comment 25 errata-xmlrpc 2022-03-10 16:30:15 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056