Description of problem: OVERVIEW: Creation of a new MachineSet on an aWS cluster resulted in 30 orphaned VM instances left in the cloud account. The 30th (successful) machine VM was successfully created and cleaned up upon deletion of the MachineSet, however the remaining VM instances were not cleaned up. MORE DETAIL: Prior to a control plane upgrade from 4.9.7 to 4.9.8, a machineset was created on the cluster ("ocmquayrop01uw2-h5thf-worker-us-west-2a-upgrade"). In AWS's CloudTrail logs, we were able to see a record of 31 RunInstances events occuring over the course of a 90 second period between 2021-11-22T10:00:00Z and 2021-11-22T10:01:30Z. All of these VMs were successfully provisioned. The final VM (instance ID i-03b0dd3b1a3768d2e) was the one that eventually became a machine in the cluster ("ocmquayrop01uw2-h5thf-worker-us-west-2a-upgrade-htpsf"). After the upgrade, the MachineSet was deleted. The "ocmquayrop01uw2-h5thf-worker-us-west-2a-upgrade-htpsf" machine was removed. The other 30 VMs remained in a Running state in the account. Some examples of names of other VMs and their instance IDs are: ocmquayrop01uw2-h5thf-worker-us-west-2a-upgrade-z4c9t i-07848599a136cb67b ocmquayrop01uw2-h5thf-worker-us-west-2a-upgrade-cnc9r i-062d82b69b1c79bb4 ocmquayrop01uw2-h5thf-worker-us-west-2a-upgrade-n88tn i-0937232775e2a3303 ocmquayrop01uw2-h5thf-worker-us-west-2a-upgrade-4w4rx i-0586366aa6646d366 ocmquayrop01uw2-h5thf-worker-us-west-2a-upgrade-tq2bp i-041898c1f2390f701 The new MachineSet which was created was identical to the cluster's existing "ocmquayrop01uw2-h5thf-worker-us-west-2a" MachineSet, except for the following differences: - Replicas was set to "1" - spec.template.labels."upgrade.managed.openshift.io" was defined and set to true - spec.template.labels."machine.openshift.io/cluster-api-machineset" was defined and set to "ocmquayrop01uw2-h5thf-worker-us-west-2a-upgrade" - spec.selector.matchLabels."upgrade.managed.openshift.io" was defined and set to true - spec.selector.matchLabels."machine.openshift.io/cluster-api-machineset" was defined and set to "ocmquayrop01uw2-h5thf-worker-us-west-2a-upgrade" All other labels and content were the same. Version-Release number of selected component (if applicable): 4.9.7 How reproducible: We have also seen this on another 4.9.7 cluster in the last week. Actual results: 30 orphaned VMs left in the account after deletion of the MachineSet. Expected results: No orphaned VMs left in the account after deletion of the MachineSet. Additional info:
It's hard to reproduce this issue. I run a regression test for the new changes, no new issues were found. Move to verified clusterversion: 4.10.0-0.nightly-2021-12-06-201335 Test run: https://polarion.engineering.redhat.com/polarion/#/project/OSE/testrun?id=20211203-0941
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:0056