Bug 2025767
| Summary: | VMs orphaned during machineset scaleup | |||
|---|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Matt Bargenquast <mbargenq> | |
| Component: | Cloud Compute | Assignee: | Joel Speed <jspeed> | |
| Cloud Compute sub component: | Other Providers | QA Contact: | sunzhaohua <zhsun> | |
| Status: | CLOSED ERRATA | Docs Contact: | ||
| Severity: | urgent | |||
| Priority: | urgent | CC: | cattias, cblecker, dofinn, haowang, todabasi, travi, wking | |
| Version: | 4.9 | Keywords: | ServiceDeliveryBlocker | |
| Target Milestone: | --- | |||
| Target Release: | 4.10.0 | |||
| Hardware: | Unspecified | |||
| OS: | Unspecified | |||
| Whiteboard: | ||||
| Fixed In Version: | Doc Type: | Bug Fix | ||
| Doc Text: |
Cause: In certain scenarios, the Machine API could reconcile a Machine before AWS has synced a VM creation across its API
Consequence: AWS reports that the newly created VM does not exist which causes Machine API to fail the Machine
Fix: Wait to mark a Machine as provisioned until the AWS API has synced and reports the instance exists
Result: Likelihood of leaking instances has been reduced
|
Story Points: | --- | |
| Clone Of: | ||||
| : | 2029993 (view as bug list) | Environment: | ||
| Last Closed: | 2022-03-10 16:30:15 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Embargoed: | ||||
| Bug Depends On: | ||||
| Bug Blocks: | 2029993 | |||
It's hard to reproduce this issue. I run a regression test for the new changes, no new issues were found. Move to verified clusterversion: 4.10.0-0.nightly-2021-12-06-201335 Test run: https://polarion.engineering.redhat.com/polarion/#/project/OSE/testrun?id=20211203-0941 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:0056 |
Description of problem: OVERVIEW: Creation of a new MachineSet on an aWS cluster resulted in 30 orphaned VM instances left in the cloud account. The 30th (successful) machine VM was successfully created and cleaned up upon deletion of the MachineSet, however the remaining VM instances were not cleaned up. MORE DETAIL: Prior to a control plane upgrade from 4.9.7 to 4.9.8, a machineset was created on the cluster ("ocmquayrop01uw2-h5thf-worker-us-west-2a-upgrade"). In AWS's CloudTrail logs, we were able to see a record of 31 RunInstances events occuring over the course of a 90 second period between 2021-11-22T10:00:00Z and 2021-11-22T10:01:30Z. All of these VMs were successfully provisioned. The final VM (instance ID i-03b0dd3b1a3768d2e) was the one that eventually became a machine in the cluster ("ocmquayrop01uw2-h5thf-worker-us-west-2a-upgrade-htpsf"). After the upgrade, the MachineSet was deleted. The "ocmquayrop01uw2-h5thf-worker-us-west-2a-upgrade-htpsf" machine was removed. The other 30 VMs remained in a Running state in the account. Some examples of names of other VMs and their instance IDs are: ocmquayrop01uw2-h5thf-worker-us-west-2a-upgrade-z4c9t i-07848599a136cb67b ocmquayrop01uw2-h5thf-worker-us-west-2a-upgrade-cnc9r i-062d82b69b1c79bb4 ocmquayrop01uw2-h5thf-worker-us-west-2a-upgrade-n88tn i-0937232775e2a3303 ocmquayrop01uw2-h5thf-worker-us-west-2a-upgrade-4w4rx i-0586366aa6646d366 ocmquayrop01uw2-h5thf-worker-us-west-2a-upgrade-tq2bp i-041898c1f2390f701 The new MachineSet which was created was identical to the cluster's existing "ocmquayrop01uw2-h5thf-worker-us-west-2a" MachineSet, except for the following differences: - Replicas was set to "1" - spec.template.labels."upgrade.managed.openshift.io" was defined and set to true - spec.template.labels."machine.openshift.io/cluster-api-machineset" was defined and set to "ocmquayrop01uw2-h5thf-worker-us-west-2a-upgrade" - spec.selector.matchLabels."upgrade.managed.openshift.io" was defined and set to true - spec.selector.matchLabels."machine.openshift.io/cluster-api-machineset" was defined and set to "ocmquayrop01uw2-h5thf-worker-us-west-2a-upgrade" All other labels and content were the same. Version-Release number of selected component (if applicable): 4.9.7 How reproducible: We have also seen this on another 4.9.7 cluster in the last week. Actual results: 30 orphaned VMs left in the account after deletion of the MachineSet. Expected results: No orphaned VMs left in the account after deletion of the MachineSet. Additional info: