Bug 1921656
| Summary: | Cannot delete a Machine if a VM got stuck in ERROR | |||
|---|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Michał Dulko <mdulko> | |
| Component: | Cloud Compute | Assignee: | Maysa Macedo <mdemaced> | |
| Cloud Compute sub component: | OpenStack Provider | QA Contact: | Jon Uriarte <juriarte> | |
| Status: | CLOSED ERRATA | Docs Contact: | ||
| Severity: | medium | |||
| Priority: | medium | CC: | akesarka, ancollin, dgautam, itbrown, m.andre, mbooth, mdemaced, mfedosin, palonsor, pprinett | |
| Version: | 4.7 | Keywords: | Triaged | |
| Target Milestone: | --- | Flags: | mdemaced:
needinfo-
|
|
| Target Release: | 4.10.z | |||
| Hardware: | Unspecified | |||
| OS: | Unspecified | |||
| Whiteboard: | ||||
| Fixed In Version: | Doc Type: | No Doc Update | ||
| Doc Text: | Story Points: | --- | ||
| Clone Of: | ||||
| : | 2109866 (view as bug list) | Environment: | ||
| Last Closed: | 2022-08-09 02:36:03 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Embargoed: | ||||
| Bug Depends On: | 2109866 | |||
| Bug Blocks: | ||||
|
Description
Michał Dulko
2021-01-28 11:53:37 UTC
Potentially a duplicate. Emilio to investigate. There seems like some openstack clusters have a bug where machines that error out are unable to be deleted by the machine api. Just for my sanity, verify that the machine-controller is actually trying to delete it and failing. If this is the case, then the machine object is correct to remain stuck in a deleting state, and see: https://bugzilla.redhat.com/show_bug.cgi?id=1856270 I was able to delete the VM easily as my SoS tenant, so my bet is that the problem is not with OpenStack API itself. If you want an easy reproducer here it is - just create a new subnet on the main network and then try to add that subnet as a secondary interface in a MachineSet. OpenStack won't allow that VM to spawn and you'll get a VM stuck in ERROR causing the problem. Thats good to know, I'll try that and take a look. It's possible that something is not correct in our control flow. Looks like there are 2 action items here: 1. If you run oc delete machine xxxx the user expects to be able to delete that machine, even if it is not finished deploying and has finalizers. 2. If an admin deletes a VM in OpenStack, they expect that CAPO is smart enough to recognize that the instance is gone, and will delete the machine in OpenShift as a result. (In reply to egarcia from comment #5) > Looks like there are 2 action items here: > 1. If you run oc delete machine xxxx the user expects to be able to delete > that machine, even if it is not finished deploying and has finalizers. > 2. If an admin deletes a VM in OpenStack, they expect that CAPO is smart > enough to recognize that the instance is gone, and will delete the machine > in OpenShift as a result. Hm, IMO #1 is true. Also it should be able to delete it if it's in ERROR state too. #2 may be problematic, monitoring OpenStack API for resources existence is heavy (OpenStack APIs are heavy in general). But for sure machine-api should be able to handle deletion of a Machine object even if corresponding VM is gone. (In reply to Michał Dulko from comment #6) > (In reply to egarcia from comment #5) > > Looks like there are 2 action items here: > > 1. If you run oc delete machine xxxx the user expects to be able to delete > > that machine, even if it is not finished deploying and has finalizers. > > 2. If an admin deletes a VM in OpenStack, they expect that CAPO is smart > > enough to recognize that the instance is gone, and will delete the machine > > in OpenShift as a result. > > Hm, IMO #1 is true. Also it should be able to delete it if it's in ERROR > state too. #2 may be problematic, monitoring OpenStack API for resources > existence is heavy (OpenStack APIs are heavy in general). But for sure > machine-api should be able to handle deletion of a Machine object even if > corresponding VM is gone. Lucky for us, CAPO is constantly hitting the OpenStack APIs to check on the state of machines XD. *** Bug 1994625 has been marked as a duplicate of this bug. *** Removing the Triaged keyword because: * the QE automation assessment (flag qe_test_coverage) is missing Verified in 4.10.26 on top of OSP 16.2.2. Verification steps: 1. Check the VMs in OSP and machines and nodes in OCP $ openstack server list +--------------------------------------+-----------------------------+--------+-------------------------------------+--------------------+-----------+ | ID | Name | Status | Networks | Image | Flavor | +--------------------------------------+-----------------------------+--------+-------------------------------------+--------------------+-----------+ | dc1f798c-2287-44fb-8c6d-56630ad2129c | ostest-pl27z-worker-0-grp7q | ACTIVE | ostest-pl27z-openshift=10.196.1.134 | ostest-pl27z-rhcos | m4.xlarge | | 71442f4d-b957-4ab6-bf37-96ba7b06e0d0 | ostest-pl27z-worker-0-vdtfd | ACTIVE | ostest-pl27z-openshift=10.196.0.31 | ostest-pl27z-rhcos | m4.xlarge | | c56a29c4-71de-4b9d-acce-21a3318bb4f1 | ostest-pl27z-worker-0-p7wxm | ACTIVE | ostest-pl27z-openshift=10.196.2.125 | ostest-pl27z-rhcos | m4.xlarge | | 48b08730-cea3-4920-b9fc-bd5a6b49dde6 | ostest-pl27z-master-2 | ACTIVE | ostest-pl27z-openshift=10.196.3.149 | ostest-pl27z-rhcos | m4.xlarge | | c361a70f-b0cb-4e37-afc5-22a4d7c2ab65 | ostest-pl27z-master-1 | ACTIVE | ostest-pl27z-openshift=10.196.1.21 | ostest-pl27z-rhcos | m4.xlarge | | 5fdae607-c1bd-417d-a0b0-3d516fd1cf23 | ostest-pl27z-master-0 | ACTIVE | ostest-pl27z-openshift=10.196.1.76 | ostest-pl27z-rhcos | m4.xlarge | +--------------------------------------+-----------------------------+--------+-------------------------------------+--------------------+-----------+ $ oc -n openshift-machine-api get machineset NAME DESIRED CURRENT READY AVAILABLE AGE ostest-pl27z-worker-0 3 3 3 3 83m $ oc get machines -A NAMESPACE NAME PHASE TYPE REGION ZONE AGE openshift-machine-api ostest-pl27z-master-0 Running 83m openshift-machine-api ostest-pl27z-master-1 Running 83m openshift-machine-api ostest-pl27z-master-2 Running 83m openshift-machine-api ostest-pl27z-worker-0-grp7q Running 73m openshift-machine-api ostest-pl27z-worker-0-p7wxm Running m4.xlarge regionOne nova 73m openshift-machine-api ostest-pl27z-worker-0-vdtfd Running m4.xlarge regionOne nova 73m 2. Scale-up and scale-down the machineset so there is no time for the VM to go to ACTIVE Status (the machine will be deleted while is still being created): $ oc scale machineset ostest-htm84-worker-0 -n openshift-machine-api --replicas=4; oc scale machineset ostest-htm84-worker-0 -n openshift-machine-api --replicas=3 $ oc get machines -A NAMESPACE NAME PHASE TYPE REGION ZONE AGE openshift-machine-api ostest-pl27z-master-0 Running 85m openshift-machine-api ostest-pl27z-master-1 Running 85m openshift-machine-api ostest-pl27z-master-2 Running 85m openshift-machine-api ostest-pl27z-worker-0-grp7q Running 76m openshift-machine-api ostest-pl27z-worker-0-p7wxm Running m4.xlarge regionOne nova 76m openshift-machine-api ostest-pl27z-worker-0-r62vt 0s openshift-machine-api ostest-pl27z-worker-0-vdtfd Running m4.xlarge regionOne nova 76m $ oc get machines -A NAMESPACE NAME PHASE TYPE REGION ZONE AGE openshift-machine-api ostest-pl27z-master-0 Running 85m openshift-machine-api ostest-pl27z-master-1 Running 85m openshift-machine-api ostest-pl27z-master-2 Running 85m openshift-machine-api ostest-pl27z-worker-0-grp7q Running 76m openshift-machine-api ostest-pl27z-worker-0-p7wxm Running m4.xlarge regionOne nova 76m openshift-machine-api ostest-pl27z-worker-0-r62vt Deleting 1s openshift-machine-api ostest-pl27z-worker-0-vdtfd Running m4.xlarge regionOne nova 76m 3. Check the new machine is deleted (and not in a continuous Deleting status) $ oc get machines -A NAMESPACE NAME PHASE TYPE REGION ZONE AGE openshift-machine-api ostest-pl27z-master-0 Running 85m openshift-machine-api ostest-pl27z-master-1 Running 85m openshift-machine-api ostest-pl27z-master-2 Running 85m openshift-machine-api ostest-pl27z-worker-0-grp7q Running 76m openshift-machine-api ostest-pl27z-worker-0-p7wxm Running m4.xlarge regionOne nova 76m openshift-machine-api ostest-pl27z-worker-0-vdtfd Running m4.xlarge regionOne nova 76m Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.10.26 security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:5875 |