Bug 1921656 - Cannot delete a Machine if a VM got stuck in ERROR
Summary: Cannot delete a Machine if a VM got stuck in ERROR
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Cloud Compute
Version: 4.7
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: 4.10.z
Assignee: Maysa Macedo
QA Contact: Jon Uriarte
URL:
Whiteboard:
: 1994625 (view as bug list)
Depends On: 2109866
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-01-28 11:53 UTC by Michał Dulko
Modified: 2022-09-02 12:22 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
: 2109866 (view as bug list)
Environment:
Last Closed: 2022-08-09 02:36:03 UTC
Target Upstream Version:
Embargoed:
mdemaced: needinfo-


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-api-provider-openstack pull 240 0 None open Bug 1921656: Remove dependence on annotation to allow Server deletion 2022-07-22 12:41:00 UTC
Red Hat Knowledge Base (Solution) 6974350 0 None None None 2022-09-02 12:22:44 UTC
Red Hat Product Errata RHSA-2022:5875 0 None None None 2022-08-09 02:37:33 UTC

Description Michał Dulko 2021-01-28 11:53:37 UTC
Description of problem:
I run into this when trying to configure a Machine to have 2 interfaces. The VM failed with "Failed to allocate the network(s)" and landed in ERROR state. The Machine kept being in Provisioning state and even deleting the MachineSet haven't helped. So I just deleted the VM manually and then tried deleting the Machine. It ended up like this:

I0128 11:11:48.523301       1 controller.go:171] ostest-4mjhl-double-7xv92: reconciling Machine
I0128 11:11:48.523339       1 controller.go:211] ostest-4mjhl-double-7xv92: reconciling machine triggers delete
W0128 11:11:51.600815       1 machineservice.go:825] Couldn't delete all instance  ports: Resource not found
E0128 11:11:51.628437       1 actuator.go:574] Machine error ostest-4mjhl-double-7xv92: error deleting Openstack instance: Resource not found
E0128 11:11:51.628481       1 controller.go:232] ostest-4mjhl-double-7xv92: failed to delete machine: error deleting Openstack instance: Resource not found
E0128 11:11:51.628542       1 controller.go:237] controller "msg"="Reconciler error" "error"="error deleting Openstack instance: Resource not found" "controller"="machine_controller" "name"="ostest-4mjhl-double-7xv92" "namespace"="openshift-machine-api" 

I bet there are two problems - the VM got into ERROR state, yet the Machine kept to be in Provisioning. Then deleting the VM manually haven't really helped.

Version-Release number of selected component (if applicable):


How reproducible:
?

Steps to Reproduce:
Above

Actual results:
Need to remove the finalizer manually.

Expected results:
Machine just accepts that VM is gone and it's a minor problem that it can't delete it.


Additional info:

Comment 1 Martin André 2021-02-10 16:31:44 UTC
Potentially a duplicate. Emilio to investigate.

Comment 2 egarcia 2021-02-23 16:33:32 UTC
There seems like some openstack clusters have a bug where machines that error out are unable to be deleted by the machine api. Just for my sanity, verify that the machine-controller is actually trying to delete it and failing. If this is the case, then the machine object is correct to remain stuck in a deleting state, and see: https://bugzilla.redhat.com/show_bug.cgi?id=1856270

Comment 3 Michał Dulko 2021-02-23 16:54:54 UTC
I was able to delete the VM easily as my SoS tenant, so my bet is that the problem is not with OpenStack API itself. If you want an easy reproducer here it is - just create a new subnet on the main network and then try to add that subnet as a secondary interface in a MachineSet. OpenStack won't allow that VM to spawn and you'll get a VM stuck in ERROR causing the problem.

Comment 4 egarcia 2021-02-23 17:00:15 UTC
Thats good to know, I'll try that and take a look. It's possible that something is not correct in our control flow.

Comment 5 egarcia 2021-02-24 17:22:24 UTC
Looks like there are 2 action items here:
1. If you run oc delete machine xxxx the user expects to be able to delete that machine, even if it is not finished deploying and has finalizers.
2. If an admin deletes a VM in OpenStack, they expect that CAPO is smart enough to recognize that the instance is gone, and will delete the machine in OpenShift as a result.

Comment 6 Michał Dulko 2021-02-25 08:48:37 UTC
(In reply to egarcia from comment #5)
> Looks like there are 2 action items here:
> 1. If you run oc delete machine xxxx the user expects to be able to delete
> that machine, even if it is not finished deploying and has finalizers.
> 2. If an admin deletes a VM in OpenStack, they expect that CAPO is smart
> enough to recognize that the instance is gone, and will delete the machine
> in OpenShift as a result.

Hm, IMO #1 is true. Also it should be able to delete it if it's in ERROR state too. #2 may be problematic, monitoring OpenStack API for resources existence is heavy (OpenStack APIs are heavy in general). But for sure machine-api should be able to handle deletion of a Machine object even if corresponding VM is gone.

Comment 7 egarcia 2021-02-25 16:53:42 UTC
(In reply to Michał Dulko from comment #6)
> (In reply to egarcia from comment #5)
> > Looks like there are 2 action items here:
> > 1. If you run oc delete machine xxxx the user expects to be able to delete
> > that machine, even if it is not finished deploying and has finalizers.
> > 2. If an admin deletes a VM in OpenStack, they expect that CAPO is smart
> > enough to recognize that the instance is gone, and will delete the machine
> > in OpenShift as a result.
> 
> Hm, IMO #1 is true. Also it should be able to delete it if it's in ERROR
> state too. #2 may be problematic, monitoring OpenStack API for resources
> existence is heavy (OpenStack APIs are heavy in general). But for sure
> machine-api should be able to handle deletion of a Machine object even if
> corresponding VM is gone.

Lucky for us, CAPO is constantly hitting the OpenStack APIs to check on the state of machines XD.

Comment 17 Adolfo Duarte 2021-08-20 04:03:07 UTC
*** Bug 1994625 has been marked as a duplicate of this bug. ***

Comment 21 ShiftStack Bugwatcher 2021-11-25 16:11:18 UTC
Removing the Triaged keyword because:

* the QE automation assessment (flag qe_test_coverage) is missing

Comment 35 Jon Uriarte 2022-08-04 11:03:15 UTC
Verified in 4.10.26 on top of OSP 16.2.2.

Verification steps:

1. Check the VMs in OSP and machines and nodes in OCP

$ openstack server list
+--------------------------------------+-----------------------------+--------+-------------------------------------+--------------------+-----------+
| ID                                   | Name                        | Status | Networks                            | Image              | Flavor    |
+--------------------------------------+-----------------------------+--------+-------------------------------------+--------------------+-----------+
| dc1f798c-2287-44fb-8c6d-56630ad2129c | ostest-pl27z-worker-0-grp7q | ACTIVE | ostest-pl27z-openshift=10.196.1.134 | ostest-pl27z-rhcos | m4.xlarge |
| 71442f4d-b957-4ab6-bf37-96ba7b06e0d0 | ostest-pl27z-worker-0-vdtfd | ACTIVE | ostest-pl27z-openshift=10.196.0.31  | ostest-pl27z-rhcos | m4.xlarge |
| c56a29c4-71de-4b9d-acce-21a3318bb4f1 | ostest-pl27z-worker-0-p7wxm | ACTIVE | ostest-pl27z-openshift=10.196.2.125 | ostest-pl27z-rhcos | m4.xlarge |
| 48b08730-cea3-4920-b9fc-bd5a6b49dde6 | ostest-pl27z-master-2       | ACTIVE | ostest-pl27z-openshift=10.196.3.149 | ostest-pl27z-rhcos | m4.xlarge |
| c361a70f-b0cb-4e37-afc5-22a4d7c2ab65 | ostest-pl27z-master-1       | ACTIVE | ostest-pl27z-openshift=10.196.1.21  | ostest-pl27z-rhcos | m4.xlarge |
| 5fdae607-c1bd-417d-a0b0-3d516fd1cf23 | ostest-pl27z-master-0       | ACTIVE | ostest-pl27z-openshift=10.196.1.76  | ostest-pl27z-rhcos | m4.xlarge |
+--------------------------------------+-----------------------------+--------+-------------------------------------+--------------------+-----------+

$ oc -n openshift-machine-api get machineset
NAME                    DESIRED   CURRENT   READY   AVAILABLE   AGE
ostest-pl27z-worker-0   3         3         3       3           83m

$ oc get machines -A
NAMESPACE               NAME                          PHASE     TYPE        REGION      ZONE   AGE
openshift-machine-api   ostest-pl27z-master-0         Running                                  83m
openshift-machine-api   ostest-pl27z-master-1         Running                                  83m
openshift-machine-api   ostest-pl27z-master-2         Running                                  83m
openshift-machine-api   ostest-pl27z-worker-0-grp7q   Running                                  73m
openshift-machine-api   ostest-pl27z-worker-0-p7wxm   Running   m4.xlarge   regionOne   nova   73m
openshift-machine-api   ostest-pl27z-worker-0-vdtfd   Running   m4.xlarge   regionOne   nova   73m

2. Scale-up and scale-down the machineset so there is no time for the VM to go to ACTIVE Status (the
machine will be deleted while is still being created):

$ oc scale machineset ostest-htm84-worker-0 -n openshift-machine-api --replicas=4; oc scale machineset ostest-htm84-worker-0 -n openshift-machine-api --replicas=3

$ oc get machines -A
NAMESPACE               NAME                          PHASE     TYPE        REGION      ZONE   AGE
openshift-machine-api   ostest-pl27z-master-0         Running                                  85m
openshift-machine-api   ostest-pl27z-master-1         Running                                  85m
openshift-machine-api   ostest-pl27z-master-2         Running                                  85m
openshift-machine-api   ostest-pl27z-worker-0-grp7q   Running                                  76m
openshift-machine-api   ostest-pl27z-worker-0-p7wxm   Running   m4.xlarge   regionOne   nova   76m
openshift-machine-api   ostest-pl27z-worker-0-r62vt                                            0s
openshift-machine-api   ostest-pl27z-worker-0-vdtfd   Running   m4.xlarge   regionOne   nova   76m

$ oc get machines -A
NAMESPACE               NAME                          PHASE      TYPE        REGION      ZONE   AGE
openshift-machine-api   ostest-pl27z-master-0         Running                                   85m
openshift-machine-api   ostest-pl27z-master-1         Running                                   85m
openshift-machine-api   ostest-pl27z-master-2         Running                                   85m
openshift-machine-api   ostest-pl27z-worker-0-grp7q   Running                                   76m
openshift-machine-api   ostest-pl27z-worker-0-p7wxm   Running    m4.xlarge   regionOne   nova   76m
openshift-machine-api   ostest-pl27z-worker-0-r62vt   Deleting                                  1s
openshift-machine-api   ostest-pl27z-worker-0-vdtfd   Running    m4.xlarge   regionOne   nova   76m


3. Check the new machine is deleted (and not in a continuous Deleting status)

$ oc get machines -A
NAMESPACE               NAME                          PHASE     TYPE        REGION      ZONE   AGE
openshift-machine-api   ostest-pl27z-master-0         Running                                  85m
openshift-machine-api   ostest-pl27z-master-1         Running                                  85m
openshift-machine-api   ostest-pl27z-master-2         Running                                  85m
openshift-machine-api   ostest-pl27z-worker-0-grp7q   Running                                  76m
openshift-machine-api   ostest-pl27z-worker-0-p7wxm   Running   m4.xlarge   regionOne   nova   76m
openshift-machine-api   ostest-pl27z-worker-0-vdtfd   Running   m4.xlarge   regionOne   nova   76m

Comment 37 errata-xmlrpc 2022-08-09 02:36:03 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.26 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5875


Note You need to log in before you can comment on or make changes to this bug.