1921656 – Cannot delete a Machine if a VM got stuck in ERROR

Bug 1921656 - Cannot delete a Machine if a VM got stuck in ERROR

Summary: Cannot delete a Machine if a VM got stuck in ERROR

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Cloud Compute
Sub Component:
Version:	4.7
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	4.10.z
Assignee:	Maysa Macedo
QA Contact:	Jon Uriarte
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1994625 (view as bug list)
Depends On:	2109866
Blocks:
TreeView+	depends on / blocked

Reported:	2021-01-28 11:53 UTC by Michał Dulko
Modified:	2022-09-02 12:22 UTC (History)
CC List:	10 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Clones:	2109866 (view as bug list)
Environment:
Last Closed:	2022-08-09 02:36:03 UTC
Target Upstream Version:
Embargoed:
Flags:	mdemaced: needinfo-

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift cluster-api-provider-openstack pull 240	None	open	Bug 1921656: Remove dependence on annotation to allow Server deletion	2022-07-22 12:41:00 UTC
Red Hat Knowledge Base (Solution)	6974350	None	None	None	2022-09-02 12:22:44 UTC
Red Hat Product Errata	RHSA-2022:5875	None	None	None	2022-08-09 02:37:33 UTC

Description Michał Dulko 2021-01-28 11:53:37 UTC

Description of problem:
I run into this when trying to configure a Machine to have 2 interfaces. The VM failed with "Failed to allocate the network(s)" and landed in ERROR state. The Machine kept being in Provisioning state and even deleting the MachineSet haven't helped. So I just deleted the VM manually and then tried deleting the Machine. It ended up like this:

I0128 11:11:48.523301       1 controller.go:171] ostest-4mjhl-double-7xv92: reconciling Machine
I0128 11:11:48.523339       1 controller.go:211] ostest-4mjhl-double-7xv92: reconciling machine triggers delete
W0128 11:11:51.600815       1 machineservice.go:825] Couldn't delete all instance  ports: Resource not found
E0128 11:11:51.628437       1 actuator.go:574] Machine error ostest-4mjhl-double-7xv92: error deleting Openstack instance: Resource not found
E0128 11:11:51.628481       1 controller.go:232] ostest-4mjhl-double-7xv92: failed to delete machine: error deleting Openstack instance: Resource not found
E0128 11:11:51.628542       1 controller.go:237] controller "msg"="Reconciler error" "error"="error deleting Openstack instance: Resource not found" "controller"="machine_controller" "name"="ostest-4mjhl-double-7xv92" "namespace"="openshift-machine-api" 

I bet there are two problems - the VM got into ERROR state, yet the Machine kept to be in Provisioning. Then deleting the VM manually haven't really helped.

Version-Release number of selected component (if applicable):


How reproducible:
?

Steps to Reproduce:
Above

Actual results:
Need to remove the finalizer manually.

Expected results:
Machine just accepts that VM is gone and it's a minor problem that it can't delete it.


Additional info:

Comment 1 Martin André 2021-02-10 16:31:44 UTC

Potentially a duplicate. Emilio to investigate.

Comment 2 egarcia 2021-02-23 16:33:32 UTC

There seems like some openstack clusters have a bug where machines that error out are unable to be deleted by the machine api. Just for my sanity, verify that the machine-controller is actually trying to delete it and failing. If this is the case, then the machine object is correct to remain stuck in a deleting state, and see: https://bugzilla.redhat.com/show_bug.cgi?id=1856270

Comment 3 Michał Dulko 2021-02-23 16:54:54 UTC

I was able to delete the VM easily as my SoS tenant, so my bet is that the problem is not with OpenStack API itself. If you want an easy reproducer here it is - just create a new subnet on the main network and then try to add that subnet as a secondary interface in a MachineSet. OpenStack won't allow that VM to spawn and you'll get a VM stuck in ERROR causing the problem.

Comment 4 egarcia 2021-02-23 17:00:15 UTC

Thats good to know, I'll try that and take a look. It's possible that something is not correct in our control flow.

Comment 5 egarcia 2021-02-24 17:22:24 UTC

Looks like there are 2 action items here:
1. If you run oc delete machine xxxx the user expects to be able to delete that machine, even if it is not finished deploying and has finalizers.
2. If an admin deletes a VM in OpenStack, they expect that CAPO is smart enough to recognize that the instance is gone, and will delete the machine in OpenShift as a result.

Comment 6 Michał Dulko 2021-02-25 08:48:37 UTC

(In reply to egarcia from comment #5)
> Looks like there are 2 action items here:
> 1. If you run oc delete machine xxxx the user expects to be able to delete
> that machine, even if it is not finished deploying and has finalizers.
> 2. If an admin deletes a VM in OpenStack, they expect that CAPO is smart
> enough to recognize that the instance is gone, and will delete the machine
> in OpenShift as a result.

Hm, IMO #1 is true. Also it should be able to delete it if it's in ERROR state too. #2 may be problematic, monitoring OpenStack API for resources existence is heavy (OpenStack APIs are heavy in general). But for sure machine-api should be able to handle deletion of a Machine object even if corresponding VM is gone.

Comment 7 egarcia 2021-02-25 16:53:42 UTC

(In reply to Michał Dulko from comment #6)
> (In reply to egarcia from comment #5)
> > Looks like there are 2 action items here:
> > 1. If you run oc delete machine xxxx the user expects to be able to delete
> > that machine, even if it is not finished deploying and has finalizers.
> > 2. If an admin deletes a VM in OpenStack, they expect that CAPO is smart
> > enough to recognize that the instance is gone, and will delete the machine
> > in OpenShift as a result.
> 
> Hm, IMO #1 is true. Also it should be able to delete it if it's in ERROR
> state too. #2 may be problematic, monitoring OpenStack API for resources
> existence is heavy (OpenStack APIs are heavy in general). But for sure
> machine-api should be able to handle deletion of a Machine object even if
> corresponding VM is gone.

Lucky for us, CAPO is constantly hitting the OpenStack APIs to check on the state of machines XD.

Comment 17 Adolfo Duarte 2021-08-20 04:03:07 UTC

*** Bug 1994625 has been marked as a duplicate of this bug. ***

Comment 21 ShiftStack Bugwatcher 2021-11-25 16:11:18 UTC

Removing the Triaged keyword because:

* the QE automation assessment (flag qe_test_coverage) is missing

Comment 35 Jon Uriarte 2022-08-04 11:03:15 UTC

Verified in 4.10.26 on top of OSP 16.2.2.

Verification steps:

1. Check the VMs in OSP and machines and nodes in OCP

$ openstack server list
+--------------------------------------+-----------------------------+--------+-------------------------------------+--------------------+-----------+
| ID                                   | Name                        | Status | Networks                            | Image              | Flavor    |
+--------------------------------------+-----------------------------+--------+-------------------------------------+--------------------+-----------+
| dc1f798c-2287-44fb-8c6d-56630ad2129c | ostest-pl27z-worker-0-grp7q | ACTIVE | ostest-pl27z-openshift=10.196.1.134 | ostest-pl27z-rhcos | m4.xlarge |
| 71442f4d-b957-4ab6-bf37-96ba7b06e0d0 | ostest-pl27z-worker-0-vdtfd | ACTIVE | ostest-pl27z-openshift=10.196.0.31  | ostest-pl27z-rhcos | m4.xlarge |
| c56a29c4-71de-4b9d-acce-21a3318bb4f1 | ostest-pl27z-worker-0-p7wxm | ACTIVE | ostest-pl27z-openshift=10.196.2.125 | ostest-pl27z-rhcos | m4.xlarge |
| 48b08730-cea3-4920-b9fc-bd5a6b49dde6 | ostest-pl27z-master-2       | ACTIVE | ostest-pl27z-openshift=10.196.3.149 | ostest-pl27z-rhcos | m4.xlarge |
| c361a70f-b0cb-4e37-afc5-22a4d7c2ab65 | ostest-pl27z-master-1       | ACTIVE | ostest-pl27z-openshift=10.196.1.21  | ostest-pl27z-rhcos | m4.xlarge |
| 5fdae607-c1bd-417d-a0b0-3d516fd1cf23 | ostest-pl27z-master-0       | ACTIVE | ostest-pl27z-openshift=10.196.1.76  | ostest-pl27z-rhcos | m4.xlarge |
+--------------------------------------+-----------------------------+--------+-------------------------------------+--------------------+-----------+

$ oc -n openshift-machine-api get machineset
NAME                    DESIRED   CURRENT   READY   AVAILABLE   AGE
ostest-pl27z-worker-0   3         3         3       3           83m

$ oc get machines -A
NAMESPACE               NAME                          PHASE     TYPE        REGION      ZONE   AGE
openshift-machine-api   ostest-pl27z-master-0         Running                                  83m
openshift-machine-api   ostest-pl27z-master-1         Running                                  83m
openshift-machine-api   ostest-pl27z-master-2         Running                                  83m
openshift-machine-api   ostest-pl27z-worker-0-grp7q   Running                                  73m
openshift-machine-api   ostest-pl27z-worker-0-p7wxm   Running   m4.xlarge   regionOne   nova   73m
openshift-machine-api   ostest-pl27z-worker-0-vdtfd   Running   m4.xlarge   regionOne   nova   73m

2. Scale-up and scale-down the machineset so there is no time for the VM to go to ACTIVE Status (the
machine will be deleted while is still being created):

$ oc scale machineset ostest-htm84-worker-0 -n openshift-machine-api --replicas=4; oc scale machineset ostest-htm84-worker-0 -n openshift-machine-api --replicas=3

$ oc get machines -A
NAMESPACE               NAME                          PHASE     TYPE        REGION      ZONE   AGE
openshift-machine-api   ostest-pl27z-master-0         Running                                  85m
openshift-machine-api   ostest-pl27z-master-1         Running                                  85m
openshift-machine-api   ostest-pl27z-master-2         Running                                  85m
openshift-machine-api   ostest-pl27z-worker-0-grp7q   Running                                  76m
openshift-machine-api   ostest-pl27z-worker-0-p7wxm   Running   m4.xlarge   regionOne   nova   76m
openshift-machine-api   ostest-pl27z-worker-0-r62vt                                            0s
openshift-machine-api   ostest-pl27z-worker-0-vdtfd   Running   m4.xlarge   regionOne   nova   76m

$ oc get machines -A
NAMESPACE               NAME                          PHASE      TYPE        REGION      ZONE   AGE
openshift-machine-api   ostest-pl27z-master-0         Running                                   85m
openshift-machine-api   ostest-pl27z-master-1         Running                                   85m
openshift-machine-api   ostest-pl27z-master-2         Running                                   85m
openshift-machine-api   ostest-pl27z-worker-0-grp7q   Running                                   76m
openshift-machine-api   ostest-pl27z-worker-0-p7wxm   Running    m4.xlarge   regionOne   nova   76m
openshift-machine-api   ostest-pl27z-worker-0-r62vt   Deleting                                  1s
openshift-machine-api   ostest-pl27z-worker-0-vdtfd   Running    m4.xlarge   regionOne   nova   76m


3. Check the new machine is deleted (and not in a continuous Deleting status)

$ oc get machines -A
NAMESPACE               NAME                          PHASE     TYPE        REGION      ZONE   AGE
openshift-machine-api   ostest-pl27z-master-0         Running                                  85m
openshift-machine-api   ostest-pl27z-master-1         Running                                  85m
openshift-machine-api   ostest-pl27z-master-2         Running                                  85m
openshift-machine-api   ostest-pl27z-worker-0-grp7q   Running                                  76m
openshift-machine-api   ostest-pl27z-worker-0-p7wxm   Running   m4.xlarge   regionOne   nova   76m
openshift-machine-api   ostest-pl27z-worker-0-vdtfd   Running   m4.xlarge   regionOne   nova   76m

Comment 37 errata-xmlrpc 2022-08-09 02:36:03 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.26 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5875

Note You need to log in before you can comment on or make changes to this bug.