Bug 1856270

Summary:	Machine couldn't be deleted if machine stucks in Provisioning status
Product:	OpenShift Container Platform	Reporter:	sunzhaohua <zhsun>
Component:	Cloud Compute	Assignee:	egarcia
Cloud Compute sub component:	OpenStack Provider	QA Contact:	David Sanz <dsanzmor>
Status:	CLOSED ERRATA	Docs Contact:
Severity:	medium
Priority:	low	CC:	adduarte, ansverma, egarcia, m.andre, mfedosin, mgugino, oarribas, pprinett
Version:	4.6
Target Milestone:	---
Target Release:	4.6.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-10-27 16:13:54 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description sunzhaohua 2020-07-13 08:55:25 UTC

Description of problem:
Sometimes machine couldn't be created successfully because of network issues or resource limitations, stucking in Provisioning status, if we want to delete such machines, machines stuck in Deleting status. We must remove the finalizer from the Machine object then the Machine object could be deleted. 

Version-Release number of selected component (if applicable):
4.6.0-0.nightly-2020-07-12-014740

How reproducible:
Always

Steps to Reproduce:
1. Create a new machine, because of network issues or resource limitations  machine stucks in Provisioning status
2. Delete the new created machine
3. Check machines

Actual results:
Machine stucked in Deleting status, couldn't be deleted, must remove the finalizer manually from the machine object then the machine could be deleted.

$ oc get machine
NAME                              PHASE      TYPE        REGION      ZONE   AGE
machineset-clone-27609-w55dh      Deleting                                  65m
zhsun713osp1-g499t-master-0       Running    m1.xlarge   regionOne   nova   135m
zhsun713osp1-g499t-master-1       Running    m1.xlarge   regionOne   nova   135m
zhsun713osp1-g499t-master-2       Running    m1.xlarge   regionOne   nova   135m
zhsun713osp1-g499t-worker-5rfpn   Running    m1.large    regionOne   nova   125m
zhsun713osp1-g499t-worker-9wr7m   Running    m1.large    regionOne   nova   125m
zhsun713osp1-g499t-worker-mwm2q   Running    m1.large    regionOne   nova   125m

W0713 08:22:06.180218       1 machineservice.go:847] Couldn't delete all instance  ports: Resource not found
E0713 08:22:08.231420       1 actuator.go:538] Machine error machineset-clone-27609-w55dh: error deleting Openstack instance: Resource not found
E0713 08:22:08.231461       1 controller.go:230] machineset-clone-27609-w55dh: failed to delete machine: error deleting Openstack instance: Resource not found
I0713 08:22:09.231931       1 controller.go:169] machineset-clone-27609-w55dh: reconciling Machine
I0713 08:22:09.231969       1 controller.go:209] machineset-clone-27609-w55dh: reconciling machine triggers delete
I0713 08:22:09.248643       1 utils.go:99] Cloud provider CA cert not provided, using system trust bundle

Expected results:
Machine stucking in Provisioning status could be deleted.


Additional info:

Comment 3 egarcia 2020-08-11 20:50:02 UTC

So, what we can do as a stopgap is remove the finalizer when we get an "Resource not found" delete error, and force a manual delete. However, I am curious as to why it is failing to delete the instance stuck in provisioning in the first place. Is there more info about the instance or about why you think that might have happened that you can give me?

Comment 4 Michael Gugino 2020-08-17 21:51:22 UTC

Removing the finalizer if there is still a VM that needs to be removed is not what we want to do.  The finalizer should only be removed if we know the instance is gone.  If there is a situation that requires an OpenStack administrator to remove the instance (eg, we can't do it from the actuator/provider), then we should not remove the finalizer and let the machine continue to fail.  This would be a bug in OpenStack, and the machine being stuck in deleting is exactly what we want.  After the user removes the instance from the cloud, the actuator will work like normal and the machine will go away because the cloud (OpenStack) is now returning the proper information.

If there is something that can be done inside the actuator to either 1) Verify the instance is actually gone or 2) Make the instance go away via some other api call, we need to do one of those two things.

In any case, removing the finalizer for an unhandled error is not what we want.  If the cloud will always return this phantom instance (bug in OpenStack), and we cannot detect this condition via the API, the answer is to let the machine continue to fail, create some documentation around this as a known issue, and instruct the user (not the machine-controller) to remove this finalizer if this condition is encountered.

Comment 5 egarcia 2020-08-20 14:43:23 UTC

In this case, we will just document the workaround.

Comment 9 David Sanz 2020-08-26 13:32:07 UTC

Verified as fix is on docs

Comment 11 errata-xmlrpc 2020-10-27 16:13:54 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196

Comment 13 Red Hat Bugzilla 2023-09-15 00:34:06 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days