Bug 1856270 - Machine couldn't be deleted if machine stucks in Provisioning status [NEEDINFO]
Summary: Machine couldn't be deleted if machine stucks in Provisioning status
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Cloud Compute
Version: 4.6
Hardware: Unspecified
OS: Unspecified
low
medium
Target Milestone: ---
: 4.6.0
Assignee: egarcia
QA Contact: David Sanz
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-07-13 08:55 UTC by sunzhaohua
Modified: 2021-02-26 17:10 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-10-27 16:13:54 UTC
Target Upstream Version:
oarribas: needinfo? (egarcia)


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift installer pull 4083 0 None closed Bug 1856270: Update known issues with info about provisioning state node bug 2021-01-14 05:58:37 UTC
Red Hat Product Errata RHBA-2020:4196 0 None None None 2020-10-27 16:14:21 UTC

Description sunzhaohua 2020-07-13 08:55:25 UTC
Description of problem:
Sometimes machine couldn't be created successfully because of network issues or resource limitations, stucking in Provisioning status, if we want to delete such machines, machines stuck in Deleting status. We must remove the finalizer from the Machine object then the Machine object could be deleted. 

Version-Release number of selected component (if applicable):
4.6.0-0.nightly-2020-07-12-014740

How reproducible:
Always

Steps to Reproduce:
1. Create a new machine, because of network issues or resource limitations  machine stucks in Provisioning status
2. Delete the new created machine
3. Check machines

Actual results:
Machine stucked in Deleting status, couldn't be deleted, must remove the finalizer manually from the machine object then the machine could be deleted.

$ oc get machine
NAME                              PHASE      TYPE        REGION      ZONE   AGE
machineset-clone-27609-w55dh      Deleting                                  65m
zhsun713osp1-g499t-master-0       Running    m1.xlarge   regionOne   nova   135m
zhsun713osp1-g499t-master-1       Running    m1.xlarge   regionOne   nova   135m
zhsun713osp1-g499t-master-2       Running    m1.xlarge   regionOne   nova   135m
zhsun713osp1-g499t-worker-5rfpn   Running    m1.large    regionOne   nova   125m
zhsun713osp1-g499t-worker-9wr7m   Running    m1.large    regionOne   nova   125m
zhsun713osp1-g499t-worker-mwm2q   Running    m1.large    regionOne   nova   125m

W0713 08:22:06.180218       1 machineservice.go:847] Couldn't delete all instance  ports: Resource not found
E0713 08:22:08.231420       1 actuator.go:538] Machine error machineset-clone-27609-w55dh: error deleting Openstack instance: Resource not found
E0713 08:22:08.231461       1 controller.go:230] machineset-clone-27609-w55dh: failed to delete machine: error deleting Openstack instance: Resource not found
I0713 08:22:09.231931       1 controller.go:169] machineset-clone-27609-w55dh: reconciling Machine
I0713 08:22:09.231969       1 controller.go:209] machineset-clone-27609-w55dh: reconciling machine triggers delete
I0713 08:22:09.248643       1 utils.go:99] Cloud provider CA cert not provided, using system trust bundle

Expected results:
Machine stucking in Provisioning status could be deleted.


Additional info:

Comment 3 egarcia 2020-08-11 20:50:02 UTC
So, what we can do as a stopgap is remove the finalizer when we get an "Resource not found" delete error, and force a manual delete. However, I am curious as to why it is failing to delete the instance stuck in provisioning in the first place. Is there more info about the instance or about why you think that might have happened that you can give me?

Comment 4 Michael Gugino 2020-08-17 21:51:22 UTC
Removing the finalizer if there is still a VM that needs to be removed is not what we want to do.  The finalizer should only be removed if we know the instance is gone.  If there is a situation that requires an OpenStack administrator to remove the instance (eg, we can't do it from the actuator/provider), then we should not remove the finalizer and let the machine continue to fail.  This would be a bug in OpenStack, and the machine being stuck in deleting is exactly what we want.  After the user removes the instance from the cloud, the actuator will work like normal and the machine will go away because the cloud (OpenStack) is now returning the proper information.

If there is something that can be done inside the actuator to either 1) Verify the instance is actually gone or 2) Make the instance go away via some other api call, we need to do one of those two things.

In any case, removing the finalizer for an unhandled error is not what we want.  If the cloud will always return this phantom instance (bug in OpenStack), and we cannot detect this condition via the API, the answer is to let the machine continue to fail, create some documentation around this as a known issue, and instruct the user (not the machine-controller) to remove this finalizer if this condition is encountered.

Comment 5 egarcia 2020-08-20 14:43:23 UTC
In this case, we will just document the workaround.

Comment 9 David Sanz 2020-08-26 13:32:07 UTC
Verified as fix is on docs

Comment 11 errata-xmlrpc 2020-10-27 16:13:54 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196


Note You need to log in before you can comment on or make changes to this bug.