Description of problem: Sometimes machine couldn't be created successfully because of network issues or resource limitations, stucking in Provisioning status, if we want to delete such machines, machines stuck in Deleting status. We must remove the finalizer from the Machine object then the Machine object could be deleted. Version-Release number of selected component (if applicable): 4.6.0-0.nightly-2020-07-12-014740 How reproducible: Always Steps to Reproduce: 1. Create a new machine, because of network issues or resource limitations machine stucks in Provisioning status 2. Delete the new created machine 3. Check machines Actual results: Machine stucked in Deleting status, couldn't be deleted, must remove the finalizer manually from the machine object then the machine could be deleted. $ oc get machine NAME PHASE TYPE REGION ZONE AGE machineset-clone-27609-w55dh Deleting 65m zhsun713osp1-g499t-master-0 Running m1.xlarge regionOne nova 135m zhsun713osp1-g499t-master-1 Running m1.xlarge regionOne nova 135m zhsun713osp1-g499t-master-2 Running m1.xlarge regionOne nova 135m zhsun713osp1-g499t-worker-5rfpn Running m1.large regionOne nova 125m zhsun713osp1-g499t-worker-9wr7m Running m1.large regionOne nova 125m zhsun713osp1-g499t-worker-mwm2q Running m1.large regionOne nova 125m W0713 08:22:06.180218 1 machineservice.go:847] Couldn't delete all instance ports: Resource not found E0713 08:22:08.231420 1 actuator.go:538] Machine error machineset-clone-27609-w55dh: error deleting Openstack instance: Resource not found E0713 08:22:08.231461 1 controller.go:230] machineset-clone-27609-w55dh: failed to delete machine: error deleting Openstack instance: Resource not found I0713 08:22:09.231931 1 controller.go:169] machineset-clone-27609-w55dh: reconciling Machine I0713 08:22:09.231969 1 controller.go:209] machineset-clone-27609-w55dh: reconciling machine triggers delete I0713 08:22:09.248643 1 utils.go:99] Cloud provider CA cert not provided, using system trust bundle Expected results: Machine stucking in Provisioning status could be deleted. Additional info:
So, what we can do as a stopgap is remove the finalizer when we get an "Resource not found" delete error, and force a manual delete. However, I am curious as to why it is failing to delete the instance stuck in provisioning in the first place. Is there more info about the instance or about why you think that might have happened that you can give me?
Removing the finalizer if there is still a VM that needs to be removed is not what we want to do. The finalizer should only be removed if we know the instance is gone. If there is a situation that requires an OpenStack administrator to remove the instance (eg, we can't do it from the actuator/provider), then we should not remove the finalizer and let the machine continue to fail. This would be a bug in OpenStack, and the machine being stuck in deleting is exactly what we want. After the user removes the instance from the cloud, the actuator will work like normal and the machine will go away because the cloud (OpenStack) is now returning the proper information. If there is something that can be done inside the actuator to either 1) Verify the instance is actually gone or 2) Make the instance go away via some other api call, we need to do one of those two things. In any case, removing the finalizer for an unhandled error is not what we want. If the cloud will always return this phantom instance (bug in OpenStack), and we cannot detect this condition via the API, the answer is to let the machine continue to fail, create some documentation around this as a known issue, and instruct the user (not the machine-controller) to remove this finalizer if this condition is encountered.
In this case, we will just document the workaround.
Verified as fix is on docs
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4196
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days