Description of problem: If a vSphere node object has been deleted and the Machine's associated instance has entered a bad state, which prevents it from being joined to the cluster again, the Machine cannot be deleted and will be stuck in the deleting phase seemingly forever. Errors in machine reconciler log: ``` I0628 22:43:56.945572 1 recorder.go:104] controller-runtime/manager/events "msg"="Warning" "message"="e2e-fhn5z: reconciler failed to Delete machine: e2e-fhn5z: Can't check node status before vm destroy: nodes \"windows-host\" not found" "object"= ``` Logic behind this is here: https://github.com/openshift/machine-api-operator/blob/e461e729a3077016228f34d0db8845b421a91c6d/pkg/controller/vsphere/reconciler.go#L276-L280 Perhaps checkNodeReachable should return false (with no error) in the case of the node object not existing? https://github.com/openshift/machine-api-operator/blob/e461e729a3077016228f34d0db8845b421a91c6d/pkg/controller/vsphere/machine_scope.go#L158 Version-Release number of selected component (if applicable): 4.8 RC How reproducible: Always Steps to Reproduce: 1. Delete vSphere node 2. Cause vSphere instance to be un-configurable by the cluster. (In case of Windows Machine Config Operator this was removing the ability to SSH into the instance) 3. Attempt to delete the Machine Actual results: The Machine is stuck in the deleting phase. Expected results: The Machine is deleted. Additional info:
Verified clusterversion: 4.9.0-0.nightly-2021-07-07-021823 Steps to Reproduce: 1. Delete vSphere node $ oc get node NAME STATUS ROLES AGE VERSION zhsunvs1-x4w87-master-0 Ready master 4h39m v1.21.1+0228142 zhsunvs1-x4w87-master-1 Ready master 4h39m v1.21.1+0228142 zhsunvs1-x4w87-master-2 Ready master 4h39m v1.21.1+0228142 zhsunvs1-x4w87-worker-4zxp7 Ready worker 4h32m v1.21.1+0228142 $ oc get machine NAME PHASE TYPE REGION ZONE AGE zhsunvs1-x4w87-master-0 Running 4h40m zhsunvs1-x4w87-master-1 Running 4h40m zhsunvs1-x4w87-master-2 Running 4h40m zhsunvs1-x4w87-worker-4zxp7 Running 4h37m zhsunvs1-x4w87-worker-7mxp5 Running 4h37m 2. Delete machine, machine could be deleted. $ oc delete machine zhsunvs1-x4w87-worker-7mxp5 machine.machine.openshift.io "zhsunvs1-x4w87-worker-7mxp5" deleted $ oc get machine NAME PHASE TYPE REGION ZONE AGE zhsunvs1-x4w87-master-0 Running 4h44m zhsunvs1-x4w87-master-1 Running 4h44m zhsunvs1-x4w87-master-2 Running 4h44m zhsunvs1-x4w87-worker-4zxp7 Running 4h41m zhsunvs1-x4w87-worker-6f689 Running 2m55s
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:3759