Bug 1977369

Summary: vSphere Machines stuck in deleting phase if associated Node object is deleted
Product: OpenShift Container Platform Reporter: Sebastian Soto <ssoto>
Component: Cloud ComputeAssignee: dmoiseev
Cloud Compute sub component: Other Providers QA Contact: sunzhaohua <zhsun>
Status: CLOSED ERRATA Docs Contact:
Severity: medium    
Priority: medium CC: dgautam, dmoiseev, jhou, mimccune, miyadav, palshure, zhsun
Version: 4.8   
Target Milestone: ---   
Target Release: 4.9.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Cause: Attempting to delete a Machine whose Node has been deleted, and has entered a state which prevents it from rejoining the cluster. Consequence: The Machine will be stuck in "Deleting" phase permanently. Fix: The vSphere machine controller has been updated to properly detect when the Node object has been deleted. Result: Machines in this state can now be deleted properly.
Story Points: ---
Clone Of:
: 1977634 1977637 (view as bug list) Environment:
Last Closed: 2021-10-18 17:36:56 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1977634    

Description Sebastian Soto 2021-06-29 14:27:41 UTC
Description of problem:

If a vSphere node object has been deleted and the Machine's associated instance has entered a bad state, which prevents it from being joined to the cluster again, the Machine cannot be deleted and will be stuck in the deleting phase seemingly forever.

Errors in machine reconciler log:
```
 I0628 22:43:56.945572 1 recorder.go:104] controller-runtime/manager/events "msg"="Warning" "message"="e2e-fhn5z: reconciler failed to Delete machine: e2e-fhn5z: Can't check node status before vm destroy: nodes \"windows-host\" not found" "object"=
```

Logic behind this is here:
https://github.com/openshift/machine-api-operator/blob/e461e729a3077016228f34d0db8845b421a91c6d/pkg/controller/vsphere/reconciler.go#L276-L280

Perhaps checkNodeReachable should return false (with no error) in the case of the node object not existing?
https://github.com/openshift/machine-api-operator/blob/e461e729a3077016228f34d0db8845b421a91c6d/pkg/controller/vsphere/machine_scope.go#L158

Version-Release number of selected component (if applicable):
4.8 RC

How reproducible:
Always

Steps to Reproduce:
1. Delete vSphere node
2. Cause vSphere instance to be un-configurable by the cluster. (In case of Windows Machine Config Operator this was removing the ability to SSH into the instance)
3. Attempt to delete the Machine

Actual results:
The Machine is stuck in the deleting phase.

Expected results:
The Machine is deleted.

Additional info:

Comment 2 sunzhaohua 2021-07-08 08:14:53 UTC
Verified 
clusterversion: 4.9.0-0.nightly-2021-07-07-021823
Steps to Reproduce:
1. Delete vSphere node
$ oc get node
NAME                          STATUS   ROLES    AGE     VERSION
zhsunvs1-x4w87-master-0       Ready    master   4h39m   v1.21.1+0228142
zhsunvs1-x4w87-master-1       Ready    master   4h39m   v1.21.1+0228142
zhsunvs1-x4w87-master-2       Ready    master   4h39m   v1.21.1+0228142
zhsunvs1-x4w87-worker-4zxp7   Ready    worker   4h32m   v1.21.1+0228142

$ oc get machine
NAME                          PHASE     TYPE   REGION   ZONE   AGE
zhsunvs1-x4w87-master-0       Running                          4h40m
zhsunvs1-x4w87-master-1       Running                          4h40m
zhsunvs1-x4w87-master-2       Running                          4h40m
zhsunvs1-x4w87-worker-4zxp7   Running                          4h37m
zhsunvs1-x4w87-worker-7mxp5   Running                          4h37m
2. Delete machine, machine could be deleted.
$ oc delete machine zhsunvs1-x4w87-worker-7mxp5
machine.machine.openshift.io "zhsunvs1-x4w87-worker-7mxp5" deleted
$ oc get machine
NAME                          PHASE     TYPE   REGION   ZONE   AGE
zhsunvs1-x4w87-master-0       Running                          4h44m
zhsunvs1-x4w87-master-1       Running                          4h44m
zhsunvs1-x4w87-master-2       Running                          4h44m
zhsunvs1-x4w87-worker-4zxp7   Running                          4h41m
zhsunvs1-x4w87-worker-6f689   Running                          2m55s

Comment 8 errata-xmlrpc 2021-10-18 17:36:56 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:3759