Bug 1977369 - vSphere Machines stuck in deleting phase if associated Node object is deleted
Summary: vSphere Machines stuck in deleting phase if associated Node object is deleted
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Cloud Compute
Version: 4.8
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: 4.9.0
Assignee: dmoiseev
QA Contact: sunzhaohua
URL:
Whiteboard:
Depends On:
Blocks: 1977634
TreeView+ depends on / blocked
 
Reported: 2021-06-29 14:27 UTC by Sebastian Soto
Modified: 2021-10-18 17:37 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: Attempting to delete a Machine whose Node has been deleted, and has entered a state which prevents it from rejoining the cluster. Consequence: The Machine will be stuck in "Deleting" phase permanently. Fix: The vSphere machine controller has been updated to properly detect when the Node object has been deleted. Result: Machines in this state can now be deleted properly.
Clone Of:
: 1977634 1977637 (view as bug list)
Environment:
Last Closed: 2021-10-18 17:36:56 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift machine-api-operator pull 882 0 None open Bug 1977369: Prevent machine from stucking in Deleting phase if node object not found 2021-06-30 10:26:43 UTC
Red Hat Knowledge Base (Solution) 6242431 0 None None None 2021-08-05 13:22:39 UTC
Red Hat Product Errata RHSA-2021:3759 0 None None None 2021-10-18 17:37:20 UTC

Description Sebastian Soto 2021-06-29 14:27:41 UTC
Description of problem:

If a vSphere node object has been deleted and the Machine's associated instance has entered a bad state, which prevents it from being joined to the cluster again, the Machine cannot be deleted and will be stuck in the deleting phase seemingly forever.

Errors in machine reconciler log:
```
 I0628 22:43:56.945572 1 recorder.go:104] controller-runtime/manager/events "msg"="Warning" "message"="e2e-fhn5z: reconciler failed to Delete machine: e2e-fhn5z: Can't check node status before vm destroy: nodes \"windows-host\" not found" "object"=
```

Logic behind this is here:
https://github.com/openshift/machine-api-operator/blob/e461e729a3077016228f34d0db8845b421a91c6d/pkg/controller/vsphere/reconciler.go#L276-L280

Perhaps checkNodeReachable should return false (with no error) in the case of the node object not existing?
https://github.com/openshift/machine-api-operator/blob/e461e729a3077016228f34d0db8845b421a91c6d/pkg/controller/vsphere/machine_scope.go#L158

Version-Release number of selected component (if applicable):
4.8 RC

How reproducible:
Always

Steps to Reproduce:
1. Delete vSphere node
2. Cause vSphere instance to be un-configurable by the cluster. (In case of Windows Machine Config Operator this was removing the ability to SSH into the instance)
3. Attempt to delete the Machine

Actual results:
The Machine is stuck in the deleting phase.

Expected results:
The Machine is deleted.

Additional info:

Comment 2 sunzhaohua 2021-07-08 08:14:53 UTC
Verified 
clusterversion: 4.9.0-0.nightly-2021-07-07-021823
Steps to Reproduce:
1. Delete vSphere node
$ oc get node
NAME                          STATUS   ROLES    AGE     VERSION
zhsunvs1-x4w87-master-0       Ready    master   4h39m   v1.21.1+0228142
zhsunvs1-x4w87-master-1       Ready    master   4h39m   v1.21.1+0228142
zhsunvs1-x4w87-master-2       Ready    master   4h39m   v1.21.1+0228142
zhsunvs1-x4w87-worker-4zxp7   Ready    worker   4h32m   v1.21.1+0228142

$ oc get machine
NAME                          PHASE     TYPE   REGION   ZONE   AGE
zhsunvs1-x4w87-master-0       Running                          4h40m
zhsunvs1-x4w87-master-1       Running                          4h40m
zhsunvs1-x4w87-master-2       Running                          4h40m
zhsunvs1-x4w87-worker-4zxp7   Running                          4h37m
zhsunvs1-x4w87-worker-7mxp5   Running                          4h37m
2. Delete machine, machine could be deleted.
$ oc delete machine zhsunvs1-x4w87-worker-7mxp5
machine.machine.openshift.io "zhsunvs1-x4w87-worker-7mxp5" deleted
$ oc get machine
NAME                          PHASE     TYPE   REGION   ZONE   AGE
zhsunvs1-x4w87-master-0       Running                          4h44m
zhsunvs1-x4w87-master-1       Running                          4h44m
zhsunvs1-x4w87-master-2       Running                          4h44m
zhsunvs1-x4w87-worker-4zxp7   Running                          4h41m
zhsunvs1-x4w87-worker-6f689   Running                          2m55s

Comment 8 errata-xmlrpc 2021-10-18 17:36:56 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:3759


Note You need to log in before you can comment on or make changes to this bug.