Bug 1977634

Summary: vSphere Machines stuck in deleting phase if associated Node object is deleted
Product: OpenShift Container Platform Reporter: Milind Yadav <miyadav>
Component: Cloud ComputeAssignee: dmoiseev
Cloud Compute sub component: Other Providers QA Contact: Milind Yadav <miyadav>
Status: CLOSED ERRATA Docs Contact:
Severity: medium    
Priority: medium CC: dmoiseev, jhou, mimccune, miyadav, ssoto, zhsun
Version: 4.8   
Target Milestone: ---   
Target Release: 4.8.z   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: 1977369 Environment:
Last Closed: 2021-08-31 16:17:10 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1977369    
Bug Blocks: 1977637, 1989648    

Description Milind Yadav 2021-06-30 08:13:34 UTC
+++ This bug was initially created as a clone of Bug #1977369 +++

Description of problem:

If a vSphere node object has been deleted and the Machine's associated instance has entered a bad state, which prevents it from being joined to the cluster again, the Machine cannot be deleted and will be stuck in the deleting phase seemingly forever.

Errors in machine reconciler log:
```
 I0628 22:43:56.945572 1 recorder.go:104] controller-runtime/manager/events "msg"="Warning" "message"="e2e-fhn5z: reconciler failed to Delete machine: e2e-fhn5z: Can't check node status before vm destroy: nodes \"windows-host\" not found" "object"=
```

Logic behind this is here:
https://github.com/openshift/machine-api-operator/blob/e461e729a3077016228f34d0db8845b421a91c6d/pkg/controller/vsphere/reconciler.go#L276-L280

Perhaps checkNodeReachable should return false (with no error) in the case of the node object not existing?
https://github.com/openshift/machine-api-operator/blob/e461e729a3077016228f34d0db8845b421a91c6d/pkg/controller/vsphere/machine_scope.go#L158

Version-Release number of selected component (if applicable):
4.8 RC

How reproducible:
Always

Steps to Reproduce:
1. Delete vSphere node
2. Cause vSphere instance to be un-configurable by the cluster. (In case of Windows Machine Config Operator this was removing the ability to SSH into the instance)
3. Attempt to delete the Machine

Actual results:
The Machine is stuck in the deleting phase.

Expected results:
The Machine is deleted.

Additional info:

==============================================================================
Another Scenario being caused by it 

Cluster version is 4.7.0-0.nightly-2021-06-26-014854

Steps :
1.Create a mhc using below :
Expected & Actual : mhc created successfully

2.Delete the worker node being referenced by machineset being monitored by mhc
Node deleted successfully
[miyadav@miyadav ~]$ oc get nodes
NAME                               STATUS   ROLES    AGE    VERSION
miyadav-30vsp-p4gh2-master-0       Ready    master   3h5m   v1.20.0+87cc9a4
miyadav-30vsp-p4gh2-master-1       Ready    master   3h5m   v1.20.0+87cc9a4
miyadav-30vsp-p4gh2-master-2       Ready    master   3h5m   v1.20.0+87cc9a4
miyadav-30vsp-p4gh2-worker-84gsl   Ready    worker   173m   v1.20.0+87cc9a4
miyadav-30vsp-p4gh2-worker-pvp8p   Ready    worker   24m    v1.20.0+87cc9a4

[miyadav@miyadav ~]$ oc get nodes
oc get machines 
NAME                               STATUS   ROLES    AGE     VERSION
miyadav-30vsp-p4gh2-master-0       Ready    master   3h27m   v1.20.0+87cc9a4
miyadav-30vsp-p4gh2-master-1       Ready    master   3h27m   v1.20.0+87cc9a4
miyadav-30vsp-p4gh2-master-2       Ready    master   3h27m   v1.20.0+87cc9a4
miyadav-30vsp-p4gh2-worker-84gsl   Ready    worker   3h15m   v1.20.0+87cc9a4

3.New machine provisioned and old one deleted

[miyadav@miyadav ~]$ oc get machines 
NAME                               PHASE         TYPE   REGION   ZONE   AGE
miyadav-30vsp-p4gh2-master-0       Running                              3h28m
miyadav-30vsp-p4gh2-master-1       Running                              3h28m
miyadav-30vsp-p4gh2-master-2       Running                              3h28m
miyadav-30vsp-p4gh2-worker-84gsl   Running                              3h22m
miyadav-30vsp-p4gh2-worker-p8g8p   Provisioned                          72s
miyadav-30vsp-p4gh2-worker-pvp8p   Deleting                             48m
.
.
[miyadav@miyadav ~]$ oc get machines
NAME                               PHASE      TYPE   REGION   ZONE   AGE
miyadav-30vsp-p4gh2-master-0       Running                           3h52m
miyadav-30vsp-p4gh2-master-1       Running                           3h52m
miyadav-30vsp-p4gh2-master-2       Running                           3h52m
miyadav-30vsp-p4gh2-worker-84gsl   Running                           3h46m
miyadav-30vsp-p4gh2-worker-p8g8p   Running                           24m
miyadav-30vsp-p4gh2-worker-pvp8p   Deleting                          72m


Expected and actual :
New machine provisioned successfully
Old one stuck in deleting state with below error :
Events:
  Type     Reason          Age                   From                           Message
  ----     ------          ----                  ----                           -------
  Normal   Create          70m                   vspherecontroller              Created Machine miyadav-30vsp-p4gh2-worker-pvp8p
  Warning  FailedCreate    70m                   vspherecontroller              miyadav-30vsp-p4gh2-worker-pvp8p: reconciler failed to Create machine: task task-3107089 has not finished
  Warning  FailedUpdate    70m                   vspherecontroller              miyadav-30vsp-p4gh2-worker-pvp8p: reconciler failed to Update machine: task task-3107089 has not finished
  Normal   Update          30m (x14 over 69m)    vspherecontroller              Updated Machine miyadav-30vsp-p4gh2-worker-pvp8p
  Normal   MachineDeleted  22m                   machinehealthcheck-controller  Machine openshift-machine-api/mhc2/miyadav-30vsp-p4gh2-worker-pvp8p/miyadav-30vsp-p4gh2-worker-pvp8p has been remediated by requesting to delete Machine object
  Warning  FailedDelete    3m19s (x20 over 22m)  vspherecontroller              miyadav-30vsp-p4gh2-worker-pvp8p: reconciler failed to Delete machine: miyadav-30vsp-p4gh2-worker-pvp8p: Can't check node status before vm destroy: nodes "miyadav-30vsp-p4gh2-worker-pvp8p" not found

Additional info : Will attach must-gather

Comment 3 Milind Yadav 2021-08-17 11:54:42 UTC
Validated on - Cluster version is 4.8.0-0.nightly-2021-08-17-004424

Machine did not got stuck in deleting phase as reported earlier 

[miyadav@miyadav vsphere]$ oc get machineset 
NAME                          DESIRED   CURRENT   READY   AVAILABLE   AGE
miyadav-1708bz-qd2z9-worker   2         2         2       2           41m

[miyadav@miyadav vsphere]$ vi mhc
mhc_master.yaml  mhc_wind.yaml    mhc.yaml         
[miyadav@miyadav vsphere]$ vi mhc.yaml
[miyadav@miyadav vsphere]$ oc create -f  mhc.yaml 
machinehealthcheck.machine.openshift.io/mhc2 created

[miyadav@miyadav vsphere]$ oc get machines -o wide
NAME                                PHASE     TYPE   REGION   ZONE   AGE   NODE                                PROVIDERID                                       STATE
miyadav-1708bz-qd2z9-master-0       Running                          42m   miyadav-1708bz-qd2z9-master-0       vsphere://422c006b-b924-00f3-94e6-bf53e8a76f42   poweredOn
miyadav-1708bz-qd2z9-master-1       Running                          42m   miyadav-1708bz-qd2z9-master-1       vsphere://422c5747-ffe0-d3a2-fd6a-7c97eace5ab6   poweredOn
miyadav-1708bz-qd2z9-master-2       Running                          42m   miyadav-1708bz-qd2z9-master-2       vsphere://422c2a72-2823-d21f-6e2e-6f713b8de8dd   poweredOn
miyadav-1708bz-qd2z9-worker-hvxb9   Running                          38m   miyadav-1708bz-qd2z9-worker-hvxb9   vsphere://422c8da3-37ae-9918-b478-a4df5a074987   poweredOn
miyadav-1708bz-qd2z9-worker-nhftb   Running                          38m   miyadav-1708bz-qd2z9-worker-nhftb   vsphere://422c09b6-d202-7c43-a325-90e4e80544b6   poweredOn

[miyadav@miyadav vsphere]$ oc delete node miyadav-1708bz-qd2z9-worker-hvxb9
node "miyadav-1708bz-qd2z9-worker-hvxb9" deleted

[miyadav@miyadav vsphere]$ oc get mhc
NAME                              MAXUNHEALTHY   EXPECTEDMACHINES   CURRENTHEALTHY
machine-api-termination-handler   100%           0                  0
mhc2                              3              2                  1

[miyadav@miyadav vsphere]$ oc get machines -o wide
NAME                                PHASE          TYPE   REGION   ZONE   AGE   NODE                                PROVIDERID                                       STATE
miyadav-1708bz-qd2z9-master-0       Running                               42m   miyadav-1708bz-qd2z9-master-0       vsphere://422c006b-b924-00f3-94e6-bf53e8a76f42   poweredOn
miyadav-1708bz-qd2z9-master-1       Running                               42m   miyadav-1708bz-qd2z9-master-1       vsphere://422c5747-ffe0-d3a2-fd6a-7c97eace5ab6   poweredOn
miyadav-1708bz-qd2z9-master-2       Running                               42m   miyadav-1708bz-qd2z9-master-2       vsphere://422c2a72-2823-d21f-6e2e-6f713b8de8dd   poweredOn
miyadav-1708bz-qd2z9-worker-nhftb   Running                               39m   miyadav-1708bz-qd2z9-worker-nhftb   vsphere://422c09b6-d202-7c43-a325-90e4e80544b6   poweredOn
miyadav-1708bz-qd2z9-worker-nmtx6   Provisioning                          14s 


[miyadav@miyadav vsphere]$ oc get machines -o wide 
NAME                                PHASE     TYPE   REGION   ZONE   AGE   NODE                                PROVIDERID                                       STATE
miyadav-1708bz-qd2z9-master-0       Running                          52m   miyadav-1708bz-qd2z9-master-0       vsphere://422c006b-b924-00f3-94e6-bf53e8a76f42   poweredOn
miyadav-1708bz-qd2z9-master-1       Running                          52m   miyadav-1708bz-qd2z9-master-1       vsphere://422c5747-ffe0-d3a2-fd6a-7c97eace5ab6   poweredOn
miyadav-1708bz-qd2z9-master-2       Running                          52m   miyadav-1708bz-qd2z9-master-2       vsphere://422c2a72-2823-d21f-6e2e-6f713b8de8dd   poweredOn
miyadav-1708bz-qd2z9-worker-nhftb   Running                          48m   miyadav-1708bz-qd2z9-worker-nhftb   vsphere://422c09b6-d202-7c43-a325-90e4e80544b6   poweredOn
miyadav-1708bz-qd2z9-worker-nmtx6   Running                          10m   miyadav-1708bz-qd2z9-worker-nmtx6   vsphere://422c3973-947e-1c4d-3b4a-b14fe6882d4a   poweredOn
[miyadav@miyadav vsphere]$ 



Additional info:
Moved to VERIFIED based on above results .

Comment 4 ximhan 2021-08-20 07:26:57 UTC
OpenShift engineering has decided to NOT ship 4.8.6 on 8/23 due to the following issue.
https://bugzilla.redhat.com/show_bug.cgi?id=1995785
All the fixes part will be now included in 4.8.7 on 8/30.

Comment 8 errata-xmlrpc 2021-08-31 16:17:10 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.8.9 bug fix), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:3247