+++ This bug was initially created as a clone of Bug #1977369 +++ Description of problem: If a vSphere node object has been deleted and the Machine's associated instance has entered a bad state, which prevents it from being joined to the cluster again, the Machine cannot be deleted and will be stuck in the deleting phase seemingly forever. Errors in machine reconciler log: ``` I0628 22:43:56.945572 1 recorder.go:104] controller-runtime/manager/events "msg"="Warning" "message"="e2e-fhn5z: reconciler failed to Delete machine: e2e-fhn5z: Can't check node status before vm destroy: nodes \"windows-host\" not found" "object"= ``` Logic behind this is here: https://github.com/openshift/machine-api-operator/blob/e461e729a3077016228f34d0db8845b421a91c6d/pkg/controller/vsphere/reconciler.go#L276-L280 Perhaps checkNodeReachable should return false (with no error) in the case of the node object not existing? https://github.com/openshift/machine-api-operator/blob/e461e729a3077016228f34d0db8845b421a91c6d/pkg/controller/vsphere/machine_scope.go#L158 Version-Release number of selected component (if applicable): 4.8 RC How reproducible: Always Steps to Reproduce: 1. Delete vSphere node 2. Cause vSphere instance to be un-configurable by the cluster. (In case of Windows Machine Config Operator this was removing the ability to SSH into the instance) 3. Attempt to delete the Machine Actual results: The Machine is stuck in the deleting phase. Expected results: The Machine is deleted. Additional info: ============================================================================== Another Scenario being caused by it Cluster version is 4.7.0-0.nightly-2021-06-26-014854 Steps : 1.Create a mhc using below : Expected & Actual : mhc created successfully 2.Delete the worker node being referenced by machineset being monitored by mhc Node deleted successfully [miyadav@miyadav ~]$ oc get nodes NAME STATUS ROLES AGE VERSION miyadav-30vsp-p4gh2-master-0 Ready master 3h5m v1.20.0+87cc9a4 miyadav-30vsp-p4gh2-master-1 Ready master 3h5m v1.20.0+87cc9a4 miyadav-30vsp-p4gh2-master-2 Ready master 3h5m v1.20.0+87cc9a4 miyadav-30vsp-p4gh2-worker-84gsl Ready worker 173m v1.20.0+87cc9a4 miyadav-30vsp-p4gh2-worker-pvp8p Ready worker 24m v1.20.0+87cc9a4 [miyadav@miyadav ~]$ oc get nodes oc get machines NAME STATUS ROLES AGE VERSION miyadav-30vsp-p4gh2-master-0 Ready master 3h27m v1.20.0+87cc9a4 miyadav-30vsp-p4gh2-master-1 Ready master 3h27m v1.20.0+87cc9a4 miyadav-30vsp-p4gh2-master-2 Ready master 3h27m v1.20.0+87cc9a4 miyadav-30vsp-p4gh2-worker-84gsl Ready worker 3h15m v1.20.0+87cc9a4 3.New machine provisioned and old one deleted [miyadav@miyadav ~]$ oc get machines NAME PHASE TYPE REGION ZONE AGE miyadav-30vsp-p4gh2-master-0 Running 3h28m miyadav-30vsp-p4gh2-master-1 Running 3h28m miyadav-30vsp-p4gh2-master-2 Running 3h28m miyadav-30vsp-p4gh2-worker-84gsl Running 3h22m miyadav-30vsp-p4gh2-worker-p8g8p Provisioned 72s miyadav-30vsp-p4gh2-worker-pvp8p Deleting 48m . . [miyadav@miyadav ~]$ oc get machines NAME PHASE TYPE REGION ZONE AGE miyadav-30vsp-p4gh2-master-0 Running 3h52m miyadav-30vsp-p4gh2-master-1 Running 3h52m miyadav-30vsp-p4gh2-master-2 Running 3h52m miyadav-30vsp-p4gh2-worker-84gsl Running 3h46m miyadav-30vsp-p4gh2-worker-p8g8p Running 24m miyadav-30vsp-p4gh2-worker-pvp8p Deleting 72m Expected and actual : New machine provisioned successfully Old one stuck in deleting state with below error : Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Create 70m vspherecontroller Created Machine miyadav-30vsp-p4gh2-worker-pvp8p Warning FailedCreate 70m vspherecontroller miyadav-30vsp-p4gh2-worker-pvp8p: reconciler failed to Create machine: task task-3107089 has not finished Warning FailedUpdate 70m vspherecontroller miyadav-30vsp-p4gh2-worker-pvp8p: reconciler failed to Update machine: task task-3107089 has not finished Normal Update 30m (x14 over 69m) vspherecontroller Updated Machine miyadav-30vsp-p4gh2-worker-pvp8p Normal MachineDeleted 22m machinehealthcheck-controller Machine openshift-machine-api/mhc2/miyadav-30vsp-p4gh2-worker-pvp8p/miyadav-30vsp-p4gh2-worker-pvp8p has been remediated by requesting to delete Machine object Warning FailedDelete 3m19s (x20 over 22m) vspherecontroller miyadav-30vsp-p4gh2-worker-pvp8p: reconciler failed to Delete machine: miyadav-30vsp-p4gh2-worker-pvp8p: Can't check node status before vm destroy: nodes "miyadav-30vsp-p4gh2-worker-pvp8p" not found Additional info : Will attach must-gather
Validated on - Cluster version is 4.8.0-0.nightly-2021-08-17-004424 Machine did not got stuck in deleting phase as reported earlier [miyadav@miyadav vsphere]$ oc get machineset NAME DESIRED CURRENT READY AVAILABLE AGE miyadav-1708bz-qd2z9-worker 2 2 2 2 41m [miyadav@miyadav vsphere]$ vi mhc mhc_master.yaml mhc_wind.yaml mhc.yaml [miyadav@miyadav vsphere]$ vi mhc.yaml [miyadav@miyadav vsphere]$ oc create -f mhc.yaml machinehealthcheck.machine.openshift.io/mhc2 created [miyadav@miyadav vsphere]$ oc get machines -o wide NAME PHASE TYPE REGION ZONE AGE NODE PROVIDERID STATE miyadav-1708bz-qd2z9-master-0 Running 42m miyadav-1708bz-qd2z9-master-0 vsphere://422c006b-b924-00f3-94e6-bf53e8a76f42 poweredOn miyadav-1708bz-qd2z9-master-1 Running 42m miyadav-1708bz-qd2z9-master-1 vsphere://422c5747-ffe0-d3a2-fd6a-7c97eace5ab6 poweredOn miyadav-1708bz-qd2z9-master-2 Running 42m miyadav-1708bz-qd2z9-master-2 vsphere://422c2a72-2823-d21f-6e2e-6f713b8de8dd poweredOn miyadav-1708bz-qd2z9-worker-hvxb9 Running 38m miyadav-1708bz-qd2z9-worker-hvxb9 vsphere://422c8da3-37ae-9918-b478-a4df5a074987 poweredOn miyadav-1708bz-qd2z9-worker-nhftb Running 38m miyadav-1708bz-qd2z9-worker-nhftb vsphere://422c09b6-d202-7c43-a325-90e4e80544b6 poweredOn [miyadav@miyadav vsphere]$ oc delete node miyadav-1708bz-qd2z9-worker-hvxb9 node "miyadav-1708bz-qd2z9-worker-hvxb9" deleted [miyadav@miyadav vsphere]$ oc get mhc NAME MAXUNHEALTHY EXPECTEDMACHINES CURRENTHEALTHY machine-api-termination-handler 100% 0 0 mhc2 3 2 1 [miyadav@miyadav vsphere]$ oc get machines -o wide NAME PHASE TYPE REGION ZONE AGE NODE PROVIDERID STATE miyadav-1708bz-qd2z9-master-0 Running 42m miyadav-1708bz-qd2z9-master-0 vsphere://422c006b-b924-00f3-94e6-bf53e8a76f42 poweredOn miyadav-1708bz-qd2z9-master-1 Running 42m miyadav-1708bz-qd2z9-master-1 vsphere://422c5747-ffe0-d3a2-fd6a-7c97eace5ab6 poweredOn miyadav-1708bz-qd2z9-master-2 Running 42m miyadav-1708bz-qd2z9-master-2 vsphere://422c2a72-2823-d21f-6e2e-6f713b8de8dd poweredOn miyadav-1708bz-qd2z9-worker-nhftb Running 39m miyadav-1708bz-qd2z9-worker-nhftb vsphere://422c09b6-d202-7c43-a325-90e4e80544b6 poweredOn miyadav-1708bz-qd2z9-worker-nmtx6 Provisioning 14s [miyadav@miyadav vsphere]$ oc get machines -o wide NAME PHASE TYPE REGION ZONE AGE NODE PROVIDERID STATE miyadav-1708bz-qd2z9-master-0 Running 52m miyadav-1708bz-qd2z9-master-0 vsphere://422c006b-b924-00f3-94e6-bf53e8a76f42 poweredOn miyadav-1708bz-qd2z9-master-1 Running 52m miyadav-1708bz-qd2z9-master-1 vsphere://422c5747-ffe0-d3a2-fd6a-7c97eace5ab6 poweredOn miyadav-1708bz-qd2z9-master-2 Running 52m miyadav-1708bz-qd2z9-master-2 vsphere://422c2a72-2823-d21f-6e2e-6f713b8de8dd poweredOn miyadav-1708bz-qd2z9-worker-nhftb Running 48m miyadav-1708bz-qd2z9-worker-nhftb vsphere://422c09b6-d202-7c43-a325-90e4e80544b6 poweredOn miyadav-1708bz-qd2z9-worker-nmtx6 Running 10m miyadav-1708bz-qd2z9-worker-nmtx6 vsphere://422c3973-947e-1c4d-3b4a-b14fe6882d4a poweredOn [miyadav@miyadav vsphere]$ Additional info: Moved to VERIFIED based on above results .
OpenShift engineering has decided to NOT ship 4.8.6 on 8/23 due to the following issue. https://bugzilla.redhat.com/show_bug.cgi?id=1995785 All the fixes part will be now included in 4.8.7 on 8/30.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.8.9 bug fix), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2021:3247