+++ This bug was initially created as a clone of Bug #1977369 +++ Description of problem: If a vSphere node object has been deleted and the Machine's associated instance has entered a bad state, which prevents it from being joined to the cluster again, the Machine cannot be deleted and will be stuck in the deleting phase seemingly forever. Errors in machine reconciler log: ``` I0628 22:43:56.945572 1 recorder.go:104] controller-runtime/manager/events "msg"="Warning" "message"="e2e-fhn5z: reconciler failed to Delete machine: e2e-fhn5z: Can't check node status before vm destroy: nodes \"windows-host\" not found" "object"= ``` Logic behind this is here: https://github.com/openshift/machine-api-operator/blob/e461e729a3077016228f34d0db8845b421a91c6d/pkg/controller/vsphere/reconciler.go#L276-L280 Perhaps checkNodeReachable should return false (with no error) in the case of the node object not existing? https://github.com/openshift/machine-api-operator/blob/e461e729a3077016228f34d0db8845b421a91c6d/pkg/controller/vsphere/machine_scope.go#L158 Version-Release number of selected component (if applicable): 4.8 RC How reproducible: Always Steps to Reproduce: 1. Delete vSphere node 2. Cause vSphere instance to be un-configurable by the cluster. (In case of Windows Machine Config Operator this was removing the ability to SSH into the instance) 3. Attempt to delete the Machine Actual results: The Machine is stuck in the deleting phase. Expected results: The Machine is deleted. Additional info: ============================================================================== Another Scenario being caused by it Cluster version is 4.7.0-0.nightly-2021-06-26-014854 Steps : 1.Create a mhc using below : apiVersion: "machine.openshift.io/v1beta1" kind: "MachineHealthCheck" metadata: name: mhc2 namespace: openshift-machine-api spec: selector: matchLabels: machine.openshift.io/cluster-api-cluster: miyadav-30vsp-p4gh2 machine.openshift.io/cluster-api-machine-role: worker machine.openshift.io/cluster-api-machine-type: worker machine.openshift.io/cluster-api-machineset: miyadav-30vsp-p4gh2-worker unhealthyConditions: - type: Ready status: "False" timeout: 300s - type: Ready status: Unknown timeout: 300s maxUnhealthy: 3 Expected & Actual : mhc created successfully 2.Delete the worker node being referenced by machineset being monitored by mhc Node deleted successfully [miyadav@miyadav ~]$ oc get nodes NAME STATUS ROLES AGE VERSION miyadav-30vsp-p4gh2-master-0 Ready master 3h5m v1.20.0+87cc9a4 miyadav-30vsp-p4gh2-master-1 Ready master 3h5m v1.20.0+87cc9a4 miyadav-30vsp-p4gh2-master-2 Ready master 3h5m v1.20.0+87cc9a4 miyadav-30vsp-p4gh2-worker-84gsl Ready worker 173m v1.20.0+87cc9a4 miyadav-30vsp-p4gh2-worker-pvp8p Ready worker 24m v1.20.0+87cc9a4 [miyadav@miyadav ~]$ oc get nodes oc get machines NAME STATUS ROLES AGE VERSION miyadav-30vsp-p4gh2-master-0 Ready master 3h27m v1.20.0+87cc9a4 miyadav-30vsp-p4gh2-master-1 Ready master 3h27m v1.20.0+87cc9a4 miyadav-30vsp-p4gh2-master-2 Ready master 3h27m v1.20.0+87cc9a4 miyadav-30vsp-p4gh2-worker-84gsl Ready worker 3h15m v1.20.0+87cc9a4 3.New machine provisioned and old one deleted [miyadav@miyadav ~]$ oc get machines NAME PHASE TYPE REGION ZONE AGE miyadav-30vsp-p4gh2-master-0 Running 3h28m miyadav-30vsp-p4gh2-master-1 Running 3h28m miyadav-30vsp-p4gh2-master-2 Running 3h28m miyadav-30vsp-p4gh2-worker-84gsl Running 3h22m miyadav-30vsp-p4gh2-worker-p8g8p Provisioned 72s miyadav-30vsp-p4gh2-worker-pvp8p Deleting 48m . . [miyadav@miyadav ~]$ oc get machines NAME PHASE TYPE REGION ZONE AGE miyadav-30vsp-p4gh2-master-0 Running 3h52m miyadav-30vsp-p4gh2-master-1 Running 3h52m miyadav-30vsp-p4gh2-master-2 Running 3h52m miyadav-30vsp-p4gh2-worker-84gsl Running 3h46m miyadav-30vsp-p4gh2-worker-p8g8p Running 24m miyadav-30vsp-p4gh2-worker-pvp8p Deleting 72m Expected and actual : New machine provisioned successfully Old one stuck in deleting state with below error : Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Create 70m vspherecontroller Created Machine miyadav-30vsp-p4gh2-worker-pvp8p Warning FailedCreate 70m vspherecontroller miyadav-30vsp-p4gh2-worker-pvp8p: reconciler failed to Create machine: task task-3107089 has not finished Warning FailedUpdate 70m vspherecontroller miyadav-30vsp-p4gh2-worker-pvp8p: reconciler failed to Update machine: task task-3107089 has not finished Normal Update 30m (x14 over 69m) vspherecontroller Updated Machine miyadav-30vsp-p4gh2-worker-pvp8p Normal MachineDeleted 22m machinehealthcheck-controller Machine openshift-machine-api/mhc2/miyadav-30vsp-p4gh2-worker-pvp8p/miyadav-30vsp-p4gh2-worker-pvp8p has been remediated by requesting to delete Machine object Warning FailedDelete 3m19s (x20 over 22m) vspherecontroller miyadav-30vsp-p4gh2-worker-pvp8p: reconciler failed to Delete machine: miyadav-30vsp-p4gh2-worker-pvp8p: Can't check node status before vm destroy: nodes "miyadav-30vsp-p4gh2-worker-pvp8p" not found Additional info : Will attach must-gather
I have also encounted the machine stuck in deleting status for a long time # oc get machine -n openshift-machine-api NAME PHASE TYPE REGION ZONE AGE cluster1-storage-vlan-40-h8mm5 Running 44d cluster1-storage-vlan-40-s2lv4 Running 44d cluster1-storage-vlan-40-tqsd6 Running 44d cluster1-worker-vlan-40-2w7hx Running 7h58m cluster1-worker-vlan-40-kxbvw Deleting 4d1h cluster1-worker-vlan-50-lqs9m Deleting 4d1h # oc describe machine -n openshift-machine-api cluster1-worker-vlan-40-kxbvw Name: cluster1-worker-vlan-40-kxbvw Namespace: openshift-machine-api Labels: machine.openshift.io/cluster-api-cluster=cluster1-vh9zx machine.openshift.io/cluster-api-machine-role=worker machine.openshift.io/cluster-api-machine-type=worker machine.openshift.io/cluster-api-machineset=cluster1-worker-vlan-40 machine.openshift.io/region= machine.openshift.io/zone= Annotations: machine.openshift.io/instance-state: poweredOn API Version: machine.openshift.io/v1beta1 Kind: Machine Metadata: Creation Timestamp: 2021-07-08T00:01:50Z Deletion Grace Period Seconds: 0 Deletion Timestamp: 2021-07-11T15:27:56Z Finalizers: machine.machine.openshift.io Generate Name: cluster1-worker-vlan-40- Generation: 3 Managed Fields: API Version: machine.openshift.io/v1beta1 Fields Type: FieldsV1 fieldsV1: f:metadata: f:generateName: f:labels: .: f:machine.openshift.io/cluster-api-machine-role: f:machine.openshift.io/cluster-api-machine-type: f:machine.openshift.io/cluster-api-machineset: f:ownerReferences: .: k:{"uid":"a7bab249-e4f0-45b3-9d7a-9b403462f69b"}: .: f:apiVersion: f:blockOwnerDeletion: f:controller: f:kind: f:name: f:uid: f:spec: .: f:metadata: .: f:labels: .: f:node-role.kubernetes.io/app: f:node-role.kubernetes.io/vlan-40: f:providerSpec: .: f:value: .: f:apiVersion: f:credentialsSecret: f:diskGiB: f:kind: f:memoryMiB: f:metadata: f:network: f:numCPUs: f:numCoresPerSocket: f:snapshot: f:template: f:userDataSecret: f:workspace: Manager: machineset-controller Operation: Update Time: 2021-07-08T00:01:50Z API Version: machine.openshift.io/v1beta1 Fields Type: FieldsV1 fieldsV1: f:metadata: f:annotations: .: f:machine.openshift.io/instance-state: f:finalizers: .: v:"machine.machine.openshift.io": f:labels: f:machine.openshift.io/region: f:machine.openshift.io/zone: f:spec: f:providerID: f:status: .: f:addresses: f:phase: f:providerStatus: .: f:conditions: f:instanceId: f:instanceState: f:taskRef: Manager: machine-controller-manager Operation: Update Time: 2021-07-11T15:28:16Z API Version: machine.openshift.io/v1beta1 Fields Type: FieldsV1 fieldsV1: f:status: f:lastUpdated: f:nodeRef: .: f:kind: f:name: f:uid: Manager: nodelink-controller Operation: Update Time: 2021-07-11T18:24:10Z Owner References: API Version: machine.openshift.io/v1beta1 Block Owner Deletion: true Controller: true Kind: MachineSet Name: cluster1-worker-vlan-40 UID: a7bab249-e4f0-45b3-9d7a-9b403462f69b Resource Version: 82676685 Self Link: /apis/machine.openshift.io/v1beta1/namespaces/openshift-machine-api/machines/cluster1-worker-vlan-40-kxbvw UID: 0c3b83b1-9dfb-495d-ac58-a6b831d38435 Spec: Metadata: Labels: node-role.kubernetes.io/app: node-role.kubernetes.io/vlan-40: Provider ID: vsphere://422b6bc8-fabd-939d-0d45-ed987f81334d Provider Spec: Value: API Version: vsphereprovider.openshift.io/v1beta1 Credentials Secret: Name: vsphere-cloud-credentials Disk Gi B: 120 Kind: VSphereMachineProviderSpec Memory Mi B: 8192 Metadata: Creation Timestamp: <nil> Network: Devices: Network Name: OCP4-Lab2-vLan-40 Network Name: OCP4-Lab2-vLan-Trunk Network Name: OCP4-Lab2-vLan-Trunk Num CP Us: 2 Num Cores Per Socket: 1 Snapshot: Template: rhcos-4.7.7-x86_64 User Data Secret: Name: worker-user-data Workspace: Datacenter: Datacenter1 Datastore: datastore3 Folder: /Datacenter1/vm/OCP4-Lab2/03-DataCenter/OCP/Cluster-1/cluster1-worker-vlan-40 Server: 192.168.1.6 Status: Addresses: Address: 192.168.40.55 Type: InternalIP Address: fe80::5c56:1229:ae1e:dfbd Type: InternalIP Address: cluster1-worker-vlan-40-kxbvw Type: InternalDNS Last Updated: 2021-07-12T00:43:14Z Node Ref: Kind: Node Name: cluster1-worker-vlan-40-kxbvw UID: 6cb0b173-c2f1-44aa-8cca-cc3671edff10 Phase: Deleting Provider Status: Conditions: Last Probe Time: 2021-07-08T00:01:50Z Last Transition Time: 2021-07-08T00:01:50Z Message: Machine successfully created Reason: MachineCreationSucceeded Status: True Type: MachineCreation Instance Id: 422b6bc8-fabd-939d-0d45-ed987f81334d Instance State: poweredOn Task Ref: task-61052 Events: <none> # oc get nodes NAME STATUS ROLES AGE VERSION cluster1-storage-vlan-40-h8mm5 Ready storage,worker 44d v1.20.0+2817867 cluster1-storage-vlan-40-s2lv4 Ready storage,worker 44d v1.20.0+2817867 cluster1-storage-vlan-40-tqsd6 Ready storage,worker 44d v1.20.0+2817867 cluster1-worker-vlan-40-2w7hx Ready app,vlan-40,worker 7h51m v1.20.0+2817867 cluster1-worker-vlan-40-kxbvw Ready,SchedulingDisabled app,vlan-40,worker 4d1h v1.20.0+2817867 cluster1-worker-vlan-50-lqs9m Ready,SchedulingDisabled app,vlan-50,worker 4d1h v1.20.0+2817867 master-01.cluster1.ocp4.example.internal Ready master 56d v1.20.0+2817867 master-02.cluster1.ocp4.example.internal Ready master 56d v1.20.0+2817867 master-03.cluster1.ocp4.example.internal Ready master 56d v1.20.0+2817867 # oc describe nodes cluster1-worker-vlan-40-kxbvw Name: cluster1-worker-vlan-40-kxbvw Roles: app,vlan-40,worker Labels: beta.kubernetes.io/arch=amd64 beta.kubernetes.io/os=linux kubernetes.io/arch=amd64 kubernetes.io/hostname=cluster1-worker-vlan-40-kxbvw kubernetes.io/os=linux node-role.kubernetes.io/app= node-role.kubernetes.io/vlan-40= node-role.kubernetes.io/worker= node.openshift.io/os_id=rhcos Annotations: csi.volume.kubernetes.io/nodeid: {"openshift-storage.cephfs.csi.ceph.com":"cluster1-worker-vlan-40-kxbvw","openshift-storage.rbd.csi.ceph.com":"cluster1-worker-vlan-40-kxb... machine.openshift.io/machine: openshift-machine-api/cluster1-worker-vlan-40-kxbvw machineconfiguration.openshift.io/currentConfig: rendered-worker-a7aa7de76b7ef645f66b332beb7766dd machineconfiguration.openshift.io/desiredConfig: rendered-worker-a7aa7de76b7ef645f66b332beb7766dd machineconfiguration.openshift.io/reason: machineconfiguration.openshift.io/state: Done volumes.kubernetes.io/controller-managed-attach-detach: true CreationTimestamp: Thu, 08 Jul 2021 08:09:33 +0800 Taints: node.kubernetes.io/unschedulable:NoSchedule Unschedulable: true Lease: HolderIdentity: cluster1-worker-vlan-40-kxbvw AcquireTime: <unset> RenewTime: Mon, 12 Jul 2021 09:16:32 +0800 Conditions: Type Status LastHeartbeatTime LastTransitionTime Reason Message ---- ------ ----------------- ------------------ ------ ------- MemoryPressure False Mon, 12 Jul 2021 09:12:01 +0800 Mon, 12 Jul 2021 08:41:58 +0800 KubeletHasSufficientMemory kubelet has sufficient memory available DiskPressure False Mon, 12 Jul 2021 09:12:01 +0800 Mon, 12 Jul 2021 08:41:58 +0800 KubeletHasNoDiskPressure kubelet has no disk pressure PIDPressure False Mon, 12 Jul 2021 09:12:01 +0800 Mon, 12 Jul 2021 08:41:58 +0800 KubeletHasSufficientPID kubelet has sufficient PID available Ready True Mon, 12 Jul 2021 09:12:01 +0800 Mon, 12 Jul 2021 08:41:58 +0800 KubeletReady kubelet is posting ready status Addresses: ExternalIP: 192.168.40.55 InternalIP: 192.168.40.55 Hostname: cluster1-worker-vlan-40-kxbvw Capacity: cpu: 2 ephemeral-storage: 125293548Ki hugepages-2Mi: 0 memory: 8153700Ki pods: 250 Allocatable: cpu: 1500m ephemeral-storage: 114396791822 hugepages-2Mi: 0 memory: 7002724Ki pods: 250 System Info: Machine ID: 6a4377b6bbff45ba9b177c0418ee0291 System UUID: c86b2b42-bdfa-9d93-0d45-ed987f81334d Boot ID: 71ec56c1-6526-4465-900a-2aaf347f1230 Kernel Version: 4.18.0-240.22.1.el8_3.x86_64 OS Image: Red Hat Enterprise Linux CoreOS 47.83.202106032343-0 (Ootpa) Operating System: linux Architecture: amd64 Container Runtime Version: cri-o://1.20.3-2.rhaos4.7.gitb53fa9d.el8 Kubelet Version: v1.20.0+2817867 Kube-Proxy Version: v1.20.0+2817867 ProviderID: vsphere://422b6bc8-fabd-939d-0d45-ed987f81334d Non-terminated Pods: (14 in total) Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits AGE --------- ---- ------------ ---------- --------------- ------------- --- openshift-cluster-node-tuning-operator tuned-c4cg8 10m (0%) 0 (0%) 50Mi (0%) 0 (0%) 4d1h openshift-dns dns-default-j8lcv 65m (4%) 0 (0%) 131Mi (1%) 0 (0%) 4d1h openshift-image-registry node-ca-j5x2h 10m (0%) 0 (0%) 10Mi (0%) 0 (0%) 4d1h openshift-ingress-canary ingress-canary-j994v 10m (0%) 0 (0%) 20Mi (0%) 0 (0%) 4d1h openshift-machine-config-operator machine-config-daemon-jxszk 40m (2%) 0 (0%) 100Mi (1%) 0 (0%) 4d1h openshift-monitoring node-exporter-dn5kx 9m (0%) 0 (0%) 210Mi (3%) 0 (0%) 4d1h openshift-multus multus-nn76n 10m (0%) 0 (0%) 150Mi (2%) 0 (0%) 4d1h openshift-multus network-metrics-daemon-s9h7p 20m (1%) 0 (0%) 120Mi (1%) 0 (0%) 4d1h openshift-network-diagnostics network-check-target-7sh8h 10m (0%) 0 (0%) 15Mi (0%) 0 (0%) 4d1h openshift-nmstate nmstate-handler-g79nz 0 (0%) 0 (0%) 0 (0%) 0 (0%) 4d openshift-sdn sdn-jv4g5 110m (7%) 0 (0%) 220Mi (3%) 0 (0%) 4d1h openshift-storage csi-cephfsplugin-nfrlg 0 (0%) 0 (0%) 0 (0%) 0 (0%) 4d1h openshift-storage csi-rbdplugin-9tn4h 0 (0%) 0 (0%) 0 (0%) 0 (0%) 4d1h percona-test-1 cluster1-haproxy-1 200m (13%) 0 (0%) 1G (13%) 0 (0%) 4d Allocated resources: (Total limits may be over 100 percent, i.e., overcommitted.) Resource Requests Limits -------- -------- ------ cpu 494m (32%) 0 (0%) memory 2075838976 (28%) 0 (0%) ephemeral-storage 0 (0%) 0 (0%) hugepages-2Mi 0 (0%) 0 (0%) Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal NodeHasSufficientMemory 98m (x60 over 4d1h) kubelet Node cluster1-worker-vlan-40-kxbvw status is now: NodeHasSufficientMemory
my OCP version is OCP 4.7.16,vSphere UPI + machineset
@welin Not related to this bug. Node is there. ```cluster1-worker-vlan-40-kxbvw Ready,SchedulingDisabled app,vlan-40,worker 4d1h v1.20.0+2817867```
*** This bug has been marked as a duplicate of bug 1989648 ***