1977637 – vSphere Machines stuck in deleting phase if associated Node object is deleted

Bug 1977637 - vSphere Machines stuck in deleting phase if associated Node object is deleted

Summary: vSphere Machines stuck in deleting phase if associated Node object is deleted

Keywords:
Status:	CLOSED DUPLICATE of bug 1989648
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Cloud Compute
Sub Component:
Version:	4.7
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	4.7.z
Assignee:	dmoiseev
QA Contact:	sunzhaohua
Docs Contact:
URL:
Whiteboard:
Depends On:	1977634
Blocks:
TreeView+	depends on / blocked

Reported:	2021-06-30 08:14 UTC by Milind Yadav
Modified:	2021-08-19 10:23 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:	1977369
Environment:
Last Closed:	2021-08-19 10:23:36 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Milind Yadav 2021-06-30 08:14:29 UTC

+++ This bug was initially created as a clone of Bug #1977369 +++

Description of problem:

If a vSphere node object has been deleted and the Machine's associated instance has entered a bad state, which prevents it from being joined to the cluster again, the Machine cannot be deleted and will be stuck in the deleting phase seemingly forever.

Errors in machine reconciler log:
```
 I0628 22:43:56.945572 1 recorder.go:104] controller-runtime/manager/events "msg"="Warning" "message"="e2e-fhn5z: reconciler failed to Delete machine: e2e-fhn5z: Can't check node status before vm destroy: nodes \"windows-host\" not found" "object"=
```

Logic behind this is here:
https://github.com/openshift/machine-api-operator/blob/e461e729a3077016228f34d0db8845b421a91c6d/pkg/controller/vsphere/reconciler.go#L276-L280

Perhaps checkNodeReachable should return false (with no error) in the case of the node object not existing?
https://github.com/openshift/machine-api-operator/blob/e461e729a3077016228f34d0db8845b421a91c6d/pkg/controller/vsphere/machine_scope.go#L158

Version-Release number of selected component (if applicable):
4.8 RC

How reproducible:
Always

Steps to Reproduce:
1. Delete vSphere node
2. Cause vSphere instance to be un-configurable by the cluster. (In case of Windows Machine Config Operator this was removing the ability to SSH into the instance)
3. Attempt to delete the Machine

Actual results:
The Machine is stuck in the deleting phase.

Expected results:
The Machine is deleted.

Additional info:

==============================================================================
Another Scenario being caused by it 

Cluster version is 4.7.0-0.nightly-2021-06-26-014854

Steps :
1.Create a mhc using below :

apiVersion: "machine.openshift.io/v1beta1"
kind: "MachineHealthCheck"
metadata:
  name: mhc2
  namespace: openshift-machine-api
spec:
  selector:
    matchLabels:
      machine.openshift.io/cluster-api-cluster: miyadav-30vsp-p4gh2
      machine.openshift.io/cluster-api-machine-role: worker
      machine.openshift.io/cluster-api-machine-type: worker
      machine.openshift.io/cluster-api-machineset: miyadav-30vsp-p4gh2-worker
  unhealthyConditions:
  - type: Ready
    status: "False"
    timeout: 300s
  - type: Ready
    status: Unknown
    timeout: 300s
  maxUnhealthy: 3


Expected & Actual : mhc created successfully

2.Delete the worker node being referenced by machineset being monitored by mhc
Node deleted successfully
[miyadav@miyadav ~]$ oc get nodes
NAME                               STATUS   ROLES    AGE    VERSION
miyadav-30vsp-p4gh2-master-0       Ready    master   3h5m   v1.20.0+87cc9a4
miyadav-30vsp-p4gh2-master-1       Ready    master   3h5m   v1.20.0+87cc9a4
miyadav-30vsp-p4gh2-master-2       Ready    master   3h5m   v1.20.0+87cc9a4
miyadav-30vsp-p4gh2-worker-84gsl   Ready    worker   173m   v1.20.0+87cc9a4
miyadav-30vsp-p4gh2-worker-pvp8p   Ready    worker   24m    v1.20.0+87cc9a4

[miyadav@miyadav ~]$ oc get nodes
oc get machines 
NAME                               STATUS   ROLES    AGE     VERSION
miyadav-30vsp-p4gh2-master-0       Ready    master   3h27m   v1.20.0+87cc9a4
miyadav-30vsp-p4gh2-master-1       Ready    master   3h27m   v1.20.0+87cc9a4
miyadav-30vsp-p4gh2-master-2       Ready    master   3h27m   v1.20.0+87cc9a4
miyadav-30vsp-p4gh2-worker-84gsl   Ready    worker   3h15m   v1.20.0+87cc9a4

3.New machine provisioned and old one deleted

[miyadav@miyadav ~]$ oc get machines 
NAME                               PHASE         TYPE   REGION   ZONE   AGE
miyadav-30vsp-p4gh2-master-0       Running                              3h28m
miyadav-30vsp-p4gh2-master-1       Running                              3h28m
miyadav-30vsp-p4gh2-master-2       Running                              3h28m
miyadav-30vsp-p4gh2-worker-84gsl   Running                              3h22m
miyadav-30vsp-p4gh2-worker-p8g8p   Provisioned                          72s
miyadav-30vsp-p4gh2-worker-pvp8p   Deleting                             48m
.
.
[miyadav@miyadav ~]$ oc get machines
NAME                               PHASE      TYPE   REGION   ZONE   AGE
miyadav-30vsp-p4gh2-master-0       Running                           3h52m
miyadav-30vsp-p4gh2-master-1       Running                           3h52m
miyadav-30vsp-p4gh2-master-2       Running                           3h52m
miyadav-30vsp-p4gh2-worker-84gsl   Running                           3h46m
miyadav-30vsp-p4gh2-worker-p8g8p   Running                           24m
miyadav-30vsp-p4gh2-worker-pvp8p   Deleting                          72m


Expected and actual :
New machine provisioned successfully
Old one stuck in deleting state with below error :
Events:
  Type     Reason          Age                   From                           Message
  ----     ------          ----                  ----                           -------
  Normal   Create          70m                   vspherecontroller              Created Machine miyadav-30vsp-p4gh2-worker-pvp8p
  Warning  FailedCreate    70m                   vspherecontroller              miyadav-30vsp-p4gh2-worker-pvp8p: reconciler failed to Create machine: task task-3107089 has not finished
  Warning  FailedUpdate    70m                   vspherecontroller              miyadav-30vsp-p4gh2-worker-pvp8p: reconciler failed to Update machine: task task-3107089 has not finished
  Normal   Update          30m (x14 over 69m)    vspherecontroller              Updated Machine miyadav-30vsp-p4gh2-worker-pvp8p
  Normal   MachineDeleted  22m                   machinehealthcheck-controller  Machine openshift-machine-api/mhc2/miyadav-30vsp-p4gh2-worker-pvp8p/miyadav-30vsp-p4gh2-worker-pvp8p has been remediated by requesting to delete Machine object
  Warning  FailedDelete    3m19s (x20 over 22m)  vspherecontroller              miyadav-30vsp-p4gh2-worker-pvp8p: reconciler failed to Delete machine: miyadav-30vsp-p4gh2-worker-pvp8p: Can't check node status before vm destroy: nodes "miyadav-30vsp-p4gh2-worker-pvp8p" not found

Additional info : Will attach must-gather

Comment 2 kevin 2021-07-12 01:17:29 UTC

I have also encounted the machine stuck in deleting status for a long time

# oc get machine -n openshift-machine-api
NAME                             PHASE      TYPE   REGION   ZONE   AGE
cluster1-storage-vlan-40-h8mm5   Running                           44d
cluster1-storage-vlan-40-s2lv4   Running                           44d
cluster1-storage-vlan-40-tqsd6   Running                           44d
cluster1-worker-vlan-40-2w7hx    Running                           7h58m
cluster1-worker-vlan-40-kxbvw    Deleting                          4d1h
cluster1-worker-vlan-50-lqs9m    Deleting                          4d1h

# oc describe machine -n openshift-machine-api cluster1-worker-vlan-40-kxbvw 
Name:         cluster1-worker-vlan-40-kxbvw
Namespace:    openshift-machine-api
Labels:       machine.openshift.io/cluster-api-cluster=cluster1-vh9zx
              machine.openshift.io/cluster-api-machine-role=worker
              machine.openshift.io/cluster-api-machine-type=worker
              machine.openshift.io/cluster-api-machineset=cluster1-worker-vlan-40
              machine.openshift.io/region=
              machine.openshift.io/zone=
Annotations:  machine.openshift.io/instance-state: poweredOn
API Version:  machine.openshift.io/v1beta1
Kind:         Machine
Metadata:
  Creation Timestamp:             2021-07-08T00:01:50Z
  Deletion Grace Period Seconds:  0
  Deletion Timestamp:             2021-07-11T15:27:56Z
  Finalizers:
    machine.machine.openshift.io
  Generate Name:  cluster1-worker-vlan-40-
  Generation:     3
  Managed Fields:
    API Version:  machine.openshift.io/v1beta1
    Fields Type:  FieldsV1
    fieldsV1:
      f:metadata:
        f:generateName:
        f:labels:
          .:
          f:machine.openshift.io/cluster-api-machine-role:
          f:machine.openshift.io/cluster-api-machine-type:
          f:machine.openshift.io/cluster-api-machineset:
        f:ownerReferences:
          .:
          k:{"uid":"a7bab249-e4f0-45b3-9d7a-9b403462f69b"}:
            .:
            f:apiVersion:
            f:blockOwnerDeletion:
            f:controller:
            f:kind:
            f:name:
            f:uid:
      f:spec:
        .:
        f:metadata:
          .:
          f:labels:
            .:
            f:node-role.kubernetes.io/app:
            f:node-role.kubernetes.io/vlan-40:
        f:providerSpec:
          .:
          f:value:
            .:
            f:apiVersion:
            f:credentialsSecret:
            f:diskGiB:
            f:kind:
            f:memoryMiB:
            f:metadata:
            f:network:
            f:numCPUs:
            f:numCoresPerSocket:
            f:snapshot:
            f:template:
            f:userDataSecret:
            f:workspace:
    Manager:      machineset-controller
    Operation:    Update
    Time:         2021-07-08T00:01:50Z
    API Version:  machine.openshift.io/v1beta1
    Fields Type:  FieldsV1
    fieldsV1:
      f:metadata:
        f:annotations:
          .:
          f:machine.openshift.io/instance-state:
        f:finalizers:
          .:
          v:"machine.machine.openshift.io":
        f:labels:
          f:machine.openshift.io/region:
          f:machine.openshift.io/zone:
      f:spec:
        f:providerID:
      f:status:
        .:
        f:addresses:
        f:phase:
        f:providerStatus:
          .:
          f:conditions:
          f:instanceId:
          f:instanceState:
          f:taskRef:
    Manager:      machine-controller-manager
    Operation:    Update
    Time:         2021-07-11T15:28:16Z
    API Version:  machine.openshift.io/v1beta1
    Fields Type:  FieldsV1
    fieldsV1:
      f:status:
        f:lastUpdated:
        f:nodeRef:
          .:
          f:kind:
          f:name:
          f:uid:
    Manager:    nodelink-controller
    Operation:  Update
    Time:       2021-07-11T18:24:10Z
  Owner References:
    API Version:           machine.openshift.io/v1beta1
    Block Owner Deletion:  true
    Controller:            true
    Kind:                  MachineSet
    Name:                  cluster1-worker-vlan-40
    UID:                   a7bab249-e4f0-45b3-9d7a-9b403462f69b
  Resource Version:        82676685
  Self Link:               /apis/machine.openshift.io/v1beta1/namespaces/openshift-machine-api/machines/cluster1-worker-vlan-40-kxbvw
  UID:                     0c3b83b1-9dfb-495d-ac58-a6b831d38435
Spec:
  Metadata:
    Labels:
      node-role.kubernetes.io/app:      
      node-role.kubernetes.io/vlan-40:  
  Provider ID:                          vsphere://422b6bc8-fabd-939d-0d45-ed987f81334d
  Provider Spec:
    Value:
      API Version:  vsphereprovider.openshift.io/v1beta1
      Credentials Secret:
        Name:       vsphere-cloud-credentials
      Disk Gi B:    120
      Kind:         VSphereMachineProviderSpec
      Memory Mi B:  8192
      Metadata:
        Creation Timestamp:  <nil>
      Network:
        Devices:
          Network Name:      OCP4-Lab2-vLan-40
          Network Name:      OCP4-Lab2-vLan-Trunk
          Network Name:      OCP4-Lab2-vLan-Trunk
      Num CP Us:             2
      Num Cores Per Socket:  1
      Snapshot:              
      Template:              rhcos-4.7.7-x86_64
      User Data Secret:
        Name:  worker-user-data
      Workspace:
        Datacenter:  Datacenter1
        Datastore:   datastore3
        Folder:      /Datacenter1/vm/OCP4-Lab2/03-DataCenter/OCP/Cluster-1/cluster1-worker-vlan-40
        Server:      192.168.1.6
Status:
  Addresses:
    Address:     192.168.40.55
    Type:        InternalIP
    Address:     fe80::5c56:1229:ae1e:dfbd
    Type:        InternalIP
    Address:     cluster1-worker-vlan-40-kxbvw
    Type:        InternalDNS
  Last Updated:  2021-07-12T00:43:14Z
  Node Ref:
    Kind:  Node
    Name:  cluster1-worker-vlan-40-kxbvw
    UID:   6cb0b173-c2f1-44aa-8cca-cc3671edff10
  Phase:   Deleting
  Provider Status:
    Conditions:
      Last Probe Time:       2021-07-08T00:01:50Z
      Last Transition Time:  2021-07-08T00:01:50Z
      Message:               Machine successfully created
      Reason:                MachineCreationSucceeded
      Status:                True
      Type:                  MachineCreation
    Instance Id:             422b6bc8-fabd-939d-0d45-ed987f81334d
    Instance State:          poweredOn
    Task Ref:                task-61052
Events:                      <none>

#  oc get nodes
NAME                                       STATUS                     ROLES                AGE     VERSION
cluster1-storage-vlan-40-h8mm5             Ready                      storage,worker       44d     v1.20.0+2817867
cluster1-storage-vlan-40-s2lv4             Ready                      storage,worker       44d     v1.20.0+2817867
cluster1-storage-vlan-40-tqsd6             Ready                      storage,worker       44d     v1.20.0+2817867
cluster1-worker-vlan-40-2w7hx              Ready                      app,vlan-40,worker   7h51m   v1.20.0+2817867
cluster1-worker-vlan-40-kxbvw              Ready,SchedulingDisabled   app,vlan-40,worker   4d1h    v1.20.0+2817867
cluster1-worker-vlan-50-lqs9m              Ready,SchedulingDisabled   app,vlan-50,worker   4d1h    v1.20.0+2817867
master-01.cluster1.ocp4.example.internal   Ready                      master               56d     v1.20.0+2817867
master-02.cluster1.ocp4.example.internal   Ready                      master               56d     v1.20.0+2817867
master-03.cluster1.ocp4.example.internal   Ready                      master               56d     v1.20.0+2817867

# oc describe nodes cluster1-worker-vlan-40-kxbvw
Name:               cluster1-worker-vlan-40-kxbvw
Roles:              app,vlan-40,worker
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/os=linux
                    kubernetes.io/arch=amd64
                    kubernetes.io/hostname=cluster1-worker-vlan-40-kxbvw
                    kubernetes.io/os=linux
                    node-role.kubernetes.io/app=
                    node-role.kubernetes.io/vlan-40=
                    node-role.kubernetes.io/worker=
                    node.openshift.io/os_id=rhcos
Annotations:        csi.volume.kubernetes.io/nodeid:
                      {"openshift-storage.cephfs.csi.ceph.com":"cluster1-worker-vlan-40-kxbvw","openshift-storage.rbd.csi.ceph.com":"cluster1-worker-vlan-40-kxb...
                    machine.openshift.io/machine: openshift-machine-api/cluster1-worker-vlan-40-kxbvw
                    machineconfiguration.openshift.io/currentConfig: rendered-worker-a7aa7de76b7ef645f66b332beb7766dd
                    machineconfiguration.openshift.io/desiredConfig: rendered-worker-a7aa7de76b7ef645f66b332beb7766dd
                    machineconfiguration.openshift.io/reason: 
                    machineconfiguration.openshift.io/state: Done
                    volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp:  Thu, 08 Jul 2021 08:09:33 +0800
Taints:             node.kubernetes.io/unschedulable:NoSchedule
Unschedulable:      true
Lease:
  HolderIdentity:  cluster1-worker-vlan-40-kxbvw
  AcquireTime:     <unset>
  RenewTime:       Mon, 12 Jul 2021 09:16:32 +0800
Conditions:
  Type             Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----             ------  -----------------                 ------------------                ------                       -------
  MemoryPressure   False   Mon, 12 Jul 2021 09:12:01 +0800   Mon, 12 Jul 2021 08:41:58 +0800   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure     False   Mon, 12 Jul 2021 09:12:01 +0800   Mon, 12 Jul 2021 08:41:58 +0800   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure      False   Mon, 12 Jul 2021 09:12:01 +0800   Mon, 12 Jul 2021 08:41:58 +0800   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready            True    Mon, 12 Jul 2021 09:12:01 +0800   Mon, 12 Jul 2021 08:41:58 +0800   KubeletReady                 kubelet is posting ready status
Addresses:
  ExternalIP:  192.168.40.55
  InternalIP:  192.168.40.55
  Hostname:    cluster1-worker-vlan-40-kxbvw
Capacity:
  cpu:                2
  ephemeral-storage:  125293548Ki
  hugepages-2Mi:      0
  memory:             8153700Ki
  pods:               250
Allocatable:
  cpu:                1500m
  ephemeral-storage:  114396791822
  hugepages-2Mi:      0
  memory:             7002724Ki
  pods:               250
System Info:
  Machine ID:                             6a4377b6bbff45ba9b177c0418ee0291
  System UUID:                            c86b2b42-bdfa-9d93-0d45-ed987f81334d
  Boot ID:                                71ec56c1-6526-4465-900a-2aaf347f1230
  Kernel Version:                         4.18.0-240.22.1.el8_3.x86_64
  OS Image:                               Red Hat Enterprise Linux CoreOS 47.83.202106032343-0 (Ootpa)
  Operating System:                       linux
  Architecture:                           amd64
  Container Runtime Version:              cri-o://1.20.3-2.rhaos4.7.gitb53fa9d.el8
  Kubelet Version:                        v1.20.0+2817867
  Kube-Proxy Version:                     v1.20.0+2817867
ProviderID:                               vsphere://422b6bc8-fabd-939d-0d45-ed987f81334d
Non-terminated Pods:                      (14 in total)
  Namespace                               Name                            CPU Requests  CPU Limits  Memory Requests  Memory Limits  AGE
  ---------                               ----                            ------------  ----------  ---------------  -------------  ---
  openshift-cluster-node-tuning-operator  tuned-c4cg8                     10m (0%)      0 (0%)      50Mi (0%)        0 (0%)         4d1h
  openshift-dns                           dns-default-j8lcv               65m (4%)      0 (0%)      131Mi (1%)       0 (0%)         4d1h
  openshift-image-registry                node-ca-j5x2h                   10m (0%)      0 (0%)      10Mi (0%)        0 (0%)         4d1h
  openshift-ingress-canary                ingress-canary-j994v            10m (0%)      0 (0%)      20Mi (0%)        0 (0%)         4d1h
  openshift-machine-config-operator       machine-config-daemon-jxszk     40m (2%)      0 (0%)      100Mi (1%)       0 (0%)         4d1h
  openshift-monitoring                    node-exporter-dn5kx             9m (0%)       0 (0%)      210Mi (3%)       0 (0%)         4d1h
  openshift-multus                        multus-nn76n                    10m (0%)      0 (0%)      150Mi (2%)       0 (0%)         4d1h
  openshift-multus                        network-metrics-daemon-s9h7p    20m (1%)      0 (0%)      120Mi (1%)       0 (0%)         4d1h
  openshift-network-diagnostics           network-check-target-7sh8h      10m (0%)      0 (0%)      15Mi (0%)        0 (0%)         4d1h
  openshift-nmstate                       nmstate-handler-g79nz           0 (0%)        0 (0%)      0 (0%)           0 (0%)         4d
  openshift-sdn                           sdn-jv4g5                       110m (7%)     0 (0%)      220Mi (3%)       0 (0%)         4d1h
  openshift-storage                       csi-cephfsplugin-nfrlg          0 (0%)        0 (0%)      0 (0%)           0 (0%)         4d1h
  openshift-storage                       csi-rbdplugin-9tn4h             0 (0%)        0 (0%)      0 (0%)           0 (0%)         4d1h
  percona-test-1                          cluster1-haproxy-1              200m (13%)    0 (0%)      1G (13%)         0 (0%)         4d
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource           Requests          Limits
  --------           --------          ------
  cpu                494m (32%)        0 (0%)
  memory             2075838976 (28%)  0 (0%)
  ephemeral-storage  0 (0%)            0 (0%)
  hugepages-2Mi      0 (0%)            0 (0%)
Events:
  Type    Reason                   Age                  From     Message
  ----    ------                   ----                 ----     -------
  Normal  NodeHasSufficientMemory  98m (x60 over 4d1h)  kubelet  Node cluster1-worker-vlan-40-kxbvw status is now: NodeHasSufficientMemory

Comment 3 kevin 2021-07-12 01:18:10 UTC

my OCP version is OCP 4.7.16,vSphere UPI + machineset

Comment 4 dmoiseev 2021-07-13 10:11:00 UTC

@welin Not related to this bug. Node is there.

```cluster1-worker-vlan-40-kxbvw              Ready,SchedulingDisabled   app,vlan-40,worker   4d1h    v1.20.0+2817867```

Comment 5 Joel Speed 2021-08-19 10:23:36 UTC


*** This bug has been marked as a duplicate of bug 1989648 ***

Note You need to log in before you can comment on or make changes to this bug.