Description of problem:
Pod deletion and volume detach happen asynchronously, so pod could be deleted before volume detached from the node.
When deleting a machine, this could cause issues for vsphere-volume, because if the node deleted before volume detaching success, then the underline volume will be deleted together with the Machine.
After machine deletion its volumes should remain untouched.
Related to https://bugzilla.redhat.com/show_bug.cgi?id=1883993
Upstream issue: https://github.com/kubernetes-sigs/cluster-api/issues/4707
Validated on :
[miyadav@miyadav vsphere]$ oc get clusterversion
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.9.0-0.nightly-2021-08-26-040328 True False 39m Cluster version is 4.9.0-0.nightly-2021-08-26-040328
[miyadav@miyadav vsphere]$ oc create -f pvc.yaml
Result : PVC created successfully
2.Create deployment which uses PVC by below yaml.
[miyadav@miyadav vsphere]$ oc create -f deploymentyaml.yaml
Result : deployment created successfully
4.Stop the kubelet of the node running the pod and delete the machine having the node object , we should get proper logs message which suggest disk removed before deattaching
[miyadav@miyadav vsphere]$ oc get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
cluster-autoscaler-operator-78bf97c749-5xvkp 2/2 Running 0 38m 10.130.0.30 miyadav-2708-hptnr-master-1 <none> <none>
cluster-baremetal-operator-688fcf9594-dvwvk 2/2 Running 0 38m 10.130.0.21 miyadav-2708-hptnr-master-1 <none> <none>
dep1-64495756b4-sqd7c 1/1 Running 0 4m26s 10.131.0.31 miyadav-2708-hptnr-worker-h8nmj <none> <none>
machine-api-controllers-7f49d8bbbb-nfj5g 7/7 Running 0 35m 10.128.0.11 miyadav-2708-hptnr-master-2 <none> <none>
machine-api-operator-779c45669b-c8dht 2/2 Running 0 38m 10.130.0.25 miyadav-2708-hptnr-master-1 <none> <none>
[miyadav@miyadav vsphere]$ oc debug node/miyadav-2708-hptnr-worker-h8nmj
Starting pod/miyadav-2708-hptnr-worker-h8nmj-debug ...
To use host binaries, run `chroot /host`
Pod IP: 172.31.249.39
If you don't see a command prompt, try pressing enter.
sh-4.4# chroot /host
sh-4.4# systemctl stop kubelet
Removing debug pod ...
[miyadav@miyadav vsphere]$ oc delete machine miyadav-2708-hptnr-worker-h8nmj
machine.machine.openshift.io "miyadav-2708-hptnr-worker-h8nmj" deleted
I0827 04:18:36.039313 1 reconciler.go:284] miyadav-2708-hptnr-worker-h8nmj: node not ready, kubelet unreachable for some reason. Detaching disks before vm destroy.
I0827 04:18:36.053559 1 reconciler.go:792] miyadav-2708-hptnr-worker-h8nmj: Updating provider status
I0827 04:18:36.057589 1 machine_scope.go:102] miyadav-2708-hptnr-worker-h8nmj: patching machine
E0827 04:18:36.082391 1 actuator.go:57] miyadav-2708-hptnr-worker-h8nmj error: miyadav-2708-hptnr-worker-h8nmj: reconciler failed to Delete machine: destroying vm in progress, reconciling
E0827 04:18:36.082442 1 controller.go:239] miyadav-2708-hptnr-worker-h8nmj: failed to delete machine: miyadav-2708-hptnr-worker-h8nmj: reconciler failed to Delete machine: destroying vm in progress, reconciling
E0827 04:18:36.082486 1 controller.go:304] controller-runtime/manager/controller/machine_controller "msg"="Reconciler error" "error"="miyadav-2708-hptnr-worker-h8nmj: reconciler failed to Delete machine: destroying vm in progress, reconciling" "name"="miyadav-2708-hptnr-worker-h8nmj" "namespace"="openshift-machine-api"
Looks good to me , will wait for sometime if any inputs on test steps , if no comments , will move to VERIFIED.
Test case looks good to me, only possible suggestion I would add is to check the PVC/disk to make sure it's still ok, eg check it still exists on the vCenter, check there's no errors reported on the PVC object
Thanks @Joel , I checked from vsphere side as well , the vmdk was persisted even after we machine got deleted and new machine provisioned in its place .
Moving to VERIFIED.
Validated on the different cluster today , even after deleting machine , could see it exists.
[miyadav@miyadav ~]$ govc datastore.ls -l '5137595f-7ce3-e95a-5c03-06d835dea807' | grep 'miyadav-2708'
12.0MB Mon Aug 30 05:45:14 2021 miyadav-2708-htqh4-dyn-pvc-413e0eaa-549c-4aa4-b969-bbc96550a6d3.vmdk
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.