Description of problem: Pod deletion and volume detach happen asynchronously, so pod could be deleted before volume detached from the node. When deleting a machine, this could cause issues for vsphere-volume, because if the node deleted before volume detaching success, then the underline volume will be deleted together with the Machine. Expected results: After machine deletion its volumes should remain untouched. Related to https://bugzilla.redhat.com/show_bug.cgi?id=1883993 Upstream issue: https://github.com/kubernetes-sigs/cluster-api/issues/4707
Validated on : [miyadav@miyadav vsphere]$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.9.0-0.nightly-2021-08-26-040328 True False 39m Cluster version is 4.9.0-0.nightly-2021-08-26-040328 [miyadav@miyadav vsphere]$ 1.Create PVC [miyadav@miyadav vsphere]$ oc create -f pvc.yaml opersistentvolumeclaim/pvc4 created Result : PVC created successfully 2.Create deployment which uses PVC by below yaml. [miyadav@miyadav vsphere]$ oc create -f deploymentyaml.yaml deployment.apps/dep1 created apiVersion: apps/v1 kind: Deployment metadata: name: "dep1" spec: replicas: 1 selector: matchLabels: app: nginx template: metadata: labels: app: nginx spec: containers: - name: "myfrontend" image: "quay.io/openshifttest/hello-openshift@sha256:aaea76ff622d2f8bcb32e538e7b3cd0ef6d291953f3e7c9f556c1ba5baf47e2e" ports: - containerPort: 80 name: "http-server" volumeMounts: - mountPath: "/var/www/html" name: "pvol" volumes: - name: "pvol" persistentVolumeClaim: claimName: "pvc4" Result : deployment created successfully 4.Stop the kubelet of the node running the pod and delete the machine having the node object , we should get proper logs message which suggest disk removed before deattaching [miyadav@miyadav vsphere]$ oc get pods -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES cluster-autoscaler-operator-78bf97c749-5xvkp 2/2 Running 0 38m 10.130.0.30 miyadav-2708-hptnr-master-1 <none> <none> cluster-baremetal-operator-688fcf9594-dvwvk 2/2 Running 0 38m 10.130.0.21 miyadav-2708-hptnr-master-1 <none> <none> dep1-64495756b4-sqd7c 1/1 Running 0 4m26s 10.131.0.31 miyadav-2708-hptnr-worker-h8nmj <none> <none> machine-api-controllers-7f49d8bbbb-nfj5g 7/7 Running 0 35m 10.128.0.11 miyadav-2708-hptnr-master-2 <none> <none> machine-api-operator-779c45669b-c8dht 2/2 Running 0 38m 10.130.0.25 miyadav-2708-hptnr-master-1 <none> <none> [miyadav@miyadav vsphere]$ oc debug node/miyadav-2708-hptnr-worker-h8nmj Starting pod/miyadav-2708-hptnr-worker-h8nmj-debug ... To use host binaries, run `chroot /host` chroot /host Pod IP: 172.31.249.39 If you don't see a command prompt, try pressing enter. sh-4.4# chroot /host sh-4.4# sh-4.4# systemctl stop kubelet Removing debug pod ... [miyadav@miyadav vsphere]$ oc delete machine miyadav-2708-hptnr-worker-h8nmj machine.machine.openshift.io "miyadav-2708-hptnr-worker-h8nmj" deleted . . I0827 04:18:36.039313 1 reconciler.go:284] miyadav-2708-hptnr-worker-h8nmj: node not ready, kubelet unreachable for some reason. Detaching disks before vm destroy. I0827 04:18:36.053559 1 reconciler.go:792] miyadav-2708-hptnr-worker-h8nmj: Updating provider status I0827 04:18:36.057589 1 machine_scope.go:102] miyadav-2708-hptnr-worker-h8nmj: patching machine E0827 04:18:36.082391 1 actuator.go:57] miyadav-2708-hptnr-worker-h8nmj error: miyadav-2708-hptnr-worker-h8nmj: reconciler failed to Delete machine: destroying vm in progress, reconciling E0827 04:18:36.082442 1 controller.go:239] miyadav-2708-hptnr-worker-h8nmj: failed to delete machine: miyadav-2708-hptnr-worker-h8nmj: reconciler failed to Delete machine: destroying vm in progress, reconciling E0827 04:18:36.082486 1 controller.go:304] controller-runtime/manager/controller/machine_controller "msg"="Reconciler error" "error"="miyadav-2708-hptnr-worker-h8nmj: reconciler failed to Delete machine: destroying vm in progress, reconciling" "name"="miyadav-2708-hptnr-worker-h8nmj" "namespace"="openshift-machine-api" . . Additional Info: Looks good to me , will wait for sometime if any inputs on test steps , if no comments , will move to VERIFIED.
Test case looks good to me, only possible suggestion I would add is to check the PVC/disk to make sure it's still ok, eg check it still exists on the vCenter, check there's no errors reported on the PVC object
Thanks @Joel , I checked from vsphere side as well , the vmdk was persisted even after we machine got deleted and new machine provisioned in its place . Moving to VERIFIED. Validated on the different cluster today , even after deleting machine , could see it exists. [miyadav@miyadav ~]$ govc datastore.ls -l '5137595f-7ce3-e95a-5c03-06d835dea807' | grep 'miyadav-2708' 12.0MB Mon Aug 30 05:45:14 2021 miyadav-2708-htqh4-dyn-pvc-413e0eaa-549c-4aa4-b969-bbc96550a6d3.vmdk
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:3759