1990432 – Volumes are accidentally deleted along with the machine [vsphere]

Bug 1990432 - Volumes are accidentally deleted along with the machine [vsphere]

Summary: Volumes are accidentally deleted along with the machine [vsphere]

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Cloud Compute
Sub Component:
Version:	4.9
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.9.0
Assignee:	Mike Fedosin
QA Contact:	Milind Yadav
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1993117
TreeView+	depends on / blocked

Reported:	2021-08-05 11:48 UTC by Mike Fedosin
Modified:	2021-10-18 17:45 UTC (History)
CC List:	0 users
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-10-18 17:45:04 UTC
Target Upstream Version:
Embargoed:
Flags:	miyadav: needinfo-

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift machine-api-operator pull 894	0	None	None	None	2021-08-05 11:53:31 UTC
Red Hat Product Errata	RHSA-2021:3759	0	None	None	None	2021-10-18 17:45:24 UTC

Description Mike Fedosin 2021-08-05 11:48:41 UTC

Description of problem:

Pod deletion and volume detach happen asynchronously, so pod could be deleted before volume detached from the node.

When deleting a machine, this could cause issues for vsphere-volume, because if the node deleted before volume detaching success, then the underline volume will be deleted together with the Machine.

Expected results: 

After machine deletion its volumes should remain untouched.

Related to https://bugzilla.redhat.com/show_bug.cgi?id=1883993

Upstream issue: https://github.com/kubernetes-sigs/cluster-api/issues/4707

Comment 2 Milind Yadav 2021-08-27 04:36:57 UTC

Validated on :
[miyadav@miyadav vsphere]$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.9.0-0.nightly-2021-08-26-040328   True        False         39m     Cluster version is 4.9.0-0.nightly-2021-08-26-040328
[miyadav@miyadav vsphere]$ 


1.Create PVC 
[miyadav@miyadav vsphere]$ oc create -f pvc.yaml 
opersistentvolumeclaim/pvc4 created

Result : PVC created successfully 

2.Create deployment which uses PVC by below yaml.
[miyadav@miyadav vsphere]$ oc create -f deploymentyaml.yaml 
deployment.apps/dep1 created

apiVersion: apps/v1
kind: Deployment
metadata:
  name: "dep1"
spec:
  replicas: 1
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        app: nginx
    spec:
     containers:
     -
       name: "myfrontend"
       image: "quay.io/openshifttest/hello-openshift@sha256:aaea76ff622d2f8bcb32e538e7b3cd0ef6d291953f3e7c9f556c1ba5baf47e2e"
       ports:
        -
          containerPort: 80
          name: "http-server"
       volumeMounts:
        -
          mountPath: "/var/www/html"
          name: "pvol"
     volumes:
       -
         name: "pvol"
         persistentVolumeClaim:
         claimName: "pvc4"

Result : deployment created successfully 

4.Stop the kubelet of the node running the pod and delete the machine having the node object , we should get proper logs message which suggest disk removed before deattaching 
[miyadav@miyadav vsphere]$ oc get pods -o wide
NAME                                           READY   STATUS    RESTARTS   AGE     IP            NODE                              NOMINATED NODE   READINESS GATES
cluster-autoscaler-operator-78bf97c749-5xvkp   2/2     Running   0          38m     10.130.0.30   miyadav-2708-hptnr-master-1       <none>           <none>
cluster-baremetal-operator-688fcf9594-dvwvk    2/2     Running   0          38m     10.130.0.21   miyadav-2708-hptnr-master-1       <none>           <none>
dep1-64495756b4-sqd7c                          1/1     Running   0          4m26s   10.131.0.31   miyadav-2708-hptnr-worker-h8nmj   <none>           <none>
machine-api-controllers-7f49d8bbbb-nfj5g       7/7     Running   0          35m     10.128.0.11   miyadav-2708-hptnr-master-2       <none>           <none>
machine-api-operator-779c45669b-c8dht          2/2     Running   0          38m     10.130.0.25   miyadav-2708-hptnr-master-1       <none>           <none>
[miyadav@miyadav vsphere]$ oc debug node/miyadav-2708-hptnr-worker-h8nmj
Starting pod/miyadav-2708-hptnr-worker-h8nmj-debug ...
To use host binaries, run `chroot /host`
chroot /host

Pod IP: 172.31.249.39
If you don't see a command prompt, try pressing enter.
sh-4.4# chroot /host
sh-4.4# 
sh-4.4# systemctl stop kubelet


Removing debug pod ...
[miyadav@miyadav vsphere]$ oc delete machine miyadav-2708-hptnr-worker-h8nmj
machine.machine.openshift.io "miyadav-2708-hptnr-worker-h8nmj" deleted


.
.
I0827 04:18:36.039313       1 reconciler.go:284] miyadav-2708-hptnr-worker-h8nmj: node not ready, kubelet unreachable for some reason. Detaching disks before vm destroy.
I0827 04:18:36.053559       1 reconciler.go:792] miyadav-2708-hptnr-worker-h8nmj: Updating provider status
I0827 04:18:36.057589       1 machine_scope.go:102] miyadav-2708-hptnr-worker-h8nmj: patching machine
E0827 04:18:36.082391       1 actuator.go:57] miyadav-2708-hptnr-worker-h8nmj error: miyadav-2708-hptnr-worker-h8nmj: reconciler failed to Delete machine: destroying vm in progress, reconciling
E0827 04:18:36.082442       1 controller.go:239] miyadav-2708-hptnr-worker-h8nmj: failed to delete machine: miyadav-2708-hptnr-worker-h8nmj: reconciler failed to Delete machine: destroying vm in progress, reconciling
E0827 04:18:36.082486       1 controller.go:304] controller-runtime/manager/controller/machine_controller "msg"="Reconciler error" "error"="miyadav-2708-hptnr-worker-h8nmj: reconciler failed to Delete machine: destroying vm in progress, reconciling" "name"="miyadav-2708-hptnr-worker-h8nmj" "namespace"="openshift-machine-api" 
.
.

Additional Info:
Looks good to me  , will wait for sometime if any inputs on test steps , if no comments , will move to VERIFIED.

Comment 3 Joel Speed 2021-08-27 07:54:34 UTC

Test case looks good to me, only possible suggestion I would add is to check the PVC/disk to make sure it's still ok, eg check it still exists on the vCenter, check there's no errors reported on the PVC object

Comment 4 Milind Yadav 2021-08-30 06:20:25 UTC

Thanks @Joel , I checked from vsphere side as well , the vmdk was persisted even after we machine got deleted and new machine provisioned in its place .

Moving to VERIFIED.

Validated on the different cluster today , even after deleting machine , could see it exists.
[miyadav@miyadav ~]$ govc datastore.ls -l '5137595f-7ce3-e95a-5c03-06d835dea807' | grep 'miyadav-2708'
12.0MB    Mon Aug 30 05:45:14 2021  miyadav-2708-htqh4-dyn-pvc-413e0eaa-549c-4aa4-b969-bbc96550a6d3.vmdk

Comment 7 errata-xmlrpc 2021-10-18 17:45:04 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:3759

Note You need to log in before you can comment on or make changes to this bug.