Bug 1869330

Summary:	Deletion of PVC while performing Ceph/OCS pod deletion leaves behind PV in Released state
Product:	[Red Hat Storage] Red Hat OpenShift Container Storage	Reporter:	Rachael <rgeorge>
Component:	csi-driver	Assignee:	Madhu Rajanna <mrajanna>
Status:	CLOSED ERRATA	QA Contact:	Oded <oviner>
Severity:	medium	Docs Contact:
Priority:	unspecified
Version:	4.5	CC:	ebenahar, hchiramm, jijoy, madam, mrajanna, muagarwa, nberry, ocs-bugs
Target Milestone:	---
Target Release:	OCS 4.5.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	4.5.0-64.ci	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-09-15 10:18:38 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Comment 5 Neha Berry 2020-08-17 16:43:25 UTC

Proposing as a blocker as this issue was seen in two different test cases for tier4 executions and it would help to get an initial analysis

Comment 10 Yaniv Kaul 2020-08-18 08:17:09 UTC

(In reply to Neha Berry from comment #5)
> Proposing as a blocker as this issue was seen in two different test cases
> for tier4 executions and it would help to get an initial analysis

I'm sorry, but I'm really missing an understanding here of why is this even considered a blocker. Would you like to defer a release until this is fixed?
Why? What's the user impact? What's the frequency? How hard is it to recover from it?

Comment 12 Mudit Agarwal 2020-08-18 13:19:12 UTC

Not a blocker. Simple fix, PR is already there.

Comment 13 Humble Chirammal 2020-08-19 05:14:37 UTC

(In reply to Mudit Agarwal from comment #12)
> Not a blocker. Simple fix, PR is already there.

True.. The backport is merged too. This should not be a blocker though.

Comment 14 Jilju Joy 2020-08-20 08:21:50 UTC

Hi Madhu,

Can you please confirm whether the root cause is same in the below case?

This time in another test case which kill one MDS daemon while performing PVC deletion.

Test case - pv_services.test_daemon_kill_during_pvc_pod_deletion_and_io.TestDaemonKillDuringMultipleDeleteOperations::test_daemon_kill_during_pvc_pod_deletion_and_io[CephFileSystem-mds]

Testcase failed due to timeout waiting for PV pvc-927f7f09-d36c-400b-b6a4-d31b5b848d4c to delete.



E               TimeoutError: Timeout when waiting for pvc-927f7f09-d36c-400b-b6a4-d31b5b848d4c to delete. Describe output: Name:            pvc-927f7f09-d36c-400b-b6a4-d31b5b848d4c
E               Labels:          <none>
E               Annotations:     pv.kubernetes.io/provisioned-by: openshift-storage.cephfs.csi.ceph.com
E               Finalizers:      [kubernetes.io/pv-protection]
E               StorageClass:    ocs-storagecluster-cephfs
E               Status:          Released
E               Claim:           namespace-test-539904cade1f4ea6b7e0b7fd60e74e6d/pvc-test-cd1e10c8a9a447fa82c2657eaefda116
E               Reclaim Policy:  Delete
E               Access Modes:    RWX
E               VolumeMode:      Filesystem
E               Capacity:        3Gi
E               Node Affinity:   <none>
E               Message:         
E               Source:
E                   Type:              CSI (a Container Storage Interface (CSI) volume source)
E                   Driver:            openshift-storage.cephfs.csi.ceph.com
E                   FSType:            ext4
E                   VolumeHandle:      0001-0011-openshift-storage-0000000000000001-e79b8e2b-dff4-11ea-b618-0a580a800212
E                   ReadOnly:          false
E                   VolumeAttributes:      clusterID=openshift-storage
E                                          fsName=ocs-storagecluster-cephfilesystem
E                                          storage.kubernetes.io/csiProvisionerIdentity=1597555290158-8081-openshift-storage.cephfs.csi.ceph.com
E               Events:
E                 Type     Reason              Age                    From                                                                                                                     Message
E                 ----     ------              ----                   ----                                                                                                                     -------
E                 Warning  VolumeFailedDelete  7m24s                  openshift-storage.cephfs.csi.ceph.com_csi-cephfsplugin-provisioner-c748c89bf-b6bj6_658b2045-2da6-4e75-a8c6-e2f400f2b1ba  rpc error: code = DeadlineExceeded desc = context deadline exceeded
E                 Warning  VolumeFailedDelete  6m39s (x6 over 7m23s)  openshift-storage.cephfs.csi.ceph.com_csi-cephfsplugin-provisioner-c748c89bf-b6bj6_658b2045-2da6-4e75-a8c6-e2f400f2b1ba  rpc error: code = Aborted desc = an operation with the given Volume ID 0001-0011-openshift-storage-0000000000000001-e79b8e2b-dff4-11ea-b618-0a580a800212 already exists
E                 Warning  VolumeFailedDelete  2m50s (x3 over 6m6s)   openshift-storage.cephfs.csi.ceph.com_csi-cephfsplugin-provisioner-c748c89bf-b6bj6_658b2045-2da6-4e75-a8c6-e2f400f2b1ba  rpc error: code = Aborted desc = an operation with the given Volume ID pvc-927f7f09-d36c-400b-b6a4-d31b5b848d4c already exists

ocs_ci/ocs/ocp.py:655: TimeoutError




I don't see "subvolume 'csi-vol-e79b8e2b-dff4-11ea-b618-0a580a800212' does not exist" error in PV describe output. But it is present in csi-cephfsplugin-provisioner-c748c89bf-b6bj6 csi-cephfsplugin container log.


2020-08-16T19:31:04.938542271Z E0816 19:31:04.938488       1 volume.go:82] ID: 1768 Req-ID: 0001-0011-openshift-storage-0000000000000001-e79b8e2b-dff4-11ea-b618-0a580a800212 failed to get the rootpath for the vol csi-vol-e79b8e2b-dff4-11ea-b618-0a580a800212(an error (exit status 2) occurred while running ceph args: [fs subvolume getpath ocs-storagecluster-cephfilesystem csi-vol-e79b8e2b-dff4-11ea-b618-0a580a800212 --group_name csi -m 172.30.103.71:6789,172.30.230.64:6789,172.30.195.226:6789 -c /etc/ceph/ceph.conf -n client.csi-cephfs-provisioner --keyfile=***stripped***]) stdError Error ENOENT: subvolume 'csi-vol-e79b8e2b-dff4-11ea-b618-0a580a800212' does not exist
2020-08-16T19:31:04.938598891Z E0816 19:31:04.938561       1 utils.go:163] ID: 1768 Req-ID: 0001-0011-openshift-storage-0000000000000001-e79b8e2b-dff4-11ea-b618-0a580a800212 GRPC error: rpc error: code = Aborted desc = an operation with the given Volume ID pvc-927f7f09-d36c-400b-b6a4-d31b5b848d4c already exists



must-gather - http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/prsurve-slave-14/prsurve-slave-14_20200816T040807/logs/failed_testcase_ocs_logs_1597555947/test_daemon_kill_during_pvc_pod_deletion_and_io%5bCephFileSystem-mds%5d_ocs_logs/

Test case log - http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/prsurve-slave-14/prsurve-slave-14_20200816T040807/logs/ocs-ci-logs-1597555947/by_outcome/failed/tests/manage/pv_services/test_daemon_kill_during_pvc_pod_deletion_and_io.py/TestDaemonKillDuringMultipleDeleteOperations/test_daemon_kill_during_pvc_pod_deletion_and_io-CephFileSystem-mds/

Comment 18 Humble Chirammal 2020-08-20 09:10:01 UTC

>Keywords: Regression

Elad, I would like to point out that, this is NOT a regression. That said, this code exist since OCS 4.2 and carried over all the releases till date. This issue may not hit in most of the cases or visible in some corner scenarios. Thats why we didnt hit it from customers or QE eventhough we qualified all previous releases with the same issue present. 

Can you please remove the regression flag from this bug?

Please share if you have any questions on this.

Comment 21 Oded 2020-08-28 09:34:50 UTC

These test cases pass.


https://github.com/red-hat-storage/ocs-ci/blob/master/tests/manage/pv_services/test_resource_deletion_during_pod_pvc_deletion.py
test cases:
pytest.param(*[constants.CEPHFILESYSTEM, 'delete_pvcs', 'mgr'],polarion-OCS-920))
pytest.param(*[constants.CEPHFILESYSTEM, 'delete_pvcs','cephfsplugin_provisioner'],polarion_id-OCS-951,bugzilla-1860891

https://github.com/red-hat-storage/ocs-ci/blob/master/tests/manage/pv_services/test_resource_deletion_during_pvc_pod_deletion_and_io.py
test cases:
pytest.param(*[constants.CEPHFILESYSTEM, 'mgr'],polarion_id:OCS-813)
pytest.param(*[constants.CEPHFILESYSTEM, 'cephfsplugin'],polarion_id:OCS-1012)



SetUp:
Provider:Vmware
OCP Version:4.5.0-0.nightly-2020-08-27-074254
OCS Version:4.5.0-67.ci

Test Process:
1.Create 12 PVC:
$ oc get pvc -n namespace-test-9531f6e7b986489e9b91d98e06ef2616
NAME                                        STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS                AGE
pvc-test-0947d77312f6427ca5216cc8ea987e98   Bound    pvc-3750674b-12f6-4b45-8963-649a11b86f55   3Gi        RWO            ocs-storagecluster-cephfs   6m21s
pvc-test-0bf78014db2c4f45bd53f955f0a74434   Bound    pvc-03cfc44e-d9b0-4b75-b9ef-9542527986ad   3Gi        RWO            ocs-storagecluster-cephfs   6m22s
pvc-test-39ff51d48b9c47ce9c528cd398518f22   Bound    pvc-b00c1875-9648-4b1e-b18e-b19564047356   3Gi        RWO            ocs-storagecluster-cephfs   6m19s
pvc-test-67b89b5ae93a47ea88a385ab24986ba1   Bound    pvc-f92bfcdb-6445-4431-b603-6ea72396b4f1   3Gi        RWX            ocs-storagecluster-cephfs   6m14s
pvc-test-77952c0b13da438989711d3b4606db5d   Bound    pvc-39baab3b-f76e-4d1b-acf5-da325c3aba82   3Gi        RWX            ocs-storagecluster-cephfs   6m13s
pvc-test-80ae0cf8ff1e4b87b4ca4660b0de08be   Bound    pvc-1b8e78d3-265c-47b7-bf86-946232f182bb   3Gi        RWX            ocs-storagecluster-cephfs   6m12s
pvc-test-9925bf805ef9497f8180ce6a72ca2936   Bound    pvc-1b88c3a0-a414-4bc3-acc3-e0219378c5c2   3Gi        RWX            ocs-storagecluster-cephfs   6m17s
pvc-test-bc291f3364474e6abb8c157da8afdbff   Bound    pvc-665b77d2-5ea7-49aa-b20d-daf21cc6bbfa   3Gi        RWO            ocs-storagecluster-cephfs   6m20s
pvc-test-bc4c173bae9a4b9bbb21dae77e837b6a   Bound    pvc-a5e65bda-c9ca-4f84-b4bb-0cca3e54d25e   3Gi        RWO            ocs-storagecluster-cephfs   6m18s
pvc-test-bd60536efe304bc1805a140f739fbe0b   Bound    pvc-746a2fe2-bb4f-4b75-86d6-73e08153c66f   3Gi        RWO            ocs-storagecluster-cephfs   6m23s
pvc-test-c5fb4bccdbdd400e8860d9ffc1fb7fb7   Bound    pvc-bd570231-a32f-4ce9-8bfb-9fe9e5ee1f96   3Gi        RWX            ocs-storagecluster-cephfs   6m16s
pvc-test-ddb63817a30141f989dc75de2c0db261   Bound    pvc-e4fcfc66-a20e-4b74-91b3-ecf0696f29cd   3Gi        RWX            ocs-storagecluster-cephfs   6m15s

2.Create 18 pods[Image:nginx]
$ oc get pods -n namespace-test-9531f6e7b986489e9b91d98e06ef2616
NAME                                               READY   STATUS    RESTARTS   AGE
pod-test-cephfs-1889c2ba8a2b4520bf36fc0b8fc99766   1/1     Running   0          7m33s
pod-test-cephfs-1eb293004907409a841c5f0a605c3014   1/1     Running   0          4m30s
pod-test-cephfs-287dd71e19334f50a17890b144a9b66f   1/1     Running   0          5m40s
pod-test-cephfs-31afce50e02f4995a0fe6933ad6190d8   1/1     Running   0          8m21s
pod-test-cephfs-3c6bece91a884518a7414cbc2f3d885c   1/1     Running   0          7m52s
pod-test-cephfs-3d1e81c26d4249dca591d4e520e77ada   1/1     Running   0          6m28s
pod-test-cephfs-4868ae1c97d140a69f7f06ba67e08579   1/1     Running   0          9m14s
pod-test-cephfs-5264688cb5024840a34dd8816b084009   1/1     Running   0          4m48s
pod-test-cephfs-7422ebfecb03482c8e4566d9ae785f8d   1/1     Running   0          7m19s
pod-test-cephfs-8763d7509fbd4ede9c7c6248504b2b33   1/1     Running   0          9m33s
pod-test-cephfs-88e9425699b7461eb3edf1e6f5ab6a40   1/1     Running   0          5m2s
pod-test-cephfs-90d3f55bd60140d5a5b2ad496d1148d4   1/1     Running   0          6m14s
pod-test-cephfs-90eb64578702403f93b83ccc5181ed7b   1/1     Running   0          7m
pod-test-cephfs-a4c8ca74d4ef497aa4bedef2e3e022f7   1/1     Running   0          5m59s
pod-test-cephfs-c417cd7f54734439af22aeb79a55e298   1/1     Running   0          3m50s
pod-test-cephfs-e3af369d1a224240a7291a8dbd69c2a3   1/1     Running   0          8m40s
pod-test-cephfs-f21f92c6f74b4ba1848e1d315e43c735   1/1     Running   0          5m16s
pod-test-cephfs-f6866c1e18dc49019a424991dc3d719d   1/1     Running   0          4m9s


3.Install FIO on all pods:
oc -n namespace-test-9531f6e7b986489e9b91d98e06ef2616 rsh pod-test-cephfs-3c6bece91a884518a7414cbc2f3d885c
*which apt-get
*apt-get update
*apt-get -y install fio

4.Run FIO command on all pods:
oc -n namespace-test-9531f6e7b986489e9b91d98e06ef2616 rsh pod-test-cephfs-8763d7509fbd4ede9c7c6248504b2b33 
fio --name=fio-rand-readwrite --filename=/var/lib/www/html/pod-test-cephfs-8763d7509fbd4ede9c7c6248504b2b33_io --readwrite=randrw --bs=4K --direct=1 --numjobs=1 --time_based=1 --runtime=60 --size=2G --iodepth=4 --invalidate=1 --fsync_on_close=1 --rwmixread=75 --ioengine=libaio --rate=1m --rate_process=poisson --output-format=json

5.Delete all 18 PODs
oc -n namespace-test-9531f6e7b986489e9b91d98e06ef2616 delete Pod pod-test-cephfs-8763d7509fbd4ede9c7c6248504b2b33 --wait=false


6.Check Pod stataus:
oc -n namespace-test-9531f6e7b986489e9b91d98e06ef2616 get Pod pod-test-cephfs-4868ae1c97d140a69f7f06ba67e08579


Starting pod/compute-0-debug ...
To use host binaries, run `chroot /host`
17:47:15 - MainThread - ocs_ci.utility.utils - INFO - Executing command: oc debug nodes/compute-2 -- df
17:47:26 - MainThread - ocs_ci.utility.utils - WARNING - Command stderr: Starting pod/compute-2-debug ...
To use host binaries, run `chroot /host`

Removing debug pod ...

7.mount points are removed from nodes after deleting the pods
oc get pv/pvc-746a2fe2-bb4f-4b75-86d6-73e08153c66f -o jsonpath='{.spec.csi.volumeHandle}'

8.Fetched image uuid associated with each PVC:
oc -n namespace-test-9531f6e7b986489e9b91d98e06ef2616 get PersistentVolumeClaim -o yaml

9.Delete PVCs
oc -n namespace-test-9531f6e7b986489e9b91d98e06ef2616 delete PersistentVolumeClaim pvc-test-bd60536efe304bc1805a140f739fbe0b

10.Delete rook-ceph-mgr-a Pod [parallel to step 9] 
oc -n openshift-storage delete Pod rook-ceph-mgr-a-6944ff7f79-8cxc7 --grace-period=0 --force

11.Check PVC in Bound state:
oc get PersistentVolume pvc-1b8e78d3-265c-47b7-bf86-946232f182bb -o yaml

12.Go to Ceph tool pod and run "ceph fs" command
oc -n openshift-storage rsh rook-ceph-tools-5b9cbc586c-m9whq 
ceph fs subvolume getpath ocs-storagecluster-cephfilesystem csi-vol-4beef4d8-e86f-11ea-9e5a-0a580a80020a csi --format json

13.Check ceph health:
Ceph cluster health is HEALTH_OK.

Comment 22 Oded 2020-08-28 11:46:35 UTC

This test cases pass.

https://github.com/red-hat-storage/ocs-ci/blob/master/tests/manage/pv_services/test_daemon_kill_during_pvc_pod_creation_and_io.py
pytest.param(*[constants.CEPHFILESYSTEM, 'mgr'],polarion_id-OCS-1108)
   


SetUp:
Provider:Vmware
OCP Version:4.5.0-0.nightly-2020-08-27-074254
OCS Version:4.5.0-67.ci

Comment 25 errata-xmlrpc 2020-09-15 10:18:38 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenShift Container Storage 4.5.0 bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:3754

Comment 26 Red Hat Bugzilla 2023-09-14 06:05:49 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days