1860891 – Deleting csi-cephfsplugin-provisioner pod during PVC deletion leaves behind PV in Released state

Bug 1860891 - Deleting csi-cephfsplugin-provisioner pod during PVC deletion leaves behind PV in Released state

Summary: Deleting csi-cephfsplugin-provisioner pod during PVC deletion leaves behind P...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenShift Container Storage
Classification:	Red Hat Storage
Component:	csi-driver
Sub Component:
Version:	4.4
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Target Release:	OCS 4.5.0
Assignee:	Madhu Rajanna
QA Contact:	Jilju Joy
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1861257
TreeView+	depends on / blocked

Reported:	2020-07-27 11:14 UTC by Jilju Joy
Modified:	2020-09-15 10:19 UTC (History)
CC List:	8 users (show)
Fixed In Version:	4.5.0-515.ci
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1861257 (view as bug list)
Environment:
Last Closed:	2020-09-15 10:18:25 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	ceph ceph-csi pull 1247	None	closed	cephfs: return volume not found error if volume doesnot exists	2021-02-07 16:17:31 UTC
Github	openshift ceph-csi pull 6	None	closed	BUG 1860891: cephfs: return volume not found error if volume doesnot exists	2021-02-07 16:17:31 UTC
Red Hat Product Errata	RHBA-2020:3754	None	None	None	2020-09-15 10:18:59 UTC

Description Jilju Joy 2020-07-27 11:14:24 UTC

Description of problem (please be detailed as possible and provide log
snippests):
When csi-cephfsplugin-provisioner pod is deleted while deleting a set of CephFS PVCs, a  PV remained in Released state. This issue seems to be same as bug 1793387

PV pvc-583e7736-8876-477c-b4ed-ed82dad3f03b describe output.

               Name:            pvc-583e7736-8876-477c-b4ed-ed82dad3f03b
               Labels:          <none>
               Annotations:     pv.kubernetes.io/provisioned-by: openshift-storage.cephfs.csi.ceph.com
               Finalizers:      [kubernetes.io/pv-protection]
               StorageClass:    ocs-storagecluster-cephfs
               Status:          Released
               Claim:           namespace-test-89a261d05a2f4a768c27bf1777c5bd6d/pvc-test-45e7c874f21b49b5983feb46c061a30e
               Reclaim Policy:  Delete
               Access Modes:    RWO
               VolumeMode:      Filesystem
               Capacity:        3Gi
               Node Affinity:   <none>
               Message:         
               Source:
                   Type:              CSI (a Container Storage Interface (CSI) volume source)
                   Driver:            openshift-storage.cephfs.csi.ceph.com
                   FSType:            ext4
                   VolumeHandle:      0001-0011-openshift-storage-0000000000000001-185cfa6b-cd1d-11ea-ae58-0a580a830016
                   ReadOnly:          false
                   VolumeAttributes:      clusterID=openshift-storage
                                          fsName=ocs-storagecluster-cephfilesystem
                                          storage.kubernetes.io/csiProvisionerIdentity=1595514090126-8081-openshift-storage.cephfs.csi.ceph.com
               Events:
                 Type     Reason              Age                 From                                                                                                                      Message
                 ----     ------              ----                ----                                                                                                                      -------
                 Warning  VolumeFailedDelete  32s (x8 over 104s)  openshift-storage.cephfs.csi.ceph.com_csi-cephfsplugin-provisioner-584f787449-78qp9_dbeafef8-8f5d-4127-a83a-dddd74265c73  rpc error: code = Internal desc = an error (exit status 2) occurred while running ceph args: [fs subvolume getpath ocs-storagecluster-cephfilesystem csi-vol-185cfa6b-cd1d-11ea-ae58-0a580a830016 --group_name csi -m 172.30.193.176:6789,172.30.56.46:6789,172.30.19.132:6789 -c /etc/ceph/ceph.conf -n client.csi-cephfs-provisioner --keyfile=***stripped***]


csi-cephfsplugin container log from csi-cephfsplugin-provisioner-584f787449-78qp9 pod.

2020-07-23T19:57:11.839462083Z I0723 19:57:11.839419       1 utils.go:157] ID: 28 Req-ID: 0001-0011-openshift-storage-0000000000000001-185cfa6b-cd1d-11ea-ae58-0a580a830016 GRPC call: /csi.v1.Controller/DeleteVolume
2020-07-23T19:57:11.839817167Z I0723 19:57:11.839442       1 utils.go:158] ID: 28 Req-ID: 0001-0011-openshift-storage-0000000000000001-185cfa6b-cd1d-11ea-ae58-0a580a830016 GRPC request: {"secrets":"***stripped***","volume_id":"0001-0011-openshift-storage-0000000000000001-185cfa6b-cd1d-11ea-ae58-0a580a830016"}
2020-07-23T19:57:11.839992954Z I0723 19:57:11.839972       1 util.go:48] ID: 28 Req-ID: 0001-0011-openshift-storage-0000000000000001-185cfa6b-cd1d-11ea-ae58-0a580a830016 cephfs: EXEC ceph [-m 172.30.193.176:6789,172.30.56.46:6789,172.30.19.132:6789 --id csi-cephfs-provisioner --keyfile=***stripped*** -c /etc/ceph/ceph.conf fs dump --format=json]
2020-07-23T19:57:12.172265247Z I0723 19:57:12.172218       1 util.go:48] ID: 28 Req-ID: 0001-0011-openshift-storage-0000000000000001-185cfa6b-cd1d-11ea-ae58-0a580a830016 cephfs: EXEC ceph [-m 172.30.193.176:6789,172.30.56.46:6789,172.30.19.132:6789 --id csi-cephfs-provisioner --keyfile=***stripped*** -c /etc/ceph/ceph.conf fs ls --format=json]
2020-07-23T19:57:12.922403083Z E0723 19:57:12.922362       1 volume.go:75] ID: 28 Req-ID: 0001-0011-openshift-storage-0000000000000001-185cfa6b-cd1d-11ea-ae58-0a580a830016 failed to get the rootpath for the vol csi-vol-185cfa6b-cd1d-11ea-ae58-0a580a830016(an error (exit status 2) occurred while running ceph args: [fs subvolume getpath ocs-storagecluster-cephfilesystem csi-vol-185cfa6b-cd1d-11ea-ae58-0a580a830016 --group_name csi -m 172.30.193.176:6789,172.30.56.46:6789,172.30.19.132:6789 -c /etc/ceph/ceph.conf -n client.csi-cephfs-provisioner --keyfile=***stripped***])
2020-07-23T19:57:12.9224392Z E0723 19:57:12.922421       1 utils.go:161] ID: 28 Req-ID: 0001-0011-openshift-storage-0000000000000001-185cfa6b-cd1d-11ea-ae58-0a580a830016 GRPC error: rpc error: code = Internal desc = an error (exit status 2) occurred while running ceph args: [fs subvolume getpath ocs-storagecluster-cephfilesystem csi-vol-185cfa6b-cd1d-11ea-ae58-0a580a830016 --group_name csi -m 172.30.193.176:6789,172.30.56.46:6789,172.30.19.132:6789 -c /etc/ceph/ceph.conf -n client.csi-cephfs-provisioner --keyfile=***stripped***]



Logs : http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/jnk-ai3c33-t4c/jnk-ai3c33-t4c_20200723T134130/logs/failed_testcase_ocs_logs_1595514856/test_disruptive_during_pod_pvc_deletion%5bCephFileSystem-delete_pvcs-cephfsplugin_provisioner%5d_ocs_logs/

Version of all relevant components (if applicable):
Cluster Version 4.4.0-0.nightly-2020-07-23-025224
OCS operator 	v4.4.2-503.ci
CSI Driver version: release-4.4 and Git version: 6057d566b2d94c19a996869613ec7eb7530275e4

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?


Is there any workaround available to the best of your knowledge?
No

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
2

Can this issue reproducible?
Reporting first instance

Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:
Seems to be same as bug 1793387

Steps to Reproduce:
1. Create 12 Cephfs PVCs and verify they are Bound.(PV reclaim policy 'Delete')
2. Start deleting the PVCs in a loop.
3. While step 2 is progressing, delete csi-cephfsplugin-provisioner leader pod.
4. Wait for PVCs to delete.
5. Wait for PVs to be deleted.

This test case is automated:
tests/manage/pv_services/test_resource_deletion_during_pod_pvc_deletion.py::TestDeleteResourceDuringPodPvcDeletion::test_disruptive_during_pod_pvc_deletion[CephFileSystem-delete_pvcs-cephfsplugin_provisioner] 


Actual results:
One PV remained in Released state due to "VolumeFailedDelete" error.

Expected results:
PVs should be deleted.

Additional info:

Comment 3 Mudit Agarwal 2020-07-27 11:43:21 UTC

Yes, it looks similar to https://bugzilla.redhat.com/show_bug.cgi?id=1793387 because here also deletion is failing with ENOENT.

@Jilju, do we have the system intact. Can we check whether the subvolume (csi-vol-185cfa6b-cd1d-11ea-ae58-0a580a83001) is present or not on the backing cephfs volume?

Comment 4 Jilju Joy 2020-07-27 12:42:56 UTC

(In reply to Mudit Agarwal from comment #3)
> Yes, it looks similar to https://bugzilla.redhat.com/show_bug.cgi?id=1793387
> because here also deletion is failing with ENOENT.
> 
> @Jilju, do we have the system intact. 
Sorry, the cluster is not available now. It was destroyed after automation execution.

> Can we check whether the subvolume
> (csi-vol-185cfa6b-cd1d-11ea-ae58-0a580a83001) is present or not on the
> backing cephfs volume?

Comment 5 Madhu Rajanna 2020-07-27 12:44:57 UTC

In some ceph version, if the subvolume is not present, the ceph returns does not exist and in some version not found
error message.

sh-4.2# ceph version
ceph version 14.2.10 (b340acf629a010a74d90da5782a2c5fe0b54ac20) nautilus (stable)

sh-4.2# ceph fs subvolume getpath myfs csi-vol-a24a3d97-c7f4-11ea-8cfc-0242ac110012 --group_name csi
Error ENOENT: subvolume 'csi-vol-a24a3d97-c7f4-11ea-8cfc-0242ac110012' does not exist
sh-4.2# ceph version
ceph version 14.2.4 (75f4de193b3ea58512f204623e6c5a16e6c1e1ba) nautilus (stable)

In ceph 14.2.4
sh-4.2# ceph fs subvolume getpath myfs testing --group_name=csi
Error ENOENT: Subvolume 'testing' not found

This is a regression from ceph fs core side,  we have fixed  in ceph-csi to handle both cases
in https://github.com/ceph/ceph-csi/pull/1247. will backport it to downstream

Comment 6 Mudit Agarwal 2020-07-27 13:05:11 UTC

Thanks Madhu.

This can be hit based upon the ceph version but it is a corner case and would mostly be hit in a disruptive environment like this.
Not a blocker for OCS4.5, can be pushed to OCS4.6.

Comment 7 Madhu Rajanna 2020-07-27 13:06:26 UTC

downstream PR for  both 4.4 [1] and 4.5 [2]

[1] https://github.com/openshift/ceph-csi/pull/5
[2] https://github.com/openshift/ceph-csi/pull/6

Comment 10 Madhu Rajanna 2020-07-27 14:44:35 UTC

Yes above one should fix the issue.

Comment 11 Mudit Agarwal 2020-07-27 16:01:35 UTC

Providing the devel_ack because it is a regression and we already have a simple fix for the issue.

Also, looks like this particular Test case will always fail with this ceph version that makes it a test blocker?

@Madhu, please wait for all the acks before the merge.

Comment 18 Jilju Joy 2020-08-05 07:35:58 UTC

Verified in version:

Cluster Version 	4.5.0-0.nightly-2020-08-03-123303
OCS operator 	v4.5.0-515.ci
rook_csi_ceph 	cephcsi@sha256:244099ffc77fe965cd258e105aeff127de08673830a679ecb2525d9220e161fb
rook_ceph 	rook-ceph@sha256:6aaf689232cb7fcb44e37dc1c34b17c7cc81d5fe244cfb4277fafdb5a3865ee4


Executed test cases:

tests/manage/pv_services/test_pvc_disruptive.py::TestPVCDisruption::test_pvc_disruptive[CephFileSystem-create_pvc-cephfsplugin_provisioner]
tests/manage/pv_services/test_resource_deletion_during_pod_pvc_deletion.py::TestDeleteResourceDuringPodPvcDeletion::test_disruptive_during_pod_pvc_deletion[CephFileSystem-delete_pvcs-cephfsplugin_provisioner]
tests/manage/pv_services/test_resource_deletion_during_pvc_pod_creation_and_io.py::TestResourceDeletionDuringCreationOperations::test_resource_deletion_during_pvc_pod_creation_and_io[CephFileSystem-cephfsplugin_provisioner] tests/manage/pv_services/test_resource_deletion_during_pvc_pod_deletion_and_io.py::TestResourceDeletionDuringMultipleDeleteOperations::test_disruptive_during_pod_pvc_deletion_and_io[CephFileSystem-cephfsplugin_provisioner]


Test run - https://ocs4-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/qe-deploy-ocs-cluster/10551/

Comment 21 errata-xmlrpc 2020-09-15 10:18:25 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenShift Container Storage 4.5.0 bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:3754

Note You need to log in before you can comment on or make changes to this bug.