Description of problem (please be detailed as possible and provide log snippests): When csi-cephfsplugin-provisioner pod is deleted while deleting a set of CephFS PVCs, a PV remained in Released state. This issue seems to be same as bug 1793387 PV pvc-583e7736-8876-477c-b4ed-ed82dad3f03b describe output. Name: pvc-583e7736-8876-477c-b4ed-ed82dad3f03b Labels: <none> Annotations: pv.kubernetes.io/provisioned-by: openshift-storage.cephfs.csi.ceph.com Finalizers: [kubernetes.io/pv-protection] StorageClass: ocs-storagecluster-cephfs Status: Released Claim: namespace-test-89a261d05a2f4a768c27bf1777c5bd6d/pvc-test-45e7c874f21b49b5983feb46c061a30e Reclaim Policy: Delete Access Modes: RWO VolumeMode: Filesystem Capacity: 3Gi Node Affinity: <none> Message: Source: Type: CSI (a Container Storage Interface (CSI) volume source) Driver: openshift-storage.cephfs.csi.ceph.com FSType: ext4 VolumeHandle: 0001-0011-openshift-storage-0000000000000001-185cfa6b-cd1d-11ea-ae58-0a580a830016 ReadOnly: false VolumeAttributes: clusterID=openshift-storage fsName=ocs-storagecluster-cephfilesystem storage.kubernetes.io/csiProvisionerIdentity=1595514090126-8081-openshift-storage.cephfs.csi.ceph.com Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning VolumeFailedDelete 32s (x8 over 104s) openshift-storage.cephfs.csi.ceph.com_csi-cephfsplugin-provisioner-584f787449-78qp9_dbeafef8-8f5d-4127-a83a-dddd74265c73 rpc error: code = Internal desc = an error (exit status 2) occurred while running ceph args: [fs subvolume getpath ocs-storagecluster-cephfilesystem csi-vol-185cfa6b-cd1d-11ea-ae58-0a580a830016 --group_name csi -m 172.30.193.176:6789,172.30.56.46:6789,172.30.19.132:6789 -c /etc/ceph/ceph.conf -n client.csi-cephfs-provisioner --keyfile=***stripped***] csi-cephfsplugin container log from csi-cephfsplugin-provisioner-584f787449-78qp9 pod. 2020-07-23T19:57:11.839462083Z I0723 19:57:11.839419 1 utils.go:157] ID: 28 Req-ID: 0001-0011-openshift-storage-0000000000000001-185cfa6b-cd1d-11ea-ae58-0a580a830016 GRPC call: /csi.v1.Controller/DeleteVolume 2020-07-23T19:57:11.839817167Z I0723 19:57:11.839442 1 utils.go:158] ID: 28 Req-ID: 0001-0011-openshift-storage-0000000000000001-185cfa6b-cd1d-11ea-ae58-0a580a830016 GRPC request: {"secrets":"***stripped***","volume_id":"0001-0011-openshift-storage-0000000000000001-185cfa6b-cd1d-11ea-ae58-0a580a830016"} 2020-07-23T19:57:11.839992954Z I0723 19:57:11.839972 1 util.go:48] ID: 28 Req-ID: 0001-0011-openshift-storage-0000000000000001-185cfa6b-cd1d-11ea-ae58-0a580a830016 cephfs: EXEC ceph [-m 172.30.193.176:6789,172.30.56.46:6789,172.30.19.132:6789 --id csi-cephfs-provisioner --keyfile=***stripped*** -c /etc/ceph/ceph.conf fs dump --format=json] 2020-07-23T19:57:12.172265247Z I0723 19:57:12.172218 1 util.go:48] ID: 28 Req-ID: 0001-0011-openshift-storage-0000000000000001-185cfa6b-cd1d-11ea-ae58-0a580a830016 cephfs: EXEC ceph [-m 172.30.193.176:6789,172.30.56.46:6789,172.30.19.132:6789 --id csi-cephfs-provisioner --keyfile=***stripped*** -c /etc/ceph/ceph.conf fs ls --format=json] 2020-07-23T19:57:12.922403083Z E0723 19:57:12.922362 1 volume.go:75] ID: 28 Req-ID: 0001-0011-openshift-storage-0000000000000001-185cfa6b-cd1d-11ea-ae58-0a580a830016 failed to get the rootpath for the vol csi-vol-185cfa6b-cd1d-11ea-ae58-0a580a830016(an error (exit status 2) occurred while running ceph args: [fs subvolume getpath ocs-storagecluster-cephfilesystem csi-vol-185cfa6b-cd1d-11ea-ae58-0a580a830016 --group_name csi -m 172.30.193.176:6789,172.30.56.46:6789,172.30.19.132:6789 -c /etc/ceph/ceph.conf -n client.csi-cephfs-provisioner --keyfile=***stripped***]) 2020-07-23T19:57:12.9224392Z E0723 19:57:12.922421 1 utils.go:161] ID: 28 Req-ID: 0001-0011-openshift-storage-0000000000000001-185cfa6b-cd1d-11ea-ae58-0a580a830016 GRPC error: rpc error: code = Internal desc = an error (exit status 2) occurred while running ceph args: [fs subvolume getpath ocs-storagecluster-cephfilesystem csi-vol-185cfa6b-cd1d-11ea-ae58-0a580a830016 --group_name csi -m 172.30.193.176:6789,172.30.56.46:6789,172.30.19.132:6789 -c /etc/ceph/ceph.conf -n client.csi-cephfs-provisioner --keyfile=***stripped***] Logs : http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/jnk-ai3c33-t4c/jnk-ai3c33-t4c_20200723T134130/logs/failed_testcase_ocs_logs_1595514856/test_disruptive_during_pod_pvc_deletion%5bCephFileSystem-delete_pvcs-cephfsplugin_provisioner%5d_ocs_logs/ Version of all relevant components (if applicable): Cluster Version 4.4.0-0.nightly-2020-07-23-025224 OCS operator v4.4.2-503.ci CSI Driver version: release-4.4 and Git version: 6057d566b2d94c19a996869613ec7eb7530275e4 Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? Is there any workaround available to the best of your knowledge? No Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? 2 Can this issue reproducible? Reporting first instance Can this issue reproduce from the UI? If this is a regression, please provide more details to justify this: Seems to be same as bug 1793387 Steps to Reproduce: 1. Create 12 Cephfs PVCs and verify they are Bound.(PV reclaim policy 'Delete') 2. Start deleting the PVCs in a loop. 3. While step 2 is progressing, delete csi-cephfsplugin-provisioner leader pod. 4. Wait for PVCs to delete. 5. Wait for PVs to be deleted. This test case is automated: tests/manage/pv_services/test_resource_deletion_during_pod_pvc_deletion.py::TestDeleteResourceDuringPodPvcDeletion::test_disruptive_during_pod_pvc_deletion[CephFileSystem-delete_pvcs-cephfsplugin_provisioner] Actual results: One PV remained in Released state due to "VolumeFailedDelete" error. Expected results: PVs should be deleted. Additional info:
Yes, it looks similar to https://bugzilla.redhat.com/show_bug.cgi?id=1793387 because here also deletion is failing with ENOENT. @Jilju, do we have the system intact. Can we check whether the subvolume (csi-vol-185cfa6b-cd1d-11ea-ae58-0a580a83001) is present or not on the backing cephfs volume?
(In reply to Mudit Agarwal from comment #3) > Yes, it looks similar to https://bugzilla.redhat.com/show_bug.cgi?id=1793387 > because here also deletion is failing with ENOENT. > > @Jilju, do we have the system intact. Sorry, the cluster is not available now. It was destroyed after automation execution. > Can we check whether the subvolume > (csi-vol-185cfa6b-cd1d-11ea-ae58-0a580a83001) is present or not on the > backing cephfs volume?
In some ceph version, if the subvolume is not present, the ceph returns does not exist and in some version not found error message. sh-4.2# ceph version ceph version 14.2.10 (b340acf629a010a74d90da5782a2c5fe0b54ac20) nautilus (stable) sh-4.2# ceph fs subvolume getpath myfs csi-vol-a24a3d97-c7f4-11ea-8cfc-0242ac110012 --group_name csi Error ENOENT: subvolume 'csi-vol-a24a3d97-c7f4-11ea-8cfc-0242ac110012' does not exist sh-4.2# ceph version ceph version 14.2.4 (75f4de193b3ea58512f204623e6c5a16e6c1e1ba) nautilus (stable) In ceph 14.2.4 sh-4.2# ceph fs subvolume getpath myfs testing --group_name=csi Error ENOENT: Subvolume 'testing' not found This is a regression from ceph fs core side, we have fixed in ceph-csi to handle both cases in https://github.com/ceph/ceph-csi/pull/1247. will backport it to downstream
Thanks Madhu. This can be hit based upon the ceph version but it is a corner case and would mostly be hit in a disruptive environment like this. Not a blocker for OCS4.5, can be pushed to OCS4.6.
downstream PR for both 4.4 [1] and 4.5 [2] [1] https://github.com/openshift/ceph-csi/pull/5 [2] https://github.com/openshift/ceph-csi/pull/6
Yes above one should fix the issue.
Providing the devel_ack because it is a regression and we already have a simple fix for the issue. Also, looks like this particular Test case will always fail with this ceph version that makes it a test blocker? @Madhu, please wait for all the acks before the merge.
Verified in version: Cluster Version 4.5.0-0.nightly-2020-08-03-123303 OCS operator v4.5.0-515.ci rook_csi_ceph cephcsi@sha256:244099ffc77fe965cd258e105aeff127de08673830a679ecb2525d9220e161fb rook_ceph rook-ceph@sha256:6aaf689232cb7fcb44e37dc1c34b17c7cc81d5fe244cfb4277fafdb5a3865ee4 Executed test cases: tests/manage/pv_services/test_pvc_disruptive.py::TestPVCDisruption::test_pvc_disruptive[CephFileSystem-create_pvc-cephfsplugin_provisioner] tests/manage/pv_services/test_resource_deletion_during_pod_pvc_deletion.py::TestDeleteResourceDuringPodPvcDeletion::test_disruptive_during_pod_pvc_deletion[CephFileSystem-delete_pvcs-cephfsplugin_provisioner] tests/manage/pv_services/test_resource_deletion_during_pvc_pod_creation_and_io.py::TestResourceDeletionDuringCreationOperations::test_resource_deletion_during_pvc_pod_creation_and_io[CephFileSystem-cephfsplugin_provisioner] tests/manage/pv_services/test_resource_deletion_during_pvc_pod_deletion_and_io.py::TestResourceDeletionDuringMultipleDeleteOperations::test_disruptive_during_pod_pvc_deletion_and_io[CephFileSystem-cephfsplugin_provisioner] Test run - https://ocs4-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/qe-deploy-ocs-cluster/10551/
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat OpenShift Container Storage 4.5.0 bug fix and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:3754