Description of problem (please be detailed as possible and provide log snippests): When a RBD PVC is being thick provisioned(PVC in Pending state), if the RBD provisioner leader and the PVC itself are deleted, the RBD image will not be deleted. Test case error: Error Details AssertionError: Wait timeout: RBD image ['csi-vol-c4471bfa-b9a6-11eb-9f59-0a580a8102be'] is not deleted. Check the logs to ensure thatthis is the stale image of the deleted Logs from the test case showing the "rbd du" output of the image. 20:05:58 - MainThread - tests.manage.pv_services.test_delete_pvc_while_provisioning - INFO - rbd du output of the image csi-vol-c4471bfa-b9a6-11eb-9f59-0a580a8102be: NAME PROVISIONED USED csi-vol-c4471bfa-b9a6-11eb-9f59-0a580a8102be 15 GiB 380 MiB # rbd info -p ocs-storagecluster-cephblockpool csi-vol-c4471bfa-b9a6-11eb-9f59-0a580a8102be rbd image 'csi-vol-c4471bfa-b9a6-11eb-9f59-0a580a8102be': size 15 GiB in 3840 objects order 22 (4 MiB objects) snapshot_count: 0 id: 4974e4cec38bf block_name_prefix: rbd_data.4974e4cec38bf format: 2 features: layering op_features: flags: create_timestamp: Thu May 20 20:05:47 2021 access_timestamp: Thu May 20 20:05:47 2021 modify_timestamp: Thu May 20 20:05:47 2021 # rbd du -p ocs-storagecluster-cephblockpool csi-vol-c4471bfa-b9a6-11eb-9f59-0a580a8102be warning: fast-diff map is not enabled for csi-vol-c4471bfa-b9a6-11eb-9f59-0a580a8102be. operation may be slow. NAME PROVISIONED USED csi-vol-c4471bfa-b9a6-11eb-9f59-0a580a8102be 15 GiB 380 MiB Deleted PVC name : pvc-test-54307ed030f54a6297181b8db280659, namespace: namespace-test-be58187cd3134d74b4ac04027 Test case : tests/manage/pv_services/test_delete_pvc_while_provisioning.py::TestDeletePvcWhileProvisioning::test_delete_rbd_pvc_while_thick_provisioning[rbdplugin_provisioner] OCS and OCP must-gather logs: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/jijoy-17may/jijoy-17may_20210517T110547/logs/failed_testcase_ocs_logs_1621541014/test_delete_rbd_pvc_while_thick_provisioning%5brbdplugin_provisioner%5d_ocs_logs/ Test case log: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/jijoy-17may/jijoy-17may_20210517T110547/logs/ocs-ci-logs-1621541014/tests/manage/pv_services/test_delete_pvc_while_provisioning.py/TestDeletePvcWhileProvisioning/test_delete_rbd_pvc_while_thick_provisioning-rbdplugin_provisioner =============================================================================================================== Version of all relevant components (if applicable): OCS 4.8.0-394.ci OCP 4.8.0-0.nightly-2021-05-15-141455 Ceph Version 14.2.11-147.el8cp (1f54d52f20d93c1b91f1ec6af4c67a4b81402800) nautilus (stable) rook_csi_provisioner ose-csi-external-provisioner@sha256:0d1cab421c433c213d37043dd0dbaa6a2942ccf1d21d35afc32e35ce8216ddec Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? Partially thick provisioned stale RBD image will be present. Is there any workaround available to the best of your knowledge? Delete the RBD image manually Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? 2 Can this issue reproducible? Yes Can this issue reproduce from the UI? Yes If this is a regression, please provide more details to justify this: RBD thick provision is a new feature in OCS 4.8 Steps to Reproduce: 1. Start creating a RBD PVC of size 15 GiB. Use thick provision enabled storage class. 2. When step 1 is progressing (PVC in Pending state), delete the csi-rbdplugin-provisioner leader pod. 3. Immediately after step 2 (PVC is still in Pending state), delete the PVC. 4. Wait for the PVC to get deleted. 5. Wait for the corresponding RBD image to get deleted. or run the ocs-ci test case tests/manage/pv_services/test_delete_pvc_while_provisioning.py::TestDeletePvcWhileProvisioning::test_delete_rbd_pvc_while_thick_provisioning[rbdplugin_provisioner] Actual results: PVC deleted. RBD image is not deleted. Expected results: The RBD image should be deleted. Additional info:
Jilju, this should happen without thick provisioning also (though the window is very small) Is it possible to verify?
(In reply to Mudit Agarwal from comment #3) > Jilju, this should happen without thick provisioning also (though the window > is very small) > Is it possible to verify? Hi Mudit, I tried to test this multiple times. The provisioning was very quick. I0527 13:02:45.222042 1 controller.go:1335] provision "default/pvc-test-2" class "ocs-storagecluster-ceph-rbd": started I0527 13:02:45.275805 1 controller.go:1442] provision "default/pvc-test-2" class "ocs-storagecluster-ceph-rbd": volume "pvc-a83265d5-3cb7-48b4-a698-f56a79a56d7c" provisioned I0527 13:02:45.275823 1 controller.go:1459] provision "default/pvc-test-2" class "ocs-storagecluster-ceph-rbd": succeeded So I was not able to delete the PVC before the provisioning completed. Tested in version: ocs-operator.v4.8.0-399.ci
As mentioned by Madhu in https://bugzilla.redhat.com/show_bug.cgi?id=1962956#c2, this is not something which can be handled by ceph-csi driver. This is an existing issue which can be seen more with thick provisioning. Moving it out for now, we need to raise a bug with OCP and see if we can fix it.
The problem arises only when both the provisioner and the pending PVC is deleted. PVC can be deleted while thick provisioning is going on. But the suggested workaround gives an impression that a thick PVC cannot be deleted while provisioning is going on. It is true that we do not expect the user to check whether the provisioner was respinned before trying to delete a Pending PVC. The actual information we want to convey is "Do not delete a Pending RBD thick PVC if the active RBD provisioner pod was restarted after the start of PVC creation" The above statement can be rephrased, but this is the actual situation which can leave a stale image.
I see Jilju responded. Clearing needinfo on me.
In comment #5, Mudit states that a bug needs to be created against OCP to get a resolution to this issue. Has that happened?
(In reply to Adam Litke from comment #13) > In comment #5, Mudit states that a bug needs to be created against OCP to > get a resolution to this issue. Has that happened? Not yet, Mudit is going to take care of that (today?).
Raised https://bugzilla.redhat.com/show_bug.cgi?id=1990428
This is not something ceph-csi can address. This is a tracker for the OCP distribution that ODF builds upon, changing components accordingly.