Bug 1962956 - [Tracker for OCP BZ #1990428] [RBD][Thick] Deleting the PVC and RBD provisioner leader pod while thick provisioning is progressing, will leave a stale image
Summary: [Tracker for OCP BZ #1990428] [RBD][Thick] Deleting the PVC and RBD provision...
Keywords:
Status: CLOSED DEFERRED
Alias: None
Product: Red Hat OpenShift Data Foundation
Classification: Red Hat Storage
Component: distribution
Version: 4.8
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: ---
Assignee: Eran Tamir
QA Contact: Elad
URL:
Whiteboard:
Depends On: 1990428
Blocks: 1966894
TreeView+ depends on / blocked
 
Reported: 2021-05-20 20:42 UTC by Jilju Joy
Modified: 2023-08-09 16:43 UTC (History)
13 users (show)

Fixed In Version:
Doc Type: Known Issue
Doc Text:
.Deleting the pending PVC and RBD provisioner leader pod while the thick provisioning is progressing, will leave a stale image and OMAP metadata When an RBD PVC is being thick provisioned, the Persistent Volume Claim (PVC) is in a `Pending` state. If the RBD provisioner leader and the PVC itself are deleted, the RBD image and OMAP metadata will not be deleted. To address this issue, do not delete the PVC while the thick provisioning is in progress.
Clone Of:
Environment:
Last Closed: 2022-01-31 17:19:51 UTC
Embargoed:


Attachments (Terms of Use)

Description Jilju Joy 2021-05-20 20:42:51 UTC
Description of problem (please be detailed as possible and provide log
snippests):
When a RBD PVC is being thick provisioned(PVC in Pending state), if the RBD provisioner leader and the PVC itself are deleted, the RBD image will not be deleted.


Test case error:

Error Details
AssertionError: Wait timeout: RBD image ['csi-vol-c4471bfa-b9a6-11eb-9f59-0a580a8102be'] is not deleted. Check the logs to ensure thatthis is the stale image of the deleted


Logs from the test case showing the "rbd du" output of the image.

20:05:58 - MainThread - tests.manage.pv_services.test_delete_pvc_while_provisioning - INFO - rbd du output of the image csi-vol-c4471bfa-b9a6-11eb-9f59-0a580a8102be: NAME                                         PROVISIONED USED csi-vol-c4471bfa-b9a6-11eb-9f59-0a580a8102be      15 GiB 380 MiB


# rbd info -p ocs-storagecluster-cephblockpool csi-vol-c4471bfa-b9a6-11eb-9f59-0a580a8102be
rbd image 'csi-vol-c4471bfa-b9a6-11eb-9f59-0a580a8102be':
	size 15 GiB in 3840 objects
	order 22 (4 MiB objects)
	snapshot_count: 0
	id: 4974e4cec38bf
	block_name_prefix: rbd_data.4974e4cec38bf
	format: 2
	features: layering
	op_features: 
	flags: 
	create_timestamp: Thu May 20 20:05:47 2021
	access_timestamp: Thu May 20 20:05:47 2021
	modify_timestamp: Thu May 20 20:05:47 2021


# rbd du -p ocs-storagecluster-cephblockpool csi-vol-c4471bfa-b9a6-11eb-9f59-0a580a8102be
warning: fast-diff map is not enabled for csi-vol-c4471bfa-b9a6-11eb-9f59-0a580a8102be. operation may be slow.
NAME                                         PROVISIONED USED    
csi-vol-c4471bfa-b9a6-11eb-9f59-0a580a8102be      15 GiB 380 MiB 


Deleted PVC name : pvc-test-54307ed030f54a6297181b8db280659,  namespace: namespace-test-be58187cd3134d74b4ac04027


Test case : tests/manage/pv_services/test_delete_pvc_while_provisioning.py::TestDeletePvcWhileProvisioning::test_delete_rbd_pvc_while_thick_provisioning[rbdplugin_provisioner]


OCS and OCP must-gather logs:
http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/jijoy-17may/jijoy-17may_20210517T110547/logs/failed_testcase_ocs_logs_1621541014/test_delete_rbd_pvc_while_thick_provisioning%5brbdplugin_provisioner%5d_ocs_logs/


Test case log: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/jijoy-17may/jijoy-17may_20210517T110547/logs/ocs-ci-logs-1621541014/tests/manage/pv_services/test_delete_pvc_while_provisioning.py/TestDeletePvcWhileProvisioning/test_delete_rbd_pvc_while_thick_provisioning-rbdplugin_provisioner


===============================================================================================================
Version of all relevant components (if applicable):
OCS 4.8.0-394.ci
OCP 4.8.0-0.nightly-2021-05-15-141455
Ceph Version	14.2.11-147.el8cp (1f54d52f20d93c1b91f1ec6af4c67a4b81402800) nautilus (stable)
rook_csi_provisioner	ose-csi-external-provisioner@sha256:0d1cab421c433c213d37043dd0dbaa6a2942ccf1d21d35afc32e35ce8216ddec


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
Partially thick provisioned stale RBD image will be present.


Is there any workaround available to the best of your knowledge?
Delete the RBD image manually


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
2

Can this issue reproducible?
Yes

Can this issue reproduce from the UI?
Yes

If this is a regression, please provide more details to justify this:
RBD thick provision is a new feature in OCS 4.8

Steps to Reproduce:
1. Start creating a RBD PVC of size 15 GiB. Use thick provision enabled storage class.
2. When step 1 is progressing (PVC in Pending state), delete the csi-rbdplugin-provisioner leader pod.
3. Immediately after step 2 (PVC is still in Pending state), delete the PVC.
4. Wait for the PVC to get deleted.
5. Wait for the corresponding RBD image to get deleted.

or run the ocs-ci test case
tests/manage/pv_services/test_delete_pvc_while_provisioning.py::TestDeletePvcWhileProvisioning::test_delete_rbd_pvc_while_thick_provisioning[rbdplugin_provisioner]

Actual results:
PVC deleted. RBD image is not deleted.

Expected results:
The RBD image should be deleted.

Additional info:

Comment 3 Mudit Agarwal 2021-05-26 13:45:08 UTC
Jilju, this should happen without thick provisioning also (though the window is very small)
Is it possible to verify?

Comment 4 Jilju Joy 2021-05-27 13:12:21 UTC
(In reply to Mudit Agarwal from comment #3)
> Jilju, this should happen without thick provisioning also (though the window
> is very small)
> Is it possible to verify?

Hi Mudit,

I tried to test this multiple times. The provisioning was very quick. 
I0527 13:02:45.222042       1 controller.go:1335] provision "default/pvc-test-2" class "ocs-storagecluster-ceph-rbd": started
I0527 13:02:45.275805       1 controller.go:1442] provision "default/pvc-test-2" class "ocs-storagecluster-ceph-rbd": volume "pvc-a83265d5-3cb7-48b4-a698-f56a79a56d7c" provisioned
I0527 13:02:45.275823       1 controller.go:1459] provision "default/pvc-test-2" class "ocs-storagecluster-ceph-rbd": succeeded

So I was not able to delete the PVC before the provisioning completed.
 
Tested in version: ocs-operator.v4.8.0-399.ci

Comment 5 Mudit Agarwal 2021-05-31 08:34:51 UTC
As mentioned by Madhu in https://bugzilla.redhat.com/show_bug.cgi?id=1962956#c2, this is not something which can be handled by ceph-csi driver.
This is an existing issue which can be seen more with thick provisioning.

Moving it out for now, we need to raise a bug with OCP and see if we can fix it.

Comment 7 Jilju Joy 2021-06-18 14:30:56 UTC
The problem arises only when both the provisioner and  the pending PVC is deleted. PVC can be deleted while thick provisioning is going on.
But the suggested workaround gives an impression that a thick PVC cannot be deleted while provisioning is going on.
It is true that we do not expect the user to check whether the provisioner was respinned before trying to delete a Pending PVC.

The actual information we want to convey is "Do not delete a Pending RBD thick PVC if the active RBD provisioner pod was restarted after the start of PVC creation"
The above statement can be rephrased, but this is the actual situation which can leave a stale image.

Comment 8 Neha Berry 2021-07-27 17:21:53 UTC
I see Jilju responded. Clearing needinfo on me.

Comment 13 Adam Litke 2021-08-02 20:46:33 UTC
In comment #5, Mudit states that a bug needs to be created against OCP to get a resolution to this issue.  Has that happened?

Comment 14 Niels de Vos 2021-08-05 11:34:45 UTC
(In reply to Adam Litke from comment #13)
> In comment #5, Mudit states that a bug needs to be created against OCP to
> get a resolution to this issue.  Has that happened?

Not yet, Mudit is going to take care of that (today?).

Comment 15 Mudit Agarwal 2021-08-09 04:05:26 UTC
Raised https://bugzilla.redhat.com/show_bug.cgi?id=1990428

Comment 16 Niels de Vos 2021-11-15 10:40:56 UTC
This is not something ceph-csi can address. This is a tracker for the OCP distribution that ODF builds upon, changing components accordingly.


Note You need to log in before you can comment on or make changes to this bug.