Bug 2222020

Summary: Slow data removal for PVC with ODF VolSync enabled
Product: [Red Hat Storage] Red Hat OpenShift Data Foundation Reporter: Elvir Kuric <ekuric>
Component: odf-drAssignee: Benamar Mekhissi <bmekhiss>
odf-dr sub component: unclassified QA Contact: krishnaram Karthick <kramdoss>
Status: NEW --- Docs Contact:
Severity: unspecified    
Priority: unspecified CC: muagarwa, odf-bz-bot
Version: 4.13   
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Elvir Kuric 2023-07-11 14:36:52 UTC
Description of problem (please be detailed as possible and provide log
snippests):
In this tests we created 100 pods, each writing 10GB per pod - this was randrw test running for 10h ( with --time-based=1 ). 

Writing to ceph beckend was fine and VolSync replicated data between cluster1 and cluster2. 

After test is done, we deleted pods/pvc/replicationsources, volumereplicationgroups and primary and secondary cluster ( also we deleted replicationdestination ) on secondary cluster and all objects were deleted. 

However, "ceph df" on primary cluster did not shown that all storage space is reclaimed back. 20h after pods / pvc are deleted there is still 120 GB ( even there is nobody/nothing using this cluster ) 

this is problematic:

"ocs-storagecluster-cephfilesystem-data0                12  128  112 GiB   57.30k  336 GiB   2.31    4.6 TiB"

$ ceph df
--- RAW STORAGE ---
CLASS    SIZE   AVAIL     USED  RAW USED  %RAW USED
ssd    18 TiB  17 TiB  1.2 TiB   1.2 TiB       6.55
TOTAL  18 TiB  17 TiB  1.2 TiB   1.2 TiB       6.55
 
--- POOLS ---
POOL                                                   ID  PGS   STORED  OBJECTS     USED  %USED  MAX AVAIL
.mgr                                                    1    1   54 MiB       15  162 MiB      0    4.6 TiB
ocs-storagecluster-cephblockpool                        2  512  265 GiB  130.13k  794 GiB   5.29    4.6 TiB
ocs-storagecluster-cephobjectstore.rgw.otp              3    8      0 B        0      0 B      0    4.6 TiB
ocs-storagecluster-cephobjectstore.rgw.control          4    8      0 B        8      0 B      0    4.6 TiB
ocs-storagecluster-cephobjectstore.rgw.buckets.index    5    8    541 B       11  1.6 KiB      0    4.6 TiB
.rgw.root                                               6    8  5.7 KiB       16  180 KiB      0    4.6 TiB
ocs-storagecluster-cephobjectstore.rgw.buckets.non-ec   7    8      0 B        0      0 B      0    4.6 TiB
ocs-storagecluster-cephobjectstore.rgw.log              8    8  1.7 MiB      340  7.0 MiB      0    4.6 TiB
ocs-storagecluster-cephobjectstore.rgw.meta             9    8  4.6 KiB       14  126 KiB      0    4.6 TiB
ocs-storagecluster-cephfilesystem-metadata             10   16  769 MiB      322  2.3 GiB   0.02    4.6 TiB
ocs-storagecluster-cephobjectstore.rgw.buckets.data    11  128  1.0 KiB        2   24 KiB      0    4.6 TiB
ocs-storagecluster-cephfilesystem-data0                12  128  112 GiB   57.30k  336 GiB   2.31    4.6 TiB


Version of all relevant components (if applicable):
ceph version 17.2.6-26.el9cp (ef7b8da24916178ade693b2fd0de13b917f53865) quincy (stable)


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?

Yes
Is there any workaround available to the best of your knowledge?

NA
Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
3

Can this issue reproducible?
I believe yes, but tested 1x 

Can this issue reproduce from the UI?
NA


If this is a regression, please provide more details to justify this:
NA

Steps to Reproduce:
1. create VolSync setup between cluster1 and cluster2 with ODF v4.13. 
2. Create 100 pods, writing 10 GB per pod, test duration cca 10h
3. delete pods/pvc/replicationsources, volumereplicationgroups on primary and secondary cluster
4. check "ceph df" on primary cluster

Secondary cluster is fine. "ceph df" shows that test data are deleted. 


Actual results:
data purge from cephfs volume is slow 

Expected results:
data deletion to be faster

Additional info: