Bug 2221094

Summary: slow data replication in ODF DR VolSync setup
Product: [Red Hat Storage] Red Hat OpenShift Data Foundation Reporter: Elvir Kuric <ekuric>
Component: odf-drAssignee: Benamar Mekhissi <bmekhiss>
odf-dr sub component: volume-replication-operator QA Contact: krishnaram Karthick <kramdoss>
Status: NEW --- Docs Contact:
Severity: unspecified    
Priority: unspecified CC: bmekhiss, muagarwa, odf-bz-bot, rtalur
Version: 4.13   
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Elvir Kuric 2023-07-07 08:51:45 UTC
Description of problem (please be detailed as possible and provide log
snippests):

ODF VolSync does not sync all data. 

Test description:

On ODF VolSync enabled cluster we did below 

- Created 3 pods writting 333 GB Per pod - 1 TB data set 

pod-1                                          0/1     Completed   0          22h
pod-2                                          0/1     Completed   0          22h
pod-3                                          0/1     Completed   0          22h
volsync-rsync-tls-src-perf-test-pvc-1-t62sn    1/1     Running     0          21h
volsync-rsync-tls-src-perf-test-pvc-2-9q8hc    1/1     Running     0          21h
volsync-rsync-tls-src-perf-test-pvc-3-nvmd8    1/1     Running     0          21h

test pods are pod-1, pod-2, pod-3. 

Test duration was 1h so that is reason why these pods are in "Completed" status. 

PVC created during this tests are

oc get pvc
NAME                          STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS                    AGE
perf-test-pvc-1               Bound    pvc-32c1ec59-d5de-4221-9604-e3587ff68a16   500Gi      RWX            ocs-storagecluster-cephfs       22h
perf-test-pvc-2               Bound    pvc-3b0d4a80-8325-4d50-94ae-0577b087e3bd   500Gi      RWX            ocs-storagecluster-cephfs       22h
perf-test-pvc-3               Bound    pvc-12c1f1b1-b91b-4730-bee8-a56cfbafe459   500Gi      RWX            ocs-storagecluster-cephfs       22h
volsync-perf-test-pvc-1-src   Bound    pvc-4c1c9e36-62ec-4e18-97bf-128e876a5689   500Gi      ROX            ocs-storagecluster-cephfs-vrg   21h
volsync-perf-test-pvc-2-src   Bound    pvc-19efa2da-e873-4a5b-882d-ee302099fcac   500Gi      ROX            ocs-storagecluster-cephfs-vrg   21h
volsync-perf-test-pvc-3-src   Bound    pvc-0651c93e-1a4d-4ef4-90eb-142ba5af5da4   500Gi      ROX            ocs-storagecluster-cephfs-vrg   21h


volsync pods mount correspondig volsync pvc 

http://perf148b.perf.lab.eng.bos.redhat.com/bz/volsync/slowop/logs/mount-volsyncpod.txt
http://perf148b.perf.lab.eng.bos.redhat.com/bz/volsync/slowop/logs/df-h-volsyncpod.txt 
http://perf148b.perf.lab.eng.bos.redhat.com/bz/volsync/slowop/logs/logs_volsync-rsync-tls-src-perf-test-pvc-1-t62sn.txt

What we noticed is below:

volsync pods are running for hours:

volsync-rsync-tls-src-perf-test-pvc-1-t62sn    1/1     Running     0          21h
volsync-rsync-tls-src-perf-test-pvc-2-9q8hc    1/1     Running     0          21h
volsync-rsync-tls-src-perf-test-pvc-3-nvmd8    1/1     Running     0          21h

replicationsources on primary cluster are 

NAME              SOURCE            LAST SYNC              DURATION           NEXT SYNC
perf-test-pvc-1   perf-test-pvc-1   2023-07-06T09:46:59Z   47m18.199782607s   2023-07-06T09:50:00Z
perf-test-pvc-2   perf-test-pvc-2   2023-07-06T09:50:25Z   50m44.683806909s   2023-07-06T09:55:00Z
perf-test-pvc-3   perf-test-pvc-3   2023-07-06T09:49:17Z   49m24.772300706s   2023-07-06T09:50:00Z

ramen dr operator log : http://perf148b.perf.lab.eng.bos.redhat.com/bz/volsync/slowop/log_ramen-dr-cluster-operator-manager.txt

on primary cluster we see 

ceph df 
--- RAW STORAGE ---
CLASS    SIZE   AVAIL     USED  RAW USED  %RAW USED
ssd    18 TiB  14 TiB  3.9 TiB   3.9 TiB      21.23
TOTAL  18 TiB  14 TiB  3.9 TiB   3.9 TiB      21.23
 
--- POOLS ---
POOL                                                   ID  PGS   STORED  OBJECTS     USED  %USED  MAX AVAIL
.mgr                                                    1    1   51 MiB       14  154 MiB      0    3.7 TiB
ocs-storagecluster-cephblockpool                        2  512  252 GiB  126.99k  757 GiB   6.31    3.7 TiB
ocs-storagecluster-cephobjectstore.rgw.otp              3    8      0 B        0      0 B      0    3.7 TiB
ocs-storagecluster-cephobjectstore.rgw.control          4    8      0 B        8      0 B      0    3.7 TiB
ocs-storagecluster-cephobjectstore.rgw.buckets.index    5    8  3.8 KiB       11   12 KiB      0    3.7 TiB
.rgw.root                                               6    8  5.7 KiB       16  180 KiB      0    3.7 TiB
ocs-storagecluster-cephobjectstore.rgw.buckets.non-ec   7    8      0 B        0      0 B      0    3.7 TiB
ocs-storagecluster-cephobjectstore.rgw.log              8    8  1.6 MiB      340  6.7 MiB      0    3.7 TiB
ocs-storagecluster-cephobjectstore.rgw.meta             9    8  4.2 KiB       14  125 KiB      0    3.7 TiB
ocs-storagecluster-cephfilesystem-metadata             10   16  494 MiB      168  1.4 GiB   0.01    3.7 TiB
ocs-storagecluster-cephobjectstore.rgw.buckets.data    11  128  1.0 KiB        2   24 KiB      0    3.7 TiB
ocs-storagecluster-cephfilesystem-data0                12  128  2.0 TiB  511.49k  3.1 TiB  21.98    3.7 TiB

secondary cluster: 

 ceph df
--- RAW STORAGE ---
CLASS    SIZE   AVAIL     USED  RAW USED  %RAW USED
ssd    18 TiB  17 TiB  1.2 TiB   1.2 TiB       6.37
TOTAL  18 TiB  17 TiB  1.2 TiB   1.2 TiB       6.37
 
--- POOLS ---
POOL                                                   ID  PGS   STORED  OBJECTS     USED  %USED  MAX AVAIL
.mgr                                                    1    1   53 MiB       15  160 MiB      0    4.7 TiB
ocs-storagecluster-cephblockpool                        2  512  153 GiB   39.78k  460 GiB   3.06    4.7 TiB
ocs-storagecluster-cephobjectstore.rgw.control          3    8      0 B        8      0 B      0    4.7 TiB
ocs-storagecluster-cephobjectstore.rgw.buckets.index    4    8  7.3 KiB       11   22 KiB      0    4.7 TiB
ocs-storagecluster-cephobjectstore.rgw.log              5    8  1.7 MiB      340  7.0 MiB      0    4.7 TiB
ocs-storagecluster-cephobjectstore.rgw.meta             6    8  8.5 KiB       14  138 KiB      0    4.7 TiB
ocs-storagecluster-cephobjectstore.rgw.buckets.non-ec   7    8      0 B        0      0 B      0    4.7 TiB
.rgw.root                                               8    8  5.7 KiB       16  180 KiB      0    4.7 TiB
ocs-storagecluster-cephobjectstore.rgw.otp              9    8      0 B        0      0 B      0    4.7 TiB
ocs-storagecluster-cephfilesystem-metadata             10   16  3.1 GiB      837  9.3 GiB   0.06    4.7 TiB
ocs-storagecluster-cephobjectstore.rgw.buckets.data    11  128    1 KiB        1   12 KiB      0    4.7 TiB
ocs-storagecluster-cephfilesystem-data0                12  128  232 GiB   59.42k  696 GiB   4.56    4.7 TiB


- 


Version of all relevant components (if applicable):

odf v4.13 
ceph version 17.2.6-70.el9cp (fe62dcdbb2c6e05782a3e2b67d025b84ff5047cc) quincy (stable)


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
NA

Is there any workaround available to the best of your knowledge?
NA

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
3

Can this issue reproducible?
Yes

Can this issue reproduce from the UI?
NA

If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1. Create ODF DR with Volsync 
2. create 3 pods on primary side, mount pvc to them and write 300-400 GB per pod, with schedulinginterval:5m, and test runtime of 1h
3. establish replication between cluster1 and cluster2 
4. monitor will volsync pods on primary side finish syncing, and will "ceph df" be equal on both sides 


Actual results:
data not syncing between cluster1 and cluster2 in ODF VolSync setup


Expected results:
data to syncing between cluster1 and cluster2 in ODF VolSync setup

Additional info:

logs and must gather from cluster1 and cluster2:

http://perf148b.perf.lab.eng.bos.redhat.com/bz/volsync/slowop/