Description of problem (please be detailed as possible and provide log snippets): [DR] rbd-mirror performing full copy of rbd image every scheduling interval Version of all relevant components (if applicable): ODF version:- 4.9.0-248.ci OCP version:- 4.9.0-0.nightly-2021-11-12-222121 Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? yes Is there any workaround available to the best of your knowledge? no Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? 1 Can this issue be reproducible? yes always Can this issue reproduce from the UI? If this is a regression, please provide more details to justify this: Steps to Reproduce: 1. Deploy DR cluster 2. Run workload, make sure PVC size are more like (100-500Gib or more) 3. check snap ls output using for i in $(rbd ls -p ocs-storagecluster-cephblockpool); do rbd snap ls ocs-storagecluster-cephblockpool/$i --all; done Actual results: SNAPID NAME SIZE PROTECTED TIMESTAMP NAMESPACE 5415 .mirror.non_primary.6b821980-027d-4008-a40f-ceb8c72291a2.75346ad4-d698-4f0a-9b50-8dcfc847d9ed 100 GiB Thu Nov 25 06:03:00 2021 mirror (non-primary peer_uuids:[] 1b9a2222-0ab5-4179-9e3b-c40f53d4c194:6362 copied) 5465 .mirror.non_primary.6b821980-027d-4008-a40f-ceb8c72291a2.afb61695-269c-4920-b2a1-ac13a9bb8f1d 100 GiB Thu Nov 25 06:16:04 2021 mirror (non-primary peer_uuids:[] 1b9a2222-0ab5-4179-9e3b-c40f53d4c194:6434 83% copied) Expected results: There should be only differential copy Additional info: Scheduling interval was set to 5m
This issue can be observed easily in a high latency environment, but it can also be reproduced easily in a low latency environment by simply creating a large RBD volume. The following 100G volume took around 11 minutes to be transfered to the other cluster. Mon Nov 29 14:01:35 -- rbd create ocs-storagecluster-cephblockpool/test_1110 --size 100G Mon Nov 29 14:10:56 -- rbd snap ls ocs-storagecluster-cephblockpool/test_1110 --all ``` SNAPID NAME SIZE PROTECTED TIMESTAMP NAMESPACE 3395 .mirror.non_primary.24a3c8b6-e6e1-4739-9319-6408c3e6b38f.adf4bc4d-fc18-4854-8fd2-271672275d5c 100 GiB Mon Nov 29 19:02:52 2021 mirror (non-primary peer_uuids:[] 5b050cc6-40bd-4df1-960a-c880dfb4ae93:3526 85% copied) ``` Mon Nov 29 14:11:32 -- rbd snap ls ocs-storagecluster-cephblockpool/test_1110 --all ``` SNAPID NAME SIZE PROTECTED TIMESTAMP NAMESPACE 3395 .mirror.non_primary.24a3c8b6-e6e1-4739-9319-6408c3e6b38f.adf4bc4d-fc18-4854-8fd2-271672275d5c 100 GiB Mon Nov 29 19:02:52 2021 mirror (non-primary peer_uuids:[] 5b050cc6-40bd-4df1-960a-c880dfb4ae93:3526 copied) ``` JC, Annette, and Madhu suggested using the fast-diff option. And sure enough, once I recreated an image with fast-diff, the transfer time dropped to around 20 seconds. ``` rbd create ocs-storagecluster-cephblockpool/test_2220 --size 100G --image-feature object-map,fast-diff,exclusive-lock ```
Hi, I believe this BZ is critical as the impact of it might be not only on ODF related resources but also on other resources in the cluster and even beyond ODF/OCP. The reason is that large amounts of data being fully replicated every 5 minutes, could potentially lead to constant high network and compute utilization which could suffocate the clusters and the customer network which in turn could lead to starvation among other clients. Therefore, proposing as a blocker for 4.9.1
(In reply to Elad from comment #5) > Hi, I believe this BZ is critical as the impact of it might be not only on > ODF related resources but also on other resources in the cluster and even > beyond ODF/OCP. > The reason is that large amounts of data being fully replicated every 5 > minutes, could potentially lead to constant high network and compute > utilization which could suffocate the clusters and the customer network > which in turn could lead to starvation among other clients. > Therefore, proposing as a blocker for 4.9.1 The core issue is fixed in 4.9.1 based on this clone: https://bugzilla.redhat.com/show_bug.cgi?id=2030745 The fix requires that images be created using fast-diff feature, and hence requires a separate StroageClass for DR. This issue is being tracked for improvements as per comment #4 I would hence state that further fixes in this regard are not a target for 4.9.1 (IOW, merging the StorageClass for both DR and non-DR cases).
Agree with Shyam, the fix required for TP is already in 4.9.1 and a doc bug is in place to help users with the SC creation etc. We don't intend to put any further fixes in 4.9.z for this, removing the 4.9.z flag.
We have already fixed it and is being tracked in Jira, closing the BZ. https://issues.redhat.com/browse/RHSTOR-2502