Bug 2026575
| Summary: | [DR] rbd-mirror performing full copy of rbd image every scheduling interval | |||
|---|---|---|---|---|
| Product: | [Red Hat Storage] Red Hat OpenShift Data Foundation | Reporter: | Pratik Surve <prsurve> | |
| Component: | csi-driver | Assignee: | Madhu Rajanna <mrajanna> | |
| Status: | CLOSED DEFERRED | QA Contact: | Elad <ebenahar> | |
| Severity: | high | Docs Contact: | ||
| Priority: | unspecified | |||
| Version: | 4.9 | CC: | bmekhiss, bniver, ebenahar, jespy, jmishra, kseeger, madam, mmuench, muagarwa, ocs-bugs, odf-bz-bot, srangana, ypadia | |
| Target Milestone: | --- | |||
| Target Release: | --- | |||
| Hardware: | Unspecified | |||
| OS: | Unspecified | |||
| Whiteboard: | ||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | ||
| Doc Text: | Story Points: | --- | ||
| Clone Of: | ||||
| : | 2030745 (view as bug list) | Environment: | ||
| Last Closed: | 2022-05-26 09:43:43 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Embargoed: | ||||
| Bug Depends On: | ||||
| Bug Blocks: | 2030745, 2032914 | |||
|
Description
Pratik Surve
2021-11-25 07:36:07 UTC
This issue can be observed easily in a high latency environment, but it can also be reproduced easily in a low latency environment by simply creating a large RBD volume. The following 100G volume took around 11 minutes to be transfered to the other cluster. Mon Nov 29 14:01:35 -- rbd create ocs-storagecluster-cephblockpool/test_1110 --size 100G Mon Nov 29 14:10:56 -- rbd snap ls ocs-storagecluster-cephblockpool/test_1110 --all ``` SNAPID NAME SIZE PROTECTED TIMESTAMP NAMESPACE 3395 .mirror.non_primary.24a3c8b6-e6e1-4739-9319-6408c3e6b38f.adf4bc4d-fc18-4854-8fd2-271672275d5c 100 GiB Mon Nov 29 19:02:52 2021 mirror (non-primary peer_uuids:[] 5b050cc6-40bd-4df1-960a-c880dfb4ae93:3526 85% copied) ``` Mon Nov 29 14:11:32 -- rbd snap ls ocs-storagecluster-cephblockpool/test_1110 --all ``` SNAPID NAME SIZE PROTECTED TIMESTAMP NAMESPACE 3395 .mirror.non_primary.24a3c8b6-e6e1-4739-9319-6408c3e6b38f.adf4bc4d-fc18-4854-8fd2-271672275d5c 100 GiB Mon Nov 29 19:02:52 2021 mirror (non-primary peer_uuids:[] 5b050cc6-40bd-4df1-960a-c880dfb4ae93:3526 copied) ``` JC, Annette, and Madhu suggested using the fast-diff option. And sure enough, once I recreated an image with fast-diff, the transfer time dropped to around 20 seconds. ``` rbd create ocs-storagecluster-cephblockpool/test_2220 --size 100G --image-feature object-map,fast-diff,exclusive-lock ``` Hi, I believe this BZ is critical as the impact of it might be not only on ODF related resources but also on other resources in the cluster and even beyond ODF/OCP. The reason is that large amounts of data being fully replicated every 5 minutes, could potentially lead to constant high network and compute utilization which could suffocate the clusters and the customer network which in turn could lead to starvation among other clients. Therefore, proposing as a blocker for 4.9.1 (In reply to Elad from comment #5) > Hi, I believe this BZ is critical as the impact of it might be not only on > ODF related resources but also on other resources in the cluster and even > beyond ODF/OCP. > The reason is that large amounts of data being fully replicated every 5 > minutes, could potentially lead to constant high network and compute > utilization which could suffocate the clusters and the customer network > which in turn could lead to starvation among other clients. > Therefore, proposing as a blocker for 4.9.1 The core issue is fixed in 4.9.1 based on this clone: https://bugzilla.redhat.com/show_bug.cgi?id=2030745 The fix requires that images be created using fast-diff feature, and hence requires a separate StroageClass for DR. This issue is being tracked for improvements as per comment #4 I would hence state that further fixes in this regard are not a target for 4.9.1 (IOW, merging the StorageClass for both DR and non-DR cases). Agree with Shyam, the fix required for TP is already in 4.9.1 and a doc bug is in place to help users with the SC creation etc. We don't intend to put any further fixes in 4.9.z for this, removing the 4.9.z flag. We have already fixed it and is being tracked in Jira, closing the BZ. https://issues.redhat.com/browse/RHSTOR-2502 |