2213237 – [rbd-mirror] : improvements :RPO not met when adding latency between clusters

Bug 2213237 - [rbd-mirror] : improvements :RPO not met when adding latency between clusters

Summary: [rbd-mirror] : improvements :RPO not met when adding latency between clusters

Keywords:
Status:	NEW
Alias:	None
Product:	Red Hat Ceph Storage
Classification:	Red Hat Storage
Component:	RBD-Mirror
Sub Component:
Version:	6.1
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	9.0
Assignee:	Ilya Dryomov
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:	2181055
Blocks:
TreeView+	depends on / blocked

Reported:	2023-06-07 15:06 UTC by Vasishta
Modified:	2024-11-11 22:21 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:	2181055
Environment:
Last Closed:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Ceph Project Bug Tracker	59732	0	None	None	None	2023-06-07 15:06:33 UTC
Red Hat Issue Tracker	RHCEPH-6824	0	None	None	None	2023-06-13 17:29:18 UTC

Description Vasishta 2023-06-07 15:06:33 UTC

+++ This bug was initially created as a clone of Bug #2181055 +++

Cloning the BZ as initial BZ was used to track fix mentioned in https://tracker.ceph.com/issues/59393

Creating this tracker for tracking further investigations.


From Paul Cuzner's testing of rbd-mirror:

Introducing a network delay between clusters for a small workload (100 images, 2,500 IOPS) showed the
following;
o Measuring the effect of 20ms latency applied before the rbd-mirror relationships were created
▪ After several hours, tests with as few as 50 images (250GB of data) were not able to achieve
synchronization
o Measuring the effect of various network latencies after the initial synchronization was complete
▪ At 10ms, the sync time is extended by at least 30% but replication success remains
consistent.
▪ Based on cloudping data, there are NO compatible AWS regions that exhibit this
latency
▪ At 20ms latency, network bandwidth and CPU load imply replication is not happening, but
snapshot timestamps are changing – it's just very, very slow!
▪ Changes to concurrent_image_syncs are needed to force rbd-mirror to run more
concurrent sessions to accommodate the 20ms network delay. The downside of this
strategy is increased CPU load, as more sync tasks are handled concurrently.
▪ Using the cloudping data with a 20ms ceiling there are 10 regions that have the
potential to support snapshot rbd-mirror across 14 region-to-region combinations
(code and output)
▪ At 50ms latency, with 50 concurrent images syncs, the images do not replicate within the
replication interval. Snapshots are taken at the primary cluster but after 2 hours the
secondary site has not been able to achieve a consistent state with the primary cluster

--- Additional comment from Josh Durgin on 2023-03-23 00:15:01 UTC ---

Full report here: https://ibm.ent.box.com/s/95b8wq63mu1y3pn81f5la38la901bg19/file/1164853673665

--- Additional comment from Ilya Dryomov on 2023-05-10 16:41:50 UTC ---

Pushed to ceph-6.1-rhel-patches.

--- Additional comment from errata-xmlrpc on 2023-05-10 20:01:39 UTC ---

This bug has been added to advisory RHBA-2023:112314 by Thomas Serlin (tserlin)

--- Additional comment from errata-xmlrpc on 2023-05-10 20:01:40 UTC ---

Bug report changed to ON_QA status by Errata System.
A QE request has been submitted for advisory RHBA-2023:112314-01
https://errata.devel.redhat.com/advisory/112314

--- Additional comment from Vasishta on 2023-05-23 15:34:03 UTC ---

Release verification is in progress, observing io on 50 images with ios and 30-100ms delay.

--- Additional comment from Vasishta on 2023-05-31 04:56:15 UTC ---

Verification is in progress.

Observed that 10 image creation of 10G in sequence without any other IOs.
Apparently it took 2 hours for all images to complete with 1G nic and 100ms delay.
Trying to narrow down the behavior on ios on images, will update further.

Also, while trying out few things around, we hit https://bugzilla.redhat.com/show_bug.cgi?id=2210790

--- Additional comment from Vasishta on 2023-06-06 05:10:23 UTC ---

Tried concurrent io on 40+ images with another 50-60 images with snapshot scheduling (empty snaps) with 1G network.

With 16GB(including all 40 images) io within each 5 minutes interval, observed big difference between 100ms delay and 300ms delay.
Got to know that issue has been identified and fix is being tracked under https://tracker.ceph.com/issues/59732

Currently we are working on verifying fix for https://tracker.ceph.com/issues/59393

--- Additional comment from Vasishta on 2023-06-06 08:22:21 UTC ---

Hi Josh,

As you might already be knowing, there has been two issues identified to fix/improvise RPO not being met with delay in place between the clusters.
https://tracker.ceph.com/issues/59393 and https://tracker.ceph.com/issues/59732

Currently we have fix for https://tracker.ceph.com/issues/59393 .
As BZ's summary depicts a larger scope, please let me know your thoughts on tracking https://tracker.ceph.com/issues/59732 , whether to use same BZ or the different (new) tracker ?

--- Additional comment from Ilya Dryomov on 2023-06-06 08:59:45 UTC ---

(In reply to Vasishta from comment #7)
> Tried concurrent io on 40+ images with another 50-60 images with snapshot
> scheduling (empty snaps) with 1G network.
> 
> With 16GB(including all 40 images) io within each 5 minutes interval,
> observed big difference between 100ms delay and 300ms delay.
> Got to know that issue has been identified and fix is being tracked under
> https://tracker.ceph.com/issues/59732

Hi Vasi,

Can you clarify what the "big difference" was?  https://tracker.ceph.com/issues/59732 is very light on details, just as your report here.

--- Additional comment from Vasishta on 2023-06-06 09:03:59 UTC ---

Here is an exercise (test) performed and its details -

In a cluster with 1 gig network and 300ms latency,

a. At around same time, started creating 10 thick provisioned images of 10G each (with mirroring enabled and snapshot scheduling at 5 minutes)[1] along with concurrent ios on 40 images. [2]
b. For 1st newly created thick provisioned image which got created at which got created before Tue Jun  6 04:40:37 [1], Even at 8:40, empty primary snapshots stuck can be seen.

Example: Image cr4m_p8_b6_0
snap ls at 8:42 (4 hours after image creation == 4 hours after io stopped on this image)
snap ls at primary and secondary - http://magna002.ceph.redhat.com/ceph-qe-logs/vasishta/2023/BZ_2181055/b6_0_image_snap_ls
Last snapshot which is stuck at 0% copying must be an empty snap stuck in copying for almost an hour

Snap ls of all such 10 images which were created 4 hours ago from current time and no ios performed but which have got stuck for hours with copying empty images [3]

Even after ios stopped on 30 images (only 10 image in progress and that too very small quantity, Itcan be observed that snapshot copying were stalled for hour for image cr4m_p8_b6_0 (example)

Can we please check what could be causing this issue ?

primary mgr log http://magna002.ceph.redhat.com/ceph-qe-logs/vasishta/2023/BZ_2181055/ceph-mgr.cali020.urvbse.log.gz
secondary mgr log http://magna002.ceph.redhat.com/ceph-qe-logs/vasishta/2023/BZ_2181055/ceph-mgr.cali025.ffrpac.log.gz
secondary rbd mirror client - http://magna002.ceph.redhat.com/ceph-qe-logs/vasishta/2023/BZ_2181055/ceph-client.rbd-mirror.pluto009.ghicsc.log

[1] creation cli log- http://magna002.ceph.redhat.com/ceph-qe-logs/vasishta/2023/BZ_2181055/j6_image_creation_log
[2] fio profile on 30 images (in batch of 10 images) - in loop for 3 times
--runtime 3600 --rw=randwrite --bs=4k --size 25G --ioengine=libaio --numjobs=4 --time_based --group_reporting
fio profile on 10 images 
--runtime 33600 --rw=randwrite --bs=4k --size 25G --ioengine=libaio --numjobs=4 --time_based --group_reporting
[3] http://magna002.ceph.redhat.com/ceph-qe-logs/vasishta/2023/BZ_2181055/b6_images_snap_ls_primary
http://magna002.ceph.redhat.com/ceph-qe-logs/vasishta/2023/BZ_2181055/b6_images_snap_ls_secondary
[4] http://magna002.ceph.redhat.com/ceph-qe-logs/vasishta/2023/BZ_2181055/b6_0_image_info

--- Additional comment from Vasishta on 2023-06-06 09:13:57 UTC ---

(In reply to Ilya Dryomov from comment #9)
> (In reply to Vasishta from comment #7)
> Can you clarify what the "big difference" was? 
> https://tracker.ceph.com/issues/59732 is very light on details, just as your
> report here.

Ilya, 
sorry, haven't recorded the observation properly with this

Just rephrasing and adding what I remember
- Tried concurrent io on 40+ images with another 50-60 images with snapshot scheduling (empty snaps) with 1G network.
With other factors being same,I observed around 20 minutes delay in complete syncing in between 100ms delay and 300ms delay.

--- Additional comment from Vasishta on 2023-06-06 09:26:40 UTC ---

Extending comment #10

Currently only parallel io is being performed only on 10 images [1]
io started on another 30 images has got stopped around 1:30 hours ago, observing stuck snapshots which were created in primary 2 hours back.

Example :cr4m_p8_b1_0 (This image has been created (20 days ago and has been under snapshot schedule for 5 minutes)
image info and snap ls from both clusters - 
http://magna002.ceph.redhat.com/ceph-qe-logs/vasishta/2023/BZ_2181055/example2

Can these images be examined and see whether this has caused due to any of the identified issues or not ?


[1] fio profile 
--runtime 33600 --rw=randwrite --bs=4k --size 25G --ioengine=libaio --numjobs=4 --time_based --group_reporting

--- Additional comment from Vasishta on 2023-06-06 13:36:02 UTC ---

(In reply to Vasishta from comment #12)
> Extending comment #10
> 
> Currently only parallel io is being performed only on 10 images [1]
> io started on another 30 images has got stopped around 1:30 hours ago,
> observing stuck snapshots which were created in primary 2 hours back.
> 
> Example :cr4m_p8_b1_0 (This image has been created (20 days ago and has been
> under snapshot schedule for 5 minutes)
> image info and snap ls from both clusters - 
> http://magna002.ceph.redhat.com/ceph-qe-logs/vasishta/2023/BZ_2181055/
> example2
> 
> Can these images be examined and see whether this has caused due to any of
> the identified issues or not ?
> 
> 
> [1] fio profile 
> --runtime 33600 --rw=randwrite --bs=4k --size 25G --ioengine=libaio
> --numjobs=4 --time_based --group_reporting


Image from above example, currently there has been no io on this image for last 6+ hours and the the time gap between syncing snapshot and latest primary snapshot hasn't caught up but reduced to 35 minutes difference from 1:30 hours

Added latest snap ls o/p from primary and secondary to same file
> http://magna002.ceph.redhat.com/ceph-qe-logs/vasishta/2023/BZ_2181055/example2

--- Additional comment from Ilya Dryomov on 2023-06-06 18:04:07 UTC ---

As discussed in the standup, I would suggest to separate the concerns here.  This BZ is tracking a fix for https://tracker.ceph.com/issues/59393 which Chris was able to reproduce with just a single image on a tiny cluster.  No work has been done on https://tracker.ceph.com/issues/59732 (we haven't even established its validity yet -- notice that it says "investigate"...), so tying it to the verification of this BZ is fruitless.

Having ~40 + another 50-60 images is a plausible but very different scenario.  As noted in the description (which is a paste from Paul's report), you would likely need to tweak rbd_mirror_concurrent_image_syncs and possibly other options to accommodate that number of images under 300ms delay (which is 3x bigger than our nominal target for replication between us-east and us-west regions).

Note You need to log in before you can comment on or make changes to this bug.