2181055 – [rbd-mirror] RPO not met when adding latency between clusters

Bug 2181055 - [rbd-mirror] RPO not met when adding latency between clusters

Summary: [rbd-mirror] RPO not met when adding latency between clusters

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Ceph Storage
Classification:	Red Hat Storage
Component:	RBD-Mirror
Sub Component:
Version:	6.0
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Target Release:	6.1
Assignee:	Christopher Hoffman
QA Contact:	Vasishta
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	2213237 2192813
TreeView+	depends on / blocked

Reported:	2023-03-23 00:15 UTC by Josh Durgin
Modified:	2023-06-21 09:19 UTC (History)
CC List:	9 users (show)
Fixed In Version:	ceph-17.2.6-46.el9cp
Doc Type:	Bug Fix
Doc Text:	.Snapshot removal tep is moved to the primary site Previously, a remote cluster could not grab lock quickly enough to remove synced snapshot while I/O was underway due to latency between sites. This would cause the mirror image snapshot sync process to be stuck during snapshot removal and would not continue with syncing further snapshots. With this fix, the snapshot removal step is moved to the primary site which grabs the lock fast enough to remove the snapshot and the mirror image snapshot sync does not get stuck and works as expected.
Clone Of:
Clones:	2213237 (view as bug list)
Environment:
Last Closed:	2023-06-15 09:16:51 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Ceph Project Bug Tracker	59393	0	None	None	None	2023-04-19 15:46:59 UTC
Red Hat Issue Tracker	RHCEPH-6292	0	None	None	None	2023-03-23 00:15:46 UTC

Description Josh Durgin 2023-03-23 00:15:01 UTC

From Paul Cuzner's testing of rbd-mirror:

Introducing a network delay between clusters for a small workload (100 images, 2,500 IOPS) showed the
following;
o Measuring the effect of 20ms latency applied before the rbd-mirror relationships were created
▪ After several hours, tests with as few as 50 images (250GB of data) were not able to achieve
synchronization
o Measuring the effect of various network latencies after the initial synchronization was complete
▪ At 10ms, the sync time is extended by at least 30% but replication success remains
consistent.
▪ Based on cloudping data, there are NO compatible AWS regions that exhibit this
latency
▪ At 20ms latency, network bandwidth and CPU load imply replication is not happening, but
snapshot timestamps are changing – it's just very, very slow!
▪ Changes to concurrent_image_syncs are needed to force rbd-mirror to run more
concurrent sessions to accommodate the 20ms network delay. The downside of this
strategy is increased CPU load, as more sync tasks are handled concurrently.
▪ Using the cloudping data with a 20ms ceiling there are 10 regions that have the
potential to support snapshot rbd-mirror across 14 region-to-region combinations
(code and output)
▪ At 50ms latency, with 50 concurrent images syncs, the images do not replicate within the
replication interval. Snapshots are taken at the primary cluster but after 2 hours the
secondary site has not been able to achieve a consistent state with the primary cluster

Comment 17 errata-xmlrpc 2023-06-15 09:16:51 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: Red Hat Ceph Storage 6.1 security and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2023:3623

Note You need to log in before you can comment on or make changes to this bug.