Bug 2105308

Summary: [rbd-mirror]: secondary images reporting error (stopping_replay, stopped, error) which secondary seen split-brain and client blocklist
Product: [Red Hat Storage] Red Hat Ceph Storage Reporter: Vasishta <vashastr>
Component: RBD-MirrorAssignee: Ilya Dryomov <idryomov>
Status: NEW --- QA Contact: Sunil Angadi <sangadi>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 5.2CC: ceph-eng-bugs, cephqe-warriors, idryomov, jdurgin, sangadi, vereddy
Target Milestone: ---Flags: sangadi: needinfo+
Target Release: 7.1   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Vasishta 2022-07-08 13:50:08 UTC
Description of problem:
Configured mirroring with 26 images on both clusters with snapshot schedule set to 2 min on individual images. Ran IOs on few images.
(No relocate operations)

Created and deleted 25-75 images on one of the clusters with snapshot schedule.

Upon observing backlog of mirror snapshot copy to peer clusters, changed rbd-mirroring daemon on both clusters to a node with higher network capacity. relocated mirroring daemon in cluster with images with issue to another host.

scaled up number of monitors in cluster with primary images (with above issues) appending public_network.

Observed that set of images in cluster with 102 primary images + 26 secondary images reported that all images are in error state (some images fluctuating between (stopping_replay, stopped, error).

mirror image description were-
failed to refresh remote image
failed to unlink local peer from remote image
stopping replay
stopped

Version-Release number of selected component (if applicable):
16.2.8-65.el8cp

How reproducible:
Tried once

Steps to Reproduce:
(Mentioned in description )

Actual results:
All secondary images reporting error (some images fluctuating between (stopping_replay, stopped, error).

Expected results:
Secondary images were up+replying

Additional info: