1424687 – [rbd-mirror] : after split-brain is detected, unable to resync image using 'rbd mirror image resync <image-spec>'

Bug 1424687 - [rbd-mirror] : after split-brain is detected, unable to resync image using 'rbd mirror image resync <image-spec>'

Summary: [rbd-mirror] : after split-brain is detected, unable to resync image using 'r...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Ceph Storage
Classification:	Red Hat Storage
Component:	RBD-Mirror
Sub Component:
Version:	2.2
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	rc
Target Release:	2.2
Assignee:	Ilya Dryomov
QA Contact:	Vasishta
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2017-02-18 04:43 UTC by Rachana Patel
Modified:	2023-05-25 08:41 UTC (History)
CC List:	7 users (show)
Fixed In Version:	RHEL: ceph-10.2.5-28.el7cp Ubuntu: ceph_10.2.5-20redhat1xenial
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2017-03-14 15:49:45 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Ceph Project Bug Tracker	18191	None	None	None	2017-02-18 13:30:35 UTC
Red Hat Issue Tracker	RHCEPH-6736	None	None	None	2023-05-25 08:41:45 UTC
Red Hat Product Errata	RHBA-2017:0514	normal	SHIPPED_LIVE	Red Hat Ceph Storage 2.2 bug fix and enhancement update	2017-03-21 07:24:26 UTC

Description Rachana Patel 2017-02-18 04:43:51 UTC

Description of problem:
=======================
After Failback, tried to resync Image from 2nd secondary site but  resync didnt synced image as Split brain was detected 


Version-Release number of selected component (if applicable):
=============================================================
10.2.5-13.el7cp.x86_64


How reproducible:
=================
always


Steps to Reproduce:
===================
1. have 3 cluster. Site A being primary and Site B and site C are secondary sites
(site B has bidirectional relation with A while C has one-directional)
2. enable pool level or image level mirroring for few images.
3. create images and let it sync to secondary.(A->B, A->C)
4. Demote Images on Site A. Promote mirrored images from site B
5.  Do some I/O on images from site B.
6. let data synced to Site A from site B. (B->A)
7. Now demote Images from site B and promote Images from Site A.
8. Issue resync command on site C to sync Images with site A.


[root@magna099 ubuntu]# rbd mirror image resync i-mirror/image1 --cluster slave2
Flagged image for resync from primary





Actual results:
===============
 Check Image status.
[root@magna099 ubuntu]#  rbd mirror image status i-mirror/image1 --cluster slave2
image1:
  global_id:   37ce0781-30ef-416f-9c3a-5ed6124b55ec
  state:       up+error
  description: error bootstrapping replay
  last_update: 2017-02-17 17:45:28


Expected results:
=================
Image resync is not happening

Comment 6 Federico Lucifredi 2017-02-21 00:38:53 UTC

in repro, step 1, it reads "(site B has bidirectional relation with A while C has one-directional)". What is the difference there?

Comment 7 Federico Lucifredi 2017-02-21 00:42:17 UTC

multiple secondaries are not a blocker for release 2.2.

Comment 8 Jason Dillaman 2017-02-21 00:48:34 UTC

@Federico: "site B has bidirectional relation with A while C has one-directional" means site B was configured to mirror primary images from site A and site A was configured to to mirror primary images from site B. Site C was configured to only mirror primary images from site A.

This resync issue is an issue regardless of whether or not multiple secondaries are in-use should you hit a split-brain condition.

Comment 9 Federico Lucifredi 2017-02-21 00:51:31 UTC

Thanks Jason, understood.

One-directional A->B with an optional A->C is the key use case. 

If we can get bi-directional A->B and B->A for different images/pools in this release, that is great. Do not worry about multiple secondaries at this late stage, we can punt those bugs to 2.3.

Comment 13 Rachana Patel 2017-02-23 03:09:45 UTC

Executed bewlow case to verify defect

precondition
============
--> have 3 cluster. Site A being primary and Site B and site C are secondary sites
(site B has bidirectional relation with A while C has one-directional)
--> enable pool level or image level mirroring for few images.
--> create images and let it sync to secondary.(A->B, A->C)


1) orderly shutdown
a)failover
 --> demote image on A, promote image on B
 --> shutdown cluster A
 --> I/O on image from cluster B

b)Failback
 --> bring up cluster A and let image sync to A
 --> demote image on B , promote image on A
 --> resync image on C
 --> do I/O on image from cluster A and let it sync to cluster B & C

2) nonorderly shutdown
a)failover
 --> bring down cluster A
 --> force promote image on B
 --> **WORKAROUND** - restart rbd-mirror on cluster B
 --> do I/O on image from cluster B

b)Failback
 --> bring cluster A back
 --> demote Image on A, resync Image on A
 --> demote image on cluster B, promote image on cluster A
 --> resync image from cluster C


resync worked in both cases, hence moving back to verified
verified with version - 10.2.5-29.el7cp.x86_64

Comment 14 Rachana Patel 2017-02-23 14:29:33 UTC

(In reply to Rachana Patel from comment #13)
> Executed bewlow case to verify defect
> 
> precondition
> ============
> --> have 3 cluster. Site A being primary and Site B and site C are secondary
> sites
> (site B has bidirectional relation with A while C has one-directional)
> --> enable pool level or image level mirroring for few images.
> --> create images and let it sync to secondary.(A->B, A->C)
> 
> 
> 1) orderly shutdown
> a)failover
>  --> demote image on A, promote image on B
>  --> shutdown cluster A
>  --> I/O on image from cluster B
> 
> b)Failback
>  --> bring up cluster A and let image sync to A
>  --> demote image on B , promote image on A
>  --> resync image on C
this should be 'resync image on cluster C from cluster A'
>  --> do I/O on image from cluster A and let it sync to cluster B & C
> 
> 2) nonorderly shutdown
> a)failover
>  --> bring down cluster A
>  --> force promote image on B
>  --> **WORKAROUND** - restart rbd-mirror on cluster B
>  --> do I/O on image from cluster B
> 
> b)Failback
>  --> bring cluster A back
>  --> demote Image on A, resync Image on A
>  --> demote image on cluster B, promote image on cluster A
>  --> resync image from cluster C
it should be 'resync image from cluster A to cluster C'
> 
> 
> resync worked in both cases, hence moving back to verified
> verified with version - 10.2.5-29.el7cp.x86_64

Comment 16 errata-xmlrpc 2017-03-14 15:49:45 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2017-0514.html

Note You need to log in before you can comment on or make changes to this bug.