1448066 – [rbd-mirror] : Hitting split-brain during failback after orderly shutdown

Bug 1448066 - [rbd-mirror] : Hitting split-brain during failback after orderly shutdown

Summary: [rbd-mirror] : Hitting split-brain during failback after orderly shutdown

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Ceph Storage
Classification:	Red Hat Storage
Component:	RBD
Sub Component:
Version:	2.3
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	medium
Target Milestone:	rc
Target Release:	2.3
Assignee:	Jason Dillaman
QA Contact:	ceph-qe-bugs
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2017-05-04 13:10 UTC by Vasishta
Modified:	2022-02-21 18:41 UTC (History)
CC List:	3 users (show)
Fixed In Version:	RHEL: ceph-10.2.7-15.el7 Ubuntu: ceph_10.2.7-17redhat1
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2017-06-19 13:32:49 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
File contains a snippet of log from both clusters in mirroring setup (9.34 KB, text/plain) 2017-05-04 13:10 UTC, Vasishta	no flags	Details
View All

Links
System	ID	Priority	Status	Summary	Last Updated
Ceph Project Bug Tracker	19858	None	None	None	2017-05-04 16:32:00 UTC
Red Hat Issue Tracker	RHCEPH-3516	None	None	None	2022-02-21 18:41:09 UTC
Red Hat Product Errata	RHBA-2017:1497	normal	SHIPPED_LIVE	Red Hat Ceph Storage 2.3 bug fix and enhancement update	2017-06-19 17:24:11 UTC

Description Vasishta 2017-05-04 13:10:44 UTC

Created attachment 1276341 [details]
File contains a snippet of log from both clusters in mirroring setup

Description of problem:
Hitting split-brain on some images while performing failback after orderly shutdown. 

Version-Release number of selected component (if applicable):
ceph version 10.2.7-13.el7cp


Followed steps:
1. Set up mirroring between 2 clusters 
2. Create images and attached them to vms as disks
3. Demoted images in local to secondary and promoted images in remote to primary
4. Performed an orderly shutdown on all nodes of the cluster
5. Accessed images from vms created in secondary node.
6. Turned on all nodes in primary and promoted images back to primary in local and to secondary in remote

Actual results:

$ for i in {data1/vmdisk1,data2/vmdisk2};do  sudo rbd mirror image status $i --cluster Two;done
vmdisk1:
  global_id:   5a7495f1-9cdc-4d6f-b920-6a11a6b988d9
  state:       up+error
  description: split-brain detected
  last_update: 2017-05-04 12:21:26
vmdisk2:
  global_id:   91be31d3-5795-47b6-b917-22aa37f8d5f4
  state:       up+error
  description: split-brain detected
  last_update: 2017-05-04 12:21:31

$for i in {data1/vmdisk1,data2/vmdisk2};do  sudo rbd info $i --cluster Two;done
rbd image 'vmdisk1':
	size 10240 MB in 2560 objects
	order 22 (4096 kB objects)
	block_name_prefix: rbd_data.2a2cc56d2c3f7
	format: 2
	features: layering, exclusive-lock, object-map, fast-diff, deep-flatten, journaling
	flags: 
	journal: 2a2cc56d2c3f7
	mirroring state: enabled
	mirroring global id: 5a7495f1-9cdc-4d6f-b920-6a11a6b988d9
	mirroring primary: false
rbd image 'vmdisk2':
	size 10240 MB in 2560 objects
	order 22 (4096 kB objects)
	block_name_prefix: rbd_data.2a2cd1b539fa5
	format: 2
	features: layering, exclusive-lock, object-map, fast-diff, deep-flatten, journaling
	flags: 
	journal: 2a2cd1b539fa5
	mirroring state: enabled
	mirroring global id: 91be31d3-5795-47b6-b917-22aa37f8d5f4
	mirroring primary: false

Expected results:
split-brain issue shouldn't appear

Additional info -
Tried reproducing this issue on same images after syncing manually using command 'rbd mirror image resync'. Faced same issue again.

Comment 4 Vasishta 2017-05-04 14:35:34 UTC

Hi,

One thing I had missed to mention in summary that, both the clusters were ipv6 configured and ipv4 has been disabled in mirroring nodes.

Regards,
Vasishta

Comment 5 Jason Dillaman 2017-05-04 16:32:00 UTC

Looks like the issue is caused when you gracefully failback an unmodified image. We didn't catch in automated testing since we write new data to the image after a failover and compare that the images remain in-sync before and after a failback.

Comment 6 Jason Dillaman 2017-05-05 14:56:30 UTC

Upstream PR: https://github.com/ceph/ceph/pull/14977

Comment 12 Vasishta 2017-05-12 14:05:00 UTC

Hi All,

Couldn't reproduce the issue. Moving this bug to verified state.

Regards,
Vasishta

Comment 14 errata-xmlrpc 2017-06-19 13:32:49 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:1497

Note You need to log in before you can comment on or make changes to this bug.