Created attachment 1276341 [details] File contains a snippet of log from both clusters in mirroring setup Description of problem: Hitting split-brain on some images while performing failback after orderly shutdown. Version-Release number of selected component (if applicable): ceph version 10.2.7-13.el7cp Followed steps: 1. Set up mirroring between 2 clusters 2. Create images and attached them to vms as disks 3. Demoted images in local to secondary and promoted images in remote to primary 4. Performed an orderly shutdown on all nodes of the cluster 5. Accessed images from vms created in secondary node. 6. Turned on all nodes in primary and promoted images back to primary in local and to secondary in remote Actual results: $ for i in {data1/vmdisk1,data2/vmdisk2};do sudo rbd mirror image status $i --cluster Two;done vmdisk1: global_id: 5a7495f1-9cdc-4d6f-b920-6a11a6b988d9 state: up+error description: split-brain detected last_update: 2017-05-04 12:21:26 vmdisk2: global_id: 91be31d3-5795-47b6-b917-22aa37f8d5f4 state: up+error description: split-brain detected last_update: 2017-05-04 12:21:31 $for i in {data1/vmdisk1,data2/vmdisk2};do sudo rbd info $i --cluster Two;done rbd image 'vmdisk1': size 10240 MB in 2560 objects order 22 (4096 kB objects) block_name_prefix: rbd_data.2a2cc56d2c3f7 format: 2 features: layering, exclusive-lock, object-map, fast-diff, deep-flatten, journaling flags: journal: 2a2cc56d2c3f7 mirroring state: enabled mirroring global id: 5a7495f1-9cdc-4d6f-b920-6a11a6b988d9 mirroring primary: false rbd image 'vmdisk2': size 10240 MB in 2560 objects order 22 (4096 kB objects) block_name_prefix: rbd_data.2a2cd1b539fa5 format: 2 features: layering, exclusive-lock, object-map, fast-diff, deep-flatten, journaling flags: journal: 2a2cd1b539fa5 mirroring state: enabled mirroring global id: 91be31d3-5795-47b6-b917-22aa37f8d5f4 mirroring primary: false Expected results: split-brain issue shouldn't appear Additional info - Tried reproducing this issue on same images after syncing manually using command 'rbd mirror image resync'. Faced same issue again.
Hi, One thing I had missed to mention in summary that, both the clusters were ipv6 configured and ipv4 has been disabled in mirroring nodes. Regards, Vasishta
Looks like the issue is caused when you gracefully failback an unmodified image. We didn't catch in automated testing since we write new data to the image after a failover and compare that the images remain in-sync before and after a failback.
Upstream PR: https://github.com/ceph/ceph/pull/14977
Hi All, Couldn't reproduce the issue. Moving this bug to verified state. Regards, Vasishta
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2017:1497