Bug 1343941 - Hitting a Split-Brain when multiple images are getting synced in parallel
Summary: Hitting a Split-Brain when multiple images are getting synced in parallel
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Red Hat Ceph Storage
Classification: Red Hat Storage
Component: RBD
Version: 2.0
Hardware: x86_64
OS: Linux
unspecified
urgent
Target Milestone: rc
: 2.0
Assignee: Jason Dillaman
QA Contact: Tanay Ganguly
URL:
Whiteboard:
Depends On:
Blocks: 1343229
TreeView+ depends on / blocked
 
Reported: 2016-06-08 11:29 UTC by Tanay Ganguly
Modified: 2017-07-30 15:30 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2016-06-14 11:41:17 UTC
Embargoed:


Attachments (Terms of Use)
Master Node log (2.67 MB, text/plain)
2016-06-08 11:29 UTC, Tanay Ganguly
no flags Details
New Resync Log (2.11 MB, application/x-gzip)
2016-06-08 11:30 UTC, Tanay Ganguly
no flags Details
Older Resync file (2.06 MB, application/x-gzip)
2016-06-08 11:32 UTC, Tanay Ganguly
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Ceph Project Bug Tracker 16200 0 None None None 2016-06-08 23:55:21 UTC

Description Tanay Ganguly 2016-06-08 11:29:08 UTC
Created attachment 1165928 [details]
Master Node log

Description of problem:
I am hitting a split brain in Slave Node while more than one image is getting resyn.

Version-Release number of selected component (if applicable):
ceph version 10.2.1-12.el7cp

How reproducible:
Hit it once

Steps to Reproduce:
1. Created an Image on Master Node, dont enable Journaling
2. Write some 10G data on the Image.
3. Once write complete enable the Journaling feature on Master Node ( Resync starts)
4. Disable journaling on an existing created image ( Before that it was synced with Slave Node ) this is a different image
5. Start bench-write on the Image, write some new data and then stop it.
6. Again enable the Journaling ( Resync starts)
rbd feature enable RBD/testing3 journaling --cluster master

Actual results:
After enabling journaling again i am seeing split brain.
Now both the images was trying to get synced

Older image which was getting synced stopped at 55% ( refer step 3 )
New sync cried saying split-brain ( refer step 6 )

Expected results:
There should not be a split-brain

Additional info:
Log of both the resync (new and old) from Slave
Log of the master node
--------------------------------------------------------------------------------

systemctl status -l ceph-rbd-mirror@master
● ceph-rbd-mirror - Ceph rbd mirror daemon
   Loaded: loaded (/usr/lib/systemd/system/ceph-rbd-mirror@.service; enabled; vendor preset: disabled)
   Active: active (running) since Wed 2016-06-08 15:14:30 IST; 1h 42min ago
 Main PID: 86496 (rbd-mirror)
   CGroup: /system.slice/system-ceph\x2drbd\x2dmirror.slice/ceph-rbd-mirror
           └─86496 /usr/bin/rbd-mirror -f --cluster master --id master --setuser ceph --setgroup ceph

Jun 08 15:14:30 cephqe3.lab.eng.blr.redhat.com systemd[1]: Started Ceph rbd mirror daemon.
Jun 08 15:14:30 cephqe3.lab.eng.blr.redhat.com systemd[1]: Starting Ceph rbd mirror daemon...
Jun 08 16:29:03 cephqe3.lab.eng.blr.redhat.com rbd-mirror[86496]: 2016-06-08 16:29:03.217415 7f4fa57fa700 -1 rbd::mirror::image_replayer::BootstrapRequest: 0x7f4f7408bb10 handle_get_remote_tag_class: failed to retrieve remote client: (2) No such file or directory
Jun 08 16:29:03 cephqe3.lab.eng.blr.redhat.com rbd-mirror[86496]: 2016-06-08 16:29:03.217475 7f4fd569f700 -1 rbd::mirror::ImageReplayer: 0x7f4f7400c650 [1/e923d0ee-37b7-483e-9621-ecb70c545eee] operator(): start failed: (2) No such file or directory
Jun 08 16:29:03 cephqe3.lab.eng.blr.redhat.com rbd-mirror[86496]: 2016-06-08 16:29:03.230753 7f4fb7fff700 -1 JournalMetadata: operator(): failed to watch journal(2) No such file or directory
Jun 08 16:29:03 cephqe3.lab.eng.blr.redhat.com rbd-mirror[86496]: 2016-06-08 16:29:03.230778 7f4fb7fff700 -1 JournalMetadata: failed to initialize immutable metadata: (2) No such file or directory


systemctl status -l ceph-rbd-mirror@slave
● ceph-rbd-mirror - Ceph rbd mirror daemon
   Loaded: loaded (/usr/lib/systemd/system/ceph-rbd-mirror@.service; enabled; vendor preset: disabled)
   Active: active (running) since Wed 2016-06-08 10:58:10 UTC; 11min ago
 Main PID: 678 (rbd-mirror)
   CGroup: /system.slice/system-ceph\x2drbd\x2dmirror.slice/ceph-rbd-mirror
           └─678 /usr/bin/rbd-mirror -f --cluster slave --id slave --setuser ceph --setgroup ceph

Jun 08 11:07:34 magna003 rbd-mirror[678]: 2016-06-08 11:07:34.512360 7f7baaffd700 -1 rbd::mirror::image_replayer::BootstrapRequest: 0x7f7b740019f0 handle_get_remote_tags: split-brain detected -- skipping image replay
Jun 08 11:07:34 magna003 rbd-mirror[678]: 2016-06-08 11:07:34.827056 7f7bdaffd700 -1 rbd::mirror::ImageReplayer: 0x7f7b74004370 [1/696a499b-9cc1-44d5-8e08-2b581ef24aba] operator(): start failed: (17) File exists
Jun 08 11:08:05 magna003 rbd-mirror[678]: 2016-06-08 11:08:05.371161 7f7baaffd700 -1 rbd::mirror::image_replayer::BootstrapRequest: 0x7f7b74006240 handle_get_remote_tags: split-brain detected -- skipping image replay
Jun 08 11:08:05 magna003 rbd-mirror[678]: 2016-06-08 11:08:05.695272 7f7bdaffd700 -1 rbd::mirror::ImageReplayer: 0x7f7b74004370 [1/696a499b-9cc1-44d5-8e08-2b581ef24aba] operator(): start failed: (17) File exists
Jun 08 11:08:49 magna003 rbd-mirror[678]: 2016-06-08 11:08:49.085661 7f7baaffd700 -1 rbd::mirror::image_replayer::BootstrapRequest: 0x7f7b74003160 handle_get_remote_tags: split-brain detected -- skipping image replay
Jun 08 11:08:49 magna003 rbd-mirror[678]: 2016-06-08 11:08:49.549062 7f7bdaffd700 -1 rbd::mirror::ImageReplayer: 0x7f7b74004370 [1/696a499b-9cc1-44d5-8e08-2b581ef24aba] operator(): start failed: (17) File exists
Jun 08 11:09:21 magna003 rbd-mirror[678]: 2016-06-08 11:09:21.401637 7f7baaffd700 -1 rbd::mirror::image_replayer::BootstrapRequest: 0x7f7b74003160 handle_get_remote_tags: split-brain detected -- skipping image replay
Jun 08 11:09:30 magna003 rbd-mirror[678]: 2016-06-08 11:09:30.704921 7f7bdaffd700 -1 rbd::mirror::ImageReplayer: 0x7f7b74004370 [1/696a499b-9cc1-44d5-8e08-2b581ef24aba] operator(): start failed: (17) File exists
Jun 08 11:09:48 magna003 rbd-mirror[678]: 2016-06-08 11:09:48.849965 7f7baaffd700 -1 rbd::mirror::image_replayer::BootstrapRequest: 0x7f7b74005d30 handle_get_remote_tags: split-brain detected -- skipping image replay
Jun 08 11:09:49 magna003 rbd-mirror[678]: 2016-06-08 11:09:49.190766 7f7bdaffd700 -1 rbd::mirror::ImageReplayer: 0x7f7b74004370 [1/696a499b-9cc1-44d5-8e08-2b581ef24aba] operator(): start failed: (17) File exists

Comment 2 Tanay Ganguly 2016-06-08 11:30:57 UTC
Created attachment 1165929 [details]
New Resync Log

This is a tar file, please rename to open

Comment 3 Tanay Ganguly 2016-06-08 11:32:01 UTC
Created attachment 1165930 [details]
Older Resync file

This is a tar file, please rename to open

Comment 4 Jason Dillaman 2016-06-09 13:52:09 UTC
@Tanay: I am having a hard time understanding your test case.  Can you provide the exact commands you ran?

Comment 5 Jason Dillaman 2016-06-09 14:34:23 UTC
@Tanay: also, how were these logs generated?  The "older" log (which shows the in-progress sync @ 55%) just abruptly ends mid-sync.  Did rbd-mirror crash?

Comment 12 Jason Dillaman 2016-06-14 11:41:17 UTC
Moving to CLOSED/NOTABUG for now since the only way to reproduce it was to put the system in an inconsistent state. In such a case, it is expected to see this behavior. If a similar issue appears after retesting BZ #1344274, we can re-evaluate and open a new BZ.


Note You need to log in before you can comment on or make changes to this bug.