Bug 1343941

Summary: Hitting a Split-Brain when multiple images are getting synced in parallel
Product: [Red Hat Storage] Red Hat Ceph Storage Reporter: Tanay Ganguly <tganguly>
Component: RBDAssignee: Jason Dillaman <jdillama>
Status: CLOSED NOTABUG QA Contact: Tanay Ganguly <tganguly>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: 2.0CC: ceph-eng-bugs, hnallurv, hyelloji, kurs, mlawrenc, tganguly
Target Milestone: rc   
Target Release: 2.0   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-06-14 11:41:17 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1343229    
Attachments:
Description Flags
Master Node log
none
New Resync Log
none
Older Resync file none

Description Tanay Ganguly 2016-06-08 11:29:08 UTC
Created attachment 1165928 [details]
Master Node log

Description of problem:
I am hitting a split brain in Slave Node while more than one image is getting resyn.

Version-Release number of selected component (if applicable):
ceph version 10.2.1-12.el7cp

How reproducible:
Hit it once

Steps to Reproduce:
1. Created an Image on Master Node, dont enable Journaling
2. Write some 10G data on the Image.
3. Once write complete enable the Journaling feature on Master Node ( Resync starts)
4. Disable journaling on an existing created image ( Before that it was synced with Slave Node ) this is a different image
5. Start bench-write on the Image, write some new data and then stop it.
6. Again enable the Journaling ( Resync starts)
rbd feature enable RBD/testing3 journaling --cluster master

Actual results:
After enabling journaling again i am seeing split brain.
Now both the images was trying to get synced

Older image which was getting synced stopped at 55% ( refer step 3 )
New sync cried saying split-brain ( refer step 6 )

Expected results:
There should not be a split-brain

Additional info:
Log of both the resync (new and old) from Slave
Log of the master node
--------------------------------------------------------------------------------

systemctl status -l ceph-rbd-mirror@master
● ceph-rbd-mirror - Ceph rbd mirror daemon
   Loaded: loaded (/usr/lib/systemd/system/ceph-rbd-mirror@.service; enabled; vendor preset: disabled)
   Active: active (running) since Wed 2016-06-08 15:14:30 IST; 1h 42min ago
 Main PID: 86496 (rbd-mirror)
   CGroup: /system.slice/system-ceph\x2drbd\x2dmirror.slice/ceph-rbd-mirror
           └─86496 /usr/bin/rbd-mirror -f --cluster master --id master --setuser ceph --setgroup ceph

Jun 08 15:14:30 cephqe3.lab.eng.blr.redhat.com systemd[1]: Started Ceph rbd mirror daemon.
Jun 08 15:14:30 cephqe3.lab.eng.blr.redhat.com systemd[1]: Starting Ceph rbd mirror daemon...
Jun 08 16:29:03 cephqe3.lab.eng.blr.redhat.com rbd-mirror[86496]: 2016-06-08 16:29:03.217415 7f4fa57fa700 -1 rbd::mirror::image_replayer::BootstrapRequest: 0x7f4f7408bb10 handle_get_remote_tag_class: failed to retrieve remote client: (2) No such file or directory
Jun 08 16:29:03 cephqe3.lab.eng.blr.redhat.com rbd-mirror[86496]: 2016-06-08 16:29:03.217475 7f4fd569f700 -1 rbd::mirror::ImageReplayer: 0x7f4f7400c650 [1/e923d0ee-37b7-483e-9621-ecb70c545eee] operator(): start failed: (2) No such file or directory
Jun 08 16:29:03 cephqe3.lab.eng.blr.redhat.com rbd-mirror[86496]: 2016-06-08 16:29:03.230753 7f4fb7fff700 -1 JournalMetadata: operator(): failed to watch journal(2) No such file or directory
Jun 08 16:29:03 cephqe3.lab.eng.blr.redhat.com rbd-mirror[86496]: 2016-06-08 16:29:03.230778 7f4fb7fff700 -1 JournalMetadata: failed to initialize immutable metadata: (2) No such file or directory


systemctl status -l ceph-rbd-mirror@slave
● ceph-rbd-mirror - Ceph rbd mirror daemon
   Loaded: loaded (/usr/lib/systemd/system/ceph-rbd-mirror@.service; enabled; vendor preset: disabled)
   Active: active (running) since Wed 2016-06-08 10:58:10 UTC; 11min ago
 Main PID: 678 (rbd-mirror)
   CGroup: /system.slice/system-ceph\x2drbd\x2dmirror.slice/ceph-rbd-mirror
           └─678 /usr/bin/rbd-mirror -f --cluster slave --id slave --setuser ceph --setgroup ceph

Jun 08 11:07:34 magna003 rbd-mirror[678]: 2016-06-08 11:07:34.512360 7f7baaffd700 -1 rbd::mirror::image_replayer::BootstrapRequest: 0x7f7b740019f0 handle_get_remote_tags: split-brain detected -- skipping image replay
Jun 08 11:07:34 magna003 rbd-mirror[678]: 2016-06-08 11:07:34.827056 7f7bdaffd700 -1 rbd::mirror::ImageReplayer: 0x7f7b74004370 [1/696a499b-9cc1-44d5-8e08-2b581ef24aba] operator(): start failed: (17) File exists
Jun 08 11:08:05 magna003 rbd-mirror[678]: 2016-06-08 11:08:05.371161 7f7baaffd700 -1 rbd::mirror::image_replayer::BootstrapRequest: 0x7f7b74006240 handle_get_remote_tags: split-brain detected -- skipping image replay
Jun 08 11:08:05 magna003 rbd-mirror[678]: 2016-06-08 11:08:05.695272 7f7bdaffd700 -1 rbd::mirror::ImageReplayer: 0x7f7b74004370 [1/696a499b-9cc1-44d5-8e08-2b581ef24aba] operator(): start failed: (17) File exists
Jun 08 11:08:49 magna003 rbd-mirror[678]: 2016-06-08 11:08:49.085661 7f7baaffd700 -1 rbd::mirror::image_replayer::BootstrapRequest: 0x7f7b74003160 handle_get_remote_tags: split-brain detected -- skipping image replay
Jun 08 11:08:49 magna003 rbd-mirror[678]: 2016-06-08 11:08:49.549062 7f7bdaffd700 -1 rbd::mirror::ImageReplayer: 0x7f7b74004370 [1/696a499b-9cc1-44d5-8e08-2b581ef24aba] operator(): start failed: (17) File exists
Jun 08 11:09:21 magna003 rbd-mirror[678]: 2016-06-08 11:09:21.401637 7f7baaffd700 -1 rbd::mirror::image_replayer::BootstrapRequest: 0x7f7b74003160 handle_get_remote_tags: split-brain detected -- skipping image replay
Jun 08 11:09:30 magna003 rbd-mirror[678]: 2016-06-08 11:09:30.704921 7f7bdaffd700 -1 rbd::mirror::ImageReplayer: 0x7f7b74004370 [1/696a499b-9cc1-44d5-8e08-2b581ef24aba] operator(): start failed: (17) File exists
Jun 08 11:09:48 magna003 rbd-mirror[678]: 2016-06-08 11:09:48.849965 7f7baaffd700 -1 rbd::mirror::image_replayer::BootstrapRequest: 0x7f7b74005d30 handle_get_remote_tags: split-brain detected -- skipping image replay
Jun 08 11:09:49 magna003 rbd-mirror[678]: 2016-06-08 11:09:49.190766 7f7bdaffd700 -1 rbd::mirror::ImageReplayer: 0x7f7b74004370 [1/696a499b-9cc1-44d5-8e08-2b581ef24aba] operator(): start failed: (17) File exists

Comment 2 Tanay Ganguly 2016-06-08 11:30:57 UTC
Created attachment 1165929 [details]
New Resync Log

This is a tar file, please rename to open

Comment 3 Tanay Ganguly 2016-06-08 11:32:01 UTC
Created attachment 1165930 [details]
Older Resync file

This is a tar file, please rename to open

Comment 4 Jason Dillaman 2016-06-09 13:52:09 UTC
@Tanay: I am having a hard time understanding your test case.  Can you provide the exact commands you ran?

Comment 5 Jason Dillaman 2016-06-09 14:34:23 UTC
@Tanay: also, how were these logs generated?  The "older" log (which shows the in-progress sync @ 55%) just abruptly ends mid-sync.  Did rbd-mirror crash?

Comment 12 Jason Dillaman 2016-06-14 11:41:17 UTC
Moving to CLOSED/NOTABUG for now since the only way to reproduce it was to put the system in an inconsistent state. In such a case, it is expected to see this behavior. If a similar issue appears after retesting BZ #1344274, we can re-evaluate and open a new BZ.