Bug 1336755

Summary: [RBD-Mirroring] Images are not getting synced to the Slave Cluster
Product: [Red Hat Storage] Red Hat Ceph Storage Reporter: Tanay Ganguly <tganguly>
Component: RBDAssignee: Jason Dillaman <jdillama>
Status: CLOSED ERRATA QA Contact: Hemanth Kumar <hyelloji>
Severity: high Docs Contact:
Priority: unspecified    
Version: 2.0CC: amaredia, ceph-eng-bugs, hnallurv, hyelloji, kdreyer, kurs
Target Milestone: rc   
Target Release: 2.0   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: RHEL: ceph-10.2.1-12.el7cp Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-08-23 19:38:37 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
RBD_create.sh
none
Slave Log
none
Master Log
none
Hemanth : rbd-mirror.log none

Description Tanay Ganguly 2016-05-17 12:06:35 UTC
Created attachment 1158253 [details]
RBD_create.sh

Description of problem:
I ran a script to create 500 Images on Master Cluster. But only 51 of the Images have synced to the remote cluster. Also seeing a Crash on both (Slave and Master)

Version-Release number of selected component (if applicable):
ceph version 10.2.1-1.el7cp 

How reproducible:
Once

Steps to Reproduce:
1.Set up the RBD-Mirroring in Pool Mode.
2.Ran the attached script, PFA
  It creates RBD image, takes snap, Protect it and then It Clones ( 500 times)
3.After a while i saw only 51 images have got synced to the Slave Cluster 

Actual results:
Seeing a crash in both the Master and Slave Node.

Expected results:
All images should get synced.


Additional info:
Logs, core dump

Comment 2 Tanay Ganguly 2016-05-17 12:06:58 UTC
Created attachment 1158254 [details]
Slave Log

Comment 3 Tanay Ganguly 2016-05-17 12:07:20 UTC
Created attachment 1158255 [details]
Master Log

Comment 6 Hemanth Kumar 2016-05-19 12:21:12 UTC
Created attachment 1159452 [details]
Hemanth : rbd-mirror.log

Hi Jason, 

Same Segmentation fault is seen again with 2GB of data synchronization on a single Image

Description of problem:
-----------------------
Synchronization stopped after transferring few KBs of data on the destination rbd client. Primary rbd daemon crashed 

Version-Release number of selected component (if applicable):
-------------------------------------------------------------
v10.2.1.3

Steps to Reproduce:
-------------------
1. Configured RBD mirror on 2 different clusters on a freshly installed v10.2.1.3 ceph packages

2. Created few images on primary rbd host, all the created images synced to the remote rbd host

3. Mapped one of the image to a VM as a Raw Disk and mounted it as xfs filesystem.

4. Copy data onto that mounted path and observe the synchronization.

Actual results:
---------------
Synchronization stopped after transferring few KB's of data

Expected results:
------------------
Synchronization should happen flawlessly.

Additional info:
-----------------
2016-05-19 08:08:23.988243 7faae37fe700 -1 *** Caught signal (Segmentation fault) **
 in thread 7faae37fe700 thread_name:fn_anonymous

 ceph version 10.2.1-3.el7cp (f6e1bde2840e1da621601bad87e15fd3f654c01e)
 1: (()+0x3780ea) [0x7fab0ce700ea]
 2: (()+0xf100) [0x7fab023ca100]
 3: (()+0xa80aa) [0x7fab032010aa]
 4: (librados::IoCtx::aio_operate(std::string const&, librados::AioCompletion*, librados::ObjectWriteOperation*)+0x4b) [0x7fab031bab7b]
 5: (rbd::mirror::ImageReplayer<librbd::ImageCtx>::update_mirror_image_status(bool, rbd::mirror::ImageReplayer<librbd::ImageCtx>::State)+0x8c8) [0x7fab0ccc6608]
 6: (rbd::mirror::ImageReplayer<librbd::ImageCtx>::update_mirror_image_status(bool, rbd::mirror::ImageReplayer<librbd::ImageCtx>::State)::{lambda(int)#1}::operator()(int) const+0xd7) [0x7fab0ccc7df7]
 7: (FunctionContext::finish(int)+0x2a) [0x7fab0ccc0afa]
 8: (Context::complete(int)+0x9) [0x7fab0ccb62a9]
 9: (()+0x9d6ed) [0x7fab031f66ed]
 10: (()+0x85af9) [0x7fab031deaf9]
 11: (()+0x16f7d6) [0x7fab032c87d6]
 12: (()+0x7dc5) [0x7fab023c2dc5]
 13: (clone()+0x6d) [0x7fab012a8ced]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.


Attached the Log File :- rbd-mirror.log

Comment 11 Hemanth Kumar 2016-05-23 09:22:56 UTC
Synchronization is not happening on 10.2.1.6 build also.

Here is what I followed :-
1. Upgraded the ceph version to 10.2.1.6 and restarted the services
2. Pending data started to sync.
4. Added more data on the primary image.
5. Data failed to sync on secondary rbd host.

On Primary :
-------------
[root@magna020 yum.repos.d]# rbd du data1 --cluster master -p pool1
warning: fast-diff map is not enabled for data1. operation may be slow.
NAME  PROVISIONED  USED 
data1      10240M 4192M 

On Remote :
-----------
[root@magna069 ~]# rbd du data1 --cluster slave -p pool1
warning: fast-diff map is not enabled for data1. operation may be slow.
NAME  PROVISIONED  USED 
data1      10240M 2256M

RBD Log on remote rbd node :-
----------------------------
[root@magna069 ceph]# tailf qemu-guest-16989.log
2016-05-23 07:24:02.883340 7fa734fc5c40  0 set uid:gid to 167:167 (ceph:ceph)
2016-05-23 07:24:02.883371 7fa734fc5c40  0 ceph version 10.2.1-6.el7cp (339d1fb5d73e5f113c8538a45e95b3777038da36), process rbd-mirror, pid 16989
2016-05-23 07:31:04.045929 7fa7251e1700 -1 JournalPlayer: missing prior journal entry: Entry[tag_tid=2, entry_tid=924, data size=16777183]
2016-05-23 07:31:04.045952 7fa7251e1700 -1 rbd-mirror: ImageReplayer[1/5e4e2ae8944a]::handle_replay_complete: replay encountered an error: (42) No message of desired type
2016-05-23 07:32:35.695730 7fa7251e1700 -1 JournalPlayer: missing prior journal entry: Entry[tag_tid=2, entry_tid=956, data size=1054]
2016-05-23 07:32:35.695751 7fa7251e1700 -1 rbd-mirror: ImageReplayer[1/5e4e2ae8944a]::handle_replay_complete: replay encountered an error: (42) No message of desired type

-------------------------------------------

Comment 13 Jason Dillaman 2016-05-23 12:17:44 UTC
@Hemanth: restarted rbd-mirror and replication continued.  

# rbd --cluster slave --pool pool1 mirror pool status --verbose
health: OK
images: 4 total
    4 replaying

data1:
  global_id:   037458d0-4516-4b68-908f-ab6fce7de7a7
  state:       up+replaying
  description: replaying, master_position=[object_number=287, tag_tid=2, entry_tid=1491], mirror_position=[object_number=287, tag_tid=2, entry_tid=1491], entries_behind_master=0
  last_update: 2016-05-23 12:17:04

data3:
  global_id:   8f65f371-a354-43c1-914a-08795c192192
  state:       up+replaying
  description: replaying, master_position=[object_number=3, tag_tid=1, entry_tid=3], mirror_position=[object_number=3, tag_tid=1, entry_tid=3], entries_behind_master=0
  last_update: 2016-05-23 12:17:04

data2:
  global_id:   aaaddb15-6cb1-49dd-83fc-e253600a26f9
  state:       up+replaying
  description: replaying, master_position=[object_number=3, tag_tid=1, entry_tid=3], mirror_position=[object_number=3, tag_tid=1, entry_tid=3], entries_behind_master=0
  last_update: 2016-05-23 12:17:04

data4:
  global_id:   d1938591-a852-4773-ab13-d7cba7f8f0d3
  state:       up+replaying
  description: replaying, master_position=[object_number=3, tag_tid=1, entry_tid=3], mirror_position=[object_number=3, tag_tid=1, entry_tid=3], entries_behind_master=0
  last_update: 2016-05-23 12:17:04

# rbd --cluster slave --pool pool1 du data1
warning: fast-diff map is not enabled for data1. operation may be slow.
NAME  PROVISIONED  USED 
data1      10240M 4192M

Comment 14 Ken Dreyer (Red Hat) 2016-05-24 21:44:57 UTC
https://github.com/ceph/ceph/pull/9282 needs to be merged to master, then cherry-picked downstream.

Comment 19 Hemanth Kumar 2016-06-27 12:11:25 UTC
this Crash was not seen in v10.2.2-5 build.
Moving to verified state

Comment 21 errata-xmlrpc 2016-08-23 19:38:37 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2016-1755.html