Bug 1336755 - [RBD-Mirroring] Images are not getting synced to the Slave Cluster
Summary: [RBD-Mirroring] Images are not getting synced to the Slave Cluster
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Ceph Storage
Classification: Red Hat
Component: RBD
Version: 2.0
Hardware: x86_64
OS: Linux
unspecified
high
Target Milestone: rc
: 2.0
Assignee: Jason Dillaman
QA Contact: Hemanth Kumar
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-05-17 12:06 UTC by Tanay Ganguly
Modified: 2017-07-30 15:28 UTC (History)
6 users (show)

Fixed In Version: RHEL: ceph-10.2.1-12.el7cp
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2016-08-23 19:38:37 UTC
Target Upstream Version:


Attachments (Terms of Use)
RBD_create.sh (455 bytes, application/x-shellscript)
2016-05-17 12:06 UTC, Tanay Ganguly
no flags Details
Slave Log (42.13 KB, text/plain)
2016-05-17 12:06 UTC, Tanay Ganguly
no flags Details
Master Log (59.35 KB, text/plain)
2016-05-17 12:07 UTC, Tanay Ganguly
no flags Details
Hemanth : rbd-mirror.log (2.42 MB, text/plain)
2016-05-19 12:21 UTC, Hemanth Kumar
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Ceph Project Bug Tracker 14937 0 None None None 2016-05-17 15:52:42 UTC
Ceph Project Bug Tracker 15909 0 None None None 2016-05-17 16:05:54 UTC
Ceph Project Bug Tracker 15993 0 None None None 2016-05-23 12:16:29 UTC
Red Hat Product Errata RHBA-2016:1755 0 normal SHIPPED_LIVE Red Hat Ceph Storage 2.0 bug fix and enhancement update 2016-08-23 23:23:52 UTC

Description Tanay Ganguly 2016-05-17 12:06:35 UTC
Created attachment 1158253 [details]
RBD_create.sh

Description of problem:
I ran a script to create 500 Images on Master Cluster. But only 51 of the Images have synced to the remote cluster. Also seeing a Crash on both (Slave and Master)

Version-Release number of selected component (if applicable):
ceph version 10.2.1-1.el7cp 

How reproducible:
Once

Steps to Reproduce:
1.Set up the RBD-Mirroring in Pool Mode.
2.Ran the attached script, PFA
  It creates RBD image, takes snap, Protect it and then It Clones ( 500 times)
3.After a while i saw only 51 images have got synced to the Slave Cluster 

Actual results:
Seeing a crash in both the Master and Slave Node.

Expected results:
All images should get synced.


Additional info:
Logs, core dump

Comment 2 Tanay Ganguly 2016-05-17 12:06:58 UTC
Created attachment 1158254 [details]
Slave Log

Comment 3 Tanay Ganguly 2016-05-17 12:07:20 UTC
Created attachment 1158255 [details]
Master Log

Comment 6 Hemanth Kumar 2016-05-19 12:21:12 UTC
Created attachment 1159452 [details]
Hemanth : rbd-mirror.log

Hi Jason, 

Same Segmentation fault is seen again with 2GB of data synchronization on a single Image

Description of problem:
-----------------------
Synchronization stopped after transferring few KBs of data on the destination rbd client. Primary rbd daemon crashed 

Version-Release number of selected component (if applicable):
-------------------------------------------------------------
v10.2.1.3

Steps to Reproduce:
-------------------
1. Configured RBD mirror on 2 different clusters on a freshly installed v10.2.1.3 ceph packages

2. Created few images on primary rbd host, all the created images synced to the remote rbd host

3. Mapped one of the image to a VM as a Raw Disk and mounted it as xfs filesystem.

4. Copy data onto that mounted path and observe the synchronization.

Actual results:
---------------
Synchronization stopped after transferring few KB's of data

Expected results:
------------------
Synchronization should happen flawlessly.

Additional info:
-----------------
2016-05-19 08:08:23.988243 7faae37fe700 -1 *** Caught signal (Segmentation fault) **
 in thread 7faae37fe700 thread_name:fn_anonymous

 ceph version 10.2.1-3.el7cp (f6e1bde2840e1da621601bad87e15fd3f654c01e)
 1: (()+0x3780ea) [0x7fab0ce700ea]
 2: (()+0xf100) [0x7fab023ca100]
 3: (()+0xa80aa) [0x7fab032010aa]
 4: (librados::IoCtx::aio_operate(std::string const&, librados::AioCompletion*, librados::ObjectWriteOperation*)+0x4b) [0x7fab031bab7b]
 5: (rbd::mirror::ImageReplayer<librbd::ImageCtx>::update_mirror_image_status(bool, rbd::mirror::ImageReplayer<librbd::ImageCtx>::State)+0x8c8) [0x7fab0ccc6608]
 6: (rbd::mirror::ImageReplayer<librbd::ImageCtx>::update_mirror_image_status(bool, rbd::mirror::ImageReplayer<librbd::ImageCtx>::State)::{lambda(int)#1}::operator()(int) const+0xd7) [0x7fab0ccc7df7]
 7: (FunctionContext::finish(int)+0x2a) [0x7fab0ccc0afa]
 8: (Context::complete(int)+0x9) [0x7fab0ccb62a9]
 9: (()+0x9d6ed) [0x7fab031f66ed]
 10: (()+0x85af9) [0x7fab031deaf9]
 11: (()+0x16f7d6) [0x7fab032c87d6]
 12: (()+0x7dc5) [0x7fab023c2dc5]
 13: (clone()+0x6d) [0x7fab012a8ced]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.


Attached the Log File :- rbd-mirror.log

Comment 11 Hemanth Kumar 2016-05-23 09:22:56 UTC
Synchronization is not happening on 10.2.1.6 build also.

Here is what I followed :-
1. Upgraded the ceph version to 10.2.1.6 and restarted the services
2. Pending data started to sync.
4. Added more data on the primary image.
5. Data failed to sync on secondary rbd host.

On Primary :
-------------
[root@magna020 yum.repos.d]# rbd du data1 --cluster master -p pool1
warning: fast-diff map is not enabled for data1. operation may be slow.
NAME  PROVISIONED  USED 
data1      10240M 4192M 

On Remote :
-----------
[root@magna069 ~]# rbd du data1 --cluster slave -p pool1
warning: fast-diff map is not enabled for data1. operation may be slow.
NAME  PROVISIONED  USED 
data1      10240M 2256M

RBD Log on remote rbd node :-
----------------------------
[root@magna069 ceph]# tailf qemu-guest-16989.log
2016-05-23 07:24:02.883340 7fa734fc5c40  0 set uid:gid to 167:167 (ceph:ceph)
2016-05-23 07:24:02.883371 7fa734fc5c40  0 ceph version 10.2.1-6.el7cp (339d1fb5d73e5f113c8538a45e95b3777038da36), process rbd-mirror, pid 16989
2016-05-23 07:31:04.045929 7fa7251e1700 -1 JournalPlayer: missing prior journal entry: Entry[tag_tid=2, entry_tid=924, data size=16777183]
2016-05-23 07:31:04.045952 7fa7251e1700 -1 rbd-mirror: ImageReplayer[1/5e4e2ae8944a]::handle_replay_complete: replay encountered an error: (42) No message of desired type
2016-05-23 07:32:35.695730 7fa7251e1700 -1 JournalPlayer: missing prior journal entry: Entry[tag_tid=2, entry_tid=956, data size=1054]
2016-05-23 07:32:35.695751 7fa7251e1700 -1 rbd-mirror: ImageReplayer[1/5e4e2ae8944a]::handle_replay_complete: replay encountered an error: (42) No message of desired type

-------------------------------------------

Comment 13 Jason Dillaman 2016-05-23 12:17:44 UTC
@Hemanth: restarted rbd-mirror and replication continued.  

# rbd --cluster slave --pool pool1 mirror pool status --verbose
health: OK
images: 4 total
    4 replaying

data1:
  global_id:   037458d0-4516-4b68-908f-ab6fce7de7a7
  state:       up+replaying
  description: replaying, master_position=[object_number=287, tag_tid=2, entry_tid=1491], mirror_position=[object_number=287, tag_tid=2, entry_tid=1491], entries_behind_master=0
  last_update: 2016-05-23 12:17:04

data3:
  global_id:   8f65f371-a354-43c1-914a-08795c192192
  state:       up+replaying
  description: replaying, master_position=[object_number=3, tag_tid=1, entry_tid=3], mirror_position=[object_number=3, tag_tid=1, entry_tid=3], entries_behind_master=0
  last_update: 2016-05-23 12:17:04

data2:
  global_id:   aaaddb15-6cb1-49dd-83fc-e253600a26f9
  state:       up+replaying
  description: replaying, master_position=[object_number=3, tag_tid=1, entry_tid=3], mirror_position=[object_number=3, tag_tid=1, entry_tid=3], entries_behind_master=0
  last_update: 2016-05-23 12:17:04

data4:
  global_id:   d1938591-a852-4773-ab13-d7cba7f8f0d3
  state:       up+replaying
  description: replaying, master_position=[object_number=3, tag_tid=1, entry_tid=3], mirror_position=[object_number=3, tag_tid=1, entry_tid=3], entries_behind_master=0
  last_update: 2016-05-23 12:17:04

# rbd --cluster slave --pool pool1 du data1
warning: fast-diff map is not enabled for data1. operation may be slow.
NAME  PROVISIONED  USED 
data1      10240M 4192M

Comment 14 Ken Dreyer (Red Hat) 2016-05-24 21:44:57 UTC
https://github.com/ceph/ceph/pull/9282 needs to be merged to master, then cherry-picked downstream.

Comment 19 Hemanth Kumar 2016-06-27 12:11:25 UTC
this Crash was not seen in v10.2.2-5 build.
Moving to verified state

Comment 21 errata-xmlrpc 2016-08-23 19:38:37 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2016-1755.html


Note You need to log in before you can comment on or make changes to this bug.