Bug 1336755
Summary: | [RBD-Mirroring] Images are not getting synced to the Slave Cluster | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | [Red Hat Storage] Red Hat Ceph Storage | Reporter: | Tanay Ganguly <tganguly> | ||||||||||
Component: | RBD | Assignee: | Jason Dillaman <jdillama> | ||||||||||
Status: | CLOSED ERRATA | QA Contact: | Hemanth Kumar <hyelloji> | ||||||||||
Severity: | high | Docs Contact: | |||||||||||
Priority: | unspecified | ||||||||||||
Version: | 2.0 | CC: | amaredia, ceph-eng-bugs, hnallurv, hyelloji, kdreyer, kurs | ||||||||||
Target Milestone: | rc | ||||||||||||
Target Release: | 2.0 | ||||||||||||
Hardware: | x86_64 | ||||||||||||
OS: | Linux | ||||||||||||
Whiteboard: | |||||||||||||
Fixed In Version: | RHEL: ceph-10.2.1-12.el7cp | Doc Type: | If docs needed, set a value | ||||||||||
Doc Text: | Story Points: | --- | |||||||||||
Clone Of: | Environment: | ||||||||||||
Last Closed: | 2016-08-23 19:38:37 UTC | Type: | Bug | ||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||
Documentation: | --- | CRM: | |||||||||||
Verified Versions: | Category: | --- | |||||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||
Embargoed: | |||||||||||||
Attachments: |
|
Created attachment 1158254 [details]
Slave Log
Created attachment 1158255 [details]
Master Log
Created attachment 1159452 [details]
Hemanth : rbd-mirror.log
Hi Jason,
Same Segmentation fault is seen again with 2GB of data synchronization on a single Image
Description of problem:
-----------------------
Synchronization stopped after transferring few KBs of data on the destination rbd client. Primary rbd daemon crashed
Version-Release number of selected component (if applicable):
-------------------------------------------------------------
v10.2.1.3
Steps to Reproduce:
-------------------
1. Configured RBD mirror on 2 different clusters on a freshly installed v10.2.1.3 ceph packages
2. Created few images on primary rbd host, all the created images synced to the remote rbd host
3. Mapped one of the image to a VM as a Raw Disk and mounted it as xfs filesystem.
4. Copy data onto that mounted path and observe the synchronization.
Actual results:
---------------
Synchronization stopped after transferring few KB's of data
Expected results:
------------------
Synchronization should happen flawlessly.
Additional info:
-----------------
2016-05-19 08:08:23.988243 7faae37fe700 -1 *** Caught signal (Segmentation fault) **
in thread 7faae37fe700 thread_name:fn_anonymous
ceph version 10.2.1-3.el7cp (f6e1bde2840e1da621601bad87e15fd3f654c01e)
1: (()+0x3780ea) [0x7fab0ce700ea]
2: (()+0xf100) [0x7fab023ca100]
3: (()+0xa80aa) [0x7fab032010aa]
4: (librados::IoCtx::aio_operate(std::string const&, librados::AioCompletion*, librados::ObjectWriteOperation*)+0x4b) [0x7fab031bab7b]
5: (rbd::mirror::ImageReplayer<librbd::ImageCtx>::update_mirror_image_status(bool, rbd::mirror::ImageReplayer<librbd::ImageCtx>::State)+0x8c8) [0x7fab0ccc6608]
6: (rbd::mirror::ImageReplayer<librbd::ImageCtx>::update_mirror_image_status(bool, rbd::mirror::ImageReplayer<librbd::ImageCtx>::State)::{lambda(int)#1}::operator()(int) const+0xd7) [0x7fab0ccc7df7]
7: (FunctionContext::finish(int)+0x2a) [0x7fab0ccc0afa]
8: (Context::complete(int)+0x9) [0x7fab0ccb62a9]
9: (()+0x9d6ed) [0x7fab031f66ed]
10: (()+0x85af9) [0x7fab031deaf9]
11: (()+0x16f7d6) [0x7fab032c87d6]
12: (()+0x7dc5) [0x7fab023c2dc5]
13: (clone()+0x6d) [0x7fab012a8ced]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Attached the Log File :- rbd-mirror.log
Synchronization is not happening on 10.2.1.6 build also. Here is what I followed :- 1. Upgraded the ceph version to 10.2.1.6 and restarted the services 2. Pending data started to sync. 4. Added more data on the primary image. 5. Data failed to sync on secondary rbd host. On Primary : ------------- [root@magna020 yum.repos.d]# rbd du data1 --cluster master -p pool1 warning: fast-diff map is not enabled for data1. operation may be slow. NAME PROVISIONED USED data1 10240M 4192M On Remote : ----------- [root@magna069 ~]# rbd du data1 --cluster slave -p pool1 warning: fast-diff map is not enabled for data1. operation may be slow. NAME PROVISIONED USED data1 10240M 2256M RBD Log on remote rbd node :- ---------------------------- [root@magna069 ceph]# tailf qemu-guest-16989.log 2016-05-23 07:24:02.883340 7fa734fc5c40 0 set uid:gid to 167:167 (ceph:ceph) 2016-05-23 07:24:02.883371 7fa734fc5c40 0 ceph version 10.2.1-6.el7cp (339d1fb5d73e5f113c8538a45e95b3777038da36), process rbd-mirror, pid 16989 2016-05-23 07:31:04.045929 7fa7251e1700 -1 JournalPlayer: missing prior journal entry: Entry[tag_tid=2, entry_tid=924, data size=16777183] 2016-05-23 07:31:04.045952 7fa7251e1700 -1 rbd-mirror: ImageReplayer[1/5e4e2ae8944a]::handle_replay_complete: replay encountered an error: (42) No message of desired type 2016-05-23 07:32:35.695730 7fa7251e1700 -1 JournalPlayer: missing prior journal entry: Entry[tag_tid=2, entry_tid=956, data size=1054] 2016-05-23 07:32:35.695751 7fa7251e1700 -1 rbd-mirror: ImageReplayer[1/5e4e2ae8944a]::handle_replay_complete: replay encountered an error: (42) No message of desired type ------------------------------------------- @Hemanth: restarted rbd-mirror and replication continued. # rbd --cluster slave --pool pool1 mirror pool status --verbose health: OK images: 4 total 4 replaying data1: global_id: 037458d0-4516-4b68-908f-ab6fce7de7a7 state: up+replaying description: replaying, master_position=[object_number=287, tag_tid=2, entry_tid=1491], mirror_position=[object_number=287, tag_tid=2, entry_tid=1491], entries_behind_master=0 last_update: 2016-05-23 12:17:04 data3: global_id: 8f65f371-a354-43c1-914a-08795c192192 state: up+replaying description: replaying, master_position=[object_number=3, tag_tid=1, entry_tid=3], mirror_position=[object_number=3, tag_tid=1, entry_tid=3], entries_behind_master=0 last_update: 2016-05-23 12:17:04 data2: global_id: aaaddb15-6cb1-49dd-83fc-e253600a26f9 state: up+replaying description: replaying, master_position=[object_number=3, tag_tid=1, entry_tid=3], mirror_position=[object_number=3, tag_tid=1, entry_tid=3], entries_behind_master=0 last_update: 2016-05-23 12:17:04 data4: global_id: d1938591-a852-4773-ab13-d7cba7f8f0d3 state: up+replaying description: replaying, master_position=[object_number=3, tag_tid=1, entry_tid=3], mirror_position=[object_number=3, tag_tid=1, entry_tid=3], entries_behind_master=0 last_update: 2016-05-23 12:17:04 # rbd --cluster slave --pool pool1 du data1 warning: fast-diff map is not enabled for data1. operation may be slow. NAME PROVISIONED USED data1 10240M 4192M https://github.com/ceph/ceph/pull/9282 needs to be merged to master, then cherry-picked downstream. this Crash was not seen in v10.2.2-5 build. Moving to verified state Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2016-1755.html |
Created attachment 1158253 [details] RBD_create.sh Description of problem: I ran a script to create 500 Images on Master Cluster. But only 51 of the Images have synced to the remote cluster. Also seeing a Crash on both (Slave and Master) Version-Release number of selected component (if applicable): ceph version 10.2.1-1.el7cp How reproducible: Once Steps to Reproduce: 1.Set up the RBD-Mirroring in Pool Mode. 2.Ran the attached script, PFA It creates RBD image, takes snap, Protect it and then It Clones ( 500 times) 3.After a while i saw only 51 images have got synced to the Slave Cluster Actual results: Seeing a crash in both the Master and Slave Node. Expected results: All images should get synced. Additional info: Logs, core dump