Description of problem: Added a third site to an existing multisite active-active configuration. The data from the master zone was being synced while a segfault occured when I ran "radosgw sync status" command Version-Release number of selected component (if applicable): ceph-radosgw-10.1.1-1.el7cp.x86_64 Steps to Reproduce: 1. Configure multisite active-active cluster with two zones. All the data is synced between each other. 2. Now add a third zone to the configuration and restart the radosgw service on all the zones. 3. The metadata and data sync should start from the master zone to the newly added zone. 4. Check the sync status using radosgw-admin command to check progress. Actual results: Segmentation fault occured when I ran the following command: # radosgw-admin sync status realm 4e00a610-36e9-43d0-803e-4001442b8232 (earth) zonegroup e66e1293-e63b-4afe-9dad-3397647dfb03 (us) zone 017d72f3-dd2e-48a3-ab51-cb2a9b73b2c0 (us-3) metadata sync syncing full sync: 0/64 shards metadata is caught up with master incremental sync: 64/64 shards *** Caught signal (Segmentation fault) ** in thread 7f7274133a40 thread_name:radosgw-admin ceph version 10.1.1-1.el7cp (61adb020219fbad4508050b5f0a792246ba74dae) 1: (()+0x54267a) [0x7f726a85767a] 2: (()+0xf100) [0x7f726078b100] 3: (RGWShardCollectCR::operate()+0x1fe) [0x7f726a6684fe] 4: (RGWCoroutinesStack::operate(RGWCoroutinesEnv*)+0x7e) [0x7f726a5db99e] 5: (RGWCoroutinesManager::run(std::list<RGWCoroutinesStack*, std::allocator<RGWCoroutinesStack*> >&)+0x3f1) [0x7f726a5dd901] 6: (RGWCoroutinesManager::run(RGWCoroutine*)+0x70) [0x7f726a5de4c0] 7: (RGWRemoteDataLog::read_source_log_shards_info(std::map<int, RGWDataChangesLogInfo, std::less<int>, std::allocator<std::pair<int const, RGWDataChangesLogInfo> > >*)+0xb2) [0x7f726a68d1f2] 8: (main()+0xe7e5) [0x7f72741844b5] 9: (__libc_start_main()+0xf5) [0x7f725febdb15] 10: (()+0x35641) [0x7f7274193641] 2016-04-18 09:53:49.308014 7f7274133a40 -1 *** Caught signal (Segmentation fault) ** in thread 7f7274133a40 thread_name:radosgw-admin Additional info: Also found these errors in rgw logs: 2016-04-18 09:53:36.796191 7fd4a5ffb700 0 ERROR: lease cr failed, done early 2016-04-18 09:53:36.796214 7fd4a5ffb700 0 ERROR: incremental sync on new-bucket bucket_id=acadcc66-10b9-4829-b8e2-306c0048bff5.4176.5 shard_id=-1 failed, retcode=-16 2016-04-18 09:53:36.821654 7fd415ffb700 0 ERROR: failed to wait for op, ret=-22: POST http://magna059:8080/admin/log?type=data¬ify&source-zone=017d72f3-dd2e-48a3-ab51-cb2a9b73b2c0&rgwx-zonegroup=e66e1293-e63b-4afe-9dad-3397647dfb03 2016-04-18 09:53:43.073586 7fd36b7ee700 1 civetweb: 0x7fd3c8004df0: 10.8.128.75 - - [18/Apr/2016:09:53:42 +0000] "GET /admin/log/ HTTP/1.1" 200 0 - - 2016-04-18 09:53:43.239843 7fd417fff700 0 ERROR: lease cr failed, done early 2016-04-18 09:53:43.239869 7fd417fff700 0 ERROR: incremental sync on container2 bucket_id=acadcc66-10b9-4829-b8e2-306c0048bff5.4176.6 shard_id=-1 failed, retcode=-16 2016-04-18 09:53:43.312791 7fd415ffb700 0 ERROR: failed to wait for op, ret=-22: POST http://magna075:8080/admin/log?type=data¬ify&source-zone=017d72f3-dd2e-48a3-ab51-cb2a9b73b2c0&rgwx-zonegroup=e66e1293-e63b-4afe-9dad-3397647dfb03 However running sync status again did not crash. But the data is behind. # radosgw-admin sync status realm 4e00a610-36e9-43d0-803e-4001442b8232 (earth) zonegroup e66e1293-e63b-4afe-9dad-3397647dfb03 (us) zone 017d72f3-dd2e-48a3-ab51-cb2a9b73b2c0 (us-3) metadata sync syncing full sync: 0/64 shards metadata is caught up with master incremental sync: 64/64 shards data sync source: 001da65b-c3a8-42e2-a1ce-79cacefbace2 (us-2) syncing full sync: 0/128 shards incremental sync: 128/128 shards data is behind on 2 shards oldest incremental change not applied: 2016-04-18 10:19:09.0.769808s source: acadcc66-10b9-4829-b8e2-306c0048bff5 (us-1) syncing full sync: 0/128 shards incremental sync: 128/128 shards data is behind on 8 shards oldest incremental change not applied: 2016-04-18 09:52:05.0.15445s Oldest incremental change was before 09:52 and it hasn't synced ever since. That was around the time the segfault appeared.
this might have already been fixed in latest upstream
Moving to verified. I don't see this issue in 10.2.2
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2016-1755.html