Bug 1328070 - RGW multisite: Segfault while running radosgw-admin sync status on a newly added zone
Summary: RGW multisite: Segfault while running radosgw-admin sync status on a newly ad...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Ceph Storage
Classification: Red Hat
Component: RGW
Version: 2.0
Hardware: Unspecified
OS: Unspecified
unspecified
urgent
Target Milestone: rc
: 2.0
Assignee: Orit Wasserman
QA Contact: shilpa
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-04-18 11:10 UTC by shilpa
Modified: 2017-07-30 15:58 UTC (History)
9 users (show)

Fixed In Version: ceph-10.2.0-1.el7cp
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2016-08-23 19:36:38 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2016:1755 normal SHIPPED_LIVE Red Hat Ceph Storage 2.0 bug fix and enhancement update 2016-08-23 23:23:52 UTC

Description shilpa 2016-04-18 11:10:14 UTC
Description of problem:
Added a third site to an existing multisite active-active configuration. The data from the master zone was being synced while a segfault occured when I ran "radosgw sync status" command


Version-Release number of selected component (if applicable):
ceph-radosgw-10.1.1-1.el7cp.x86_64


Steps to Reproduce:
1. Configure multisite active-active cluster with two zones. All the data is synced between each other.
2. Now add a third zone to the configuration and restart the radosgw service on all the zones. 
3. The metadata and data sync should start from the master zone to the newly added zone. 
4. Check the sync status using radosgw-admin command to check progress.

Actual results:

Segmentation fault occured when I ran the following command:

      
# radosgw-admin sync status
          realm 4e00a610-36e9-43d0-803e-4001442b8232 (earth)
      zonegroup e66e1293-e63b-4afe-9dad-3397647dfb03 (us)
           zone 017d72f3-dd2e-48a3-ab51-cb2a9b73b2c0 (us-3)
  metadata sync syncing
                full sync: 0/64 shards
                metadata is caught up with master
                incremental sync: 64/64 shards
*** Caught signal (Segmentation fault) **
 in thread 7f7274133a40 thread_name:radosgw-admin
 ceph version 10.1.1-1.el7cp (61adb020219fbad4508050b5f0a792246ba74dae)
 1: (()+0x54267a) [0x7f726a85767a]
 2: (()+0xf100) [0x7f726078b100]
 3: (RGWShardCollectCR::operate()+0x1fe) [0x7f726a6684fe]
 4: (RGWCoroutinesStack::operate(RGWCoroutinesEnv*)+0x7e) [0x7f726a5db99e]
 5: (RGWCoroutinesManager::run(std::list<RGWCoroutinesStack*, std::allocator<RGWCoroutinesStack*> >&)+0x3f1) [0x7f726a5dd901]
 6: (RGWCoroutinesManager::run(RGWCoroutine*)+0x70) [0x7f726a5de4c0]
 7: (RGWRemoteDataLog::read_source_log_shards_info(std::map<int, RGWDataChangesLogInfo, std::less<int>, std::allocator<std::pair<int const, RGWDataChangesLogInfo> > >*)+0xb2) [0x7f726a68d1f2]
 8: (main()+0xe7e5) [0x7f72741844b5]
 9: (__libc_start_main()+0xf5) [0x7f725febdb15]
 10: (()+0x35641) [0x7f7274193641]
2016-04-18 09:53:49.308014 7f7274133a40 -1 *** Caught signal (Segmentation fault) **
 in thread 7f7274133a40 thread_name:radosgw-admin


Additional info:

Also found these errors in rgw logs:

2016-04-18 09:53:36.796191 7fd4a5ffb700  0 ERROR: lease cr failed, done early 
2016-04-18 09:53:36.796214 7fd4a5ffb700  0 ERROR: incremental sync on new-bucket bucket_id=acadcc66-10b9-4829-b8e2-306c0048bff5.4176.5 shard_id=-1 failed, retcode=-16
2016-04-18 09:53:36.821654 7fd415ffb700  0 ERROR: failed to wait for op, ret=-22: POST http://magna059:8080/admin/log?type=data&notify&source-zone=017d72f3-dd2e-48a3-ab51-cb2a9b73b2c0&rgwx-zonegroup=e66e1293-e63b-4afe-9dad-3397647dfb03
2016-04-18 09:53:43.073586 7fd36b7ee700  1 civetweb: 0x7fd3c8004df0: 10.8.128.75 - - [18/Apr/2016:09:53:42 +0000] "GET /admin/log/ HTTP/1.1" 200 0 - -
2016-04-18 09:53:43.239843 7fd417fff700  0 ERROR: lease cr failed, done early 
2016-04-18 09:53:43.239869 7fd417fff700  0 ERROR: incremental sync on container2 bucket_id=acadcc66-10b9-4829-b8e2-306c0048bff5.4176.6 shard_id=-1 failed, retcode=-16
2016-04-18 09:53:43.312791 7fd415ffb700  0 ERROR: failed to wait for op, ret=-22: POST http://magna075:8080/admin/log?type=data&notify&source-zone=017d72f3-dd2e-48a3-ab51-cb2a9b73b2c0&rgwx-zonegroup=e66e1293-e63b-4afe-9dad-3397647dfb03

However running sync status again did not crash. But the data is behind.


# radosgw-admin sync status
          realm 4e00a610-36e9-43d0-803e-4001442b8232 (earth)
      zonegroup e66e1293-e63b-4afe-9dad-3397647dfb03 (us)
           zone 017d72f3-dd2e-48a3-ab51-cb2a9b73b2c0 (us-3)
  metadata sync syncing
                full sync: 0/64 shards
                metadata is caught up with master
                incremental sync: 64/64 shards
      data sync source: 001da65b-c3a8-42e2-a1ce-79cacefbace2 (us-2)
                        syncing
                        full sync: 0/128 shards
                        incremental sync: 128/128 shards
                        data is behind on 2 shards
                        oldest incremental change not applied: 2016-04-18 10:19:09.0.769808s
                source: acadcc66-10b9-4829-b8e2-306c0048bff5 (us-1)
                        syncing
                        full sync: 0/128 shards
                        incremental sync: 128/128 shards
                        data is behind on 8 shards
                        oldest incremental change not applied: 2016-04-18 09:52:05.0.15445s


Oldest incremental change was before 09:52 and it hasn't synced ever since. That was around the time the segfault appeared.

Comment 2 Yehuda Sadeh 2016-04-18 16:09:35 UTC
this might have already been fixed in latest upstream

Comment 5 shilpa 2016-06-30 09:25:07 UTC
Moving to verified. I don't see this issue in 10.2.2

Comment 7 errata-xmlrpc 2016-08-23 19:36:38 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2016-1755.html


Note You need to log in before you can comment on or make changes to this bug.