Bug 1328070

Summary: RGW multisite: Segfault while running radosgw-admin sync status on a newly added zone
Product: Red Hat Ceph Storage Reporter: shilpa <smanjara>
Component: RGWAssignee: Orit Wasserman <owasserm>
Status: CLOSED ERRATA QA Contact: shilpa <smanjara>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: 2.0CC: cbodley, ceph-eng-bugs, gmeno, hnallurv, kbader, kdreyer, mbenjamin, owasserm, sweil
Target Milestone: rc   
Target Release: 2.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: ceph-10.2.0-1.el7cp Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-08-23 19:36:38 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description shilpa 2016-04-18 11:10:14 UTC
Description of problem:
Added a third site to an existing multisite active-active configuration. The data from the master zone was being synced while a segfault occured when I ran "radosgw sync status" command


Version-Release number of selected component (if applicable):
ceph-radosgw-10.1.1-1.el7cp.x86_64


Steps to Reproduce:
1. Configure multisite active-active cluster with two zones. All the data is synced between each other.
2. Now add a third zone to the configuration and restart the radosgw service on all the zones. 
3. The metadata and data sync should start from the master zone to the newly added zone. 
4. Check the sync status using radosgw-admin command to check progress.

Actual results:

Segmentation fault occured when I ran the following command:

      
# radosgw-admin sync status
          realm 4e00a610-36e9-43d0-803e-4001442b8232 (earth)
      zonegroup e66e1293-e63b-4afe-9dad-3397647dfb03 (us)
           zone 017d72f3-dd2e-48a3-ab51-cb2a9b73b2c0 (us-3)
  metadata sync syncing
                full sync: 0/64 shards
                metadata is caught up with master
                incremental sync: 64/64 shards
*** Caught signal (Segmentation fault) **
 in thread 7f7274133a40 thread_name:radosgw-admin
 ceph version 10.1.1-1.el7cp (61adb020219fbad4508050b5f0a792246ba74dae)
 1: (()+0x54267a) [0x7f726a85767a]
 2: (()+0xf100) [0x7f726078b100]
 3: (RGWShardCollectCR::operate()+0x1fe) [0x7f726a6684fe]
 4: (RGWCoroutinesStack::operate(RGWCoroutinesEnv*)+0x7e) [0x7f726a5db99e]
 5: (RGWCoroutinesManager::run(std::list<RGWCoroutinesStack*, std::allocator<RGWCoroutinesStack*> >&)+0x3f1) [0x7f726a5dd901]
 6: (RGWCoroutinesManager::run(RGWCoroutine*)+0x70) [0x7f726a5de4c0]
 7: (RGWRemoteDataLog::read_source_log_shards_info(std::map<int, RGWDataChangesLogInfo, std::less<int>, std::allocator<std::pair<int const, RGWDataChangesLogInfo> > >*)+0xb2) [0x7f726a68d1f2]
 8: (main()+0xe7e5) [0x7f72741844b5]
 9: (__libc_start_main()+0xf5) [0x7f725febdb15]
 10: (()+0x35641) [0x7f7274193641]
2016-04-18 09:53:49.308014 7f7274133a40 -1 *** Caught signal (Segmentation fault) **
 in thread 7f7274133a40 thread_name:radosgw-admin


Additional info:

Also found these errors in rgw logs:

2016-04-18 09:53:36.796191 7fd4a5ffb700  0 ERROR: lease cr failed, done early 
2016-04-18 09:53:36.796214 7fd4a5ffb700  0 ERROR: incremental sync on new-bucket bucket_id=acadcc66-10b9-4829-b8e2-306c0048bff5.4176.5 shard_id=-1 failed, retcode=-16
2016-04-18 09:53:36.821654 7fd415ffb700  0 ERROR: failed to wait for op, ret=-22: POST http://magna059:8080/admin/log?type=data&notify&source-zone=017d72f3-dd2e-48a3-ab51-cb2a9b73b2c0&rgwx-zonegroup=e66e1293-e63b-4afe-9dad-3397647dfb03
2016-04-18 09:53:43.073586 7fd36b7ee700  1 civetweb: 0x7fd3c8004df0: 10.8.128.75 - - [18/Apr/2016:09:53:42 +0000] "GET /admin/log/ HTTP/1.1" 200 0 - -
2016-04-18 09:53:43.239843 7fd417fff700  0 ERROR: lease cr failed, done early 
2016-04-18 09:53:43.239869 7fd417fff700  0 ERROR: incremental sync on container2 bucket_id=acadcc66-10b9-4829-b8e2-306c0048bff5.4176.6 shard_id=-1 failed, retcode=-16
2016-04-18 09:53:43.312791 7fd415ffb700  0 ERROR: failed to wait for op, ret=-22: POST http://magna075:8080/admin/log?type=data&notify&source-zone=017d72f3-dd2e-48a3-ab51-cb2a9b73b2c0&rgwx-zonegroup=e66e1293-e63b-4afe-9dad-3397647dfb03

However running sync status again did not crash. But the data is behind.


# radosgw-admin sync status
          realm 4e00a610-36e9-43d0-803e-4001442b8232 (earth)
      zonegroup e66e1293-e63b-4afe-9dad-3397647dfb03 (us)
           zone 017d72f3-dd2e-48a3-ab51-cb2a9b73b2c0 (us-3)
  metadata sync syncing
                full sync: 0/64 shards
                metadata is caught up with master
                incremental sync: 64/64 shards
      data sync source: 001da65b-c3a8-42e2-a1ce-79cacefbace2 (us-2)
                        syncing
                        full sync: 0/128 shards
                        incremental sync: 128/128 shards
                        data is behind on 2 shards
                        oldest incremental change not applied: 2016-04-18 10:19:09.0.769808s
                source: acadcc66-10b9-4829-b8e2-306c0048bff5 (us-1)
                        syncing
                        full sync: 0/128 shards
                        incremental sync: 128/128 shards
                        data is behind on 8 shards
                        oldest incremental change not applied: 2016-04-18 09:52:05.0.15445s


Oldest incremental change was before 09:52 and it hasn't synced ever since. That was around the time the segfault appeared.

Comment 2 Yehuda Sadeh 2016-04-18 16:09:35 UTC
this might have already been fixed in latest upstream

Comment 5 shilpa 2016-06-30 09:25:07 UTC
Moving to verified. I don't see this issue in 10.2.2

Comment 7 errata-xmlrpc 2016-08-23 19:36:38 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2016-1755.html