1328070 – RGW multisite: Segfault while running radosgw-admin sync status on a newly added zone

Bug 1328070 - RGW multisite: Segfault while running radosgw-admin sync status on a newly added zone

Summary: RGW multisite: Segfault while running radosgw-admin sync status on a newly ad...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Ceph Storage
Classification:	Red Hat Storage
Component:	RGW
Sub Component:
Version:	2.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	urgent
Target Milestone:	rc
Target Release:	2.0
Assignee:	Orit Wasserman
QA Contact:	shilpa
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2016-04-18 11:10 UTC by shilpa
Modified:	2017-07-30 15:58 UTC (History)
CC List:	9 users (show)
Fixed In Version:	ceph-10.2.0-1.el7cp
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2016-08-23 19:36:38 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2016:1755	0	normal	SHIPPED_LIVE	Red Hat Ceph Storage 2.0 bug fix and enhancement update	2016-08-23 23:23:52 UTC

Description shilpa 2016-04-18 11:10:14 UTC

Description of problem:
Added a third site to an existing multisite active-active configuration. The data from the master zone was being synced while a segfault occured when I ran "radosgw sync status" command


Version-Release number of selected component (if applicable):
ceph-radosgw-10.1.1-1.el7cp.x86_64


Steps to Reproduce:
1. Configure multisite active-active cluster with two zones. All the data is synced between each other.
2. Now add a third zone to the configuration and restart the radosgw service on all the zones. 
3. The metadata and data sync should start from the master zone to the newly added zone. 
4. Check the sync status using radosgw-admin command to check progress.

Actual results:

Segmentation fault occured when I ran the following command:

      
# radosgw-admin sync status
          realm 4e00a610-36e9-43d0-803e-4001442b8232 (earth)
      zonegroup e66e1293-e63b-4afe-9dad-3397647dfb03 (us)
           zone 017d72f3-dd2e-48a3-ab51-cb2a9b73b2c0 (us-3)
  metadata sync syncing
                full sync: 0/64 shards
                metadata is caught up with master
                incremental sync: 64/64 shards
*** Caught signal (Segmentation fault) **
 in thread 7f7274133a40 thread_name:radosgw-admin
 ceph version 10.1.1-1.el7cp (61adb020219fbad4508050b5f0a792246ba74dae)
 1: (()+0x54267a) [0x7f726a85767a]
 2: (()+0xf100) [0x7f726078b100]
 3: (RGWShardCollectCR::operate()+0x1fe) [0x7f726a6684fe]
 4: (RGWCoroutinesStack::operate(RGWCoroutinesEnv*)+0x7e) [0x7f726a5db99e]
 5: (RGWCoroutinesManager::run(std::list<RGWCoroutinesStack*, std::allocator<RGWCoroutinesStack*> >&)+0x3f1) [0x7f726a5dd901]
 6: (RGWCoroutinesManager::run(RGWCoroutine*)+0x70) [0x7f726a5de4c0]
 7: (RGWRemoteDataLog::read_source_log_shards_info(std::map<int, RGWDataChangesLogInfo, std::less<int>, std::allocator<std::pair<int const, RGWDataChangesLogInfo> > >*)+0xb2) [0x7f726a68d1f2]
 8: (main()+0xe7e5) [0x7f72741844b5]
 9: (__libc_start_main()+0xf5) [0x7f725febdb15]
 10: (()+0x35641) [0x7f7274193641]
2016-04-18 09:53:49.308014 7f7274133a40 -1 *** Caught signal (Segmentation fault) **
 in thread 7f7274133a40 thread_name:radosgw-admin


Additional info:

Also found these errors in rgw logs:

2016-04-18 09:53:36.796191 7fd4a5ffb700  0 ERROR: lease cr failed, done early 
2016-04-18 09:53:36.796214 7fd4a5ffb700  0 ERROR: incremental sync on new-bucket bucket_id=acadcc66-10b9-4829-b8e2-306c0048bff5.4176.5 shard_id=-1 failed, retcode=-16
2016-04-18 09:53:36.821654 7fd415ffb700  0 ERROR: failed to wait for op, ret=-22: POST http://magna059:8080/admin/log?type=data&notify&source-zone=017d72f3-dd2e-48a3-ab51-cb2a9b73b2c0&rgwx-zonegroup=e66e1293-e63b-4afe-9dad-3397647dfb03
2016-04-18 09:53:43.073586 7fd36b7ee700  1 civetweb: 0x7fd3c8004df0: 10.8.128.75 - - [18/Apr/2016:09:53:42 +0000] "GET /admin/log/ HTTP/1.1" 200 0 - -
2016-04-18 09:53:43.239843 7fd417fff700  0 ERROR: lease cr failed, done early 
2016-04-18 09:53:43.239869 7fd417fff700  0 ERROR: incremental sync on container2 bucket_id=acadcc66-10b9-4829-b8e2-306c0048bff5.4176.6 shard_id=-1 failed, retcode=-16
2016-04-18 09:53:43.312791 7fd415ffb700  0 ERROR: failed to wait for op, ret=-22: POST http://magna075:8080/admin/log?type=data&notify&source-zone=017d72f3-dd2e-48a3-ab51-cb2a9b73b2c0&rgwx-zonegroup=e66e1293-e63b-4afe-9dad-3397647dfb03

However running sync status again did not crash. But the data is behind.


# radosgw-admin sync status
          realm 4e00a610-36e9-43d0-803e-4001442b8232 (earth)
      zonegroup e66e1293-e63b-4afe-9dad-3397647dfb03 (us)
           zone 017d72f3-dd2e-48a3-ab51-cb2a9b73b2c0 (us-3)
  metadata sync syncing
                full sync: 0/64 shards
                metadata is caught up with master
                incremental sync: 64/64 shards
      data sync source: 001da65b-c3a8-42e2-a1ce-79cacefbace2 (us-2)
                        syncing
                        full sync: 0/128 shards
                        incremental sync: 128/128 shards
                        data is behind on 2 shards
                        oldest incremental change not applied: 2016-04-18 10:19:09.0.769808s
                source: acadcc66-10b9-4829-b8e2-306c0048bff5 (us-1)
                        syncing
                        full sync: 0/128 shards
                        incremental sync: 128/128 shards
                        data is behind on 8 shards
                        oldest incremental change not applied: 2016-04-18 09:52:05.0.15445s


Oldest incremental change was before 09:52 and it hasn't synced ever since. That was around the time the segfault appeared.

Comment 2 Yehuda Sadeh 2016-04-18 16:09:35 UTC

this might have already been fixed in latest upstream

Comment 5 shilpa 2016-06-30 09:25:07 UTC

Moving to verified. I don't see this issue in 10.2.2

Comment 7 errata-xmlrpc 2016-08-23 19:36:38 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2016-1755.html

Note You need to log in before you can comment on or make changes to this bug.