1353972 – Master zone radosgw process segfaults during I/O and sync operations

Bug 1353972 - Master zone radosgw process segfaults during I/O and sync operations

Summary: Master zone radosgw process segfaults during I/O and sync operations

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Ceph Storage
Classification:	Red Hat Storage
Component:	RGW
Sub Component:
Version:	2.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	rc
Target Release:	2.0
Assignee:	Casey Bodley
QA Contact:	shilpa
Docs Contact:
URL:
Whiteboard:
Depends On:	1354156
Blocks:
TreeView+	depends on / blocked

Reported:	2016-07-08 14:50 UTC by shilpa
Modified:	2017-07-31 14:15 UTC (History)
CC List:	10 users (show)
Fixed In Version:	RHEL: ceph-10.2.2-18.el7cp Ubuntu: ceph_10.2.2-14redhat1xenial
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2016-08-23 19:43:47 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Ceph Project Bug Tracker	16603	0	None	None	None	2016-07-08 14:51:21 UTC
Red Hat Product Errata	RHBA-2016:1755	0	normal	SHIPPED_LIVE	Red Hat Ceph Storage 2.0 bug fix and enhancement update	2016-08-23 23:23:52 UTC

Description shilpa 2016-07-08 14:50:01 UTC

Description of problem:
Upload large amounts of data from both rgw nodes. Around 50 buckets with around 100GB of data. During the sync operations, rgw process segfaulted on master zone.


Version-Release number of selected component (if applicable):
ceph-radosgw-10.2.2-15.el7cp.x86_64

Actual results:

    0> 2016-07-08 09:38:32.318800 7fc485ffb700 -1 *** Caught signal (Segmentation fault) **
 in thread 7fc485ffb700 thread_name:radosgw

 ceph version 10.2.2-15.el7cp (60cd52496ca02bdde9c2f4191e617f75166d87b6)
 1: (()+0x54e22a) [0x7fc59936d22a]
 2: (()+0xf100) [0x7fc59879e100]
 3: (std::__detail::_List_node_base::_M_transfer(std::__detail::_List_node_base*, std::__detail::_List_node_base*)+0x10) [0x7fc5982f1010]
 4: (RGWOmapAppend::flush_pending()+0x23) [0x7fc5990f0993]
 5: (RGWOmapAppend::append(std::string const&)+0x98) [0x7fc5990f0a48]
 6: (RGWDataSyncSingleEntryCR::operate()+0x82a) [0x7fc5991ad94a]
 7: (RGWCoroutinesStack::operate(RGWCoroutinesEnv*)+0x7e) [0x7fc5990e8a2e]
 8: (RGWCoroutinesManager::run(std::list<RGWCoroutinesStack*, std::allocator<RGWCoroutinesStack*> >&)+0x3f1) [0x7fc5990ea9d1]
 9: (RGWCoroutinesManager::run(RGWCoroutine*)+0x70) [0x7fc5990eb590]
 10: (RGWRemoteDataLog::run_sync(int, rgw_data_sync_status&)+0x352) [0x7fc599192912]
 11: (RGWDataSyncProcessorThread::process()+0x49) [0x7fc599255289]
 12: (RGWRadosThread::Worker::entry()+0x133) [0x7fc5991fa083]
 13: (()+0x7dc5) [0x7fc598796dc5]
 14: (clone()+0x6d) [0x7fc597da0ced]



Additional info:

Currently unable to run RGW I/O on both the zones.

Comment 2 Casey Bodley 2016-07-08 15:06:46 UTC

upstream fix pending review: https://github.com/ceph/ceph/pull/10157

Comment 6 shilpa 2016-07-09 10:29:39 UTC

While I tried to continue with testing after a rgw restart on the two nodes, I noticed that the non-master zone segfaults with a different stack trace a few seconds after the master zone segfaults.

2016-07-09 08:06:39.522101 7fc10d7e2700 -1 *** Caught signal (Segmentation fault) **
 in thread 7fc10d7e2700 thread_name:radosgw

 ceph version 10.2.2-15.el7cp (60cd52496ca02bdde9c2f4191e617f75166d87b6)
 1: (()+0x54e22a) [0x7fc192d5d22a]
 2: (()+0xf100) [0x7fc19218e100]
 3: (std::basic_string<char, std::char_traits<char>, std::allocator<char> >::basic_string(std::string const&)+0x1b) [0x7fc191d30f3b]
 4: (rgw_bucket::rgw_bucket(rgw_bucket const&)+0x23) [0x7fc192ae60e3]
 5: (RGWRadosRemoveOmapKeysCR::RGWRadosRemoveOmapKeysCR(RGWRados*, rgw_bucket const&, std::string const&, std::set<std::string, std::less<std::string>, std::allocator<std::string> > const&)+0x128) [0x7fc192ae2968]
 6: (RGWDataSyncSingleEntryCR::operate()+0xa96) [0x7fc192b9dbb6]
 7: (RGWCoroutinesStack::operate(RGWCoroutinesEnv*)+0x7e) [0x7fc192ad8a2e]
 8: (RGWCoroutinesManager::run(std::list<RGWCoroutinesStack*, std::allocator<RGWCoroutinesStack*> >&)+0x3f1) [0x7fc192ada9d1]
 9: (RGWCoroutinesManager::run(RGWCoroutine*)+0x70) [0x7fc192adb590]
 10: (RGWRemoteDataLog::run_sync(int, rgw_data_sync_status&)+0x352) [0x7fc192b82912]
 11: (RGWDataSyncProcessorThread::process()+0x49) [0x7fc192c45289]
 12: (RGWRadosThread::Worker::entry()+0x133) [0x7fc192bea083]
 13: (()+0x7dc5) [0x7fc192186dc5]
 14: (clone()+0x6d) [0x7fc191790ced]

Not sure if this is related to the original segfault on master.

Comment 7 Casey Bodley 2016-07-11 13:42:20 UTC

(In reply to shilpa from comment #6)
> Not sure if this is related to the original segfault on master.

The fix will address this segfault as well.

Comment 8 shilpa 2016-07-12 13:37:47 UTC

Running on 10.2.2-18. I hit this stack trace again. This time on non-master node, during object upload and sync operations.

2016-07-12 13:07:27.107029 7f3879ffb700 -1 *** Caught signal (Segmentation fault) **
 in thread 7f3879ffb700 thread_name:radosgw

 ceph version 10.2.2-18.el7cp (408019449adec8263014b356737cf326544ea7c6)
 1: (()+0x54e2ba) [0x7f39102ab2ba]
 2: (()+0xf100) [0x7f390f6dc100]
 3: (RGWCoroutinesStack::wakeup()+0xe) [0x7f39100274ce]
 4: (RGWBucketShardIncrementalSyncCR::operate()+0xfed) [0x7f39100d639d]
 5: (RGWCoroutinesStack::operate(RGWCoroutinesEnv*)+0x7e) [0x7f3910026a4e]
 6: (RGWCoroutinesManager::run(std::list<RGWCoroutinesStack*, std::allocator<RGWCoroutinesStack*> >&)+0x3f1) [0x7f39100289f1]
 7: (RGWCoroutinesManager::run(RGWCoroutine*)+0x70) [0x7f39100295b0]
 8: (RGWRemoteDataLog::run_sync(int, rgw_data_sync_status&)+0x352) [0x7f39100d0932]
 9: (RGWDataSyncProcessorThread::process()+0x49) [0x7f3910193319]
 10: (RGWRadosThread::Worker::entry()+0x133) [0x7f3910138113]
 11: (()+0x7dc5) [0x7f390f6d4dc5]
 12: (clone()+0x6d) [0x7f390ecdeced]

Comment 18 shilpa 2016-07-26 06:19:32 UTC

I haven't seen this occur since 10.2.2-23. Moving to verified.

Comment 20 errata-xmlrpc 2016-08-23 19:43:47 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2016-1755.html

Note You need to log in before you can comment on or make changes to this bug.