Bug 1353972

Summary: Master zone radosgw process segfaults during I/O and sync operations
Product: [Red Hat Storage] Red Hat Ceph Storage Reporter: shilpa <smanjara>
Component: RGWAssignee: Casey Bodley <cbodley>
Status: CLOSED ERRATA QA Contact: shilpa <smanjara>
Severity: high Docs Contact:
Priority: unspecified    
Version: 2.0CC: cbodley, ceph-eng-bugs, ceph-qe-bugs, hnallurv, kbader, kdreyer, mbenjamin, owasserm, sweil, yehuda
Target Milestone: rc   
Target Release: 2.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: RHEL: ceph-10.2.2-18.el7cp Ubuntu: ceph_10.2.2-14redhat1xenial Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-08-23 19:43:47 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1354156    
Bug Blocks:    

Description shilpa 2016-07-08 14:50:01 UTC
Description of problem:
Upload large amounts of data from both rgw nodes. Around 50 buckets with around 100GB of data. During the sync operations, rgw process segfaulted on master zone.


Version-Release number of selected component (if applicable):
ceph-radosgw-10.2.2-15.el7cp.x86_64

Actual results:

    0> 2016-07-08 09:38:32.318800 7fc485ffb700 -1 *** Caught signal (Segmentation fault) **
 in thread 7fc485ffb700 thread_name:radosgw

 ceph version 10.2.2-15.el7cp (60cd52496ca02bdde9c2f4191e617f75166d87b6)
 1: (()+0x54e22a) [0x7fc59936d22a]
 2: (()+0xf100) [0x7fc59879e100]
 3: (std::__detail::_List_node_base::_M_transfer(std::__detail::_List_node_base*, std::__detail::_List_node_base*)+0x10) [0x7fc5982f1010]
 4: (RGWOmapAppend::flush_pending()+0x23) [0x7fc5990f0993]
 5: (RGWOmapAppend::append(std::string const&)+0x98) [0x7fc5990f0a48]
 6: (RGWDataSyncSingleEntryCR::operate()+0x82a) [0x7fc5991ad94a]
 7: (RGWCoroutinesStack::operate(RGWCoroutinesEnv*)+0x7e) [0x7fc5990e8a2e]
 8: (RGWCoroutinesManager::run(std::list<RGWCoroutinesStack*, std::allocator<RGWCoroutinesStack*> >&)+0x3f1) [0x7fc5990ea9d1]
 9: (RGWCoroutinesManager::run(RGWCoroutine*)+0x70) [0x7fc5990eb590]
 10: (RGWRemoteDataLog::run_sync(int, rgw_data_sync_status&)+0x352) [0x7fc599192912]
 11: (RGWDataSyncProcessorThread::process()+0x49) [0x7fc599255289]
 12: (RGWRadosThread::Worker::entry()+0x133) [0x7fc5991fa083]
 13: (()+0x7dc5) [0x7fc598796dc5]
 14: (clone()+0x6d) [0x7fc597da0ced]



Additional info:

Currently unable to run RGW I/O on both the zones.

Comment 2 Casey Bodley 2016-07-08 15:06:46 UTC
upstream fix pending review: https://github.com/ceph/ceph/pull/10157

Comment 6 shilpa 2016-07-09 10:29:39 UTC
While I tried to continue with testing after a rgw restart on the two nodes, I noticed that the non-master zone segfaults with a different stack trace a few seconds after the master zone segfaults.

2016-07-09 08:06:39.522101 7fc10d7e2700 -1 *** Caught signal (Segmentation fault) **
 in thread 7fc10d7e2700 thread_name:radosgw

 ceph version 10.2.2-15.el7cp (60cd52496ca02bdde9c2f4191e617f75166d87b6)
 1: (()+0x54e22a) [0x7fc192d5d22a]
 2: (()+0xf100) [0x7fc19218e100]
 3: (std::basic_string<char, std::char_traits<char>, std::allocator<char> >::basic_string(std::string const&)+0x1b) [0x7fc191d30f3b]
 4: (rgw_bucket::rgw_bucket(rgw_bucket const&)+0x23) [0x7fc192ae60e3]
 5: (RGWRadosRemoveOmapKeysCR::RGWRadosRemoveOmapKeysCR(RGWRados*, rgw_bucket const&, std::string const&, std::set<std::string, std::less<std::string>, std::allocator<std::string> > const&)+0x128) [0x7fc192ae2968]
 6: (RGWDataSyncSingleEntryCR::operate()+0xa96) [0x7fc192b9dbb6]
 7: (RGWCoroutinesStack::operate(RGWCoroutinesEnv*)+0x7e) [0x7fc192ad8a2e]
 8: (RGWCoroutinesManager::run(std::list<RGWCoroutinesStack*, std::allocator<RGWCoroutinesStack*> >&)+0x3f1) [0x7fc192ada9d1]
 9: (RGWCoroutinesManager::run(RGWCoroutine*)+0x70) [0x7fc192adb590]
 10: (RGWRemoteDataLog::run_sync(int, rgw_data_sync_status&)+0x352) [0x7fc192b82912]
 11: (RGWDataSyncProcessorThread::process()+0x49) [0x7fc192c45289]
 12: (RGWRadosThread::Worker::entry()+0x133) [0x7fc192bea083]
 13: (()+0x7dc5) [0x7fc192186dc5]
 14: (clone()+0x6d) [0x7fc191790ced]

Not sure if this is related to the original segfault on master.

Comment 7 Casey Bodley 2016-07-11 13:42:20 UTC
(In reply to shilpa from comment #6)
> Not sure if this is related to the original segfault on master.

The fix will address this segfault as well.

Comment 8 shilpa 2016-07-12 13:37:47 UTC
Running on 10.2.2-18. I hit this stack trace again. This time on non-master node, during object upload and sync operations.

2016-07-12 13:07:27.107029 7f3879ffb700 -1 *** Caught signal (Segmentation fault) **
 in thread 7f3879ffb700 thread_name:radosgw

 ceph version 10.2.2-18.el7cp (408019449adec8263014b356737cf326544ea7c6)
 1: (()+0x54e2ba) [0x7f39102ab2ba]
 2: (()+0xf100) [0x7f390f6dc100]
 3: (RGWCoroutinesStack::wakeup()+0xe) [0x7f39100274ce]
 4: (RGWBucketShardIncrementalSyncCR::operate()+0xfed) [0x7f39100d639d]
 5: (RGWCoroutinesStack::operate(RGWCoroutinesEnv*)+0x7e) [0x7f3910026a4e]
 6: (RGWCoroutinesManager::run(std::list<RGWCoroutinesStack*, std::allocator<RGWCoroutinesStack*> >&)+0x3f1) [0x7f39100289f1]
 7: (RGWCoroutinesManager::run(RGWCoroutine*)+0x70) [0x7f39100295b0]
 8: (RGWRemoteDataLog::run_sync(int, rgw_data_sync_status&)+0x352) [0x7f39100d0932]
 9: (RGWDataSyncProcessorThread::process()+0x49) [0x7f3910193319]
 10: (RGWRadosThread::Worker::entry()+0x133) [0x7f3910138113]
 11: (()+0x7dc5) [0x7f390f6d4dc5]
 12: (clone()+0x6d) [0x7f390ecdeced]

Comment 18 shilpa 2016-07-26 06:19:32 UTC
I haven't seen this occur since 10.2.2-23. Moving to verified.

Comment 20 errata-xmlrpc 2016-08-23 19:43:47 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2016-1755.html