Description of problem: Upload large amounts of data from both rgw nodes. Around 50 buckets with around 100GB of data. During the sync operations, rgw process segfaulted on master zone. Version-Release number of selected component (if applicable): ceph-radosgw-10.2.2-15.el7cp.x86_64 Actual results: 0> 2016-07-08 09:38:32.318800 7fc485ffb700 -1 *** Caught signal (Segmentation fault) ** in thread 7fc485ffb700 thread_name:radosgw ceph version 10.2.2-15.el7cp (60cd52496ca02bdde9c2f4191e617f75166d87b6) 1: (()+0x54e22a) [0x7fc59936d22a] 2: (()+0xf100) [0x7fc59879e100] 3: (std::__detail::_List_node_base::_M_transfer(std::__detail::_List_node_base*, std::__detail::_List_node_base*)+0x10) [0x7fc5982f1010] 4: (RGWOmapAppend::flush_pending()+0x23) [0x7fc5990f0993] 5: (RGWOmapAppend::append(std::string const&)+0x98) [0x7fc5990f0a48] 6: (RGWDataSyncSingleEntryCR::operate()+0x82a) [0x7fc5991ad94a] 7: (RGWCoroutinesStack::operate(RGWCoroutinesEnv*)+0x7e) [0x7fc5990e8a2e] 8: (RGWCoroutinesManager::run(std::list<RGWCoroutinesStack*, std::allocator<RGWCoroutinesStack*> >&)+0x3f1) [0x7fc5990ea9d1] 9: (RGWCoroutinesManager::run(RGWCoroutine*)+0x70) [0x7fc5990eb590] 10: (RGWRemoteDataLog::run_sync(int, rgw_data_sync_status&)+0x352) [0x7fc599192912] 11: (RGWDataSyncProcessorThread::process()+0x49) [0x7fc599255289] 12: (RGWRadosThread::Worker::entry()+0x133) [0x7fc5991fa083] 13: (()+0x7dc5) [0x7fc598796dc5] 14: (clone()+0x6d) [0x7fc597da0ced] Additional info: Currently unable to run RGW I/O on both the zones.
upstream fix pending review: https://github.com/ceph/ceph/pull/10157
While I tried to continue with testing after a rgw restart on the two nodes, I noticed that the non-master zone segfaults with a different stack trace a few seconds after the master zone segfaults. 2016-07-09 08:06:39.522101 7fc10d7e2700 -1 *** Caught signal (Segmentation fault) ** in thread 7fc10d7e2700 thread_name:radosgw ceph version 10.2.2-15.el7cp (60cd52496ca02bdde9c2f4191e617f75166d87b6) 1: (()+0x54e22a) [0x7fc192d5d22a] 2: (()+0xf100) [0x7fc19218e100] 3: (std::basic_string<char, std::char_traits<char>, std::allocator<char> >::basic_string(std::string const&)+0x1b) [0x7fc191d30f3b] 4: (rgw_bucket::rgw_bucket(rgw_bucket const&)+0x23) [0x7fc192ae60e3] 5: (RGWRadosRemoveOmapKeysCR::RGWRadosRemoveOmapKeysCR(RGWRados*, rgw_bucket const&, std::string const&, std::set<std::string, std::less<std::string>, std::allocator<std::string> > const&)+0x128) [0x7fc192ae2968] 6: (RGWDataSyncSingleEntryCR::operate()+0xa96) [0x7fc192b9dbb6] 7: (RGWCoroutinesStack::operate(RGWCoroutinesEnv*)+0x7e) [0x7fc192ad8a2e] 8: (RGWCoroutinesManager::run(std::list<RGWCoroutinesStack*, std::allocator<RGWCoroutinesStack*> >&)+0x3f1) [0x7fc192ada9d1] 9: (RGWCoroutinesManager::run(RGWCoroutine*)+0x70) [0x7fc192adb590] 10: (RGWRemoteDataLog::run_sync(int, rgw_data_sync_status&)+0x352) [0x7fc192b82912] 11: (RGWDataSyncProcessorThread::process()+0x49) [0x7fc192c45289] 12: (RGWRadosThread::Worker::entry()+0x133) [0x7fc192bea083] 13: (()+0x7dc5) [0x7fc192186dc5] 14: (clone()+0x6d) [0x7fc191790ced] Not sure if this is related to the original segfault on master.
(In reply to shilpa from comment #6) > Not sure if this is related to the original segfault on master. The fix will address this segfault as well.
Running on 10.2.2-18. I hit this stack trace again. This time on non-master node, during object upload and sync operations. 2016-07-12 13:07:27.107029 7f3879ffb700 -1 *** Caught signal (Segmentation fault) ** in thread 7f3879ffb700 thread_name:radosgw ceph version 10.2.2-18.el7cp (408019449adec8263014b356737cf326544ea7c6) 1: (()+0x54e2ba) [0x7f39102ab2ba] 2: (()+0xf100) [0x7f390f6dc100] 3: (RGWCoroutinesStack::wakeup()+0xe) [0x7f39100274ce] 4: (RGWBucketShardIncrementalSyncCR::operate()+0xfed) [0x7f39100d639d] 5: (RGWCoroutinesStack::operate(RGWCoroutinesEnv*)+0x7e) [0x7f3910026a4e] 6: (RGWCoroutinesManager::run(std::list<RGWCoroutinesStack*, std::allocator<RGWCoroutinesStack*> >&)+0x3f1) [0x7f39100289f1] 7: (RGWCoroutinesManager::run(RGWCoroutine*)+0x70) [0x7f39100295b0] 8: (RGWRemoteDataLog::run_sync(int, rgw_data_sync_status&)+0x352) [0x7f39100d0932] 9: (RGWDataSyncProcessorThread::process()+0x49) [0x7f3910193319] 10: (RGWRadosThread::Worker::entry()+0x133) [0x7f3910138113] 11: (()+0x7dc5) [0x7f390f6d4dc5] 12: (clone()+0x6d) [0x7f390ecdeced]
I haven't seen this occur since 10.2.2-23. Moving to verified.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2016-1755.html