Bug 1368196

Summary: 0.34-17 segfault under load in HA and lot of QMF schema queries
Product: Red Hat Enterprise MRG Reporter: Pavel Moravec <pmoravec>
Component: qpid-cppAssignee: messaging-bugs <messaging-bugs>
Status: POST --- QA Contact: Messaging QE <messaging-qe-bugs>
Severity: high Docs Contact:
Priority: unspecified    
Version: 3.2CC: jross, mcressma
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description Pavel Moravec 2016-08-18 15:55:24 UTC
Description of problem:
Under unknown scenario, but having a 0.34-17 broker:
- in HA cluster with some backup nodes (the problematic is the primary broker)
- under some stress test
- processing a lot of QMF schema queries (>5 "get list of bindings" queries run in parallel)
- doing periodicProcessing
- sending execution.sync on a session to a backup broker

Then calculating the encodedSize of ExecutionSyncBody segfaults with backtrace:

#5  0x00000030c27940c2 in qpid::framing::AMQFrame::encodedSize (this=0x7f89a7ffd1a0)
    at /usr/src/debug/qpid-cpp-0.34/src/qpid/framing/AMQFrame.cpp:46

(full BT below)


Version-Release number of selected component (if applicable):
qpid-cpp-server-* 0.34-17


How reproducible:
???


Steps to Reproduce:
1. ???


Actual results:


Expected results:


Additional info:
Backtrace:

Thread 1 (Thread 0x7f89a7fff700 (LWP 28118)):
#0  0x0000003d6129d28c in std::basic_string<char, std::char_traits<char>, std::allocator<char> >::basic_string(std::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) () from /usr/lib64/libstdc++.so.6
#1  0x00000030c2778d40 in _Construct<std::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::basic_string<char, std::char_traits<char>, std::allocator<char> > > (__first=<value optimized out>, __last="", __result=0x30c2a72630)
    at /usr/include/c++/4.4.7/bits/stl_construct.h:80
#2  uninitialized_copy<__gnu_cxx::__normal_iterator<std::basic_string<char, std::char_traits<char>, std::allocator<char> > const*, std::vector<std::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::basic_string<char, std::char_traits<char>, std::allocator<char> > > > >, std::basic_string<char, std::char_traits<char>, std::allocator<char> >*> (
    __first=<value optimized out>, __last="", __result=0x30c2a72630) at /usr/include/c++/4.4.7/bits/stl_uninitialized.h:74
#3  uninitialized_copy<__gnu_cxx::__normal_iterator<std::basic_string<char, std::char_traits<char>, std::allocator<char> > const*, std::vector<std::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::basic_string<char, std::char_traits<char>, std::allocator<char> > > > >, std::basic_string<char, std::char_traits<char>, std::allocator<char> >*> (
    __first=<value optimized out>, __last="", __result=0x30c2a72630) at /usr/include/c++/4.4.7/bits/stl_uninitialized.h:116
#4  std::__uninitialized_copy_a<__gnu_cxx::__normal_iterator<std::basic_string<char, std::char_traits<char>, std::allocator<char> > const*, std::vector<std::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::basic_string<char, std::char_traits<char>, std::allocator<char> > > > >, std::basic_string<char, std::char_traits<char>, std::allocator<char> >*, std::basic_string<char, std::char_traits<char>, std::allocator<char> > > (__first=<value optimized out>, __last="", __result=0x30c2a72630)
    at /usr/include/c++/4.4.7/bits/stl_uninitialized.h:256
#5  0x00000030c27940c2 in qpid::framing::AMQFrame::encodedSize (this=0x7f89a7ffd1a0)
    at /usr/src/debug/qpid-cpp-0.34/src/qpid/framing/AMQFrame.cpp:46
#6  0x00000030c2784bbc in qpid::SessionState::senderRecord (this=0x2925120, f=...)
    at /usr/src/debug/qpid-cpp-0.34/src/qpid/SessionState.cpp:123
#7  0x00000030c2792b9d in qpid::amqp_0_10::SessionHandler::handleOut (this=0x2810870, f=...)
    at /usr/src/debug/qpid-cpp-0.34/src/qpid/amqp_0_10/SessionHandler.cpp:138
#8  0x00000030c27a7b7a in qpid::framing::Proxy::send (this=0x2810940, b=...)
    at /usr/src/debug/qpid-cpp-0.34/src/qpid/framing/Proxy.cpp:37
#9  0x00000030c26d7d09 in qpid::framing::AMQP_ClientProxy::Execution::sync (this=0x2810940)
    at /usr/src/debug/qpid-cpp-0.34/src/qpid/framing/AMQP_ClientProxy.cpp:178
#10 0x00000030c2eeb532 in qpid::broker::SessionState::deliver (this=<value optimized out>, message=...,
    destination=<value optimized out>, isRedelivered=false, ttl=0, acceptMode=<value optimized out>,
    acquireMode=qpid::framing::message::ACQUIRE_MODE_NOT_ACQUIRED, annotations=std::map with 0 elements, sync=true)
    at /usr/src/debug/qpid-cpp-0.34/src/qpid/broker/SessionState.cpp:325
#11 0x00000030c2ecf368 in qpid::broker::SemanticStateConsumerImpl::deliver (this=0x2833a20, cursor=..., msg=..., consumer=...)
    at /usr/src/debug/qpid-cpp-0.34/src/qpid/broker/SemanticState.cpp:396
#12 0x00007f89af4af5dc in qpid::ha::ReplicatingSubscription::sendEvent (this=0x2833a20, event=<value optimized out>)
    at /usr/src/debug/qpid-cpp-0.34/src/qpid/ha/ReplicatingSubscription.cpp:306
#13 0x00007f89af4af81f in qpid::ha::ReplicatingSubscription::sendIdEvent (this=<value optimized out>, pos=<value optimized out>,
    l=<value optimized out>) at /usr/src/debug/qpid-cpp-0.34/src/qpid/ha/ReplicatingSubscription.cpp:297
#14 0x00007f89af4b101b in qpid::ha::ReplicatingSubscription::deliver (this=0x2833a20, c=..., m=...)
    at /usr/src/debug/qpid-cpp-0.34/src/qpid/ha/ReplicatingSubscription.cpp:212
#15 0x00000030c2e32557 in qpid::broker::Queue::dispatch (this=<value optimized out>, c=...)
    at /usr/src/debug/qpid-cpp-0.34/src/qpid/broker/Queue.cpp:507
#16 0x00000030c2ed25bb in qpid::broker::SemanticStateConsumerImpl::doDispatch (this=<value optimized out>)
    at /usr/src/debug/qpid-cpp-0.34/src/qpid/broker/SemanticState.cpp:663
#17 0x00007f89af4afc14 in qpid::ha::ReplicatingSubscription::doDispatch (this=0x2833a20)
    at /usr/src/debug/qpid-cpp-0.34/src/qpid/ha/ReplicatingSubscription.cpp:317
#18 0x00000030c2ecc221 in qpid::broker::SemanticStateConsumerImpl::doOutput (this=0x2833a20)
    at /usr/src/debug/qpid-cpp-0.34/src/qpid/broker/SemanticState.cpp:729
#19 0x00000030c27b7221 in qpid::sys::AggregateOutput::doOutput (this=0x7f896c091768)
    at /usr/src/debug/qpid-cpp-0.34/src/qpid/sys/AggregateOutput.cpp:54
#20 0x00000030c2e58069 in qpid::broker::amqp_0_10::Connection::doOutput (this=0x7f896c091720)
    at /usr/src/debug/qpid-cpp-0.34/src/qpid/broker/amqp_0_10/Connection.cpp:400
#21 0x00000030c2de5791 in qpid::amqp_0_10::Connection::encode (this=0x7f896c005310, buffer=<value optimized out>, size=65536)
    at /usr/src/debug/qpid-cpp-0.34/src/qpid/amqp_0_10/Connection.cpp:101
#22 0x00000030c27b9f54 in qpid::sys::AsynchIOHandler::idle (this=0x28f1e10)
    at /usr/src/debug/qpid-cpp-0.34/src/qpid/sys/AsynchIOHandler.cpp:221
#23 0x00000030c2737968 in operator() (this=0x29269c0, h=...) at /usr/include/boost/function/function_template.hpp:1013
#24 qpid::sys::posix::AsynchIO::writeable (this=0x29269c0, h=...) at /usr/src/debug/qpid-cpp-0.34/src/qpid/sys/posix/AsynchIO.cpp:582
#25 0x00000030c27bea83 in boost::function1<void, qpid::sys::DispatchHandle&>::operator() (this=<value optimized out>,
    a0=<value optimized out>) at /usr/include/boost/function/function_template.hpp:1013
#26 0x00000030c27bd3ee in qpid::sys::DispatchHandle::processEvent (this=0x29269c8, type=qpid::sys::Poller::WRITABLE)
    at /usr/src/debug/qpid-cpp-0.34/src/qpid/sys/DispatchHandle.cpp:287
#27 0x00000030c275dfad in process (this=0x266b9a0) at /usr/src/debug/qpid-cpp-0.34/src/qpid/sys/Poller.h:131
#28 qpid::sys::Poller::run (this=0x266b9a0) at /usr/src/debug/qpid-cpp-0.34/src/qpid/sys/epoll/EpollPoller.cpp:522
#29 0x00000030c275216a in qpid::sys::(anonymous namespace)::runRunnable (p=<value optimized out>)
    at /usr/src/debug/qpid-cpp-0.34/src/qpid/sys/posix/Thread.cpp:35
#30 0x0000003d5f607a51 in start_thread (arg=0x7f89a7fff700) at pthread_create.c:301
#31 0x0000003d5f2e896d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115

Comment 2 Pavel Moravec 2016-08-18 16:02:00 UTC
Mine analysis:

- many threads (# 2-4,6-10 and 15) are processing QMF "get list of bindings" request
- thread 14 routing a message via TopicExchange
- thread 10 doing periodicProcessing
- thread 5 creates a subscription for HA backup broker
- thread 1 segfaults when calculating encodedSize of ExecutionSyncBody frame - inside some std::string stuff, but ExecutionSyncBody has quite static size not depending on any string

It smells me like QMF requests & periodicProcessing caused some memory corruption that caused encodedSize to segfault..

Comment 3 Alan Conway 2016-09-22 20:47:57 UTC
Theoretical fix at:

http://git.app.eng.bos.redhat.com/git/rh-qpid.git/commit/?h=0.34-mrg-aconway-bz1368196

Not proven. The fix will log 'warning epoll:' messages if we hit any of the fixed codepaths, so if the reproducer doesn't crash and we see those log messages, then we are probably in good shape.

Bug 1368196 - Fix theoretcial race condition in EpollPoller

Theoretical fix for apparent race condition that allows AsynchIOHandler
connection resources to be deleted while still in use.

The theory: To safely delete a PollerHandle it must be:

1. not concurrently be in use by any worker thread.
2. un-registered from epoll.
3. not eligible to be re-activated in epoll.

DeletionManager only enforces 1. There is no enforcement of 2 or 3 in EpollPoller.
There are assert() macros showing that EpollPoller.cpp *assumes* that 2 & 3 hold,
but these have no effect in a release build if they are incorrect.

This patch replaces the asserts with if statements. If the assert would be true,
the behavior is unchanged, otherwise a safe behavior is substituted.