Hide Forgot
Description of problem: Having frequent flapping (dis)connections of qpid broker in HA cluster and frequent QMF requests to drop the connected ones causes segfault with backtrace: #0 0x0000003f6f032625 in raise (sig=<value optimized out>) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64 #1 0x0000003f6f033e05 in abort () at abort.c:92 #2 0x0000003fa46bea7d in __gnu_cxx::__verbose_terminate_handler() () from /usr/lib64/libstdc++.so.6 #3 0x0000003fa46bcbd6 in ?? () from /usr/lib64/libstdc++.so.6 #4 0x0000003fa46bcc03 in std::terminate() () from /usr/lib64/libstdc++.so.6 #5 0x0000003fa46bd55f in __cxa_pure_virtual () from /usr/lib64/libstdc++.so.6 #6 0x00000031589b7221 in qpid::sys::AggregateOutput::doOutput (this=0x7fb564002198) at /usr/src/debug/qpid-cpp-0.34/src/qpid/sys/AggregateOutput.cpp:54 #7 0x0000003159058069 in qpid::broker::amqp_0_10::Connection::doOutput (this=0x7fb564002150) at /usr/src/debug/qpid-cpp-0.34/src/qpid/broker/amqp_0_10/Connection.cpp:400 #8 0x0000003158fe5791 in qpid::amqp_0_10::Connection::encode (this=0x7fb564001c20, buffer=<value optimized out>, size=65536) at /usr/src/debug/qpid-cpp-0.34/src/qpid/amqp_0_10/Connection.cpp:101 #9 0x00000031589b9f54 in qpid::sys::AsynchIOHandler::idle (this=0x7fb564026b30) at /usr/src/debug/qpid-cpp-0.34/src/qpid/sys/AsynchIOHandler.cpp:221 #10 0x0000003158937968 in operator() (this=0x7fb564005140, h=...) at /usr/include/boost/function/function_template.hpp:1013 #11 qpid::sys::posix::AsynchIO::writeable (this=0x7fb564005140, h=...) at /usr/src/debug/qpid-cpp-0.34/src/qpid/sys/posix/AsynchIO.cpp:582 #12 0x00000031589bea83 in boost::function1<void, qpid::sys::DispatchHandle&>::operator() (this=<value optimized out>, a0=<value optimized out>) at /usr/include/boost/function/function_template.hpp:1013 #13 0x00000031589bd3ee in qpid::sys::DispatchHandle::processEvent (this=0x7fb564005148, type=qpid::sys::Poller::WRITABLE) at /usr/src/debug/qpid-cpp-0.34/src/qpid/sys/DispatchHandle.cpp:287 #14 0x000000315895dfad in process (this=0xec3a40) at /usr/src/debug/qpid-cpp-0.34/src/qpid/sys/Poller.h:131 #15 qpid::sys::Poller::run (this=0xec3a40) at /usr/src/debug/qpid-cpp-0.34/src/qpid/sys/epoll/EpollPoller.cpp:522 #16 0x000000315895216a in qpid::sys::(anonymous namespace)::runRunnable (p=<value optimized out>) at /usr/src/debug/qpid-cpp-0.34/src/qpid/sys/posix/Thread.cpp:35 #17 0x0000003f6f407a51 in start_thread (arg=0x7fb58a298700) at pthread_create.c:301 #18 0x0000003f6f0e896d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115 Version-Release number of selected component (if applicable): qpid-cpp-server 0.34-17 How reproducible: 100% in 30minutes Steps to Reproduce: 1. run 3node HA cluster 2. run this script: for i in $(seq 1 $receivers); do while true; do id=$(($((RANDOM))%1000)) qpid-receive -a "qqq_${id}; {create:always, delete:always, node:{ x-declare:{auto-delete:True}, x-bindings:[{exchange:'amq.direct', queue:'qqq_${id}', key:'qqq_${id}'}, {exchange:'amq.fanout', queue:'qqq_${id}'}]}}" --timeout=2 --connection-options "{'heartbeat':1}" > /dev/null 2>&1 & pid=$! sleep $(($((RANDOM))%10)) kill $pid done & done & while true; do for i in $(./close_my_connection | grep qpid | tail -n20); do ./close_connection $i done sleep 2 done & while true; do for i in $(seq 1 50); do qpid-receive -a "qqqq_$(($((RANDOM))%100)); {create:always, delete:always, node:{ x-declare:{auto-delete:True}}}" --timeout=1 & done wait done & (I will provide sources of close_connection* soon, simply it calls QMF method to close qpid connection) 3. Wait Actual results: 3. broker segfaults with above backtrace Expected results: 3. no segfault Additional info: different backtraces were seen, like: #0 0x0000003d6129d25b in std::basic_string<char, std::char_traits<char>, std::allocator<char> >::basic_string(std::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) () from /usr/lib64/libstdc++.so.6 #1 0x00000030c2778d40 in _Construct<std::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::basic_string<char, std::char_traits<char>, std::allocator<char> > > ( __first=<value optimized out>, __last=Cannot access memory at address 0x0 ) at /usr/include/c++/4.4.7/bits/stl_construct.h:80 #2 uninitialized_copy<__gnu_cxx::__normal_iterator<std::basic_string<char, std::char_traits<char>, std::allocator<char> > const*, std::vector<std::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::basic_string<char, std::char_traits<char>, std::allocator<char> > > > >, std::basic_string<char, std::char_traits<char>, std::allocator<char> >*> (__first=<value optimized out>, __last=Cannot access memory at address 0x0 ) at /usr/include/c++/4.4.7/bits/stl_uninitialized.h:74 #3 uninitialized_copy<__gnu_cxx::__normal_iterator<std::basic_string<char, std::char_traits<char>, std::allocator<char> > const*, std::vector<std::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::basic_string<char, std::char_traits<char>, std::allocator<char> > > > >, std::basic_string<char, std::char_traits<char>, std::allocator<char> >*> (__first=<value optimized out>, __last=Cannot access memory at address 0x0 ) at /usr/include/c++/4.4.7/bits/stl_uninitialized.h:116 #4 std::__uninitialized_copy_a<__gnu_cxx::__normal_iterator<std::basic_string<char, std::char_traits<char>, std::allocator<char> > const*, std::vector<std::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::basic_string<char, std::char_traits<char>, std::allocator<char> > > > >, std::basic_string<char, std::char_traits<char>, std::allocator<char> >*, std::basic_string<char, std::char_traits<char>, std::allocator<char> > > (__first=<value optimized out>, __last=Cannot access memory at address 0x0 ) at /usr/include/c++/4.4.7/bits/stl_uninitialized.h:256 #5 0x00000030c27940c2 in qpid::framing::AMQFrame::encodedSize (this=0x7efea0a45cc0) at /usr/src/debug/qpid-cpp-0.34/src/qpid/framing/AMQFrame.cpp:46 #6 0x00000030c2de50ba in qpid::amqp_0_10::Connection::encode (this=0x7efe60099d30, buffer=<value optimized out>, size=65536) at /usr/src/debug/qpid-cpp-0.34/src/qpid/amqp_0_10/Connection.cpp:94 #7 0x00000030c27b9f54 in qpid::sys::AsynchIOHandler::idle (this=0x7efea00a50f0) at /usr/src/debug/qpid-cpp-0.34/src/qpid/sys/AsynchIOHandler.cpp:221 #8 0x00000030c2737968 in operator() (this=0x7efea0489d20, h=...) at /usr/include/boost/function/function_template.hpp:1013 #9 qpid::sys::posix::AsynchIO::writeable (this=0x7efea0489d20, h=...) at /usr/src/debug/qpid-cpp-0.34/src/qpid/sys/posix/AsynchIO.cpp:582 #10 0x00000030c27bea83 in boost::function1<void, qpid::sys::DispatchHandle&>::operator() (this=<value optimized out>, a0=<value optimized out>) at /usr/include/boost/function/function_template.hpp:1013 #11 0x00000030c27bd3ee in qpid::sys::DispatchHandle::processEvent (this=0x7efea0489d28, type=qpid::sys::Poller::READ_WRITABLE) at /usr/src/debug/qpid-cpp-0.34/src/qpid/sys/DispatchHandle.cpp:287 #12 0x00000030c275dfad in process (this=0xcd59a0) at /usr/src/debug/qpid-cpp-0.34/src/qpid/sys/Poller.h:131 #13 qpid::sys::Poller::run (this=0xcd59a0) at /usr/src/debug/qpid-cpp-0.34/src/qpid/sys/epoll/EpollPoller.cpp:522 #14 0x00000030c275216a in qpid::sys::(anonymous namespace)::runRunnable (p=<value optimized out>) at /usr/src/debug/qpid-cpp-0.34/src/qpid/sys/posix/Thread.cpp:35 #15 0x0000003d5f607a51 in start_thread (arg=0x7efeb3341700) at pthread_create.c:301 #16 0x0000003d5f2e896d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115 These are assumed to have same root cause - deleting connection object while a thread still uses it. (I expect same root cause behind)
Theoretical fix at: http://git.app.eng.bos.redhat.com/git/rh-qpid.git/commit/?h=0.34-mrg-aconway-bz1368196 Not proven. The fix will log 'warning epoll:' messages if we hit any of the fixed codepaths, so if the reproducer doesn't crash and we see those log messages, then we are probably in good shape. Bug 1368196 - Fix theoretcial race condition in EpollPoller Theoretical fix for apparent race condition that allows AsynchIOHandler connection resources to be deleted while still in use. The theory: To safely delete a PollerHandle it must be: 1. not concurrently be in use by any worker thread. 2. un-registered from epoll. 3. not eligible to be re-activated in epoll. DeletionManager only enforces 1. There is no enforcement of 2 or 3 in EpollPoller. There are assert() macros showing that EpollPoller.cpp *assumes* that 2 & 3 hold, but these have no effect in a release build if they are incorrect. This patch replaces the asserts with if statements. If the assert would be true, the behavior is unchanged, otherwise a safe behavior is substituted.