Bug 1376570 - segfault when frequently dropping connections via QMF
Summary: segfault when frequently dropping connections via QMF
Keywords:
Status: POST
Alias: None
Product: Red Hat Enterprise MRG
Classification: Red Hat
Component: qpid-cpp
Version: 3.2
Hardware: x86_64
OS: Linux
high
high
Target Milestone: ---
: ---
Assignee: messaging-bugs
QA Contact: Messaging QE
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-09-15 18:35 UTC by Pavel Moravec
Modified: 2021-03-03 23:10 UTC (History)
1 user (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:
Target Upstream Version:


Attachments (Terms of Use)

Description Pavel Moravec 2016-09-15 18:35:48 UTC
Description of problem:
Having frequent flapping (dis)connections of qpid broker in HA cluster and frequent QMF requests to drop the connected ones causes segfault with backtrace:

#0  0x0000003f6f032625 in raise (sig=<value optimized out>) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64
#1  0x0000003f6f033e05 in abort () at abort.c:92
#2  0x0000003fa46bea7d in __gnu_cxx::__verbose_terminate_handler() () from /usr/lib64/libstdc++.so.6
#3  0x0000003fa46bcbd6 in ?? () from /usr/lib64/libstdc++.so.6
#4  0x0000003fa46bcc03 in std::terminate() () from /usr/lib64/libstdc++.so.6
#5  0x0000003fa46bd55f in __cxa_pure_virtual () from /usr/lib64/libstdc++.so.6
#6  0x00000031589b7221 in qpid::sys::AggregateOutput::doOutput (this=0x7fb564002198) at /usr/src/debug/qpid-cpp-0.34/src/qpid/sys/AggregateOutput.cpp:54
#7  0x0000003159058069 in qpid::broker::amqp_0_10::Connection::doOutput (this=0x7fb564002150) at /usr/src/debug/qpid-cpp-0.34/src/qpid/broker/amqp_0_10/Connection.cpp:400
#8  0x0000003158fe5791 in qpid::amqp_0_10::Connection::encode (this=0x7fb564001c20, buffer=<value optimized out>, size=65536)
    at /usr/src/debug/qpid-cpp-0.34/src/qpid/amqp_0_10/Connection.cpp:101
#9  0x00000031589b9f54 in qpid::sys::AsynchIOHandler::idle (this=0x7fb564026b30) at /usr/src/debug/qpid-cpp-0.34/src/qpid/sys/AsynchIOHandler.cpp:221
#10 0x0000003158937968 in operator() (this=0x7fb564005140, h=...) at /usr/include/boost/function/function_template.hpp:1013
#11 qpid::sys::posix::AsynchIO::writeable (this=0x7fb564005140, h=...) at /usr/src/debug/qpid-cpp-0.34/src/qpid/sys/posix/AsynchIO.cpp:582
#12 0x00000031589bea83 in boost::function1<void, qpid::sys::DispatchHandle&>::operator() (this=<value optimized out>, a0=<value optimized out>)
    at /usr/include/boost/function/function_template.hpp:1013
#13 0x00000031589bd3ee in qpid::sys::DispatchHandle::processEvent (this=0x7fb564005148, type=qpid::sys::Poller::WRITABLE) at /usr/src/debug/qpid-cpp-0.34/src/qpid/sys/DispatchHandle.cpp:287
#14 0x000000315895dfad in process (this=0xec3a40) at /usr/src/debug/qpid-cpp-0.34/src/qpid/sys/Poller.h:131
#15 qpid::sys::Poller::run (this=0xec3a40) at /usr/src/debug/qpid-cpp-0.34/src/qpid/sys/epoll/EpollPoller.cpp:522
#16 0x000000315895216a in qpid::sys::(anonymous namespace)::runRunnable (p=<value optimized out>) at /usr/src/debug/qpid-cpp-0.34/src/qpid/sys/posix/Thread.cpp:35
#17 0x0000003f6f407a51 in start_thread (arg=0x7fb58a298700) at pthread_create.c:301
#18 0x0000003f6f0e896d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115


Version-Release number of selected component (if applicable):
qpid-cpp-server 0.34-17


How reproducible:
100% in 30minutes


Steps to Reproduce:
1. run 3node HA cluster
2. run this script:

for i in $(seq 1 $receivers); do
	while true; do
		id=$(($((RANDOM))%1000))
		qpid-receive -a "qqq_${id}; {create:always, delete:always, node:{ x-declare:{auto-delete:True}, x-bindings:[{exchange:'amq.direct', queue:'qqq_${id}', key:'qqq_${id}'}, {exchange:'amq.fanout', queue:'qqq_${id}'}]}}" --timeout=2 --connection-options "{'heartbeat':1}" > /dev/null 2>&1 &
		pid=$!
		sleep $(($((RANDOM))%10))
		kill $pid
	done &
done &

while true; do
        for i in $(./close_my_connection | grep qpid | tail -n20); do
                ./close_connection $i
        done
        sleep 2
done &

while true; do
        for i in $(seq 1 50); do
               qpid-receive -a "qqqq_$(($((RANDOM))%100)); {create:always, delete:always, node:{ x-declare:{auto-delete:True}}}" --timeout=1 &
        done
        wait
done &

(I will provide sources of close_connection* soon, simply it calls QMF method to close qpid connection)

3. Wait


Actual results:
3. broker segfaults with above backtrace


Expected results:
3. no segfault


Additional info:
different backtraces were seen, like:

#0  0x0000003d6129d25b in std::basic_string<char, std::char_traits<char>, std::allocator<char> >::basic_string(std::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)
    () from /usr/lib64/libstdc++.so.6
#1  0x00000030c2778d40 in _Construct<std::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::basic_string<char, std::char_traits<char>, std::allocator<char> > > (
    __first=<value optimized out>, __last=Cannot access memory at address 0x0
) at /usr/include/c++/4.4.7/bits/stl_construct.h:80
#2  uninitialized_copy<__gnu_cxx::__normal_iterator<std::basic_string<char, std::char_traits<char>, std::allocator<char> > const*, std::vector<std::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::basic_string<char, std::char_traits<char>, std::allocator<char> > > > >, std::basic_string<char, std::char_traits<char>, std::allocator<char> >*> (__first=<value optimized out>, __last=Cannot access memory at address 0x0
) at /usr/include/c++/4.4.7/bits/stl_uninitialized.h:74
#3  uninitialized_copy<__gnu_cxx::__normal_iterator<std::basic_string<char, std::char_traits<char>, std::allocator<char> > const*, std::vector<std::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::basic_string<char, std::char_traits<char>, std::allocator<char> > > > >, std::basic_string<char, std::char_traits<char>, std::allocator<char> >*> (__first=<value optimized out>, __last=Cannot access memory at address 0x0
) at /usr/include/c++/4.4.7/bits/stl_uninitialized.h:116
#4  std::__uninitialized_copy_a<__gnu_cxx::__normal_iterator<std::basic_string<char, std::char_traits<char>, std::allocator<char> > const*, std::vector<std::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::basic_string<char, std::char_traits<char>, std::allocator<char> > > > >, std::basic_string<char, std::char_traits<char>, std::allocator<char> >*, std::basic_string<char, std::char_traits<char>, std::allocator<char> > > (__first=<value optimized out>, __last=Cannot access memory at address 0x0
) at /usr/include/c++/4.4.7/bits/stl_uninitialized.h:256
#5  0x00000030c27940c2 in qpid::framing::AMQFrame::encodedSize (this=0x7efea0a45cc0) at /usr/src/debug/qpid-cpp-0.34/src/qpid/framing/AMQFrame.cpp:46
#6  0x00000030c2de50ba in qpid::amqp_0_10::Connection::encode (this=0x7efe60099d30, buffer=<value optimized out>, size=65536)
    at /usr/src/debug/qpid-cpp-0.34/src/qpid/amqp_0_10/Connection.cpp:94
#7  0x00000030c27b9f54 in qpid::sys::AsynchIOHandler::idle (this=0x7efea00a50f0) at /usr/src/debug/qpid-cpp-0.34/src/qpid/sys/AsynchIOHandler.cpp:221
#8  0x00000030c2737968 in operator() (this=0x7efea0489d20, h=...) at /usr/include/boost/function/function_template.hpp:1013
#9  qpid::sys::posix::AsynchIO::writeable (this=0x7efea0489d20, h=...) at /usr/src/debug/qpid-cpp-0.34/src/qpid/sys/posix/AsynchIO.cpp:582
#10 0x00000030c27bea83 in boost::function1<void, qpid::sys::DispatchHandle&>::operator() (this=<value optimized out>, a0=<value optimized out>)
    at /usr/include/boost/function/function_template.hpp:1013
#11 0x00000030c27bd3ee in qpid::sys::DispatchHandle::processEvent (this=0x7efea0489d28, type=qpid::sys::Poller::READ_WRITABLE)
    at /usr/src/debug/qpid-cpp-0.34/src/qpid/sys/DispatchHandle.cpp:287
#12 0x00000030c275dfad in process (this=0xcd59a0) at /usr/src/debug/qpid-cpp-0.34/src/qpid/sys/Poller.h:131
#13 qpid::sys::Poller::run (this=0xcd59a0) at /usr/src/debug/qpid-cpp-0.34/src/qpid/sys/epoll/EpollPoller.cpp:522
#14 0x00000030c275216a in qpid::sys::(anonymous namespace)::runRunnable (p=<value optimized out>) at /usr/src/debug/qpid-cpp-0.34/src/qpid/sys/posix/Thread.cpp:35
#15 0x0000003d5f607a51 in start_thread (arg=0x7efeb3341700) at pthread_create.c:301
#16 0x0000003d5f2e896d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115

These are assumed to have same root cause - deleting connection object while a thread still uses it.

(I expect same root cause behind)

Comment 1 Alan Conway 2016-09-22 20:48:16 UTC
Theoretical fix at:

http://git.app.eng.bos.redhat.com/git/rh-qpid.git/commit/?h=0.34-mrg-aconway-bz1368196

Not proven. The fix will log 'warning epoll:' messages if we hit any of the fixed codepaths, so if the reproducer doesn't crash and we see those log messages, then we are probably in good shape.

Bug 1368196 - Fix theoretcial race condition in EpollPoller

Theoretical fix for apparent race condition that allows AsynchIOHandler
connection resources to be deleted while still in use.

The theory: To safely delete a PollerHandle it must be:

1. not concurrently be in use by any worker thread.
2. un-registered from epoll.
3. not eligible to be re-activated in epoll.

DeletionManager only enforces 1. There is no enforcement of 2 or 3 in EpollPoller.
There are assert() macros showing that EpollPoller.cpp *assumes* that 2 & 3 hold,
but these have no effect in a release build if they are incorrect.

This patch replaces the asserts with if statements. If the assert would be true,
the behavior is unchanged, otherwise a safe behavior is substituted.


Note You need to log in before you can comment on or make changes to this bug.