Bug 1712228

Summary: sporadic qpidd segfault during shutdown while attaching new link
Product: Red Hat Enterprise MRG Reporter: Pavel Moravec <pmoravec>
Component: qpid-cppAssignee: Mike Cressman <mcressma>
Status: CLOSED ERRATA QA Contact: Messaging QE <messaging-qe-bugs>
Severity: medium Docs Contact:
Priority: medium    
Version: 3.2CC: gsim, jfrancin, jross, mcressma, zkraus
Target Milestone: 3.2.13Keywords: OtherQA
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: qpid-cpp-1.36.0-22 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-07-15 07:54:52 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Pavel Moravec 2019-05-21 05:44:19 UTC
Description of problem:
When shutting down a broker that is just about to attach a link to a federated broker, a concurrency bug can cause segfault.

Backtraces (see Additional info) show the problem is with concurrent manipulation of qpid::sys::HandleSet where new link adds an entry to the set at the same time when broker shutdown clears all the set content.

The segfault is quite rare (I reproduced it once upon few hours, customer with 2% probability) and happens probably only during shutdown.


Version-Release number of selected component (if applicable):
qpid-cpp-server-1.36.0-21.el6.x86_64
(but expected in any version)


How reproducible:
low probability


Steps to Reproduce:
1) Add to /etc/hosts:
127.0.0.1 test-1 test-2 test-3 test-4 test-5 test-6 test-7 test-8 test-9 test-0

2) Run this script to (A) start one broker accepting just SSL connections, and (B) repeatedly restart 2nd broker creating many links to this broker (all of them will fail due to noSSL and some re-attach attempt will happen during this broker shutdown at the "right" ztime):

killall qpidd
sleep 1

commonOptions="--log-to-stderr=no --log-to-stdout=no --log-enable=critical+"
qpidd --require-encryption=yes --ssl-require-client-authentication=yes --ssl-port=5671 --log-to-file=qpidd.log $commonOptions &

while true; do
	echo "$(date): starting 2nd broker.."
	qpidd --worker-threads 12 --port=5673 --pid-dir ~/.qpidd5673 --data-dir ~/.qpidd5673 --store-dir ~/.qpidd5673 --log-to-file=qpidd.5673.log --link-maintenance-interval 0.01 $commonOptions &
	pid=$!
	sleep 1
	echo "$(date): adding some links.."
	for i in $(seq 0 9); do qpid-route link add localhost:5673 test-${i}:5672 & done
	while [ $(ps aux | grep python | grep -c qpid-route) -gt 0 ]; do
		sleep 1
	done
	sleep 10
	echo "$(date): killing 2nd broker.."
	kill $pid
	sleep 5
done


Actual results:
Sometimes, coredump is generated during shutdown of the broker.


Expected results:
No shutdown.


Additional info:
Backtraces:
Program terminated with signal 11, Segmentation fault.
#0  std::_Rb_tree_insert_and_rebalance (__insert_left=<value optimized out>, __x=0x7fdd4c0164a0, __p=<value optimized out>, __header=...) at ../../../../libstdc++-v3/src/tree.cc:248
248	    __root->_M_color = _S_black;
Missing separate debuginfos, use: debuginfo-install libibverbs-1.1.8-4.el6.x86_64 librdmacm-1.0.21-0.el6.x86_64 xerces-c-3.0.1-20.el6.x86_64 xqilla-2.2.3-8.el6.x86_64
(gdb) bt
#0  std::_Rb_tree_insert_and_rebalance (__insert_left=<value optimized out>, __x=0x7fdd4c0164a0, __p=<value optimized out>, __header=...) at ../../../../libstdc++-v3/src/tree.cc:248
#1  0x00007fdd8584b894 in std::_Rb_tree<qpid::sys::PollerHandle*, qpid::sys::PollerHandle*, std::_Identity<qpid::sys::PollerHandle*>, std::less<qpid::sys::PollerHandle*>, std::allocator<qpid::sys::PollerHandle*> >::_M_insert_ (this=0x746400, __x=<value optimized out>, __p=0x7fdd4c003710, __v=@0x7fdd80c46878) at /usr/include/c++/4.4.7/bits/stl_tree.h:883
#2  0x00007fdd8584a753 in _M_insert_unique (this=0x7463d8, h=0x7fdd4cf55368) at /usr/include/c++/4.4.7/bits/stl_tree.h:1182
#3  insert (this=0x7463d8, h=0x7fdd4cf55368) at /usr/include/c++/4.4.7/bits/stl_set.h:411
#4  qpid::sys::HandleSet::add (this=0x7463d8, h=0x7fdd4cf55368) at /usr/src/debug/qpid-cpp-1.36.0/src/qpid/sys/epoll/EpollPoller.cpp:173
#5  0x00007fdd8584a839 in qpid::sys::Poller::registerHandle (this=0x73bec0, handle=...) at /usr/src/debug/qpid-cpp-1.36.0/src/qpid/sys/epoll/EpollPoller.cpp:341
#6  0x00007fdd858ab7df in qpid::sys::DispatchHandle::startWatch (this=0x7fdd4cf55368, poller0=<value optimized out>) at /usr/src/debug/qpid-cpp-1.36.0/src/qpid/sys/DispatchHandle.cpp:55
#7  0x00007fdd85821e8e in qpid::sys::posix::AsynchConnector::start (this=<value optimized out>, poller=<value optimized out>)
    at /usr/src/debug/qpid-cpp-1.36.0/src/qpid/sys/posix/AsynchIO.cpp:188
#8  0x00007fdd85eb319f in qpid::sys::SocketConnector::connect (this=<value optimized out>, poller=<value optimized out>, name=<value optimized out>, host="test-1", port="5672", fact=
    0x7fdd80c46bb0, failed=...) at /usr/src/debug/qpid-cpp-1.36.0/src/qpid/sys/SocketTransport.cpp:212
#9  0x00007fdd85d6e58c in qpid::broker::Broker::connect (this=<value optimized out>, name="qpid.tcp:test-1:5672", host="test-1", port="5672", transport=<value optimized out>, f=0x745f98, 
    failed=...) at /usr/src/debug/qpid-cpp-1.36.0/src/qpid/broker/Broker.cpp:1367
#10 0x00007fdd85d6e874 in qpid::broker::Broker::connect (this=0x745820, name="qpid.tcp:test-1:5672", host="test-1", port=<value optimized out>, transport=<value optimized out>, 
    failed=<value optimized out>) at /usr/src/debug/qpid-cpp-1.36.0/src/qpid/broker/Broker.cpp:1358
#11 0x00007fdd85e07c00 in qpid::broker::Link::startConnectionLH (this=0x7fdd50028f70) at /usr/src/debug/qpid-cpp-1.36.0/src/qpid/broker/Link.cpp:227
#12 0x00007fdd85e087fe in qpid::broker::Link::maintenanceVisit (this=0x7fdd50028f70) at /usr/src/debug/qpid-cpp-1.36.0/src/qpid/broker/Link.cpp:518
#13 0x00007fdd85e0d574 in qpid::broker::LinkTimerTask::fire (this=0x7fdd50018d20) at /usr/src/debug/qpid-cpp-1.36.0/src/qpid/broker/Link.cpp:70
#14 0x00007fdd858af65b in qpid::sys::Timer::fire (this=<value optimized out>, t=...) at /usr/src/debug/qpid-cpp-1.36.0/src/qpid/sys/Timer.cpp:222
#15 0x00007fdd858b0c85 in qpid::sys::Timer::run (this=0x7460f0) at /usr/src/debug/qpid-cpp-1.36.0/src/qpid/sys/Timer.cpp:156
#16 0x00007fdd8583f7aa in qpid::sys::(anonymous namespace)::runRunnable (p=<value optimized out>) at /usr/src/debug/qpid-cpp-1.36.0/src/qpid/sys/posix/Thread.cpp:35
#17 0x00007fdd85254a51 in start_thread (arg=0x7fdd80c48700) at pthread_create.c:301
#18 0x00007fdd84fa196d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115
(gdb) thread 2
[Switching to thread 2 (Thread 0x7fdd7f82a700 (LWP 26621))]#0  __lll_lock_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:136
136	2:	movl	%edx, %eax
(gdb) bt
#0  __lll_lock_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:136
#1  0x00007fdd852565a3 in _L_lock_892 () from /lib64/libpthread.so.0
#2  0x00007fdd85256487 in __pthread_mutex_lock (mutex=0x7463d8) at pthread_mutex_lock.c:82
#3  0x00007fdd85846905 in qpid::sys::Mutex::lock (this=<value optimized out>) at /usr/src/debug/qpid-cpp-1.36.0/src/qpid/sys/posix/Mutex.h:116
#4  0x00007fdd8584a39b in ScopedLock (this=<value optimized out>, h=0x7fdd4cf17578) at /usr/src/debug/qpid-cpp-1.36.0/src/qpid/sys/Mutex.h:33
#5  qpid::sys::HandleSet::remove (this=<value optimized out>, h=0x7fdd4cf17578) at /usr/src/debug/qpid-cpp-1.36.0/src/qpid/sys/epoll/EpollPoller.cpp:177
#6  0x00007fdd8584a546 in qpid::sys::Poller::unregisterHandle (this=0x73bec0, handle=...) at /usr/src/debug/qpid-cpp-1.36.0/src/qpid/sys/epoll/EpollPoller.cpp:352
#7  0x00007fdd858ab2b1 in qpid::sys::DispatchHandle::stopWatch (this=0x7fdd4cf17578) at /usr/src/debug/qpid-cpp-1.36.0/src/qpid/sys/DispatchHandle.cpp:191
#8  0x00007fdd85826922 in qpid::sys::posix::AsynchConnector::connComplete (this=0x7fdd4cf17570, h=...) at /usr/src/debug/qpid-cpp-1.36.0/src/qpid/sys/posix/AsynchIO.cpp:213
#9  0x00007fdd858ac2a3 in boost::function1<void, qpid::sys::DispatchHandle&>::operator() (this=<value optimized out>, a0=<value optimized out>)
    at /usr/include/boost/function/function_template.hpp:1013
#10 0x00007fdd858aabc7 in qpid::sys::DispatchHandle::processEvent (this=0x7fdd4cf17578, type=qpid::sys::Poller::DISCONNECTED)
    at /usr/src/debug/qpid-cpp-1.36.0/src/qpid/sys/DispatchHandle.cpp:291
#11 0x00007fdd858493e6 in process (this=<value optimized out>) at /usr/src/debug/qpid-cpp-1.36.0/src/qpid/sys/Poller.h:131
#12 qpid::sys::HandleSet::cleanup (this=<value optimized out>) at /usr/src/debug/qpid-cpp-1.36.0/src/qpid/sys/epoll/EpollPoller.cpp:187
#13 0x00007fdd8584b3ed in qpid::sys::Poller::run (this=0x73bec0) at /usr/src/debug/qpid-cpp-1.36.0/src/qpid/sys/epoll/EpollPoller.cpp:529
#14 0x00007fdd8583f7aa in qpid::sys::(anonymous namespace)::runRunnable (p=<value optimized out>) at /usr/src/debug/qpid-cpp-1.36.0/src/qpid/sys/posix/Thread.cpp:35
#15 0x00007fdd85254a51 in start_thread (arg=0x7fdd7f82a700) at pthread_create.c:301
#16 0x00007fdd84fa196d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115

Comment 1 Pavel Moravec 2019-05-21 05:48:46 UTC
Perspective patch: during the shutdown, cancel all timers first?

Comment 3 Gordon Sim 2019-06-10 20:21:18 UTC
(In reply to Pavel Moravec from comment #1)
> Perspective patch: during the shutdown, cancel all timers first?

+1, that change now committed upstream (see QPID-8321). This also fixes https://bugzilla.redhat.com/show_bug.cgi?id=1624896

Comment 4 Pavel Moravec 2019-06-17 09:43:11 UTC
Testing scratch build http://brew-task-repos.usersys.redhat.com/repos/scratch/mcressma/qpid-cpp/1.36.0/22.el6/qpid-cpp-1.36.0-22.el6-scratch.repo :

no segfault on this build within 2 hours test.

Comment 9 errata-xmlrpc 2019-07-15 07:54:52 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:1770