Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 568831

Summary: clusters_test fails sporadically with core in client::Connector code.
Product: Red Hat Enterprise MRG Reporter: Alan Conway <aconway>
Component: qpid-cppAssignee: Andrew Stitcher <astitcher>
Status: CLOSED DUPLICATE QA Contact: MRG Quality Engineering <mrgqe-bugs>
Severity: high Docs Contact:
Priority: high    
Version: DevelopmentCC: astitcher, freznice, gsim
Target Milestone: 1.3   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2010-04-15 22:02:44 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Alan Conway 2010-02-26 17:39:48 UTC
cluster_test fails sporadically when run in a loop.  I generally see a
failure in about 20 iterations, but have seen up to 70 repetitions
before a core. The cores are in client connection code run by a broker
sending an update to another broker.

The crash does not appear to be associated with a particular test, crashes
have been seen in at least these tests:
- testReconnectSameSessionName,
- testDequeueWaitingSubscription
- testAcl
- testMessageTimeToLive

So far I have seen 2 different stack traces (below.) The first is very
clearly in UpdateClient as it is closing a connection. The second is
in a Poller thread, it occurs when no other thread is in UpdateClient
code, which suggests that the UpdateClient thread has just terminated
when the crash occurs (there is no other code that could be causing
client::Connector code to run.) I don't know that the two traces are
related but it seems likely.

*** backtrace 1
#0  0x0000003da3430265 in raise () from /lib64/libc.so.6
#1  0x0000003da3431d10 in abort () from /lib64/libc.so.6
#2  0x0000003da34296e6 in __assert_fail () from /lib64/libc.so.6
#3  0x00002ad609b8bb92 in qpid::sys::Mutex::unlock (this=0x591ad80) at ../../cpp/include/qpid/sys/posix/Mutex.h:120
#4  0x00002ad609b8bbac in ~ScopedLock (this=0x477e5030) at ../../cpp/include/qpid/sys/Mutex.h:34
#5  0x00002ad60a27c009 in qpid::sys::DispatchHandle::rewatchWrite (this=0x591ac78) at ../../cpp/src/qpid/sys/DispatchHandle.cpp:118
#6  0x00002ad60a18c4ca in qpid::sys::posix::AsynchIO::notifyPendingWrite (this=0x591ac70) at ../../cpp/src/qpid/sys/posix/AsynchIO.cpp:346
#7  0x00002ad60acfa202 in qpid::client::TCPConnector::send (this=0x5919f70, frame=@0x477e52a0) at ../../cpp/src/qpid/client/TCPConnector.cpp:203
#8  0x00002ad60acbcefd in boost::_mfi::mf1<void, qpid::client::Connector, qpid::framing::AMQFrame&>::call<boost::scoped_ptr<qpid::client::Connector>, qpid::framing::AMQFrame> (
    this=0x5919600, u=@0x5919d68, b1=@0x477e52a0) at /usr/include/boost/bind/mem_fn_template.hpp:140
#9  0x00002ad60acbcf29 in boost::_mfi::mf1<void, qpid::client::Connector, qpid::framing::AMQFrame&>::operator()<boost::scoped_ptr<qpid::client::Connector> > (this=0x5919600,
    u=@0x5919d68, a1=@0x477e52a0) at /usr/include/boost/bind/mem_fn_template.hpp:154
#10 0x00002ad60acbcf85 in boost::_bi::list2<boost::reference_wrapper<boost::scoped_ptr<qpid::client::Connector> >, boost::arg<1> >::operator()<boost::_mfi::mf1<void, qpid::client::Co\
nnector, qpid::framing::AMQFrame&>, boost::_bi::list1<qpid::framing::AMQFrame&> > (this=0x5919610, f=@0x5919600, a=@0x477e51c0) at /usr/include/boost/bind.hpp:287
#11 0x00002ad60acbcfcc in boost::_bi::bind_t<void, boost::_mfi::mf1<void, qpid::client::Connector, qpid::framing::AMQFrame&>, boost::_bi::list2<boost::reference_wrapper<boost::scoped\
_ptr<qpid::client::Connector> >, boost::arg<1> > >::operator()<qpid::framing::AMQFrame> (this=0x5919600, a1=@0x477e52a0) at /usr/include/boost/bind/bind_template.hpp:32
#12 0x00002ad60acbcff3 in boost::detail::function::void_function_obj_invoker1<boost::_bi::bind_t<void, boost::_mfi::mf1<void, qpid::client::Connector, qpid::framing::AMQFrame&>, boos\
t::_bi::list2<boost::reference_wrapper<boost::scoped_ptr<qpid::client::Connector> >, boost::arg<1> > >, void, qpid::framing::AMQFrame&>::invoke (function_obj_ptr=
      {obj_ptr = 0x5919600, const_obj_ptr = 0x5919600, func_ptr = 0x5919600, data = ""}, a0=@0x477e52a0) at /usr/include/boost/function/function_template.hpp:136
#13 0x00002ad60acaf49a in boost::function1<void, qpid::framing::AMQFrame&, std::allocator<void> >::operator() (this=0x5919bc0, a0=@0x477e52a0)
    at /usr/include/boost/function/function_template.hpp:576
#14 0x00002ad60acaf4ca in qpid::client::ConnectionHandler::Adapter::handle (this=0x5919c50, f=@0x477e52a0) at ../../cpp/src/qpid/client/ConnectionHandler.h:60
#15 0x00002ad60a26a4ed in qpid::framing::Proxy::send (this=0x5919c68, b=@0x477e5300) at ../../cpp/src/qpid/framing/Proxy.cpp:37
#16 0x00002ad60a1b1672 in qpid::framing::AMQP_ServerProxy::Connection::close (this=0x5919c68, replyCode=200, replyText=@0x2ad60afeba48) at qpid/framing/AMQP_ServerProxy.cpp:144
#17 0x00002ad60aca95f3 in qpid::client::ConnectionHandler::close (this=0x5919ad8) at ../../cpp/src/qpid/client/ConnectionHandler.cpp:151
#18 0x00002ad60acb5cde in qpid::client::ConnectionImpl::close (this=0x59199f0) at ../../cpp/src/qpid/client/ConnectionImpl.cpp:287
#19 0x00002ad60aca4ee1 in qpid::client::Connection::close (this=0x5713bf8) at ../../cpp/src/qpid/client/Connection.cpp:152
#20 0x00002ad60a6b8ba8 in qpid::cluster::UpdateClient::updateConnection (this=0x5713a80, updateConnection=@0x5713e80) at ../../cpp/src/qpid/cluster/UpdateClient.cpp:358
#21 0x00002ad60a6bc33c in boost::_mfi::mf1<void, qpid::cluster::UpdateClient, boost::intrusive_ptr<qpid::cluster::Connection> const&>::operator() (this=0x477e5a60, p=0x5713a80,
    a1=@0x5713e80) at /usr/include/boost/bind/mem_fn_template.hpp:149
#22 0x00002ad60a6bd25d in boost::_bi::list2<boost::_bi::value<qpid::cluster::UpdateClient*>, boost::arg<1> >::operator()<boost::_mfi::mf1<void, qpid::cluster::UpdateClient, boost::in\
trusive_ptr<qpid::cluster::Connection> const&>, boost::_bi::list1<boost::intrusive_ptr<qpid::cluster::Connection>&> > (this=0x477e5a70, f=@0x477e5a60, a=@0x477e5a10)
    at /usr/include/boost/bind.hpp:287
#23 0x00002ad60a6bd2a4 in boost::_bi::bind_t<void, boost::_mfi::mf1<void, qpid::cluster::UpdateClient, boost::intrusive_ptr<qpid::cluster::Connection> const&>, boost::_bi::list2<boos\
t::_bi::value<qpid::cluster::UpdateClient*>, boost::arg<1> > >::operator()<boost::intrusive_ptr<qpid::cluster::Connection> > (this=0x477e5a60, a1=@0x5713e80)
    at /usr/include/boost/bind/bind_template.hpp:32
#24 0x00002ad60a6bd2d1 in std::for_each<__gnu_cxx::__normal_iterator<boost::intrusive_ptr<qpid::cluster::Connection>*, std::vector<boost::intrusive_ptr<qpid::cluster::Connection>, st\
d::allocator<boost::intrusive_ptr<qpid::cluster::Connection> > > >, boost::_bi::bind_t<void, boost::_mfi::mf1<void, qpid::cluster::UpdateClient, boost::intrusive_ptr<qpid::cluster::C\
onnection> const&>, boost::_bi::list2<boost::_bi::value<qpid::cluster::UpdateClient*>, boost::arg<1> > > > (__first={_M_current = 0x5713e80}, __last={_M_current = 0x5713e88}, __f=
	{f_ = {f_ = 0x2ad60a6b849c <qpid::cluster::UpdateClient::updateConnection(boost::intrusive_ptr<qpid::cluster::Connection> const&)>}, l_ = {a1_ = {t_ = 0x5713a80}, a2_ = {<No \
data fields>}}}) at /usr/lib/gcc/x86_64-redhat-linux/4.1.2/../../../../include/c++/4.1.2/bits/stl_algo.h:159
#25 0x00002ad60a6b751a in qpid::cluster::UpdateClient::update (this=0x5713a80) at ../../cpp/src/qpid/cluster/UpdateClient.cpp:142
#26 0x00002ad60a6b7a92 in qpid::cluster::UpdateClient::run (this=0x5713a80) at ../../cpp/src/qpid/cluster/UpdateClient.cpp:120
#27 0x00002ad60a1951bc in runRunnable (p=0x5713a80) at ../../cpp/src/qpid/sys/posix/Thread.cpp:35
#28 0x0000003da4006617 in start_thread () from /lib64/libpthread.so.0
#29 0x0000003da34d3c2d in clone () from /lib64/libc.so.6

*** backtrace 2
#0  0x00002b0c3aef9eae in qpid::client::TCPConnector::writebuff (this=0x2aaaac0422b0)
    at ../../cpp/src/qpid/client/TCPConnector.cpp:218
#1  0x00002b0c3aeff12a in boost::_mfi::mf1<void, qpid::client::TCPConnector, qpid::sys::AsynchIO&>::operator() (this=0x1961fe40, p=0x2aaaac0422b0, a1=@0x1961fa20)
    at /usr/include/boost/bind/mem_fn_template.hpp:149
#2  0x00002b0c3aeff30f in boost::_bi::list2<boost::_bi::value<qpid::client::TCPConnector*>, boost::arg<1> >::operator()<boost::_mfi::mf1<void, qpid::client::TCPConnector, qpid::sys::\
AsynchIO&>, boost::_bi::list1<qpid::sys::AsynchIO&> > (this=0x1961fe50, f=@0x1961fe40,
    a=@0x47fd3930) at /usr/include/boost/bind.hpp:287
#3  0x00002b0c3aeff356 in boost::_bi::bind_t<void, boost::_mfi::mf1<void, qpid::client::TCPConnector, qpid::sys::AsynchIO&>, boost::_bi::list2<boost::_bi::value<qpid::client::TCPConn\
ector*>, boost::arg<1> > >::operator()<qpid::sys::AsynchIO> (this=0x1961fe40,
    a1=@0x1961fa20) at /usr/include/boost/bind/bind_template.hpp:32
#4  0x00002b0c3aeff37d in boost::detail::function::void_function_obj_invoker1<boost::_bi::bind_t<void, boost::_mfi::mf1<void, qpid::client::TCPConnector, qpid::sys::AsynchIO&>, boost\
::_bi::list2<boost::_bi::value<qpid::client::TCPConnector*>, boost::arg<1> > >, void, qpid::sys::AsynchIO&>::invoke (function_obj_ptr=
      {obj_ptr = 0x1961fe40, const_obj_ptr = 0x1961fe40, func_ptr = 0x1961fe40, data = "@"}, a0=@0x1961fa20) at /usr/include/boost/function/function_template.hpp:136
#5  0x00002b0c3a391d04 in boost::function1<void, qpid::sys::AsynchIO&, std::allocator<boost::function_base> >::operator() (this=0x1961fbd8, a0=@0x1961fa20)
    at /usr/include/boost/function/function_template.hpp:576
#6  0x00002b0c3a38c8bd in qpid::sys::posix::AsynchIO::writeable (this=0x1961fa20,
    h=@0x1961fa28) at ../../cpp/src/qpid/sys/posix/AsynchIO.cpp:536
#7  0x00002b0c3a390e7a in boost::_mfi::mf1<void, qpid::sys::posix::AsynchIO, qpid::sys::DispatchHandle&>::operator() (this=0x1961fd60, p=0x1961fa20, a1=@0x1961fa28)
    at /usr/include/boost/bind/mem_fn_template.hpp:149
#8  0x00002b0c3a3910f9 in boost::_bi::list2<boost::_bi::value<qpid::sys::posix::AsynchIO*>, boost::arg<1> >::operator()<boost::_mfi::mf1<void, qpid::sys::posix::AsynchIO, qpid::sys::\
DispatchHandle&>, boost::_bi::list1<qpid::sys::DispatchHandle&> > (this=0x1961fd70,
    f=@0x1961fd60, a=@0x47fd3c90) at /usr/include/boost/bind.hpp:287
#9  0x00002b0c3a391140 in boost::_bi::bind_t<void, boost::_mfi::mf1<void, qpid::sys::posix::AsynchIO, qpid::sys::DispatchHandle&>, boost::_bi::list2<boost::_bi::value<qpid::sys::posi\
x::AsynchIO*>, boost::arg<1> > >::operator()<qpid::sys::DispatchHandle> (
    this=0x1961fd60, a1=@0x1961fa28) at /usr/include/boost/bind/bind_template.hpp:32
#10 0x00002b0c3a391167 in boost::detail::function::void_function_obj_invoker1<boost::_bi::bind_t<void, boost::_mfi::mf1<void, qpid::sys::posix::AsynchIO, qpid::sys::DispatchHandle&>,\
 boost::_bi::list2<boost::_bi::value<qpid::sys::posix::AsynchIO*>, boost::arg<1> > >, void, qpid::sys::DispatchHandle&>::invoke (function_obj_ptr=
      {obj_ptr = 0x1961fd60, const_obj_ptr = 0x1961fd60, func_ptr = 0x1961fd60, data = "`"}, a0=@0x1961fa28) at /usr/include/boost/function/function_template.hpp:136
#11 0x00002b0c3a4815e2 in boost::function1<void, qpid::sys::DispatchHandle&, std::allocator<boost::function_base> >::operator() (this=0x1961fa50, a0=@0x1961fa28)
    at /usr/include/boost/function/function_template.hpp:576
#12 0x00002b0c3a47b176 in qpid::sys::DispatchHandle::processEvent (this=0x1961fa28,
    type=qpid::sys::Poller::READ_WRITABLE)
    at ../../cpp/src/qpid/sys/DispatchHandle.cpp:285
#13 0x00002b0c3a39cc49 in qpid::sys::Poller::Event::process (this=0x47fd3ed0)
    at ../../cpp/src/qpid/sys/Poller.h:123
#14 0x00002b0c3a39c36a in qpid::sys::Poller::run (this=0x2aaaac041ae0)
    at ../../cpp/src/qpid/sys/epoll/EpollPoller.cpp:483
#15 0x00002b0c3a394ff8 in runRunnable (p=0x2aaaac041ae0)
    at ../../cpp/src/qpid/sys/posix/Thread.cpp:35
#16 0x0000003da4006617 in start_thread () from /lib64/libpthread.so.0
#17 0x0000003da34d3c2d in clone () from /lib64/libc.so.6

Comment 1 Alan Conway 2010-02-26 19:46:28 UTC
This is a regression 

At r901548 (before thread per connection change) I got 305 cluster_test runs with no errors 

At r901550 (thread per connection fix), I can rarely get even one clean cluster_test run, I get the writebuff stack trace right away. Current trunk is better so it appears some races were improved but there's still one left.

Comment 2 Alan Conway 2010-02-26 22:01:42 UTC
A faster reproducer:

 make -k cluster_test 'cluster_test_OBJECTS=unit_test.o ClusterFixture.o ForkedBroker.o ClusterFailover.o'

Then run cluster_test in a loop. It fails in under a minute for me and I've seen it produce both stack traces.

Comment 3 Alan Conway 2010-03-15 20:17:46 UTC
To allow cluster testing to advance I comitted a workaround in r923414, a 100ms sleep at line 166 of cluster/UpdateClient.cpp between calling Connection::close() and ~Connection. This does not address the problem but the reproducer of comment 2 will no longer work unless you remove the sleep.

As to the underlying problem, it appears that the TCPConnector object is deleted by another thread between lines 284 and 285 of DispatchHandle.cpp

The stack trace shows
-> DispatchHandle::processEvent state=DELETING type=READ_WRITABLE, in writeableCallback on line 285
...
-> TCPConnector::writebuff  - invalid vptr, object has been deleted.

I have been able to reproduce this by simply re-starting a 3 member cluster in a loop but its not as reliable as described as comment 2. 

I have not been able to reproduce with a separate client.

Comment 4 Alan Conway 2010-03-23 20:10:47 UTC
I think the problem stems from a change to TCPConnector::closeInternal in r901550. The previous code stopped the connections poller and joined its thread, thereby guaranteeing that it is safe to delete the Connector once close() returns. The new code has no such wait, and a poller thread is calling writebuff after it has been deleted.

Comment 5 Alan Conway 2010-03-24 12:32:47 UTC
From Andrew Stitcher: 

The AsynchIO::queueForDeletion() in closeInternal should ensure that no
more calls ever come up from that aio, it sounds like you've found a
race condition, which stops that working properly.

Comment 6 Alan Conway 2010-04-07 19:30:15 UTC
*** Bug 580138 has been marked as a duplicate of this bug. ***

Comment 7 Gordon Sim 2010-04-15 22:02:44 UTC

*** This bug has been marked as a duplicate of bug 580824 ***