Description of problem: While fencing nodes in a cluster, occasionally encounter an issue where a previously backup broker becomes deadlocked while deleting auto-delete queues. Only noticed the issue because 'qpid-ha promote' hangs attempting to promote a backup to primary. Version-Release number of selected component (if applicable): Qpid 0.18 How reproducible: Rare (race condition) Steps to Reproduce: 1. Start HA-enabled brokers 2. Create tens-of-thousands of auto-delete queues 3. Fence / power-cycle the node hosting the primary broker Actual results: Occasionally the backup broker deadlocks Expected results: The backup broker does not deadlock Additional info:
Selected back trace shows thread 11 owning lock in qpid::ha::QueueReplicator and qpid::broker::LinkRegistry but waiting on lock in qpid::broker:::Link. Thread 4 owns the lock in qpid::broker::Link but is waiting for the lock in qpid::ha::QueueReplicator. Dozens of other threads are waiting on the lock in qpid::broker::LinkRegistry since the LinkRegistry's connection observer is called for each connection to the broker. Thread 4 (Thread 0x7f4db6bfd700 (LWP 11469)): #0 0x00000035b5a0e054 in __lll_lock_wait () from /lib64/libpthread.so.0 #1 0x00000035b5a093a3 in _L_lock_892 () from /lib64/libpthread.so.0 #2 0x00000035b5a09287 in pthread_mutex_lock () from /lib64/libpthread.so.0 #3 0x0000003573374f0a in qpid::sys::Mutex::lock (this=<value optimized out>) at ../include/qpid/sys/posix/Mutex.h:116 #4 0x00007f4dbe9027e1 in ScopedLock (this=0x7f4d893047b0, bridge=..., sessionHandler=...) at ../include/qpid/sys/Mutex.h:33 #5 qpid::ha::QueueReplicator::initializeBridge (this=0x7f4d893047b0, bridge=..., sessionHandler=...) at qpid/ha/QueueReplicator.cpp:166 #6 0x0000003573377dc8 in operator() (this=0x7f4d893038a0, c=...) at /usr/include/boost/function/function_template.hpp:1013 #7 qpid::broker::Bridge::create (this=0x7f4d893038a0, c=...) at qpid/broker/Bridge.cpp:122 #8 0x00000035733e1491 in qpid::broker::Link::ioThreadProcessing (this=0x2527240) at qpid/broker/Link.cpp:484 #9 0x00000035733a35ab in operator() (this=0x7f4d7dd193a0) at /usr/include/boost/function/function_template.hpp:1013 #10 qpid::broker::Connection::doIoCallbacks (this=0x7f4d7dd193a0) at qpid/broker/Connection.cpp:375 #11 0x00000035733a395c in qpid::broker::Connection::received (this=0x7f4d7dd193a0, frame=...) at qpid/broker/Connection.cpp:172 #12 0x000000357336f5ad in qpid::amqp_0_10::Connection::decode (this=0x7f4d7d2401d0, buffer=<value optimized out>, size=<value optimized out>) at qpid/amqp_0_10/Connection.cpp:58 #13 0x0000003572e1e59d in qpid::sys::AsynchIOHandler::readbuff (this=0x7f4d88f07d50, buff=0x7f4d885158a0) at qpid/sys/AsynchIOHandler.cpp:164 #14 0x0000003572d41501 in operator() (this=0x7f4d8885b120, h=...) at /usr/include/boost/function/function_template.hpp:1013 #15 qpid::sys::posix::AsynchIO::readable (this=0x7f4d8885b120, h=...) at qpid/sys/posix/AsynchIO.cpp:462 #16 0x0000003572e24403 in boost::function1<void, qpid::sys::DispatchHandle&>::operator() (this=<value optimized out>, a0=<value optimized out>) at /usr/include/boost/function/function_template.hpp:1013 #17 0x0000003572e21606 in qpid::sys::DispatchHandle::processEvent (this=0x7f4d8885b128, type=qpid::sys::Poller::READABLE) at qpid/sys/DispatchHandle.cpp:280 #18 0x0000003572d4e0cd in process (this=0x250ac80) at qpid/sys/Poller.h:131 #19 qpid::sys::Poller::run (this=0x250ac80) at qpid/sys/epoll/EpollPoller.cpp:524 #20 0x0000003572d4616a in qpid::sys::(anonymous namespace)::runRunnable (p=<value optimized out>) at qpid/sys/posix/Thread.cpp:35 #21 0x00000035b5a07851 in start_thread () from /lib64/libpthread.so.0 #22 0x00000035b52e811d in clone () from /lib64/libc.so.6 Thread 11 (Thread 0x7f4db1468700 (LWP 11476)): #0 0x00000035b5a0e054 in __lll_lock_wait () from /lib64/libpthread.so.0 #1 0x00000035b5a093a3 in _L_lock_892 () from /lib64/libpthread.so.0 #2 0x00000035b5a09287 in pthread_mutex_lock () from /lib64/libpthread.so.0 #3 0x0000003573374f0a in qpid::sys::Mutex::lock (this=<value optimized out>) at ../include/qpid/sys/posix/Mutex.h:116 #4 0x00000035733e0dd0 in ScopedLock (this=0x2527240, bridge=...) at ../include/qpid/sys/Mutex.h:33 #5 qpid::broker::Link::cancel (this=0x2527240, bridge=...) at qpid/broker/Link.cpp:430 #6 0x00000035733ec4ba in qpid::broker::LinkRegistry::destroyBridge (this=0x251f080, bridge=0x7f4d893038a0) at qpid/broker/LinkRegistry.cpp:254 #7 0x0000003573375df2 in operator() (this=<value optimized out>) at /usr/include/boost/function/function_template.hpp:1013 #8 qpid::broker::Bridge::close (this=<value optimized out>) at qpid/broker/Bridge.cpp:194 #9 0x00007f4dbe903674 in qpid::ha::QueueReplicator::destroy (this=0x7f4d893047b0) at qpid/ha/QueueReplicator.cpp:156 #10 0x00000035734161b2 in call<boost::shared_ptr<qpid::broker::QueueObserver> const> (this=0x7f4d892ffe00) at /usr/include/boost/bind/mem_fn_template.hpp:40 #11 operator()<boost::shared_ptr<qpid::broker::QueueObserver> > (this=0x7f4d892ffe00) at /usr/include/boost/bind/mem_fn_template.hpp:61 #12 operator()<boost::_mfi::mf0<void, qpid::broker::QueueObserver>, boost::_bi::list1<const boost::shared_ptr<qpid::broker::QueueObserver>&> > (this=0x7f4d892ffe00) at /usr/include/boost/bind/bind.hpp:246 #13 operator()<boost::shared_ptr<qpid::broker::QueueObserver> > (this=0x7f4d892ffe00) at /usr/include/boost/bind/bind_template.hpp:47 #14 for_each<std::_Rb_tree_const_iterator<boost::shared_ptr<qpid::broker::QueueObserver> >, boost::_bi::bind_t<void, boost::_mfi::mf0<void, qpid::broker::QueueObserver>, boost::_bi::list1<boost::arg<1> > > > (this=0x7f4d892ffe00) at /usr/include/c++/4.4.6/bits/stl_algo.h:4200 #15 qpid::broker::Queue::destroyed (this=0x7f4d892ffe00) at qpid/broker/Queue.cpp:1391 #16 0x0000003573386a30 in qpid::broker::Broker::deleteQueue (this=0x251ec10, name="Queue1", userId="@QPID", connectionId="192.168.1.11:26901-system-node1c-cluster:10001", check=<value optimized out>) at qpid/broker/Broker.cpp:1145 #17 0x00007f4dbe8d3fc8 in qpid::ha::BrokerReplicator::deleteQueue (this=0x2529780, name="Queue1", purge=<value optimized out>) at qpid/ha/BrokerReplicator.cpp:782 #18 0x00007f4dbe8d4d91 in qpid::ha::BrokerReplicator::autoDeleteCheck (this=0x2529780, ex=<value optimized out>) at qpid/ha/BrokerReplicator.cpp:869 #19 0x00007f4dbe8e3453 in operator() (__first=<value optimized out>, __last=..., __f=...) at /usr/include/boost/bind/mem_fn_template.hpp:162 #20 operator()<boost::_mfi::mf1<void, qpid::ha::BrokerReplicator, boost::shared_ptr<qpid::broker::Exchange> >, boost::_bi::list1<boost::shared_ptr<qpid::broker::Exchange>&> > ( __first=<value optimized out>, __last=..., __f=...) at /usr/include/boost/bind/bind.hpp:306 ---Type <return> to continue, or q <return> to quit--- #21 operator()<boost::shared_ptr<qpid::broker::Exchange> > (__first=<value optimized out>, __last=..., __f=...) at /usr/include/boost/bind/bind_template.hpp:32 #22 std::for_each<__gnu_cxx::__normal_iterator<boost::shared_ptr<qpid::broker::Exchange>*, std::vector<boost::shared_ptr<qpid::broker::Exchange>, std::allocator<boost::shared_ptr<qpid::broker::Exchange> > > >, boost::_bi::bind_t<void, boost::_mfi::mf1<void, qpid::ha::BrokerReplicator, boost::shared_ptr<qpid::broker::Exchange> >, boost::_bi::list2<boost::_bi::value<qpid::ha::BrokerReplicator*>, boost::arg<1> > > > (__first=<value optimized out>, __last=..., __f=...) at /usr/include/c++/4.4.6/bits/stl_algo.h:4200 #23 0x00007f4dbe8d3ae6 in qpid::ha::BrokerReplicator::disconnected (this=0x2529780) at qpid/ha/BrokerReplicator.cpp:891 #24 0x000000357339841c in call<boost::shared_ptr<qpid::broker::ConnectionObserver>, qpid::broker::Connection> (this=<value optimized out>, c=...) at /usr/include/boost/bind/mem_fn_template.hpp:153 #25 operator()<boost::shared_ptr<qpid::broker::ConnectionObserver> > (this=<value optimized out>, c=...) at /usr/include/boost/bind/mem_fn_template.hpp:167 #26 operator()<boost::_mfi::mf1<void, qpid::broker::ConnectionObserver, qpid::broker::Connection&>, boost::_bi::list1<boost::shared_ptr<qpid::broker::ConnectionObserver>&> > ( this=<value optimized out>, c=...) at /usr/include/boost/bind/bind.hpp:306 #27 operator()<boost::shared_ptr<qpid::broker::ConnectionObserver> > (this=<value optimized out>, c=...) at /usr/include/boost/bind/bind_template.hpp:32 #28 for_each<__gnu_cxx::__normal_iterator<boost::shared_ptr<qpid::broker::ConnectionObserver>*, std::vector<boost::shared_ptr<qpid::broker::ConnectionObserver>, std::allocator<boost::shared_ptr<qpid::broker::ConnectionObserver> > > >, boost::_bi::bind_t<void, boost::_mfi::mf1<void, qpid::broker::ConnectionObserver, qpid::broker::Connection&>, boost::_bi::list2<boost::arg<1>, boost::reference_wrapper<qpid::broker::Connection> > > > (this=<value optimized out>, c=...) at /usr/include/c++/4.4.6/bits/stl_algo.h:4200 #29 each<boost::_bi::bind_t<void, boost::_mfi::mf1<void, qpid::broker::ConnectionObserver, qpid::broker::Connection&>, boost::_bi::list2<boost::arg<1>, boost::reference_wrapper<qpid::broker::Connection> > > > (this=<value optimized out>, c=...) at qpid/broker/Observers.h:63 #30 qpid::broker::ConnectionObservers::closed (this=<value optimized out>, c=...) at qpid/broker/ConnectionObservers.h:49 #31 0x00000035733a3b6a in qpid::broker::Connection::~Connection (this=0x7f4d74006840, __in_chrg=<value optimized out>) at qpid/broker/Connection.cpp:150 #32 0x00000035733a4859 in qpid::broker::Connection::~Connection (this=0x7f4d74006840, __in_chrg=<value optimized out>) at qpid/broker/Connection.cpp:160 #33 0x00000035733720d1 in ~auto_ptr (this=0x7f4d74003780, __in_chrg=<value optimized out>, __vtt_parm=<value optimized out>) at /usr/include/c++/4.4.6/backward/auto_ptr.h:168 #34 qpid::amqp_0_10::Connection::~Connection (this=0x7f4d74003780, __in_chrg=<value optimized out>, __vtt_parm=<value optimized out>) at qpid/amqp_0_10/Connection.h:45 #35 0x00000035733724a9 in qpid::amqp_0_10::Connection::~Connection (this=0x7f4d74003780, __in_chrg=<value optimized out>, __vtt_parm=<value optimized out>) at qpid/amqp_0_10/Connection.h:45 #36 0x00000035734432c4 in ~auto_ptr (this=0x7f4d740027b0, __in_chrg=<value optimized out>) at /usr/include/c++/4.4.6/backward/auto_ptr.h:168 #37 ~SecureConnection (this=0x7f4d740027b0, __in_chrg=<value optimized out>) at qpid/broker/SecureConnection.h:42 #38 qpid::broker::SecureConnection::~SecureConnection (this=0x7f4d740027b0, __in_chrg=<value optimized out>) at qpid/broker/SecureConnection.h:42 #39 0x0000003572e1d9a3 in qpid::sys::AsynchIOHandler::~AsynchIOHandler (this=0x7f4d7c0010f0, __in_chrg=<value optimized out>) at qpid/sys/AsynchIOHandler.cpp:81 #40 0x0000003572e1dae9 in qpid::sys::AsynchIOHandler::~AsynchIOHandler (this=0x7f4d7c0010f0, __in_chrg=<value optimized out>) at qpid/sys/AsynchIOHandler.cpp:82 #41 0x0000003572e1f170 in qpid::sys::AsynchIOHandler::closedSocket (this=0x7f4d7c0010f0, s=...) at qpid/sys/AsynchIOHandler.cpp:226 #42 0x0000003572d400e6 in operator() (this=0x7f4d7c002820, h=<value optimized out>) at /usr/include/boost/function/function_template.hpp:1013 #43 qpid::sys::posix::AsynchIO::close (this=0x7f4d7c002820, h=<value optimized out>) at qpid/sys/posix/AsynchIO.cpp:630 #44 0x0000003572d41e44 in qpid::sys::posix::AsynchIO::writeable (this=0x7f4d7c002820, h=...) at qpid/sys/posix/AsynchIO.cpp:589 #45 0x0000003572e24403 in boost::function1<void, qpid::sys::DispatchHandle&>::operator() (this=<value optimized out>, a0=<value optimized out>) at /usr/include/boost/function/function_template.hpp:1013 #46 0x0000003572e212ce in qpid::sys::DispatchHandle::processEvent (this=0x7f4d7c002828, type=qpid::sys::Poller::WRITABLE) at qpid/sys/DispatchHandle.cpp:287 #47 0x0000003572d4e0cd in process (this=0x250ac80) at qpid/sys/Poller.h:131 #48 qpid::sys::Poller::run (this=0x250ac80) at qpid/sys/epoll/EpollPoller.cpp:524 #49 0x0000003572d4616a in qpid::sys::(anonymous namespace)::runRunnable (p=<value optimized out>) at qpid/sys/posix/Thread.cpp:35 #50 0x00000035b5a07851 in start_thread () from /lib64/libpthread.so.0 ---Type <return> to continue, or q <return> to quit--- #51 0x00000035b52e811d in clone () from /lib64/libc.so.6
Fix on trunk: ------------------------------------------------------------------------ r1476305 | aconway | 2013-04-26 13:28:26 -0400 (Fri, 26 Apr 2013) | 9 lines QPID-4780: Bug 889552 - HA broker deadlock after loss of primary broker. Lock ordering deadlock found by inspection of code and stack trace: - thread 1: Link::ioThreadProcessing(Link:lock)-> QueueReplicator::initializeBridge(QueueReplicator::lock) - thread 2: QueueReplicator::destroy(QueueReplicator::lock)-> Bridge::destroy(Link::lock) This patch breaks the lock by removing locking around Bridge::destroy in QueueReplicator::destroy. ------------------------------------------------------------------------
I am seeing 100% failure in backup brokers with the attached reproducer. ( Following notes are from the README file in the attachment. ) Reproducing 889552 ================================================ 1. Run the script ./start_ha_cluster the cluster will be at ports 5801, 5802, 5803 10 seconds after starting all brokers, this script promotes 5801 to be primary. 2. Run the C++ program many_autodelete_queues/declare_queues To build it, there's a little script called 'm' in that directory. This program will attempt to declare 50,000 autodelete queues. 3. While that program is running, do "ps -aef | grep qpidd" in a loop every minute or so. At some point you will notice that the two backup brokers have gone away. ----------------------------------------------- Result: I am seeing 100% failure rate with these packages: qpid-tools-0.22-6.el6.noarch qpid-cpp-client-devel-0.22-26.el6.x86_64 qpid-qmf-0.22-23.el6.x86_64 python-qpid-qmf-0.22-23.el6.x86_64 python-qpid-0.22-7.el6.noarch qpid-cpp-client-ssl-0.22-26.el6.x86_64 qpid-cpp-server-ha-0.22-26.el6.x86_64 qpid-proton-c-0.5-9.el6.x86_64 qpid-cpp-server-0.22-26.el6.x86_64 qpid-qmf-devel-0.22-23.el6.x86_64 qpid-cpp-client-0.22-26.el6.x86_64 qpid-cpp-server-ssl-0.22-26.el6.x86_64 The output from the failed brokers looks like this: 2013-11-20 10:26:32 [HA] error Backup: Connection error: Backup: Configuration replication failed: Link qpid.ha.link.2da1e206-2080-4ebe-8378-9abaa5e3989a channel pool is empty: while handling: [{_schema_id:{_class_name:queueDeclare, _hash:81c14a6d-4e35-0a72-44db-7f8a09d0646e, _package_name:org.apache.qpid.broker, _type:_event}, _severity:6, _timestamp:1384960846905653034, _values:{altEx:, args:{qpid.ha-uuid:True, qpid.replicate:all}, autoDel:True, disp:created, durable:False, excl:False, qName:q_32778, rhost:qpid.127.0.0.1:5801-127.0.0.1:40713, user:anonymous}}] (/builddir/build/BUILD/qpid-0.22/cpp/src/qpid/ha/BrokerReplicator.cpp:492
Created attachment 826842 [details] reproducer
This looks like a different issue (one which needs to be addressed) Creating so many queues is using up all the channel identifiers on the replication link, causing the failure seen. I will start working on this issue, can post the results here or on another BZ as appropriate.
This new issue is fixed on trunk https://svn.apache.org/r1561206 Note the limit on the max number of queues has not been eliminated. It has been increased to 64k and the primary broker raises an exception if a client attempts to declare too many queues. See the comment on https://issues.apache.org/jira/browse/QPID-5513
This trunk patch fixes a link error on windows: Commit 1561754 from Alan Conway in branch 'qpid/trunk' [ https://svn.apache.org/r1561754 ] QPID-5513: HA backup fails if number of replicated queues exceeds number of channels. Fix link error (incorrect libraries listed) in ha_test_max_queues on windows.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHEA-2014-1296.html