Bug 889552 - HA broker deadlock after loss of primary broker
Summary: HA broker deadlock after loss of primary broker
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise MRG
Classification: Red Hat
Component: qpid-cpp
Version: Development
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: 3.0
: ---
Assignee: Alan Conway
QA Contact: mick
URL:
Whiteboard:
Depends On:
Blocks: 698367
TreeView+ depends on / blocked
 
Reported: 2012-12-21 17:24 UTC by Jason Dillaman
Modified: 2014-09-24 15:05 UTC (History)
5 users (show)

Fixed In Version: qpid-cpp-0.22-35
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2014-09-24 15:05:54 UTC
Target Upstream Version:


Attachments (Terms of Use)
reproducer (87.79 KB, application/x-gzip)
2013-11-20 20:32 UTC, mick
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Apache JIRA QPID-4780 0 None None None Never
Apache JIRA QPID-5513 0 None None None Never
Red Hat Bugzilla 1076533 0 medium VERIFIED qpidd's ha-replication hardcoded undocumented limit (65434 queues) 2021-03-03 23:10:08 UTC
Red Hat Product Errata RHEA-2014:1296 0 normal SHIPPED_LIVE Red Hat Enterprise MRG Messaging 3.0 Release 2014-09-24 19:00:06 UTC

Internal Links: 1076533

Description Jason Dillaman 2012-12-21 17:24:52 UTC
Description of problem:
While fencing nodes in a cluster, occasionally encounter an issue where a previously backup broker becomes deadlocked while deleting auto-delete queues.  Only noticed the issue because 'qpid-ha promote' hangs attempting to promote a backup to primary.

Version-Release number of selected component (if applicable):
Qpid 0.18

How reproducible:
Rare (race condition)

Steps to Reproduce:
1. Start HA-enabled brokers
2. Create tens-of-thousands of auto-delete queues
3. Fence / power-cycle the node hosting the primary broker
  
Actual results:
Occasionally the backup broker deadlocks

Expected results:
The backup broker does not deadlock

Additional info:

Comment 1 Jason Dillaman 2012-12-21 17:29:24 UTC
Selected back trace shows thread 11 owning lock in qpid::ha::QueueReplicator and qpid::broker::LinkRegistry but waiting on lock in qpid::broker:::Link.  Thread 4 owns the lock in qpid::broker::Link but is waiting for the lock in qpid::ha::QueueReplicator.  Dozens of other threads are waiting on the lock in qpid::broker::LinkRegistry since the LinkRegistry's connection observer is called for each connection to the broker.

Thread 4 (Thread 0x7f4db6bfd700 (LWP 11469)):
#0  0x00000035b5a0e054 in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x00000035b5a093a3 in _L_lock_892 () from /lib64/libpthread.so.0
#2  0x00000035b5a09287 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x0000003573374f0a in qpid::sys::Mutex::lock (this=<value optimized out>) at ../include/qpid/sys/posix/Mutex.h:116
#4  0x00007f4dbe9027e1 in ScopedLock (this=0x7f4d893047b0, bridge=..., sessionHandler=...) at ../include/qpid/sys/Mutex.h:33
#5  qpid::ha::QueueReplicator::initializeBridge (this=0x7f4d893047b0, bridge=..., sessionHandler=...) at qpid/ha/QueueReplicator.cpp:166
#6  0x0000003573377dc8 in operator() (this=0x7f4d893038a0, c=...) at /usr/include/boost/function/function_template.hpp:1013
#7  qpid::broker::Bridge::create (this=0x7f4d893038a0, c=...) at qpid/broker/Bridge.cpp:122
#8  0x00000035733e1491 in qpid::broker::Link::ioThreadProcessing (this=0x2527240) at qpid/broker/Link.cpp:484
#9  0x00000035733a35ab in operator() (this=0x7f4d7dd193a0) at /usr/include/boost/function/function_template.hpp:1013
#10 qpid::broker::Connection::doIoCallbacks (this=0x7f4d7dd193a0) at qpid/broker/Connection.cpp:375
#11 0x00000035733a395c in qpid::broker::Connection::received (this=0x7f4d7dd193a0, frame=...) at qpid/broker/Connection.cpp:172
#12 0x000000357336f5ad in qpid::amqp_0_10::Connection::decode (this=0x7f4d7d2401d0, buffer=<value optimized out>, size=<value optimized out>) at qpid/amqp_0_10/Connection.cpp:58
#13 0x0000003572e1e59d in qpid::sys::AsynchIOHandler::readbuff (this=0x7f4d88f07d50, buff=0x7f4d885158a0) at qpid/sys/AsynchIOHandler.cpp:164
#14 0x0000003572d41501 in operator() (this=0x7f4d8885b120, h=...) at /usr/include/boost/function/function_template.hpp:1013
#15 qpid::sys::posix::AsynchIO::readable (this=0x7f4d8885b120, h=...) at qpid/sys/posix/AsynchIO.cpp:462
#16 0x0000003572e24403 in boost::function1<void, qpid::sys::DispatchHandle&>::operator() (this=<value optimized out>, a0=<value optimized out>)
    at /usr/include/boost/function/function_template.hpp:1013
#17 0x0000003572e21606 in qpid::sys::DispatchHandle::processEvent (this=0x7f4d8885b128, type=qpid::sys::Poller::READABLE) at qpid/sys/DispatchHandle.cpp:280
#18 0x0000003572d4e0cd in process (this=0x250ac80) at qpid/sys/Poller.h:131
#19 qpid::sys::Poller::run (this=0x250ac80) at qpid/sys/epoll/EpollPoller.cpp:524
#20 0x0000003572d4616a in qpid::sys::(anonymous namespace)::runRunnable (p=<value optimized out>) at qpid/sys/posix/Thread.cpp:35
#21 0x00000035b5a07851 in start_thread () from /lib64/libpthread.so.0
#22 0x00000035b52e811d in clone () from /lib64/libc.so.6

Thread 11 (Thread 0x7f4db1468700 (LWP 11476)):
#0  0x00000035b5a0e054 in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x00000035b5a093a3 in _L_lock_892 () from /lib64/libpthread.so.0
#2  0x00000035b5a09287 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x0000003573374f0a in qpid::sys::Mutex::lock (this=<value optimized out>) at ../include/qpid/sys/posix/Mutex.h:116
#4  0x00000035733e0dd0 in ScopedLock (this=0x2527240, bridge=...) at ../include/qpid/sys/Mutex.h:33
#5  qpid::broker::Link::cancel (this=0x2527240, bridge=...) at qpid/broker/Link.cpp:430
#6  0x00000035733ec4ba in qpid::broker::LinkRegistry::destroyBridge (this=0x251f080, bridge=0x7f4d893038a0) at qpid/broker/LinkRegistry.cpp:254
#7  0x0000003573375df2 in operator() (this=<value optimized out>) at /usr/include/boost/function/function_template.hpp:1013
#8  qpid::broker::Bridge::close (this=<value optimized out>) at qpid/broker/Bridge.cpp:194
#9  0x00007f4dbe903674 in qpid::ha::QueueReplicator::destroy (this=0x7f4d893047b0) at qpid/ha/QueueReplicator.cpp:156
#10 0x00000035734161b2 in call<boost::shared_ptr<qpid::broker::QueueObserver> const> (this=0x7f4d892ffe00) at /usr/include/boost/bind/mem_fn_template.hpp:40
#11 operator()<boost::shared_ptr<qpid::broker::QueueObserver> > (this=0x7f4d892ffe00) at /usr/include/boost/bind/mem_fn_template.hpp:61
#12 operator()<boost::_mfi::mf0<void, qpid::broker::QueueObserver>, boost::_bi::list1<const boost::shared_ptr<qpid::broker::QueueObserver>&> > (this=0x7f4d892ffe00)
    at /usr/include/boost/bind/bind.hpp:246
#13 operator()<boost::shared_ptr<qpid::broker::QueueObserver> > (this=0x7f4d892ffe00) at /usr/include/boost/bind/bind_template.hpp:47
#14 for_each<std::_Rb_tree_const_iterator<boost::shared_ptr<qpid::broker::QueueObserver> >, boost::_bi::bind_t<void, boost::_mfi::mf0<void, qpid::broker::QueueObserver>, boost::_bi::list1<boost::arg<1> > > > (this=0x7f4d892ffe00) at /usr/include/c++/4.4.6/bits/stl_algo.h:4200
#15 qpid::broker::Queue::destroyed (this=0x7f4d892ffe00) at qpid/broker/Queue.cpp:1391
#16 0x0000003573386a30 in qpid::broker::Broker::deleteQueue (this=0x251ec10, name="Queue1", userId="@QPID", connectionId="192.168.1.11:26901-system-node1c-cluster:10001", 
    check=<value optimized out>) at qpid/broker/Broker.cpp:1145
#17 0x00007f4dbe8d3fc8 in qpid::ha::BrokerReplicator::deleteQueue (this=0x2529780, name="Queue1", purge=<value optimized out>) at qpid/ha/BrokerReplicator.cpp:782
#18 0x00007f4dbe8d4d91 in qpid::ha::BrokerReplicator::autoDeleteCheck (this=0x2529780, ex=<value optimized out>) at qpid/ha/BrokerReplicator.cpp:869
#19 0x00007f4dbe8e3453 in operator() (__first=<value optimized out>, __last=..., __f=...) at /usr/include/boost/bind/mem_fn_template.hpp:162
#20 operator()<boost::_mfi::mf1<void, qpid::ha::BrokerReplicator, boost::shared_ptr<qpid::broker::Exchange> >, boost::_bi::list1<boost::shared_ptr<qpid::broker::Exchange>&> > (
    __first=<value optimized out>, __last=..., __f=...) at /usr/include/boost/bind/bind.hpp:306
---Type <return> to continue, or q <return> to quit---
#21 operator()<boost::shared_ptr<qpid::broker::Exchange> > (__first=<value optimized out>, __last=..., __f=...) at /usr/include/boost/bind/bind_template.hpp:32
#22 std::for_each<__gnu_cxx::__normal_iterator<boost::shared_ptr<qpid::broker::Exchange>*, std::vector<boost::shared_ptr<qpid::broker::Exchange>, std::allocator<boost::shared_ptr<qpid::broker::Exchange> > > >, boost::_bi::bind_t<void, boost::_mfi::mf1<void, qpid::ha::BrokerReplicator, boost::shared_ptr<qpid::broker::Exchange> >, boost::_bi::list2<boost::_bi::value<qpid::ha::BrokerReplicator*>, boost::arg<1> > > > (__first=<value optimized out>, __last=..., __f=...) at /usr/include/c++/4.4.6/bits/stl_algo.h:4200
#23 0x00007f4dbe8d3ae6 in qpid::ha::BrokerReplicator::disconnected (this=0x2529780) at qpid/ha/BrokerReplicator.cpp:891
#24 0x000000357339841c in call<boost::shared_ptr<qpid::broker::ConnectionObserver>, qpid::broker::Connection> (this=<value optimized out>, c=...) at /usr/include/boost/bind/mem_fn_template.hpp:153
#25 operator()<boost::shared_ptr<qpid::broker::ConnectionObserver> > (this=<value optimized out>, c=...) at /usr/include/boost/bind/mem_fn_template.hpp:167
#26 operator()<boost::_mfi::mf1<void, qpid::broker::ConnectionObserver, qpid::broker::Connection&>, boost::_bi::list1<boost::shared_ptr<qpid::broker::ConnectionObserver>&> > (
    this=<value optimized out>, c=...) at /usr/include/boost/bind/bind.hpp:306
#27 operator()<boost::shared_ptr<qpid::broker::ConnectionObserver> > (this=<value optimized out>, c=...) at /usr/include/boost/bind/bind_template.hpp:32
#28 for_each<__gnu_cxx::__normal_iterator<boost::shared_ptr<qpid::broker::ConnectionObserver>*, std::vector<boost::shared_ptr<qpid::broker::ConnectionObserver>, std::allocator<boost::shared_ptr<qpid::broker::ConnectionObserver> > > >, boost::_bi::bind_t<void, boost::_mfi::mf1<void, qpid::broker::ConnectionObserver, qpid::broker::Connection&>, boost::_bi::list2<boost::arg<1>, boost::reference_wrapper<qpid::broker::Connection> > > > (this=<value optimized out>, c=...) at /usr/include/c++/4.4.6/bits/stl_algo.h:4200
#29 each<boost::_bi::bind_t<void, boost::_mfi::mf1<void, qpid::broker::ConnectionObserver, qpid::broker::Connection&>, boost::_bi::list2<boost::arg<1>, boost::reference_wrapper<qpid::broker::Connection> > > > (this=<value optimized out>, c=...) at qpid/broker/Observers.h:63
#30 qpid::broker::ConnectionObservers::closed (this=<value optimized out>, c=...) at qpid/broker/ConnectionObservers.h:49
#31 0x00000035733a3b6a in qpid::broker::Connection::~Connection (this=0x7f4d74006840, __in_chrg=<value optimized out>) at qpid/broker/Connection.cpp:150
#32 0x00000035733a4859 in qpid::broker::Connection::~Connection (this=0x7f4d74006840, __in_chrg=<value optimized out>) at qpid/broker/Connection.cpp:160
#33 0x00000035733720d1 in ~auto_ptr (this=0x7f4d74003780, __in_chrg=<value optimized out>, __vtt_parm=<value optimized out>) at /usr/include/c++/4.4.6/backward/auto_ptr.h:168
#34 qpid::amqp_0_10::Connection::~Connection (this=0x7f4d74003780, __in_chrg=<value optimized out>, __vtt_parm=<value optimized out>) at qpid/amqp_0_10/Connection.h:45
#35 0x00000035733724a9 in qpid::amqp_0_10::Connection::~Connection (this=0x7f4d74003780, __in_chrg=<value optimized out>, __vtt_parm=<value optimized out>) at qpid/amqp_0_10/Connection.h:45
#36 0x00000035734432c4 in ~auto_ptr (this=0x7f4d740027b0, __in_chrg=<value optimized out>) at /usr/include/c++/4.4.6/backward/auto_ptr.h:168
#37 ~SecureConnection (this=0x7f4d740027b0, __in_chrg=<value optimized out>) at qpid/broker/SecureConnection.h:42
#38 qpid::broker::SecureConnection::~SecureConnection (this=0x7f4d740027b0, __in_chrg=<value optimized out>) at qpid/broker/SecureConnection.h:42
#39 0x0000003572e1d9a3 in qpid::sys::AsynchIOHandler::~AsynchIOHandler (this=0x7f4d7c0010f0, __in_chrg=<value optimized out>) at qpid/sys/AsynchIOHandler.cpp:81
#40 0x0000003572e1dae9 in qpid::sys::AsynchIOHandler::~AsynchIOHandler (this=0x7f4d7c0010f0, __in_chrg=<value optimized out>) at qpid/sys/AsynchIOHandler.cpp:82
#41 0x0000003572e1f170 in qpid::sys::AsynchIOHandler::closedSocket (this=0x7f4d7c0010f0, s=...) at qpid/sys/AsynchIOHandler.cpp:226
#42 0x0000003572d400e6 in operator() (this=0x7f4d7c002820, h=<value optimized out>) at /usr/include/boost/function/function_template.hpp:1013
#43 qpid::sys::posix::AsynchIO::close (this=0x7f4d7c002820, h=<value optimized out>) at qpid/sys/posix/AsynchIO.cpp:630
#44 0x0000003572d41e44 in qpid::sys::posix::AsynchIO::writeable (this=0x7f4d7c002820, h=...) at qpid/sys/posix/AsynchIO.cpp:589
#45 0x0000003572e24403 in boost::function1<void, qpid::sys::DispatchHandle&>::operator() (this=<value optimized out>, a0=<value optimized out>)
    at /usr/include/boost/function/function_template.hpp:1013
#46 0x0000003572e212ce in qpid::sys::DispatchHandle::processEvent (this=0x7f4d7c002828, type=qpid::sys::Poller::WRITABLE) at qpid/sys/DispatchHandle.cpp:287
#47 0x0000003572d4e0cd in process (this=0x250ac80) at qpid/sys/Poller.h:131
#48 qpid::sys::Poller::run (this=0x250ac80) at qpid/sys/epoll/EpollPoller.cpp:524
#49 0x0000003572d4616a in qpid::sys::(anonymous namespace)::runRunnable (p=<value optimized out>) at qpid/sys/posix/Thread.cpp:35
#50 0x00000035b5a07851 in start_thread () from /lib64/libpthread.so.0
---Type <return> to continue, or q <return> to quit---
#51 0x00000035b52e811d in clone () from /lib64/libc.so.6

Comment 2 Alan Conway 2013-04-26 17:30:46 UTC
Fix on trunk:

------------------------------------------------------------------------
r1476305 | aconway | 2013-04-26 13:28:26 -0400 (Fri, 26 Apr 2013) | 9 lines

QPID-4780: Bug 889552 - HA broker deadlock after loss of primary broker.

Lock ordering deadlock found by inspection of code and stack trace:

- thread 1: Link::ioThreadProcessing(Link:lock)-> QueueReplicator::initializeBridge(QueueReplicator::lock)
- thread 2: QueueReplicator::destroy(QueueReplicator::lock)-> Bridge::destroy(Link::lock)

This patch breaks the lock by removing locking around Bridge::destroy in QueueReplicator::destroy.

------------------------------------------------------------------------

Comment 6 mick 2013-11-20 20:30:32 UTC
I am seeing 100% failure in backup brokers with the attached reproducer.

( Following notes are from the README file in the attachment. )

Reproducing 889552
================================================

1. Run the script  ./start_ha_cluster
   the cluster will be at ports 5801, 5802, 5803
   10 seconds after starting all brokers, this script
   promotes 5801 to be primary.


2. Run the C++ program  many_autodelete_queues/declare_queues
   To build it, there's a little script called 'm' in that directory.
   This program will attempt to declare 50,000 autodelete queues.


3. While that program is running, do "ps -aef | grep qpidd" in a loop
   every minute or so.  At some point you will notice that the two
   backup brokers have gone away.


-----------------------------------------------


Result:

I am seeing 100% failure rate with these packages:

  qpid-tools-0.22-6.el6.noarch
  qpid-cpp-client-devel-0.22-26.el6.x86_64
  qpid-qmf-0.22-23.el6.x86_64
  python-qpid-qmf-0.22-23.el6.x86_64
  python-qpid-0.22-7.el6.noarch
  qpid-cpp-client-ssl-0.22-26.el6.x86_64
  qpid-cpp-server-ha-0.22-26.el6.x86_64
  qpid-proton-c-0.5-9.el6.x86_64
  qpid-cpp-server-0.22-26.el6.x86_64
  qpid-qmf-devel-0.22-23.el6.x86_64
  qpid-cpp-client-0.22-26.el6.x86_64
  qpid-cpp-server-ssl-0.22-26.el6.x86_64
  
  
  
  
  
The output from the failed brokers looks like this:

    2013-11-20 10:26:32 [HA] error Backup: Connection error: Backup: Configuration replication failed: Link qpid.ha.link.2da1e206-2080-4ebe-8378-9abaa5e3989a channel pool is empty: while handling: [{_schema_id:{_class_name:queueDeclare, _hash:81c14a6d-4e35-0a72-44db-7f8a09d0646e, _package_name:org.apache.qpid.broker, _type:_event}, _severity:6, _timestamp:1384960846905653034, _values:{altEx:, args:{qpid.ha-uuid:True, qpid.replicate:all}, autoDel:True, disp:created, durable:False, excl:False, qName:q_32778, rhost:qpid.127.0.0.1:5801-127.0.0.1:40713, user:anonymous}}] (/builddir/build/BUILD/qpid-0.22/cpp/src/qpid/ha/BrokerReplicator.cpp:492

Comment 7 mick 2013-11-20 20:32:08 UTC
Created attachment 826842 [details]
reproducer

Comment 8 Alan Conway 2014-01-13 19:24:04 UTC
This looks like a different issue (one which needs to be addressed) Creating so many queues is using up all the channel identifiers on the replication link, causing the failure seen. I will start working on this issue, can post the results here or on another BZ as appropriate.

Comment 9 Alan Conway 2014-01-24 21:59:29 UTC
This new issue is fixed on trunk https://svn.apache.org/r1561206

Note the limit on the max number of queues has not been eliminated. It has been increased to 64k and the primary broker raises an exception if a client attempts to declare too many queues. See the comment on https://issues.apache.org/jira/browse/QPID-5513

Comment 10 Alan Conway 2014-01-27 17:34:01 UTC
This trunk patch fixes a link error on windows:

Commit 1561754 from Alan Conway in branch 'qpid/trunk'
[ https://svn.apache.org/r1561754 ]

QPID-5513: HA backup fails if number of replicated queues exceeds number of channels.

Fix link error (incorrect libraries listed) in ha_test_max_queues on windows.

Comment 14 errata-xmlrpc 2014-09-24 15:05:54 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHEA-2014-1296.html


Note You need to log in before you can comment on or make changes to this bug.