I have a repeatable crash in primary HA broker, by doing a soak test on TXNs. This is with trunk code new as of an hour ago: URL: https://svn.apache.org/repos/asf/qpid/trunk/qpid/cpp Repository Root: https://svn.apache.org/repos/asf Repository UUID: 13f79535-47bb-0310-9956-ffa450edef68 Revision: 1626916 Node Kind: directory Schedule: normal Last Changed Author: aconway Last Changed Rev: 1626887 I did a standard build, first of proton and then of qpidd -- except that I had them install themselves in /usr instead of /usr/local . Here are the scripts I use. script 1 starting the HA cluster { #! /bin/bash export PYTHONPATH=/home/mick/trunk/qpid/python QPIDD=/usr/sbin/qpidd QPID_HA=/home/mick/trunk/qpid/tools/src/py/qpid-ha # This is where I put the log files. rm -rf /tmp/mick mkdir /tmp/mick for N in 1 2 3 do $QPIDD \ --auth=no \ --no-module-dir \ --load-module /usr/lib64/qpid/daemon/ha.so \ --log-enable debug+:ha:: \ --ha-cluster=yes \ --ha-replicate=all \ --ha-brokers-url=localhost:5801,localhost:5802,localhost:5803 \ --ha-public-url=localhost:5801,localhost:5802,localhost:5803 \ -p 580$N \ --data-dir /tmp/mick/data_$N \ --log-to-file /tmp/mick/qpidd_$N.log \ --mgmt-enable=yes \ -d echo "============================================" echo "started broker $N from $QPIDD" echo "============================================" sleep 1 done # Now promote one broker to primary. echo "Promoting broker 5801..." ${QPID_HA} promote --cluster-manager -b localhost:5801 echo "done." } script 2 create the tx queues, and load the first one with 1000 messages { #! /bin/bash TXTEST2=/usr/libexec/qpid/tests/qpid-txtest2 echo "Loading data to queues..." ${TXTEST2} --init=yes --transfer=no --check=no \ --port 5801 \ --total-messages 1000 --connection-options '{reconnect:true}' \ --messages-per-tx 10 --tx-count 100 \ --queue-base-name=tx --fetch-timeout=1 } script 3 now beat the heck out of the TXN code { #! /bin/bash TXTEST2=/usr/libexec/qpid/tests/qpid-txtest2 echo "starting transfers..." ${TXTEST2} --init=no --transfer=yes --check=no \ --port 5801 \ --total-messages 5000000 --connection-options '{reconnect:true}' \ --messages-per-tx 10 --tx-count 500000 \ --queue-base-name=tx --fetch-timeout=1 } I do *not* do any failovers. Just let that TXN-exercising script run until the primary broker dies. It took quite a while. In my most recent test, I got through something like 300,000 transactions (10 messages each) before the broker became brokest. I then tried the same test on a standalone broker and it got all the way through. Here is the traceback: #0 0x0000003186a328a5 in raise () from /lib64/libc.so.6 #1 0x0000003186a34085 in abort () from /lib64/libc.so.6 #2 0x0000003186a2ba1e in __assert_fail_base () from /lib64/libc.so.6 #3 0x0000003186a2bae0 in __assert_fail () from /lib64/libc.so.6 #4 0x00007f6bb72b4f16 in operator-> (this=0x7f6b48378060, sync=<value optimized out>) at /usr/include/boost/smart_ptr/intrusive_ptr.hpp:166 #5 qpid::broker::SessionState::IncompleteIngressMsgXfer::completed (this=0x7f6b48378060, sync=<value optimized out>) at /home/mick/trunk/qpid/cpp/src/qpid/broker/SessionState.cpp:409 #6 0x00007f6bb726d670 in invokeCallback (this=<value optimized out>, msg=<value optimized out>) at /home/mick/trunk/qpid/cpp/src/qpid/broker/AsyncCompletion.h:117 #7 finishCompleter (this=<value optimized out>, msg=<value optimized out>) at /home/mick/trunk/qpid/cpp/src/qpid/broker/AsyncCompletion.h:158 #8 enqueueComplete (this=<value optimized out>, msg=<value optimized out>) at /home/mick/trunk/qpid/cpp/src/qpid/broker/PersistableMessage.h:76 #9 qpid::broker::NullMessageStore::enqueue (this=<value optimized out>, msg=<value optimized out>) at /home/mick/trunk/qpid/cpp/src/qpid/broker/NullMessageStore.cpp:97 #10 0x00007f6bb71f4578 in qpid::broker::Queue::enqueue (this=0x7f6b4801ef90, ctxt=0x7f6b6821bdf0, msg=...) at /home/mick/trunk/qpid/cpp/src/qpid/broker/Queue.cpp:910 #11 0x00007f6bb71f46db in qpid::broker::Queue::TxPublish::prepare (this=0x7f6b48435c70, ctxt=<value optimized out>) at /home/mick/trunk/qpid/cpp/src/qpid/broker/Queue.cpp:159 #12 0x00007f6bb72c8b3d in qpid::broker::TxBuffer::prepare (this=0x7f6b68549120, ctxt=0x7f6b6821bdf0) at /home/mick/trunk/qpid/cpp/src/qpid/broker/TxBuffer.cpp:42 #13 0x00007f6bb72c9dbe in qpid::broker::TxBuffer::startCommit (this=0x7f6b68549120, store=<value optimized out>) at /home/mick/trunk/qpid/cpp/src/qpid/broker/TxBuffer.cpp:73 #14 0x00007f6bb7298c74 in qpid::broker::SemanticState::commit (this=0x7f6b6c567fb8, store=0x2460440) at /home/mick/trunk/qpid/cpp/src/qpid/broker/SemanticState.cpp:198 #15 0x00007f6bb6c5886e in invoke<qpid::framing::AMQP_ServerOperations::TxHandler> (this=0x7f6b8bffd4a0, body=<value optimized out>) at /home/mick/trunk/qpid/cpp/build/src/qpid/framing/TxCommitBody.h:53 #16 qpid::framing::AMQP_ServerOperations::TxHandler::Invoker::visit (this=0x7f6b8bffd4a0, body=<value optimized out>) at /home/mick/trunk/qpid/cpp/build/src/qpid/framing/ServerInvoker.cpp:582 #17 0x00007f6bb6c5cc41 in qpid::framing::AMQP_ServerOperations::Invoker::visit (this=0x7f6b8bffd670, body=...) at /home/mick/trunk/qpid/cpp/build/src/qpid/framing/ServerInvoker.cpp:278 #18 0x00007f6bb72b504c in invoke<qpid::broker::SessionAdapter> (this=<value optimized out>, method=0x7f6b68130790) at /home/mick/trunk/qpid/cpp/src/qpid/framing/Invoker.h:67 #19 qpid::broker::SessionState::handleCommand (this=<value optimized out>, method=0x7f6b68130790) at /home/mick/trunk/qpid/cpp/src/qpid/broker/SessionState.cpp:198 #20 0x00007f6bb72b6235 in qpid::broker::SessionState::handleIn (this=0x7f6b6c567df0, frame=...) at /home/mick/trunk/qpid/cpp/src/qpid/broker/SessionState.cpp:295 #21 0x00007f6bb6cd5291 in qpid::amqp_0_10::SessionHandler::handleIn (this=0x7f6b6c4e2120, f=...) at /home/mick/trunk/qpid/cpp/src/qpid/amqp_0_10/SessionHandler.cpp:93 #22 0x00007f6bb722692b in operator() (this=0x7f6b500ab840, frame=...) at /home/mick/trunk/qpid/cpp/src/qpid/framing/Handler.h:39 #23 qpid::broker::ConnectionHandler::handle (this=0x7f6b500ab840, frame=...) at /home/mick/trunk/qpid/cpp/src/qpid/broker/ConnectionHandler.cpp:94 #24 0x00007f6bb7221ba8 in qpid::broker::amqp_0_10::Connection::received (this=0x7f6b500ab660, frame=...) at /home/mick/trunk/qpid/cpp/src/qpid/broker/amqp_0_10/Connection.cpp:198 #25 0x00007f6bb71aea4d in qpid::amqp_0_10::Connection::decode (this=0x7f6b5005d770, buffer=<value optimized out>, size=<value optimized out>)
failure in 1-cluster ! this took about 3 minutes of wall-clock time.... HA 1-cluster start script: ------------------------------------------------------------- ( IS THIS CORRECT ??? ) #! /bin/bash export PYTHONPATH=/home/mick/trunk/qpid/python QPIDD=/usr/sbin/qpidd QPID_HA=/home/mick/trunk/qpid/tools/src/py/qpid-ha # This is where I put the log files. rm -rf /tmp/mick mkdir /tmp/mick for N in 1 do $QPIDD \ --auth=no \ --no-module-dir \ --load-module /usr/lib64/qpid/daemon/ha.so \ --log-enable debug+:ha:: \ --ha-cluster=yes \ --ha-replicate=all \ --ha-brokers-url=localhost:5801,localhost:5802,localhost:5803 \ --ha-public-url=localhost:5801,localhost:5802,localhost:5803 \ -p 580$N \ --data-dir /tmp/mick/data_$N \ --log-to-file /tmp/mick/qpidd_$N.log \ --mgmt-enable=yes \ -d echo "============================================" echo "started broker $N from $QPIDD" echo "============================================" sleep 1 done # Now promote one broker to primary. echo "Promoting broker 5801..." ${QPID_HA} promote --cluster-manager -b localhost:5801 echo "done." client output: ------------------------------------------------------------- Transaction tx-1>tx-2:63201 of 500000 committed successfully Transaction tx-1>tx-2:63202 of 500000 committed successfully Transaction tx-1>tx-2:63203 of 500000 committed successfully Transaction tx-1>tx-2:63204 of 500000 committed successfully Transaction tx-1>tx-2:63205 of 500000 committed successfully Transaction tx-2>tx-1:63134 of 500000 committed successfully 2014-09-23 10:09:18 [Client] warning Connection [[::1]:46065-localhost:5801] closed 2014-09-23 10:09:18 [Client] warning Connection [[::1]:46064-localhost:5801] closed 2014-09-23 10:09:18 [Client] warning Connection [[::1]:46066-localhost:5801] closed 2014-09-23 10:09:18 [Test] error Transfer tx-2 to tx-1 interrupted: Transaction outcome unknown: transport failure 2014-09-23 10:09:18 [Test] error Transfer tx-1 to tx-2 interrupted: Transaction outcome unknown: transport failure Client shutdown: Transaction outcome unknown: transport failure Client shutdown: Transaction outcome unknown: transport failure last output from broker: ---------------------------------------------------------------- 2014-09-23 10:09:13 [HA] debug aec5a8ca(active) Primary TX c3f3a1ac: Commit 2014-09-23 10:09:13 [HA] debug aec5a8ca(active) Destroyed queue qpid.ha-tx:c3f3a1ac-3bbf-4022-979e-e938ed549156 2014-09-23 10:09:13 [HA] debug aec5a8ca(active) Primary TX a6e7efc5: Started, backups { } 2014-09-23 10:09:13 [HA] debug aec5a8ca(active) Created queue qpid.ha-tx:a6e7efc5-eb58-4613-a294-2e78109af92f replication: all 2014-09-23 10:09:13 [HA] debug aec5a8ca(active) Primary TX a6e7efc5: Prepare { } 2014-09-23 10:09:13 [HA] debug aec5a8ca(active) Primary TX a6e7efc5: Commit 2014-09-23 10:09:13 [HA] debug aec5a8ca(active) Destroyed queue qpid.ha-tx:a6e7efc5-eb58-4613-a294-2e78109af92f 2014-09-23 10:09:13 [HA] debug aec5a8ca(active) Primary TX 09719c41: Started, backups { } 2014-09-23 10:09:13 [HA] debug aec5a8ca(active) Created queue qpid.ha-tx:09719c41-c36a-43d2-87a3-001fc101da00 replication: all 2014-09-23 10:09:13 [HA] debug aec5a8ca(active) Primary TX 09719c41: Prepare { } 2014-09-23 10:09:13 [HA] debug aec5a8ca(active) Primary TX 09719c41: Commit 2014-09-23 10:09:13 [HA] debug aec5a8ca(active) Destroyed queue qpid.ha-tx:09719c41-c36a-43d2-87a3-001fc101da00 2014-09-23 10:09:13 [HA] debug aec5a8ca(active) Primary TX 2397b7be: Started, backups { } 2014-09-23 10:09:13 [HA] debug aec5a8ca(active) Created queue qpid.ha-tx:2397b7be-2124-4f17-a0e2-3506425a33ba replication: all 2014-09-23 10:09:13 [HA] debug aec5a8ca(active) Primary TX 2397b7be: Prepare { } 2014-09-23 10:09:13 [HA] debug aec5a8ca(active) Primary TX 2397b7be: Commit 2014-09-23 10:09:13 [HA] debug aec5a8ca(active) Destroyed queue qpid.ha-tx:2397b7be-2124-4f17-a0e2-3506425a33ba 2014-09-23 10:09:13 [HA] debug aec5a8ca(active) Primary TX 6e8ef126: Started, backups { } 2014-09-23 10:09:13 [HA] debug aec5a8ca(active) Created queue qpid.ha-tx:6e8ef126-4fd8-4aca-b724-4d6f84dcb178 replication: all 2014-09-23 10:09:13 [HA] debug aec5a8ca(active) Primary TX 6e8ef126: Prepare { } 2014-09-23 10:09:13 [HA] debug aec5a8ca(active) Primary TX 6e8ef126: Commit 2014-09-23 10:09:13 [HA] debug aec5a8ca(active) Destroyed queue qpid.ha-tx:6e8ef126-4fd8-4aca-b724-4d6f84dcb178 2014-09-23 10:09:13 [HA] debug aec5a8ca(active) Primary TX 77fcacb3: Started, backups { } 2014-09-23 10:09:13 [HA] debug aec5a8ca(active) Created queue qpid.ha-tx:77fcacb3-dc08-48c7-a3d6-3ab8b8f97d7c replication: all Backtrace: --------------------------------------------------------- #0 0x00007f234aa13396 in std::tr1::__detail::_Map_base<std::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::pair<std::basic_string<char, std::char_traits<char>, std::allocator<char> > const, boost::weak_ptr<qpid::ha::PrimaryTxObserver> >, std::_Select1st<std::pair<std::basic_string<char, std::char_traits<char>, std::allocator<char> > const, boost::weak_ptr<qpid::ha::PrimaryTxObserver> > >, true, std::tr1::_Hashtable<std::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::pair<std::basic_string<char, std::char_traits<char>, std::allocator<char> > const, boost::weak_ptr<qpid::ha::PrimaryTxObserver> >, std::allocator<std::pair<std::basic_string<char, std::char_traits<char>, std::allocator<char> > const, boost::weak_ptr<qpid::ha::PrimaryTxObserver> > >, std::_Select1st<std::pair<std::basic_string<char, std::char_traits<char>, std::allocator<char> > const, boost::weak_ptr<qpid::ha::PrimaryTxObserver> > >, std::equal_to<std::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::tr1::hash<std::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::tr1::__detail::_Mod_range_hashing, std::tr1::__detail::_Default_ranged_hash, std::tr1::__detail::_Prime_rehash_policy, false, false, true> >::operator[] (this=0x7f2308021d98, __k= "qpid.ha-tx:77fcacb3-dc08-48c7-a3d6-3ab8b8f97d7c") at /usr/lib/gcc/x86_64-redhat-linux/4.4.6/../../../../include/c++/4.4.6/tr1_impl/hashtable_policy.h:572 #1 0x00007f234aa09774 in qpid::ha::Primary::makeTxObserver (this=0x7f2308021c80, txBuffer=<value optimized out>) at /home/mick/trunk/qpid/cpp/src/qpid/ha/Primary.cpp:485 #2 0x00007f234aa0985c in qpid::ha::Primary::startTx (this=<value optimized out>, txBuffer=...) at /home/mick/trunk/qpid/cpp/src/qpid/ha/Primary.cpp:490 #3 0x00007f234b5d503a in call<boost::shared_ptr<qpid::broker::BrokerObserver> const, boost::intrusive_ptr<qpid::broker::TxBuffer> const> (this=0x7f233c049da8) at /usr/include/boost/bind/mem_fn_template.hpp:153 #4 operator()<boost::shared_ptr<qpid::broker::BrokerObserver> > (this=0x7f233c049da8) at /usr/include/boost/bind/mem_fn_template.hpp:174 #5 operator()<boost::_mfi::mf1<void, qpid::broker::BrokerObserver, const boost::intrusive_ptr<qpid::broker::TxBuffer>&>, boost::_bi::list1<const boost::shared_ptr<qpid::broker::BrokerObserver>&> > ( this=0x7f233c049da8) at /usr/include/boost/bind/bind.hpp:306 #6 operator()<boost::shared_ptr<qpid::broker::BrokerObserver> > (this=0x7f233c049da8) at /usr/include/boost/bind/bind_template.hpp:47 #7 for_each<std::_Rb_tree_const_iterator<boost::shared_ptr<qpid::broker::BrokerObserver> >, boost::_bi::bind_t<void, boost::_mfi::mf1<void, qpid::broker::BrokerObserver, boost::intrusive_ptr<qpid::broker::TxBuffer> const&>, boost::_bi::list2<boost::arg<1>, boost::_bi::value<boost::intrusive_ptr<qpid::broker::TxBuffer> > > > > (this=0x7f233c049da8) at /usr/lib/gcc/x86_64-redhat-linux/4.4.6/../../../../include/c++/4.4.6/bits/stl_algo.h:4200 #8 each<boost::_bi::bind_t<void, boost::_mfi::mf1<void, qpid::broker::BrokerObserver, boost::intrusive_ptr<qpid::broker::TxBuffer> const&>, boost::_bi::list2<boost::arg<1>, boost::_bi::value<boost::intrusive_ptr<qpid::broker::TxBuffer> > > > > (this=0x7f233c049da8) at /home/mick/trunk/qpid/cpp/src/qpid/broker/Observers.h:59 #9 each<boost::_bi::bind_t<void, boost::_mfi::mf1<void, qpid::broker::BrokerObserver, boost::intrusive_ptr<qpid::broker::TxBuffer> const&>, boost::_bi::list2<boost::arg<1>, boost::_bi::value<boost::intrusive_ptr<qpid::broker::TxBuffer> > > > > (this=0x7f233c049da8) at /home/mick/trunk/qpid/cpp/src/qpid/broker/BrokerObservers.h:68 #10 startTx (this=0x7f233c049da8) at /home/mick/trunk/qpid/cpp/src/qpid/broker/BrokerObservers.h:61 #11 qpid::broker::SemanticState::startTx (this=0x7f233c049da8) at /home/mick/trunk/qpid/cpp/src/qpid/broker/SemanticState.cpp:173 #12 0x00007f234b5d7175 in ~StartTxOnExit (this=0x7f233c049da8, store=<value optimized out>) at /home/mick/trunk/qpid/cpp/src/qpid/broker/SemanticState.cpp:181
If I start the 1-cluster properly (see script below) I have NO CRASH after 1 million TXNs, i.e. 10 million messages. ----- script ---------- #! /bin/bash export PYTHONPATH=/home/mick/trunk/qpid/python QPIDD=/usr/sbin/qpidd QPID_HA=/home/mick/trunk/qpid/tools/src/py/qpid-ha # This is where I put the log files. rm -rf /tmp/mick mkdir /tmp/mick # was: # --ha-brokers-url=localhost:5801,localhost:5802,localhost:5803 \ # --ha-public-url=localhost:5801,localhost:5802,localhost:5803 \ for N in 1 do $QPIDD \ --auth=no \ --no-module-dir \ --load-module /usr/lib64/qpid/daemon/ha.so \ --log-enable debug+:ha:: \ --ha-cluster=yes \ --ha-replicate=all \ --ha-brokers-url=localhost:5801 \ --ha-public-url=localhost:5801 \ -p 580$N \ --data-dir /tmp/mick/data_$N \ --log-to-file /tmp/mick/qpidd_$N.log \ --mgmt-enable=yes \ -d echo "============================================" echo "started broker $N from $QPIDD" echo "============================================" sleep 1 done # Now promote one broker to primary. echo "Promoting broker 5801..." ${QPID_HA} promote --cluster-manager -b localhost:5801 echo "done."
This issue is easily reproducible on RHEL6.5 on the Vienna packages (qpid-cpp-0.22-49) using mick's scripts.
------------------------------------------------------------------------ r1646618 | aconway | 2014-12-18 22:18:57 -0500 (Thu, 18 Dec 2014) | 26 lines QPID-6278: HA broker abort in TXN soak test The crash appears to be a race condition in async completion exposed by the HA TX code code as follows: 1. Message received and placed on tx-replication queue, completion delayed till backups ack. Completion count goes up for each backup then down as each backup acks. 2. Prepare received, message placed on primary's local persistent queue. Completion count goes up one then down one for local store completion (null store in this case). The race is something like this: - last backup ack arrives (on backup IO thread) and drops completion count to 0. - prepare arrives (on client thread) null store bumps count to 1 and immediately drops to 0. - both threads try to invoke the completion callback, one deletes it while the other is still invoking. The old completion logic assumed that only one thread can see the atomic counter go to 0. It does not handle the count going to 0 in one thread and concurrently being increased and decreased back to 0 in another. This case is introduced by HA transactions because the same message is put onto a tx-replication queue and then put again onto another persistent local queue, so there are two cycles of completion. The new logic fixes this only one call to completion callback is possible in all cases. Also fixed missing lock in ha/Primary.cpp. ------------------------------------------------------------------------
Issue has been reproduced on stable (qpid-cpp-0.22-49) build on RHEL6.6 x86_64 using mick's not-corrected single host scenario. (3+1 sample) A/B testing is ongoing.
The issue has been reliably fixed, tested on RHEL 6.6 i686/x86_64 and RHEL 7.1 x86_64 on packages: python-qpid-qmf-0.30-5.el6.x86_64 python-qpid-0.30-5.el6.noarch qpid-cpp-client-devel-docs-0.30-7.el6.noarch qpid-cpp-client-devel-0.30-7.el6.x86_64 qpid-cpp-client-rdma-0.30-7.el6.x86_64 qpid-cpp-client-0.30-7.el6.x86_64 qpid-cpp-debuginfo-0.30-7.el6.x86_64 qpid-cpp-server-devel-0.30-7.el6.x86_64 qpid-cpp-server-ha-0.30-7.el6.x86_64 qpid-cpp-server-linearstore-0.30-7.el6.x86_64 qpid-cpp-server-rdma-0.30-7.el6.x86_64 qpid-cpp-server-xml-0.30-7.el6.x86_64 qpid-cpp-server-0.30-7.el6.x86_64 qpid-java-client-0.30-4.el6.noarch qpid-java-common-0.30-4.el6.noarch qpid-java-example-0.30-4.el6.noarch qpid-jca-xarecovery-0.22-2.el6.noarch qpid-jca-0.22-2.el6.noarch qpid-proton-c-devel-0.7-4.el6.x86_64 qpid-proton-c-0.7-4.el6.x86_64 qpid-proton-debuginfo-0.7-4.el6.x86_64 qpid-qmf-debuginfo-0.30-5.el6.x86_64 qpid-qmf-devel-0.30-5.el6.x86_64 qpid-qmf-0.30-5.el6.x86_64 qpid-tools-0.30-4.el6.noarch ruby-qpid-0.7.946106-2.el6.x86_64 Extensive test on 3 candidate and one buggy machines proved that issue is reliably fixed. -> VERIFIED
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHEA-2015-0805.html