Bug 1145386
Summary: | HA broker abort in TXN soak test | ||
---|---|---|---|
Product: | Red Hat Enterprise MRG | Reporter: | mick <mgoulish> |
Component: | qpid-cpp | Assignee: | Alan Conway <aconway> |
Status: | CLOSED ERRATA | QA Contact: | Frantisek Reznicek <freznice> |
Severity: | high | Docs Contact: | |
Priority: | high | ||
Version: | Development | CC: | aconway, esammons, freznice, iboverma, jross |
Target Milestone: | 3.1 | ||
Target Release: | --- | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | qpid-cpp-0.30-5 | Doc Type: | Bug Fix |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2015-04-14 13:48:23 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
mick
2014-09-23 01:17:46 UTC
failure in 1-cluster ! this took about 3 minutes of wall-clock time.... HA 1-cluster start script: ------------------------------------------------------------- ( IS THIS CORRECT ??? ) #! /bin/bash export PYTHONPATH=/home/mick/trunk/qpid/python QPIDD=/usr/sbin/qpidd QPID_HA=/home/mick/trunk/qpid/tools/src/py/qpid-ha # This is where I put the log files. rm -rf /tmp/mick mkdir /tmp/mick for N in 1 do $QPIDD \ --auth=no \ --no-module-dir \ --load-module /usr/lib64/qpid/daemon/ha.so \ --log-enable debug+:ha:: \ --ha-cluster=yes \ --ha-replicate=all \ --ha-brokers-url=localhost:5801,localhost:5802,localhost:5803 \ --ha-public-url=localhost:5801,localhost:5802,localhost:5803 \ -p 580$N \ --data-dir /tmp/mick/data_$N \ --log-to-file /tmp/mick/qpidd_$N.log \ --mgmt-enable=yes \ -d echo "============================================" echo "started broker $N from $QPIDD" echo "============================================" sleep 1 done # Now promote one broker to primary. echo "Promoting broker 5801..." ${QPID_HA} promote --cluster-manager -b localhost:5801 echo "done." client output: ------------------------------------------------------------- Transaction tx-1>tx-2:63201 of 500000 committed successfully Transaction tx-1>tx-2:63202 of 500000 committed successfully Transaction tx-1>tx-2:63203 of 500000 committed successfully Transaction tx-1>tx-2:63204 of 500000 committed successfully Transaction tx-1>tx-2:63205 of 500000 committed successfully Transaction tx-2>tx-1:63134 of 500000 committed successfully 2014-09-23 10:09:18 [Client] warning Connection [[::1]:46065-localhost:5801] closed 2014-09-23 10:09:18 [Client] warning Connection [[::1]:46064-localhost:5801] closed 2014-09-23 10:09:18 [Client] warning Connection [[::1]:46066-localhost:5801] closed 2014-09-23 10:09:18 [Test] error Transfer tx-2 to tx-1 interrupted: Transaction outcome unknown: transport failure 2014-09-23 10:09:18 [Test] error Transfer tx-1 to tx-2 interrupted: Transaction outcome unknown: transport failure Client shutdown: Transaction outcome unknown: transport failure Client shutdown: Transaction outcome unknown: transport failure last output from broker: ---------------------------------------------------------------- 2014-09-23 10:09:13 [HA] debug aec5a8ca(active) Primary TX c3f3a1ac: Commit 2014-09-23 10:09:13 [HA] debug aec5a8ca(active) Destroyed queue qpid.ha-tx:c3f3a1ac-3bbf-4022-979e-e938ed549156 2014-09-23 10:09:13 [HA] debug aec5a8ca(active) Primary TX a6e7efc5: Started, backups { } 2014-09-23 10:09:13 [HA] debug aec5a8ca(active) Created queue qpid.ha-tx:a6e7efc5-eb58-4613-a294-2e78109af92f replication: all 2014-09-23 10:09:13 [HA] debug aec5a8ca(active) Primary TX a6e7efc5: Prepare { } 2014-09-23 10:09:13 [HA] debug aec5a8ca(active) Primary TX a6e7efc5: Commit 2014-09-23 10:09:13 [HA] debug aec5a8ca(active) Destroyed queue qpid.ha-tx:a6e7efc5-eb58-4613-a294-2e78109af92f 2014-09-23 10:09:13 [HA] debug aec5a8ca(active) Primary TX 09719c41: Started, backups { } 2014-09-23 10:09:13 [HA] debug aec5a8ca(active) Created queue qpid.ha-tx:09719c41-c36a-43d2-87a3-001fc101da00 replication: all 2014-09-23 10:09:13 [HA] debug aec5a8ca(active) Primary TX 09719c41: Prepare { } 2014-09-23 10:09:13 [HA] debug aec5a8ca(active) Primary TX 09719c41: Commit 2014-09-23 10:09:13 [HA] debug aec5a8ca(active) Destroyed queue qpid.ha-tx:09719c41-c36a-43d2-87a3-001fc101da00 2014-09-23 10:09:13 [HA] debug aec5a8ca(active) Primary TX 2397b7be: Started, backups { } 2014-09-23 10:09:13 [HA] debug aec5a8ca(active) Created queue qpid.ha-tx:2397b7be-2124-4f17-a0e2-3506425a33ba replication: all 2014-09-23 10:09:13 [HA] debug aec5a8ca(active) Primary TX 2397b7be: Prepare { } 2014-09-23 10:09:13 [HA] debug aec5a8ca(active) Primary TX 2397b7be: Commit 2014-09-23 10:09:13 [HA] debug aec5a8ca(active) Destroyed queue qpid.ha-tx:2397b7be-2124-4f17-a0e2-3506425a33ba 2014-09-23 10:09:13 [HA] debug aec5a8ca(active) Primary TX 6e8ef126: Started, backups { } 2014-09-23 10:09:13 [HA] debug aec5a8ca(active) Created queue qpid.ha-tx:6e8ef126-4fd8-4aca-b724-4d6f84dcb178 replication: all 2014-09-23 10:09:13 [HA] debug aec5a8ca(active) Primary TX 6e8ef126: Prepare { } 2014-09-23 10:09:13 [HA] debug aec5a8ca(active) Primary TX 6e8ef126: Commit 2014-09-23 10:09:13 [HA] debug aec5a8ca(active) Destroyed queue qpid.ha-tx:6e8ef126-4fd8-4aca-b724-4d6f84dcb178 2014-09-23 10:09:13 [HA] debug aec5a8ca(active) Primary TX 77fcacb3: Started, backups { } 2014-09-23 10:09:13 [HA] debug aec5a8ca(active) Created queue qpid.ha-tx:77fcacb3-dc08-48c7-a3d6-3ab8b8f97d7c replication: all Backtrace: --------------------------------------------------------- #0 0x00007f234aa13396 in std::tr1::__detail::_Map_base<std::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::pair<std::basic_string<char, std::char_traits<char>, std::allocator<char> > const, boost::weak_ptr<qpid::ha::PrimaryTxObserver> >, std::_Select1st<std::pair<std::basic_string<char, std::char_traits<char>, std::allocator<char> > const, boost::weak_ptr<qpid::ha::PrimaryTxObserver> > >, true, std::tr1::_Hashtable<std::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::pair<std::basic_string<char, std::char_traits<char>, std::allocator<char> > const, boost::weak_ptr<qpid::ha::PrimaryTxObserver> >, std::allocator<std::pair<std::basic_string<char, std::char_traits<char>, std::allocator<char> > const, boost::weak_ptr<qpid::ha::PrimaryTxObserver> > >, std::_Select1st<std::pair<std::basic_string<char, std::char_traits<char>, std::allocator<char> > const, boost::weak_ptr<qpid::ha::PrimaryTxObserver> > >, std::equal_to<std::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::tr1::hash<std::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::tr1::__detail::_Mod_range_hashing, std::tr1::__detail::_Default_ranged_hash, std::tr1::__detail::_Prime_rehash_policy, false, false, true> >::operator[] (this=0x7f2308021d98, __k= "qpid.ha-tx:77fcacb3-dc08-48c7-a3d6-3ab8b8f97d7c") at /usr/lib/gcc/x86_64-redhat-linux/4.4.6/../../../../include/c++/4.4.6/tr1_impl/hashtable_policy.h:572 #1 0x00007f234aa09774 in qpid::ha::Primary::makeTxObserver (this=0x7f2308021c80, txBuffer=<value optimized out>) at /home/mick/trunk/qpid/cpp/src/qpid/ha/Primary.cpp:485 #2 0x00007f234aa0985c in qpid::ha::Primary::startTx (this=<value optimized out>, txBuffer=...) at /home/mick/trunk/qpid/cpp/src/qpid/ha/Primary.cpp:490 #3 0x00007f234b5d503a in call<boost::shared_ptr<qpid::broker::BrokerObserver> const, boost::intrusive_ptr<qpid::broker::TxBuffer> const> (this=0x7f233c049da8) at /usr/include/boost/bind/mem_fn_template.hpp:153 #4 operator()<boost::shared_ptr<qpid::broker::BrokerObserver> > (this=0x7f233c049da8) at /usr/include/boost/bind/mem_fn_template.hpp:174 #5 operator()<boost::_mfi::mf1<void, qpid::broker::BrokerObserver, const boost::intrusive_ptr<qpid::broker::TxBuffer>&>, boost::_bi::list1<const boost::shared_ptr<qpid::broker::BrokerObserver>&> > ( this=0x7f233c049da8) at /usr/include/boost/bind/bind.hpp:306 #6 operator()<boost::shared_ptr<qpid::broker::BrokerObserver> > (this=0x7f233c049da8) at /usr/include/boost/bind/bind_template.hpp:47 #7 for_each<std::_Rb_tree_const_iterator<boost::shared_ptr<qpid::broker::BrokerObserver> >, boost::_bi::bind_t<void, boost::_mfi::mf1<void, qpid::broker::BrokerObserver, boost::intrusive_ptr<qpid::broker::TxBuffer> const&>, boost::_bi::list2<boost::arg<1>, boost::_bi::value<boost::intrusive_ptr<qpid::broker::TxBuffer> > > > > (this=0x7f233c049da8) at /usr/lib/gcc/x86_64-redhat-linux/4.4.6/../../../../include/c++/4.4.6/bits/stl_algo.h:4200 #8 each<boost::_bi::bind_t<void, boost::_mfi::mf1<void, qpid::broker::BrokerObserver, boost::intrusive_ptr<qpid::broker::TxBuffer> const&>, boost::_bi::list2<boost::arg<1>, boost::_bi::value<boost::intrusive_ptr<qpid::broker::TxBuffer> > > > > (this=0x7f233c049da8) at /home/mick/trunk/qpid/cpp/src/qpid/broker/Observers.h:59 #9 each<boost::_bi::bind_t<void, boost::_mfi::mf1<void, qpid::broker::BrokerObserver, boost::intrusive_ptr<qpid::broker::TxBuffer> const&>, boost::_bi::list2<boost::arg<1>, boost::_bi::value<boost::intrusive_ptr<qpid::broker::TxBuffer> > > > > (this=0x7f233c049da8) at /home/mick/trunk/qpid/cpp/src/qpid/broker/BrokerObservers.h:68 #10 startTx (this=0x7f233c049da8) at /home/mick/trunk/qpid/cpp/src/qpid/broker/BrokerObservers.h:61 #11 qpid::broker::SemanticState::startTx (this=0x7f233c049da8) at /home/mick/trunk/qpid/cpp/src/qpid/broker/SemanticState.cpp:173 #12 0x00007f234b5d7175 in ~StartTxOnExit (this=0x7f233c049da8, store=<value optimized out>) at /home/mick/trunk/qpid/cpp/src/qpid/broker/SemanticState.cpp:181 If I start the 1-cluster properly (see script below) I have NO CRASH after 1 million TXNs, i.e. 10 million messages. ----- script ---------- #! /bin/bash export PYTHONPATH=/home/mick/trunk/qpid/python QPIDD=/usr/sbin/qpidd QPID_HA=/home/mick/trunk/qpid/tools/src/py/qpid-ha # This is where I put the log files. rm -rf /tmp/mick mkdir /tmp/mick # was: # --ha-brokers-url=localhost:5801,localhost:5802,localhost:5803 \ # --ha-public-url=localhost:5801,localhost:5802,localhost:5803 \ for N in 1 do $QPIDD \ --auth=no \ --no-module-dir \ --load-module /usr/lib64/qpid/daemon/ha.so \ --log-enable debug+:ha:: \ --ha-cluster=yes \ --ha-replicate=all \ --ha-brokers-url=localhost:5801 \ --ha-public-url=localhost:5801 \ -p 580$N \ --data-dir /tmp/mick/data_$N \ --log-to-file /tmp/mick/qpidd_$N.log \ --mgmt-enable=yes \ -d echo "============================================" echo "started broker $N from $QPIDD" echo "============================================" sleep 1 done # Now promote one broker to primary. echo "Promoting broker 5801..." ${QPID_HA} promote --cluster-manager -b localhost:5801 echo "done." This issue is easily reproducible on RHEL6.5 on the Vienna packages (qpid-cpp-0.22-49) using mick's scripts. ------------------------------------------------------------------------ r1646618 | aconway | 2014-12-18 22:18:57 -0500 (Thu, 18 Dec 2014) | 26 lines QPID-6278: HA broker abort in TXN soak test The crash appears to be a race condition in async completion exposed by the HA TX code code as follows: 1. Message received and placed on tx-replication queue, completion delayed till backups ack. Completion count goes up for each backup then down as each backup acks. 2. Prepare received, message placed on primary's local persistent queue. Completion count goes up one then down one for local store completion (null store in this case). The race is something like this: - last backup ack arrives (on backup IO thread) and drops completion count to 0. - prepare arrives (on client thread) null store bumps count to 1 and immediately drops to 0. - both threads try to invoke the completion callback, one deletes it while the other is still invoking. The old completion logic assumed that only one thread can see the atomic counter go to 0. It does not handle the count going to 0 in one thread and concurrently being increased and decreased back to 0 in another. This case is introduced by HA transactions because the same message is put onto a tx-replication queue and then put again onto another persistent local queue, so there are two cycles of completion. The new logic fixes this only one call to completion callback is possible in all cases. Also fixed missing lock in ha/Primary.cpp. ------------------------------------------------------------------------ Issue has been reproduced on stable (qpid-cpp-0.22-49) build on RHEL6.6 x86_64 using mick's not-corrected single host scenario. (3+1 sample) A/B testing is ongoing. The issue has been reliably fixed, tested on RHEL 6.6 i686/x86_64 and RHEL 7.1 x86_64 on packages: python-qpid-qmf-0.30-5.el6.x86_64 python-qpid-0.30-5.el6.noarch qpid-cpp-client-devel-docs-0.30-7.el6.noarch qpid-cpp-client-devel-0.30-7.el6.x86_64 qpid-cpp-client-rdma-0.30-7.el6.x86_64 qpid-cpp-client-0.30-7.el6.x86_64 qpid-cpp-debuginfo-0.30-7.el6.x86_64 qpid-cpp-server-devel-0.30-7.el6.x86_64 qpid-cpp-server-ha-0.30-7.el6.x86_64 qpid-cpp-server-linearstore-0.30-7.el6.x86_64 qpid-cpp-server-rdma-0.30-7.el6.x86_64 qpid-cpp-server-xml-0.30-7.el6.x86_64 qpid-cpp-server-0.30-7.el6.x86_64 qpid-java-client-0.30-4.el6.noarch qpid-java-common-0.30-4.el6.noarch qpid-java-example-0.30-4.el6.noarch qpid-jca-xarecovery-0.22-2.el6.noarch qpid-jca-0.22-2.el6.noarch qpid-proton-c-devel-0.7-4.el6.x86_64 qpid-proton-c-0.7-4.el6.x86_64 qpid-proton-debuginfo-0.7-4.el6.x86_64 qpid-qmf-debuginfo-0.30-5.el6.x86_64 qpid-qmf-devel-0.30-5.el6.x86_64 qpid-qmf-0.30-5.el6.x86_64 qpid-tools-0.30-4.el6.noarch ruby-qpid-0.7.946106-2.el6.x86_64 Extensive test on 3 candidate and one buggy machines proved that issue is reliably fixed. -> VERIFIED Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHEA-2015-0805.html |