Created attachment 925103 [details] ha script to kill & wait for cluster Description of problem: Running qpid-txtest2 in a loop against a 3 node cluster while killing primary brokers to force failover eventually results in the test failing due to extra or missing messages. Version-Release number of selected component (if applicable): How reproducible: 100% Steps to Reproduce: In separate windows: $ while ha stat; qpid-txtest2 -b 20.0.20.200 --total-messages 100 --connection-options '{reconnect:true}' --messages-per-tx 10 --tx-count 10; do true; done $ while ha wait -a 2; do sleep .5; ha kill; done # ha script attached. Actual result: Eventually (up to 10 minutes) the test fails with a message like: The following ids were missing: 'msg-100' 'msg-51' 'msg-52' 'msg-53' OR The following extra ids were encountered: 'msg-1' 'msg-1' 'msg-10' Expected result: Test runs forever, no failures.
There are two problems here: 1. When the primary is killed the backups may not be at the same point in processing a transaction. E.g. the primary may have comitted but the backups not yet, or one backup may have committed but not the other. That creates inconsistencies that result in extra or missing messages. 2. There may not be agreement between the client and the brokers about the transaction outcome. If the client sends commit and the primary is killed before responding then the client will assume the transaction aborted. However it could be the primary was killed after it (and the backups) had committed the transaction but before the primary could send the response frame. In that case the client would believe the transaction aborted and replay it but the brokers would believe it committed and therefore it would be executed twice. The proper solution here is a recovery phase like DTX, in fact this really IS DTX. For 1. we can implement such a phase between the backups since we have full control there (we already have a DTX-like prepare phase), and ensure the backups are always consistent and always commit the transaction if the client *could have* received notice that it committed. For 2. the TX protocol has no way to let the client check the outcome of in-doubt transactions. The best I can think of is to raise a "TransactionUnknown" exception so at least the user knows there is trouble - however there is not much they can do about it. I'm consulting with others for better solutions.
Regarding issue 2 from comment 1, from discussion with gsim: ---- gsim: Arguably TransactionAborted is being over-specific, but where we can be sure that it has there's no harm in that, However as you say, once we have actually issued the commit then the name of that error is misleading or even downright inaccurate. So I think either we should add a distinct exception for the 'status unknown' case, or we should change the name of the existing exception to be less definite. E.g. TransactionFailure. ... Local transactions really just provide atomicity guarantees. They aren't intended to provide exactly once semantics. ... Yes, there is ambiguity in the case of a failure (and we could do a better job of signalling when that ambiguity implies and could certainly avoid being misleading in our nomenclature) and that is an issue for the application to solve, e.g. by examining the state of things to see what it needs to do, or simply being prepared for the results of a duplicated transaction. ---- So for this issue I think it is sufficient to provide a different exception type for the case where we have issued the commit.
comment 2 is NOTABUG. Although it is true that we don't fully reconcile transactions that's OK because they are still atomic and we don't guarantee that they are exactly-once anyway. The backups will all reconcile their queues with the new primary so everything will be consistent even if a TX gets duplicated or lost. New theory: the problem is with backups joining while transactions are in progress. The backups are not aware of the transactions and will sync transactional messages just like ordinary messages. This means that it is possible for new backup to consider itself "ready" when it has part of a transactional result.
Created attachment 931976 [details] Test for transaction atomicity Test to verify that transactions are atomic in the presence of HA failover - The test is ha/bin/ha-tx-atomic.sh, supporting scripts are included, add ha/bin to your PATH to run the script.
This bug is partly not-a-bug but it does raise an error handling issue that needs to be fixed, see: https://reviews.apache.org/r/25151/
Created attachment 932450 [details] Added a second test Added a second test: ha-tx-txtest2.sh: Runs qpid-txtest2 with failover only during the transfer phase. Ignores TransactionUnknown errors from the transfer phase but runs the verify phase to ensure no extra/missing messages. ha-tx-atomic.sh: Runs transactional qpid-send/qpid-receive with fail-over and verifies that the number of enqueues/dequeues and message depth are a multiple of the transaction size.
Pushed to mrg repo http://git.app.eng.bos.redhat.com/git/rh-qpid.git/log/?h=0.22-mrg-aconway-bz1128051 QPID-5975: HA extra/missing messages when running qpid-txtest2 in a loop with failover. This commit fixes a client error handling problem as explained below: qpid-txtest2 enqueues inital messages at the start and drains them at the end to verify - this is *not transactional*. It is normal to see duplicates if there is a failover during initialization or draining. Modifying the test to fail-over only during the transactional transfer phase gives us failures with "no messages to fetch". Failover during a transaction raises TransactionAborted. The client assumes the transaction rolled back and re-plays it. However, if the failover occurs at a critical point *after* the client has sent commit but *before* it has received a response, then the client *does not know* whether the transaction committed or rolled-back on the new primary. Re-playing in this case repeats the transaction. Each transaction moves messages from one queue to another. Repeated transactions may try to move more messages than exist on the queue, hence "no messages to fetch". For example if thread 1 moves N messages from q1 to q2, and thread 2 tries to move N+M messages back, then thread 2 will fail. To fix this the C++ and python clients now raise the following exceptions: - TransactionAborted: Transaction definitely rolled back due to connection failure before commit or broker error (e.g. a store error) during commit. It can safely be replayed. - TransactionUnknown: Outcome is unknown because the connection failed at the critical time. There's no simple automatic way to know what happened without examining the state of the broker queues. The JIRA has two reproducers and supporting scripts in ha.tgz: ha-tx-txtest2.sh: Runs qpid-txtest2 with failover only during the transfer phase. Ignores TransactionUnknown errors from the transfer phase but runs the verify phase to ensure no extra/missing messages. ha-tx-atomic.sh: Runs transactional qpid-send/qpid-receive with fail-over and verifies that the number of enqueues/dequeues and message depth are a multiple of the transaction size. Note: the original bug also reported seeing missing messages from qpid-txtest2. I don't have a good explanation for that but since the new reproducers are passing I am letting that go.
Created attachment 932785 [details] Updated tests
Doc bug for new TransactionUnknown exception is Bug 1135613 - New exception TransactionUnknown in C++ and python APIs.
The original issue reproduced on -47 using: N1=192.168.6.189 V=$(get_vip) LOCAL_ENA=1 # part A while true; do wait_for_ha_states $N1 1 0+ list_ha_nodes_states $N1 ./qpid-txtest2 -b $V --total-messages 100 --connection-options '{reconnect:true}' --messages-per-tx 10 --tx-count 10 done # part B while wait_for_ha_states $N1 1 1+; do usleep $RANDOM kill_broker $(get_primary $N1) 9 done
Hi Alan, I'm seeing the original reproducer failing (extra messages, not missing) on both -47 and -48 (comment 0). Is the original reproducer deprecated in favor of new two (ha-tx-txtest2.sh, ha-tx-atomic.sh) in comment 9.
(In reply to Frantisek Reznicek from comment #15) > Hi Alan, I'm seeing the original reproducer failing (extra messages, not > missing) on both -47 and -48 (comment 0). > Is the original reproducer deprecated in favor of new two (ha-tx-txtest2.sh, > ha-tx-atomic.sh) in comment 9. Yes. ha-tx-test2 is close to the original reproducer but only kills the broker during the transfer stage. It does the init and verify stages separately in a quiet environment since they are not transactional and therefore duplicates are expected if there's a failover during those phases, which will cause verify to fail. That still verifies that transactions behave atomically
The issue has been fixed, retested mainly using modified ha-tx-txtest2 reproducer, tested also using ha-tx-atomic with few iterations only due to bug 1138706 long catchup effect. Tested on two independent RHEL 6.5 i6868 / x86_64 3-node cluster using packages: perl-qpid-0.22-13.el6.i686 perl-qpid-debuginfo-0.22-13.el6.i686 python-qpid-0.22-18.el6.noarch python-qpid-proton-doc-0.7-3.el6.noarch python-qpid-qmf-0.22-39.el6.i686 qpid-cpp-client-0.22-48.el6.i686 qpid-cpp-client-devel-0.22-48.el6.i686 qpid-cpp-client-devel-docs-0.22-48.el6.noarch qpid-cpp-client-rdma-0.22-48.el6.i686 qpid-cpp-debuginfo-0.22-48.el6.i686 qpid-cpp-server-0.22-48.el6.i686 qpid-cpp-server-devel-0.22-48.el6.i686 qpid-cpp-server-ha-0.22-48.el6.i686 qpid-cpp-server-linearstore-0.22-48.el6.i686 qpid-cpp-server-rdma-0.22-48.el6.i686 qpid-cpp-server-xml-0.22-48.el6.i686 qpid-java-client-0.22-6.el6.noarch qpid-java-common-0.22-6.el6.noarch qpid-java-example-0.22-6.el6.noarch qpid-jca-0.22-2.el6.noarch qpid-jca-xarecovery-0.22-2.el6.noarch qpid-jca-zip-0.22-2.el6.noarch qpid-proton-c-0.7-4.el6.i686 qpid-proton-c-devel-0.7-4.el6.i686 qpid-proton-c-devel-doc-0.7-3.el6.noarch qpid-proton-debuginfo-0.7-4.el6.i686 qpid-qmf-0.22-39.el6.i686 qpid-qmf-debuginfo-0.22-39.el6.i686 qpid-qmf-devel-0.22-39.el6.i686 qpid-snmpd-1.0.0-16.el6.i686 qpid-snmpd-debuginfo-1.0.0-16.el6.i686 qpid-tests-0.22-16.el6.noarch qpid-tools-0.22-16.el6.noarch ruby-qpid-qmf-0.22-39.el6.i686 resp. perl-qpid-0.22-13.el6.x86_64 perl-qpid-debuginfo-0.22-13.el6.x86_64 python-qpid-0.22-18.el6.noarch python-qpid-proton-doc-0.7-3.el6.noarch python-qpid-qmf-0.22-39.el6.x86_64 qpid-cpp-client-0.22-48.el6.x86_64 qpid-cpp-client-devel-0.22-48.el6.x86_64 qpid-cpp-client-devel-docs-0.22-48.el6.noarch qpid-cpp-client-rdma-0.22-48.el6.x86_64 qpid-cpp-debuginfo-0.22-48.el6.x86_64 qpid-cpp-server-0.22-48.el6.x86_64 qpid-cpp-server-devel-0.22-48.el6.x86_64 qpid-cpp-server-ha-0.22-48.el6.x86_64 qpid-cpp-server-linearstore-0.22-48.el6.x86_64 qpid-cpp-server-rdma-0.22-48.el6.x86_64 qpid-cpp-server-xml-0.22-48.el6.x86_64 qpid-java-client-0.22-6.el6.noarch qpid-java-common-0.22-6.el6.noarch qpid-java-example-0.22-6.el6.noarch qpid-jca-0.22-2.el6.noarch qpid-jca-xarecovery-0.22-2.el6.noarch qpid-jca-zip-0.22-2.el6.noarch qpid-proton-c-0.7-4.el6.x86_64 qpid-proton-c-devel-0.7-4.el6.x86_64 qpid-proton-c-devel-doc-0.7-3.el6.noarch qpid-proton-debuginfo-0.7-4.el6.x86_64 qpid-qmf-0.22-39.el6.x86_64 qpid-qmf-debuginfo-0.22-39.el6.x86_64 qpid-qmf-devel-0.22-39.el6.x86_64 qpid-snmpd-1.0.0-16.el6.x86_64 qpid-snmpd-debuginfo-1.0.0-16.el6.x86_64 qpid-tests-0.22-16.el6.noarch qpid-tools-0.22-16.el6.noarch ruby-qpid-qmf-0.22-39.el6.x86_64 -> VERIFIED