Bug 1128051 - QPID-5975: HA extra/missing messages when running qpid-txtest2 in a loop with failover.
Summary: QPID-5975: HA extra/missing messages when running qpid-txtest2 in a loop with...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat Enterprise MRG
Classification: Red Hat
Component: qpid-cpp
Version: 3.0
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: 3.0
: ---
Assignee: Alan Conway
QA Contact: Frantisek Reznicek
URL:
Whiteboard:
Depends On:
Blocks: 923985
TreeView+ depends on / blocked
 
Reported: 2014-08-08 07:37 UTC by Alan Conway
Modified: 2015-01-21 12:55 UTC (History)
5 users (show)

Fixed In Version: qpid-cpp-0.22-48
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2015-01-21 12:55:08 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
ha script to kill & wait for cluster (2.28 KB, text/plain)
2014-08-08 07:37 UTC, Alan Conway
no flags Details
Test for transaction atomicity (30.00 KB, application/octet-stream)
2014-08-28 14:12 UTC, Alan Conway
no flags Details
Added a second test (6.45 KB, application/octet-stream)
2014-08-28 19:00 UTC, Alan Conway
no flags Details
Updated tests (34.16 KB, application/octet-stream)
2014-08-29 18:32 UTC, Alan Conway
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Apache JIRA QPID-5975 0 None None None Never
Red Hat Bugzilla 1089362 0 medium NEW messages re-apear after failover, xa, non-durable 2022-04-23 08:29:03 UTC

Internal Links: 1089362

Description Alan Conway 2014-08-08 07:37:21 UTC
Created attachment 925103 [details]
ha script to kill & wait for cluster

Description of problem:

Running qpid-txtest2 in a loop against a 3 node cluster while killing primary brokers to force failover eventually results in the test failing due to extra or missing messages.

Version-Release number of selected component (if applicable):


How reproducible: 100%


Steps to Reproduce:

In separate windows:
$ while ha stat; qpid-txtest2 -b 20.0.20.200 --total-messages 100 --connection-options '{reconnect:true}' --messages-per-tx 10 --tx-count 10; do true; done
$ while ha wait -a 2; do sleep .5; ha kill;  done # ha script  attached.

Actual result:

Eventually (up to 10 minutes) the test fails with a message like: 

The following ids were missing:
    'msg-100'
    'msg-51'
    'msg-52'
    'msg-53'

OR 

The following extra ids were encountered:
    'msg-1'
    'msg-1'
    'msg-10'


Expected result: Test runs forever, no failures.

Comment 1 Alan Conway 2014-08-08 12:47:31 UTC
There are two problems here:

1. When the primary is killed the backups may not be at the same point in processing a transaction. E.g. the primary may have comitted but the backups not yet, or one backup may have committed but not the other. That creates inconsistencies that result in extra or missing messages.

2. There may not be agreement between the client and the brokers about the transaction outcome. If the client sends commit and the primary is killed before responding then the client will assume the transaction aborted. However it could be the primary was killed after it (and the backups) had committed the transaction but before the primary could send the response frame. In that case the client would believe the transaction aborted and replay it but the brokers would believe it committed and therefore it would be executed twice.

The proper solution here is a recovery phase like DTX, in fact this really IS DTX. 

For 1. we can implement such a phase between the backups since we have full control there (we already have a DTX-like prepare phase), and ensure the backups are always consistent and always commit the transaction if the client *could have* received notice that it committed.

For 2. the TX protocol has no way to let the client check the outcome of in-doubt transactions. The best I can think of is to raise a "TransactionUnknown" exception so at least the user knows there is trouble - however there is not much they can do about it. I'm consulting with others for better solutions.

Comment 2 Alan Conway 2014-08-11 08:01:21 UTC
Regarding issue 2 from comment 1, from discussion with gsim:

---- gsim: 
Arguably TransactionAborted is being over-specific, but where we can be 
sure that it has there's no harm in that, However as you say, once we 
have actually issued the commit then the name of that error is 
misleading or even downright inaccurate.

So I think either we should add a distinct exception for the 'status 
unknown' case, or we should change the name of the existing exception to 
be less definite. E.g. TransactionFailure.
...
Local transactions really just provide atomicity guarantees. They aren't 
intended to provide exactly once semantics. ... Yes, there is ambiguity in the case 
of a failure (and we could do a better job of signalling when that 
ambiguity implies and could certainly avoid being misleading in our 
nomenclature) and that is an issue for the application to solve, e.g. by 
examining the state of things to see what it needs to do, or simply 
being prepared for the results of a duplicated transaction.
----

So for this issue I think it is sufficient to provide a different exception type for the case where we have issued the commit.

Comment 3 Alan Conway 2014-08-12 14:16:51 UTC
comment 2 is NOTABUG. Although it is true that we don't fully reconcile transactions that's OK because they are still atomic and we don't guarantee that they are exactly-once anyway. The backups will all reconcile their queues with the new primary so everything will be consistent even if a TX gets duplicated or lost.

New theory: the problem is with backups joining while transactions are in progress. The backups are not aware of the transactions and will sync transactional messages just like ordinary messages. This means that it is possible for new backup to consider itself "ready"  when it has part of a transactional result.

Comment 6 Alan Conway 2014-08-28 14:12:56 UTC
Created attachment 931976 [details]
Test for transaction atomicity

Test to verify that transactions are atomic in the presence of HA failover - The test is ha/bin/ha-tx-atomic.sh, supporting scripts are included, add ha/bin to your PATH to run the script.

Comment 7 Alan Conway 2014-08-28 14:29:17 UTC
This bug is partly not-a-bug but it does raise an error handling issue that needs to be fixed, see: https://reviews.apache.org/r/25151/

Comment 8 Alan Conway 2014-08-28 19:00:46 UTC
Created attachment 932450 [details]
Added a second test

Added a second test:

    ha-tx-txtest2.sh: Runs qpid-txtest2 with failover only during the transfer                                                                                                                                       
    phase.  Ignores TransactionUnknown errors from the transfer phase but runs the                                                                                                                                   
    verify phase to ensure no extra/missing messages.                                                                                                                                                                
                                                                                                                                                                                                                     
    ha-tx-atomic.sh: Runs transactional qpid-send/qpid-receive with fail-over and                                                                                                                                    
    verifies that the number of enqueues/dequeues and message depth are a multiple                                                                                                                                   
    of the transaction size.

Comment 9 Alan Conway 2014-08-28 21:46:01 UTC
Pushed to mrg repo 
http://git.app.eng.bos.redhat.com/git/rh-qpid.git/log/?h=0.22-mrg-aconway-bz1128051

QPID-5975: HA extra/missing messages when running qpid-txtest2 in a loop with failover.

This commit fixes a client error handling problem as explained below:

qpid-txtest2 enqueues inital messages at the start and drains them at the end to
verify - this is *not transactional*. It is normal to see duplicates if there is
a failover during initialization or draining.

Modifying the test to fail-over only during the transactional transfer phase
gives us failures with "no messages to fetch". Failover during a transaction
raises TransactionAborted. The client assumes the transaction rolled back and
re-plays it. However, if the failover occurs at a critical point *after* the
client has sent commit but *before* it has received a response, then the client
*does not know* whether the transaction committed or rolled-back on the new
primary.

Re-playing in this case repeats the transaction. Each transaction moves messages
from one queue to another. Repeated transactions may try to move more messages
than exist on the queue, hence "no messages to fetch". For example if thread 1
moves N messages from q1 to q2, and thread 2 tries to move N+M messages back,
then thread 2 will fail.

To fix this the C++ and python clients now raise the following exceptions:

- TransactionAborted: Transaction definitely rolled back due to connection
  failure before commit or broker error (e.g. a store error) during commit.
  It can safely be replayed.

- TransactionUnknown: Outcome is unknown because the connection failed at the
  critical time. There's no simple automatic way to know what happened without
  examining the state of the broker queues.


The JIRA has two reproducers and supporting scripts in ha.tgz:

ha-tx-txtest2.sh: Runs qpid-txtest2 with failover only during the transfer
phase.  Ignores TransactionUnknown errors from the transfer phase but runs the
verify phase to ensure no extra/missing messages.

ha-tx-atomic.sh: Runs transactional qpid-send/qpid-receive with fail-over and
verifies that the number of enqueues/dequeues and message depth are a multiple
of the transaction size.

Note: the original bug also reported seeing missing messages from   
qpid-txtest2. I don't have a good explanation for that but since the
new reproducers are passing I am letting that go.

Comment 12 Alan Conway 2014-08-29 18:32:23 UTC
Created attachment 932785 [details]
Updated tests

Comment 13 Alan Conway 2014-08-29 19:20:35 UTC
Doc bug for new TransactionUnknown exception is Bug 1135613 - New exception TransactionUnknown in C++ and python APIs.

Comment 14 Frantisek Reznicek 2014-09-04 14:57:57 UTC
The original issue reproduced on -47 using:

N1=192.168.6.189
V=$(get_vip)
LOCAL_ENA=1
# part A
while true; do
  wait_for_ha_states $N1 1 0+
  list_ha_nodes_states $N1
  ./qpid-txtest2 -b $V --total-messages 100 --connection-options '{reconnect:true}' --messages-per-tx 10 --tx-count 10
done

# part B
while wait_for_ha_states $N1 1 1+; do
  usleep $RANDOM
  kill_broker $(get_primary $N1) 9
done

Comment 15 Frantisek Reznicek 2014-09-05 14:03:13 UTC
Hi Alan, I'm seeing the original reproducer failing (extra messages, not missing) on both -47 and -48 (comment 0).
Is the original reproducer deprecated in favor of new two (ha-tx-txtest2.sh, ha-tx-atomic.sh) in comment 9.

Comment 16 Alan Conway 2014-09-05 16:07:55 UTC
(In reply to Frantisek Reznicek from comment #15)
> Hi Alan, I'm seeing the original reproducer failing (extra messages, not
> missing) on both -47 and -48 (comment 0).
> Is the original reproducer deprecated in favor of new two (ha-tx-txtest2.sh,
> ha-tx-atomic.sh) in comment 9.

Yes. ha-tx-test2 is close to the original reproducer but only kills the broker during the transfer stage. It does the init and verify stages separately in a  quiet environment since they are not transactional and therefore duplicates are expected if there's a failover during those phases, which will cause verify to fail.

That still verifies that transactions behave atomically

Comment 17 Frantisek Reznicek 2014-09-05 19:48:24 UTC
The issue has been fixed, retested mainly using modified ha-tx-txtest2 reproducer, tested also using ha-tx-atomic with few iterations only due to bug 1138706 long catchup effect. Tested on two independent RHEL 6.5 i6868 / x86_64 3-node cluster using packages:

perl-qpid-0.22-13.el6.i686
perl-qpid-debuginfo-0.22-13.el6.i686
python-qpid-0.22-18.el6.noarch
python-qpid-proton-doc-0.7-3.el6.noarch
python-qpid-qmf-0.22-39.el6.i686
qpid-cpp-client-0.22-48.el6.i686
qpid-cpp-client-devel-0.22-48.el6.i686
qpid-cpp-client-devel-docs-0.22-48.el6.noarch
qpid-cpp-client-rdma-0.22-48.el6.i686
qpid-cpp-debuginfo-0.22-48.el6.i686
qpid-cpp-server-0.22-48.el6.i686
qpid-cpp-server-devel-0.22-48.el6.i686
qpid-cpp-server-ha-0.22-48.el6.i686
qpid-cpp-server-linearstore-0.22-48.el6.i686
qpid-cpp-server-rdma-0.22-48.el6.i686
qpid-cpp-server-xml-0.22-48.el6.i686
qpid-java-client-0.22-6.el6.noarch
qpid-java-common-0.22-6.el6.noarch
qpid-java-example-0.22-6.el6.noarch
qpid-jca-0.22-2.el6.noarch
qpid-jca-xarecovery-0.22-2.el6.noarch
qpid-jca-zip-0.22-2.el6.noarch
qpid-proton-c-0.7-4.el6.i686
qpid-proton-c-devel-0.7-4.el6.i686
qpid-proton-c-devel-doc-0.7-3.el6.noarch
qpid-proton-debuginfo-0.7-4.el6.i686
qpid-qmf-0.22-39.el6.i686
qpid-qmf-debuginfo-0.22-39.el6.i686
qpid-qmf-devel-0.22-39.el6.i686
qpid-snmpd-1.0.0-16.el6.i686
qpid-snmpd-debuginfo-1.0.0-16.el6.i686
qpid-tests-0.22-16.el6.noarch
qpid-tools-0.22-16.el6.noarch
ruby-qpid-qmf-0.22-39.el6.i686

resp. 

perl-qpid-0.22-13.el6.x86_64
perl-qpid-debuginfo-0.22-13.el6.x86_64
python-qpid-0.22-18.el6.noarch
python-qpid-proton-doc-0.7-3.el6.noarch
python-qpid-qmf-0.22-39.el6.x86_64
qpid-cpp-client-0.22-48.el6.x86_64
qpid-cpp-client-devel-0.22-48.el6.x86_64
qpid-cpp-client-devel-docs-0.22-48.el6.noarch
qpid-cpp-client-rdma-0.22-48.el6.x86_64
qpid-cpp-debuginfo-0.22-48.el6.x86_64
qpid-cpp-server-0.22-48.el6.x86_64
qpid-cpp-server-devel-0.22-48.el6.x86_64
qpid-cpp-server-ha-0.22-48.el6.x86_64
qpid-cpp-server-linearstore-0.22-48.el6.x86_64
qpid-cpp-server-rdma-0.22-48.el6.x86_64
qpid-cpp-server-xml-0.22-48.el6.x86_64
qpid-java-client-0.22-6.el6.noarch
qpid-java-common-0.22-6.el6.noarch
qpid-java-example-0.22-6.el6.noarch
qpid-jca-0.22-2.el6.noarch
qpid-jca-xarecovery-0.22-2.el6.noarch
qpid-jca-zip-0.22-2.el6.noarch
qpid-proton-c-0.7-4.el6.x86_64
qpid-proton-c-devel-0.7-4.el6.x86_64
qpid-proton-c-devel-doc-0.7-3.el6.noarch
qpid-proton-debuginfo-0.7-4.el6.x86_64
qpid-qmf-0.22-39.el6.x86_64
qpid-qmf-debuginfo-0.22-39.el6.x86_64
qpid-qmf-devel-0.22-39.el6.x86_64
qpid-snmpd-1.0.0-16.el6.x86_64
qpid-snmpd-debuginfo-1.0.0-16.el6.x86_64
qpid-tests-0.22-16.el6.noarch
qpid-tools-0.22-16.el6.noarch
ruby-qpid-qmf-0.22-39.el6.x86_64


-> VERIFIED


Note You need to log in before you can comment on or make changes to this bug.