Bug 1127089

Summary: HA mixing tx enqueue and non-tx dequeue leaves extra messages on backup.
Product: Red Hat Enterprise MRG Reporter: Alan Conway <aconway>
Component: qpid-cppAssignee: Alan Conway <aconway>
Status: CLOSED CURRENTRELEASE QA Contact: Frantisek Reznicek <freznice>
Severity: high Docs Contact:
Priority: high    
Version: 3.0CC: esammons, iboverma, jross, zkraus
Target Milestone: 3.0   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: qpid-cpp-0.22-46 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---

Description Alan Conway 2014-08-06 06:50:16 UTC
Description of problem:

With HA enabled, 2 or more node cluster. If a message is enqueued in a transaction, and then dequeued with no transaction, the message is left on the backup brokers.

Version-Release number of selected component (if applicable):

Up to trunk r1615992, 0.22-mrg f22fae4

How reproducible: 100%

Steps to Reproduce:

qpid-send -a 'q;{create:always}' -b 20.0.20.200 --content-string foo --tx 1
qpid-receive -a 'q;{create:always}' -b 20.0.20.200

Then on each node:
qpid-stat -q --ha-admin

Actual results:

Primary q is empty, backup q still has 1 message
Message is enqueued twice on backup and dequeued only once.

Expected results:

Primary and backup q empty, message enqueued and dequeued once on primary and backup

Additional info:

With a tx dequeue (add --tx 1 to qpid-receive) the message is removed completely on the backup so at first glance things appear OK. However qpid-stat shows it is enqueued and dequeued _twice_ which is incorrect.

Comment 1 Alan Conway 2014-08-06 07:34:39 UTC
Using the python client there is still a message left on the backup, but there is only 1 enqueue and no dequeue, rather than 2 enqueues and 1 dequeue.

Comment 2 Alan Conway 2014-08-07 17:52:07 UTC
http://git.app.eng.bos.redhat.com/git/rh-qpid.git/commit/?h=0.22-mrg-aconway-bz1127089&id=4d4091e1b3ea961bb140ac0f8b88195372d9903c

Bug 1127089 - QPID-5966: HA mixing tx enqueue and non-tx dequeue leaves extra messages on backup.0.22-mrg-aconway-bz1127089
There were several problems:

1. Positions of transactionally enqueued messages not known to QueueReplicator, so not dequeued
   on backup if dequeued outside a TX on primary.

2. Race condition if tx created immediately after queue could cause duplication of TX message.

3. Replication IDs were not being set during recovery from store (regression, store change?)

Fix:

1. Update positions QueueReplicator positions via QueueObserver::enqueued to see all enqueues.
2. Check for duplicate replication-ids on backup in QueueReplicator::route.
3. Set replication-id in publish() if not already set in record().

Comment 3 Frantisek Reznicek 2014-08-19 20:12:38 UTC
The issue has been fixed, tested on qpid HAs on RHEL 6.5 i686 / x86_64 using packages:

perl-qpid-debuginfo-0.22-13.el6.x86_64
python-qpid-0.22-17.el6.noarch
python-qpid-qmf-0.22-37.el6.x86_64
qpid-cpp-client-0.22-46.el6.x86_64
qpid-cpp-client-devel-0.22-46.el6.x86_64
qpid-cpp-client-devel-docs-0.22-46.el6.noarch
qpid-cpp-client-rdma-0.22-46.el6.x86_64
qpid-cpp-debuginfo-0.22-46.el6.x86_64
qpid-cpp-server-0.22-46.el6.x86_64
qpid-cpp-server-devel-0.22-46.el6.x86_64
qpid-cpp-server-ha-0.22-46.el6.x86_64
qpid-cpp-server-linearstore-0.22-46.el6.x86_64
qpid-cpp-server-rdma-0.22-46.el6.x86_64
qpid-cpp-server-xml-0.22-46.el6.x86_64
qpid-java-client-0.22-6.el6.noarch
qpid-java-common-0.22-6.el6.noarch
qpid-java-example-0.22-6.el6.noarch
qpid-jca-0.22-2.el6.noarch
qpid-jca-xarecovery-0.22-2.el6.noarch
qpid-jca-zip-0.22-2.el6.noarch
qpid-proton-c-0.7-3.el6.x86_64
qpid-proton-c-devel-0.7-3.el6.x86_64
qpid-proton-debuginfo-0.7-3.el6.x86_64
qpid-qmf-0.22-37.el6.x86_64
qpid-qmf-debuginfo-0.22-37.el6.x86_64
qpid-qmf-devel-0.22-37.el6.x86_64
qpid-snmpd-1.0.0-16.el6.x86_64
qpid-snmpd-debuginfo-1.0.0-16.el6.x86_64
qpid-tests-0.22-16.el6.noarch
qpid-tools-0.22-14.el6.noarch

perl-qpid-debuginfo-0.22-13.el6.i686
python-qpid-0.22-17.el6.noarch
python-qpid-qmf-0.22-37.el6.i686
qpid-cpp-client-0.22-46.el6.i686
qpid-cpp-client-devel-0.22-46.el6.i686
qpid-cpp-client-devel-docs-0.22-46.el6.noarch
qpid-cpp-client-rdma-0.22-46.el6.i686
qpid-cpp-debuginfo-0.22-46.el6.i686
qpid-cpp-server-0.22-46.el6.i686
qpid-cpp-server-devel-0.22-46.el6.i686
qpid-cpp-server-ha-0.22-46.el6.i686
qpid-cpp-server-linearstore-0.22-46.el6.i686
qpid-cpp-server-rdma-0.22-46.el6.i686
qpid-cpp-server-xml-0.22-46.el6.i686
qpid-java-client-0.22-6.el6.noarch
qpid-java-common-0.22-6.el6.noarch
qpid-java-example-0.22-6.el6.noarch
qpid-jca-0.22-2.el6.noarch
qpid-jca-xarecovery-0.22-2.el6.noarch
qpid-jca-zip-0.22-2.el6.noarch
qpid-proton-c-0.7-3.el6.i686
qpid-proton-c-devel-0.7-3.el6.i686
qpid-proton-debuginfo-0.7-3.el6.i686
qpid-qmf-0.22-37.el6.i686
qpid-qmf-debuginfo-0.22-37.el6.i686
qpid-qmf-devel-0.22-37.el6.i686
qpid-snmpd-1.0.0-16.el6.i686
qpid-snmpd-debuginfo-1.0.0-16.el6.i686
qpid-tests-0.22-16.el6.noarch
qpid-tools-0.22-14.el6.noarch

Non-transaction dequeue removs message from primary and all backups.


-> VERIFIED


Repro steps:
# common variables
M=192.168.6.251 # virt IP
Q=BZ1127089

# on one node (one message sent, one received, in c++|python)
/var/dtests/node_data/clients/qc2_spout    -b $M  --log-msgs dict --content foo --tx-batch-size 1 -c 1  --tx-action commit   "${Q}_cpp;{create:always}"
/var/dtests/node_data/clients/qc2_drain    -b $M  --log-msgs dict                                 -c 1                       "${Q}_cpp;{create:always}"
/var/dtests/node_data/clients/qc2_spout.py -b $M  --log-msgs dict --content foo --tx-batch-size 1 -c 1  --tx-action commit   "${Q}_py;{create:always}"
/var/dtests/node_data/clients/qc2_drain.py -b $M  --log-msgs dict                                 -c 1                       "${Q}_py;{create:always}"

# on all nodes (queue depth 0 "0 1 1")
qpid-stat -q --ha-admin | grep "${Q}_"