Description of problem: Atomicity of messages sent under XA may be lost on failover. Version-Release number of selected component (if applicable): qpid-java-client-0.18-2.el5 qpid-java-common-0.18-2.el5 qpid-java-example-0.18-2.el5 qpid-jca-0.18-2.el5 qpid-jca-xarecovery-0.18-2.el How reproducible: 100% Steps to Reproduce: 1. send a message under an XA transaction to a cluster 2. commit the transaction 3. kill the node connected to, triggering failover Actual results: The message that was sent under an XA transaction that was successfully completed is redelivered on reconnect. Expected results: No message redelivery for committed sends as this violates atomicity. Additional info: See https://issues.apache.org/jira/browse/QPID-2994 which was resolved for non-XA transaction, but from what I can make out does not address the case where XA transactions are used
Weston, please assess.
Currently reviewing. This is an area that at the very least we need more testing to consistently reproduce effectively. However, I agree with Gordon's assessment, most likely something in the JMS client that is not being handled correctly.
Note, one blocker on this is Gordon being on vacation being that he is the 'owner' or at least the expert on the DTX code.
My environment: Broker OS: Linux carthage 3.6.7-4.fc16.x86_64 #1 SMP Tue Nov 20 20:33:31 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux Broker Build: [wmprice@carthage ~]$ qpid-install/sbin/qpidd -v qpidd (qpidc) version 0.18 built from 0.18-mrg branch in internal git repo on mrg1 Store Build: [wmprice@carthage qpid-store]$ svn info Path: . URL: http://anonsvn.jboss.org/repos/rhmessaging/store/branches/qpid-0.18 Repository Root: http://anonsvn.jboss.org/repos/rhmessaging Repository UUID: 06e15bec-b515-0410-bef0-cc27a458cf48 Revision: 4530 Node Kind: directory Schedule: normal Last Changed Author: mcressman Last Changed Rev: 4527 Last Changed Date: 2013-01-02 15:15:03 -0500 (Wed, 02 Jan 2013) Qpid JMS/JCA Build: 0.18-mrg branch from our internal git repository JEE Server: EAP 5.1 In my setup, I am running two brokers on the same OS instance with different ports. Each broker has it's own data directory and do not share a store etc. The app server is running on a separate OS (OSX) independent of the broker hosts. I am using the 0.18 version of the JCA adapter, deploying the examples and running within EAP. Currently, when running in a cluster with XA, I am unable to reproduce this issue. However, this isn't saying much as there is no DTX* type information printed to the logs which is pretty confusing as within the debugger I can see the XA transaction complete successfully. The client does failover properly, but the messages sent to the previous node are not replayed. Again, I don't really trust this as I can't see any XA/DTX information in the logs at all so I am a bit miffed at this point. At any rate, I have repeatable environment that is automated to setup and run this scenario when Gordon returns.
Adjust log settings and now DTX info is showing up correctly and issue becomes apparent right away.
Actually, I am only seeing the following type of info the logs: 2013-01-16 15:18:58 [Broker] debug preparing: {Xid: format=131075; global-id=1--3f57fe9c:f13b:50f70b08:63; branch-id=-3f57fe9c:f13b:50f70b08:65; } 2013-01-16 15:19:04 [Broker] debug committing: {Xid: format=131075; global-id=1--3f57fe9c:f13b:50f70b08:63; branch-id=-3f57fe9c:f13b:50f70b08:65; } I am not seeing any type of DtxSelect/DtxBegin/DtxEnd etc. I am not sure if something has changed within the Broker logging or if my settings are wrong. I am using: --log-enable trace+:Dtx --log-enable trace+:Protocol I have tried various options to no avail. At any rate, I have also noticed that this issue seems to only occur when multiple XA resources are used within the same XA transaction. I am reviewing this further.
Thanks to Rajith we have a patch. I applied and tested the fix both on trunk as well as our internal 0.18 branch. One minor modification was required to build against 0.18 so I am submitting a modified version of Rajith's patch if we need it. I will simply attach it to the BZ. All tests (unit, system and XA/HA failover with JCA) look good.
Created attachment 682763 [details] Patch for XA/HA failover Patch for XA/HA failover issue for the 0.18-mrg internal branch.
VERIFIED qpid-java-client-0.18-6.el6.noarch qpid-java-common-0.18-6.el6.noarch qpid-java-example-0.18-6.el6.noarch qpid-jca-0.18-7.el6.noarch qpid-jca-xarecovery-0.18-7.el6.noarch qpid-jca-zip-0.18-7.el6.noarch
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHSA-2013-0561.html