Bug 486159 - Java client failover test failure
Java client failover test failure
Status: CLOSED CURRENTRELEASE
Product: Red Hat Enterprise MRG
Classification: Red Hat
Component: qpid-java (Show other bugs)
1.1
All Linux
medium Severity medium
: 1.1.1
: ---
Assigned To: Rajith Attapattu
Frantisek Reznicek
: Reopened
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2009-02-18 12:33 EST by Rajith Attapattu
Modified: 2015-11-15 19:06 EST (History)
5 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2011-07-12 11:03:29 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
Test case patch (12.40 KB, patch)
2009-02-18 23:15 EST, Rajith Attapattu
no flags Details | Diff
Broker + client log (218.51 KB, application/octet-stream)
2009-02-18 23:17 EST, Rajith Attapattu
no flags Details

  None (edit)
Description Rajith Attapattu 2009-02-18 12:33:18 EST
The following test case (not committed to the source tree yet) is failing. The following possibilities are there and haven't ruled out any.

1. The test is wrong
2. The JMS client is at fault
3. The C++ broker is at fault

The test case is as follows.

A cluster consisting of two brokers is created. The JMS client connects to one broker and sends x messages and consumes y messages. They we shutdown the primary broker and allow the java client to failover to the other.
We then send another p messages and tries to consume (x + p -y)  messages.
We also check for the order of the messages.
Comment 1 Rajith Attapattu 2009-02-18 23:10:40 EST
After more investigation it seems the error is on the broker side. I have attached a patch that can be applied and run as follows.

ant build test -Dprofile=cpp.cluster -Dtest=FailoverTest

For simplicity I have commented out all tests except testP2PFailoverWithMessagesLeftToConsumeAndProduce (which is the test case described above). Also I have hard code the following settings.

The test sends 5 messages, consumes 3, then shutsdown the broker and tries to failover. Then it sends 5 more messages and tries to consume the remaning 7 messages.

I have added log messages (that are prominently marked) to indicate messages sent or received. Message numbers start with zero with message0 being the first message.

If you look at the attached log you could clearly see, that the client sends the 5 messages (message0 to message4) and consumes 3 (message0 to message2) as expected. Then failover happens. (You could also see that the client has acked the messages and the broker has received them).

Then you can see it sends 5 messages (message5 to message9). Then it consumes message3 and message4 as expected. **BUT then the client receives message0 again!!!**.

So it appears that the acks are not replicated properly.

Below is a grep from the attached log where it shows that the client is sending the acks and the broker is receiving them.


======================== Client sending acks ==================================
main 2009-02-18 17:21:35,082 DEBUG [apache.qpid.transport.Connection] SEND: [conn:7a5b7f] ch=0 id=14 MessageAccept(transfers={[0, 0]})
main 2009-02-18 17:21:35,088 DEBUG [apache.qpid.transport.Connection] SEND: [conn:7a5b7f] ch=0 id=15 MessageAccept(transfers={[1, 1]})
main 2009-02-18 17:21:35,092 DEBUG [apache.qpid.transport.Connection] SEND: [conn:7a5b7f] ch=0 id=16 MessageAccept(transfers={[2, 2]})

------- the above 3 shows the acks for the first 3 messages ----------

IoReceiver - localhost/127.0.0.1:5673 2009-02-18 17:21:35,188 DEBUG [apache.qpid.transport.Connection] SEND: [conn:7a5b7f] ch=3 id=14 MessageAccept(transfers={})
IoReceiver - localhost/127.0.0.1:5673 2009-02-18 17:21:35,188 DEBUG [apache.qpid.transport.Connection] SEND: [conn:7a5b7f] ch=3 id=15 MessageAccept(transfers={})
IoReceiver - localhost/127.0.0.1:5673 2009-02-18 17:21:35,188 DEBUG [apache.qpid.transport.Connection] SEND: [conn:7a5b7f] ch=3 id=16 MessageAccept(transfers={})

-----------------------------------------------------------------------------
Below you can see the acks being sent for 3 messages consumed after failover
The 3rd message received is message0, it should have been message5.
-----------------------------------------------------------------------------
main 2009-02-18 17:21:35,231 DEBUG [apache.qpid.transport.Connection] SEND: [conn:7a5b7f] ch=3 id=28 MessageAccept(transfers={[0, 0]})
main 2009-02-18 17:21:35,272 DEBUG [apache.qpid.transport.Connection] SEND: [conn:7a5b7f] ch=3 id=29 MessageAccept(transfers={[1, 1]})
main 2009-02-18 17:21:35,279 DEBUG [apache.qpid.transport.Connection] SEND: [conn:7a5b7f] ch=3 id=30 MessageAccept(transfers={[2, 2]})
=================================================================================


============= Broker receiving acks =========================================
2009-feb-18 17:21:35 trace 10.15.16.125:8541(READY) DLVR: 10.15.16.125:8541-0x86340c0/32 Frame[BEbe; channel=0; {MessageAcceptBody: transfers={ [0,0] }; }] rc=1
2009-feb-18 17:21:35 debug 73f67273-21df-466b-bf25-790f72a1d44c@guest@QPID: recv cmd 14: {MessageAcceptBody: transfers={ [0,0] }; }
2009-feb-18 17:21:35 trace 10.15.16.125:8519(READY) DLVR: 10.15.16.125:8541-0x86340c0/32 Frame[BEbe; channel=0; {MessageAcceptBody: transfers={ [0,0] }; }] rc=1
2009-feb-18 17:21:35 debug 73f67273-21df-466b-bf25-790f72a1d44c@guest@QPID: recv cmd 14: {MessageAcceptBody: transfers={ [0,0] }; }
2009-feb-18 17:21:35 trace 10.15.16.125:8541(READY) DLVR: 10.15.16.125:8541-0x86340c0/34 Frame[BEbe; channel=0; {MessageAcceptBody: transfers={ [1,1] }; }] rc=1
2009-feb-18 17:21:35 debug 73f67273-21df-466b-bf25-790f72a1d44c@guest@QPID: recv cmd 15: {MessageAcceptBody: transfers={ [1,1] }; }
2009-feb-18 17:21:35 trace 10.15.16.125:8519(READY) DLVR: 10.15.16.125:8541-0x86340c0/34 Frame[BEbe; channel=0; {MessageAcceptBody: transfers={ [1,1] }; }] rc=1
2009-feb-18 17:21:35 debug 73f67273-21df-466b-bf25-790f72a1d44c@guest@QPID: recv cmd 15: {MessageAcceptBody: transfers={ [1,1] }; }
2009-feb-18 17:21:35 trace 10.15.16.125:8541(READY) DLVR: 10.15.16.125:8541-0x86340c0/36 Frame[BEbe; channel=0; {MessageAcceptBody: transfers={ [2,2] }; }] rc=1
2009-feb-18 17:21:35 debug 73f67273-21df-466b-bf25-790f72a1d44c@guest@QPID: recv cmd 16: {MessageAcceptBody: transfers={ [2,2] }; }
2009-feb-18 17:21:35 trace 10.15.16.125:8519(READY) DLVR: 10.15.16.125:8541-0x86340c0/36 Frame[BEbe; channel=0; {MessageAcceptBody: transfers={ [2,2] }; }] rc=1
2009-feb-18 17:21:35 debug 73f67273-21df-466b-bf25-790f72a1d44c@guest@QPID: recv cmd 16: {MessageAcceptBody: transfers={ [2,2] }; }
2009-feb-18 17:21:35 trace 10.15.16.125:8519(READY) DLVR: 10.15.16.125:8519-0x8482f70/41 Frame[BEbe; channel=3; {MessageAcceptBody: transfers={ }; }] rc=0

------- above shows the acks for first 3 msgs being received --------------
It shows 2 entries per ack received. I assume the second is the replication message.
---------------------------------------------------------------------------
Below it shows the acks being received for 3 messages consumed after failover.

2009-feb-18 17:21:35 debug 73f67273-21df-466b-bf25-790f72a1d44c@guest@QPID: recv cmd 14: {MessageAcceptBody: transfers={ }; }
2009-feb-18 17:21:35 trace 10.15.16.125:8519(READY) DLVR: 10.15.16.125:8519-0x8482f70/41 Frame[BEbe; channel=3; {MessageAcceptBody: transfers={ }; }] rc=0
2009-feb-18 17:21:35 debug 73f67273-21df-466b-bf25-790f72a1d44c@guest@QPID: recv cmd 15: {MessageAcceptBody: transfers={ }; }
2009-feb-18 17:21:35 trace 10.15.16.125:8519(READY) DLVR: 10.15.16.125:8519-0x8482f70/41 Frame[BEbe; channel=3; {MessageAcceptBody: transfers={ }; }] rc=0
2009-feb-18 17:21:35 debug 73f67273-21df-466b-bf25-790f72a1d44c@guest@QPID: recv cmd 16: {MessageAcceptBody: transfers={ }; }
2009-feb-18 17:21:35 trace 10.15.16.125:8519(READY) DLVR: 10.15.16.125:8519-0x8482f70/53 Frame[BEbe; channel=3; {MessageAcceptBody: transfers={ [0,0] }; }] rc=1
2009-feb-18 17:21:35 debug 73f67273-21df-466b-bf25-790f72a1d44c@guest@QPID: recv cmd 28: {MessageAcceptBody: transfers={ [0,0] }; }
2009-feb-18 17:21:35 trace 10.15.16.125:8519(READY) DLVR: 10.15.16.125:8519-0x8482f70/55 Frame[BEbe; channel=3; {MessageAcceptBody: transfers={ [1,1] }; }] rc=1
2009-feb-18 17:21:35 debug 73f67273-21df-466b-bf25-790f72a1d44c@guest@QPID: recv cmd 29: {MessageAcceptBody: transfers={ [1,1] }; }
2009-feb-18 17:21:35 trace 10.15.16.125:8519(READY) DLVR: 10.15.16.125:8519-0x8482f70/57 Frame[BEbe; channel=3; {MessageAcceptBody: transfers={ [2,2] }; }] rc=1
2009-feb-18 17:21:35 debug 73f67273-21df-466b-bf25-790f72a1d44c@guest@QPID: recv cmd 30: {MessageAcceptBody: transfers={ [2,2] }; }
==============================================================================
Comment 2 Rajith Attapattu 2009-02-18 23:15:35 EST
Created attachment 332496 [details]
Test case patch

Apply it inside the java folder.
Then do "ant build test -Dprofile=cpp.cluster -Dtest=FailoverTest"
Comment 3 Rajith Attapattu 2009-02-18 23:17:58 EST
Created attachment 332497 [details]
Broker + client log

If you run the test case you can find it in the following location

java/build/results/systests/TEST-org.apache.qpid.test.client.failover.FailoverTest.testP2PFailoverWithMessagesLeftToConsumeAndProduce.out
Comment 4 Gordon Sim 2009-02-20 05:48:07 EST
Test was making false assumptions. As the messages were resent on failover, the redelivery was to be expected.
Comment 5 Rajith Attapattu 2009-02-24 09:14:14 EST
As per the JMS spec you need to provide exactly-once semantics when using AUTO_ACK mode. In order to facilitate this the following changes are needed.

a) Ability to synchronously ack
b) to publish synchronously (currently it's only possible to do this with persistent messaging).

I have introduced the following system/connection URL properties

a) sync_ack=true - which will ack synchronously when using AUTO_ACK.

b) sync_publish={persistent|all} - will sync after every publish depending on type(persistent or all messages) specified.

Ideally you should be able to configure (b) at the destination level. This can be looked at when revamping the binding URL.

by default AUTO_ACK and DUPS_OK will behave the same. If a customer wants to use AUTO_ACK with strict JMS compliance then you would need to do the following.
a) set sync_ack=true and sync_publish=all

I have reopened the JIRA, but downgraded to medium. A fix will be committed shortly. I will also update the bz with the Qpid JIRA numbers for the above features.
Comment 6 Rajith Attapattu 2009-03-09 15:46:50 EDT
This issue has been corrected with the help of sync_ack=true and sync_publish=true.
Comment 7 Justin Ross 2011-06-27 16:52:28 EDT
Rajith, what revision or release did this get into?
Comment 8 Petr Matousek 2011-07-12 10:58:05 EDT
The issue has been fixed

Test is passing repeatedly

Verified on RHEL5.6 and RHEL6.1, architectures: i386, x86_64

packages installed:
python-qpid-0.10-1.el5
python-qpid-qmf-0.10-10.el5
qpid-cpp-client-0.10-8.el5
qpid-cpp-client-devel-0.10-8.el5
qpid-cpp-client-devel-docs-0.10-8.el5
qpid-cpp-client-ssl-0.10-8.el5
qpid-cpp-server-0.10-8.el5
qpid-cpp-server-cluster-0.10-8.el5
qpid-cpp-server-devel-0.10-8.el5
qpid-cpp-server-ssl-0.10-8.el5
qpid-cpp-server-store-0.10-8.el5
qpid-cpp-server-xml-0.10-8.el5
qpid-java-client-0.10-6.el5
qpid-java-common-0.10-6.el5
qpid-java-example-0.10-6.el5
qpid-qmf-0.10-10.el5
qpid-qmf-devel-0.10-10.el5
qpid-tools-0.10-6.el5

-> VERIFIED

Note You need to log in before you can comment on or make changes to this bug.