Bug 1095849
| Summary: | JAVA Client Can not recieve message with qpid ha cluster "Session exception occured while trying to commit" | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise MRG | Reporter: | masterQi | ||||||||||
| Component: | qpid-cpp | Assignee: | messaging-bugs <messaging-bugs> | ||||||||||
| Status: | CLOSED UPSTREAM | QA Contact: | Michal Toth <mtoth> | ||||||||||
| Severity: | high | Docs Contact: | |||||||||||
| Priority: | high | ||||||||||||
| Version: | Development | CC: | jross, masterQi, mtoth, zkraus | ||||||||||
| Target Milestone: | 3.2 | ||||||||||||
| Target Release: | --- | ||||||||||||
| Hardware: | x86_64 | ||||||||||||
| OS: | Linux | ||||||||||||
| Whiteboard: | |||||||||||||
| Fixed In Version: | qpid-cpp-0.34-4 | Doc Type: | Bug Fix | ||||||||||
| Doc Text: | Story Points: | --- | |||||||||||
| Clone Of: | Environment: | ||||||||||||
| Last Closed: | 2025-02-10 03:35:48 UTC | Type: | Bug | ||||||||||
| Regression: | --- | Mount Type: | --- | ||||||||||
| Documentation: | --- | CRM: | |||||||||||
| Verified Versions: | Category: | --- | |||||||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||||||
| Embargoed: | |||||||||||||
| Attachments: |
|
||||||||||||
|
Description
masterQi
2014-05-08 16:31:28 UTC
there are manay qpid-ha tx queue
qpid: list
Summary of Objects by Type:
Package Class Active Deleted
============================================================
org.apache.qpid.broker binding 17 0
org.apache.qpid.broker broker 1 0
org.apache.qpid.broker memory 1 0
org.apache.qpid.broker system 1 0
org.apache.qpid.ha habroker 1 0
org.apache.qpid.legacystore store 1 0
org.apache.qpid.broker subscription 8 0
org.apache.qpid.broker connection 3 0
org.apache.qpid.legacystore journal 2 0
org.apache.qpid.broker session 4 0
org.apache.qpid.broker queue 42 0
org.apache.qpid.broker exchange 9 0
org.apache.qpid.broker vhost 1 0
qpid: list queue
Object Summary:
ID Created Destroyed Index
==================================================================================================================================================================================
132 16:01:41 - org.apache.qpid.broker:queue:log
133 16:01:41 - org.apache.qpid.broker:queue:meizu.game.order.notify
134 16:34:38 - org.apache.qpid.broker:queue:qmfc-v2-GZ-GAME-19.65.19797.1
135 16:34:38 - org.apache.qpid.broker:queue:qmfc-v2-hb-GZ-GAME-19.65.19797.1
136 16:34:38 - org.apache.qpid.broker:queue:qmfc-v2-ui-GZ-GAME-19.65.19797.1
137 16:01:52 - org.apache.qpid.broker:queue:qpid.bridge_queue_qpid.broker-replicator.bridge.55f21fec-fb5d-49e2-92e3-45c3b8c6a31f_58b2be25-5a48-4691-8dcb-2f05e22af240
138 16:25:18 - org.apache.qpid.broker:queue:qpid.ha-tx:050dee8e-eb28-43ae-8853-6b8774ec467a
139 16:31:23 - org.apache.qpid.broker:queue:qpid.ha-tx:16df28dd-47f0-4a39-a96f-c7e07fd79c05
140 16:28:18 - org.apache.qpid.broker:queue:qpid.ha-tx:1b2d1353-8c42-484b-b231-25964a5f6453
141 16:16:18 - org.apache.qpid.broker:queue:qpid.ha-tx:26239421-986e-4dc3-b22a-243e44d36a40
142 16:01:53 - org.apache.qpid.broker:queue:qpid.ha-tx:2dfbb304-2020-464a-a14f-c87418956867
143 16:01:53 - org.apache.qpid.broker:queue:qpid.ha-tx:2e54c358-c200-4d4e-b4ac-ef8edb7c6b9d
144 16:16:18 - org.apache.qpid.broker:queue:qpid.ha-tx:30815be3-714d-4e4b-9733-b61359120ac3
145 16:14:12 - org.apache.qpid.broker:queue:qpid.ha-tx:31521fd7-5308-4a08-8882-bd8aefa0050e
146 16:25:23 - org.apache.qpid.broker:queue:qpid.ha-tx:34fe2710-9243-4884-8abb-85ab54685b1e
147 16:34:19 - org.apache.qpid.broker:queue:qpid.ha-tx:4c4dcb65-7a2c-4ee0-92a1-567da5ae5aba
148 16:01:53 - org.apache.qpid.broker:queue:qpid.ha-tx:5b0fa3f5-264c-4477-b6a0-46c0b39a6a79
149 16:19:18 - org.apache.qpid.broker:queue:qpid.ha-tx:5b8eb722-1a3d-4c46-97b1-26b108703044
150 16:12:06 - org.apache.qpid.broker:queue:qpid.ha-tx:6457ac02-fcb7-4144-b0c2-3339cfa2cf4f
151 16:04:54 - org.apache.qpid.broker:queue:qpid.ha-tx:701ac0c7-b3fd-41da-8832-70dc700cf0f3
152 16:31:18 - org.apache.qpid.broker:queue:qpid.ha-tx:70b0f4a0-2de0-49f7-ad6b-08506dbd211f
153 16:09:06 - org.apache.qpid.broker:queue:qpid.ha-tx:7b171b6a-8f17-4519-9fe5-204fc5626529
154 16:34:25 - org.apache.qpid.broker:queue:qpid.ha-tx:7c003fc5-9986-4598-8950-c409ff86b709
155 16:09:06 - org.apache.qpid.broker:queue:qpid.ha-tx:7c6c8f9a-8eab-4446-9bc2-50deaf66d627
156 16:15:06 - org.apache.qpid.broker:queue:qpid.ha-tx:80c0e04c-5f80-4f75-95ba-81091292cd4d
157 16:12:06 - org.apache.qpid.broker:queue:qpid.ha-tx:8c3aea7e-f333-48c8-829b-564f89d600ba
158 16:22:23 - org.apache.qpid.broker:queue:qpid.ha-tx:a11e7cbd-d5c0-4a50-b94d-d6bd8f36d8bc
159 16:17:12 - org.apache.qpid.broker:queue:qpid.ha-tx:ad1ab060-2a6f-41e8-80c2-16aabdfcf8de
160 16:28:23 - org.apache.qpid.broker:queue:qpid.ha-tx:b0454339-4759-44de-906f-d1bf3465a2cb
161 16:10:00 - org.apache.qpid.broker:queue:qpid.ha-tx:b4c5ad63-ec07-4619-b45c-35593894e072
162 16:04:54 - org.apache.qpid.broker:queue:qpid.ha-tx:d57c3e52-8f65-44e4-9595-6e8440d39978
163 16:19:23 - org.apache.qpid.broker:queue:qpid.ha-tx:da34735a-6e9c-44c0-bd8c-c8368ab74912
164 16:22:18 - org.apache.qpid.broker:queue:qpid.ha-tx:dad4938e-7308-4c9e-b648-49494da41682
165 16:04:54 - org.apache.qpid.broker:queue:qpid.ha-tx:e82f5575-9e2b-4567-be77-fb2a766fa74b
166 16:07:54 - org.apache.qpid.broker:queue:qpid.ha-tx:ead7752b-54f0-4e5b-a389-5d2b46b8949a
167 16:34:23 - org.apache.qpid.broker:queue:qpid.ha-tx:ecd4b7e9-1199-44be-927a-b5230ed8aa3c
168 16:07:00 - org.apache.qpid.broker:queue:qpid.ha-tx:ed388289-24b8-4415-9b3e-3568874c6f39
169 16:34:19 - org.apache.qpid.broker:queue:qpid.ha-tx:ed4417bf-ee74-4928-ad2d-989b3ed6b44a
170 16:12:06 - org.apache.qpid.broker:queue:qpid.ha-tx:f64dd47f-c174-40a9-aba9-f39c4ce9f5c6
171 16:01:53 - org.apache.qpid.broker:queue:qpid.ha-tx:fc85c5e7-be8e-44fd-8521-cf99dcfc3f96
172 16:34:38 - org.apache.qpid.broker:queue:reply-GZ-GAME-19.65.19797.1
173 16:34:38 - org.apache.qpid.broker:queue:topic-GZ-GAME-19.65.19797.1
195 16:34:53 16:34:53 org.apache.qpid.broker:queue:64fdbc38-22a9-4950-91d6-28bdc4e9cf8d:0.0
broker: qpid-cpp-0.26 client version: qpid-client-0.26.jar qpid-common-0.26.jar when i set "sessionTransacted" false the it's OK. Created attachment 911563 [details]
Simple reproducer
The attached modified version of the Spout example reproduces the problem quickly. I have a 3 broker local cluster (same machine) with configuration:
auth=no
log-enable=info+
log-enable=debug+:ha
ha-cluster=yes
ha-replicate=all
ha-brokers-url=localhost:6000,localhost:6001,localhost:6002
Run the attached as follows:
java org/apache/qpid/example/Spout -b x:y@localhost:6000 --content FOOBAR -c 100000 'q;{create:always}'
In a few seconds it hangs, then after another minute it exits with
Exception in thread "main" javax.jms.JMSException: Session exception occured while trying to commit: timed out waiting for sync: complete = 107, point = 109
at org.apache.qpid.client.AMQSession.toJMSException(AMQSession.java:3465)
at org.apache.qpid.client.AMQSession.commit(AMQSession.java:843)
at org.apache.qpid.example.Spout.<init>(Spout.java:102)
at org.apache.qpid.example.Spout.main(Spout.java:149)
Caused by: org.apache.qpid.transport.SessionException: timed out waiting for sync: complete = 107, point = 109
at org.apache.qpid.transport.Session.sync(Session.java:867)
at org.apache.qpid.transport.Session.sync(Session.java:837)
at org.apache.qpid.transport.Session.invoke(Session.java:789)
at org.apache.qpid.transport.Session.invoke(Session.java:627)
at org.apache.qpid.transport.SessionInvoker.txCommit(SessionInvoker.java:145)
at org.apache.qpid.client.AMQSession_0_10.commitImpl(AMQSession_0_10.java:1037)
at org.apache.qpid.client.AMQSession.commit(AMQSession.java:830)
... 2 more
The stack trace matches that from the bug report
I have a tentative fix but I want to consult a bit to see if there's a better one before I claim to have fixed this. The fix, with reproducer, is up for review at https://reviews.apache.org/r/23096/ (In reply to Alan Conway from comment #7) > I have a tentative fix but I want to consult a bit to see if there's a > better one before I claim to have fixed this. > > The fix, with reproducer, is up for review at > https://reviews.apache.org/r/23096/ This fix is not acceptable. It masks the problem by making the client force an extra sync on the broker but this should not be necessary so we need a broker side fix. There appears to be a race condition where the broker sometimes fails to send a completion for the transaction commit command. *** Bug 1117863 has been marked as a duplicate of this bug. *** Bug 1117863 shows that this problem occurs with C++ client also which confirms it is a broker bug. It also shows that in the C++ case the completion IS sent eventually (c++ client is more patient) and client can continue. That should help track down the problem. *** Bug 1083676 has been marked as a duplicate of this bug. *** Updated review board with broker-side fix, I think this solves the problem but will wait for review feedback. https://reviews.apache.org/r/23096 ------------------------------------------------------------------------ r1619816 | aconway | 2014-08-22 10:13:14 -0400 (Fri, 22 Aug 2014) | 15 lines QPID-5855: JAVA Client Can not recieve message with qpid ha cluster "Session exception occured while trying to commit" The problem: the java client sets the sync flag on tx.commit and then waits for completion of the entire transaction. According to the 0-10 spec, this is correct, the commit (or rollback) will not complete until all of the transactional commands have completed. However the C++ broker was sometimes completing a commit *before* one of the the corresponding enqueues. It issued the completions up to the commit (because the commit is makred sync) but there is a "hole" for the incomplete enqueue. The enqueue is not marked sync so when this hole is filled no completion is sent and the client hangs. Fix: make tx.commit a "sync point", that is it behaves like execution.sync and is not completed till all preceeding commands are complete. Note tx.rollback does not need modification as it is never completed asynchronously. ------------------------------------------------------------------------ Created attachment 929663 [details]
Final reproducer with script to measure tx times.
This fix causes a regression: failure of qpid-txtest2 when run with linear store. Will post additional fix shortly. Fix requires the following additional commit: ------------------------------------------------------------------------ r1621368 | aconway | 2014-08-29 15:29:06 -0400 (Fri, 29 Aug 2014) | 12 lines QPID-5855: Fix to JAVA Client Can not recieve message with qpid ha cluster. The original fix for this introduced a regression, running the qpid-txttest2 test against a cluster with the linear store failed. This fixes the fix. - Run transaction commit logic when the commit completes. Report completion to the user only when all prior commands have completed (sync point) - Fix missing initializer in client/amqp0_10/SessionImpl.cpp for transaction committing flag. - Remove annoying log messages from IdSetter.h - Skip transactional messages in prepare, don't wait till commit. - Added fetch-timeout option to qpid-txtest2 ------------------------------------------------------------------------ Fixed on branch 0.30-mrg-acownway-bz1095849 Created attachment 1007187 [details]
Qpid broker ha logs
Same error showed up, when sending a large amount of messages, using provided "simple reproducer" - modified Spout.java from example package, using default qpid.sync_op_timeout=60000.
Test was conducted on 3x rhel6x VMs set up for qpid-ha cluster.
Same result with "qpid.sync_op_timeout=5000" (error showed up sooner).
Exception in thread "main" javax.jms.JMSException: Session exception occured while trying to commit: timed out waiting for sync: complete = 2893, point = 2895
at org.apache.qpid.client.AMQSession.toJMSException(AMQSession.java:3439)
at org.apache.qpid.client.AMQSession.commit(AMQSession.java:843)
at org.apache.qpid.example.Spout.<init>(Spout.java:102)
at org.apache.qpid.example.Spout.main(Spout.java:150)
Caused by: org.apache.qpid.transport.SessionException: timed out waiting for sync: complete = 2893, point = 2895
at org.apache.qpid.transport.Session.sync(Session.java:867)
at org.apache.qpid.transport.Session.sync(Session.java:837)
at org.apache.qpid.transport.Session.invoke(Session.java:789)
at org.apache.qpid.transport.Session.invoke(Session.java:627)
at org.apache.qpid.transport.SessionInvoker.txCommit(SessionInvoker.java:145)
at org.apache.qpid.client.AMQSession_0_10.commitImpl(AMQSession_0_10.java:1022)
at org.apache.qpid.client.AMQSession.commit(AMQSession.java:830)
... 2 more
Following log data is from "qpid.sync_op_timeout=60000".
./run_example.sh org.apache.qpid.example.Spout -b guest/guest.6.251:5672 --content FOOBAR -c 1000000 --broker-option=sasl_mechs=ANONYMOUS 'q;{create:always}'
Exception in thread "main" javax.jms.JMSException: Session exception occured while trying to commit: timed out waiting for sync: complete = 472555, point = 472557
at org.apache.qpid.client.AMQSession.toJMSException(AMQSession.java:3439)
at org.apache.qpid.client.AMQSession.commit(AMQSession.java:843)
at org.apache.qpid.example.Spout.<init>(Spout.java:102)
at org.apache.qpid.example.Spout.main(Spout.java:149)
Caused by: org.apache.qpid.transport.SessionException: timed out waiting for sync: complete = 472555, point = 472557
at org.apache.qpid.transport.Session.sync(Session.java:867)
at org.apache.qpid.transport.Session.sync(Session.java:837)
at org.apache.qpid.transport.Session.invoke(Session.java:789)
at org.apache.qpid.transport.Session.invoke(Session.java:627)
at org.apache.qpid.transport.SessionInvoker.txCommit(SessionInvoker.java:145)
at org.apache.qpid.client.AMQSession_0_10.commitImpl(AMQSession_0_10.java:1022)
at org.apache.qpid.client.AMQSession.commit(AMQSession.java:830)
... 2 more
---------------------------
qpid-stat -q -b 192.168.6.251
Queues
queue dur autoDel excl msg msgIn msgOut bytes bytesIn bytesOut cons bind
===========================================================================================================================================================================================================
1937b4b8-9381-40f1-9455-4b41622be0bf:0.0 Y Y 0 0 0 0 0 0 1 2
fb9b1ef0-f159-4fd7-8265-10828e012fe8:0.0 Y Y 0 0 0 0 0 0 1 2
q 256k 256k 0 33.1m 33.1m 0 0 1
qpid.bridge_queue_qpid.broker-replicator.bridge.431bb174-7fac-4390-abcd-efd1f368eb8d_aac99a43-3c08-4a10-98cc-1075c683f31f Y Y 0 311 311 0 224k 224k 1 3
qpid.bridge_queue_qpid.broker-replicator.bridge.87e4e771-de29-49ab-a27b-c652d711f537_9c39b6c7-73b2-488b-a39d-f335a64bcbd6 Y Y 0 301 301 0 221k 221k 1 3
--------------------------
Tested on packages
qpid-cpp-server-0.30-7.el6.x86_64
qpid-cpp-server-ha-0.30-7.el6.x86_64
Provided attachment 1007187 [details] are qpid-logs from all nodes.
Moving back to assigned.
Created attachment 1008641 [details]
Updated reproducer, not able to reproduce the issue.
I have not been able to reproduce the problem, have been running all day.
My current reproducer is attached, please compare to yours and let me know of any differences.
How long does it take for you to reproduce the failure (time or message count)?
I'm running on 3 RHEL 6.4 bare metal boxes, with a 0.30-mrg-checkpoint7 release build which should correspond to yours.
Can you think of any other details of your environment that I might need to reproduce? Is there any possibility you could give me access to your VMs so I can investigate directly there?
[0,root@mrg-qe-26 ~/bz1095849]$ ./runme ./runme: line 18: /root/rh-qpid/release/src/tests/test_env.sh: No such file or directory [1,root@mrg-qe-26 ~/bz1095849]$ ls qpid-brk-ha-logs.tar qpidd.log_1_primary qpidd.log_2 qpidd.log_3 runme Spout.java So I am back to using only provided Spout.java from example package, with "transacted session" & "ssn.commit()" after each message. [0,root@mrg-qe-26 ~/bz1095849]$ rpm -qa qpid-java* qpid-java-client-0.30-4.el6.noarch qpid-java-common-0.30-4.el6.noarch qpid-java-example-0.30-4.el6.noarch [0,root@mrg-qe-26 /usr/share/doc/qpid-java-0.30/examples]$ ./run_example.sh org.apache.qpid.example.Spout -b=guest:guest.6.251:5672 -c 1000000 "test;{create:always}" {b=guest:guest.6.251:5672, c=1000000} Exception in thread "main" javax.jms.JMSException: Session exception occured while trying to commit: timed out waiting for sync: complete = 1040079, point = 1040081 at org.apache.qpid.client.AMQSession.toJMSException(AMQSession.java:3688) at org.apache.qpid.client.AMQSession.commit(AMQSession.java:971) at org.apache.qpid.example.Spout.<init>(Spout.java:102) at org.apache.qpid.example.Spout.main(Spout.java:148) Caused by: org.apache.qpid.transport.SessionException: timed out waiting for sync: complete = 1040079, point = 1040081 at org.apache.qpid.transport.Session.sync(Session.java:867) at org.apache.qpid.transport.Session.sync(Session.java:837) at org.apache.qpid.transport.Session.invoke(Session.java:789) at org.apache.qpid.transport.Session.invoke(Session.java:627) at org.apache.qpid.transport.SessionInvoker.txCommit(SessionInvoker.java:145) at org.apache.qpid.client.AMQSession_0_10.commitImpl(AMQSession_0_10.java:1033) at org.apache.qpid.client.AMQSession.commit(AMQSession.java:958) ... 2 more 130root@ha_rhel6i_1:~# rpm -qa |grep qpid-java qpid-java-client-0.30-4.el6.noarch qpid-java-common-0.30-4.el6.noarch qpid-java-example-0.30-4.el6.noarch root@ha_rhel6i_1:~# rpm -qa |grep qpid-cpp qpid-cpp-client-devel-docs-0.30-7.el6.noarch qpid-cpp-client-rdma-0.30-7.el6.i686 qpid-cpp-server-0.30-7.el6.i686 qpid-cpp-client-devel-0.30-7.el6.i686 qpid-cpp-server-xml-0.30-7.el6.i686 qpid-cpp-server-rdma-0.30-7.el6.i686 qpid-cpp-client-0.30-7.el6.i686 qpid-cpp-server-devel-0.30-7.el6.i686 qpid-cpp-server-linearstore-0.30-7.el6.i686 qpid-cpp-server-ha-0.30-7.el6.i686 root@ha_rhel6i_1:~# ip r 192.168.6.0/24 dev eth1 proto kernel scope link src 192.168.6.11 ... root@ha_rhel6i_2:~# qpid-stat -q -b 192.168.6.251:5672 Queues queue dur autoDel excl msg msgIn msgOut bytes bytesIn bytesOut cons bind =========================================================================================================================================================================================================== 1c8b3172-a86e-4223-9c65-f9cd1267e0e5:0.0 Y Y 0 0 0 0 0 0 1 2 qpid.bridge_queue_qpid.broker-replicator.bridge.7b55a587-aa72-4780-bff3-0d5201f406e0_2983b6bd-f92b-4ee4-9f36-c5a3d5a3b406 Y Y 0 49 49 0 46.3k 46.3k 1 3 test 520k 520k 0 68.1m 68.1m 0 0 1 ------------------------------------------------------------------------------------------------------------------------------------------------------ 2nd try added following line to Spout client (3 changes in total to example Spout.java client) System.setProperty("qpid.sync_op_timeout", "5000"); [0,root@mrg-qe-26 /usr/share/doc/qpid-java-0.30/examples]$ ./run_example.sh org.apache.qpid.example.Spout -b=guest:guest.6.251:5672 -c 1000000 "test2;{create:always}" {b=guest:guest.6.251:5672, c=1000000} Exception in thread "main" javax.jms.JMSException: Session exception occured while trying to commit: timed out waiting for sync: complete = 818269, point = 818271 at org.apache.qpid.client.AMQSession.toJMSException(AMQSession.java:3688) at org.apache.qpid.client.AMQSession.commit(AMQSession.java:971) at org.apache.qpid.example.Spout.<init>(Spout.java:102) at org.apache.qpid.example.Spout.main(Spout.java:148) Caused by: org.apache.qpid.transport.SessionException: timed out waiting for sync: complete = 818269, point = 818271 at org.apache.qpid.transport.Session.sync(Session.java:867) at org.apache.qpid.transport.Session.sync(Session.java:837) at org.apache.qpid.transport.Session.invoke(Session.java:789) at org.apache.qpid.transport.Session.invoke(Session.java:627) at org.apache.qpid.transport.SessionInvoker.txCommit(SessionInvoker.java:145) at org.apache.qpid.client.AMQSession_0_10.commitImpl(AMQSession_0_10.java:1033) at org.apache.qpid.client.AMQSession.commit(AMQSession.java:958) ... 2 more root@ha_rhel6i_2:~# clustat Cluster Status for dtests_ha @ Tue Mar 31 12:01:56 2015 Member Status: Quorate Member Name ID Status ------ ---- ---- ------ 192.168.6.11 1 Online, rgmanager 192.168.6.12 2 Online, Local, rgmanager 192.168.6.13 3 Online, rgmanager Service Name Owner (Last) State ------- ---- ----- ------ ----- service:qpidd_1 192.168.6.11 started service:qpidd_2 (192.168.6.12) failed service:qpidd_3 192.168.6.13 started service:qpidd_primary 192.168.6.11 started root@ha_rhel6i_2:~# qpid-stat -q -b 192.168.6.251:5672 Queues queue dur autoDel excl msg msgIn msgOut bytes bytesIn bytesOut cons bind =========================================================================================================================================================================================================== 79bc2e25-c996-44e8-987b-c4c81f9dc37e:0.0 Y Y 0 0 0 0 0 0 1 2 qpid.bridge_queue_qpid.broker-replicator.bridge.ef461632-409f-4f1d-9382-ba00697112a2_172066a3-252f-4c93-89f5-f91f84adaa02 Y Y 0 3.19k 3.19k 0 2.18m 2.18m 1 3 qpid.ha-tx:805d9081-0cfb-46dc-bc5d-68ef02595b97 Y 0 0 0 0 0 0 1 1 test 520k 520k 0 45.8m 45.8m 0 0 1 test2 409k 409k 0 54.0m 54.0m 0 0 1 I have also noticed, that both times one of the cluster nodes died. I _think_ that in previous runs, the brokers did not died after error reported from client. I will make another few runs and look into broker debug logs for possible detailed errors... I will provide you credentials/ips information on machines via email. Yes but only on Michal Toth's VMs and it is very slow to reproduce. I will get to the bottom of it but no idea how long it will take. Root cause appears to be mis-handling of backup connections that are timed out due to heartbeat, resulting in primary believing the backups are still connected but backups not receiving data. This needs to be fixed. Not clear why heartbeat timeout is being triggered, that also needs to be understood - are they simply set too low or is something holding up processing. I believe I finally understand the cause of this bug and have a fix. The fix is not trivial but it is isolated to transactions under HA, it will simplify the code and I think will give a substantial performance improvement. I hope to have it implemented by end of the week. http://git.app.eng.bos.redhat.com/git/rh-qpid.git/commit/?h=0.30-mrg-aconwy-bz1095849 Bug 1095849 - JAVA Client Can not recieve message with qpid ha cluster "Session exception occured while trying to commit"0.30-mrg-aconwy-bz1095849 Removed complex and incorrect HA+TX logic, reverted to the following limitation: You can use transactions in a HA cluster, but there are limitations on the transactional guarantees. Transactions function normally with the *primary* broker but replication to the backups is not coverted by the atomic guarantee. The following situations are all safe: - Client rolls back a transaction. - Client successfully commits a transaction. - Primary fails during a transaction *before* the client sends a commit. - Transaction contains only one message. The problem case is when all of the following occur: - transaction contains multiple actions (enqueues or dequeues) - primary fails between client sending commit and receiving commit-complete. In this case it is possible that only part of the transaction was replicated to the backups, so on fail-over partial transaction results may be visible. Rebased to 0.34-mrg branch http://git.app.eng.bos.redhat.com/git/rh-qpid.git/log/?h=0.34-mrg-aconway-bz1095849 Also on trunk as r1701109 | QPID-5855 - Simplified HA transaction logic. I regrettably report, that I have been able to reproduce the issue on the latest packages (0.34-4 qpid-cpp-server).
Client received either the same, or similar exception. I have been testing on rhel6 32 and 64 bit architectures and on two different
physical machines hosting 2 different HA clusters. Minority of test runs (1 million of messages) have passed. Most of them failed with either
Exception [1],[2] or [3]. None of the HA nodes have been killed or failed.
[1] -
Exception in thread "main" javax.jms.JMSException: Session exception occured while trying to commit: timed out waiting for sync: complete = 341865, point = 341867
at org.apache.qpid.client.AMQSession.toJMSException(AMQSession.java:3688)
at org.apache.qpid.client.AMQSession.commit(AMQSession.java:971)
at org.apache.qpid.example.Spout.<init>(Spout.java:99)
at org.apache.qpid.example.Spout.main(Spout.java:147)
Caused by: org.apache.qpid.transport.SessionException: timed out waiting for sync: complete = 341865, point = 341867
at org.apache.qpid.transport.Session.sync(Session.java:867)
at org.apache.qpid.transport.Session.sync(Session.java:837)
at org.apache.qpid.transport.Session.invoke(Session.java:789)
at org.apache.qpid.transport.Session.invoke(Session.java:627)
at org.apache.qpid.transport.SessionInvoker.txCommit(SessionInvoker.java:145)
at org.apache.qpid.client.AMQSession_0_10.commitImpl(AMQSession_0_10.java:1033)
at org.apache.qpid.client.AMQSession.commit(AMQSession.java:958)
[2] -
IoReceiver - /192.168.6.251:5672 2015-09-15 16:40:03,743 ERROR [apache.qpid.client.AMQConnection] Throwable Received but no listener set: org.apache.qpid.AMQException: ch=0 id=1 ExecutionException(errorCode=INTERNAL_ERROR, commandId=492129, classCode=5, commandCode=2, fieldIndex=0, description=internal-error: Transaction prepare failed., errorInfo={}) [error code 541: internal error]
Exception in thread "main" javax.jms.JMSException: Session exception occured while trying to commit: ch=0 id=1 ExecutionException(errorCode=INTERNAL_ERROR, commandId=492129, classCode=5, commandCode=2, fieldIndex=0, description=internal-error: Transaction prepare failed., errorInfo={})
at org.apache.qpid.client.AMQSession.toJMSException(AMQSession.java:3688)
at org.apache.qpid.client.AMQSession.commit(AMQSession.java:971)
at org.apache.qpid.example.Spout.<init>(Spout.java:99)
at org.apache.qpid.example.Spout.main(Spout.java:147)
Caused by: org.apache.qpid.transport.SessionException: ch=0 id=1 ExecutionException(errorCode=INTERNAL_ERROR, commandId=492129, classCode=5, commandCode=2, fieldIndex=0, description=internal-error: Transaction prepare failed., errorInfo={})
at org.apache.qpid.transport.Session.sync(Session.java:876)
at org.apache.qpid.transport.Session.sync(Session.java:837)
at org.apache.qpid.transport.Session.invoke(Session.java:789)
at org.apache.qpid.transport.Session.invoke(Session.java:627)
at org.apache.qpid.transport.SessionInvoker.txCommit(SessionInvoker.java:145)
at org.apache.qpid.client.AMQSession_0_10.commitImpl(AMQSession_0_10.java:1033)
at org.apache.qpid.client.AMQSession.commit(AMQSession.java:958)
[3] -
Exception in thread "main" javax.jms.JMSException: Session exception occured while trying to commit: Session sync was interrupted by failover.
at org.apache.qpid.client.AMQSession.toJMSException(AMQSession.java:3688)
at org.apache.qpid.client.AMQSession.commit(AMQSession.java:971)
at org.apache.qpid.example.Spout.<init>(Spout.java:99)
at org.apache.qpid.example.Spout.main(Spout.java:147)
Caused by: org.apache.qpid.transport.SessionException: Session sync was interrupted by failover.
at org.apache.qpid.transport.Session.checkFailoverRequired(Session.java:826)
at org.apache.qpid.transport.Session.sync(Session.java:855)
at org.apache.qpid.transport.Session.sync(Session.java:837)
at org.apache.qpid.transport.Session.invoke(Session.java:789)
at org.apache.qpid.transport.Session.invoke(Session.java:627)
at org.apache.qpid.transport.SessionInvoker.txCommit(SessionInvoker.java:145)
at org.apache.qpid.client.AMQSession_0_10.commitImpl(AMQSession_0_10.java:1033)
at org.apache.qpid.client.AMQSession.commit(AMQSession.java:958)
----
Reproducer is fairly easy. Execute commit on session after each sent message.
I think that setting property "qpid.sync_op_timeout" to "5000" does not make difference.
Here are some of the executions:
(Upstream spout example client committing after each sent message)
$ ./run_example.sh org.apache.qpid.example.Spout -b guest:guest.6.251:5672 -c 1000000 "test;{create:always}"
{b=guest:guest.6.251:5672, c=1000000}
0
1000
2000
...
170000
Exception in thread "main" javax.jms.JMSException: Session exception occured while trying to commit: timed out waiting for sync: complete = 341865, point = 341867
at org.apache.qpid.client.AMQSession.toJMSException(AMQSession.java:3688)
at org.apache.qpid.client.AMQSession.commit(AMQSession.java:971)
at org.apache.qpid.example.Spout.<init>(Spout.java:99)
at org.apache.qpid.example.Spout.main(Spout.java:147)
Caused by: org.apache.qpid.transport.SessionException: timed out waiting for sync: complete = 341865, point = 341867
at org.apache.qpid.transport.Session.sync(Session.java:867)
at org.apache.qpid.transport.Session.sync(Session.java:837)
at org.apache.qpid.transport.Session.invoke(Session.java:789)
at org.apache.qpid.transport.Session.invoke(Session.java:627)
at org.apache.qpid.transport.SessionInvoker.txCommit(SessionInvoker.java:145)
at org.apache.qpid.client.AMQSession_0_10.commitImpl(AMQSession_0_10.java:1033)
at org.apache.qpid.client.AMQSession.commit(AMQSession.java:958)
... 2 more
qpid-cpp-server-0.34-4.el6.i686
qpid-cpp-server-ha-0.34-4.el6.i686
qpid-java-client-0.30-7.el6.noarch
qpid-java-common-0.30-7.el6.noarch
==> Not OK -> bug is still present in the latest packages
root@ha_rhel6i_1:/var/dtests/node_data/clients# ./qc2_spout.java.sh -b guest/guest.6.251:5672 -c 1000000 --tx-batch-size 1 "test;{create:always}"
main 2015-09-16 09:40:02,336 WARN [apache.qpid.example.CoreClient] Connection=amqp://guest:guest@test/test?sync_ack='true'&sync_publish='all'&brokerlist='tcp://192.168.6.251:5672'
main 2015-09-16 09:40:02,336 WARN [apache.qpid.example.CoreClient] Connection=amqp://guest:guest@test/test?sync_ack='true'&sync_publish='all'&brokerlist='tcp://192.168.6.251:5672'
IoReceiver - /192.168.6.251:5672 2015-09-16 09:52:05,525 ERROR [apache.qpid.example.MessagingExceptionListener] Error
javax.jms.JMSException: Exception thrown against AMQConnection:
Host: 192.168.6.251
Port: 5672
Virtual Host: test
Client ID: test
Active session count: 0: org.apache.qpid.AMQException: ch=0 id=1 ExecutionException(errorCode=INTERNAL_ERROR, commandId=1735025, classCode=5, commandCode=2, fieldIndex=0, description=internal-error: Transaction prepare failed., errorInfo={}) [error code 541: internal error]
at org.apache.qpid.client.AMQConnection.exceptionReceived(AMQConnection.java:1297)
at org.apache.qpid.client.AMQSession_0_10.setCurrentException(AMQSession_0_10.java:1097)
at org.apache.qpid.client.AMQSession_0_10.exception(AMQSession_0_10.java:935)
at org.apache.qpid.transport.SessionDelegate.executionException(SessionDelegate.java:182)
at org.apache.qpid.transport.SessionDelegate.executionException(SessionDelegate.java:32)
at org.apache.qpid.transport.ExecutionException.dispatch(ExecutionException.java:103)
at org.apache.qpid.transport.SessionDelegate.command(SessionDelegate.java:55)
at org.apache.qpid.transport.SessionDelegate.command(SessionDelegate.java:50)
at org.apache.qpid.transport.SessionDelegate.command(SessionDelegate.java:32)
at org.apache.qpid.transport.Method.delegate(Method.java:159)
at org.apache.qpid.transport.Session.received(Session.java:596)
at org.apache.qpid.transport.Connection.dispatch(Connection.java:451)
at org.apache.qpid.transport.ConnectionDelegate.handle(ConnectionDelegate.java:64)
at org.apache.qpid.transport.ConnectionDelegate.handle(ConnectionDelegate.java:40)
at org.apache.qpid.transport.MethodDelegate.executionException(MethodDelegate.java:110)
at org.apache.qpid.transport.ExecutionException.dispatch(ExecutionException.java:103)
at org.apache.qpid.transport.ConnectionDelegate.command(ConnectionDelegate.java:54)
at org.apache.qpid.transport.ConnectionDelegate.command(ConnectionDelegate.java:40)
at org.apache.qpid.transport.Method.delegate(Method.java:159)
at org.apache.qpid.transport.Connection.received(Connection.java:404)
at org.apache.qpid.transport.Connection.received(Connection.java:69)
at org.apache.qpid.transport.network.Assembler.emit(Assembler.java:97)
at org.apache.qpid.transport.network.Assembler.assemble(Assembler.java:198)
at org.apache.qpid.transport.network.Assembler.frame(Assembler.java:131)
at org.apache.qpid.transport.network.Frame.delegate(Frame.java:128)
at org.apache.qpid.transport.network.Assembler.received(Assembler.java:102)
at org.apache.qpid.transport.network.Assembler.received(Assembler.java:44)
at org.apache.qpid.transport.network.InputHandler.next(InputHandler.java:199)
at org.apache.qpid.transport.network.InputHandler.received(InputHandler.java:114)
at org.apache.qpid.transport.network.InputHandler.received(InputHandler.java:46)
at org.apache.qpid.transport.network.io.IoReceiver.run(IoReceiver.java:161)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.qpid.AMQException: ch=0 id=1 ExecutionException(errorCode=INTERNAL_ERROR, commandId=1735025, classCode=5, commandCode=2, fieldIndex=0, description=internal-error: Transaction prepare failed., errorInfo={}) [error code 541: internal error]
at org.apache.qpid.client.AMQSession_0_10.setCurrentException(AMQSession_0_10.java:1081)
... 30 more
main 2015-09-16 09:52:05,541 ERROR [apache.qpid.example.CoreClient] Error while sending message!
javax.jms.JMSException: Session exception occured while trying to commit: ch=0 id=1 ExecutionException(errorCode=INTERNAL_ERROR, commandId=1735025, classCode=5, commandCode=2, fieldIndex=0, description=internal-error: Transaction prepare failed., errorInfo={})
main 2015-09-16 09:52:05,541 ERROR [apache.qpid.example.CoreClient] Error while sending message!
javax.jms.JMSException: Session exception occured while trying to commit: ch=0 id=1 ExecutionException(errorCode=INTERNAL_ERROR, commandId=1735025, classCode=5, commandCode=2, fieldIndex=0, description=internal-error: Transaction prepare failed., errorInfo={})
This product has been discontinued or is no longer tracked in Red Hat Bugzilla. |