Bug 620808
Summary: | failover does not work properly | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise MRG | Reporter: | Jiri Kolar <jkolar> | ||||||||
Component: | qpid-java | Assignee: | Rajith Attapattu <rattapat+nobody> | ||||||||
Status: | CLOSED ERRATA | QA Contact: | MRG Quality Engineering <mrgqe-bugs> | ||||||||
Severity: | urgent | Docs Contact: | |||||||||
Priority: | urgent | ||||||||||
Version: | beta | CC: | gsim, rattapat+nobody, rmusil, tross | ||||||||
Target Milestone: | 1.3 | ||||||||||
Target Release: | --- | ||||||||||
Hardware: | All | ||||||||||
OS: | Linux | ||||||||||
Whiteboard: | |||||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||||
Doc Text: | Story Points: | --- | |||||||||
Clone Of: | Environment: | ||||||||||
Last Closed: | 2010-10-20 11:30:28 UTC | Type: | --- | ||||||||
Regression: | --- | Mount Type: | --- | ||||||||
Documentation: | --- | CRM: | |||||||||
Verified Versions: | Category: | --- | |||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||
Embargoed: | |||||||||||
Attachments: |
|
Created attachment 436289 [details]
Producer class
Created attachment 436290 [details]
Producer class source
Jiri, I have run your script (with modifications only for paths) against the mrg 1.3 release branch and it works as expected. Can you also checkout a new machine from beaker, deploy the latest rpms and run your script? Rajith After more investigation and modifying the original script, I was able to reproduce the issue. The problem only occurs if the producer (or consumer) is using a an addressing string. If you run the same example using the BURL syntax then there is no issue. (*)The resume method in o.a.q.transport.Connection takes a lock when in enters the resume method and in turn it tries to call the "resumed" methods in the AMQSession_0_10 class. (*)That method will essentially recreate the producers and consumers which will cause the address string to be re-evaluated. In doing so a synchronous ExchangeQuery is made. (*) The IoReceiverThread tries to dispatch it, but unable to do as it cannot grab the 'lock' as it is already held by the failover thread. The failover thread is not releasing the lock as it is waiting for the response from the exchange query method. (*) Hence the client appears to be stuck until the wait for the response times out and the failover operation fails :) Following are two snippets from a thread dump that illustrates the issue. "IoReceiver - /10.16.44.241:5673" daemon prio=10 tid=0x000000000cdf0800 nid=0x1afe waiting for monitor entry [0x0000000041c1a000] java.lang.Thread.State: BLOCKED (on object monitor) at org.apache.qpid.transport.Connection.getSession(Connection.java:440) - waiting to lock <0x00002aab6075e0e8> (a java.lang.Object) at org.apache.qpid.transport.Connection.dispatch(Connection.java:380) at org.apache.qpid.transport.ConnectionDelegate.handle(ConnectionDelegate.java:64) at org.apache.qpid.transport.ConnectionDelegate.handle(ConnectionDelegate.java:40) at org.apache.qpid.transport.MethodDelegate.sessionAttached(MethodDelegate.java:68) at org.apache.qpid.transport.SessionAttached.dispatch(SessionAttached.java:91) at org.apache.qpid.transport.ConnectionDelegate.control(ConnectionDelegate.java:49) at org.apache.qpid.transport.ConnectionDelegate.control(ConnectionDelegate.java:40) at org.apache.qpid.transport.Method.delegate(Method.java:163) at org.apache.qpid.transport.Connection.received(Connection.java:348) at org.apache.qpid.transport.Connection.received(Connection.java:55) at org.apache.qpid.transport.network.Assembler.emit(Assembler.java:98) at org.apache.qpid.transport.network.Assembler.assemble(Assembler.java:184) at org.apache.qpid.transport.network.Assembler.frame(Assembler.java:132) at org.apache.qpid.transport.network.Frame.delegate(Frame.java:133) at org.apache.qpid.transport.network.Assembler.received(Assembler.java:103) at org.apache.qpid.transport.network.Assembler.received(Assembler.java:48) at org.apache.qpid.transport.network.InputHandler.next(InputHandler.java:187) at org.apache.qpid.transport.network.InputHandler.received(InputHandler.java:103) at org.apache.qpid.transport.network.InputHandler.received(InputHandler.java:42) at org.apache.qpid.transport.network.io.IoReceiver.run(IoReceiver.java:128) at java.lang.Thread.run(Thread.java:636) IoReceiver - localhost/127.0.0.1:5672" daemon prio=10 tid=0x00002aabbc1a8000 nid=0x1aee in Object.wait() [0x0000000040bee000] java.lang.Thread.State: TIMED_WAITING (on object monitor) at java.lang.Object.wait(Native Method) - waiting on <0x00002aab6155be60> (a org.apache.qpid.transport.Session$ResultFuture) at org.apache.qpid.transport.util.Waiter.await(Waiter.java:54) at org.apache.qpid.transport.Session$ResultFuture.get(Session.java:837) - locked <0x00002aab6155be60> (a org.apache.qpid.transport.Session$ResultFuture) at org.apache.qpid.transport.Session$ResultFuture.get(Session.java:859) at org.apache.qpid.client.AMQSession_0_10.resolveAddressType(AMQSession_0_10.java:1219) at org.apache.qpid.client.AMQSession_0_10.handleAddressBasedDestination(AMQSession_0_10.java:1160) at org.apache.qpid.client.BasicMessageProducer_0_10.declareDestination(BasicMessageProducer_0_10.java:81) at org.apache.qpid.client.BasicMessageProducer.resubscribe(BasicMessageProducer.java:174) at org.apache.qpid.client.AMQSession.resubscribeProducers(AMQSession.java:2782) at org.apache.qpid.client.AMQSession.resubscribe(AMQSession.java:2038) at org.apache.qpid.client.AMQSession_0_10.resumed(AMQSession_0_10.java:871) at org.apache.qpid.transport.Session.resume(Session.java:265) - locked <0x00002aab61761350> (a [Lorg.apache.qpid.transport.Method;) at org.apache.qpid.transport.Connection.resume(Connection.java:452) - locked <0x00002aab6075e0e8> (a java.lang.Object) at org.apache.qpid.client.AMQConnectionDelegate_0_10.closed(AMQConnectionDelegate_0_10.java:243) at org.apache.qpid.transport.Connection.closed(Connection.java:530) at org.apache.qpid.transport.network.Assembler.closed(Assembler.java:113) at org.apache.qpid.transport.network.InputHandler.closed(InputHandler.java:202) at org.apache.qpid.transport.network.io.IoReceiver.run(IoReceiver.java:150) at java.lang.Thread.run(Thread.java:636) This issue is tracked in upstream via QPID-2808 It was decided not to fix the locking issue at this time. Instead a workaround was put in place to avoid the codepath that could cause the locking issue. This workaround is tracked in upstream via QPID-2809. In addition to being a workaround this fix also improves the code to eliminate duplicate work by caching the results of a destination lookup. I have ported the upstream fixes to the internal git repo using the following commits. http://mrg1.lab.bos.redhat.com/cgit/qpid.git/commit/?id=f3b4a96b13883746d0b94b9cf31baea988c99864 http://mrg1.lab.bos.redhat.com/cgit/qpid.git/commit/?id=3279fe59f00163a5574b424b05b69b336005fbab http://mrg1.lab.bos.redhat.com/cgit/qpid.git/commit/?id=e03ad451b8037c60d036d4260fe79f0ca34bd445 fixed in qpid-java-client-0.7.946106-10.el5 validated on RHEL5.5 i386 / x86_64 packages: # rpm -qa | grep -E '(qpid|openais|rhm)' | sort -u openais-0.80.6-16.el5_5.7 openais-devel-0.80.6-16.el5_5.7 python-qpid-0.7.946106-14.el5 qpid-cpp-client-0.7.946106-17.el5 qpid-cpp-client-devel-0.7.946106-17.el5 qpid-cpp-client-devel-docs-0.7.946106-17.el5 qpid-cpp-client-ssl-0.7.946106-17.el5 qpid-cpp-mrg-debuginfo-0.7.946106-14.el5 qpid-cpp-server-0.7.946106-17.el5 qpid-cpp-server-cluster-0.7.946106-17.el5 qpid-cpp-server-devel-0.7.946106-17.el5 qpid-cpp-server-ssl-0.7.946106-17.el5 qpid-cpp-server-store-0.7.946106-17.el5 qpid-cpp-server-xml-0.7.946106-17.el5 qpid-java-client-0.7.946106-10.el5 qpid-java-common-0.7.946106-10.el5 qpid-tools-0.7.946106-11.el5 rhm-docs-0.7.946106-5.el5 rh-tests-distribution-MRG-Messaging-qpid_common-1.6-53 ->VERIFIED |
Created attachment 436288 [details] my reproducer with described steps as a bash script Description of problem: When 2node cluster is started and java client connected in failover_exchange mode, messages being send and then node where client is connected to killed, client does not failover properly. Version-Release number of selected component (if applicable): 1.3 beta4 How reproducible: always Steps to Reproduce: 1.start 2 node cluster 2.Compile and start attached producer (URL of broker and port hardcoded in Producer) 3.Kill node where Producer is connected Actual results: Client failover to another node but does not continue sending messages Expected results: Client should continue sending messages Additional info: I discussed this with Rajith and he promissed to take a look, but i fill a bug anyway to be sure problem is known. There may be something wrong with my Reproducer as well, but i have no clue what