Bug 620808 - failover does not work properly
Summary: failover does not work properly
Alias: None
Product: Red Hat Enterprise MRG
Classification: Red Hat
Component: qpid-java
Version: beta
Hardware: All
OS: Linux
Target Milestone: 1.3
: ---
Assignee: Rajith Attapattu
QA Contact: MRG Quality Engineering
Depends On:
TreeView+ depends on / blocked
Reported: 2010-08-03 14:35 UTC by Jiri Kolar
Modified: 2010-10-20 11:30 UTC (History)
4 users (show)

Clone Of:
Last Closed: 2010-10-20 11:30:28 UTC

Attachments (Terms of Use)
my reproducer with described steps as a bash script (1.66 KB, text/x-sh)
2010-08-03 14:35 UTC, Jiri Kolar
no flags Details
Producer class (5.60 KB, text/x-java)
2010-08-03 14:36 UTC, Jiri Kolar
no flags Details
Producer class source (5.60 KB, text/x-java)
2010-08-03 14:37 UTC, Jiri Kolar
no flags Details

Description Jiri Kolar 2010-08-03 14:35:36 UTC
Created attachment 436288 [details]
my reproducer with described steps as a bash script

Description of problem:
When 2node cluster is started and java client connected in failover_exchange mode, messages being send and then node where client is connected to killed, client does not failover properly.

Version-Release number of selected component (if applicable):
1.3 beta4

How reproducible:

Steps to Reproduce:
1.start 2 node cluster

2.Compile and start attached producer (URL of broker and port hardcoded in Producer)

3.Kill node where Producer is connected
Actual results:
Client failover to another node but does not continue sending messages

Expected results:
Client should continue sending messages

Additional info:
I discussed this with Rajith and he promissed to take a look, but i fill a bug anyway to be sure problem is known. There may be something wrong with my Reproducer as well, but i have no clue what

Comment 1 Jiri Kolar 2010-08-03 14:36:26 UTC
Created attachment 436289 [details]
Producer class

Comment 2 Jiri Kolar 2010-08-03 14:37:29 UTC
Created attachment 436290 [details]
Producer class source

Comment 3 Rajith Attapattu 2010-08-09 23:53:54 UTC

I have run your script (with modifications only for paths) against the mrg 1.3 release branch and it works as expected.

Can you also checkout a new machine from beaker, deploy the latest rpms and run your script?


Comment 4 Rajith Attapattu 2010-08-17 02:44:53 UTC
After more investigation and modifying the original script, I was able to reproduce the issue.

The problem only occurs if the producer (or consumer) is using a an addressing string. If you run the same example using the BURL syntax then there is no issue.

(*)The resume method in o.a.q.transport.Connection takes a lock when in enters the resume method and in turn it tries to call the "resumed" methods in the AMQSession_0_10 class. 

(*)That method will essentially recreate the producers and consumers which will cause the address string to be re-evaluated. In doing so a synchronous ExchangeQuery is made.

(*) The IoReceiverThread tries to dispatch it, but unable to do as it cannot grab the 'lock' as it is already held by the failover thread.
The failover thread is not releasing the lock as it is waiting for the response from the exchange query method.

(*) Hence the client appears to be stuck until the wait for the response times out and the failover operation fails :)

Comment 5 Rajith Attapattu 2010-08-17 02:54:54 UTC
Following are two snippets from a thread dump that illustrates the issue.

"IoReceiver - /" daemon prio=10 tid=0x000000000cdf0800 nid=0x1afe waiting for monitor entry [0x0000000041c1a000]
   java.lang.Thread.State: BLOCKED (on object monitor)
	at org.apache.qpid.transport.Connection.getSession(Connection.java:440)
	- waiting to lock <0x00002aab6075e0e8> (a java.lang.Object)
	at org.apache.qpid.transport.Connection.dispatch(Connection.java:380)
	at org.apache.qpid.transport.ConnectionDelegate.handle(ConnectionDelegate.java:64)
	at org.apache.qpid.transport.ConnectionDelegate.handle(ConnectionDelegate.java:40)
	at org.apache.qpid.transport.MethodDelegate.sessionAttached(MethodDelegate.java:68)
	at org.apache.qpid.transport.SessionAttached.dispatch(SessionAttached.java:91)
	at org.apache.qpid.transport.ConnectionDelegate.control(ConnectionDelegate.java:49)
	at org.apache.qpid.transport.ConnectionDelegate.control(ConnectionDelegate.java:40)
	at org.apache.qpid.transport.Method.delegate(Method.java:163)
	at org.apache.qpid.transport.Connection.received(Connection.java:348)
	at org.apache.qpid.transport.Connection.received(Connection.java:55)
	at org.apache.qpid.transport.network.Assembler.emit(Assembler.java:98)
	at org.apache.qpid.transport.network.Assembler.assemble(Assembler.java:184)
	at org.apache.qpid.transport.network.Assembler.frame(Assembler.java:132)
	at org.apache.qpid.transport.network.Frame.delegate(Frame.java:133)
	at org.apache.qpid.transport.network.Assembler.received(Assembler.java:103)
	at org.apache.qpid.transport.network.Assembler.received(Assembler.java:48)
	at org.apache.qpid.transport.network.InputHandler.next(InputHandler.java:187)
	at org.apache.qpid.transport.network.InputHandler.received(InputHandler.java:103)
	at org.apache.qpid.transport.network.InputHandler.received(InputHandler.java:42)
	at org.apache.qpid.transport.network.io.IoReceiver.run(IoReceiver.java:128)
	at java.lang.Thread.run(Thread.java:636)

IoReceiver - localhost/" daemon prio=10 tid=0x00002aabbc1a8000 nid=0x1aee in Object.wait() [0x0000000040bee000]
   java.lang.Thread.State: TIMED_WAITING (on object monitor)
	at java.lang.Object.wait(Native Method)
	- waiting on <0x00002aab6155be60> (a org.apache.qpid.transport.Session$ResultFuture)
	at org.apache.qpid.transport.util.Waiter.await(Waiter.java:54)
	at org.apache.qpid.transport.Session$ResultFuture.get(Session.java:837)
	- locked <0x00002aab6155be60> (a org.apache.qpid.transport.Session$ResultFuture)
	at org.apache.qpid.transport.Session$ResultFuture.get(Session.java:859)
	at org.apache.qpid.client.AMQSession_0_10.resolveAddressType(AMQSession_0_10.java:1219)
	at org.apache.qpid.client.AMQSession_0_10.handleAddressBasedDestination(AMQSession_0_10.java:1160)
	at org.apache.qpid.client.BasicMessageProducer_0_10.declareDestination(BasicMessageProducer_0_10.java:81)
	at org.apache.qpid.client.BasicMessageProducer.resubscribe(BasicMessageProducer.java:174)
	at org.apache.qpid.client.AMQSession.resubscribeProducers(AMQSession.java:2782)
	at org.apache.qpid.client.AMQSession.resubscribe(AMQSession.java:2038)
	at org.apache.qpid.client.AMQSession_0_10.resumed(AMQSession_0_10.java:871)
	at org.apache.qpid.transport.Session.resume(Session.java:265)
	- locked <0x00002aab61761350> (a [Lorg.apache.qpid.transport.Method;)
	at org.apache.qpid.transport.Connection.resume(Connection.java:452)
	- locked <0x00002aab6075e0e8> (a java.lang.Object)
	at org.apache.qpid.client.AMQConnectionDelegate_0_10.closed(AMQConnectionDelegate_0_10.java:243)
	at org.apache.qpid.transport.Connection.closed(Connection.java:530)
	at org.apache.qpid.transport.network.Assembler.closed(Assembler.java:113)
	at org.apache.qpid.transport.network.InputHandler.closed(InputHandler.java:202)
	at org.apache.qpid.transport.network.io.IoReceiver.run(IoReceiver.java:150)
	at java.lang.Thread.run(Thread.java:636)

Comment 6 Rajith Attapattu 2010-08-31 22:06:36 UTC
This issue is tracked in upstream via QPID-2808
It was decided not to fix the locking issue at this time.
Instead a workaround was put in place to avoid the codepath that could cause the locking issue.
This workaround is tracked in upstream via QPID-2809. In addition to being a workaround this fix also improves the code to eliminate duplicate work by caching the results of a destination lookup.

I have ported the upstream fixes to the internal git repo using the following commits.




Comment 7 Jiri Kolar 2010-09-29 15:04:25 UTC
fixed in qpid-java-client-0.7.946106-10.el5

validated on RHEL5.5  i386 / x86_64  

# rpm -qa | grep -E '(qpid|openais|rhm)' | sort -u



Note You need to log in before you can comment on or make changes to this bug.