Bug 620808

Summary: failover does not work properly
Product: Red Hat Enterprise MRG Reporter: Jiri Kolar <jkolar>
Component: qpid-javaAssignee: Rajith Attapattu <rattapat+nobody>
Status: CLOSED ERRATA QA Contact: MRG Quality Engineering <mrgqe-bugs>
Severity: urgent Docs Contact:
Priority: urgent    
Version: betaCC: gsim, rattapat+nobody, rmusil, tross
Target Milestone: 1.3   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2010-10-20 11:30:28 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
my reproducer with described steps as a bash script
none
Producer class
none
Producer class source none

Description Jiri Kolar 2010-08-03 14:35:36 UTC
Created attachment 436288 [details]
my reproducer with described steps as a bash script

Description of problem:
When 2node cluster is started and java client connected in failover_exchange mode, messages being send and then node where client is connected to killed, client does not failover properly.

Version-Release number of selected component (if applicable):
1.3 beta4

How reproducible:
always

Steps to Reproduce:
1.start 2 node cluster

2.Compile and start attached producer (URL of broker and port hardcoded in Producer)

3.Kill node where Producer is connected
  
Actual results:
Client failover to another node but does not continue sending messages

Expected results:
Client should continue sending messages

Additional info:
I discussed this with Rajith and he promissed to take a look, but i fill a bug anyway to be sure problem is known. There may be something wrong with my Reproducer as well, but i have no clue what

Comment 1 Jiri Kolar 2010-08-03 14:36:26 UTC
Created attachment 436289 [details]
Producer class

Comment 2 Jiri Kolar 2010-08-03 14:37:29 UTC
Created attachment 436290 [details]
Producer class source

Comment 3 Rajith Attapattu 2010-08-09 23:53:54 UTC
Jiri,

I have run your script (with modifications only for paths) against the mrg 1.3 release branch and it works as expected.

Can you also checkout a new machine from beaker, deploy the latest rpms and run your script?

Rajith

Comment 4 Rajith Attapattu 2010-08-17 02:44:53 UTC
After more investigation and modifying the original script, I was able to reproduce the issue.

The problem only occurs if the producer (or consumer) is using a an addressing string. If you run the same example using the BURL syntax then there is no issue.

(*)The resume method in o.a.q.transport.Connection takes a lock when in enters the resume method and in turn it tries to call the "resumed" methods in the AMQSession_0_10 class. 

(*)That method will essentially recreate the producers and consumers which will cause the address string to be re-evaluated. In doing so a synchronous ExchangeQuery is made.

(*) The IoReceiverThread tries to dispatch it, but unable to do as it cannot grab the 'lock' as it is already held by the failover thread.
The failover thread is not releasing the lock as it is waiting for the response from the exchange query method.

(*) Hence the client appears to be stuck until the wait for the response times out and the failover operation fails :)

Comment 5 Rajith Attapattu 2010-08-17 02:54:54 UTC
Following are two snippets from a thread dump that illustrates the issue.

"IoReceiver - /10.16.44.241:5673" daemon prio=10 tid=0x000000000cdf0800 nid=0x1afe waiting for monitor entry [0x0000000041c1a000]
   java.lang.Thread.State: BLOCKED (on object monitor)
	at org.apache.qpid.transport.Connection.getSession(Connection.java:440)
	- waiting to lock <0x00002aab6075e0e8> (a java.lang.Object)
	at org.apache.qpid.transport.Connection.dispatch(Connection.java:380)
	at org.apache.qpid.transport.ConnectionDelegate.handle(ConnectionDelegate.java:64)
	at org.apache.qpid.transport.ConnectionDelegate.handle(ConnectionDelegate.java:40)
	at org.apache.qpid.transport.MethodDelegate.sessionAttached(MethodDelegate.java:68)
	at org.apache.qpid.transport.SessionAttached.dispatch(SessionAttached.java:91)
	at org.apache.qpid.transport.ConnectionDelegate.control(ConnectionDelegate.java:49)
	at org.apache.qpid.transport.ConnectionDelegate.control(ConnectionDelegate.java:40)
	at org.apache.qpid.transport.Method.delegate(Method.java:163)
	at org.apache.qpid.transport.Connection.received(Connection.java:348)
	at org.apache.qpid.transport.Connection.received(Connection.java:55)
	at org.apache.qpid.transport.network.Assembler.emit(Assembler.java:98)
	at org.apache.qpid.transport.network.Assembler.assemble(Assembler.java:184)
	at org.apache.qpid.transport.network.Assembler.frame(Assembler.java:132)
	at org.apache.qpid.transport.network.Frame.delegate(Frame.java:133)
	at org.apache.qpid.transport.network.Assembler.received(Assembler.java:103)
	at org.apache.qpid.transport.network.Assembler.received(Assembler.java:48)
	at org.apache.qpid.transport.network.InputHandler.next(InputHandler.java:187)
	at org.apache.qpid.transport.network.InputHandler.received(InputHandler.java:103)
	at org.apache.qpid.transport.network.InputHandler.received(InputHandler.java:42)
	at org.apache.qpid.transport.network.io.IoReceiver.run(IoReceiver.java:128)
	at java.lang.Thread.run(Thread.java:636)


IoReceiver - localhost/127.0.0.1:5672" daemon prio=10 tid=0x00002aabbc1a8000 nid=0x1aee in Object.wait() [0x0000000040bee000]
   java.lang.Thread.State: TIMED_WAITING (on object monitor)
	at java.lang.Object.wait(Native Method)
	- waiting on <0x00002aab6155be60> (a org.apache.qpid.transport.Session$ResultFuture)
	at org.apache.qpid.transport.util.Waiter.await(Waiter.java:54)
	at org.apache.qpid.transport.Session$ResultFuture.get(Session.java:837)
	- locked <0x00002aab6155be60> (a org.apache.qpid.transport.Session$ResultFuture)
	at org.apache.qpid.transport.Session$ResultFuture.get(Session.java:859)
	at org.apache.qpid.client.AMQSession_0_10.resolveAddressType(AMQSession_0_10.java:1219)
	at org.apache.qpid.client.AMQSession_0_10.handleAddressBasedDestination(AMQSession_0_10.java:1160)
	at org.apache.qpid.client.BasicMessageProducer_0_10.declareDestination(BasicMessageProducer_0_10.java:81)
	at org.apache.qpid.client.BasicMessageProducer.resubscribe(BasicMessageProducer.java:174)
	at org.apache.qpid.client.AMQSession.resubscribeProducers(AMQSession.java:2782)
	at org.apache.qpid.client.AMQSession.resubscribe(AMQSession.java:2038)
	at org.apache.qpid.client.AMQSession_0_10.resumed(AMQSession_0_10.java:871)
	at org.apache.qpid.transport.Session.resume(Session.java:265)
	- locked <0x00002aab61761350> (a [Lorg.apache.qpid.transport.Method;)
	at org.apache.qpid.transport.Connection.resume(Connection.java:452)
	- locked <0x00002aab6075e0e8> (a java.lang.Object)
	at org.apache.qpid.client.AMQConnectionDelegate_0_10.closed(AMQConnectionDelegate_0_10.java:243)
	at org.apache.qpid.transport.Connection.closed(Connection.java:530)
	at org.apache.qpid.transport.network.Assembler.closed(Assembler.java:113)
	at org.apache.qpid.transport.network.InputHandler.closed(InputHandler.java:202)
	at org.apache.qpid.transport.network.io.IoReceiver.run(IoReceiver.java:150)
	at java.lang.Thread.run(Thread.java:636)

Comment 6 Rajith Attapattu 2010-08-31 22:06:36 UTC
This issue is tracked in upstream via QPID-2808
It was decided not to fix the locking issue at this time.
Instead a workaround was put in place to avoid the codepath that could cause the locking issue.
This workaround is tracked in upstream via QPID-2809. In addition to being a workaround this fix also improves the code to eliminate duplicate work by caching the results of a destination lookup.

I have ported the upstream fixes to the internal git repo using the following commits.

http://mrg1.lab.bos.redhat.com/cgit/qpid.git/commit/?id=f3b4a96b13883746d0b94b9cf31baea988c99864

http://mrg1.lab.bos.redhat.com/cgit/qpid.git/commit/?id=3279fe59f00163a5574b424b05b69b336005fbab

http://mrg1.lab.bos.redhat.com/cgit/qpid.git/commit/?id=e03ad451b8037c60d036d4260fe79f0ca34bd445

Comment 7 Jiri Kolar 2010-09-29 15:04:25 UTC
fixed in qpid-java-client-0.7.946106-10.el5

validated on RHEL5.5  i386 / x86_64  

packages:
# rpm -qa | grep -E '(qpid|openais|rhm)' | sort -u

openais-0.80.6-16.el5_5.7
openais-devel-0.80.6-16.el5_5.7
python-qpid-0.7.946106-14.el5
qpid-cpp-client-0.7.946106-17.el5
qpid-cpp-client-devel-0.7.946106-17.el5
qpid-cpp-client-devel-docs-0.7.946106-17.el5
qpid-cpp-client-ssl-0.7.946106-17.el5
qpid-cpp-mrg-debuginfo-0.7.946106-14.el5
qpid-cpp-server-0.7.946106-17.el5
qpid-cpp-server-cluster-0.7.946106-17.el5
qpid-cpp-server-devel-0.7.946106-17.el5
qpid-cpp-server-ssl-0.7.946106-17.el5
qpid-cpp-server-store-0.7.946106-17.el5
qpid-cpp-server-xml-0.7.946106-17.el5
qpid-java-client-0.7.946106-10.el5
qpid-java-common-0.7.946106-10.el5
qpid-tools-0.7.946106-11.el5
rhm-docs-0.7.946106-5.el5
rh-tests-distribution-MRG-Messaging-qpid_common-1.6-53


->VERIFIED