Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 634794

Summary: JMS client failover failures
Product: Red Hat Enterprise MRG Reporter: Rajith Attapattu <rattapat+nobody>
Component: qpid-javaAssignee: Rajith Attapattu <rattapat+nobody>
Status: CLOSED CURRENTRELEASE QA Contact: MRG Quality Engineering <mrgqe-bugs>
Severity: urgent Docs Contact:
Priority: urgent    
Version: DevelopmentCC: gsim, jross, tross
Target Milestone: 1.3   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2012-12-11 18:58:32 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
client side log for case 1 in comment #3
none
section of broker log for case1 in comment #3
none
client log for case 2 in comment #3
none
section of broker log for case2 in comment #3
none
Proposed fix as a patch against the 1.3.x release branch none

Description Rajith Attapattu 2010-09-17 02:37:11 UTC
Description of problem:
The c++ broker could resend commands after failover that is no longer valid on the client side.

The commands maybe invalid for two reasons,

1. These commands could have been acted on the client side but were unable to confirm to the broker before failover happened.

2. It could also that the c++ broker is actually sending commands that were already acked due to a bug on the broker side.

So far I haven't been able to conclusively determine which of the above is causing the duplicate commands. Since this is only surfaced very recently I am speculating that it is the later as if it was the former then it would have been detected a lot earlier as the JMS client lacked any error handling in this area.

It could also be that the broker is replaying only responses to certain commands like ExchangeQuery QueueQuery etc.. These commands are sent by the JMS client only if the new addressing scheme is used, hence the reason why this issue can be reproduced only the new addressing syntax is used.

Irrespective of whether it's a broker bug or not the Java client should be able to handle this error. 

For examples the c++ broker seems to resend an ExchangeQueryResult after failover, that had actually been sent to the client before failover. This causes a NPE on the JMS client which doesn't get handled properly and eventually kills the connection. This causes the client to retry the connection which causes errors like Session already attached.

2010-09-16 17:16:05 error Channel exception: session-busy: Session already attached: guest (qpid/broker/SessionManager.cpp:55)
2010-09-16 17:16:05 error Channel exception: not-attached: Channel 4 is not attached (qpid/amqp_0_10/SessionHandler.cpp:39)


Version-Release number of selected component (if applicable):
7.946106-5 onwards

How reproducible:
Always

Steps to Reproduce:
1. Install the following packages,
     qpid-java-client/common 7.946106-5
     qpid-cpp-server and qpid-cpp-server-cluster 0.7.946106-14.el5
2. Start a two node cluster.
3. Run the attached reproducer.
4. Observe that failover does not happen as intended.
  
Actual results:
The client does not failover.

Expected results:
Should failover without any error

Comment 1 Rajith Attapattu 2010-09-17 02:40:29 UTC
*** Bug 633942 has been marked as a duplicate of this bug. ***

Comment 2 Gordon Sim 2010-09-17 08:08:26 UTC
The broker does not replay any command responses at all on failover. I believe there must be some other cause behind the symptoms here.

Comment 3 Gordon Sim 2010-09-17 11:57:52 UTC
I see two distinct types of problem in running the reproducer for Bug 633942.

Case 1:

With the normal dest_syntax (i.e. new addressing), the producer does not failover correctly even on the first attempt. I.e. start two node cluster, run reproducer class, kill first broker then observe that the subscriber successfully fails over, but does not receive any new messages. Eventually the producer appears to timeout.

The broker logs show that the producers session reattaches ok on failover (the same session name is used, but this does not cause any problem), it then replays some in-doubt messages but then proceeds to issue a further session-attach for the same session id on a different channel. This results - as expected - in a session-busy exception from the broker.

Will attach client log and a portion of broker log for the producers session showing the re-issuing of the session-attach.

This case does not occur if the dest_syntax is set to BURL indicating it may be due to new addressing changes.

Case 2:

Using BURL as the default dest_syntax, the same repoducer fails in a different way. Here the first failover proceeds without issue. However if the first broker is restarted, then the second broker is killed to trigger a second failover, the producer appears to failover successfully but the subscriber fails with a resource-locked exception.

The broker logs show that the client appears to reattach successfully on this second failover attempt. It then issues a declare, bind, subscribe sequence which do not cause any errors. Then follw a sequence of execution-sync calls after which the same declare-bind-subscribe sequence follows. The subscribes in the sequence are exclusive, and as the first one was not cancelled the second results in a resource-locked exception as expected.

Will attach the client log and a portion of broker log for the subscribers session showing the duplicate subscribes being issued (the second of which fails).

Note that the same pattern also occurs for the failover exchange subscriptions.

Comment 4 Gordon Sim 2010-09-17 11:59:38 UTC
Created attachment 447980 [details]
client side log for case 1 in comment #3

org.apache.qpid.transport.ConnectionException
	at org.apache.qpid.transport.Connection.exception(Connection.java:479)
	at org.apache.qpid.transport.network.Assembler.exception(Assembler.java:108)
	at org.apache.qpid.transport.network.InputHandler.exception(InputHandler.java:197)
	at org.apache.qpid.transport.network.io.IoReceiver.run(IoReceiver.java:145)
	at java.lang.Thread.run(Thread.java:619)
Caused by: java.lang.NullPointerException
	at org.apache.qpid.transport.Session$ResultFuture.access$100(Session.java:809)
	at org.apache.qpid.transport.Session.result(Session.java:762)
	at org.apache.qpid.transport.SessionDelegate.executionResult(SessionDelegate.java:136)
	at org.apache.qpid.transport.SessionDelegate.executionResult(SessionDelegate.java:32)
	at org.apache.qpid.transport.ExecutionResult.dispatch(ExecutionResult.java:93)
	at org.apache.qpid.transport.SessionDelegate.command(SessionDelegate.java:46)
	at org.apache.qpid.transport.SessionDelegate.command(SessionDelegate.java:32)
	at org.apache.qpid.transport.Method.delegate(Method.java:159)
	at org.apache.qpid.transport.Session.received(Session.java:487)
	at org.apache.qpid.transport.Connection.dispatch(Connection.java:382)
	at org.apache.qpid.transport.ConnectionDelegate.handle(ConnectionDelegate.java:64)
	at org.apache.qpid.transport.ConnectionDelegate.handle(ConnectionDelegate.java:40)
	at org.apache.qpid.transport.MethodDelegate.executionResult(MethodDelegate.java:107)
	at org.apache.qpid.transport.ExecutionResult.dispatch(ExecutionResult.java:93)
	at org.apache.qpid.transport.ConnectionDelegate.command(ConnectionDelegate.java:54)
	at org.apache.qpid.transport.ConnectionDelegate.command(ConnectionDelegate.java:40)
	at org.apache.qpid.transport.Method.delegate(Method.java:159)
	at org.apache.qpid.transport.Connection.received(Connection.java:347)
	at org.apache.qpid.transport.Connection.received(Connection.java:55)
	at org.apache.qpid.transport.network.Assembler.emit(Assembler.java:98)
	at org.apache.qpid.transport.network.Assembler.assemble(Assembler.java:199)
	at org.apache.qpid.transport.network.Assembler.frame(Assembler.java:132)
	at org.apache.qpid.transport.network.Frame.delegate(Frame.java:133)
	at org.apache.qpid.transport.network.Assembler.received(Assembler.java:103)
	at org.apache.qpid.transport.network.Assembler.received(Assembler.java:48)
	at org.apache.qpid.transport.network.InputHandler.next(InputHandler.java:187)
	at org.apache.qpid.transport.network.InputHandler.received(InputHandler.java:103)
	at org.apache.qpid.transport.network.InputHandler.received(InputHandler.java:42)
	at org.apache.qpid.transport.network.io.IoReceiver.run(IoReceiver.java:128)
	... 1 more

seemed perhaps relevant?

Comment 5 Gordon Sim 2010-09-17 12:00:44 UTC
Created attachment 447981 [details]
section of broker log for case1 in comment #3

Here you can see the re-issuing of a session-attach on a different channel by the client while the first session is still active.

Comment 6 Gordon Sim 2010-09-17 12:01:15 UTC
Created attachment 447982 [details]
client log for case 2 in comment #3

Comment 7 Gordon Sim 2010-09-17 12:03:29 UTC
Created attachment 447983 [details]
section of broker log for case2 in comment #3

Here you can see a duplicated declare-bind-subscribe sequence on the same session. As the subscribes are both exclusive, the second fails as expected. The sequence of 15 or so execution-sync commands appears a little strange.

Comment 9 Rajith Attapattu 2010-09-21 03:06:51 UTC
This issue is happening due to the same root cause as Bug 633969.
This is tracked in upstream via QPID-2876.

In this particular case, the Java client has ExchangeQuery commands in it's command array that gets replayed after failover. The broker then responds with a ExchangeQueryResult, but when the client tries to process the result, it cannot find a corresponding command causing the NPE.

The fix for Bug 633969 should eliminate this issue as well. However it's good to have the Null check as a safety mechanism.

Following is the proposed patch for adding a Null check

--- qpid/trunk/qpid/java/common/src/main/java/org/apache/qpid/transport/Session.java (original)
+++ qpid/trunk/qpid/java/common/src/main/java/org/apache/qpid/transport/Session.java
@@ -767,7 +767,17 @@ public class Session extends SessionInvo
        {
            future = results.remove(command);
        }
-        future.set(result);
+
+        if (future != null)
+        {
+            future.set(result);
+        }
+        else
+        {
+            log.warn("Received a response to a command" +
+                       " that's no longer valid on the client side." +
+                       " [ command id : %s , result : %s ]",command, result);
+        }
    }

    void setException(ExecutionException exc)

Comment 10 Rajith Attapattu 2010-09-21 18:25:00 UTC
Created attachment 448767 [details]
Proposed fix as a patch against the 1.3.x release branch

The proposed patch contains the following changes

Session.java
============
Changes :

A null check is added before setting the result on a future. If the command had already received a response then the future will not be present in the map which could cause a NPE. 

The fix for Bug 633969 will eliminate the reason that caused this bug.
However the simple null check is good to have.

The second patch merely corrects the log message as the previous patch incorrectly implies that it is due to a broker bug.

Risk:

It's a very low risk patch.

Comment 11 Rajith Attapattu 2010-09-23 13:15:20 UTC
The patch was committed to the 1.3.x branch in the internal git repo.

http://mrg1.lab.bos.redhat.com/cgit/qpid.git/commit/?id=4e1f175743eb39e7141fae9394f7895ccb45765b

http://mrg1.lab.bos.redhat.com/cgit/qpid.git/commit/?id=1b39b02d42dd252e13feb8fab676f6f9f9a28e42

These changes are included in the 7.946106-10 package set.