Bug 634794
| Summary: | JMS client failover failures | ||
|---|---|---|---|
| Product: | Red Hat Enterprise MRG | Reporter: | Rajith Attapattu <rattapat+nobody> |
| Component: | qpid-java | Assignee: | Rajith Attapattu <rattapat+nobody> |
| Status: | CLOSED CURRENTRELEASE | QA Contact: | MRG Quality Engineering <mrgqe-bugs> |
| Severity: | urgent | Docs Contact: | |
| Priority: | urgent | ||
| Version: | Development | CC: | gsim, jross, tross |
| Target Milestone: | 1.3 | ||
| Target Release: | --- | ||
| Hardware: | All | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | Bug Fix | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2012-12-11 18:58:32 UTC | Type: | --- |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Attachments: | |||
*** Bug 633942 has been marked as a duplicate of this bug. *** The broker does not replay any command responses at all on failover. I believe there must be some other cause behind the symptoms here. I see two distinct types of problem in running the reproducer for Bug 633942. Case 1: With the normal dest_syntax (i.e. new addressing), the producer does not failover correctly even on the first attempt. I.e. start two node cluster, run reproducer class, kill first broker then observe that the subscriber successfully fails over, but does not receive any new messages. Eventually the producer appears to timeout. The broker logs show that the producers session reattaches ok on failover (the same session name is used, but this does not cause any problem), it then replays some in-doubt messages but then proceeds to issue a further session-attach for the same session id on a different channel. This results - as expected - in a session-busy exception from the broker. Will attach client log and a portion of broker log for the producers session showing the re-issuing of the session-attach. This case does not occur if the dest_syntax is set to BURL indicating it may be due to new addressing changes. Case 2: Using BURL as the default dest_syntax, the same repoducer fails in a different way. Here the first failover proceeds without issue. However if the first broker is restarted, then the second broker is killed to trigger a second failover, the producer appears to failover successfully but the subscriber fails with a resource-locked exception. The broker logs show that the client appears to reattach successfully on this second failover attempt. It then issues a declare, bind, subscribe sequence which do not cause any errors. Then follw a sequence of execution-sync calls after which the same declare-bind-subscribe sequence follows. The subscribes in the sequence are exclusive, and as the first one was not cancelled the second results in a resource-locked exception as expected. Will attach the client log and a portion of broker log for the subscribers session showing the duplicate subscribes being issued (the second of which fails). Note that the same pattern also occurs for the failover exchange subscriptions. Created attachment 447980 [details] client side log for case 1 in comment #3 org.apache.qpid.transport.ConnectionException at org.apache.qpid.transport.Connection.exception(Connection.java:479) at org.apache.qpid.transport.network.Assembler.exception(Assembler.java:108) at org.apache.qpid.transport.network.InputHandler.exception(InputHandler.java:197) at org.apache.qpid.transport.network.io.IoReceiver.run(IoReceiver.java:145) at java.lang.Thread.run(Thread.java:619) Caused by: java.lang.NullPointerException at org.apache.qpid.transport.Session$ResultFuture.access$100(Session.java:809) at org.apache.qpid.transport.Session.result(Session.java:762) at org.apache.qpid.transport.SessionDelegate.executionResult(SessionDelegate.java:136) at org.apache.qpid.transport.SessionDelegate.executionResult(SessionDelegate.java:32) at org.apache.qpid.transport.ExecutionResult.dispatch(ExecutionResult.java:93) at org.apache.qpid.transport.SessionDelegate.command(SessionDelegate.java:46) at org.apache.qpid.transport.SessionDelegate.command(SessionDelegate.java:32) at org.apache.qpid.transport.Method.delegate(Method.java:159) at org.apache.qpid.transport.Session.received(Session.java:487) at org.apache.qpid.transport.Connection.dispatch(Connection.java:382) at org.apache.qpid.transport.ConnectionDelegate.handle(ConnectionDelegate.java:64) at org.apache.qpid.transport.ConnectionDelegate.handle(ConnectionDelegate.java:40) at org.apache.qpid.transport.MethodDelegate.executionResult(MethodDelegate.java:107) at org.apache.qpid.transport.ExecutionResult.dispatch(ExecutionResult.java:93) at org.apache.qpid.transport.ConnectionDelegate.command(ConnectionDelegate.java:54) at org.apache.qpid.transport.ConnectionDelegate.command(ConnectionDelegate.java:40) at org.apache.qpid.transport.Method.delegate(Method.java:159) at org.apache.qpid.transport.Connection.received(Connection.java:347) at org.apache.qpid.transport.Connection.received(Connection.java:55) at org.apache.qpid.transport.network.Assembler.emit(Assembler.java:98) at org.apache.qpid.transport.network.Assembler.assemble(Assembler.java:199) at org.apache.qpid.transport.network.Assembler.frame(Assembler.java:132) at org.apache.qpid.transport.network.Frame.delegate(Frame.java:133) at org.apache.qpid.transport.network.Assembler.received(Assembler.java:103) at org.apache.qpid.transport.network.Assembler.received(Assembler.java:48) at org.apache.qpid.transport.network.InputHandler.next(InputHandler.java:187) at org.apache.qpid.transport.network.InputHandler.received(InputHandler.java:103) at org.apache.qpid.transport.network.InputHandler.received(InputHandler.java:42) at org.apache.qpid.transport.network.io.IoReceiver.run(IoReceiver.java:128) ... 1 more seemed perhaps relevant? Created attachment 447981 [details] section of broker log for case1 in comment #3 Here you can see the re-issuing of a session-attach on a different channel by the client while the first session is still active. Created attachment 447982 [details] client log for case 2 in comment #3 Created attachment 447983 [details] section of broker log for case2 in comment #3 Here you can see a duplicated declare-bind-subscribe sequence on the same session. As the subscribes are both exclusive, the second fails as expected. The sequence of 15 or so execution-sync commands appears a little strange. This issue is happening due to the same root cause as Bug 633969. This is tracked in upstream via QPID-2876. In this particular case, the Java client has ExchangeQuery commands in it's command array that gets replayed after failover. The broker then responds with a ExchangeQueryResult, but when the client tries to process the result, it cannot find a corresponding command causing the NPE. The fix for Bug 633969 should eliminate this issue as well. However it's good to have the Null check as a safety mechanism. Following is the proposed patch for adding a Null check --- qpid/trunk/qpid/java/common/src/main/java/org/apache/qpid/transport/Session.java (original) +++ qpid/trunk/qpid/java/common/src/main/java/org/apache/qpid/transport/Session.java @@ -767,7 +767,17 @@ public class Session extends SessionInvo { future = results.remove(command); } - future.set(result); + + if (future != null) + { + future.set(result); + } + else + { + log.warn("Received a response to a command" + + " that's no longer valid on the client side." + + " [ command id : %s , result : %s ]",command, result); + } } void setException(ExecutionException exc) Created attachment 448767 [details] Proposed fix as a patch against the 1.3.x release branch The proposed patch contains the following changes Session.java ============ Changes : A null check is added before setting the result on a future. If the command had already received a response then the future will not be present in the map which could cause a NPE. The fix for Bug 633969 will eliminate the reason that caused this bug. However the simple null check is good to have. The second patch merely corrects the log message as the previous patch incorrectly implies that it is due to a broker bug. Risk: It's a very low risk patch. The patch was committed to the 1.3.x branch in the internal git repo. http://mrg1.lab.bos.redhat.com/cgit/qpid.git/commit/?id=4e1f175743eb39e7141fae9394f7895ccb45765b http://mrg1.lab.bos.redhat.com/cgit/qpid.git/commit/?id=1b39b02d42dd252e13feb8fab676f6f9f9a28e42 These changes are included in the 7.946106-10 package set. |
Description of problem: The c++ broker could resend commands after failover that is no longer valid on the client side. The commands maybe invalid for two reasons, 1. These commands could have been acted on the client side but were unable to confirm to the broker before failover happened. 2. It could also that the c++ broker is actually sending commands that were already acked due to a bug on the broker side. So far I haven't been able to conclusively determine which of the above is causing the duplicate commands. Since this is only surfaced very recently I am speculating that it is the later as if it was the former then it would have been detected a lot earlier as the JMS client lacked any error handling in this area. It could also be that the broker is replaying only responses to certain commands like ExchangeQuery QueueQuery etc.. These commands are sent by the JMS client only if the new addressing scheme is used, hence the reason why this issue can be reproduced only the new addressing syntax is used. Irrespective of whether it's a broker bug or not the Java client should be able to handle this error. For examples the c++ broker seems to resend an ExchangeQueryResult after failover, that had actually been sent to the client before failover. This causes a NPE on the JMS client which doesn't get handled properly and eventually kills the connection. This causes the client to retry the connection which causes errors like Session already attached. 2010-09-16 17:16:05 error Channel exception: session-busy: Session already attached: guest (qpid/broker/SessionManager.cpp:55) 2010-09-16 17:16:05 error Channel exception: not-attached: Channel 4 is not attached (qpid/amqp_0_10/SessionHandler.cpp:39) Version-Release number of selected component (if applicable): 7.946106-5 onwards How reproducible: Always Steps to Reproduce: 1. Install the following packages, qpid-java-client/common 7.946106-5 qpid-cpp-server and qpid-cpp-server-cluster 0.7.946106-14.el5 2. Start a two node cluster. 3. Run the attached reproducer. 4. Observe that failover does not happen as intended. Actual results: The client does not failover. Expected results: Should failover without any error