Description of problem: A two node cluster with JMS clients experienced an error of the form: Execution exception: invalid-argument: guest: confirmed < (2+0) but only sent < (1+0) (qpid/SessionState.cpp:163) This only occured on one of the nodes (shadow for that connection), which then shut itself down. Version-Release number of selected component (if applicable): qpidd-cluster-0.5.752581-26.el5 openais-0.80.3-22.el5_3.8 How reproducible: Has occurred at least twice. Steps to Reproduce: Not yet known.
Created attachment 356864 [details] log file (from failed node) of another occurence with tracing on
Next time we reproduce this, it would be good to get the logs from both nodes to see where the discrepancy arises between the two.
There are three subscribers on a particular queue[1], each using a timed (30sec) 'pull' style receiver (with a flush)[2]. All of these are on connections created on the failing node as part of the update protocol. Just after message becomes available[3], one of these subscribers sends a session-completed (most likely followed by a message-accept as that is the pattern shown for every other session-completed). However this session-completed appears to indicate an invalid command id (i.e. a command that has not yet been issued). So I think either the state transfer during the update leaves the shadow with incorrect state[4] (i.e. the command counter is somehow off by one after the update), or the dispatch is inconsistent between the two nodes and the shadow does not actually dispatch the message to the same subscriber as does the other node[5]. [1] Plus lots of connections and subscribers on other queues. [2] The pattern is to issue infinite byte credit, credit for 1 message, then to wait for 30 secs, flush if no message has been received, then repeat. I.e. in the log the pattern appears as MessageFlow(allocate credit for unlimited bytes) MessageFlow(allocate credit for 1 message) DeliverDoOutput .... MessageFlush (comes just before the cycle starts again) [3] The message that seems to trigger the error is the only one enqueued during the period the log represents. None of the other subscribers on that queue send a session-completed or message-accept at any time. In other words I believe this is the first delivery during the logged period, which means that there had been 110 commands sent to the client at the point of the new node joining. [4] Not sure whether this is relevant, but the channel used by the failing session is 0. [5] The command pattern mentioned above repeats regularly but on the failing session, it ends in a slightly irregular form (which may be entirely irrelevant): MessageFlow(allocate credit for unlimited bytes) DeliverDoOutput MessageFlow(allocate credit for 1 message) DeliverDoOutput DeliverDoOutput
At the end of the log we can see that there is - a message received & enqueued - 2 deliver-do-output commands * no messages sent - a session-completed that triggers the error. The session-completed can only have been triggered by a message being sent, so it looks like the other broker sent a message while the failed broker did not. I don't know why the failed broker did not send however, since it appears from the logs that the session is subscribed and has 1 message credit.
Created attachment 357364 [details] Reproducer The attachment contains a JMS based reproducer. Just untar the package and run the scramble_brokers.sh script. It basically starts a jms producer and jms consumer that uses ** sync_ack ** in the bg and then changes the 4 node cluster membership rapidly to force failover. I tried with a 2 node cluster to keep things simple but the probability of the error happening was pretty low. Also in this case it was hitting a known issue in the JMS clients FailoverExchangeMethod. The script is running the java clients with log level at WARN. You can easily change that in the script to debug ..etc. You could also get the brokers to log into a file. Feel free to modify the tests as you see fit. Please ping me if you make any improvements to the test script and I could incorporate those changes. into my nightly runs.
There appear to be two bugs here: 1. The original bug has error "confirmed N but only sent N-1" which is only raised on some cluster members causing them to shut down as inconsistent. 2. Rajith's reproducer above gives error "confirmed N but only sent 0", which is raised consistently on all cluster members active at the time. I've raised bug 519476 for 2. We still don't have a reproducer for 1.
Created attachment 359418 [details] Test program The attached program can be used to emulate the pattern from the logs and reproduce the error (its a modification to the existing receiver.cpp test). Steps to reproduce: 1. start one node of the cluster 2. qpid-config add queue test-queue 3. start several retriever processes (code attached) (e.g. for i in `seq 1 5`; do ./retriever & done) 4. send messages to the queue with sleeps between them (sleeps appear to be important to triggering the error) (e.g. for i in `seq 1 10000`; do echo "message $i"; sleep 5; done | sender, where sender is as checked in to qpid/src/tests) 5. after a short while, start another node for the cluster The new node then exits (reliably for me using above steps) with: 2009-sep-01 12:16:43 notice Journal "TplStore": Created 2009-sep-01 12:16:43 notice Store module initialized; dir=/home/gordon/qpid/cpp/test-cluster-temp-data-5673 2009-sep-01 12:16:43 notice Recovering from cluster, no recovery from local journal 2009-sep-01 12:16:43 notice SASL disabled: No Authentication Performed 2009-sep-01 12:16:43 notice Listening on TCP port 5673 2009-sep-01 12:16:43 notice 100.0.50.15:15648(INIT) joining cluster grs-mrg15-test-cluster with url=amqp:tcp:10.16.44.222:5673,tcp:100.0.50.15:5673,tcp:30.0.10.15:5673,tcp:30.0.20.15:5673 2009-sep-01 12:16:43 notice Rdma: Listening on RDMA port 5673 2009-sep-01 12:16:43 notice Broker running 2009-sep-01 12:16:44 notice 100.0.50.15:15648(READY) caught up, active cluster member 2009-sep-01 12:16:45 error Execution exception: invalid-argument: anonymous.532d7f87-796a-418a-ab03-b9e49f89a781: confirmed < (17+0) but only sent < (16+0) (qpid/SessionState.cpp:163) 2009-sep-01 12:16:45 error 100.0.50.15:15648(READY/error) channel error 1196 on 100.0.50.15:12625-2(shadow): invalid-argument: anonymous.532d7f87-796a-418a-ab03-b9e49f89a781: confirmed < (17+0) but only sent < (16+0) (qpid/SessionState.cpp:163) (unresolved: 100.0.50.15:12625 100.0.50.15:15648 ) 2009-sep-01 12:16:45 critical 100.0.50.15:15648(READY/error) error 1196 did not occur on 100.0.50.15:12625 2009-sep-01 12:16:45 error Error delivering frames: Aborted by local failure that did not occur on all replicas 2009-sep-01 12:16:45 notice 100.0.50.15:15648(LEFT/error) leaving cluster grs-mrg15-test-cluster 2009-sep-01 12:16:45 notice Shut down
Created attachment 359431 [details] Logs from failed run Attached logs are for a two node cluster where the error was observed. The node on port 5672 was started with the 5 receivers connected to it. Then some messages were sent and another node on port 5673 was started. Then another couple of messages were sent and the second node exited. The problem seems to be message 'fff' which was sent out on 5672 in response to the do-output delivered as event 475. On node 5673 the same send was not triggered and consequently it was unhappy on receiving the completion.
Further information: (1) I was running against packages (qpidd-0.5.752581-26.el5 & qpidd-cluster-0.5.752581-26.el5). I can't reproduce against qpid trunk so changes there may have fixed the issue. (2) You don't actually even need the attached test case. Follow steps as listed in comment #7 but simply start 2 receivers in step 3 (instead of the retriever program) and the same error can be observed.
From gsim: The good news is that it appears to be 'fixed' on trunk - certainly I cannot reproduce it there. The bad news is that it seems very easy to reproduce on the -26 packages: Start one cluster node, create a queue, start two receivers subscribed to the queue, send one message to the queue, start another node, send another message to the queue, second node exits. Any ideas on the change that may have fixed this on trunk? aconway: here's a list of candidate fixes on trunk, none seem to match exactly. Next step is to identify which fix resolves the issue. That will help us understand whats going wrong and evaluate whether the issue is fully resolved on trunk. commit c3c922d82e71b7d191c469cf2bc77e9478e7faae Author: Alan Conway <aconway> Date: Fri Jul 31 18:40:26 2009 +0000 Fix race condition in cluster error handling. If different errors occured almost simultaneously on two different nodes in a cluster, there was a race condition that could cause the cluster to hang. git-svn-id: https://svn.apache.org/repos/asf/qpid/trunk@799687 13f79535-47bb-0310-9956-ffa450edef68 commit 2dced0fb971f24cf61f6c0965e868ce29f09328b Author: Alan Conway <aconway> Date: Fri Jul 10 15:42:36 2009 +0000 Fix cluster handling of multiple errors. If an error occured while there were frames on the error queue from a previous error, the enqueued frames were not being processed for the new error, which could lead to error-check or config-change frames being missed. git-svn-id: https://svn.apache.org/repos/asf/qpid/trunk@792991 13f79535-47bb-0310-9956-ffa450edef68 commit 155482f43f33c6aa5bd44e8ebd767017433d8988 Author: Alan Conway <aconway> Date: Wed Jul 1 21:58:40 2009 +0000 Fix members joining cluster while cluster is handling client errors. Completes the previous incomplete fix in r790163. git-svn-id: https://svn.apache.org/repos/asf/qpid/trunk@790397 13f79535-47bb-0310-9956-ffa450edef68 commit 62b700b251111d7c0cb09d2b99ec0bfaa180b54e Author: Alan Conway <aconway> Date: Wed Jul 1 13:47:04 2009 +0000 Fix members joining cluster while cluster is handling client errors. Previously cluster members could abort if a new member joins while existing members are handling a client error. Now if an update offer arrives while an error is in progress, the offering broker retracts the offer and the newcomer must try again. git-svn-id: https://svn.apache.org/repos/asf/qpid/trunk@790163 13f79535-47bb-0310-9956-ffa450edef68 commit 6b9decf55d9aa235bfbcca645501165d08b61749 Author: Alan Conway <aconway> Date: Tue Jun 30 20:51:38 2009 +0000 Fix cluster race condition with connections closed by broker while in use. If a client is using a connection that is closed at the broker end because of an error, there is a race condition that allows the connection to be incorrectly re-created on replica brokers which can cause those brokers to exit with an error that does not occur on the directly connected broker. The fix: explicitly announce new connections, shadow connections are no longer implicitly created on first use. Make error-check a cluster control so it can be handled independently of the lifecycle of the connection where an error initially occured. git-svn-id: https://svn.apache.org/repos/asf/qpid/trunk@789947 13f79535-47bb-0310-9956-ffa450edef68
This bug was fixed by r794736.
Back-ported & comitted to release repository http://git.et.redhat.com/git/qpid.git/?p=qpid.git;a=commitdiff;h=0595ecfd2b8d7cc0274754f391f747ace8daaee5
Tested: on -26 bug appears on -28 has been fixed validated on RHEL RHEL5-Server-U4 i386 / x86_64 packages: # rpm -qa | grep -E '(qpid|openais|rhm)' | sort -u openais-0.80.6-8.el5 python-qpid-0.5.752581-3.el5 qpidc-0.5.752581-28.el5 qpidc-debuginfo-0.5.752581-28.el5 qpidc-devel-0.5.752581-28.el5 qpidc-rdma-0.5.752581-28.el5 qpidc-ssl-0.5.752581-28.el5 qpidd-0.5.752581-28.el5 qpidd-acl-0.5.752581-28.el5 qpidd-cluster-0.5.752581-28.el5 qpidd-devel-0.5.752581-28.el5 qpid-dotnet-0.4.738274-2.el5 qpidd-rdma-0.5.752581-28.el5 qpidd-ssl-0.5.752581-28.el5 qpidd-xml-0.5.752581-28.el5 qpid-java-client-0.5.751061-9.el5 qpid-java-common-0.5.751061-9.el5 rhm-0.5.3206-14.el5 rhm-docs-0.5.756148-1.el5 rh-tests-distribution-MRG-Messaging-qpid_common-1.5-15 ->VERIFIED
Release note added. If any revisions are required, please set the "requires_release_notes" flag to "?" and edit the "Release Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: Resolved session-completed inconsistency error on two node cluster with JMS clients (516501)
Release note updated. If any revisions are required, please set the "requires_release_notes" flag to "?" and edit the "Release Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. Diffed Contents: @@ -1 +1,8 @@ -Resolved session-completed inconsistency error on two node cluster with JMS clients (516501)+Messaging Bug Fix + +C: Running a cluster with two JMS client nodes, and one issues a session-completed command that is interpreted as being invalid. +C: The node that produced the command would return an error and shut down, while the other ran as expected. +F: +R: The error no longer occurs + +NEED FURTHER INFO FOR RELNOTE
Release note updated. If any revisions are required, please set the "requires_release_notes" flag to "?" and edit the "Release Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. Diffed Contents: @@ -1,8 +1,6 @@ Messaging Bug Fix -C: Running a cluster with two JMS client nodes, and one issues a session-completed command that is interpreted as being invalid. +C: Where a queue has multiple consumers at the point a new member is added to the queue, it was an inconsistent allocation of messages to consumers was possible meaning that one node delivered a message of which the shadow session(s) were unaware. -C: The node that produced the command would return an error and shut down, while the other ran as expected. +C: When a session-completed command was then sent while acknowledging the message and error occurred on some but not all nodes and this caused those nodes on which the error did occur to shutdown on detecting the inconsistency. -F: +F: The update process was fixed such that it correctly handles the case where multiple subscribers exist on a queue (specifically the allocation order is now identical across the cluster). -R: The error no longer occurs +R: The inconsistent delivery no longer occurs, thus no inconsistent errors are raised in this case and all nodes remain in the cluster as expected.- -NEED FURTHER INFO FOR RELNOTE
Release note updated. If any revisions are required, please set the "requires_release_notes" flag to "?" and edit the "Release Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. Diffed Contents: @@ -3,4 +3,6 @@ C: Where a queue has multiple consumers at the point a new member is added to the queue, it was an inconsistent allocation of messages to consumers was possible meaning that one node delivered a message of which the shadow session(s) were unaware. C: When a session-completed command was then sent while acknowledging the message and error occurred on some but not all nodes and this caused those nodes on which the error did occur to shutdown on detecting the inconsistency. F: The update process was fixed such that it correctly handles the case where multiple subscribers exist on a queue (specifically the allocation order is now identical across the cluster). -R: The inconsistent delivery no longer occurs, thus no inconsistent errors are raised in this case and all nodes remain in the cluster as expected.+R: The inconsistent delivery no longer occurs, thus no inconsistent errors are raised in this case and all nodes remain in the cluster as expected. + +When a queue had multiple consumers at the time that a new member was added to the queue, it inconsistently allocated messages to consumers. This could result in one node delivering a message of which the shadow session(s) were unaware. Some nodes would then record an error and shut down on detecting the inconsistency. The update process was corrected and the inconsistent delivery no longer occurs. This prevents errors, and all nodes remain in the cluster as expected.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHEA-2009-1633.html