516501 – Inconsistent error from session-completed on two node cluster with JMS clients

Bug 516501 - Inconsistent error from session-completed on two node cluster with JMS clients

Summary: Inconsistent error from session-completed on two node cluster with JMS clients

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise MRG
Classification:	Red Hat
Component:	qpid-cpp
Sub Component:
Version:	1.1.6
Hardware:	All
OS:	Linux
Priority:	urgent
Severity:	urgent
Target Milestone:	1.2
Target Release:	---
Assignee:	Alan Conway
QA Contact:	Jiri Kolar
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	527551
TreeView+	depends on / blocked

Reported:	2009-08-10 09:46 UTC by Gordon Sim
Modified:	2018-10-20 04:10 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Messaging Bug Fix C: Where a queue has multiple consumers at the point a new member is added to the queue, it was an inconsistent allocation of messages to consumers was possible meaning that one node delivered a message of which the shadow session(s) were unaware. C: When a session-completed command was then sent while acknowledging the message and error occurred on some but not all nodes and this caused those nodes on which the error did occur to shutdown on detecting the inconsistency. F: The update process was fixed such that it correctly handles the case where multiple subscribers exist on a queue (specifically the allocation order is now identical across the cluster). R: The inconsistent delivery no longer occurs, thus no inconsistent errors are raised in this case and all nodes remain in the cluster as expected. When a queue had multiple consumers at the time that a new member was added to the queue, it inconsistently allocated messages to consumers. This could result in one node delivering a message of which the shadow session(s) were unaware. Some nodes would then record an error and shut down on detecting the inconsistency. The update process was corrected and the inconsistent delivery no longer occurs. This prevents errors, and all nodes remain in the cluster as expected.
Clone Of:
Environment:
Last Closed:	2009-12-03 09:16:43 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
log file (from failed node) of another occurence with tracing on (785.45 KB, application/x-compressed-tar) 2009-08-10 09:50 UTC, Gordon Sim	no flags	Details
Reproducer (6.65 KB, application/x-gzip) 2009-08-13 19:28 UTC, Rajith Attapattu	no flags	Details
Test program (4.81 KB, text/x-c++src) 2009-09-01 16:19 UTC, Gordon Sim	no flags	Details
Logs from failed run (43.27 KB, application/x-compressed-tar) 2009-09-01 17:13 UTC, Gordon Sim	no flags	Details
Show Obsolete (1) View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHEA-2009:1633	0	normal	SHIPPED_LIVE	Red Hat Enterprise MRG Messaging and Grid Version 1.2	2009-12-03 09:15:33 UTC

Description Gordon Sim 2009-08-10 09:46:22 UTC

Description of problem:

A two node cluster with JMS clients experienced an error of the form:

  Execution exception: invalid-argument: guest: confirmed < (2+0) but only sent < (1+0) (qpid/SessionState.cpp:163)

This only occured on one of the nodes (shadow for that connection), which then shut itself down.

Version-Release number of selected component (if applicable):

qpidd-cluster-0.5.752581-26.el5
openais-0.80.3-22.el5_3.8

How reproducible:

Has occurred at least twice.

Steps to Reproduce:

Not yet known.

Comment 1 Gordon Sim 2009-08-10 09:50:31 UTC

Created attachment 356864 [details]
log file (from failed node) of another occurence with tracing on

Comment 2 Alan Conway 2009-08-10 21:39:38 UTC

Next time we reproduce this, it would be good to get the logs from both nodes to see where the discrepancy arises between the two.

Comment 3 Gordon Sim 2009-08-11 12:58:16 UTC

There are three subscribers on a particular queue[1], each using a timed (30sec) 'pull' style receiver (with a flush)[2]. All of these are on connections created on the failing node as part of the update protocol.

Just after message becomes available[3], one of these subscribers sends a session-completed (most likely followed by a message-accept as that is the pattern shown for every other session-completed). However this session-completed appears to indicate an invalid command id (i.e. a command that has not yet been issued).

So I think either the state transfer during the update leaves the shadow with incorrect state[4] (i.e. the command counter is somehow off by one after the update), or the dispatch is inconsistent between the two nodes and the shadow does not actually dispatch the message to the same subscriber as does the other node[5].

[1] Plus lots of connections and subscribers on other queues.

[2] The pattern is to issue infinite byte credit, credit for 1 message, then to wait for 30 secs, flush if no message has been received, then repeat. I.e. in the log the pattern appears as

   MessageFlow(allocate credit for unlimited bytes)
   MessageFlow(allocate credit for 1 message)
   DeliverDoOutput
   ....
   MessageFlush (comes just before the cycle starts again)

[3] The message that seems to trigger the error is the only one enqueued during the period the log represents. None of the other subscribers on that queue send a session-completed or message-accept at any time. In other words I believe this is the first delivery during the logged period, which means that there had been 110 commands sent to the client at the point of the new node joining.

[4] Not sure whether this is relevant, but the channel used by the failing session is 0.

[5] The command pattern mentioned above repeats regularly but on the failing session, it ends in a slightly irregular form (which may be entirely irrelevant):

   MessageFlow(allocate credit for unlimited bytes)
   DeliverDoOutput
   MessageFlow(allocate credit for 1 message)
   DeliverDoOutput
   DeliverDoOutput

Comment 4 Alan Conway 2009-08-13 17:22:22 UTC

At the end of the log we can see that there is 
 - a message received & enqueued
 - 2 deliver-do-output commands
 * no messages sent 
 - a session-completed that triggers the error.

The session-completed can only have been triggered by a message being sent, so it looks like the other broker sent a message while the failed broker did not.

I don't know why the failed broker did not send however, since it appears from the logs that the session is subscribed and has 1 message credit.

Comment 5 Rajith Attapattu 2009-08-13 19:28:17 UTC

Created attachment 357364 [details]
Reproducer

The attachment contains a JMS based reproducer.
Just untar the package and run the scramble_brokers.sh script.

It basically starts a jms producer and jms consumer that uses ** sync_ack ** in the bg and then changes the 4 node cluster membership rapidly to force failover.

I tried with a 2 node cluster to keep things simple but the probability of the error happening was pretty low. Also in this case it was hitting a known issue in the JMS clients FailoverExchangeMethod.

The script is running the java clients with log level at WARN. You can easily change that in the script to debug ..etc.
You could also get the brokers to log into a file.

Feel free to modify the tests as you see fit.
Please ping me if you make any improvements to the test script and I could incorporate those changes. into my nightly runs.

Comment 6 Alan Conway 2009-08-26 18:38:10 UTC

There appear to be two bugs here:

1. The original bug has error "confirmed N but only sent N-1" which is only raised on some cluster members causing them to shut down as inconsistent.

2. Rajith's reproducer above gives error "confirmed N but only sent 0", which is raised consistently on all cluster members active at the time.

I've raised bug 519476 for 2. We still don't have a reproducer for 1.

Comment 7 Gordon Sim 2009-09-01 16:19:19 UTC

Created attachment 359418 [details]
Test program

The attached program can be used to emulate the pattern from the logs and reproduce the error (its a modification to the existing receiver.cpp test).

Steps to reproduce:

1. start one node of the cluster
2. qpid-config add queue test-queue
3. start several retriever processes (code attached) (e.g. for i in `seq 1 5`; do ./retriever & done)
4. send messages to the queue with sleeps between them (sleeps appear to be important to triggering the error) (e.g. for i in `seq 1 10000`; do echo "message $i"; sleep 5; done | sender, where sender is as checked in to qpid/src/tests)
5. after a short while, start another node for the cluster

The new node then exits (reliably for me using above steps) with:

2009-sep-01 12:16:43 notice Journal "TplStore": Created
2009-sep-01 12:16:43 notice Store module initialized; dir=/home/gordon/qpid/cpp/test-cluster-temp-data-5673
2009-sep-01 12:16:43 notice Recovering from cluster, no recovery from local journal
2009-sep-01 12:16:43 notice SASL disabled: No Authentication Performed
2009-sep-01 12:16:43 notice Listening on TCP port 5673
2009-sep-01 12:16:43 notice 100.0.50.15:15648(INIT) joining cluster grs-mrg15-test-cluster with url=amqp:tcp:10.16.44.222:5673,tcp:100.0.50.15:5673,tcp:30.0.10.15:5673,tcp:30.0.20.15:5673
2009-sep-01 12:16:43 notice Rdma: Listening on RDMA port 5673
2009-sep-01 12:16:43 notice Broker running
2009-sep-01 12:16:44 notice 100.0.50.15:15648(READY) caught up, active cluster member
2009-sep-01 12:16:45 error Execution exception: invalid-argument: anonymous.532d7f87-796a-418a-ab03-b9e49f89a781: confirmed < (17+0) but only sent < (16+0) (qpid/SessionState.cpp:163)
2009-sep-01 12:16:45 error 100.0.50.15:15648(READY/error) channel error 1196 on 100.0.50.15:12625-2(shadow): invalid-argument: anonymous.532d7f87-796a-418a-ab03-b9e49f89a781: confirmed < (17+0) but only sent < (16+0) (qpid/SessionState.cpp:163) (unresolved: 100.0.50.15:12625 100.0.50.15:15648 )
2009-sep-01 12:16:45 critical 100.0.50.15:15648(READY/error) error 1196 did not occur on 100.0.50.15:12625
2009-sep-01 12:16:45 error Error delivering frames: Aborted by local failure that did not occur on all replicas
2009-sep-01 12:16:45 notice 100.0.50.15:15648(LEFT/error) leaving cluster grs-mrg15-test-cluster
2009-sep-01 12:16:45 notice Shut down

Comment 8 Gordon Sim 2009-09-01 17:13:43 UTC

Created attachment 359431 [details]
Logs from failed run

Attached logs are for a two node cluster where the error was observed. The node on port 5672 was started with the 5 receivers connected to it. Then some messages were sent and another node on port 5673 was started. Then another couple of messages were sent and the second node exited.

The problem seems to be message 'fff' which was sent out on 5672 in response to the do-output delivered as event 475. On node 5673 the same send was not triggered and consequently it was unhappy on receiving the completion.

Comment 9 Gordon Sim 2009-09-02 08:23:46 UTC

Further information: 

(1) I was running against packages (qpidd-0.5.752581-26.el5 & qpidd-cluster-0.5.752581-26.el5). I can't reproduce against qpid trunk so changes there may have fixed the issue.

(2) You don't actually even need the attached test case. Follow steps as listed in comment #7 but simply start 2 receivers in step 3 (instead of the retriever program) and the same error can be observed.

Comment 10 Alan Conway 2009-09-02 12:53:31 UTC

From gsim:

The good news is that it appears to be 'fixed' on trunk - certainly I cannot reproduce it there.

The bad news is that it seems very easy to reproduce on the -26 packages:

  Start one cluster node,
  create a queue,
  start two receivers subscribed to the queue,
  send one message to the queue,
  start another node,
  send another message to the queue,
  second node exits. 

Any ideas on the change that may have fixed this on trunk? 

aconway: here's a list of candidate fixes on trunk, none seem to match exactly. Next step is to identify which fix resolves the issue. That will help us understand whats going wrong and evaluate whether the issue is fully resolved on trunk.

commit c3c922d82e71b7d191c469cf2bc77e9478e7faae
Author: Alan Conway <aconway>
Date:   Fri Jul 31 18:40:26 2009 +0000

    Fix race condition in cluster error handling.
    
    If different errors occured almost simultaneously on two different
    nodes in a cluster, there was a race condition that could cause the
    cluster to hang.
    
    
    git-svn-id: https://svn.apache.org/repos/asf/qpid/trunk@799687 13f79535-47bb-0310-9956-ffa450edef68

commit 2dced0fb971f24cf61f6c0965e868ce29f09328b
Author: Alan Conway <aconway>
Date:   Fri Jul 10 15:42:36 2009 +0000

    Fix cluster handling of multiple errors.
    
    If an error occured while there were frames on the error queue from a
    previous error, the enqueued frames were not being processed for the
    new error, which could lead to error-check or config-change frames
    being missed.
    
    
    git-svn-id: https://svn.apache.org/repos/asf/qpid/trunk@792991 13f79535-47bb-0310-9956-ffa450edef68

commit 155482f43f33c6aa5bd44e8ebd767017433d8988
Author: Alan Conway <aconway>
Date:   Wed Jul 1 21:58:40 2009 +0000

    Fix members joining cluster while cluster is handling client errors.
    
    Completes the previous incomplete fix in r790163.
    
    
    git-svn-id: https://svn.apache.org/repos/asf/qpid/trunk@790397 13f79535-47bb-0310-9956-ffa450edef68

commit 62b700b251111d7c0cb09d2b99ec0bfaa180b54e
Author: Alan Conway <aconway>
Date:   Wed Jul 1 13:47:04 2009 +0000

    Fix members joining cluster while cluster is handling client errors.
    
    Previously cluster members could abort if a new member joins while
    existing members are handling a client error.
    
    Now if an update offer arrives while an error is in progress, the
    offering broker retracts the offer and the newcomer must try again.
    
    
    git-svn-id: https://svn.apache.org/repos/asf/qpid/trunk@790163 13f79535-47bb-0310-9956-ffa450edef68

commit 6b9decf55d9aa235bfbcca645501165d08b61749
Author: Alan Conway <aconway>
Date:   Tue Jun 30 20:51:38 2009 +0000

    Fix cluster race condition with connections closed by broker while in use.
    
    If a client is using a connection that is closed at the broker end
    because of an error, there is a race condition that allows the
    connection to be incorrectly re-created on replica brokers which can
    cause those brokers to exit with an error that does not occur on the
    directly connected broker.
    
    The fix: explicitly announce new connections, shadow connections are no
    longer implicitly created on first use. Make error-check a cluster
    control so it can be handled independently of the lifecycle of the
    connection where an error initially occured.
    
    
    git-svn-id: https://svn.apache.org/repos/asf/qpid/trunk@789947 13f79535-47bb-0310-9956-ffa450edef68

Comment 11 Alan Conway 2009-09-09 18:42:34 UTC

This bug was fixed by r794736.

Comment 12 Alan Conway 2009-09-17 18:40:11 UTC

Back-ported & comitted to release repository

http://git.et.redhat.com/git/qpid.git/?p=qpid.git;a=commitdiff;h=0595ecfd2b8d7cc0274754f391f747ace8daaee5

Comment 13 Jiri Kolar 2009-10-14 14:40:01 UTC

Tested:
on -26 bug appears 
on -28 has been fixed

validated on RHEL RHEL5-Server-U4 i386 / x86_64 

packages:

# rpm -qa | grep -E '(qpid|openais|rhm)' | sort -u

openais-0.80.6-8.el5
python-qpid-0.5.752581-3.el5
qpidc-0.5.752581-28.el5
qpidc-debuginfo-0.5.752581-28.el5
qpidc-devel-0.5.752581-28.el5
qpidc-rdma-0.5.752581-28.el5
qpidc-ssl-0.5.752581-28.el5
qpidd-0.5.752581-28.el5
qpidd-acl-0.5.752581-28.el5
qpidd-cluster-0.5.752581-28.el5
qpidd-devel-0.5.752581-28.el5
qpid-dotnet-0.4.738274-2.el5
qpidd-rdma-0.5.752581-28.el5
qpidd-ssl-0.5.752581-28.el5
qpidd-xml-0.5.752581-28.el5
qpid-java-client-0.5.751061-9.el5
qpid-java-common-0.5.751061-9.el5
rhm-0.5.3206-14.el5
rhm-docs-0.5.756148-1.el5
rh-tests-distribution-MRG-Messaging-qpid_common-1.5-15

->VERIFIED

Comment 14 Irina Boverman 2009-10-28 17:36:33 UTC

Release note added. If any revisions are required, please set the 
"requires_release_notes" flag to "?" and edit the "Release Notes" field accordingly.
All revisions will be proofread by the Engineering Content Services team.

New Contents:
Resolved session-completed inconsistency error on two node cluster with JMS clients (516501)

Comment 15 Lana Brindley 2009-11-20 00:58:55 UTC

Release note updated. If any revisions are required, please set the 
"requires_release_notes"  flag to "?" and edit the "Release Notes" field accordingly.
All revisions will be proofread by the Engineering Content Services team.

Diffed Contents:
@@ -1 +1,8 @@
-Resolved session-completed inconsistency error on two node cluster with JMS clients (516501)+Messaging Bug Fix
+
+C: Running a cluster with two JMS client nodes, and one issues a session-completed command that is interpreted as being invalid.
+C: The node that produced the command would return an error and shut down, while the other ran as expected.
+F: 
+R: The error no longer occurs
+
+NEED FURTHER INFO FOR RELNOTE

Comment 16 Gordon Sim 2009-11-20 08:08:17 UTC

Release note updated. If any revisions are required, please set the 
"requires_release_notes"  flag to "?" and edit the "Release Notes" field accordingly.
All revisions will be proofread by the Engineering Content Services team.

Diffed Contents:
@@ -1,8 +1,6 @@
 Messaging Bug Fix
 
-C: Running a cluster with two JMS client nodes, and one issues a session-completed command that is interpreted as being invalid.
+C: Where a queue has multiple consumers at the point a new member is added to the queue, it was an inconsistent allocation of messages to consumers was possible meaning that one node delivered a message of which the shadow session(s) were unaware. 
-C: The node that produced the command would return an error and shut down, while the other ran as expected.
+C: When a session-completed command was then sent while acknowledging the message and error occurred on some but not all nodes and this caused those nodes on which the error did occur to shutdown on detecting the inconsistency.
-F: 
+F: The update process was fixed such that it correctly handles the case where multiple subscribers exist on a queue (specifically the allocation order is now identical across the cluster). 
-R: The error no longer occurs
+R: The inconsistent delivery no longer occurs, thus no inconsistent errors are raised in this case and all nodes remain in the cluster as expected.-
-NEED FURTHER INFO FOR RELNOTE

Comment 17 Lana Brindley 2009-11-23 05:39:40 UTC

Release note updated. If any revisions are required, please set the 
"requires_release_notes"  flag to "?" and edit the "Release Notes" field accordingly.
All revisions will be proofread by the Engineering Content Services team.

Diffed Contents:
@@ -3,4 +3,6 @@
 C: Where a queue has multiple consumers at the point a new member is added to the queue, it was an inconsistent allocation of messages to consumers was possible meaning that one node delivered a message of which the shadow session(s) were unaware. 
 C: When a session-completed command was then sent while acknowledging the message and error occurred on some but not all nodes and this caused those nodes on which the error did occur to shutdown on detecting the inconsistency.
 F: The update process was fixed such that it correctly handles the case where multiple subscribers exist on a queue (specifically the allocation order is now identical across the cluster). 
-R: The inconsistent delivery no longer occurs, thus no inconsistent errors are raised in this case and all nodes remain in the cluster as expected.+R: The inconsistent delivery no longer occurs, thus no inconsistent errors are raised in this case and all nodes remain in the cluster as expected.
+
+When a queue had multiple consumers at the time that a new member was added to the queue, it inconsistently allocated messages to consumers. This could result in one node delivering a message of which the shadow session(s) were unaware. Some nodes would then record an error and shut down on detecting the inconsistency. The update process was corrected and the inconsistent delivery no longer occurs. This prevents errors, and all nodes remain in the cluster as expected.

Comment 19 errata-xmlrpc 2009-12-03 09:16:43 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHEA-2009-1633.html

Note You need to log in before you can comment on or make changes to this bug.