Description of problem: Using a two node cluster, a multi-queue durable perftest that hits the journal capacity can reliably hang a cluster. Version-Release number of selected component (if applicable): qpidd-0.5.752581-26.el5 rhm-0.5.3206-9.el5 How reproducible: So far 100% for me. Steps to Reproduce: 1. start two node cluster 2. run: perftest --durable true --qt 4 --size 8 --count 100000 3. test hits capacity errors as expected, however... Actual results: ...the cluster subsequently hangs and all attempts to connect are timed out. Expected results: Capacity error should either be handled by all nodes, or nodes should be shutdown if the error occurs inconsistently across the cluster. The cluster as a whole should not hang however. Additional info: Killing on of the nodes seems to resolve the issue and frees the cluster up again.
Created attachment 355524 [details] log from first node
Created attachment 355525 [details] Log for second node
If different errors occured almost simultaneously on two different nodes in a cluster, there was a race condition that could cause the cluster to hang. Fixed in revision 799687.
Back-ported and comitted to release repository: http://git.et.redhat.com/git/qpid.git/?p=qpid.git;a=commitdiff;h=b9385c7aeee5683b4e9b26c9b4c3c75f4ca56bb6
rhm-0.5.3206-14.el5 qpidd-cluster-0.5.752581-28.el5 On i386 everything seems to be verifiable fine. On up-to-date RHEL5.4, x86_64 I have simple two node cluster, and while one qpidd stays alive even after that perftest, this is how the other ends (with -t trace): ------------------------------------------------------------------------- 2009-oct-13 20:11:31 trace 10.34.33.63:3086(READY/error) DLVR: Event[10.34.33.63:3086-0 Frame[BEbe; channel=0; {ClusterErrorCheckBody: type=2; frame-seq=824510; }]] 2009-oct-13 20:11:31 debug 10.34.33.63:3086(READY/error) local close of replicated connection 10.34.33.63:3086-5(local) 2009-oct-13 20:11:31 trace MCAST 10.34.33.63:3086-5: {ClusterConnectionDeliverCloseBody: } 2009-oct-13 20:11:31 trace 10.34.33.63:3086(READY/error) DLVR: Event[10.34.33.63:3086-5 Frame[BEbe; channel=0; {ClusterConnectionDeliverCloseBody: }]] 2009-oct-13 20:11:31 info 10.34.33.63:3086(READY/error) error 824510 resolved with 10.34.33.63:3086 2009-oct-13 20:11:31 info 10.34.33.63:3086(READY/error) error 824510 must be resolved with 10.34.33.62:3346 2009-oct-13 20:11:31 trace 10.34.33.63:3086(READY/error) DLVR: Event[10.34.33.62:3346-0 Frame[BEbe; channel=0; {ClusterErrorCheckBody: type=2; frame-seq=824534; }]] 2009-oct-13 20:11:31 trace 10.34.33.63:3086(READY/error) DLVR: Event[10.34.33.62:3346-0 Frame[BEbe; channel=0; {ClusterErrorCheckBody: type=0; frame-seq=824510; }]] 2009-oct-13 20:11:31 critical 10.34.33.63:3086(READY/error) error 824510 did not occur on 10.34.33.62:3346 2009-oct-13 20:11:31 debug Exception constructed: Error 824510 did not occur on all members (qpid/cluster/ErrorCheck.cpp:90) 2009-oct-13 20:11:31 error Error delivering frames: Error 824510 did not occur on all members (qpid/cluster/ErrorCheck.cpp:90) 2009-oct-13 20:11:31 notice 10.34.33.63:3086(LEFT/error) leaving cluster jasan 2009-oct-13 20:11:31 debug 10.34.33.63:3086(LEFT/error) deleted connection: 10.34.33.63:3086-5(local) 2009-oct-13 20:11:31 debug Shutting down CPG 2009-oct-13 20:11:31 notice Shut down 2009-oct-13 20:11:31 debug Journal "TplStore": Destroyed
The log indicates that the broker experienced an error that did not occur on at least one other member. Do you still have the logs? There will be an error message the first time "Error 824510" is mentioned. Is it possible that this member is running out of space in the store before others?
Release note added. If any revisions are required, please set the "requires_release_notes" flag to "?" and edit the "Release Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: Cluster no longer hangs when capacity errors occur almost simultaneously on different nodes in a cluster (514487)
Actually this is an expected result. The expected outcomes are 1. both brokers hit a capacity error at exactly the same point - both brokers close the connection that caused the error and continue running. 2. one broker hits the capacity error before the other. The first broker to hit the error shuts down as it can no longer keep up, the second broker continues running. So after the test either one or both brokers should still be running. The important point is that the surviving broker(s) do not hang, they can still respond to requests.
In that case I consider it being verified because the tests on both architectures match the above described behaviour.
Release note updated. If any revisions are required, please set the "requires_release_notes" flag to "?" and edit the "Release Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. Diffed Contents: @@ -1 +1,8 @@ -Cluster no longer hangs when capacity errors occur almost simultaneously on different nodes in a cluster (514487)+Messaging bug fix + +C: When capacity errors occur almost simultaneously on different nodes in a cluster +C: A race condition would cause the cluster as a whole hangs +F: The capacity errors are now handled more effectively by individual nodes in a cluster +R: The node that experiences the error will shut down, without hanging the entire cluster. + +When capacity errors occur almost simultaneously on different nodes in a cluster, a race condition would cause the cluster as a whole to hang. The capacity errors are now handled more effectively by individual nodes in a cluster, and the node that experiences the error will shut down, without hanging the entire cluster.
Release note updated. If any revisions are required, please set the "requires_release_notes" flag to "?" and edit the "Release Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. Diffed Contents: @@ -1,8 +1,6 @@ Messaging bug fix C: When capacity errors occur almost simultaneously on different nodes in a cluster -C: A race condition would cause the cluster as a whole hangs +C: A race condition would cause the cluster as a whole to hang -F: The capacity errors are now handled more effectively by individual nodes in a cluster +F: The capacity errors are now handled correctly -R: The node that experiences the error will shut down, without hanging the entire cluster. +R: The node that experiences the error will shut down, without hanging the entire cluster.- -When capacity errors occur almost simultaneously on different nodes in a cluster, a race condition would cause the cluster as a whole to hang. The capacity errors are now handled more effectively by individual nodes in a cluster, and the node that experiences the error will shut down, without hanging the entire cluster.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHEA-2009-1633.html