Bug 514487 - Store capacity errors result in hung cluster
Summary: Store capacity errors result in hung cluster
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise MRG
Classification: Red Hat
Component: qpid-cpp
Version: 1.1.2
Hardware: All
OS: Linux
urgent
high
Target Milestone: 1.2
: ---
Assignee: Alan Conway
QA Contact: Jan Sarenik
URL:
Whiteboard:
Depends On:
Blocks: 527551
TreeView+ depends on / blocked
 
Reported: 2009-07-29 10:46 UTC by Gordon Sim
Modified: 2009-12-03 09:16 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Messaging bug fix C: When capacity errors occur almost simultaneously on different nodes in a cluster C: A race condition would cause the cluster as a whole to hang F: The capacity errors are now handled correctly R: The node that experiences the error will shut down, without hanging the entire cluster.
Clone Of:
Environment:
Last Closed: 2009-12-03 09:16:39 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
log from first node (15.22 KB, text/x-log)
2009-07-29 10:49 UTC, Gordon Sim
no flags Details
Log for second node (11.03 KB, text/x-log)
2009-07-29 10:49 UTC, Gordon Sim
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHEA-2009:1633 0 normal SHIPPED_LIVE Red Hat Enterprise MRG Messaging and Grid Version 1.2 2009-12-03 09:15:33 UTC

Description Gordon Sim 2009-07-29 10:46:40 UTC
Description of problem:

Using a two node cluster, a multi-queue durable perftest that hits the journal capacity can reliably hang a cluster.

Version-Release number of selected component (if applicable):

qpidd-0.5.752581-26.el5
rhm-0.5.3206-9.el5

How reproducible:

So far 100% for me.

Steps to Reproduce:
1. start two node cluster
2. run: perftest --durable true --qt 4 --size 8 --count 100000
3. test hits capacity errors as expected, however...
  
Actual results:

...the cluster subsequently hangs and all attempts to connect are timed out.

Expected results:

Capacity error should either be handled by all nodes, or nodes should be shutdown if the error occurs inconsistently across the cluster. The cluster as a whole should not hang however.

Additional info:

Killing on of the nodes seems to resolve the issue and frees the cluster up again.

Comment 1 Gordon Sim 2009-07-29 10:49:03 UTC
Created attachment 355524 [details]
log from first node

Comment 2 Gordon Sim 2009-07-29 10:49:33 UTC
Created attachment 355525 [details]
Log for second node

Comment 3 Alan Conway 2009-07-31 18:41:33 UTC
                                                                                                                                                                                        
If different errors occured almost simultaneously on two different nodes in a cluster, there was a race condition that could cause the cluster to hang. 

Fixed in revision 799687.

Comment 4 Alan Conway 2009-09-17 18:41:50 UTC
Back-ported and comitted to release repository:

http://git.et.redhat.com/git/qpid.git/?p=qpid.git;a=commitdiff;h=b9385c7aeee5683b4e9b26c9b4c3c75f4ca56bb6

Comment 5 Jan Sarenik 2009-10-14 00:25:08 UTC
rhm-0.5.3206-14.el5
qpidd-cluster-0.5.752581-28.el5

On i386 everything seems to be verifiable fine.

On up-to-date RHEL5.4, x86_64 I have simple two node cluster,
and while one qpidd stays alive even after that perftest, this
is how the other ends (with -t trace):

-------------------------------------------------------------------------
2009-oct-13 20:11:31 trace 10.34.33.63:3086(READY/error) DLVR: Event[10.34.33.63:3086-0 Frame[BEbe; channel=0; {ClusterErrorCheckBody: type=2; frame-seq=824510; }]]
2009-oct-13 20:11:31 debug 10.34.33.63:3086(READY/error) local close of replicated connection 10.34.33.63:3086-5(local)
2009-oct-13 20:11:31 trace MCAST 10.34.33.63:3086-5: {ClusterConnectionDeliverCloseBody: }
2009-oct-13 20:11:31 trace 10.34.33.63:3086(READY/error) DLVR: Event[10.34.33.63:3086-5 Frame[BEbe; channel=0; {ClusterConnectionDeliverCloseBody: }]]
2009-oct-13 20:11:31 info 10.34.33.63:3086(READY/error) error 824510 resolved with 10.34.33.63:3086
2009-oct-13 20:11:31 info 10.34.33.63:3086(READY/error) error 824510 must be resolved with 10.34.33.62:3346 
2009-oct-13 20:11:31 trace 10.34.33.63:3086(READY/error) DLVR: Event[10.34.33.62:3346-0 Frame[BEbe; channel=0; {ClusterErrorCheckBody: type=2; frame-seq=824534; }]]
2009-oct-13 20:11:31 trace 10.34.33.63:3086(READY/error) DLVR: Event[10.34.33.62:3346-0 Frame[BEbe; channel=0; {ClusterErrorCheckBody: type=0; frame-seq=824510; }]]
2009-oct-13 20:11:31 critical 10.34.33.63:3086(READY/error) error 824510 did not occur on 10.34.33.62:3346
2009-oct-13 20:11:31 debug Exception constructed: Error 824510 did not occur on all members (qpid/cluster/ErrorCheck.cpp:90)
2009-oct-13 20:11:31 error Error delivering frames: Error 824510 did not occur on all members (qpid/cluster/ErrorCheck.cpp:90)
2009-oct-13 20:11:31 notice 10.34.33.63:3086(LEFT/error) leaving cluster jasan
2009-oct-13 20:11:31 debug 10.34.33.63:3086(LEFT/error) deleted connection: 10.34.33.63:3086-5(local)
2009-oct-13 20:11:31 debug Shutting down CPG
2009-oct-13 20:11:31 notice Shut down
2009-oct-13 20:11:31 debug Journal "TplStore": Destroyed

Comment 7 Alan Conway 2009-10-21 18:34:27 UTC
The log indicates that the broker experienced an error that did not occur on at least one other member. Do you still have the logs? There will be an error message the first time "Error 824510" is mentioned. Is it possible that this member is running out of space in the store before others?

Comment 8 Irina Boverman 2009-10-22 17:26:45 UTC
Release note added. If any revisions are required, please set the 
"requires_release_notes" flag to "?" and edit the "Release Notes" field accordingly.
All revisions will be proofread by the Engineering Content Services team.

New Contents:
Cluster no longer hangs when capacity errors occur almost simultaneously on different nodes in a cluster (514487)

Comment 10 Alan Conway 2009-10-23 12:40:36 UTC
Actually this is an expected result. The expected outcomes are

1. both brokers hit a capacity error at exactly the same point - both brokers close the connection that caused the error and continue running.

2. one broker  hits the capacity error before the other. The first broker to hit the error shuts down as it can no longer keep up, the second broker continues running. 

So after the test either one or both brokers should still be running. The important point is that the surviving broker(s) do not hang, they can still respond to requests.

Comment 11 Jan Sarenik 2009-10-26 08:29:28 UTC
In that case I consider it being verified because
the tests on both architectures match the above
described behaviour.

Comment 13 Lana Brindley 2009-12-02 00:58:47 UTC
Release note updated. If any revisions are required, please set the 
"requires_release_notes"  flag to "?" and edit the "Release Notes" field accordingly.
All revisions will be proofread by the Engineering Content Services team.

Diffed Contents:
@@ -1 +1,8 @@
-Cluster no longer hangs when capacity errors occur almost simultaneously on different nodes in a cluster (514487)+Messaging bug fix
+
+C: When capacity errors occur almost simultaneously on different nodes in a cluster
+C: A race condition would cause the cluster as a whole hangs 
+F: The capacity errors are now handled more effectively by individual nodes in a cluster
+R: The node that experiences the error will shut down, without hanging the entire cluster.
+
+When capacity errors occur almost simultaneously on different nodes in a cluster, a race condition would cause the cluster as a whole to hang. The capacity errors are now handled more effectively by individual nodes in a cluster, and the node that experiences the error will shut down, without hanging the entire cluster.

Comment 14 Alan Conway 2009-12-02 13:26:27 UTC
Release note updated. If any revisions are required, please set the 
"requires_release_notes"  flag to "?" and edit the "Release Notes" field accordingly.
All revisions will be proofread by the Engineering Content Services team.

Diffed Contents:
@@ -1,8 +1,6 @@
 Messaging bug fix
 
 C: When capacity errors occur almost simultaneously on different nodes in a cluster
-C: A race condition would cause the cluster as a whole hangs 
+C: A race condition would cause the cluster as a whole to hang 
-F: The capacity errors are now handled more effectively by individual nodes in a cluster
+F: The capacity errors are now handled correctly
-R: The node that experiences the error will shut down, without hanging the entire cluster.
+R: The node that experiences the error will shut down, without hanging the entire cluster.-
-When capacity errors occur almost simultaneously on different nodes in a cluster, a race condition would cause the cluster as a whole to hang. The capacity errors are now handled more effectively by individual nodes in a cluster, and the node that experiences the error will shut down, without hanging the entire cluster.

Comment 15 errata-xmlrpc 2009-12-03 09:16:39 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHEA-2009-1633.html


Note You need to log in before you can comment on or make changes to this bug.