501305 – Cluster node gets stuck as updatee and 'hangs' cluster

Bug 501305 - Cluster node gets stuck as updatee and 'hangs' cluster

Summary: Cluster node gets stuck as updatee and 'hangs' cluster

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise MRG
Classification:	Red Hat
Component:	qpid-cpp
Sub Component:
Version:	1.1.1
Hardware:	All
OS:	Linux
Priority:	high
Severity:	medium
Target Milestone:	1.3
Target Release:	---
Assignee:	Alan Conway
QA Contact:	Tomas Rusnak
Docs Contact:
URL:
Whiteboard:
Depends On:	639994
Blocks:
TreeView+	depends on / blocked

Reported:	2009-05-18 13:37 UTC by Gordon Sim
Modified:	2010-10-14 16:02 UTC (History)
CC List:	1 user (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Rejoining a cluster after a broker restart no longer causes the cluster to stop responding.
Clone Of:
Environment:
Last Closed:	2010-10-14 16:02:03 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
log file for updatee (239.59 KB, application/x-compressed-tar) 2009-05-18 13:37 UTC, Gordon Sim	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2010:0773	0	normal	SHIPPED_LIVE	Moderate: Red Hat Enterprise MRG Messaging and Grid Version 1.3	2010-10-14 15:56:44 UTC

Description Gordon Sim 2009-05-18 13:37:22 UTC

Created attachment 344437 [details]
log file for updatee

Description of problem:

A two node cluster is in use and one node is killed and then restarted. On attempting to rejoin the whole cluster appeared to hang.

Version-Release number of selected component (if applicable):

qpidd-0.5.752581-5.el5

How reproducible:

Not sure and don't as yet have steps to reproduce.

Additional info:

From the log file for the updatee, it appears that ClusterConnectionShadowReadyBody does not get sent for one connection being updated:

[gordon@thinkpad Desktop]$ grep ClusterConnectionSessionStateBody msg_trc_rejoin
2009-may-15 18:44:27 trace 10.34.24.21:32029(UPDATEE) RECV 10.34.24.21:32029-2(local,catchup): Frame[BEbe; channel=1; {ClusterConnectionSessionStateBody: replay-start=0; command-point=8345; sent-incomplete={ [0,8344] }; expected=18; received=18; unknown-completed={ [0,17] }; received-incomplete={ }; }]
2009-may-15 18:44:27 trace 10.34.24.21:32029(UPDATEE) RECV 10.34.24.21:32029-3(local,catchup): Frame[BEbe; channel=1; {ClusterConnectionSessionStateBody: replay-start=0; command-point=2269; sent-incomplete={ [0,2268] }; expected=20; received=20; unknown-completed={ [0,19] }; received-incomplete={ }; }]
2009-may-15 18:44:27 trace 10.34.24.21:32029(UPDATEE) RECV 10.34.24.21:32029-4(local,catchup): Frame[BEbe; channel=1; {ClusterConnectionSessionStateBody: replay-start=0; command-point=2; sent-incomplete={ }; expected=15; received=15; unknown-completed={ [1,14] }; received-incomplete={ }; }]


[gordon@thinkpad Desktop]$ grep ClusterConnectionShadowReadyBody msg_trc_rejoin
2009-may-15 18:44:27 trace 10.34.24.21:32029(UPDATEE) RECV 10.34.24.21:32029-2(local,catchup): Frame[BEbe; channel=0; {ClusterConnectionShadowReadyBody: member-id=1592059894620515759; connection-id=1; user-name=guest@QPID; fragment=; }]
2009-may-15 18:44:27 trace 10.34.24.21:32029(UPDATEE) RECV 10.34.24.21:32029-3(local,catchup): Frame[BEbe; channel=0; {ClusterConnectionShadowReadyBody: member-id=1592059894620515759; connection-id=2; user-name=guest@QPID; fragment=; }]

Note the inconsistent looking state (i.e. command-point < received) in the session state update for 10.34.24.21:32029-4 (which appears never to be marked ready, preventing the update from completing and leaving the cluster in an unusable state): 

{ClusterConnectionSessionStateBody: replay-start=0; command-point=2; sent-incomplete={ }; expected=15; received=15; unknown-completed={ [1,14] }; received-incomplete={ }; }]

Comment 1 Gordon Sim 2009-05-18 14:00:46 UTC

Ignore comment about inconsistent session state above; the command point tracks sent commands is of course independent of the received commands! If there is anything noteworthy about the last session state it's simply that unlike the earllier two, there are no in doubt sent-commands.

Comment 3 Alan Conway 2009-11-25 21:36:39 UTC

The join/update protocol has been re-worked to be more robust in commits up to r883999. Since this issue is not reproducible I'm assuming it is fixed by those changes.

Comment 5 Tomas Rusnak 2010-10-06 08:33:04 UTC

Retested with 2 nodes, one was rejoined every 10 seconds for a ~12 hours. 

/etc/qpidd.conf on both nodes:
cluster-mechanism=ANONYMOUS
cluster-name=pinola2
log-enable=trace+
log-to-file=/tmp/qpidd.log

Reproducer:

#!/bin/bash
while true; do
  echo "Starting qpidd"
  service qpidd start
  sleep 5
  qpid-cluster
  echo "Stopping qpidd"
  service qpidd stop
  sleep 5
  qpid-cluster
done

Packages:

qpid-cpp-client-ssl-0.7.946106-17.el5
qpid-java-common-0.7.946106-10.el5
qpid-cpp-server-devel-0.7.946106-17.el5
qpid-cpp-client-0.7.946106-17.el5
qpid-cpp-server-ssl-0.7.946106-17.el5
qpid-tools-0.7.946106-11.el5
qpid-cpp-client-devel-0.7.946106-17.el5
python-qpid-0.7.946106-14.el5
qpid-cpp-client-devel-docs-0.7.946106-17.el5
qpid-java-client-0.7.946106-10.el5
qpidc-debuginfo-0.5.752581-33.el5
qpid-cpp-server-store-0.7.946106-17.el5
qpid-cpp-server-0.7.946106-17.el5
qpid-dotnet-0.4.738274-2.el5
qpid-cpp-server-cluster-0.7.946106-17.el5
qpid-cpp-server-xml-0.7.946106-17.el5

No regression found.

>>> VERIFIED

Comment 6 Jaromir Hradilek 2010-10-07 15:40:25 UTC

    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
Rejoining a cluster after a broker restart no longer causes the cluster to stop responding.

Comment 8 errata-xmlrpc 2010-10-14 16:02:03 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2010-0773.html

Note You need to log in before you can comment on or make changes to this bug.