Created attachment 344437 [details] log file for updatee Description of problem: A two node cluster is in use and one node is killed and then restarted. On attempting to rejoin the whole cluster appeared to hang. Version-Release number of selected component (if applicable): qpidd-0.5.752581-5.el5 How reproducible: Not sure and don't as yet have steps to reproduce. Additional info: From the log file for the updatee, it appears that ClusterConnectionShadowReadyBody does not get sent for one connection being updated: [gordon@thinkpad Desktop]$ grep ClusterConnectionSessionStateBody msg_trc_rejoin 2009-may-15 18:44:27 trace 10.34.24.21:32029(UPDATEE) RECV 10.34.24.21:32029-2(local,catchup): Frame[BEbe; channel=1; {ClusterConnectionSessionStateBody: replay-start=0; command-point=8345; sent-incomplete={ [0,8344] }; expected=18; received=18; unknown-completed={ [0,17] }; received-incomplete={ }; }] 2009-may-15 18:44:27 trace 10.34.24.21:32029(UPDATEE) RECV 10.34.24.21:32029-3(local,catchup): Frame[BEbe; channel=1; {ClusterConnectionSessionStateBody: replay-start=0; command-point=2269; sent-incomplete={ [0,2268] }; expected=20; received=20; unknown-completed={ [0,19] }; received-incomplete={ }; }] 2009-may-15 18:44:27 trace 10.34.24.21:32029(UPDATEE) RECV 10.34.24.21:32029-4(local,catchup): Frame[BEbe; channel=1; {ClusterConnectionSessionStateBody: replay-start=0; command-point=2; sent-incomplete={ }; expected=15; received=15; unknown-completed={ [1,14] }; received-incomplete={ }; }] [gordon@thinkpad Desktop]$ grep ClusterConnectionShadowReadyBody msg_trc_rejoin 2009-may-15 18:44:27 trace 10.34.24.21:32029(UPDATEE) RECV 10.34.24.21:32029-2(local,catchup): Frame[BEbe; channel=0; {ClusterConnectionShadowReadyBody: member-id=1592059894620515759; connection-id=1; user-name=guest@QPID; fragment=; }] 2009-may-15 18:44:27 trace 10.34.24.21:32029(UPDATEE) RECV 10.34.24.21:32029-3(local,catchup): Frame[BEbe; channel=0; {ClusterConnectionShadowReadyBody: member-id=1592059894620515759; connection-id=2; user-name=guest@QPID; fragment=; }] Note the inconsistent looking state (i.e. command-point < received) in the session state update for 10.34.24.21:32029-4 (which appears never to be marked ready, preventing the update from completing and leaving the cluster in an unusable state): {ClusterConnectionSessionStateBody: replay-start=0; command-point=2; sent-incomplete={ }; expected=15; received=15; unknown-completed={ [1,14] }; received-incomplete={ }; }]
Ignore comment about inconsistent session state above; the command point tracks sent commands is of course independent of the received commands! If there is anything noteworthy about the last session state it's simply that unlike the earllier two, there are no in doubt sent-commands.
The join/update protocol has been re-worked to be more robust in commits up to r883999. Since this issue is not reproducible I'm assuming it is fixed by those changes.
Retested with 2 nodes, one was rejoined every 10 seconds for a ~12 hours. /etc/qpidd.conf on both nodes: cluster-mechanism=ANONYMOUS cluster-name=pinola2 log-enable=trace+ log-to-file=/tmp/qpidd.log Reproducer: #!/bin/bash while true; do echo "Starting qpidd" service qpidd start sleep 5 qpid-cluster echo "Stopping qpidd" service qpidd stop sleep 5 qpid-cluster done Packages: qpid-cpp-client-ssl-0.7.946106-17.el5 qpid-java-common-0.7.946106-10.el5 qpid-cpp-server-devel-0.7.946106-17.el5 qpid-cpp-client-0.7.946106-17.el5 qpid-cpp-server-ssl-0.7.946106-17.el5 qpid-tools-0.7.946106-11.el5 qpid-cpp-client-devel-0.7.946106-17.el5 python-qpid-0.7.946106-14.el5 qpid-cpp-client-devel-docs-0.7.946106-17.el5 qpid-java-client-0.7.946106-10.el5 qpidc-debuginfo-0.5.752581-33.el5 qpid-cpp-server-store-0.7.946106-17.el5 qpid-cpp-server-0.7.946106-17.el5 qpid-dotnet-0.4.738274-2.el5 qpid-cpp-server-cluster-0.7.946106-17.el5 qpid-cpp-server-xml-0.7.946106-17.el5 No regression found. >>> VERIFIED
Technical note added. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: Rejoining a cluster after a broker restart no longer causes the cluster to stop responding.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2010-0773.html