Bug 801605 - Non-responsive peer in federated link can result in entire cluster shutdown
Summary: Non-responsive peer in federated link can result in entire cluster shutdown
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise MRG
Classification: Red Hat
Component: qpid-cpp
Version: 2.0
Hardware: Unspecified
OS: Unspecified
high
unspecified
Target Milestone: 2.3
: ---
Assignee: Alan Conway
QA Contact: Leonid Zhaldybin
URL:
Whiteboard:
Depends On:
Blocks: 698367
TreeView+ depends on / blocked
 
Reported: 2012-03-08 23:34 UTC by Jason Dillaman
Modified: 2014-11-09 22:38 UTC (History)
4 users (show)

Fixed In Version: qpid-0.18
Doc Type: Bug Fix
Doc Text:
Cause: Federated links can issue cluster events prior to the link connection being fully established (protocol handshake complete). Consequence: Cluster members receive event for unknown federated link connection, which results in the members leaving the cluster. Fix: Delay federated link IO processing until after the connection is fully established. Result: Cluster members do not leave the cluster when federated to a non-responsive peer.
Clone Of:
Environment:
Last Closed: 2013-03-06 18:55:19 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Backtrace from first broker to crash (10.17 KB, text/x-log)
2012-03-08 23:36 UTC, Jason Dillaman
no flags Details
Backtrace from second broker to crash (3.72 KB, text/x-log)
2012-03-08 23:37 UTC, Jason Dillaman
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Apache JIRA QPID-4133 0 None None None 2012-07-12 18:17:44 UTC
Red Hat Product Errata RHSA-2013:0561 0 normal SHIPPED_LIVE Moderate: Red Hat Enterprise MRG Messaging 2.3 security update 2013-03-06 23:48:13 UTC

Description Jason Dillaman 2012-03-08 23:34:23 UTC
Description of problem:
Repeatedly while running performance longevity tests, all brokers within a cluster will crash.  Logs indicate that the crash occurs shortly after the loss and re-establishment of a federated link.

Version-Release number of selected component (if applicable):
qpid-cpp-server-0.12-6_ptc_hotfix_3.el6.x86_64

How reproducible:
Frequently

Steps to Reproduce:
1. Configure a federated, clustered collection of brokers
2. Send and receive messages at a high throughput
  
Actual results:
Broker cluster crashes

Expected results:
Broker cluster does not crash

Comment 2 Jason Dillaman 2012-03-08 23:36:01 UTC
Created attachment 568768 [details]
Backtrace from first broker to crash

Comment 3 Jason Dillaman 2012-03-08 23:37:05 UTC
Created attachment 568769 [details]
Backtrace from second broker to crash

Comment 5 Jason Dillaman 2012-06-27 14:10:47 UTC
Issue was recently repeated repeatedly in a client environment where debug-level logs were available.  The chain of events appears to start with a non-responsive federated link.

Sequence of Events:

1) Cluster elder establishes inter-broker link

  Jun 26 17:41:01 HOST1 qpidd[22138]: 2012-06-26 17:41:01 debug Inter-broker link connecting to HOST2:10000
  Jun 26 17:41:01 HOST1 qpidd[22138]: 2012-06-26 17:41:01 info Set TCP_NODELAY on connection to HOST2:10000
  Jun 26 17:41:01 HOST1 qpidd[22138]: 2012-06-26 17:41:01 info Inter-broker link established to HOST2:10000
  Jun 26 17:41:01 HOST1 qpidd[22138]: 2012-06-26 17:41:01 debug cluster(2.0.0.0:22138 READY) local connection HOST1:54581-HOST2:10000(2.0.0.0:22138-14 local)

2) Cluster elder sends AMQP init frame

  Jun 26 17:41:01 HOST1 qpidd[22138]: 2012-06-26 17:41:01 debug SENT [HOST1:54581-HOST2:10000] INIT(0-10)

3) Cluster elder never receives any data from federated peer / connection never announced

4) Link ioThreadProcessing fires and attempts to create bridges over federated link

  Jun 26 17:41:03 HOST1 qpidd[22138]: 2012-06-26 17:41:03 debug Link::ioThreadProcessing()
  Jun 26 17:41:03 HOST1 qpidd[22138]: 2012-06-26 17:41:03 debug SessionState::SessionState @QPID.0748c28e-995a-47f3-9216-b1d0b2c54b10: 0x7fad0006ff10
  Jun 26 17:41:03 HOST1 qpidd[22138]: 2012-06-26 17:41:03 debug @QPID.0748c28e-995a-47f3-9216-b1d0b2c54b10: attached on broker.
  Jun 26 17:41:03 HOST1 qpidd[22138]: 2012-06-26 17:41:03 debug SessionHandler::sendAttach attach id=
  Jun 26 17:41:03 HOST1 qpidd[22138]: 2012-06-26 17:41:03 debug Activated route from queue QUEUE1 to EXCHANGE1
  Jun 26 17:41:03 HOST1 qpidd[22138]: 2012-06-26 17:41:03 debug SessionState::SessionState @QPID.98811980-fc85-4ad8-b662-1a52ad7f3b25: 0x7fad00094ce0
  Jun 26 17:41:03 HOST1 qpidd[22138]: 2012-06-26 17:41:03 debug @QPID.98811980-fc85-4ad8-b662-1a52ad7f3b25: attached on broker.
  Jun 26 17:41:03 HOST1 qpidd[22138]: 2012-06-26 17:41:03 debug SessionHandler::sendAttach attach id=
  Jun 26 17:41:03 HOST1 qpidd[22138]: 2012-06-26 17:41:03 debug Activated route from queue QUEUE2 to EXCHANGE2

5) Cluster shuts down because connection was never announced

  Jun 26 17:41:03 HOST1 qpidd[22138]: 2012-06-26 17:41:03 debug Exception constructed: Unknown connection: Frame[BEbe; channel=0; {ClusterConnectionDeliverDoOutputBody: limit=2048; }] control 2.0.0.0:22138-14 (qpid/cluster/Cluster.cpp:542)
  Jun 26 17:41:03 HOST1 qpidd[22138]: 2012-06-26 17:41:03 critical Error delivering frames: Unknown connection: Frame[BEbe; channel=0; {ClusterConnectionDeliverDoOutputBody: limit=2048; }] control 2.0.0.0:22138-14 (qpid/cluster/Cluster.cpp:542)
  Jun 26 17:41:03 HOST1 qpidd[22138]: 2012-06-26 17:41:03 notice cluster(2.0.0.0:22138 LEFT) leaving cluster CLUSTER_NAME

Comment 7 Jason Dillaman 2012-07-12 15:23:29 UTC
Simplified Steps to Reproduce:

1. Configure a clustered broker
2. Start netcat in listen mode (nc -l 6000)
3. Add a route between the clustered broker and nc (qpid-route queue add localhost:5672 localhost:6000 amq.fanout foo)
4. Clustered broker shuts down

Comment 8 Justin Ross 2012-11-14 20:54:42 UTC
Merged to 0.18 r1362653

Comment 11 Leonid Zhaldybin 2013-01-08 09:55:51 UTC
Tested on RHEL5.9 and RHEL6.3 (both i386 and x86_64). This issue has been fixed.

Packages used for testing:

RHEL5.9
qpid-cpp-client-0.18-13.el5
qpid-cpp-client-devel-0.18-13.el5
qpid-cpp-client-devel-docs-0.18-13.el5
qpid-cpp-client-ssl-0.18-13.el5
qpid-cpp-server-0.18-13.el5
qpid-cpp-server-cluster-0.18-13.el5
qpid-cpp-server-devel-0.18-13.el5
qpid-cpp-server-ssl-0.18-13.el5
qpid-cpp-server-store-0.18-13.el5
qpid-cpp-server-xml-0.18-13.el5
qpid-java-client-0.18-6.el5
qpid-java-common-0.18-6.el5
qpid-java-example-0.18-6.el5
qpid-qmf-0.18-13.el5
qpid-qmf-devel-0.18-13.el5
qpid-tools-0.18-7.el5

RHEL6.3
qpid-cpp-client-0.18-13.el6
qpid-cpp-client-devel-0.18-13.el6
qpid-cpp-client-devel-docs-0.18-13.el6
qpid-cpp-server-0.18-13.el6
qpid-cpp-server-cluster-0.18-13.el6
qpid-cpp-server-devel-0.18-13.el6
qpid-cpp-server-store-0.18-13.el6
qpid-cpp-server-xml-0.18-13.el6
qpid-java-client-0.18-6.el6
qpid-java-common-0.18-6.el6
qpid-java-example-0.18-6.el6
qpid-qmf-0.18-13.el6
qpid-tools-0.18-7.el6_3

-> VERIFIED

Comment 13 errata-xmlrpc 2013-03-06 18:55:19 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHSA-2013-0561.html


Note You need to log in before you can comment on or make changes to this bug.