This service will be undergoing maintenance at 00:00 UTC, 2016-08-01. It is expected to last about 1 hours
Bug 801605 - Non-responsive peer in federated link can result in entire cluster shutdown
Non-responsive peer in federated link can result in entire cluster shutdown
Status: CLOSED ERRATA
Product: Red Hat Enterprise MRG
Classification: Red Hat
Component: qpid-cpp (Show other bugs)
2.0
Unspecified Unspecified
high Severity unspecified
: 2.3
: ---
Assigned To: Alan Conway
Leonid Zhaldybin
: Patch
Depends On:
Blocks: 698367
  Show dependency treegraph
 
Reported: 2012-03-08 18:34 EST by Jason Dillaman
Modified: 2014-11-09 17:38 EST (History)
4 users (show)

See Also:
Fixed In Version: qpid-0.18
Doc Type: Bug Fix
Doc Text:
Cause: Federated links can issue cluster events prior to the link connection being fully established (protocol handshake complete). Consequence: Cluster members receive event for unknown federated link connection, which results in the members leaving the cluster. Fix: Delay federated link IO processing until after the connection is fully established. Result: Cluster members do not leave the cluster when federated to a non-responsive peer.
Story Points: ---
Clone Of:
Environment:
Last Closed: 2013-03-06 13:55:19 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:


Attachments (Terms of Use)
Backtrace from first broker to crash (10.17 KB, text/x-log)
2012-03-08 18:36 EST, Jason Dillaman
no flags Details
Backtrace from second broker to crash (3.72 KB, text/x-log)
2012-03-08 18:37 EST, Jason Dillaman
no flags Details


External Trackers
Tracker ID Priority Status Summary Last Updated
Apache JIRA QPID-4133 None None None 2012-07-12 14:17:44 EDT

  None (edit)
Description Jason Dillaman 2012-03-08 18:34:23 EST
Description of problem:
Repeatedly while running performance longevity tests, all brokers within a cluster will crash.  Logs indicate that the crash occurs shortly after the loss and re-establishment of a federated link.

Version-Release number of selected component (if applicable):
qpid-cpp-server-0.12-6_ptc_hotfix_3.el6.x86_64

How reproducible:
Frequently

Steps to Reproduce:
1. Configure a federated, clustered collection of brokers
2. Send and receive messages at a high throughput
  
Actual results:
Broker cluster crashes

Expected results:
Broker cluster does not crash
Comment 2 Jason Dillaman 2012-03-08 18:36:01 EST
Created attachment 568768 [details]
Backtrace from first broker to crash
Comment 3 Jason Dillaman 2012-03-08 18:37:05 EST
Created attachment 568769 [details]
Backtrace from second broker to crash
Comment 5 Jason Dillaman 2012-06-27 10:10:47 EDT
Issue was recently repeated repeatedly in a client environment where debug-level logs were available.  The chain of events appears to start with a non-responsive federated link.

Sequence of Events:

1) Cluster elder establishes inter-broker link

  Jun 26 17:41:01 HOST1 qpidd[22138]: 2012-06-26 17:41:01 debug Inter-broker link connecting to HOST2:10000
  Jun 26 17:41:01 HOST1 qpidd[22138]: 2012-06-26 17:41:01 info Set TCP_NODELAY on connection to HOST2:10000
  Jun 26 17:41:01 HOST1 qpidd[22138]: 2012-06-26 17:41:01 info Inter-broker link established to HOST2:10000
  Jun 26 17:41:01 HOST1 qpidd[22138]: 2012-06-26 17:41:01 debug cluster(2.0.0.0:22138 READY) local connection HOST1:54581-HOST2:10000(2.0.0.0:22138-14 local)

2) Cluster elder sends AMQP init frame

  Jun 26 17:41:01 HOST1 qpidd[22138]: 2012-06-26 17:41:01 debug SENT [HOST1:54581-HOST2:10000] INIT(0-10)

3) Cluster elder never receives any data from federated peer / connection never announced

4) Link ioThreadProcessing fires and attempts to create bridges over federated link

  Jun 26 17:41:03 HOST1 qpidd[22138]: 2012-06-26 17:41:03 debug Link::ioThreadProcessing()
  Jun 26 17:41:03 HOST1 qpidd[22138]: 2012-06-26 17:41:03 debug SessionState::SessionState @QPID.0748c28e-995a-47f3-9216-b1d0b2c54b10: 0x7fad0006ff10
  Jun 26 17:41:03 HOST1 qpidd[22138]: 2012-06-26 17:41:03 debug @QPID.0748c28e-995a-47f3-9216-b1d0b2c54b10: attached on broker.
  Jun 26 17:41:03 HOST1 qpidd[22138]: 2012-06-26 17:41:03 debug SessionHandler::sendAttach attach id=@QPID.0748c28e-995a-47f3-9216-b1d0b2c54b10
  Jun 26 17:41:03 HOST1 qpidd[22138]: 2012-06-26 17:41:03 debug Activated route from queue QUEUE1 to EXCHANGE1
  Jun 26 17:41:03 HOST1 qpidd[22138]: 2012-06-26 17:41:03 debug SessionState::SessionState @QPID.98811980-fc85-4ad8-b662-1a52ad7f3b25: 0x7fad00094ce0
  Jun 26 17:41:03 HOST1 qpidd[22138]: 2012-06-26 17:41:03 debug @QPID.98811980-fc85-4ad8-b662-1a52ad7f3b25: attached on broker.
  Jun 26 17:41:03 HOST1 qpidd[22138]: 2012-06-26 17:41:03 debug SessionHandler::sendAttach attach id=@QPID.98811980-fc85-4ad8-b662-1a52ad7f3b25
  Jun 26 17:41:03 HOST1 qpidd[22138]: 2012-06-26 17:41:03 debug Activated route from queue QUEUE2 to EXCHANGE2

5) Cluster shuts down because connection was never announced

  Jun 26 17:41:03 HOST1 qpidd[22138]: 2012-06-26 17:41:03 debug Exception constructed: Unknown connection: Frame[BEbe; channel=0; {ClusterConnectionDeliverDoOutputBody: limit=2048; }] control 2.0.0.0:22138-14 (qpid/cluster/Cluster.cpp:542)
  Jun 26 17:41:03 HOST1 qpidd[22138]: 2012-06-26 17:41:03 critical Error delivering frames: Unknown connection: Frame[BEbe; channel=0; {ClusterConnectionDeliverDoOutputBody: limit=2048; }] control 2.0.0.0:22138-14 (qpid/cluster/Cluster.cpp:542)
  Jun 26 17:41:03 HOST1 qpidd[22138]: 2012-06-26 17:41:03 notice cluster(2.0.0.0:22138 LEFT) leaving cluster CLUSTER_NAME
Comment 7 Jason Dillaman 2012-07-12 11:23:29 EDT
Simplified Steps to Reproduce:

1. Configure a clustered broker
2. Start netcat in listen mode (nc -l 6000)
3. Add a route between the clustered broker and nc (qpid-route queue add localhost:5672 localhost:6000 amq.fanout foo)
4. Clustered broker shuts down
Comment 8 Justin Ross 2012-11-14 15:54:42 EST
Merged to 0.18 r1362653
Comment 11 Leonid Zhaldybin 2013-01-08 04:55:51 EST
Tested on RHEL5.9 and RHEL6.3 (both i386 and x86_64). This issue has been fixed.

Packages used for testing:

RHEL5.9
qpid-cpp-client-0.18-13.el5
qpid-cpp-client-devel-0.18-13.el5
qpid-cpp-client-devel-docs-0.18-13.el5
qpid-cpp-client-ssl-0.18-13.el5
qpid-cpp-server-0.18-13.el5
qpid-cpp-server-cluster-0.18-13.el5
qpid-cpp-server-devel-0.18-13.el5
qpid-cpp-server-ssl-0.18-13.el5
qpid-cpp-server-store-0.18-13.el5
qpid-cpp-server-xml-0.18-13.el5
qpid-java-client-0.18-6.el5
qpid-java-common-0.18-6.el5
qpid-java-example-0.18-6.el5
qpid-qmf-0.18-13.el5
qpid-qmf-devel-0.18-13.el5
qpid-tools-0.18-7.el5

RHEL6.3
qpid-cpp-client-0.18-13.el6
qpid-cpp-client-devel-0.18-13.el6
qpid-cpp-client-devel-docs-0.18-13.el6
qpid-cpp-server-0.18-13.el6
qpid-cpp-server-cluster-0.18-13.el6
qpid-cpp-server-devel-0.18-13.el6
qpid-cpp-server-store-0.18-13.el6
qpid-cpp-server-xml-0.18-13.el6
qpid-java-client-0.18-6.el6
qpid-java-common-0.18-6.el6
qpid-java-example-0.18-6.el6
qpid-qmf-0.18-13.el6
qpid-tools-0.18-7.el6_3

-> VERIFIED
Comment 13 errata-xmlrpc 2013-03-06 13:55:19 EST
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHSA-2013-0561.html

Note You need to log in before you can comment on or make changes to this bug.