Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 801605

Summary: Non-responsive peer in federated link can result in entire cluster shutdown
Product: Red Hat Enterprise MRG Reporter: Jason Dillaman <jdillama>
Component: qpid-cppAssignee: Alan Conway <aconway>
Status: CLOSED ERRATA QA Contact: Leonid Zhaldybin <lzhaldyb>
Severity: unspecified Docs Contact:
Priority: high    
Version: 2.0CC: esammons, jross, lzhaldyb, tross
Target Milestone: 2.3Keywords: Patch
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: qpid-0.18 Doc Type: Bug Fix
Doc Text:
Cause: Federated links can issue cluster events prior to the link connection being fully established (protocol handshake complete). Consequence: Cluster members receive event for unknown federated link connection, which results in the members leaving the cluster. Fix: Delay federated link IO processing until after the connection is fully established. Result: Cluster members do not leave the cluster when federated to a non-responsive peer.
Story Points: ---
Clone Of: Environment:
Last Closed: 2013-03-06 18:55:19 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 698367    
Attachments:
Description Flags
Backtrace from first broker to crash
none
Backtrace from second broker to crash none

Description Jason Dillaman 2012-03-08 23:34:23 UTC
Description of problem:
Repeatedly while running performance longevity tests, all brokers within a cluster will crash.  Logs indicate that the crash occurs shortly after the loss and re-establishment of a federated link.

Version-Release number of selected component (if applicable):
qpid-cpp-server-0.12-6_ptc_hotfix_3.el6.x86_64

How reproducible:
Frequently

Steps to Reproduce:
1. Configure a federated, clustered collection of brokers
2. Send and receive messages at a high throughput
  
Actual results:
Broker cluster crashes

Expected results:
Broker cluster does not crash

Comment 2 Jason Dillaman 2012-03-08 23:36:01 UTC
Created attachment 568768 [details]
Backtrace from first broker to crash

Comment 3 Jason Dillaman 2012-03-08 23:37:05 UTC
Created attachment 568769 [details]
Backtrace from second broker to crash

Comment 5 Jason Dillaman 2012-06-27 14:10:47 UTC
Issue was recently repeated repeatedly in a client environment where debug-level logs were available.  The chain of events appears to start with a non-responsive federated link.

Sequence of Events:

1) Cluster elder establishes inter-broker link

  Jun 26 17:41:01 HOST1 qpidd[22138]: 2012-06-26 17:41:01 debug Inter-broker link connecting to HOST2:10000
  Jun 26 17:41:01 HOST1 qpidd[22138]: 2012-06-26 17:41:01 info Set TCP_NODELAY on connection to HOST2:10000
  Jun 26 17:41:01 HOST1 qpidd[22138]: 2012-06-26 17:41:01 info Inter-broker link established to HOST2:10000
  Jun 26 17:41:01 HOST1 qpidd[22138]: 2012-06-26 17:41:01 debug cluster(2.0.0.0:22138 READY) local connection HOST1:54581-HOST2:10000(2.0.0.0:22138-14 local)

2) Cluster elder sends AMQP init frame

  Jun 26 17:41:01 HOST1 qpidd[22138]: 2012-06-26 17:41:01 debug SENT [HOST1:54581-HOST2:10000] INIT(0-10)

3) Cluster elder never receives any data from federated peer / connection never announced

4) Link ioThreadProcessing fires and attempts to create bridges over federated link

  Jun 26 17:41:03 HOST1 qpidd[22138]: 2012-06-26 17:41:03 debug Link::ioThreadProcessing()
  Jun 26 17:41:03 HOST1 qpidd[22138]: 2012-06-26 17:41:03 debug SessionState::SessionState @QPID.0748c28e-995a-47f3-9216-b1d0b2c54b10: 0x7fad0006ff10
  Jun 26 17:41:03 HOST1 qpidd[22138]: 2012-06-26 17:41:03 debug @QPID.0748c28e-995a-47f3-9216-b1d0b2c54b10: attached on broker.
  Jun 26 17:41:03 HOST1 qpidd[22138]: 2012-06-26 17:41:03 debug SessionHandler::sendAttach attach id=
  Jun 26 17:41:03 HOST1 qpidd[22138]: 2012-06-26 17:41:03 debug Activated route from queue QUEUE1 to EXCHANGE1
  Jun 26 17:41:03 HOST1 qpidd[22138]: 2012-06-26 17:41:03 debug SessionState::SessionState @QPID.98811980-fc85-4ad8-b662-1a52ad7f3b25: 0x7fad00094ce0
  Jun 26 17:41:03 HOST1 qpidd[22138]: 2012-06-26 17:41:03 debug @QPID.98811980-fc85-4ad8-b662-1a52ad7f3b25: attached on broker.
  Jun 26 17:41:03 HOST1 qpidd[22138]: 2012-06-26 17:41:03 debug SessionHandler::sendAttach attach id=
  Jun 26 17:41:03 HOST1 qpidd[22138]: 2012-06-26 17:41:03 debug Activated route from queue QUEUE2 to EXCHANGE2

5) Cluster shuts down because connection was never announced

  Jun 26 17:41:03 HOST1 qpidd[22138]: 2012-06-26 17:41:03 debug Exception constructed: Unknown connection: Frame[BEbe; channel=0; {ClusterConnectionDeliverDoOutputBody: limit=2048; }] control 2.0.0.0:22138-14 (qpid/cluster/Cluster.cpp:542)
  Jun 26 17:41:03 HOST1 qpidd[22138]: 2012-06-26 17:41:03 critical Error delivering frames: Unknown connection: Frame[BEbe; channel=0; {ClusterConnectionDeliverDoOutputBody: limit=2048; }] control 2.0.0.0:22138-14 (qpid/cluster/Cluster.cpp:542)
  Jun 26 17:41:03 HOST1 qpidd[22138]: 2012-06-26 17:41:03 notice cluster(2.0.0.0:22138 LEFT) leaving cluster CLUSTER_NAME

Comment 7 Jason Dillaman 2012-07-12 15:23:29 UTC
Simplified Steps to Reproduce:

1. Configure a clustered broker
2. Start netcat in listen mode (nc -l 6000)
3. Add a route between the clustered broker and nc (qpid-route queue add localhost:5672 localhost:6000 amq.fanout foo)
4. Clustered broker shuts down

Comment 8 Justin Ross 2012-11-14 20:54:42 UTC
Merged to 0.18 r1362653

Comment 11 Leonid Zhaldybin 2013-01-08 09:55:51 UTC
Tested on RHEL5.9 and RHEL6.3 (both i386 and x86_64). This issue has been fixed.

Packages used for testing:

RHEL5.9
qpid-cpp-client-0.18-13.el5
qpid-cpp-client-devel-0.18-13.el5
qpid-cpp-client-devel-docs-0.18-13.el5
qpid-cpp-client-ssl-0.18-13.el5
qpid-cpp-server-0.18-13.el5
qpid-cpp-server-cluster-0.18-13.el5
qpid-cpp-server-devel-0.18-13.el5
qpid-cpp-server-ssl-0.18-13.el5
qpid-cpp-server-store-0.18-13.el5
qpid-cpp-server-xml-0.18-13.el5
qpid-java-client-0.18-6.el5
qpid-java-common-0.18-6.el5
qpid-java-example-0.18-6.el5
qpid-qmf-0.18-13.el5
qpid-qmf-devel-0.18-13.el5
qpid-tools-0.18-7.el5

RHEL6.3
qpid-cpp-client-0.18-13.el6
qpid-cpp-client-devel-0.18-13.el6
qpid-cpp-client-devel-docs-0.18-13.el6
qpid-cpp-server-0.18-13.el6
qpid-cpp-server-cluster-0.18-13.el6
qpid-cpp-server-devel-0.18-13.el6
qpid-cpp-server-store-0.18-13.el6
qpid-cpp-server-xml-0.18-13.el6
qpid-java-client-0.18-6.el6
qpid-java-common-0.18-6.el6
qpid-java-example-0.18-6.el6
qpid-qmf-0.18-13.el6
qpid-tools-0.18-7.el6_3

-> VERIFIED

Comment 13 errata-xmlrpc 2013-03-06 18:55:19 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHSA-2013-0561.html