Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 801605

Summary:

Non-responsive peer in federated link can result in entire cluster shutdown

Product:

Red Hat Enterprise MRG

Reporter:

Jason Dillaman <jdillama>

Component:

qpid-cpp

Assignee:

Alan Conway <aconway>

Status:

CLOSED ERRATA

QA Contact:

Leonid Zhaldybin <lzhaldyb>

Severity:

unspecified

Docs Contact:

Priority:

high

Version:

2.0

CC:

esammons, jross, lzhaldyb, tross

Target Milestone:

2.3

Keywords:

Patch

Target Release:

---

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

qpid-0.18

Doc Type:

Bug Fix

Doc Text:

Cause: Federated links can issue cluster events prior to the link connection being fully established (protocol handshake complete). Consequence: Cluster members receive event for unknown federated link connection, which results in the members leaving the cluster. Fix: Delay federated link IO processing until after the connection is fully established. Result: Cluster members do not leave the cluster when federated to a non-responsive peer.

Story Points:

---

Clone Of:

Environment:

Last Closed:

2013-03-06 18:55:19 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

698367

Attachments:

Description	Flags
Backtrace from first broker to crash	none
Backtrace from second broker to crash	none

Description Jason Dillaman 2012-03-08 23:34:23 UTC

Description of problem:
Repeatedly while running performance longevity tests, all brokers within a cluster will crash.  Logs indicate that the crash occurs shortly after the loss and re-establishment of a federated link.

Version-Release number of selected component (if applicable):
qpid-cpp-server-0.12-6_ptc_hotfix_3.el6.x86_64

How reproducible:
Frequently

Steps to Reproduce:
1. Configure a federated, clustered collection of brokers
2. Send and receive messages at a high throughput
  
Actual results:
Broker cluster crashes

Expected results:
Broker cluster does not crash

Comment 2 Jason Dillaman 2012-03-08 23:36:01 UTC

Created attachment 568768 [details]
Backtrace from first broker to crash

Comment 3 Jason Dillaman 2012-03-08 23:37:05 UTC

Created attachment 568769 [details]
Backtrace from second broker to crash

Comment 5 Jason Dillaman 2012-06-27 14:10:47 UTC

Issue was recently repeated repeatedly in a client environment where debug-level logs were available.  The chain of events appears to start with a non-responsive federated link.

Sequence of Events:

1) Cluster elder establishes inter-broker link

  Jun 26 17:41:01 HOST1 qpidd[22138]: 2012-06-26 17:41:01 debug Inter-broker link connecting to HOST2:10000
  Jun 26 17:41:01 HOST1 qpidd[22138]: 2012-06-26 17:41:01 info Set TCP_NODELAY on connection to HOST2:10000
  Jun 26 17:41:01 HOST1 qpidd[22138]: 2012-06-26 17:41:01 info Inter-broker link established to HOST2:10000
  Jun 26 17:41:01 HOST1 qpidd[22138]: 2012-06-26 17:41:01 debug cluster(2.0.0.0:22138 READY) local connection HOST1:54581-HOST2:10000(2.0.0.0:22138-14 local)

2) Cluster elder sends AMQP init frame

  Jun 26 17:41:01 HOST1 qpidd[22138]: 2012-06-26 17:41:01 debug SENT [HOST1:54581-HOST2:10000] INIT(0-10)

3) Cluster elder never receives any data from federated peer / connection never announced

4) Link ioThreadProcessing fires and attempts to create bridges over federated link

  Jun 26 17:41:03 HOST1 qpidd[22138]: 2012-06-26 17:41:03 debug Link::ioThreadProcessing()
  Jun 26 17:41:03 HOST1 qpidd[22138]: 2012-06-26 17:41:03 debug SessionState::SessionState @QPID.0748c28e-995a-47f3-9216-b1d0b2c54b10: 0x7fad0006ff10
  Jun 26 17:41:03 HOST1 qpidd[22138]: 2012-06-26 17:41:03 debug @QPID.0748c28e-995a-47f3-9216-b1d0b2c54b10: attached on broker.
  Jun 26 17:41:03 HOST1 qpidd[22138]: 2012-06-26 17:41:03 debug SessionHandler::sendAttach attach id=
  Jun 26 17:41:03 HOST1 qpidd[22138]: 2012-06-26 17:41:03 debug Activated route from queue QUEUE1 to EXCHANGE1
  Jun 26 17:41:03 HOST1 qpidd[22138]: 2012-06-26 17:41:03 debug SessionState::SessionState @QPID.98811980-fc85-4ad8-b662-1a52ad7f3b25: 0x7fad00094ce0
  Jun 26 17:41:03 HOST1 qpidd[22138]: 2012-06-26 17:41:03 debug @QPID.98811980-fc85-4ad8-b662-1a52ad7f3b25: attached on broker.
  Jun 26 17:41:03 HOST1 qpidd[22138]: 2012-06-26 17:41:03 debug SessionHandler::sendAttach attach id=
  Jun 26 17:41:03 HOST1 qpidd[22138]: 2012-06-26 17:41:03 debug Activated route from queue QUEUE2 to EXCHANGE2

5) Cluster shuts down because connection was never announced

  Jun 26 17:41:03 HOST1 qpidd[22138]: 2012-06-26 17:41:03 debug Exception constructed: Unknown connection: Frame[BEbe; channel=0; {ClusterConnectionDeliverDoOutputBody: limit=2048; }] control 2.0.0.0:22138-14 (qpid/cluster/Cluster.cpp:542)
  Jun 26 17:41:03 HOST1 qpidd[22138]: 2012-06-26 17:41:03 critical Error delivering frames: Unknown connection: Frame[BEbe; channel=0; {ClusterConnectionDeliverDoOutputBody: limit=2048; }] control 2.0.0.0:22138-14 (qpid/cluster/Cluster.cpp:542)
  Jun 26 17:41:03 HOST1 qpidd[22138]: 2012-06-26 17:41:03 notice cluster(2.0.0.0:22138 LEFT) leaving cluster CLUSTER_NAME

Comment 7 Jason Dillaman 2012-07-12 15:23:29 UTC

Simplified Steps to Reproduce:

1. Configure a clustered broker
2. Start netcat in listen mode (nc -l 6000)
3. Add a route between the clustered broker and nc (qpid-route queue add localhost:5672 localhost:6000 amq.fanout foo)
4. Clustered broker shuts down

Comment 8 Justin Ross 2012-11-14 20:54:42 UTC

Merged to 0.18 r1362653

Comment 11 Leonid Zhaldybin 2013-01-08 09:55:51 UTC

Tested on RHEL5.9 and RHEL6.3 (both i386 and x86_64). This issue has been fixed.

Packages used for testing:

RHEL5.9
qpid-cpp-client-0.18-13.el5
qpid-cpp-client-devel-0.18-13.el5
qpid-cpp-client-devel-docs-0.18-13.el5
qpid-cpp-client-ssl-0.18-13.el5
qpid-cpp-server-0.18-13.el5
qpid-cpp-server-cluster-0.18-13.el5
qpid-cpp-server-devel-0.18-13.el5
qpid-cpp-server-ssl-0.18-13.el5
qpid-cpp-server-store-0.18-13.el5
qpid-cpp-server-xml-0.18-13.el5
qpid-java-client-0.18-6.el5
qpid-java-common-0.18-6.el5
qpid-java-example-0.18-6.el5
qpid-qmf-0.18-13.el5
qpid-qmf-devel-0.18-13.el5
qpid-tools-0.18-7.el5

RHEL6.3
qpid-cpp-client-0.18-13.el6
qpid-cpp-client-devel-0.18-13.el6
qpid-cpp-client-devel-docs-0.18-13.el6
qpid-cpp-server-0.18-13.el6
qpid-cpp-server-cluster-0.18-13.el6
qpid-cpp-server-devel-0.18-13.el6
qpid-cpp-server-store-0.18-13.el6
qpid-cpp-server-xml-0.18-13.el6
qpid-java-client-0.18-6.el6
qpid-java-common-0.18-6.el6
qpid-java-example-0.18-6.el6
qpid-qmf-0.18-13.el6
qpid-tools-0.18-7.el6_3

-> VERIFIED

Comment 13 errata-xmlrpc 2013-03-06 18:55:19 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHSA-2013-0561.html