Bug 801605
| Summary: | Non-responsive peer in federated link can result in entire cluster shutdown | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise MRG | Reporter: | Jason Dillaman <jdillama> | ||||||
| Component: | qpid-cpp | Assignee: | Alan Conway <aconway> | ||||||
| Status: | CLOSED ERRATA | QA Contact: | Leonid Zhaldybin <lzhaldyb> | ||||||
| Severity: | unspecified | Docs Contact: | |||||||
| Priority: | high | ||||||||
| Version: | 2.0 | CC: | esammons, jross, lzhaldyb, tross | ||||||
| Target Milestone: | 2.3 | Keywords: | Patch | ||||||
| Target Release: | --- | ||||||||
| Hardware: | Unspecified | ||||||||
| OS: | Unspecified | ||||||||
| Whiteboard: | |||||||||
| Fixed In Version: | qpid-0.18 | Doc Type: | Bug Fix | ||||||
| Doc Text: |
Cause: Federated links can issue cluster events prior to the link connection being fully established (protocol handshake complete).
Consequence: Cluster members receive event for unknown federated link connection, which results in the members leaving the cluster.
Fix: Delay federated link IO processing until after the connection is fully established.
Result: Cluster members do not leave the cluster when federated to a non-responsive peer.
|
Story Points: | --- | ||||||
| Clone Of: | Environment: | ||||||||
| Last Closed: | 2013-03-06 18:55:19 UTC | Type: | --- | ||||||
| Regression: | --- | Mount Type: | --- | ||||||
| Documentation: | --- | CRM: | |||||||
| Verified Versions: | Category: | --- | |||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||
| Embargoed: | |||||||||
| Bug Depends On: | |||||||||
| Bug Blocks: | 698367 | ||||||||
| Attachments: |
|
||||||||
|
Description
Jason Dillaman
2012-03-08 23:34:23 UTC
Created attachment 568768 [details]
Backtrace from first broker to crash
Created attachment 568769 [details]
Backtrace from second broker to crash
Issue was recently repeated repeatedly in a client environment where debug-level logs were available. The chain of events appears to start with a non-responsive federated link.
Sequence of Events:
1) Cluster elder establishes inter-broker link
Jun 26 17:41:01 HOST1 qpidd[22138]: 2012-06-26 17:41:01 debug Inter-broker link connecting to HOST2:10000
Jun 26 17:41:01 HOST1 qpidd[22138]: 2012-06-26 17:41:01 info Set TCP_NODELAY on connection to HOST2:10000
Jun 26 17:41:01 HOST1 qpidd[22138]: 2012-06-26 17:41:01 info Inter-broker link established to HOST2:10000
Jun 26 17:41:01 HOST1 qpidd[22138]: 2012-06-26 17:41:01 debug cluster(2.0.0.0:22138 READY) local connection HOST1:54581-HOST2:10000(2.0.0.0:22138-14 local)
2) Cluster elder sends AMQP init frame
Jun 26 17:41:01 HOST1 qpidd[22138]: 2012-06-26 17:41:01 debug SENT [HOST1:54581-HOST2:10000] INIT(0-10)
3) Cluster elder never receives any data from federated peer / connection never announced
4) Link ioThreadProcessing fires and attempts to create bridges over federated link
Jun 26 17:41:03 HOST1 qpidd[22138]: 2012-06-26 17:41:03 debug Link::ioThreadProcessing()
Jun 26 17:41:03 HOST1 qpidd[22138]: 2012-06-26 17:41:03 debug SessionState::SessionState @QPID.0748c28e-995a-47f3-9216-b1d0b2c54b10: 0x7fad0006ff10
Jun 26 17:41:03 HOST1 qpidd[22138]: 2012-06-26 17:41:03 debug @QPID.0748c28e-995a-47f3-9216-b1d0b2c54b10: attached on broker.
Jun 26 17:41:03 HOST1 qpidd[22138]: 2012-06-26 17:41:03 debug SessionHandler::sendAttach attach id=
Jun 26 17:41:03 HOST1 qpidd[22138]: 2012-06-26 17:41:03 debug Activated route from queue QUEUE1 to EXCHANGE1
Jun 26 17:41:03 HOST1 qpidd[22138]: 2012-06-26 17:41:03 debug SessionState::SessionState @QPID.98811980-fc85-4ad8-b662-1a52ad7f3b25: 0x7fad00094ce0
Jun 26 17:41:03 HOST1 qpidd[22138]: 2012-06-26 17:41:03 debug @QPID.98811980-fc85-4ad8-b662-1a52ad7f3b25: attached on broker.
Jun 26 17:41:03 HOST1 qpidd[22138]: 2012-06-26 17:41:03 debug SessionHandler::sendAttach attach id=
Jun 26 17:41:03 HOST1 qpidd[22138]: 2012-06-26 17:41:03 debug Activated route from queue QUEUE2 to EXCHANGE2
5) Cluster shuts down because connection was never announced
Jun 26 17:41:03 HOST1 qpidd[22138]: 2012-06-26 17:41:03 debug Exception constructed: Unknown connection: Frame[BEbe; channel=0; {ClusterConnectionDeliverDoOutputBody: limit=2048; }] control 2.0.0.0:22138-14 (qpid/cluster/Cluster.cpp:542)
Jun 26 17:41:03 HOST1 qpidd[22138]: 2012-06-26 17:41:03 critical Error delivering frames: Unknown connection: Frame[BEbe; channel=0; {ClusterConnectionDeliverDoOutputBody: limit=2048; }] control 2.0.0.0:22138-14 (qpid/cluster/Cluster.cpp:542)
Jun 26 17:41:03 HOST1 qpidd[22138]: 2012-06-26 17:41:03 notice cluster(2.0.0.0:22138 LEFT) leaving cluster CLUSTER_NAME
Simplified Steps to Reproduce: 1. Configure a clustered broker 2. Start netcat in listen mode (nc -l 6000) 3. Add a route between the clustered broker and nc (qpid-route queue add localhost:5672 localhost:6000 amq.fanout foo) 4. Clustered broker shuts down Merged to 0.18 r1362653 Tested on RHEL5.9 and RHEL6.3 (both i386 and x86_64). This issue has been fixed. Packages used for testing: RHEL5.9 qpid-cpp-client-0.18-13.el5 qpid-cpp-client-devel-0.18-13.el5 qpid-cpp-client-devel-docs-0.18-13.el5 qpid-cpp-client-ssl-0.18-13.el5 qpid-cpp-server-0.18-13.el5 qpid-cpp-server-cluster-0.18-13.el5 qpid-cpp-server-devel-0.18-13.el5 qpid-cpp-server-ssl-0.18-13.el5 qpid-cpp-server-store-0.18-13.el5 qpid-cpp-server-xml-0.18-13.el5 qpid-java-client-0.18-6.el5 qpid-java-common-0.18-6.el5 qpid-java-example-0.18-6.el5 qpid-qmf-0.18-13.el5 qpid-qmf-devel-0.18-13.el5 qpid-tools-0.18-7.el5 RHEL6.3 qpid-cpp-client-0.18-13.el6 qpid-cpp-client-devel-0.18-13.el6 qpid-cpp-client-devel-docs-0.18-13.el6 qpid-cpp-server-0.18-13.el6 qpid-cpp-server-cluster-0.18-13.el6 qpid-cpp-server-devel-0.18-13.el6 qpid-cpp-server-store-0.18-13.el6 qpid-cpp-server-xml-0.18-13.el6 qpid-java-client-0.18-6.el6 qpid-java-common-0.18-6.el6 qpid-java-example-0.18-6.el6 qpid-qmf-0.18-13.el6 qpid-tools-0.18-7.el6_3 -> VERIFIED Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHSA-2013-0561.html |