| Summary: | clustered qpidd node leaves the cluster with critical Error delivering frames: Unknown connection: Frame[Bbe;... | ||
|---|---|---|---|
| Product: | Red Hat Enterprise MRG | Reporter: | Frantisek Reznicek <freznice> |
| Component: | qpid-cpp | Assignee: | mick <mgoulish> |
| Status: | NEW --- | QA Contact: | MRG Quality Engineering <mrgqe-bugs> |
| Severity: | high | Docs Contact: | |
| Priority: | high | ||
| Version: | Development | CC: | iboverma, jross, pmoravec |
| Target Milestone: | --- | Keywords: | Reopened |
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | Bug Fix | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2012-01-25 09:50:06 UTC | Type: | --- |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
Issue seen on RHEL5.7 cluster, not seen yet on RHEL6.2 cluster although stressed more atm. I'm not able to see this case happening again. I'm going to double-check soon whether this case was caused by using shared multicast address space simultaneously by corosync and openais clusters. The bug above was triggered indeed in configuration where same multicast address and port was used for two different rhel5 (openais) and rhel6 (corosync) clusters (2+2 nodes). The procedure of launching the test was following: 1] rhel5 cluster test launched 2] rhel6 cluster test launched 3] rhel5 cluster test exited with error once first of rhel6 corosync daemons started up Although corosync.conf man page specifies that corosync can operate in compatibility mode 'compatibility: whitetank' (which was used for test) reality is different and mixing cluster openais+corosync configuration is not officially supported. I'm closing this defect as NOT-A-BUG based on above explained cluster misconfiguration. -> CLOSED / NOT-A-BUG Very recently I was able to trigger this issue again in proper configuration.
Thus I'm reopening the report.
The cluster configuration description:
VIrtualization host is running 4 VM machines r5i r5x r6i r6x.
Each of those hosts have two NICs, eth0 bridged to the network through virtualization host NIC and eth1 statically configured following way:
r5i and r5x using 192.168.1.0/24 network, clustering multicast 226.94.1.1:5405
r6i and r6x using 192.168.6.0/24 network, clustering multicast 226.94.1.6:5405
Two concurrently running tests (having the same cluster-name id) were launched and rhel5 one failed after 48 minutes, rhel6 one passed.
At the end one of the rhel5 machines in rhel5 cluster (r5i ~ rhel5.7 i386) left cluster with message:
2012-02-16 12:38:35 info Queue "test_cluster_message_ttl_performance_qn": Policy created: type=reject; maxCount=0; maxSize=37773952
2012-02-16 12:38:35 info Queue "test_cluster_message_ttl_performance_qn": Flow limit created: flowStopCount=0, flowResumeCount=0, flowStopSize=30219162, flowResu
meSize=26441766
2012-02-16 12:38:38 info ClusterClock task late 1 times by 61ms on average.
2012-02-16 12:38:42 notice cluster(192.168.1.5:23794 READY) configuration change: 192.168.1.5:23794
2012-02-16 12:38:42 notice cluster(192.168.1.5:23794 READY) Members left: 192.168.1.6:24694
2012-02-16 12:38:42 info cluster(192.168.1.5:23794 READY) member update: 192.168.1.5:23794(member) frameSeq=13137710
2012-02-16 12:38:42 notice cluster(192.168.1.5:23794 READY)Sole member of cluster, marking store clean.
2012-02-16 12:38:42 notice cluster(192.168.1.5:23794 READY) last broker standing, update queue policies
2012-02-16 12:38:43 notice cluster(192.168.1.5:23794 READY) configuration change: 192.168.1.5:23794 192.168.1.6:24694
2012-02-16 12:38:43 notice cluster(192.168.1.5:23794 READY) Members joined: 192.168.1.6:24694
2012-02-16 12:38:44 critical Error delivering frames: Unknown connection: Frame[Bbe; channel=1; {MessageTransferBody: destination=qmf.default.topic; accept-mode=
1; acquire-mode=0; }] data 192.168.1.6:24694-8 (qpid/cluster/Cluster.cpp:551)
2012-02-16 12:38:44 notice cluster(192.168.1.5:23794 LEFT) leaving cluster mycluster
2012-02-16 12:38:44 notice Shut down
-> ASSIGNED
|
Description of problem: clustered qpidd node leaves the cluster with critical Error delivering frames: Unknown connection: Frame[Bbe; channel=1; {MessageTransferBody: destination=qmf.default.topic; accept-mode=1; acquire-mode=0; }] data 192.168.1.5:3455-1 (qpid/cluster/Cluster.cpp:551) This is observed on two node rhel5.7 i386+x86_64 cluster running last 0.14 bits. The testing scenario is distributed transaction test (test_cluster_dtx_phased_nonlocal) using qpid-txtest following way: dhcp-26-114: qpid-txtest --dtx yes --init 1 --transfer 0 --check 0 --queues=12 --messages-per-tx=5 --tx-count=8 --total-messages=915 --durable no --size 1024 --broker=dhcp-26-112 -> all ok dhcp-26-112: qpid-txtest --dtx yes --init 0 --transfer 1 --check 0 --queues=12 --messages-per-tx=5 --tx-count=8 --total-messages=915 --durable no --size 1024 --broker=dhcp-26-114 -> resulting in connection closed dhcp-26-114: qpid-txtest --dtx yes --init 0 --transfer 0 --check 1 --queues=12 --messages-per-tx=5 --tx-count=8 --total-messages=915 --durable no --size 1024 --broker=dhcp-26-112 -> resulting in 'The following extra ids were encountered:' dhcp-26-114 broker dies with: 2011-12-20 11:22:10 info ManagementAgent::periodicProcessing task late 1 times by 2937ms on average. 2011-12-20 11:22:11 info Activating links 2011-12-20 11:29:24 notice cluster(192.168.1.6:3642 READY) configuration change: 192.168.1.5:3455 192.168.1.6:3642 2011-12-20 11:29:24 notice cluster(192.168.1.6:3642 READY) Members joined: 192.168.1.5:3455 2011-12-20 11:29:27 critical Error delivering frames: Unknown connection: Frame[Bbe; channel=1; {MessageTransferBody: destination=qmf.default.topic; accept-mode=1; acquire-mode=0; }] data 192.168.1.5:3455-1 (qpid/cluster/Cluster.cpp:551) 2011-12-20 11:29:27 notice cluster(192.168.1.6:3642 LEFT) leaving cluster mycluster 2011-12-20 11:29:27 notice Shut down Version-Release number of selected component (if applicable): cman-2.0.115-85.el5_7.2 openais-0.80.6-30.el5_7.4 python-qpid-0.14-1.el5 python-qpid-qmf-0.14-2.el5 python-saslwrapper-0.10-4.el5 qpid-cpp-client-0.14-3.el5 qpid-cpp-client-devel-0.14-3.el5 qpid-cpp-client-devel-docs-0.14-3.el5 qpid-cpp-client-rdma-0.14-3.el5 qpid-cpp-client-ssl-0.14-3.el5 qpid-cpp-mrg-debuginfo-0.14-3.el5 qpid-cpp-server-0.14-3.el5 qpid-cpp-server-cluster-0.14-3.el5 qpid-cpp-server-devel-0.14-3.el5 qpid-cpp-server-rdma-0.14-3.el5 qpid-cpp-server-ssl-0.14-3.el5 qpid-cpp-server-store-0.14-3.el5 qpid-cpp-server-xml-0.14-3.el5 qpid-java-client-0.14-1.el5 qpid-java-common-0.14-1.el5 qpid-java-example-0.14-1.el5 qpid-qmf-0.14-2.el5 qpid-qmf-debuginfo-0.14-2.el5 qpid-qmf-devel-0.14-2.el5 qpid-tests-0.14-1.el5 qpid-tools-0.14-1.el5 rh-qpid-cpp-tests-0.14-3.el5 ruby-qpid-qmf-0.14-2.el5 ruby-saslwrapper-0.10-4.el5 saslwrapper-0.10-4.el5 saslwrapper-debuginfo-0.10-4.el5 saslwrapper-devel-0.10-4.el5 sesame-1.0-2.el5 sesame-debuginfo-1.0-2.el5 How reproducible: 30% Steps to Reproduce: See above or below Actual results: Node leaves cluster with critical error. Expected results: Node should not leave cluster with critical error.