Bug 769247

Summary: clustered qpidd node leaves the cluster with critical Error delivering frames: Unknown connection: Frame[Bbe;...
Product: Red Hat Enterprise MRG Reporter: Frantisek Reznicek <freznice>
Component: qpid-cppAssignee: mick <mgoulish>
Status: NEW --- QA Contact: MRG Quality Engineering <mrgqe-bugs>
Severity: high Docs Contact:
Priority: high    
Version: DevelopmentCC: iboverma, jross, pmoravec
Target Milestone: ---Keywords: Reopened
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2012-01-25 09:50:06 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description Frantisek Reznicek 2011-12-20 11:18:28 UTC
Description of problem:

clustered qpidd node leaves the cluster with critical Error delivering frames: Unknown connection: Frame[Bbe; channel=1; {MessageTransferBody: destination=qmf.default.topic; accept-mode=1; acquire-mode=0; }] data 192.168.1.5:3455-1 (qpid/cluster/Cluster.cpp:551)

This is observed on two node rhel5.7 i386+x86_64 cluster running last 0.14 bits.

The testing scenario is distributed transaction test (test_cluster_dtx_phased_nonlocal) using qpid-txtest following way:

dhcp-26-114: qpid-txtest --dtx yes --init 1 --transfer 0 --check 0 --queues=12 --messages-per-tx=5 --tx-count=8 --total-messages=915   --durable no --size 1024 --broker=dhcp-26-112
 -> all ok
dhcp-26-112: qpid-txtest --dtx yes --init 0 --transfer 1 --check 0 --queues=12 --messages-per-tx=5 --tx-count=8 --total-messages=915   --durable no --size 1024 --broker=dhcp-26-114
 -> resulting in connection closed

dhcp-26-114: qpid-txtest --dtx yes --init 0 --transfer 0 --check 1 --queues=12 --messages-per-tx=5 --tx-count=8 --total-messages=915   --durable no --size 1024 --broker=dhcp-26-112
 -> resulting in 'The following extra ids were encountered:'

dhcp-26-114 broker dies with:
  2011-12-20 11:22:10 info ManagementAgent::periodicProcessing task late 1 times by 2937ms on average.
  2011-12-20 11:22:11 info Activating links
  2011-12-20 11:29:24 notice cluster(192.168.1.6:3642 READY) configuration change: 192.168.1.5:3455 192.168.1.6:3642
  2011-12-20 11:29:24 notice cluster(192.168.1.6:3642 READY) Members joined: 192.168.1.5:3455
  2011-12-20 11:29:27 critical Error delivering frames: Unknown connection: Frame[Bbe; channel=1; {MessageTransferBody: destination=qmf.default.topic; accept-mode=1; acquire-mode=0; }] data 192.168.1.5:3455-1 (qpid/cluster/Cluster.cpp:551)
  2011-12-20 11:29:27 notice cluster(192.168.1.6:3642 LEFT) leaving cluster mycluster
  2011-12-20 11:29:27 notice Shut down

Version-Release number of selected component (if applicable):
  cman-2.0.115-85.el5_7.2
  openais-0.80.6-30.el5_7.4
  python-qpid-0.14-1.el5
  python-qpid-qmf-0.14-2.el5
  python-saslwrapper-0.10-4.el5
  qpid-cpp-client-0.14-3.el5
  qpid-cpp-client-devel-0.14-3.el5
  qpid-cpp-client-devel-docs-0.14-3.el5
  qpid-cpp-client-rdma-0.14-3.el5
  qpid-cpp-client-ssl-0.14-3.el5
  qpid-cpp-mrg-debuginfo-0.14-3.el5
  qpid-cpp-server-0.14-3.el5
  qpid-cpp-server-cluster-0.14-3.el5
  qpid-cpp-server-devel-0.14-3.el5
  qpid-cpp-server-rdma-0.14-3.el5
  qpid-cpp-server-ssl-0.14-3.el5
  qpid-cpp-server-store-0.14-3.el5
  qpid-cpp-server-xml-0.14-3.el5
  qpid-java-client-0.14-1.el5
  qpid-java-common-0.14-1.el5
  qpid-java-example-0.14-1.el5
  qpid-qmf-0.14-2.el5
  qpid-qmf-debuginfo-0.14-2.el5
  qpid-qmf-devel-0.14-2.el5
  qpid-tests-0.14-1.el5
  qpid-tools-0.14-1.el5
  rh-qpid-cpp-tests-0.14-3.el5
  ruby-qpid-qmf-0.14-2.el5
  ruby-saslwrapper-0.10-4.el5
  saslwrapper-0.10-4.el5
  saslwrapper-debuginfo-0.10-4.el5
  saslwrapper-devel-0.10-4.el5
  sesame-1.0-2.el5
  sesame-debuginfo-1.0-2.el5


How reproducible:
30%

Steps to Reproduce:
See above or below
  
Actual results:
Node leaves cluster with critical error.

Expected results:
Node should not leave cluster with critical error.

Comment 2 Frantisek Reznicek 2011-12-20 11:24:54 UTC
Issue seen on RHEL5.7 cluster, not seen yet on RHEL6.2 cluster although stressed more atm.

Comment 4 Frantisek Reznicek 2012-01-11 09:06:07 UTC
I'm not able to see this case happening again. I'm going to double-check soon whether this case was caused by using shared multicast address space simultaneously by corosync and openais clusters.

Comment 5 Frantisek Reznicek 2012-01-25 09:50:06 UTC
The bug above was triggered indeed in configuration where same multicast address and port was used for two different rhel5 (openais) and rhel6 (corosync) clusters (2+2 nodes).

The procedure of launching the test was following:
1] rhel5 cluster test launched
2] rhel6 cluster test launched
3] rhel5 cluster test exited with error once first of rhel6 corosync daemons started up

Although corosync.conf man page specifies that corosync can operate in compatibility mode 'compatibility: whitetank' (which was used for test) reality is different and mixing cluster openais+corosync configuration is not officially supported.

I'm closing this defect as NOT-A-BUG based on above explained cluster  misconfiguration.

-> CLOSED / NOT-A-BUG

Comment 6 Frantisek Reznicek 2012-02-16 13:14:45 UTC
Very recently I was able to trigger this issue again in proper configuration.
Thus I'm reopening the report.

The cluster configuration description:

VIrtualization host is running 4 VM machines r5i r5x r6i r6x.
Each of those hosts have two NICs, eth0 bridged to the network through virtualization host NIC and eth1 statically configured following way:

r5i and r5x using 192.168.1.0/24 network, clustering multicast 226.94.1.1:5405
r6i and r6x using 192.168.6.0/24 network, clustering multicast 226.94.1.6:5405


Two concurrently running tests (having the same cluster-name id) were launched and rhel5 one failed after 48 minutes, rhel6 one passed.

At the end one of the rhel5 machines in rhel5 cluster (r5i ~ rhel5.7 i386) left cluster with message:

  2012-02-16 12:38:35 info Queue "test_cluster_message_ttl_performance_qn": Policy created: type=reject; maxCount=0; maxSize=37773952
  2012-02-16 12:38:35 info Queue "test_cluster_message_ttl_performance_qn": Flow limit created: flowStopCount=0, flowResumeCount=0, flowStopSize=30219162, flowResu
  meSize=26441766
  2012-02-16 12:38:38 info ClusterClock task late 1 times by 61ms on average.
  2012-02-16 12:38:42 notice cluster(192.168.1.5:23794 READY) configuration change: 192.168.1.5:23794
  2012-02-16 12:38:42 notice cluster(192.168.1.5:23794 READY) Members left: 192.168.1.6:24694
  2012-02-16 12:38:42 info cluster(192.168.1.5:23794 READY) member update: 192.168.1.5:23794(member) frameSeq=13137710
  2012-02-16 12:38:42 notice cluster(192.168.1.5:23794 READY)Sole member of cluster, marking store clean.
  2012-02-16 12:38:42 notice cluster(192.168.1.5:23794 READY) last broker standing, update queue policies
  2012-02-16 12:38:43 notice cluster(192.168.1.5:23794 READY) configuration change: 192.168.1.5:23794 192.168.1.6:24694
  2012-02-16 12:38:43 notice cluster(192.168.1.5:23794 READY) Members joined: 192.168.1.6:24694
  2012-02-16 12:38:44 critical Error delivering frames: Unknown connection: Frame[Bbe; channel=1; {MessageTransferBody: destination=qmf.default.topic; accept-mode=
  1; acquire-mode=0; }] data 192.168.1.6:24694-8 (qpid/cluster/Cluster.cpp:551)
  2012-02-16 12:38:44 notice cluster(192.168.1.5:23794 LEFT) leaving cluster mycluster
  2012-02-16 12:38:44 notice Shut down



-> ASSIGNED