Description of problem: I was able to achieve situation that qpidd cluster is up and running and after applying the load by perftest and tsxtest I end up with still working cluster, but different nodes are reporting different cluster widths: root@mrg-qe-12:~/tools# for i in 9 10 11 12; do qpid-cluster 192.168.157.${i}; done Cluster Name: mrg_pepe_test_cluster Cluster Status: ACTIVE Cluster Size: 4 Members: ID=192.168.157.9:29712 URL=amqp:tcp:10.34.33.62:5672,tcp:192.168.157.9:5672 : ID=192.168.157.10:6022 URL=amqp:tcp:10.34.33.63:5672,tcp:192.168.157.10:5672 : ID=192.168.157.11:7800 URL=amqp:tcp:10.34.33.64:5672,tcp:192.168.157.11:5672 : ID=192.168.157.12:30629 URL=amqp:tcp:10.34.33.65:5672,tcp:192.168.157.12:5672 Cluster Name: mrg_pepe_test_cluster Cluster Status: ACTIVE Cluster Size: 3 Members: ID=192.168.157.10:6022 URL=amqp:tcp:10.34.33.63:5672,tcp:192.168.157.10:5672 : ID=192.168.157.11:7800 URL=amqp:tcp:10.34.33.64:5672,tcp:192.168.157.11:5672 : ID=192.168.157.12:30629 URL=amqp:tcp:10.34.33.65:5672,tcp:192.168.157.12:5672 Cluster Name: mrg_pepe_test_cluster Cluster Status: ACTIVE Cluster Size: 4 Members: ID=192.168.157.9:29712 URL=amqp:tcp:10.34.33.62:5672,tcp:192.168.157.9:5672 : ID=192.168.157.10:6022 URL=amqp:tcp:10.34.33.63:5672,tcp:192.168.157.10:5672 : ID=192.168.157.11:7800 URL=amqp:tcp:10.34.33.64:5672,tcp:192.168.157.11:5672 : ID=192.168.157.12:30629 URL=amqp:tcp:10.34.33.65:5672,tcp:192.168.157.12:5672 Cluster Name: mrg_pepe_test_cluster Cluster Status: ACTIVE Cluster Size: 3 Members: ID=192.168.157.10:6022 URL=amqp:tcp:10.34.33.63:5672,tcp:192.168.157.10:5672 : ID=192.168.157.11:7800 URL=amqp:tcp:10.34.33.64:5672,tcp:192.168.157.11:5672 : ID=192.168.157.12:30629 URL=amqp:tcp:10.34.33.65:5672,tcp:192.168.157.12:5672 This might be issue of qpidd-cluster but I believe that management gets out of synch, because cluster is replicating well. I also checked the openais config and log and everything working fine, no issues seen on openais level. Version-Release number of selected component (if applicable): MRG 1.2 packages: root@mrg-qe-12:~/tools# rpm -qa | grep -E '(qpid|qmf|openais)' | sort -u openais-0.80.6-8.el5_4.1 openais-debuginfo-0.80.6-8.el5_4.1 python-qpid-0.5.752581-4.el5 qmf-0.5.752581-34.el5 qmf-devel-0.5.752581-34.el5 qpidc-0.5.752581-34.el5 qpidc-debuginfo-0.5.752581-34.el5 qpidc-devel-0.5.752581-34.el5 qpidc-rdma-0.5.752581-34.el5 qpidc-ssl-0.5.752581-34.el5 qpidd-0.5.752581-34.el5 qpidd-acl-0.5.752581-34.el5 qpidd-cluster-0.5.752581-34.el5 qpidd-devel-0.5.752581-34.el5 qpid-dotnet-0.4.738274-2.el5 qpidd-rdma-0.5.752581-34.el5 qpidd-ssl-0.5.752581-34.el5 qpidd-xml-0.5.752581-34.el5 qpid-java-client-0.5.751061-9.el5 qpid-java-common-0.5.751061-9.el5 How reproducible: <10% on mrg-qe-09...12 cluster Steps to Reproduce: 1. start all nodes with service qpidd stop ; service openais stop rm -rf /var/lib/qpidd/rhm /var/lib/qpidd/cluster_bak/ /tmp/*.log service openais start ; service qpidd start netstat -tlnp | grep qpidd 2. run info thread by while true; do qpid-cluster;usleep ${RANDOM};qpid-stat -b;usleep ${RANDOM};qpid-config queues; usleep ${RANDOM}; done 3. run perftest thread on one of nodes err_cnt=0;cnt=0 while true; do perftest_opts="" [ $((${RANDOM}%2)) == 1 ] && perftest_opts="--durable yes" ts=${SECONDS} perftest --user=guest --password=guest --count $(( (${RANDOM}+60000) / 2 )) -s --unique-data yes ${perftest_opts} ecode=$? usleep ${RANDOM} [ ${ecode} != 0 ] && let "err_cnt+" echo "ecode:${ecode}, dur:$((${SECONDS}-${ts})), tst_cnt:${cnt}, err_cnt:${err_cnt} (${perftest_opts})" let "cnt++" done 4. in another window launch tsxtest err_cnt=0;cnt=0 while true; do rnd_bit=$(($RANDOM % 2)) host_a=192.168.157.1${rnd_bit} if [ "${rnd_bit}" == "0" ]; then host_b=192.168.157.11 else host_b=192.168.157.10 fi tsxtest_opts="" [ $((${RANDOM}%2)) == 1 ] && tsxtest_opts="--durable_msg" ts=${SECONDS} ./tsxtest --user guest --password guest -h ${host_a} -H ${host_b} --supress --TCs 20 --report ${tsxtest_opts} | tail ecode=$? [ ${ecode} != 0 ] && let "err_cnt+" echo "ecode:${ecode}, dur:$((${SECONDS}-${ts})), tst_cnt:${cnt}, err_cnt:${err_cnt} (${host_a}<>${host_b},${tsxtest_opts})" let "cnt++" done 4. Watch info thread for cluster widths, or do this on another node: for i in 9 10 11 12; do qpid-cluster 192.168.157.${i}; done Actual results: both qpid-cluster and qpid-stat -b reports same but different cluster width for all nodes (see above) Expected results: both qpid-cluster and qpid-stat -b should report the same cluster width on all cluster nodes Additional info: There are couple of following messages in node which does not show all cluster mates (it might be unrelated): 2009-dec-15 11:13:42 info SASL: Mechanism list: PLAIN ANONYMOUS LOGIN 2009-dec-15 11:13:42 debug 192.168.157.10:6022(READY) new connection: 192.168.157.9:29712-8168(shadow) 2009-dec-15 11:13:42 info SASL: Starting authentication with mechanism: LOGIN 2009-dec-15 11:13:42 info SASL: Authentication succeeded for: guest@QPID 2009-dec-15 11:13:42 info SASL: Mechanism list: PLAIN ANONYMOUS LOGIN 2009-dec-15 11:13:42 debug 192.168.157.10:6022(READY) new connection: 192.168.157.9:29712-8169(shadow) 2009-dec-15 11:13:42 info SASL: Starting authentication with mechanism: LOGIN 2009-dec-15 11:13:42 info SASL: Authentication succeeded for: guest@QPID 2009-dec-15 11:13:42 info SASL: Mechanism list: PLAIN ANONYMOUS LOGIN 2009-dec-15 11:13:42 debug 192.168.157.10:6022(READY) new connection: 192.168.157.12:30629-2028(shadow) 2009-dec-15 11:13:42 info SASL: Starting authentication with mechanism: PLAIN 2009-dec-15 11:13:42 info SASL: Authentication succeeded for: guest@QPID 2009-dec-15 11:13:43 debug 192.168.157.10:6022(READY) erasing connection 192.168.157.9:29712-8169 2009-dec-15 11:13:43 debug 192.168.157.10:6022(READY) deleted connection: 192.168.157.9:29712-8169(shadow) 2009-dec-15 11:13:43 debug 192.168.157.10:6022(READY) erasing connection 192.168.157.9:29712-8167 2009-dec-15 11:13:43 debug 192.168.157.10:6022(READY) deleted connection: 192.168.157.9:29712-8167(shadow) 2009-dec-15 11:13:43 debug 192.168.157.10:6022(READY) erasing connection 192.168.157.9:29712-8168 2009-dec-15 11:13:43 debug 192.168.157.10:6022(READY) deleted connection: 192.168.157.9:29712-8168(shadow) 2009-dec-15 11:13:43 info SASL: Mechanism list: PLAIN ANONYMOUS LOGIN 2009-dec-15 11:13:43 debug 192.168.157.10:6022(READY) new connection: 192.168.157.10:6022-12153(local) 2009-dec-15 11:13:43 debug 192.168.157.10:6022(READY) add local connection 192.168.157.10:6022-12153 2009-dec-15 11:13:43 info SASL: Starting authentication with mechanism: PLAIN
Created attachment 378550 [details] qpidd cluster config and logs package
Note that when this had occurred, creating a queue when connected to one node that reported only three members was nevertheless propagated to all four nodes.
This was fixed as a side effect of r948143, ported to mrg_1.3.x branch: http://mrg1.lab.bos.redhat.com/git/?p=qpid.git;a=commitdiff;h=87a548b4b5e525bd887fd486469e63ed9bd79c03
Technical note added. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: Nodes occasionally reported different cluster widths in a qpidd cluster.
Technical note updated. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. Diffed Contents: @@ -1 +1 @@ -Nodes occasionally reported different cluster widths in a qpidd cluster.+Prior to this update, nodes occasionally reported different cluster widths in a qpidd cluster. However, consequent to an independent change to the underlying cluster code, this error no longer occurs, and all nodes now report correct values.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2010-0773.html