Bug 547769

Summary: clustered qpidd: qpid-cluster/qpid-stat -b reports different widths on different nodes while replication is working well on all nodes
Product: Red Hat Enterprise MRG Reporter: Frantisek Reznicek <freznice>
Component: python-qpidAssignee: Alan Conway <aconway>
Status: CLOSED ERRATA QA Contact: Frantisek Reznicek <freznice>
Severity: medium Docs Contact:
Priority: medium    
Version: 1.2CC: esammons, gsim
Target Milestone: 1.3   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Prior to this update, nodes occasionally reported different cluster widths in a qpidd cluster. However, consequent to an independent change to the underlying cluster code, this error no longer occurs, and all nodes now report correct values.
Story Points: ---
Clone Of: Environment:
Last Closed: 2010-10-14 16:11:21 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
qpidd cluster config and logs package none

Description Frantisek Reznicek 2009-12-15 16:38:10 UTC
Description of problem:

I was able to achieve situation that qpidd cluster is up and running and after applying the load by perftest and tsxtest I end up with still working cluster, but different nodes are reporting different cluster widths:

  root@mrg-qe-12:~/tools# for i in 9 10 11 12; do qpid-cluster 192.168.157.${i}; done
    Cluster Name: mrg_pepe_test_cluster
  Cluster Status: ACTIVE
    Cluster Size: 4
	 Members: ID=192.168.157.9:29712 URL=amqp:tcp:10.34.33.62:5672,tcp:192.168.157.9:5672
        	: ID=192.168.157.10:6022 URL=amqp:tcp:10.34.33.63:5672,tcp:192.168.157.10:5672
        	: ID=192.168.157.11:7800 URL=amqp:tcp:10.34.33.64:5672,tcp:192.168.157.11:5672
        	: ID=192.168.157.12:30629 URL=amqp:tcp:10.34.33.65:5672,tcp:192.168.157.12:5672
    Cluster Name: mrg_pepe_test_cluster
  Cluster Status: ACTIVE
    Cluster Size: 3
	 Members: ID=192.168.157.10:6022 URL=amqp:tcp:10.34.33.63:5672,tcp:192.168.157.10:5672
        	: ID=192.168.157.11:7800 URL=amqp:tcp:10.34.33.64:5672,tcp:192.168.157.11:5672
        	: ID=192.168.157.12:30629 URL=amqp:tcp:10.34.33.65:5672,tcp:192.168.157.12:5672
    Cluster Name: mrg_pepe_test_cluster
  Cluster Status: ACTIVE
    Cluster Size: 4
	 Members: ID=192.168.157.9:29712 URL=amqp:tcp:10.34.33.62:5672,tcp:192.168.157.9:5672
        	: ID=192.168.157.10:6022 URL=amqp:tcp:10.34.33.63:5672,tcp:192.168.157.10:5672
        	: ID=192.168.157.11:7800 URL=amqp:tcp:10.34.33.64:5672,tcp:192.168.157.11:5672
        	: ID=192.168.157.12:30629 URL=amqp:tcp:10.34.33.65:5672,tcp:192.168.157.12:5672
    Cluster Name: mrg_pepe_test_cluster
  Cluster Status: ACTIVE
    Cluster Size: 3
	 Members: ID=192.168.157.10:6022 URL=amqp:tcp:10.34.33.63:5672,tcp:192.168.157.10:5672
        	: ID=192.168.157.11:7800 URL=amqp:tcp:10.34.33.64:5672,tcp:192.168.157.11:5672
        	: ID=192.168.157.12:30629 URL=amqp:tcp:10.34.33.65:5672,tcp:192.168.157.12:5672

This might be issue of qpidd-cluster but I believe that management gets out of synch, because cluster is replicating well.

I also checked the openais config and log and everything working fine, no issues seen on openais level.


Version-Release number of selected component (if applicable):
MRG 1.2 packages:
root@mrg-qe-12:~/tools# rpm -qa | grep -E '(qpid|qmf|openais)' | sort -u
openais-0.80.6-8.el5_4.1
openais-debuginfo-0.80.6-8.el5_4.1
python-qpid-0.5.752581-4.el5
qmf-0.5.752581-34.el5
qmf-devel-0.5.752581-34.el5
qpidc-0.5.752581-34.el5
qpidc-debuginfo-0.5.752581-34.el5
qpidc-devel-0.5.752581-34.el5
qpidc-rdma-0.5.752581-34.el5
qpidc-ssl-0.5.752581-34.el5
qpidd-0.5.752581-34.el5
qpidd-acl-0.5.752581-34.el5
qpidd-cluster-0.5.752581-34.el5
qpidd-devel-0.5.752581-34.el5
qpid-dotnet-0.4.738274-2.el5
qpidd-rdma-0.5.752581-34.el5
qpidd-ssl-0.5.752581-34.el5
qpidd-xml-0.5.752581-34.el5
qpid-java-client-0.5.751061-9.el5
qpid-java-common-0.5.751061-9.el5


How reproducible:
<10% on mrg-qe-09...12 cluster

Steps to Reproduce:
1. start all nodes with
   service qpidd stop ; service openais stop
   rm -rf /var/lib/qpidd/rhm /var/lib/qpidd/cluster_bak/  /tmp/*.log
   service openais start ; service qpidd start
   netstat -tlnp | grep qpidd
2. run info thread by
   while true; do qpid-cluster;usleep ${RANDOM};qpid-stat -b;usleep ${RANDOM};qpid-config queues; usleep ${RANDOM}; done

3. run perftest thread on one of nodes
     err_cnt=0;cnt=0
  while true; do
    perftest_opts=""
    [ $((${RANDOM}%2)) == 1 ] && perftest_opts="--durable yes"
    ts=${SECONDS}
    perftest --user=guest --password=guest --count $(( (${RANDOM}+60000) / 2 )) -s --unique-data yes ${perftest_opts}
    ecode=$?
    usleep ${RANDOM}
    [ ${ecode} != 0 ] && let "err_cnt+"
    echo "ecode:${ecode}, dur:$((${SECONDS}-${ts})), tst_cnt:${cnt}, err_cnt:${err_cnt} (${perftest_opts})"
    let "cnt++"
  done
4. in another window launch tsxtest
  err_cnt=0;cnt=0
  while true; do
    rnd_bit=$(($RANDOM % 2))
    host_a=192.168.157.1${rnd_bit}
    if [ "${rnd_bit}" == "0" ]; then
      host_b=192.168.157.11
    else
      host_b=192.168.157.10
    fi
    tsxtest_opts=""
    [ $((${RANDOM}%2)) == 1 ] && tsxtest_opts="--durable_msg"
    ts=${SECONDS}
    ./tsxtest --user guest --password guest -h ${host_a} -H ${host_b}  --supress --TCs 20 --report ${tsxtest_opts} | tail
    ecode=$? 
    [ ${ecode} != 0 ] && let "err_cnt+"
    echo "ecode:${ecode}, dur:$((${SECONDS}-${ts})), tst_cnt:${cnt}, err_cnt:${err_cnt} (${host_a}<>${host_b},${tsxtest_opts})"
    let "cnt++"
  done
4. Watch info thread for cluster widths, or do this on another node:
  for i in 9 10 11 12; do qpid-cluster 192.168.157.${i}; done

  
Actual results:
both qpid-cluster and qpid-stat -b reports same but different cluster width for all nodes (see above)


Expected results:
both qpid-cluster and qpid-stat -b should report the same cluster width on all cluster nodes

Additional info:

There are couple of following messages in node which does not show all cluster mates (it might be unrelated):

2009-dec-15 11:13:42 info SASL: Mechanism list: PLAIN ANONYMOUS LOGIN
2009-dec-15 11:13:42 debug 192.168.157.10:6022(READY) new connection: 192.168.157.9:29712-8168(shadow)
2009-dec-15 11:13:42 info SASL: Starting authentication with mechanism: LOGIN
2009-dec-15 11:13:42 info SASL: Authentication succeeded for: guest@QPID
2009-dec-15 11:13:42 info SASL: Mechanism list: PLAIN ANONYMOUS LOGIN
2009-dec-15 11:13:42 debug 192.168.157.10:6022(READY) new connection: 192.168.157.9:29712-8169(shadow)
2009-dec-15 11:13:42 info SASL: Starting authentication with mechanism: LOGIN
2009-dec-15 11:13:42 info SASL: Authentication succeeded for: guest@QPID
2009-dec-15 11:13:42 info SASL: Mechanism list: PLAIN ANONYMOUS LOGIN
2009-dec-15 11:13:42 debug 192.168.157.10:6022(READY) new connection: 192.168.157.12:30629-2028(shadow)
2009-dec-15 11:13:42 info SASL: Starting authentication with mechanism: PLAIN
2009-dec-15 11:13:42 info SASL: Authentication succeeded for: guest@QPID
2009-dec-15 11:13:43 debug 192.168.157.10:6022(READY) erasing connection 192.168.157.9:29712-8169
2009-dec-15 11:13:43 debug 192.168.157.10:6022(READY) deleted connection: 192.168.157.9:29712-8169(shadow)
2009-dec-15 11:13:43 debug 192.168.157.10:6022(READY) erasing connection 192.168.157.9:29712-8167
2009-dec-15 11:13:43 debug 192.168.157.10:6022(READY) deleted connection: 192.168.157.9:29712-8167(shadow)
2009-dec-15 11:13:43 debug 192.168.157.10:6022(READY) erasing connection 192.168.157.9:29712-8168
2009-dec-15 11:13:43 debug 192.168.157.10:6022(READY) deleted connection: 192.168.157.9:29712-8168(shadow)
2009-dec-15 11:13:43 info SASL: Mechanism list: PLAIN ANONYMOUS LOGIN
2009-dec-15 11:13:43 debug 192.168.157.10:6022(READY) new connection: 192.168.157.10:6022-12153(local)
2009-dec-15 11:13:43 debug 192.168.157.10:6022(READY) add local connection 192.168.157.10:6022-12153
2009-dec-15 11:13:43 info SASL: Starting authentication with mechanism: PLAIN

Comment 1 Frantisek Reznicek 2009-12-15 16:41:16 UTC
Created attachment 378550 [details]
qpidd cluster config and logs package

Comment 2 Gordon Sim 2009-12-15 17:51:04 UTC
Note that when this had occurred, creating a queue when connected to one node that reported only three members was nevertheless propagated to all four nodes.

Comment 3 Alan Conway 2010-05-31 15:39:07 UTC
This was fixed as a side effect of r948143, ported to mrg_1.3.x branch:

http://mrg1.lab.bos.redhat.com/git/?p=qpid.git;a=commitdiff;h=87a548b4b5e525bd887fd486469e63ed9bd79c03

Comment 5 Douglas Silas 2010-10-07 14:37:33 UTC
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
Nodes occasionally reported different cluster widths in a qpidd cluster.

Comment 6 Jaromir Hradilek 2010-10-13 16:28:13 UTC
    Technical note updated. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    Diffed Contents:
@@ -1 +1 @@
-Nodes occasionally reported different cluster widths in a qpidd cluster.+Prior to this update, nodes occasionally reported different cluster widths in a qpidd cluster. However, consequent to an independent change to the underlying cluster code, this error no longer occurs, and all nodes now report correct values.

Comment 8 errata-xmlrpc 2010-10-14 16:11:21 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2010-0773.html