Bug 547769

Summary:

clustered qpidd: qpid-cluster/qpid-stat -b reports different widths on different nodes while replication is working well on all nodes

Product:

Red Hat Enterprise MRG

Reporter:

Frantisek Reznicek <freznice>

Component:

python-qpid

Assignee:

Alan Conway <aconway>

Status:

CLOSED ERRATA

QA Contact:

Frantisek Reznicek <freznice>

Severity:

medium

Docs Contact:

Priority:

medium

Version:

1.2

CC:

esammons, gsim

Target Milestone:

1.3

Target Release:

---

Hardware:

All

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Prior to this update, nodes occasionally reported different cluster widths in a qpidd cluster. However, consequent to an independent change to the underlying cluster code, this error no longer occurs, and all nodes now report correct values.

Story Points:

---

Clone Of:

Environment:

Last Closed:

2010-10-14 16:11:21 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
qpidd cluster config and logs package	none

Description Frantisek Reznicek 2009-12-15 16:38:10 UTC

Description of problem:

I was able to achieve situation that qpidd cluster is up and running and after applying the load by perftest and tsxtest I end up with still working cluster, but different nodes are reporting different cluster widths:

  root@mrg-qe-12:~/tools# for i in 9 10 11 12; do qpid-cluster 192.168.157.${i}; done
    Cluster Name: mrg_pepe_test_cluster
  Cluster Status: ACTIVE
    Cluster Size: 4
	 Members: ID=192.168.157.9:29712 URL=amqp:tcp:10.34.33.62:5672,tcp:192.168.157.9:5672
        	: ID=192.168.157.10:6022 URL=amqp:tcp:10.34.33.63:5672,tcp:192.168.157.10:5672
        	: ID=192.168.157.11:7800 URL=amqp:tcp:10.34.33.64:5672,tcp:192.168.157.11:5672
        	: ID=192.168.157.12:30629 URL=amqp:tcp:10.34.33.65:5672,tcp:192.168.157.12:5672
    Cluster Name: mrg_pepe_test_cluster
  Cluster Status: ACTIVE
    Cluster Size: 3
	 Members: ID=192.168.157.10:6022 URL=amqp:tcp:10.34.33.63:5672,tcp:192.168.157.10:5672
        	: ID=192.168.157.11:7800 URL=amqp:tcp:10.34.33.64:5672,tcp:192.168.157.11:5672
        	: ID=192.168.157.12:30629 URL=amqp:tcp:10.34.33.65:5672,tcp:192.168.157.12:5672
    Cluster Name: mrg_pepe_test_cluster
  Cluster Status: ACTIVE
    Cluster Size: 4
	 Members: ID=192.168.157.9:29712 URL=amqp:tcp:10.34.33.62:5672,tcp:192.168.157.9:5672
        	: ID=192.168.157.10:6022 URL=amqp:tcp:10.34.33.63:5672,tcp:192.168.157.10:5672
        	: ID=192.168.157.11:7800 URL=amqp:tcp:10.34.33.64:5672,tcp:192.168.157.11:5672
        	: ID=192.168.157.12:30629 URL=amqp:tcp:10.34.33.65:5672,tcp:192.168.157.12:5672
    Cluster Name: mrg_pepe_test_cluster
  Cluster Status: ACTIVE
    Cluster Size: 3
	 Members: ID=192.168.157.10:6022 URL=amqp:tcp:10.34.33.63:5672,tcp:192.168.157.10:5672
        	: ID=192.168.157.11:7800 URL=amqp:tcp:10.34.33.64:5672,tcp:192.168.157.11:5672
        	: ID=192.168.157.12:30629 URL=amqp:tcp:10.34.33.65:5672,tcp:192.168.157.12:5672

This might be issue of qpidd-cluster but I believe that management gets out of synch, because cluster is replicating well.

I also checked the openais config and log and everything working fine, no issues seen on openais level.


Version-Release number of selected component (if applicable):
MRG 1.2 packages:
root@mrg-qe-12:~/tools# rpm -qa | grep -E '(qpid|qmf|openais)' | sort -u
openais-0.80.6-8.el5_4.1
openais-debuginfo-0.80.6-8.el5_4.1
python-qpid-0.5.752581-4.el5
qmf-0.5.752581-34.el5
qmf-devel-0.5.752581-34.el5
qpidc-0.5.752581-34.el5
qpidc-debuginfo-0.5.752581-34.el5
qpidc-devel-0.5.752581-34.el5
qpidc-rdma-0.5.752581-34.el5
qpidc-ssl-0.5.752581-34.el5
qpidd-0.5.752581-34.el5
qpidd-acl-0.5.752581-34.el5
qpidd-cluster-0.5.752581-34.el5
qpidd-devel-0.5.752581-34.el5
qpid-dotnet-0.4.738274-2.el5
qpidd-rdma-0.5.752581-34.el5
qpidd-ssl-0.5.752581-34.el5
qpidd-xml-0.5.752581-34.el5
qpid-java-client-0.5.751061-9.el5
qpid-java-common-0.5.751061-9.el5


How reproducible:
<10% on mrg-qe-09...12 cluster

Steps to Reproduce:
1. start all nodes with
   service qpidd stop ; service openais stop
   rm -rf /var/lib/qpidd/rhm /var/lib/qpidd/cluster_bak/  /tmp/*.log
   service openais start ; service qpidd start
   netstat -tlnp | grep qpidd
2. run info thread by
   while true; do qpid-cluster;usleep ${RANDOM};qpid-stat -b;usleep ${RANDOM};qpid-config queues; usleep ${RANDOM}; done

3. run perftest thread on one of nodes
     err_cnt=0;cnt=0
  while true; do
    perftest_opts=""
    [ $((${RANDOM}%2)) == 1 ] && perftest_opts="--durable yes"
    ts=${SECONDS}
    perftest --user=guest --password=guest --count $(( (${RANDOM}+60000) / 2 )) -s --unique-data yes ${perftest_opts}
    ecode=$?
    usleep ${RANDOM}
    [ ${ecode} != 0 ] && let "err_cnt+"
    echo "ecode:${ecode}, dur:$((${SECONDS}-${ts})), tst_cnt:${cnt}, err_cnt:${err_cnt} (${perftest_opts})"
    let "cnt++"
  done
4. in another window launch tsxtest
  err_cnt=0;cnt=0
  while true; do
    rnd_bit=$(($RANDOM % 2))
    host_a=192.168.157.1${rnd_bit}
    if [ "${rnd_bit}" == "0" ]; then
      host_b=192.168.157.11
    else
      host_b=192.168.157.10
    fi
    tsxtest_opts=""
    [ $((${RANDOM}%2)) == 1 ] && tsxtest_opts="--durable_msg"
    ts=${SECONDS}
    ./tsxtest --user guest --password guest -h ${host_a} -H ${host_b}  --supress --TCs 20 --report ${tsxtest_opts} | tail
    ecode=$? 
    [ ${ecode} != 0 ] && let "err_cnt+"
    echo "ecode:${ecode}, dur:$((${SECONDS}-${ts})), tst_cnt:${cnt}, err_cnt:${err_cnt} (${host_a}<>${host_b},${tsxtest_opts})"
    let "cnt++"
  done
4. Watch info thread for cluster widths, or do this on another node:
  for i in 9 10 11 12; do qpid-cluster 192.168.157.${i}; done

  
Actual results:
both qpid-cluster and qpid-stat -b reports same but different cluster width for all nodes (see above)


Expected results:
both qpid-cluster and qpid-stat -b should report the same cluster width on all cluster nodes

Additional info:

There are couple of following messages in node which does not show all cluster mates (it might be unrelated):

2009-dec-15 11:13:42 info SASL: Mechanism list: PLAIN ANONYMOUS LOGIN
2009-dec-15 11:13:42 debug 192.168.157.10:6022(READY) new connection: 192.168.157.9:29712-8168(shadow)
2009-dec-15 11:13:42 info SASL: Starting authentication with mechanism: LOGIN
2009-dec-15 11:13:42 info SASL: Authentication succeeded for: guest@QPID
2009-dec-15 11:13:42 info SASL: Mechanism list: PLAIN ANONYMOUS LOGIN
2009-dec-15 11:13:42 debug 192.168.157.10:6022(READY) new connection: 192.168.157.9:29712-8169(shadow)
2009-dec-15 11:13:42 info SASL: Starting authentication with mechanism: LOGIN
2009-dec-15 11:13:42 info SASL: Authentication succeeded for: guest@QPID
2009-dec-15 11:13:42 info SASL: Mechanism list: PLAIN ANONYMOUS LOGIN
2009-dec-15 11:13:42 debug 192.168.157.10:6022(READY) new connection: 192.168.157.12:30629-2028(shadow)
2009-dec-15 11:13:42 info SASL: Starting authentication with mechanism: PLAIN
2009-dec-15 11:13:42 info SASL: Authentication succeeded for: guest@QPID
2009-dec-15 11:13:43 debug 192.168.157.10:6022(READY) erasing connection 192.168.157.9:29712-8169
2009-dec-15 11:13:43 debug 192.168.157.10:6022(READY) deleted connection: 192.168.157.9:29712-8169(shadow)
2009-dec-15 11:13:43 debug 192.168.157.10:6022(READY) erasing connection 192.168.157.9:29712-8167
2009-dec-15 11:13:43 debug 192.168.157.10:6022(READY) deleted connection: 192.168.157.9:29712-8167(shadow)
2009-dec-15 11:13:43 debug 192.168.157.10:6022(READY) erasing connection 192.168.157.9:29712-8168
2009-dec-15 11:13:43 debug 192.168.157.10:6022(READY) deleted connection: 192.168.157.9:29712-8168(shadow)
2009-dec-15 11:13:43 info SASL: Mechanism list: PLAIN ANONYMOUS LOGIN
2009-dec-15 11:13:43 debug 192.168.157.10:6022(READY) new connection: 192.168.157.10:6022-12153(local)
2009-dec-15 11:13:43 debug 192.168.157.10:6022(READY) add local connection 192.168.157.10:6022-12153
2009-dec-15 11:13:43 info SASL: Starting authentication with mechanism: PLAIN

Comment 1 Frantisek Reznicek 2009-12-15 16:41:16 UTC

Created attachment 378550 [details]
qpidd cluster config and logs package

Comment 2 Gordon Sim 2009-12-15 17:51:04 UTC

Note that when this had occurred, creating a queue when connected to one node that reported only three members was nevertheless propagated to all four nodes.

Comment 3 Alan Conway 2010-05-31 15:39:07 UTC

This was fixed as a side effect of r948143, ported to mrg_1.3.x branch:

http://mrg1.lab.bos.redhat.com/git/?p=qpid.git;a=commitdiff;h=87a548b4b5e525bd887fd486469e63ed9bd79c03

Comment 5 Douglas Silas 2010-10-07 14:37:33 UTC

    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
Nodes occasionally reported different cluster widths in a qpidd cluster.

Comment 6 Jaromir Hradilek 2010-10-13 16:28:13 UTC

    Technical note updated. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    Diffed Contents:
@@ -1 +1 @@
-Nodes occasionally reported different cluster widths in a qpidd cluster.+Prior to this update, nodes occasionally reported different cluster widths in a qpidd cluster. However, consequent to an independent change to the underlying cluster code, this error no longer occurs, and all nodes now report correct values.

Comment 8 errata-xmlrpc 2010-10-14 16:11:21 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2010-0773.html