Bug 726494

Summary: clustered broker leaves cluster with 'catch-up connection closed prematurely'
Product: Red Hat Enterprise MRG Reporter: Frantisek Reznicek <freznice>
Component: qpid-cppAssignee: messaging-bugs <messaging-bugs>
Status: CLOSED NOTABUG QA Contact: MRG Quality Engineering <mrgqe-bugs>
Severity: high Docs Contact:
Priority: unspecified    
Version: 1.3CC: aconway, esammons, jross
Target Milestone: ---   
Target Release: ---   
Hardware: athlon   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2011-08-01 15:05:30 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Attachments:
Description Flags
The last hundreds of log lines from every broker (debug+)
none
The automated test for reproducing the issue none

Description Frantisek Reznicek 2011-07-28 19:44:32 UTC
Description of problem:

Long term running cluster test uncovers that a node may rarely un-expectably leave cluster with following message:
 2011-07-28 18:32:40 debug DISCONNECTED [10.34.33.125:5672-10.34.33.236:44944]
 2011-07-28 18:32:40 critical cluster(192.168.5.2:12582 UPDATEE) catch-up connection closed prematurely 10.34.33.125:5672-10.34.33.236:44944(192.168.5.2:12582-1 local,catchup)
 2011-07-28 18:32:40 notice cluster(192.168.5.2:12582 LEFT) leaving cluster mycluster

This behavior is observed when cluster test repeatedly performs cluster toggle test which changes cluster configuration. See more description later


Version-Release number of selected component (if applicable):
cman-2.0.115-68.el5_6.4
openais-0.80.6-28.el5_6.1
python-qmf-0.7.946106-14.el5
python-qpid-0.7.946106-15.el5
python-saslwrapper-0.1.934605-2.el5
qmf-0.7.946106-33.el5
qmf-devel-0.7.946106-33.el5
qpid-cpp-client-0.7.946106-33.el5
qpid-cpp-client-devel-0.7.946106-33.el5
qpid-cpp-client-devel-docs-0.7.946106-33.el5
qpid-cpp-client-ssl-0.7.946106-33.el5
qpid-cpp-server-0.7.946106-33.el5
qpid-cpp-server-cluster-0.7.946106-33.el5
qpid-cpp-server-devel-0.7.946106-33.el5
qpid-cpp-server-ssl-0.7.946106-33.el5
qpid-cpp-server-store-0.7.946106-33.el5
qpid-cpp-server-xml-0.7.946106-33.el5
qpid-java-client-0.10-6.el5
qpid-java-common-0.10-6.el5
qpid-java-example-0.10-6.el5
qpid-tools-0.7.946106-12.el5
ruby-saslwrapper-0.1.934605-2.el5
saslwrapper-0.1.934605-2.el5
saslwrapper-devel-0.1.934605-2.el5
sesame-0.7.4297-6.el5


How reproducible:
90% just time to run ito troubles varies

Steps to Reproduce:
1. ./ctests.py --broker-maximize-uptime --broker-options=mgmt-pub-interval~1,log-enable~debug+ --log-to-file=a.log --qmf-data-timeout=120 --stop-on-error --exec-tests=testset_cluster_toggle:16h host_a host_b host_c host_d   
  
Actual results:
Cluster test running cluster toggle tests is failing

Expected results:
Cluster test running cluster toggle tests should not fail

Additional info:

Rough test log:
  
  [18:28:25] DEBUG:TESTSET 'testset_cluster_toggle' loop 3/28 (57600.0) finished. [tst_cnt:1305, err_cnt:0]
  [18:28:25] DEBUG:TESTSET 'testset_cluster_toggle' loop 4/37 (57600.0) started. [tst_cnt:1305, err_cnt:0]
  [18:28:28] DEBUG:Cluster state: 1 of 4 up and running
  [18:28:28] DEBUG:TEST 'test_cluster_toggle_no_load' loop 1/1 started. [tst_cnt:1305, err_cnt:0]
  [18:28:30] DEBUG:Cluster state: 1 of 4 up and running
  [18:28:30] DEBUG:Go to cluster state: 15 [True, True, True, True]
  [18:28:59] INFO:localhost:'cluster topology change (ecode: True exp. True)', result:True, dur.:-1
  [18:29:02] INFO:localhost:'Cluster state: 4 (exp. 4) of 4 up and running', result:True, dur.:-1
  [18:29:08] INFO:dhcp-lab-236.englab.brq.redhat.com:'qpid-cluster', result:0, dur.:4.68771
  [18:29:08] INFO:localhost:'qpid-cluster cluster size check (exp. 4)', result:True, dur.:-1
  [18:29:10] INFO:dhcp-lab-125.englab.brq.redhat.com:'qpid-cluster', result:0, dur.:2.09119
  [18:29:10] INFO:localhost:'qpid-cluster cluster size check (exp. 4)', result:True, dur.:-1
  [18:29:13] INFO:dhcp-lab-123.englab.brq.redhat.com:'qpid-cluster', result:0, dur.:2.30942
  [18:29:13] INFO:localhost:'qpid-cluster cluster size check (exp. 4)', result:True, dur.:-1
  [18:29:16] INFO:dhcp-lab-122.englab.brq.redhat.com:'qpid-cluster', result:0, dur.:2.33577
  [18:29:16] INFO:localhost:'qpid-cluster cluster size check (exp. 4)', result:True, dur.:-1
  [18:29:16] DEBUG:Go to cluster state: 14 [True, True, True, False]
  [18:29:28] INFO:localhost:'cluster topology change (ecode: True exp. True)', result:True, dur.:-1
  [18:29:30] INFO:localhost:'Cluster state: 3 (exp. 3) of 4 up and running', result:True, dur.:-1
  [18:29:32] INFO:dhcp-lab-236.englab.brq.redhat.com:'qpid-cluster', result:0, dur.:2.314
  [18:29:32] INFO:localhost:'qpid-cluster cluster size check (exp. 3)', result:True, dur.:-1
  [18:29:35] INFO:dhcp-lab-125.englab.brq.redhat.com:'qpid-cluster', result:0, dur.:2.05801
  [18:29:35] INFO:localhost:'qpid-cluster cluster size check (exp. 3)', result:True, dur.:-1
  [18:29:38] INFO:dhcp-lab-123.englab.brq.redhat.com:'qpid-cluster', result:0, dur.:2.25643
  [18:29:38] INFO:localhost:'qpid-cluster cluster size check (exp. 3)', result:True, dur.:-1
  [18:29:38] DEBUG:Go to cluster state: 13 [True, True, False, True]
  [18:29:56] INFO:localhost:'cluster topology change (ecode: True exp. True)', result:True, dur.:-1
  [18:29:58] INFO:localhost:'Cluster state: 3 (exp. 3) of 4 up and running', result:True, dur.:-1
  [18:30:01] INFO:dhcp-lab-236.englab.brq.redhat.com:'qpid-cluster', result:0, dur.:2.19658
  [18:30:01] INFO:localhost:'qpid-cluster cluster size check (exp. 3)', result:True, dur.:-1
  [18:30:04] INFO:dhcp-lab-125.englab.brq.redhat.com:'qpid-cluster', result:0, dur.:2.09778
  [18:30:04] INFO:localhost:'qpid-cluster cluster size check (exp. 3)', result:True, dur.:-1
  [18:30:07] INFO:dhcp-lab-122.englab.brq.redhat.com:'qpid-cluster', result:0, dur.:2.26122
  [18:30:07] INFO:localhost:'qpid-cluster cluster size check (exp. 3)', result:True, dur.:-1
  [18:30:07] DEBUG:Go to cluster state: 12 [True, True, False, False]
  [18:30:19] INFO:localhost:'cluster topology change (ecode: True exp. True)', result:True, dur.:-1
  [18:30:21] INFO:localhost:'Cluster state: 2 (exp. 2) of 4 up and running', result:True, dur.:-1
  [18:30:24] INFO:dhcp-lab-236.englab.brq.redhat.com:'qpid-cluster', result:0, dur.:2.24335
  [18:30:24] INFO:localhost:'qpid-cluster cluster size check (exp. 2)', result:True, dur.:-1
  [18:30:27] INFO:dhcp-lab-125.englab.brq.redhat.com:'qpid-cluster', result:0, dur.:2.30828
  [18:30:27] INFO:localhost:'qpid-cluster cluster size check (exp. 2)', result:True, dur.:-1
  [18:30:28] DEBUG:Go to cluster state: 11 [True, False, True, True]
  [18:30:55] INFO:localhost:'cluster topology change (ecode: True exp. True)', result:True, dur.:-1
  [18:30:57] INFO:localhost:'Cluster state: 3 (exp. 3) of 4 up and running', result:True, dur.:-1
  [18:31:00] INFO:dhcp-lab-236.englab.brq.redhat.com:'qpid-cluster', result:0, dur.:2.37627
  [18:31:00] INFO:localhost:'qpid-cluster cluster size check (exp. 3)', result:True, dur.:-1
  [18:31:03] INFO:dhcp-lab-123.englab.brq.redhat.com:'qpid-cluster', result:0, dur.:2.18624
  [18:31:03] INFO:localhost:'qpid-cluster cluster size check (exp. 3)', result:True, dur.:-1
  [18:31:06] INFO:dhcp-lab-122.englab.brq.redhat.com:'qpid-cluster', result:0, dur.:2.16036
  [18:31:06] INFO:localhost:'qpid-cluster cluster size check (exp. 3)', result:True, dur.:-1
  [18:31:06] DEBUG:Go to cluster state: 10 [True, False, True, False]
  [18:31:17] INFO:localhost:'cluster topology change (ecode: True exp. True)', result:True, dur.:-1
  [18:31:19] INFO:localhost:'Cluster state: 2 (exp. 2) of 4 up and running', result:True, dur.:-1
  [18:31:22] INFO:dhcp-lab-236.englab.brq.redhat.com:'qpid-cluster', result:0, dur.:2.28387
  [18:31:22] INFO:localhost:'qpid-cluster cluster size check (exp. 2)', result:True, dur.:-1
  [18:31:25] INFO:dhcp-lab-123.englab.brq.redhat.com:'qpid-cluster', result:0, dur.:2.21398
  [18:31:25] INFO:localhost:'qpid-cluster cluster size check (exp. 2)', result:True, dur.:-1
  [18:31:26] DEBUG:Go to cluster state: 9 [True, False, False, True]
  [18:31:43] INFO:localhost:'cluster topology change (ecode: True exp. True)', result:True, dur.:-1
  [18:31:45] INFO:localhost:'Cluster state: 2 (exp. 2) of 4 up and running', result:True, dur.:-1
  [18:31:48] INFO:dhcp-lab-236.englab.brq.redhat.com:'qpid-cluster', result:0, dur.:2.256
  [18:31:48] INFO:localhost:'qpid-cluster cluster size check (exp. 2)', result:True, dur.:-1
  [18:31:52] INFO:dhcp-lab-122.englab.brq.redhat.com:'qpid-cluster', result:0, dur.:2.08481
  [18:31:52] INFO:localhost:'qpid-cluster cluster size check (exp. 2)', result:True, dur.:-1
  [18:31:52] DEBUG:Go to cluster state: 8 [True, False, False, False]
  [18:32:04] INFO:localhost:'cluster topology change (ecode: True exp. True)', result:True, dur.:-1
  [18:32:06] INFO:localhost:'Cluster state: 1 (exp. 1) of 4 up and running', result:True, dur.:-1
  [18:32:09] INFO:dhcp-lab-236.englab.brq.redhat.com:'qpid-cluster', result:0, dur.:2.16295
  [18:32:09] INFO:localhost:'qpid-cluster cluster size check (exp. 1)', result:True, dur.:-1
  [18:32:10] DEBUG:Go to cluster state: 7 [False, True, True, True]
  [18:32:44] INFO:localhost:'cluster topology change (ecode: True exp. True)', result:True, dur.:-1
  [18:32:46] ERROR:localhost:'Cluster state: 2 (exp. 3) of 4 up and running', result:False, dur.:-1

Comment 1 Frantisek Reznicek 2011-07-28 19:46:48 UTC
Created attachment 515788 [details]
The last hundreds of log lines from every broker (debug+)

Comment 2 Frantisek Reznicek 2011-07-28 19:56:57 UTC
Created attachment 515789 [details]
The automated test for reproducing the issue

Test can be ran as:
./ctests.py --broker-maximize-uptime \
            --broker-options=mgmt-pub-interval~1,log-enable~debug+ \
            --log-to-file=110728_16h.log \
            --qmf-data-timeout=120 \
            --stop-on-error\
            --exec-tests=testset_cluster_toggle:16h \
            host_a host_b host_c host_d

If you do not set passless access to host_a host_b host_c host_d then you have to pass --ssh-pass=... as well

Comment 4 Frantisek Reznicek 2011-07-29 13:33:25 UTC
Following scenario seems to tend to trigger issue more easily

A  PPPP !
B   SSSSS!
C   SSSSS!
D   SSSSS!

where 'PPPP' is progress of perftest client (client A against broker on rand( A, B, C, D)), 'SSSSS' is cluster synchronization and '!' is an cluster reconfiguration/toggling nodes

Other broker options which increase the reproducibility:
mgmt-pub-interval=1

The issue was observed on RHEL5.6 x86_64 & -33.

Comment 5 Alan Conway 2011-07-29 15:30:33 UTC
This is the expected behaviour if the broker giving the update is killed before it completes. When you are changing configuration, there's a chance that after starting a broker you proceed to kill the broker providing the update.

You can fix this by connecting a client to the broker you have just started. It doesn't have to do anything, the clients connection won't complete until the broker is ready. E.g.

            c = messaging.Connection.establish(...broker address...)
            try: c.session()
            finally: c.close()

Comment 6 Frantisek Reznicek 2011-08-01 14:05:13 UTC
Hi Alan,
let me clarify in more detail the testing scenario to be able precisely judge whether behavior I can see is correct.

My last run on RHEL5.7 i386&x86_64 w/o loading cluster failed the same way.

The test sub procedure was:
  (cluster of two nodes [rhel57i~dhcp-26-112~A, rhel57x~dhcp-26-114~A)
  0. initial cluster config, no node up
  1. all nodes up, first A, then B)
  2. a test actions ... (not important atm)
  3. reconfigure cluster to: A is up and running, B is down
  4. test attempts to reconfig cluster to configuration:
     - A is down, B is up

At point 4. test detects that bringing down A would result in problem as B's store is dirty and thus do following operations:
  4-1] at B : service qpidd start
  after 4-1 finishes:
  4-2] at A : service qpidd stop
  after 4-2 is done cluster state is checked and test ends with error because B node did not start and thus cluster configuration at the end of 4-2] is : A, B down.


Now to the question whether it is qpidd or test who is guilty.

Test expects that when service qpidd operation (4-1]) ends - ensures that broker is settled and synchronized to the cluster and thus next operation (4-2]) may safely shutdown all other cluster nodes.

Alan,
Could you possibly comment above scenario, please?

Raising NEEDINFO.


Test transcript from 2-node cluster (rhel5.7 i/x):
  [14:32:36] DEBUG:TESTSET 'testset_cluster_toggle' loop 10/32 (10800.0) finished. [tst_cnt:759, err_cnt:0]
  [14:32:36] DEBUG:TESTSET 'testset_cluster_toggle' loop 11/36 (10800.0) started. [tst_cnt:759, err_cnt:0]
  [14:32:37] DEBUG:Cluster state: 1 of 2 up and running
  [14:32:37] DEBUG:TEST 'test_cluster_toggle_no_load' loop 1/1 started. [tst_cnt:759, err_cnt:0]
  [14:32:38] DEBUG:Cluster state: 1 of 2 up and running
  [14:32:38] DEBUG:Go to cluster state: 3 [True, True]
  DEBUG:ccfg:[False, True] -> tcfg:[True, True]
  DEBUG:start(A ~ 0)
  [14:32:48] INFO:localhost:'cluster topology change (ecode: True exp. True)', result:True, dur.:-1
  [14:32:49] INFO:localhost:'Cluster state: 2 (exp. 2) of 2 up and running', result:True, dur.:-1
  [14:32:51] INFO:dhcp-26-112:'qpid-cluster', result:0, dur.:1.46567
  [14:32:51] INFO:localhost:'qpid-cluster cluster size check (exp. 2)', result:True, dur.:-1
  [14:32:53] INFO:dhcp-26-114:'qpid-cluster', result:0, dur.:1.64677
  [14:32:53] INFO:localhost:'qpid-cluster cluster size check (exp. 2)', result:True, dur.:-1
  [14:32:53] DEBUG:Go to cluster state: 2 [True, False]
  DEBUG:ccfg:[True, True] -> tcfg:[True, False]
  DEBUG:stop(B ~ 1)
  [14:32:58] INFO:localhost:'cluster topology change (ecode: True exp. True)', result:True, dur.:-1
  [14:32:59] INFO:localhost:'Cluster state: 1 (exp. 1) of 2 up and running', result:True, dur.:-1
  [14:33:01] INFO:dhcp-26-112:'qpid-cluster', result:0, dur.:1.4745
  [14:33:01] INFO:localhost:'qpid-cluster cluster size check (exp. 1)', result:True, dur.:-1
  [14:33:01] DEBUG:Go to cluster state: 1 [False, True]
  DEBUG:ccfg:[True, False] -> tcfg:[False, True]
  DEBUG:start(B ~ 1)
  DEBUG:stop(A ~ 0)
  [14:33:11] INFO:localhost:'cluster topology change (ecode: True exp. True)', result:True, dur.:-1
  [14:33:12] INFO:localhost:'Cluster state: 1 (exp. 1) of 2 up and running', result:True, dur.:-1
  [14:33:14] INFO:dhcp-26-114:'qpid-cluster', result:0, dur.:1.38322
  [14:33:14] INFO:localhost:'qpid-cluster cluster size check (exp. 1)', result:True, dur.:-1
  [14:33:15] DEBUG:Cluster state: 1 of 2 up and running
  [14:33:15] DEBUG:TEST 'test_cluster_toggle_no_load' loop 1/1 finished. [tst_cnt:773, err_cnt:0]
  [14:33:15] DEBUG:TEST 'test_cluster_toggle_no_load_w_sesame' loop 1/1 started. [tst_cnt:773, err_cnt:0]
  [14:33:16] DEBUG:Cluster state: 1 of 2 up and running
  [14:33:16] DEBUG:Go to cluster state: 3 [True, True]
  DEBUG:ccfg:[False, True] -> tcfg:[True, True]
  DEBUG:start(A ~ 0)
  [14:33:26] INFO:localhost:'cluster topology change (ecode: True exp. True)', result:True, dur.:-1
  [14:33:27] INFO:localhost:'Cluster state: 2 (exp. 2) of 2 up and running', result:True, dur.:-1
  [14:33:29] INFO:dhcp-26-112:'service sesame status && service sesame restart || service sesame start ; service sesame status', result:0, dur.:0.98156
  [14:33:33] INFO:dhcp-26-114:'service sesame status && service sesame restart || service sesame start ; service sesame status', result:0, dur.:3.58063
  [14:33:35] INFO:dhcp-26-112:'qpid-cluster', result:0, dur.:1.53743
  [14:33:35] INFO:localhost:'qpid-cluster cluster size check (exp. 2)', result:True, dur.:-1
  [14:33:37] INFO:dhcp-26-114:'qpid-cluster', result:0, dur.:1.46806
  [14:33:37] INFO:localhost:'qpid-cluster cluster size check (exp. 2)', result:True, dur.:-1
  [14:33:37] DEBUG:Go to cluster state: 2 [True, False]
  DEBUG:ccfg:[True, True] -> tcfg:[True, False]
  DEBUG:stop(B ~ 1)
  [14:33:42] INFO:localhost:'cluster topology change (ecode: True exp. True)', result:True, dur.:-1
  [14:33:43] INFO:localhost:'Cluster state: 1 (exp. 1) of 2 up and running', result:True, dur.:-1
  [14:33:47] INFO:dhcp-26-112:'service sesame status && service sesame restart || service sesame start ; service sesame status', result:0, dur.:3.75164
  [14:33:48] INFO:dhcp-26-114:'service sesame status && service sesame stop || service sesame start', result:0, dur.:0.475331
  [14:33:50] INFO:dhcp-26-112:'qpid-cluster', result:0, dur.:1.52838
  [14:33:50] INFO:localhost:'qpid-cluster cluster size check (exp. 1)', result:True, dur.:-1
  [14:33:50] DEBUG:Go to cluster state: 1 [False, True]
  DEBUG:ccfg:[True, False] -> tcfg:[False, True]
  DEBUG:start(B ~ 1)
  DEBUG:stop(A ~ 0)
  [14:34:04] INFO:localhost:'cluster topology change (ecode: True exp. True)', result:True, dur.:-1
  [14:34:05] ERROR:localhost:'Cluster state: 0 (exp. 1) of 2 up and running', result:False, dur.:-1
  Traceback (most recent call last):
    File "./ctests.py", line 239, in main
      i_method[1]();
    File "./ctests.py", line 406, in testset_cluster_toggle
      self.test_cluster_toggle_no_load_w_sesame();
    File "./ctests.py", line 893, in test_cluster_toggle_no_load_w_sesame
      self.report();
    File "/home/freznice/prjs/rh/rhts_tests/tests/distribution/_others/cluster_test/testlib.py", line 1430, in report
      raise test_error("An error ocurred!");
  test_error: 'An error ocurred!'
  [14:34:05] INFO:
  [14:34:05] INFO:Execution summary: TST_CNT:789, ERR_CNT:1, WRN_CNT:0, dur.:3418.27
  [14:34:05] INFO:
  
  
  dhcp-26-112:
  2011-08-01 14:33:23 notice cluster(192.168.1.5:26282 READY) caught up.
  2011-08-01 14:33:26 info Queue "qmfagent-20e864fc-5a8b-4f9e-bf7d-61fe2cd0ecda": Policy created: type=reject; maxCount=0; maxSize=104857600
  2011-08-01 14:33:30 info Queue "qmfagent-4acadabb-d7e9-4544-be78-266a7ff80ce2": Policy created: type=reject; maxCount=0; maxSize=104857600
  2011-08-01 14:33:31 info Queue "reply-dhcp-26-112.brq.redhat.com.26476.1": Policy created: type=reject; maxCount=0; maxSize=104857600
  2011-08-01 14:33:31 info Queue "topic-dhcp-26-112.brq.redhat.com.26476.1": Policy created: type=ring; maxCount=0; maxSize=104857600
  2011-08-01 14:33:31 info Queue "qmfc-v2-dhcp-26-112.brq.redhat.com.26476.1": Policy created: type=reject; maxCount=0; maxSize=104857600
  2011-08-01 14:33:31 info Queue "qmfc-v2-ui-dhcp-26-112.brq.redhat.com.26476.1": Policy created: type=ring; maxCount=0; maxSize=104857600
  2011-08-01 14:33:31 info Queue "qmfc-v2-hb-dhcp-26-112.brq.redhat.com.26476.1": Policy created: type=ring; maxCount=0; maxSize=104857600
  2011-08-01 14:33:33 info Queue "reply-dhcp-26-114.brq.redhat.com.28112.1": Policy created: type=reject; maxCount=0; maxSize=104857600
  2011-08-01 14:33:33 info Queue "topic-dhcp-26-114.brq.redhat.com.28112.1": Policy created: type=ring; maxCount=0; maxSize=104857600
  2011-08-01 14:33:33 info Queue "qmfc-v2-dhcp-26-114.brq.redhat.com.28112.1": Policy created: type=reject; maxCount=0; maxSize=104857600
  2011-08-01 14:33:33 info Queue "qmfc-v2-ui-dhcp-26-114.brq.redhat.com.28112.1": Policy created: type=ring; maxCount=0; maxSize=104857600
  2011-08-01 14:33:33 info Queue "qmfc-v2-hb-dhcp-26-114.brq.redhat.com.28112.1": Policy created: type=ring; maxCount=0; maxSize=104857600
  2011-08-01 14:33:35 notice cluster(192.168.1.5:26282 READY) configuration change: 192.168.1.5:26282
  2011-08-01 14:33:35 notice cluster(192.168.1.5:26282 READY) Members left: 192.168.1.6:27764
  2011-08-01 14:33:35 info cluster(192.168.1.5:26282 READY) member update: 192.168.1.5:26282(member) frameSeq=21988358
  2011-08-01 14:33:35 notice cluster(192.168.1.5:26282 READY)Sole member of cluster, marking store clean.
  2011-08-01 14:33:35 notice cluster(192.168.1.5:26282 READY) last broker standing, update queue policies
  2011-08-01 14:33:35 info cluster(192.168.1.5:26282 READY) became the elder, active for links.
  2011-08-01 14:33:36 info Activating links
  2011-08-01 14:33:45 info Queue "qmfagent-89a21650-abc9-461f-9f9f-ef61a27691cf": Policy created: type=reject; maxCount=0; maxSize=104857600
  2011-08-01 14:33:47 info Queue "reply-dhcp-26-112.brq.redhat.com.26665.1": Policy created: type=reject; maxCount=0; maxSize=104857600
  2011-08-01 14:33:47 info Queue "topic-dhcp-26-112.brq.redhat.com.26665.1": Policy created: type=ring; maxCount=0; maxSize=104857600
  2011-08-01 14:33:47 info Queue "qmfc-v2-dhcp-26-112.brq.redhat.com.26665.1": Policy created: type=reject; maxCount=0; maxSize=104857600
  2011-08-01 14:33:47 info Queue "qmfc-v2-ui-dhcp-26-112.brq.redhat.com.26665.1": Policy created: type=ring; maxCount=0; maxSize=104857600
  2011-08-01 14:33:47 info Queue "qmfc-v2-hb-dhcp-26-112.brq.redhat.com.26665.1": Policy created: type=ring; maxCount=0; maxSize=104857600
  2011-08-01 14:33:51 notice cluster(192.168.1.5:26282 READY) configuration change: 192.168.1.5:26282 192.168.1.6:28490
  2011-08-01 14:33:51 notice cluster(192.168.1.5:26282 READY) Members joined: 192.168.1.6:28490
  2011-08-01 14:33:52 info cluster(192.168.1.5:26282 OFFER) send update-offer to 192.168.1.6:28490
  2011-08-01 14:33:52 info cluster(192.168.1.5:26282 OFFER) stall for update offer from 192.168.1.5:26282 to 192.168.1.6:28490
  2011-08-01 14:33:52 notice cluster(192.168.1.5:26282 UPDATER) sending update to 192.168.1.6:28490 at amqp:tcp:10.34.26.114:5672,tcp:192.168.1.6:5672
  2011-08-01 14:33:52 info SSL connector not enabled, you must set QPID_SSL_CERT_DB to enable it.
  2011-08-01 14:33:52 info Loaded Module: /usr/lib/qpid/client/sslconnector.so
  2011-08-01 14:33:52 info Connection  connected to tcp:10.34.26.114:5672
  
  
  dhcp-26-114:
  2011-08-01 14:33:35 notice Shut down
  2011-08-01 14:33:51 info Loaded Module: /usr/lib64/qpid/daemon/msgstore.so
  2011-08-01 14:33:51 info Loaded Module: /usr/lib64/qpid/daemon/replicating_listener.so
  2011-08-01 14:33:51 info Loaded Module: /usr/lib64/qpid/daemon/cluster.so
  2011-08-01 14:33:51 info Loaded Module: /usr/lib64/qpid/daemon/ssl.so
  2011-08-01 14:33:51 info Loaded Module: /usr/lib64/qpid/daemon/xml.so
  2011-08-01 14:33:51 info Loaded Module: /usr/lib64/qpid/daemon/replication_exchange.so
  2011-08-01 14:33:51 info Loaded Module: /usr/lib64/qpid/daemon/acl.so
  2011-08-01 14:33:51 info Loaded Module: /usr/lib64/qpid/daemon/watchdog.so
  2011-08-01 14:33:51 info No message store configured, persistence is disabled.
  2011-08-01 14:33:51 info Management enabled
  2011-08-01 14:33:51 info ManagementAgent restored broker ID: 96cf1637-7a1b-45f2-93a0-5b41bfa2ebee
  2011-08-01 14:33:51 notice Journal "TplStore": Created
  2011-08-01 14:33:51 notice Store module initialized; store-dir=/var/lib/qpidd
  2011-08-01 14:33:51 info > Default files per journal: 8
  2011-08-01 14:33:51 info > Default journal file size: 24 (wpgs)
  2011-08-01 14:33:51 info > Default write cache page size: 32 (Kib)
  2011-08-01 14:33:51 info > Default number of write cache pages: 32
  2011-08-01 14:33:51 info > TPL files per journal: 8
  2011-08-01 14:33:51 info > TPL journal file size: 24 (wpgs)
  2011-08-01 14:33:51 info > TPL write cache page size: 4 (Kib)
  2011-08-01 14:33:51 info > TPL number of write cache pages: 64
  2011-08-01 14:33:51 info Registered xml exchange
  2011-08-01 14:33:51 info Registered replication exchange
  2011-08-01 14:33:51 notice Initializing CPG
  2011-08-01 14:33:51 notice Cluster store state: dirty cluster-id=a114fb06-15d8-42dd-897f-cf2b8cd2d125
  2011-08-01 14:33:51 notice cluster(192.168.1.6:28490 PRE_INIT) configuration change: 192.168.1.5:26282 192.168.1.6:28490
  2011-08-01 14:33:51 notice cluster(192.168.1.6:28490 PRE_INIT) Members joined: 192.168.1.6:28490
  2011-08-01 14:33:51 notice Cluster recovery: recovered journal data discarded and journal files pushed down
  2011-08-01 14:33:51 notice Store directory /var/lib/qpidd/rhm was pushed down (saved) into directory /var/lib/qpidd/_cluster.bak.0020.
  2011-08-01 14:33:51 notice Journal "TplStore": Created
  2011-08-01 14:33:51 notice SASL disabled: No Authentication Performed
  2011-08-01 14:33:51 notice Listening on TCP port 5672
  2011-08-01 14:33:51 info SSL plugin not enabled, you must set --ssl-cert-db to enable it.
  2011-08-01 14:33:51 info Policy file not specified. ACL Disabled, no ACL checking being done!
  2011-08-01 14:33:51 info cluster(192.168.1.6:28490 INIT) not active for links.
  2011-08-01 14:33:51 notice cluster(192.168.1.6:28490 INIT) cluster-uuid = a114fb06-15d8-42dd-897f-cf2b8cd2d125
  2011-08-01 14:33:51 notice cluster(192.168.1.6:28490 JOINER) joining cluster mycluster
  2011-08-01 14:33:51 notice Broker running
  2011-08-01 14:33:51 info cluster(192.168.1.6:28490 JOINER) stall for update offer from 192.168.1.5:26282 to 192.168.1.6:28490
  2011-08-01 14:33:51 notice cluster(192.168.1.6:28490 UPDATEE) receiving update from 192.168.1.5:26282
  2011-08-01 14:33:51 notice Journal "test_cluster_toggle_load_w_sesame-00": Created
  2011-08-01 14:33:52 info Loaded Module: /usr/lib64/qpid/daemon/msgstore.so
  2011-08-01 14:33:52 info Loaded Module: /usr/lib64/qpid/daemon/replicating_listener.so
  2011-08-01 14:33:52 info Loaded Module: /usr/lib64/qpid/daemon/cluster.so
  2011-08-01 14:33:52 info Loaded Module: /usr/lib64/qpid/daemon/ssl.so
  2011-08-01 14:33:52 info Queue "test_cluster_toggle_load_w_sesame-0_pub_done": Policy created: type=reject; maxCount=0; maxSize=104857600
  2011-08-01 14:33:52 info Queue "test_cluster_toggle_load_w_sesame-0_pub_start": Policy created: type=reject; maxCount=0; maxSize=104857600
  2011-08-01 14:33:52 info Queue "test_cluster_toggle_load_w_sesame-0_sub_done": Policy created: type=reject; maxCount=0; maxSize=104857600
  2011-08-01 14:33:52 info Queue "test_cluster_toggle_load_w_sesame-0_sub_ready": Policy created: type=reject; maxCount=0; maxSize=104857600
  2011-08-01 14:33:52 notice Journal "test_cluster_toggle_load_w_sesame-10": Created
  2011-08-01 14:33:52 info Loaded Module: /usr/lib64/qpid/daemon/xml.so
  2011-08-01 14:33:52 info Loaded Module: /usr/lib64/qpid/daemon/replication_exchange.so
  2011-08-01 14:33:52 info Loaded Module: /usr/lib64/qpid/daemon/acl.so
  2011-08-01 14:33:52 info Loaded Module: /usr/lib64/qpid/daemon/watchdog.so
  2011-08-01 14:33:52 info Queue "test_cluster_toggle_load_w_sesame-1_pub_done": Policy created: type=reject; maxCount=0; maxSize=104857600
  2011-08-01 14:33:52 info Queue "test_cluster_toggle_load_w_sesame-1_pub_start": Policy created: type=reject; maxCount=0; maxSize=104857600
  2011-08-01 14:33:52 info Queue "test_cluster_toggle_load_w_sesame-1_sub_done": Policy created: type=reject; maxCount=0; maxSize=104857600
  2011-08-01 14:33:52 info Queue "test_cluster_toggle_load_w_sesame-1_sub_ready": Policy created: type=reject; maxCount=0; maxSize=104857600
  2011-08-01 14:33:52 info Queue "test_cluster_toggle_load_w_sesame-2_pub_done": Policy created: type=reject; maxCount=0; maxSize=104857600
  2011-08-01 14:33:52 info Queue "test_cluster_toggle_load_w_sesame-2_pub_start": Policy created: type=reject; maxCount=0; maxSize=104857600
  2011-08-01 14:33:52 info Queue "test_cluster_toggle_load_w_sesame-2_sub_done": Policy created: type=reject; maxCount=0; maxSize=104857600
  2011-08-01 14:33:52 info Queue "test_cluster_toggle_load_w_sesame-2_sub_ready": Policy created: type=reject; maxCount=0; maxSize=104857600
  2011-08-01 14:33:52 info Queue "test_cluster_toggle_load_w_sesame-3_pub_done": Policy created: type=reject; maxCount=0; maxSize=104857600
  2011-08-01 14:33:52 info Queue "test_cluster_toggle_load_w_sesame-3_pub_start": Policy created: type=reject; maxCount=0; maxSize=104857600
  2011-08-01 14:33:52 info Queue "test_cluster_toggle_load_w_sesame-3_sub_done": Policy created: type=reject; maxCount=0; maxSize=104857600
  2011-08-01 14:33:52 info Queue "test_cluster_toggle_load_w_sesame-3_sub_ready": Policy created: type=reject; maxCount=0; maxSize=104857600
  2011-08-01 14:33:52 notice Journal "test_cluster_toggle_load_w_sesame-40": Created
  2011-08-01 14:33:53 info Queue "test_cluster_toggle_load_w_sesame-4_pub_done": Policy created: type=reject; maxCount=0; maxSize=104857600
  2011-08-01 14:33:53 info Queue "test_cluster_toggle_load_w_sesame-4_pub_start": Policy created: type=reject; maxCount=0; maxSize=104857600
  2011-08-01 14:33:53 info Queue "test_cluster_toggle_load_w_sesame-4_sub_done": Policy created: type=reject; maxCount=0; maxSize=104857600
  2011-08-01 14:33:53 info Queue "test_cluster_toggle_load_w_sesame-4_sub_ready": Policy created: type=reject; maxCount=0; maxSize=104857600
  2011-08-01 14:33:53 info Queue "test_cluster_toggle_load_w_sesame-5_pub_done": Policy created: type=reject; maxCount=0; maxSize=104857600
  2011-08-01 14:33:53 info Queue "test_cluster_toggle_load_w_sesame-5_pub_start": Policy created: type=reject; maxCount=0; maxSize=104857600
  2011-08-01 14:33:53 info Queue "test_cluster_toggle_load_w_sesame-5_sub_done": Policy created: type=reject; maxCount=0; maxSize=104857600
  2011-08-01 14:33:53 info Queue "test_cluster_toggle_load_w_sesame-5_sub_ready": Policy created: type=reject; maxCount=0; maxSize=104857600
  2011-08-01 14:33:53 notice Journal "test_cluster_toggle_load_w_sesame-60": Created
  2011-08-01 14:33:53 info Passivating links
  2011-08-01 14:33:54 info Queue "test_cluster_toggle_load_w_sesame-6_pub_done": Policy created: type=reject; maxCount=0; maxSize=104857600
  2011-08-01 14:33:54 info Queue "test_cluster_toggle_load_w_sesame-6_pub_start": Policy created: type=reject; maxCount=0; maxSize=104857600
  2011-08-01 14:33:54 info Queue "test_cluster_toggle_load_w_sesame-6_sub_done": Policy created: type=reject; maxCount=0; maxSize=104857600
  2011-08-01 14:33:54 info Queue "test_cluster_toggle_load_w_sesame-6_sub_ready": Policy created: type=reject; maxCount=0; maxSize=104857600
  2011-08-01 14:33:54 notice Journal "test_cluster_toggle_load_w_sesame-70": Created
  2011-08-01 14:33:55 info Queue "test_cluster_toggle_load_w_sesame-7_pub_done": Policy created: type=reject; maxCount=0; maxSize=104857600
  2011-08-01 14:33:55 info Queue "test_cluster_toggle_load_w_sesame-7_pub_start": Policy created: type=reject; maxCount=0; maxSize=104857600
  2011-08-01 14:33:55 info Queue "test_cluster_toggle_load_w_sesame-7_sub_done": Policy created: type=reject; maxCount=0; maxSize=104857600
  2011-08-01 14:33:55 info Queue "test_cluster_toggle_load_w_sesame-7_sub_ready": Policy created: type=reject; maxCount=0; maxSize=104857600
  2011-08-01 14:33:55 notice Journal "test_cluster_toggle_load_w_sesame-80": Created
  2011-08-01 14:33:59 info Queue "test_cluster_toggle_load_w_sesame-8_pub_done": Policy created: type=reject; maxCount=0; maxSize=104857600
  2011-08-01 14:33:59 info Queue "test_cluster_toggle_load_w_sesame-8_pub_start": Policy created: type=reject; maxCount=0; maxSize=104857600
  2011-08-01 14:33:59 info Queue "test_cluster_toggle_load_w_sesame-8_sub_done": Policy created: type=reject; maxCount=0; maxSize=104857600
  2011-08-01 14:33:59 info Queue "test_cluster_toggle_load_w_sesame-8_sub_ready": Policy created: type=reject; maxCount=0; maxSize=104857600
  2011-08-01 14:33:59 notice Journal "test_cluster_toggle_load_w_sesame-90": Created
  2011-08-01 14:33:59 info Queue "test_cluster_toggle_load_w_sesame-9_pub_done": Policy created: type=reject; maxCount=0; maxSize=104857600
  2011-08-01 14:33:59 info Queue "test_cluster_toggle_load_w_sesame-9_pub_start": Policy created: type=reject; maxCount=0; maxSize=104857600
  2011-08-01 14:33:59 info Queue "test_cluster_toggle_load_w_sesame-9_sub_done": Policy created: type=reject; maxCount=0; maxSize=104857600
  2011-08-01 14:33:59 info Queue "test_cluster_toggle_load_w_sesame-9_sub_ready": Policy created: type=reject; maxCount=0; maxSize=104857600
  2011-08-01 14:33:59 info Queue "qpid.cluster-update": Policy created: type=reject; maxCount=0; maxSize=104857600
  2011-08-01 14:33:59 critical cluster(192.168.1.6:28490 UPDATEE) catch-up connection closed prematurely 10.34.26.114:5672-10.34.26.112:36610(192.168.1.6:28490-1 local,catchup)
  2011-08-01 14:33:59 notice cluster(192.168.1.6:28490 LEFT) leaving cluster mycluster
  2011-08-01 14:33:59 notice Shut down

Comment 7 Alan Conway 2011-08-01 14:18:43 UTC
This assumption in comment 6 is incorrect:

"Test expects that when service qpidd operation (4-1]) ends - ensures that
broker is settled and synchronized to the cluster and thus next operation
(4-2]) may safely shutdown all other cluster nodes."

When qpidd -d returns the broker has done an initial handshake with the cluster but has not yet received the full state update. To wait for the full update to complete, connect a client to the newly started broker. The connection will not complete until the broker is fully ready. Any client will do, it doesn't have to do anything other than establish a connection.

Comment 8 Frantisek Reznicek 2011-08-01 15:05:30 UTC
Thanks for clarification Alan.
The observed issue is expected and test code has to keep in mind that newbee sync takes longer time.

-> CLOSED / NOT-A-BUG

Comment 9 Alan Conway 2011-08-01 19:59:59 UTC
A note for testing: you can ensure the newbie is ready without sleeping by just making a connection with any client you like. E.g. qpid-perftest --count 1. The connection will complete when the newbie is up to date at which point it is safe to proceed with the test.