Bug 726494 - clustered broker leaves cluster with 'catch-up connection closed prematurely'
Summary: clustered broker leaves cluster with 'catch-up connection closed prematurely'
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Red Hat Enterprise MRG
Classification: Red Hat
Component: qpid-cpp
Version: 1.3
Hardware: athlon
OS: Unspecified
unspecified
high
Target Milestone: ---
: ---
Assignee: messaging-bugs
QA Contact: MRG Quality Engineering
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2011-07-28 19:44 UTC by Frantisek Reznicek
Modified: 2015-11-16 01:13 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2011-08-01 15:05:30 UTC
Target Upstream Version:


Attachments (Terms of Use)
The last hundreds of log lines from every broker (debug+) (38.05 KB, text/plain)
2011-07-28 19:46 UTC, Frantisek Reznicek
no flags Details
The automated test for reproducing the issue (41.52 KB, application/x-tbz)
2011-07-28 19:56 UTC, Frantisek Reznicek
no flags Details

Description Frantisek Reznicek 2011-07-28 19:44:32 UTC
Description of problem:

Long term running cluster test uncovers that a node may rarely un-expectably leave cluster with following message:
 2011-07-28 18:32:40 debug DISCONNECTED [10.34.33.125:5672-10.34.33.236:44944]
 2011-07-28 18:32:40 critical cluster(192.168.5.2:12582 UPDATEE) catch-up connection closed prematurely 10.34.33.125:5672-10.34.33.236:44944(192.168.5.2:12582-1 local,catchup)
 2011-07-28 18:32:40 notice cluster(192.168.5.2:12582 LEFT) leaving cluster mycluster

This behavior is observed when cluster test repeatedly performs cluster toggle test which changes cluster configuration. See more description later


Version-Release number of selected component (if applicable):
cman-2.0.115-68.el5_6.4
openais-0.80.6-28.el5_6.1
python-qmf-0.7.946106-14.el5
python-qpid-0.7.946106-15.el5
python-saslwrapper-0.1.934605-2.el5
qmf-0.7.946106-33.el5
qmf-devel-0.7.946106-33.el5
qpid-cpp-client-0.7.946106-33.el5
qpid-cpp-client-devel-0.7.946106-33.el5
qpid-cpp-client-devel-docs-0.7.946106-33.el5
qpid-cpp-client-ssl-0.7.946106-33.el5
qpid-cpp-server-0.7.946106-33.el5
qpid-cpp-server-cluster-0.7.946106-33.el5
qpid-cpp-server-devel-0.7.946106-33.el5
qpid-cpp-server-ssl-0.7.946106-33.el5
qpid-cpp-server-store-0.7.946106-33.el5
qpid-cpp-server-xml-0.7.946106-33.el5
qpid-java-client-0.10-6.el5
qpid-java-common-0.10-6.el5
qpid-java-example-0.10-6.el5
qpid-tools-0.7.946106-12.el5
ruby-saslwrapper-0.1.934605-2.el5
saslwrapper-0.1.934605-2.el5
saslwrapper-devel-0.1.934605-2.el5
sesame-0.7.4297-6.el5


How reproducible:
90% just time to run ito troubles varies

Steps to Reproduce:
1. ./ctests.py --broker-maximize-uptime --broker-options=mgmt-pub-interval~1,log-enable~debug+ --log-to-file=a.log --qmf-data-timeout=120 --stop-on-error --exec-tests=testset_cluster_toggle:16h host_a host_b host_c host_d   
  
Actual results:
Cluster test running cluster toggle tests is failing

Expected results:
Cluster test running cluster toggle tests should not fail

Additional info:

Rough test log:
  
  [18:28:25] DEBUG:TESTSET 'testset_cluster_toggle' loop 3/28 (57600.0) finished. [tst_cnt:1305, err_cnt:0]
  [18:28:25] DEBUG:TESTSET 'testset_cluster_toggle' loop 4/37 (57600.0) started. [tst_cnt:1305, err_cnt:0]
  [18:28:28] DEBUG:Cluster state: 1 of 4 up and running
  [18:28:28] DEBUG:TEST 'test_cluster_toggle_no_load' loop 1/1 started. [tst_cnt:1305, err_cnt:0]
  [18:28:30] DEBUG:Cluster state: 1 of 4 up and running
  [18:28:30] DEBUG:Go to cluster state: 15 [True, True, True, True]
  [18:28:59] INFO:localhost:'cluster topology change (ecode: True exp. True)', result:True, dur.:-1
  [18:29:02] INFO:localhost:'Cluster state: 4 (exp. 4) of 4 up and running', result:True, dur.:-1
  [18:29:08] INFO:dhcp-lab-236.englab.brq.redhat.com:'qpid-cluster', result:0, dur.:4.68771
  [18:29:08] INFO:localhost:'qpid-cluster cluster size check (exp. 4)', result:True, dur.:-1
  [18:29:10] INFO:dhcp-lab-125.englab.brq.redhat.com:'qpid-cluster', result:0, dur.:2.09119
  [18:29:10] INFO:localhost:'qpid-cluster cluster size check (exp. 4)', result:True, dur.:-1
  [18:29:13] INFO:dhcp-lab-123.englab.brq.redhat.com:'qpid-cluster', result:0, dur.:2.30942
  [18:29:13] INFO:localhost:'qpid-cluster cluster size check (exp. 4)', result:True, dur.:-1
  [18:29:16] INFO:dhcp-lab-122.englab.brq.redhat.com:'qpid-cluster', result:0, dur.:2.33577
  [18:29:16] INFO:localhost:'qpid-cluster cluster size check (exp. 4)', result:True, dur.:-1
  [18:29:16] DEBUG:Go to cluster state: 14 [True, True, True, False]
  [18:29:28] INFO:localhost:'cluster topology change (ecode: True exp. True)', result:True, dur.:-1
  [18:29:30] INFO:localhost:'Cluster state: 3 (exp. 3) of 4 up and running', result:True, dur.:-1
  [18:29:32] INFO:dhcp-lab-236.englab.brq.redhat.com:'qpid-cluster', result:0, dur.:2.314
  [18:29:32] INFO:localhost:'qpid-cluster cluster size check (exp. 3)', result:True, dur.:-1
  [18:29:35] INFO:dhcp-lab-125.englab.brq.redhat.com:'qpid-cluster', result:0, dur.:2.05801
  [18:29:35] INFO:localhost:'qpid-cluster cluster size check (exp. 3)', result:True, dur.:-1
  [18:29:38] INFO:dhcp-lab-123.englab.brq.redhat.com:'qpid-cluster', result:0, dur.:2.25643
  [18:29:38] INFO:localhost:'qpid-cluster cluster size check (exp. 3)', result:True, dur.:-1
  [18:29:38] DEBUG:Go to cluster state: 13 [True, True, False, True]
  [18:29:56] INFO:localhost:'cluster topology change (ecode: True exp. True)', result:True, dur.:-1
  [18:29:58] INFO:localhost:'Cluster state: 3 (exp. 3) of 4 up and running', result:True, dur.:-1
  [18:30:01] INFO:dhcp-lab-236.englab.brq.redhat.com:'qpid-cluster', result:0, dur.:2.19658
  [18:30:01] INFO:localhost:'qpid-cluster cluster size check (exp. 3)', result:True, dur.:-1
  [18:30:04] INFO:dhcp-lab-125.englab.brq.redhat.com:'qpid-cluster', result:0, dur.:2.09778
  [18:30:04] INFO:localhost:'qpid-cluster cluster size check (exp. 3)', result:True, dur.:-1
  [18:30:07] INFO:dhcp-lab-122.englab.brq.redhat.com:'qpid-cluster', result:0, dur.:2.26122
  [18:30:07] INFO:localhost:'qpid-cluster cluster size check (exp. 3)', result:True, dur.:-1
  [18:30:07] DEBUG:Go to cluster state: 12 [True, True, False, False]
  [18:30:19] INFO:localhost:'cluster topology change (ecode: True exp. True)', result:True, dur.:-1
  [18:30:21] INFO:localhost:'Cluster state: 2 (exp. 2) of 4 up and running', result:True, dur.:-1
  [18:30:24] INFO:dhcp-lab-236.englab.brq.redhat.com:'qpid-cluster', result:0, dur.:2.24335
  [18:30:24] INFO:localhost:'qpid-cluster cluster size check (exp. 2)', result:True, dur.:-1
  [18:30:27] INFO:dhcp-lab-125.englab.brq.redhat.com:'qpid-cluster', result:0, dur.:2.30828
  [18:30:27] INFO:localhost:'qpid-cluster cluster size check (exp. 2)', result:True, dur.:-1
  [18:30:28] DEBUG:Go to cluster state: 11 [True, False, True, True]
  [18:30:55] INFO:localhost:'cluster topology change (ecode: True exp. True)', result:True, dur.:-1
  [18:30:57] INFO:localhost:'Cluster state: 3 (exp. 3) of 4 up and running', result:True, dur.:-1
  [18:31:00] INFO:dhcp-lab-236.englab.brq.redhat.com:'qpid-cluster', result:0, dur.:2.37627
  [18:31:00] INFO:localhost:'qpid-cluster cluster size check (exp. 3)', result:True, dur.:-1
  [18:31:03] INFO:dhcp-lab-123.englab.brq.redhat.com:'qpid-cluster', result:0, dur.:2.18624
  [18:31:03] INFO:localhost:'qpid-cluster cluster size check (exp. 3)', result:True, dur.:-1
  [18:31:06] INFO:dhcp-lab-122.englab.brq.redhat.com:'qpid-cluster', result:0, dur.:2.16036
  [18:31:06] INFO:localhost:'qpid-cluster cluster size check (exp. 3)', result:True, dur.:-1
  [18:31:06] DEBUG:Go to cluster state: 10 [True, False, True, False]
  [18:31:17] INFO:localhost:'cluster topology change (ecode: True exp. True)', result:True, dur.:-1
  [18:31:19] INFO:localhost:'Cluster state: 2 (exp. 2) of 4 up and running', result:True, dur.:-1
  [18:31:22] INFO:dhcp-lab-236.englab.brq.redhat.com:'qpid-cluster', result:0, dur.:2.28387
  [18:31:22] INFO:localhost:'qpid-cluster cluster size check (exp. 2)', result:True, dur.:-1
  [18:31:25] INFO:dhcp-lab-123.englab.brq.redhat.com:'qpid-cluster', result:0, dur.:2.21398
  [18:31:25] INFO:localhost:'qpid-cluster cluster size check (exp. 2)', result:True, dur.:-1
  [18:31:26] DEBUG:Go to cluster state: 9 [True, False, False, True]
  [18:31:43] INFO:localhost:'cluster topology change (ecode: True exp. True)', result:True, dur.:-1
  [18:31:45] INFO:localhost:'Cluster state: 2 (exp. 2) of 4 up and running', result:True, dur.:-1
  [18:31:48] INFO:dhcp-lab-236.englab.brq.redhat.com:'qpid-cluster', result:0, dur.:2.256
  [18:31:48] INFO:localhost:'qpid-cluster cluster size check (exp. 2)', result:True, dur.:-1
  [18:31:52] INFO:dhcp-lab-122.englab.brq.redhat.com:'qpid-cluster', result:0, dur.:2.08481
  [18:31:52] INFO:localhost:'qpid-cluster cluster size check (exp. 2)', result:True, dur.:-1
  [18:31:52] DEBUG:Go to cluster state: 8 [True, False, False, False]
  [18:32:04] INFO:localhost:'cluster topology change (ecode: True exp. True)', result:True, dur.:-1
  [18:32:06] INFO:localhost:'Cluster state: 1 (exp. 1) of 4 up and running', result:True, dur.:-1
  [18:32:09] INFO:dhcp-lab-236.englab.brq.redhat.com:'qpid-cluster', result:0, dur.:2.16295
  [18:32:09] INFO:localhost:'qpid-cluster cluster size check (exp. 1)', result:True, dur.:-1
  [18:32:10] DEBUG:Go to cluster state: 7 [False, True, True, True]
  [18:32:44] INFO:localhost:'cluster topology change (ecode: True exp. True)', result:True, dur.:-1
  [18:32:46] ERROR:localhost:'Cluster state: 2 (exp. 3) of 4 up and running', result:False, dur.:-1

Comment 1 Frantisek Reznicek 2011-07-28 19:46:48 UTC
Created attachment 515788 [details]
The last hundreds of log lines from every broker (debug+)

Comment 2 Frantisek Reznicek 2011-07-28 19:56:57 UTC
Created attachment 515789 [details]
The automated test for reproducing the issue

Test can be ran as:
./ctests.py --broker-maximize-uptime \
            --broker-options=mgmt-pub-interval~1,log-enable~debug+ \
            --log-to-file=110728_16h.log \
            --qmf-data-timeout=120 \
            --stop-on-error\
            --exec-tests=testset_cluster_toggle:16h \
            host_a host_b host_c host_d

If you do not set passless access to host_a host_b host_c host_d then you have to pass --ssh-pass=... as well

Comment 4 Frantisek Reznicek 2011-07-29 13:33:25 UTC
Following scenario seems to tend to trigger issue more easily

A  PPPP !
B   SSSSS!
C   SSSSS!
D   SSSSS!

where 'PPPP' is progress of perftest client (client A against broker on rand( A, B, C, D)), 'SSSSS' is cluster synchronization and '!' is an cluster reconfiguration/toggling nodes

Other broker options which increase the reproducibility:
mgmt-pub-interval=1

The issue was observed on RHEL5.6 x86_64 & -33.

Comment 5 Alan Conway 2011-07-29 15:30:33 UTC
This is the expected behaviour if the broker giving the update is killed before it completes. When you are changing configuration, there's a chance that after starting a broker you proceed to kill the broker providing the update.

You can fix this by connecting a client to the broker you have just started. It doesn't have to do anything, the clients connection won't complete until the broker is ready. E.g.

            c = messaging.Connection.establish(...broker address...)
            try: c.session()
            finally: c.close()

Comment 6 Frantisek Reznicek 2011-08-01 14:05:13 UTC
Hi Alan,
let me clarify in more detail the testing scenario to be able precisely judge whether behavior I can see is correct.

My last run on RHEL5.7 i386&x86_64 w/o loading cluster failed the same way.

The test sub procedure was:
  (cluster of two nodes [rhel57i~dhcp-26-112~A, rhel57x~dhcp-26-114~A)
  0. initial cluster config, no node up
  1. all nodes up, first A, then B)
  2. a test actions ... (not important atm)
  3. reconfigure cluster to: A is up and running, B is down
  4. test attempts to reconfig cluster to configuration:
     - A is down, B is up

At point 4. test detects that bringing down A would result in problem as B's store is dirty and thus do following operations:
  4-1] at B : service qpidd start
  after 4-1 finishes:
  4-2] at A : service qpidd stop
  after 4-2 is done cluster state is checked and test ends with error because B node did not start and thus cluster configuration at the end of 4-2] is : A, B down.


Now to the question whether it is qpidd or test who is guilty.

Test expects that when service qpidd operation (4-1]) ends - ensures that broker is settled and synchronized to the cluster and thus next operation (4-2]) may safely shutdown all other cluster nodes.

Alan,
Could you possibly comment above scenario, please?

Raising NEEDINFO.


Test transcript from 2-node cluster (rhel5.7 i/x):
  [14:32:36] DEBUG:TESTSET 'testset_cluster_toggle' loop 10/32 (10800.0) finished. [tst_cnt:759, err_cnt:0]
  [14:32:36] DEBUG:TESTSET 'testset_cluster_toggle' loop 11/36 (10800.0) started. [tst_cnt:759, err_cnt:0]
  [14:32:37] DEBUG:Cluster state: 1 of 2 up and running
  [14:32:37] DEBUG:TEST 'test_cluster_toggle_no_load' loop 1/1 started. [tst_cnt:759, err_cnt:0]
  [14:32:38] DEBUG:Cluster state: 1 of 2 up and running
  [14:32:38] DEBUG:Go to cluster state: 3 [True, True]
  DEBUG:ccfg:[False, True] -> tcfg:[True, True]
  DEBUG:start(A ~ 0)
  [14:32:48] INFO:localhost:'cluster topology change (ecode: True exp. True)', result:True, dur.:-1
  [14:32:49] INFO:localhost:'Cluster state: 2 (exp. 2) of 2 up and running', result:True, dur.:-1
  [14:32:51] INFO:dhcp-26-112:'qpid-cluster', result:0, dur.:1.46567
  [14:32:51] INFO:localhost:'qpid-cluster cluster size check (exp. 2)', result:True, dur.:-1
  [14:32:53] INFO:dhcp-26-114:'qpid-cluster', result:0, dur.:1.64677
  [14:32:53] INFO:localhost:'qpid-cluster cluster size check (exp. 2)', result:True, dur.:-1
  [14:32:53] DEBUG:Go to cluster state: 2 [True, False]
  DEBUG:ccfg:[True, True] -> tcfg:[True, False]
  DEBUG:stop(B ~ 1)
  [14:32:58] INFO:localhost:'cluster topology change (ecode: True exp. True)', result:True, dur.:-1
  [14:32:59] INFO:localhost:'Cluster state: 1 (exp. 1) of 2 up and running', result:True, dur.:-1
  [14:33:01] INFO:dhcp-26-112:'qpid-cluster', result:0, dur.:1.4745
  [14:33:01] INFO:localhost:'qpid-cluster cluster size check (exp. 1)', result:True, dur.:-1
  [14:33:01] DEBUG:Go to cluster state: 1 [False, True]
  DEBUG:ccfg:[True, False] -> tcfg:[False, True]
  DEBUG:start(B ~ 1)
  DEBUG:stop(A ~ 0)
  [14:33:11] INFO:localhost:'cluster topology change (ecode: True exp. True)', result:True, dur.:-1
  [14:33:12] INFO:localhost:'Cluster state: 1 (exp. 1) of 2 up and running', result:True, dur.:-1
  [14:33:14] INFO:dhcp-26-114:'qpid-cluster', result:0, dur.:1.38322
  [14:33:14] INFO:localhost:'qpid-cluster cluster size check (exp. 1)', result:True, dur.:-1
  [14:33:15] DEBUG:Cluster state: 1 of 2 up and running
  [14:33:15] DEBUG:TEST 'test_cluster_toggle_no_load' loop 1/1 finished. [tst_cnt:773, err_cnt:0]
  [14:33:15] DEBUG:TEST 'test_cluster_toggle_no_load_w_sesame' loop 1/1 started. [tst_cnt:773, err_cnt:0]
  [14:33:16] DEBUG:Cluster state: 1 of 2 up and running
  [14:33:16] DEBUG:Go to cluster state: 3 [True, True]
  DEBUG:ccfg:[False, True] -> tcfg:[True, True]
  DEBUG:start(A ~ 0)
  [14:33:26] INFO:localhost:'cluster topology change (ecode: True exp. True)', result:True, dur.:-1
  [14:33:27] INFO:localhost:'Cluster state: 2 (exp. 2) of 2 up and running', result:True, dur.:-1
  [14:33:29] INFO:dhcp-26-112:'service sesame status && service sesame restart || service sesame start ; service sesame status', result:0, dur.:0.98156
  [14:33:33] INFO:dhcp-26-114:'service sesame status && service sesame restart || service sesame start ; service sesame status', result:0, dur.:3.58063
  [14:33:35] INFO:dhcp-26-112:'qpid-cluster', result:0, dur.:1.53743
  [14:33:35] INFO:localhost:'qpid-cluster cluster size check (exp. 2)', result:True, dur.:-1
  [14:33:37] INFO:dhcp-26-114:'qpid-cluster', result:0, dur.:1.46806
  [14:33:37] INFO:localhost:'qpid-cluster cluster size check (exp. 2)', result:True, dur.:-1
  [14:33:37] DEBUG:Go to cluster state: 2 [True, False]
  DEBUG:ccfg:[True, True] -> tcfg:[True, False]
  DEBUG:stop(B ~ 1)
  [14:33:42] INFO:localhost:'cluster topology change (ecode: True exp. True)', result:True, dur.:-1
  [14:33:43] INFO:localhost:'Cluster state: 1 (exp. 1) of 2 up and running', result:True, dur.:-1
  [14:33:47] INFO:dhcp-26-112:'service sesame status && service sesame restart || service sesame start ; service sesame status', result:0, dur.:3.75164
  [14:33:48] INFO:dhcp-26-114:'service sesame status && service sesame stop || service sesame start', result:0, dur.:0.475331
  [14:33:50] INFO:dhcp-26-112:'qpid-cluster', result:0, dur.:1.52838
  [14:33:50] INFO:localhost:'qpid-cluster cluster size check (exp. 1)', result:True, dur.:-1
  [14:33:50] DEBUG:Go to cluster state: 1 [False, True]
  DEBUG:ccfg:[True, False] -> tcfg:[False, True]
  DEBUG:start(B ~ 1)
  DEBUG:stop(A ~ 0)
  [14:34:04] INFO:localhost:'cluster topology change (ecode: True exp. True)', result:True, dur.:-1
  [14:34:05] ERROR:localhost:'Cluster state: 0 (exp. 1) of 2 up and running', result:False, dur.:-1
  Traceback (most recent call last):
    File "./ctests.py", line 239, in main
      i_method[1]();
    File "./ctests.py", line 406, in testset_cluster_toggle
      self.test_cluster_toggle_no_load_w_sesame();
    File "./ctests.py", line 893, in test_cluster_toggle_no_load_w_sesame
      self.report();
    File "/home/freznice/prjs/rh/rhts_tests/tests/distribution/_others/cluster_test/testlib.py", line 1430, in report
      raise test_error("An error ocurred!");
  test_error: 'An error ocurred!'
  [14:34:05] INFO:
  [14:34:05] INFO:Execution summary: TST_CNT:789, ERR_CNT:1, WRN_CNT:0, dur.:3418.27
  [14:34:05] INFO:
  
  
  dhcp-26-112:
  2011-08-01 14:33:23 notice cluster(192.168.1.5:26282 READY) caught up.
  2011-08-01 14:33:26 info Queue "qmfagent-20e864fc-5a8b-4f9e-bf7d-61fe2cd0ecda": Policy created: type=reject; maxCount=0; maxSize=104857600
  2011-08-01 14:33:30 info Queue "qmfagent-4acadabb-d7e9-4544-be78-266a7ff80ce2": Policy created: type=reject; maxCount=0; maxSize=104857600
  2011-08-01 14:33:31 info Queue "reply-dhcp-26-112.brq.redhat.com.26476.1": Policy created: type=reject; maxCount=0; maxSize=104857600
  2011-08-01 14:33:31 info Queue "topic-dhcp-26-112.brq.redhat.com.26476.1": Policy created: type=ring; maxCount=0; maxSize=104857600
  2011-08-01 14:33:31 info Queue "qmfc-v2-dhcp-26-112.brq.redhat.com.26476.1": Policy created: type=reject; maxCount=0; maxSize=104857600
  2011-08-01 14:33:31 info Queue "qmfc-v2-ui-dhcp-26-112.brq.redhat.com.26476.1": Policy created: type=ring; maxCount=0; maxSize=104857600
  2011-08-01 14:33:31 info Queue "qmfc-v2-hb-dhcp-26-112.brq.redhat.com.26476.1": Policy created: type=ring; maxCount=0; maxSize=104857600
  2011-08-01 14:33:33 info Queue "reply-dhcp-26-114.brq.redhat.com.28112.1": Policy created: type=reject; maxCount=0; maxSize=104857600
  2011-08-01 14:33:33 info Queue "topic-dhcp-26-114.brq.redhat.com.28112.1": Policy created: type=ring; maxCount=0; maxSize=104857600
  2011-08-01 14:33:33 info Queue "qmfc-v2-dhcp-26-114.brq.redhat.com.28112.1": Policy created: type=reject; maxCount=0; maxSize=104857600
  2011-08-01 14:33:33 info Queue "qmfc-v2-ui-dhcp-26-114.brq.redhat.com.28112.1": Policy created: type=ring; maxCount=0; maxSize=104857600
  2011-08-01 14:33:33 info Queue "qmfc-v2-hb-dhcp-26-114.brq.redhat.com.28112.1": Policy created: type=ring; maxCount=0; maxSize=104857600
  2011-08-01 14:33:35 notice cluster(192.168.1.5:26282 READY) configuration change: 192.168.1.5:26282
  2011-08-01 14:33:35 notice cluster(192.168.1.5:26282 READY) Members left: 192.168.1.6:27764
  2011-08-01 14:33:35 info cluster(192.168.1.5:26282 READY) member update: 192.168.1.5:26282(member) frameSeq=21988358
  2011-08-01 14:33:35 notice cluster(192.168.1.5:26282 READY)Sole member of cluster, marking store clean.
  2011-08-01 14:33:35 notice cluster(192.168.1.5:26282 READY) last broker standing, update queue policies
  2011-08-01 14:33:35 info cluster(192.168.1.5:26282 READY) became the elder, active for links.
  2011-08-01 14:33:36 info Activating links
  2011-08-01 14:33:45 info Queue "qmfagent-89a21650-abc9-461f-9f9f-ef61a27691cf": Policy created: type=reject; maxCount=0; maxSize=104857600
  2011-08-01 14:33:47 info Queue "reply-dhcp-26-112.brq.redhat.com.26665.1": Policy created: type=reject; maxCount=0; maxSize=104857600
  2011-08-01 14:33:47 info Queue "topic-dhcp-26-112.brq.redhat.com.26665.1": Policy created: type=ring; maxCount=0; maxSize=104857600
  2011-08-01 14:33:47 info Queue "qmfc-v2-dhcp-26-112.brq.redhat.com.26665.1": Policy created: type=reject; maxCount=0; maxSize=104857600
  2011-08-01 14:33:47 info Queue "qmfc-v2-ui-dhcp-26-112.brq.redhat.com.26665.1": Policy created: type=ring; maxCount=0; maxSize=104857600
  2011-08-01 14:33:47 info Queue "qmfc-v2-hb-dhcp-26-112.brq.redhat.com.26665.1": Policy created: type=ring; maxCount=0; maxSize=104857600
  2011-08-01 14:33:51 notice cluster(192.168.1.5:26282 READY) configuration change: 192.168.1.5:26282 192.168.1.6:28490
  2011-08-01 14:33:51 notice cluster(192.168.1.5:26282 READY) Members joined: 192.168.1.6:28490
  2011-08-01 14:33:52 info cluster(192.168.1.5:26282 OFFER) send update-offer to 192.168.1.6:28490
  2011-08-01 14:33:52 info cluster(192.168.1.5:26282 OFFER) stall for update offer from 192.168.1.5:26282 to 192.168.1.6:28490
  2011-08-01 14:33:52 notice cluster(192.168.1.5:26282 UPDATER) sending update to 192.168.1.6:28490 at amqp:tcp:10.34.26.114:5672,tcp:192.168.1.6:5672
  2011-08-01 14:33:52 info SSL connector not enabled, you must set QPID_SSL_CERT_DB to enable it.
  2011-08-01 14:33:52 info Loaded Module: /usr/lib/qpid/client/sslconnector.so
  2011-08-01 14:33:52 info Connection  connected to tcp:10.34.26.114:5672
  
  
  dhcp-26-114:
  2011-08-01 14:33:35 notice Shut down
  2011-08-01 14:33:51 info Loaded Module: /usr/lib64/qpid/daemon/msgstore.so
  2011-08-01 14:33:51 info Loaded Module: /usr/lib64/qpid/daemon/replicating_listener.so
  2011-08-01 14:33:51 info Loaded Module: /usr/lib64/qpid/daemon/cluster.so
  2011-08-01 14:33:51 info Loaded Module: /usr/lib64/qpid/daemon/ssl.so
  2011-08-01 14:33:51 info Loaded Module: /usr/lib64/qpid/daemon/xml.so
  2011-08-01 14:33:51 info Loaded Module: /usr/lib64/qpid/daemon/replication_exchange.so
  2011-08-01 14:33:51 info Loaded Module: /usr/lib64/qpid/daemon/acl.so
  2011-08-01 14:33:51 info Loaded Module: /usr/lib64/qpid/daemon/watchdog.so
  2011-08-01 14:33:51 info No message store configured, persistence is disabled.
  2011-08-01 14:33:51 info Management enabled
  2011-08-01 14:33:51 info ManagementAgent restored broker ID: 96cf1637-7a1b-45f2-93a0-5b41bfa2ebee
  2011-08-01 14:33:51 notice Journal "TplStore": Created
  2011-08-01 14:33:51 notice Store module initialized; store-dir=/var/lib/qpidd
  2011-08-01 14:33:51 info > Default files per journal: 8
  2011-08-01 14:33:51 info > Default journal file size: 24 (wpgs)
  2011-08-01 14:33:51 info > Default write cache page size: 32 (Kib)
  2011-08-01 14:33:51 info > Default number of write cache pages: 32
  2011-08-01 14:33:51 info > TPL files per journal: 8
  2011-08-01 14:33:51 info > TPL journal file size: 24 (wpgs)
  2011-08-01 14:33:51 info > TPL write cache page size: 4 (Kib)
  2011-08-01 14:33:51 info > TPL number of write cache pages: 64
  2011-08-01 14:33:51 info Registered xml exchange
  2011-08-01 14:33:51 info Registered replication exchange
  2011-08-01 14:33:51 notice Initializing CPG
  2011-08-01 14:33:51 notice Cluster store state: dirty cluster-id=a114fb06-15d8-42dd-897f-cf2b8cd2d125
  2011-08-01 14:33:51 notice cluster(192.168.1.6:28490 PRE_INIT) configuration change: 192.168.1.5:26282 192.168.1.6:28490
  2011-08-01 14:33:51 notice cluster(192.168.1.6:28490 PRE_INIT) Members joined: 192.168.1.6:28490
  2011-08-01 14:33:51 notice Cluster recovery: recovered journal data discarded and journal files pushed down
  2011-08-01 14:33:51 notice Store directory /var/lib/qpidd/rhm was pushed down (saved) into directory /var/lib/qpidd/_cluster.bak.0020.
  2011-08-01 14:33:51 notice Journal "TplStore": Created
  2011-08-01 14:33:51 notice SASL disabled: No Authentication Performed
  2011-08-01 14:33:51 notice Listening on TCP port 5672
  2011-08-01 14:33:51 info SSL plugin not enabled, you must set --ssl-cert-db to enable it.
  2011-08-01 14:33:51 info Policy file not specified. ACL Disabled, no ACL checking being done!
  2011-08-01 14:33:51 info cluster(192.168.1.6:28490 INIT) not active for links.
  2011-08-01 14:33:51 notice cluster(192.168.1.6:28490 INIT) cluster-uuid = a114fb06-15d8-42dd-897f-cf2b8cd2d125
  2011-08-01 14:33:51 notice cluster(192.168.1.6:28490 JOINER) joining cluster mycluster
  2011-08-01 14:33:51 notice Broker running
  2011-08-01 14:33:51 info cluster(192.168.1.6:28490 JOINER) stall for update offer from 192.168.1.5:26282 to 192.168.1.6:28490
  2011-08-01 14:33:51 notice cluster(192.168.1.6:28490 UPDATEE) receiving update from 192.168.1.5:26282
  2011-08-01 14:33:51 notice Journal "test_cluster_toggle_load_w_sesame-00": Created
  2011-08-01 14:33:52 info Loaded Module: /usr/lib64/qpid/daemon/msgstore.so
  2011-08-01 14:33:52 info Loaded Module: /usr/lib64/qpid/daemon/replicating_listener.so
  2011-08-01 14:33:52 info Loaded Module: /usr/lib64/qpid/daemon/cluster.so
  2011-08-01 14:33:52 info Loaded Module: /usr/lib64/qpid/daemon/ssl.so
  2011-08-01 14:33:52 info Queue "test_cluster_toggle_load_w_sesame-0_pub_done": Policy created: type=reject; maxCount=0; maxSize=104857600
  2011-08-01 14:33:52 info Queue "test_cluster_toggle_load_w_sesame-0_pub_start": Policy created: type=reject; maxCount=0; maxSize=104857600
  2011-08-01 14:33:52 info Queue "test_cluster_toggle_load_w_sesame-0_sub_done": Policy created: type=reject; maxCount=0; maxSize=104857600
  2011-08-01 14:33:52 info Queue "test_cluster_toggle_load_w_sesame-0_sub_ready": Policy created: type=reject; maxCount=0; maxSize=104857600
  2011-08-01 14:33:52 notice Journal "test_cluster_toggle_load_w_sesame-10": Created
  2011-08-01 14:33:52 info Loaded Module: /usr/lib64/qpid/daemon/xml.so
  2011-08-01 14:33:52 info Loaded Module: /usr/lib64/qpid/daemon/replication_exchange.so
  2011-08-01 14:33:52 info Loaded Module: /usr/lib64/qpid/daemon/acl.so
  2011-08-01 14:33:52 info Loaded Module: /usr/lib64/qpid/daemon/watchdog.so
  2011-08-01 14:33:52 info Queue "test_cluster_toggle_load_w_sesame-1_pub_done": Policy created: type=reject; maxCount=0; maxSize=104857600
  2011-08-01 14:33:52 info Queue "test_cluster_toggle_load_w_sesame-1_pub_start": Policy created: type=reject; maxCount=0; maxSize=104857600
  2011-08-01 14:33:52 info Queue "test_cluster_toggle_load_w_sesame-1_sub_done": Policy created: type=reject; maxCount=0; maxSize=104857600
  2011-08-01 14:33:52 info Queue "test_cluster_toggle_load_w_sesame-1_sub_ready": Policy created: type=reject; maxCount=0; maxSize=104857600
  2011-08-01 14:33:52 info Queue "test_cluster_toggle_load_w_sesame-2_pub_done": Policy created: type=reject; maxCount=0; maxSize=104857600
  2011-08-01 14:33:52 info Queue "test_cluster_toggle_load_w_sesame-2_pub_start": Policy created: type=reject; maxCount=0; maxSize=104857600
  2011-08-01 14:33:52 info Queue "test_cluster_toggle_load_w_sesame-2_sub_done": Policy created: type=reject; maxCount=0; maxSize=104857600
  2011-08-01 14:33:52 info Queue "test_cluster_toggle_load_w_sesame-2_sub_ready": Policy created: type=reject; maxCount=0; maxSize=104857600
  2011-08-01 14:33:52 info Queue "test_cluster_toggle_load_w_sesame-3_pub_done": Policy created: type=reject; maxCount=0; maxSize=104857600
  2011-08-01 14:33:52 info Queue "test_cluster_toggle_load_w_sesame-3_pub_start": Policy created: type=reject; maxCount=0; maxSize=104857600
  2011-08-01 14:33:52 info Queue "test_cluster_toggle_load_w_sesame-3_sub_done": Policy created: type=reject; maxCount=0; maxSize=104857600
  2011-08-01 14:33:52 info Queue "test_cluster_toggle_load_w_sesame-3_sub_ready": Policy created: type=reject; maxCount=0; maxSize=104857600
  2011-08-01 14:33:52 notice Journal "test_cluster_toggle_load_w_sesame-40": Created
  2011-08-01 14:33:53 info Queue "test_cluster_toggle_load_w_sesame-4_pub_done": Policy created: type=reject; maxCount=0; maxSize=104857600
  2011-08-01 14:33:53 info Queue "test_cluster_toggle_load_w_sesame-4_pub_start": Policy created: type=reject; maxCount=0; maxSize=104857600
  2011-08-01 14:33:53 info Queue "test_cluster_toggle_load_w_sesame-4_sub_done": Policy created: type=reject; maxCount=0; maxSize=104857600
  2011-08-01 14:33:53 info Queue "test_cluster_toggle_load_w_sesame-4_sub_ready": Policy created: type=reject; maxCount=0; maxSize=104857600
  2011-08-01 14:33:53 info Queue "test_cluster_toggle_load_w_sesame-5_pub_done": Policy created: type=reject; maxCount=0; maxSize=104857600
  2011-08-01 14:33:53 info Queue "test_cluster_toggle_load_w_sesame-5_pub_start": Policy created: type=reject; maxCount=0; maxSize=104857600
  2011-08-01 14:33:53 info Queue "test_cluster_toggle_load_w_sesame-5_sub_done": Policy created: type=reject; maxCount=0; maxSize=104857600
  2011-08-01 14:33:53 info Queue "test_cluster_toggle_load_w_sesame-5_sub_ready": Policy created: type=reject; maxCount=0; maxSize=104857600
  2011-08-01 14:33:53 notice Journal "test_cluster_toggle_load_w_sesame-60": Created
  2011-08-01 14:33:53 info Passivating links
  2011-08-01 14:33:54 info Queue "test_cluster_toggle_load_w_sesame-6_pub_done": Policy created: type=reject; maxCount=0; maxSize=104857600
  2011-08-01 14:33:54 info Queue "test_cluster_toggle_load_w_sesame-6_pub_start": Policy created: type=reject; maxCount=0; maxSize=104857600
  2011-08-01 14:33:54 info Queue "test_cluster_toggle_load_w_sesame-6_sub_done": Policy created: type=reject; maxCount=0; maxSize=104857600
  2011-08-01 14:33:54 info Queue "test_cluster_toggle_load_w_sesame-6_sub_ready": Policy created: type=reject; maxCount=0; maxSize=104857600
  2011-08-01 14:33:54 notice Journal "test_cluster_toggle_load_w_sesame-70": Created
  2011-08-01 14:33:55 info Queue "test_cluster_toggle_load_w_sesame-7_pub_done": Policy created: type=reject; maxCount=0; maxSize=104857600
  2011-08-01 14:33:55 info Queue "test_cluster_toggle_load_w_sesame-7_pub_start": Policy created: type=reject; maxCount=0; maxSize=104857600
  2011-08-01 14:33:55 info Queue "test_cluster_toggle_load_w_sesame-7_sub_done": Policy created: type=reject; maxCount=0; maxSize=104857600
  2011-08-01 14:33:55 info Queue "test_cluster_toggle_load_w_sesame-7_sub_ready": Policy created: type=reject; maxCount=0; maxSize=104857600
  2011-08-01 14:33:55 notice Journal "test_cluster_toggle_load_w_sesame-80": Created
  2011-08-01 14:33:59 info Queue "test_cluster_toggle_load_w_sesame-8_pub_done": Policy created: type=reject; maxCount=0; maxSize=104857600
  2011-08-01 14:33:59 info Queue "test_cluster_toggle_load_w_sesame-8_pub_start": Policy created: type=reject; maxCount=0; maxSize=104857600
  2011-08-01 14:33:59 info Queue "test_cluster_toggle_load_w_sesame-8_sub_done": Policy created: type=reject; maxCount=0; maxSize=104857600
  2011-08-01 14:33:59 info Queue "test_cluster_toggle_load_w_sesame-8_sub_ready": Policy created: type=reject; maxCount=0; maxSize=104857600
  2011-08-01 14:33:59 notice Journal "test_cluster_toggle_load_w_sesame-90": Created
  2011-08-01 14:33:59 info Queue "test_cluster_toggle_load_w_sesame-9_pub_done": Policy created: type=reject; maxCount=0; maxSize=104857600
  2011-08-01 14:33:59 info Queue "test_cluster_toggle_load_w_sesame-9_pub_start": Policy created: type=reject; maxCount=0; maxSize=104857600
  2011-08-01 14:33:59 info Queue "test_cluster_toggle_load_w_sesame-9_sub_done": Policy created: type=reject; maxCount=0; maxSize=104857600
  2011-08-01 14:33:59 info Queue "test_cluster_toggle_load_w_sesame-9_sub_ready": Policy created: type=reject; maxCount=0; maxSize=104857600
  2011-08-01 14:33:59 info Queue "qpid.cluster-update": Policy created: type=reject; maxCount=0; maxSize=104857600
  2011-08-01 14:33:59 critical cluster(192.168.1.6:28490 UPDATEE) catch-up connection closed prematurely 10.34.26.114:5672-10.34.26.112:36610(192.168.1.6:28490-1 local,catchup)
  2011-08-01 14:33:59 notice cluster(192.168.1.6:28490 LEFT) leaving cluster mycluster
  2011-08-01 14:33:59 notice Shut down

Comment 7 Alan Conway 2011-08-01 14:18:43 UTC
This assumption in comment 6 is incorrect:

"Test expects that when service qpidd operation (4-1]) ends - ensures that
broker is settled and synchronized to the cluster and thus next operation
(4-2]) may safely shutdown all other cluster nodes."

When qpidd -d returns the broker has done an initial handshake with the cluster but has not yet received the full state update. To wait for the full update to complete, connect a client to the newly started broker. The connection will not complete until the broker is fully ready. Any client will do, it doesn't have to do anything other than establish a connection.

Comment 8 Frantisek Reznicek 2011-08-01 15:05:30 UTC
Thanks for clarification Alan.
The observed issue is expected and test code has to keep in mind that newbee sync takes longer time.

-> CLOSED / NOT-A-BUG

Comment 9 Alan Conway 2011-08-01 19:59:59 UTC
A note for testing: you can ensure the newbie is ready without sleeping by just making a connection with any client you like. E.g. qpid-perftest --count 1. The connection will complete when the newbie is up to date at which point it is safe to proceed with the test.


Note You need to log in before you can comment on or make changes to this bug.