Hide Forgot
Description of problem: Long term running cluster test uncovers that a node may rarely un-expectably leave cluster with following message: 2011-07-28 18:32:40 debug DISCONNECTED [10.34.33.125:5672-10.34.33.236:44944] 2011-07-28 18:32:40 critical cluster(192.168.5.2:12582 UPDATEE) catch-up connection closed prematurely 10.34.33.125:5672-10.34.33.236:44944(192.168.5.2:12582-1 local,catchup) 2011-07-28 18:32:40 notice cluster(192.168.5.2:12582 LEFT) leaving cluster mycluster This behavior is observed when cluster test repeatedly performs cluster toggle test which changes cluster configuration. See more description later Version-Release number of selected component (if applicable): cman-2.0.115-68.el5_6.4 openais-0.80.6-28.el5_6.1 python-qmf-0.7.946106-14.el5 python-qpid-0.7.946106-15.el5 python-saslwrapper-0.1.934605-2.el5 qmf-0.7.946106-33.el5 qmf-devel-0.7.946106-33.el5 qpid-cpp-client-0.7.946106-33.el5 qpid-cpp-client-devel-0.7.946106-33.el5 qpid-cpp-client-devel-docs-0.7.946106-33.el5 qpid-cpp-client-ssl-0.7.946106-33.el5 qpid-cpp-server-0.7.946106-33.el5 qpid-cpp-server-cluster-0.7.946106-33.el5 qpid-cpp-server-devel-0.7.946106-33.el5 qpid-cpp-server-ssl-0.7.946106-33.el5 qpid-cpp-server-store-0.7.946106-33.el5 qpid-cpp-server-xml-0.7.946106-33.el5 qpid-java-client-0.10-6.el5 qpid-java-common-0.10-6.el5 qpid-java-example-0.10-6.el5 qpid-tools-0.7.946106-12.el5 ruby-saslwrapper-0.1.934605-2.el5 saslwrapper-0.1.934605-2.el5 saslwrapper-devel-0.1.934605-2.el5 sesame-0.7.4297-6.el5 How reproducible: 90% just time to run ito troubles varies Steps to Reproduce: 1. ./ctests.py --broker-maximize-uptime --broker-options=mgmt-pub-interval~1,log-enable~debug+ --log-to-file=a.log --qmf-data-timeout=120 --stop-on-error --exec-tests=testset_cluster_toggle:16h host_a host_b host_c host_d Actual results: Cluster test running cluster toggle tests is failing Expected results: Cluster test running cluster toggle tests should not fail Additional info: Rough test log: [18:28:25] DEBUG:TESTSET 'testset_cluster_toggle' loop 3/28 (57600.0) finished. [tst_cnt:1305, err_cnt:0] [18:28:25] DEBUG:TESTSET 'testset_cluster_toggle' loop 4/37 (57600.0) started. [tst_cnt:1305, err_cnt:0] [18:28:28] DEBUG:Cluster state: 1 of 4 up and running [18:28:28] DEBUG:TEST 'test_cluster_toggle_no_load' loop 1/1 started. [tst_cnt:1305, err_cnt:0] [18:28:30] DEBUG:Cluster state: 1 of 4 up and running [18:28:30] DEBUG:Go to cluster state: 15 [True, True, True, True] [18:28:59] INFO:localhost:'cluster topology change (ecode: True exp. True)', result:True, dur.:-1 [18:29:02] INFO:localhost:'Cluster state: 4 (exp. 4) of 4 up and running', result:True, dur.:-1 [18:29:08] INFO:dhcp-lab-236.englab.brq.redhat.com:'qpid-cluster', result:0, dur.:4.68771 [18:29:08] INFO:localhost:'qpid-cluster cluster size check (exp. 4)', result:True, dur.:-1 [18:29:10] INFO:dhcp-lab-125.englab.brq.redhat.com:'qpid-cluster', result:0, dur.:2.09119 [18:29:10] INFO:localhost:'qpid-cluster cluster size check (exp. 4)', result:True, dur.:-1 [18:29:13] INFO:dhcp-lab-123.englab.brq.redhat.com:'qpid-cluster', result:0, dur.:2.30942 [18:29:13] INFO:localhost:'qpid-cluster cluster size check (exp. 4)', result:True, dur.:-1 [18:29:16] INFO:dhcp-lab-122.englab.brq.redhat.com:'qpid-cluster', result:0, dur.:2.33577 [18:29:16] INFO:localhost:'qpid-cluster cluster size check (exp. 4)', result:True, dur.:-1 [18:29:16] DEBUG:Go to cluster state: 14 [True, True, True, False] [18:29:28] INFO:localhost:'cluster topology change (ecode: True exp. True)', result:True, dur.:-1 [18:29:30] INFO:localhost:'Cluster state: 3 (exp. 3) of 4 up and running', result:True, dur.:-1 [18:29:32] INFO:dhcp-lab-236.englab.brq.redhat.com:'qpid-cluster', result:0, dur.:2.314 [18:29:32] INFO:localhost:'qpid-cluster cluster size check (exp. 3)', result:True, dur.:-1 [18:29:35] INFO:dhcp-lab-125.englab.brq.redhat.com:'qpid-cluster', result:0, dur.:2.05801 [18:29:35] INFO:localhost:'qpid-cluster cluster size check (exp. 3)', result:True, dur.:-1 [18:29:38] INFO:dhcp-lab-123.englab.brq.redhat.com:'qpid-cluster', result:0, dur.:2.25643 [18:29:38] INFO:localhost:'qpid-cluster cluster size check (exp. 3)', result:True, dur.:-1 [18:29:38] DEBUG:Go to cluster state: 13 [True, True, False, True] [18:29:56] INFO:localhost:'cluster topology change (ecode: True exp. True)', result:True, dur.:-1 [18:29:58] INFO:localhost:'Cluster state: 3 (exp. 3) of 4 up and running', result:True, dur.:-1 [18:30:01] INFO:dhcp-lab-236.englab.brq.redhat.com:'qpid-cluster', result:0, dur.:2.19658 [18:30:01] INFO:localhost:'qpid-cluster cluster size check (exp. 3)', result:True, dur.:-1 [18:30:04] INFO:dhcp-lab-125.englab.brq.redhat.com:'qpid-cluster', result:0, dur.:2.09778 [18:30:04] INFO:localhost:'qpid-cluster cluster size check (exp. 3)', result:True, dur.:-1 [18:30:07] INFO:dhcp-lab-122.englab.brq.redhat.com:'qpid-cluster', result:0, dur.:2.26122 [18:30:07] INFO:localhost:'qpid-cluster cluster size check (exp. 3)', result:True, dur.:-1 [18:30:07] DEBUG:Go to cluster state: 12 [True, True, False, False] [18:30:19] INFO:localhost:'cluster topology change (ecode: True exp. True)', result:True, dur.:-1 [18:30:21] INFO:localhost:'Cluster state: 2 (exp. 2) of 4 up and running', result:True, dur.:-1 [18:30:24] INFO:dhcp-lab-236.englab.brq.redhat.com:'qpid-cluster', result:0, dur.:2.24335 [18:30:24] INFO:localhost:'qpid-cluster cluster size check (exp. 2)', result:True, dur.:-1 [18:30:27] INFO:dhcp-lab-125.englab.brq.redhat.com:'qpid-cluster', result:0, dur.:2.30828 [18:30:27] INFO:localhost:'qpid-cluster cluster size check (exp. 2)', result:True, dur.:-1 [18:30:28] DEBUG:Go to cluster state: 11 [True, False, True, True] [18:30:55] INFO:localhost:'cluster topology change (ecode: True exp. True)', result:True, dur.:-1 [18:30:57] INFO:localhost:'Cluster state: 3 (exp. 3) of 4 up and running', result:True, dur.:-1 [18:31:00] INFO:dhcp-lab-236.englab.brq.redhat.com:'qpid-cluster', result:0, dur.:2.37627 [18:31:00] INFO:localhost:'qpid-cluster cluster size check (exp. 3)', result:True, dur.:-1 [18:31:03] INFO:dhcp-lab-123.englab.brq.redhat.com:'qpid-cluster', result:0, dur.:2.18624 [18:31:03] INFO:localhost:'qpid-cluster cluster size check (exp. 3)', result:True, dur.:-1 [18:31:06] INFO:dhcp-lab-122.englab.brq.redhat.com:'qpid-cluster', result:0, dur.:2.16036 [18:31:06] INFO:localhost:'qpid-cluster cluster size check (exp. 3)', result:True, dur.:-1 [18:31:06] DEBUG:Go to cluster state: 10 [True, False, True, False] [18:31:17] INFO:localhost:'cluster topology change (ecode: True exp. True)', result:True, dur.:-1 [18:31:19] INFO:localhost:'Cluster state: 2 (exp. 2) of 4 up and running', result:True, dur.:-1 [18:31:22] INFO:dhcp-lab-236.englab.brq.redhat.com:'qpid-cluster', result:0, dur.:2.28387 [18:31:22] INFO:localhost:'qpid-cluster cluster size check (exp. 2)', result:True, dur.:-1 [18:31:25] INFO:dhcp-lab-123.englab.brq.redhat.com:'qpid-cluster', result:0, dur.:2.21398 [18:31:25] INFO:localhost:'qpid-cluster cluster size check (exp. 2)', result:True, dur.:-1 [18:31:26] DEBUG:Go to cluster state: 9 [True, False, False, True] [18:31:43] INFO:localhost:'cluster topology change (ecode: True exp. True)', result:True, dur.:-1 [18:31:45] INFO:localhost:'Cluster state: 2 (exp. 2) of 4 up and running', result:True, dur.:-1 [18:31:48] INFO:dhcp-lab-236.englab.brq.redhat.com:'qpid-cluster', result:0, dur.:2.256 [18:31:48] INFO:localhost:'qpid-cluster cluster size check (exp. 2)', result:True, dur.:-1 [18:31:52] INFO:dhcp-lab-122.englab.brq.redhat.com:'qpid-cluster', result:0, dur.:2.08481 [18:31:52] INFO:localhost:'qpid-cluster cluster size check (exp. 2)', result:True, dur.:-1 [18:31:52] DEBUG:Go to cluster state: 8 [True, False, False, False] [18:32:04] INFO:localhost:'cluster topology change (ecode: True exp. True)', result:True, dur.:-1 [18:32:06] INFO:localhost:'Cluster state: 1 (exp. 1) of 4 up and running', result:True, dur.:-1 [18:32:09] INFO:dhcp-lab-236.englab.brq.redhat.com:'qpid-cluster', result:0, dur.:2.16295 [18:32:09] INFO:localhost:'qpid-cluster cluster size check (exp. 1)', result:True, dur.:-1 [18:32:10] DEBUG:Go to cluster state: 7 [False, True, True, True] [18:32:44] INFO:localhost:'cluster topology change (ecode: True exp. True)', result:True, dur.:-1 [18:32:46] ERROR:localhost:'Cluster state: 2 (exp. 3) of 4 up and running', result:False, dur.:-1
Created attachment 515788 [details] The last hundreds of log lines from every broker (debug+)
Created attachment 515789 [details] The automated test for reproducing the issue Test can be ran as: ./ctests.py --broker-maximize-uptime \ --broker-options=mgmt-pub-interval~1,log-enable~debug+ \ --log-to-file=110728_16h.log \ --qmf-data-timeout=120 \ --stop-on-error\ --exec-tests=testset_cluster_toggle:16h \ host_a host_b host_c host_d If you do not set passless access to host_a host_b host_c host_d then you have to pass --ssh-pass=... as well
Following scenario seems to tend to trigger issue more easily A PPPP ! B SSSSS! C SSSSS! D SSSSS! where 'PPPP' is progress of perftest client (client A against broker on rand( A, B, C, D)), 'SSSSS' is cluster synchronization and '!' is an cluster reconfiguration/toggling nodes Other broker options which increase the reproducibility: mgmt-pub-interval=1 The issue was observed on RHEL5.6 x86_64 & -33.
This is the expected behaviour if the broker giving the update is killed before it completes. When you are changing configuration, there's a chance that after starting a broker you proceed to kill the broker providing the update. You can fix this by connecting a client to the broker you have just started. It doesn't have to do anything, the clients connection won't complete until the broker is ready. E.g. c = messaging.Connection.establish(...broker address...) try: c.session() finally: c.close()
Hi Alan, let me clarify in more detail the testing scenario to be able precisely judge whether behavior I can see is correct. My last run on RHEL5.7 i386&x86_64 w/o loading cluster failed the same way. The test sub procedure was: (cluster of two nodes [rhel57i~dhcp-26-112~A, rhel57x~dhcp-26-114~A) 0. initial cluster config, no node up 1. all nodes up, first A, then B) 2. a test actions ... (not important atm) 3. reconfigure cluster to: A is up and running, B is down 4. test attempts to reconfig cluster to configuration: - A is down, B is up At point 4. test detects that bringing down A would result in problem as B's store is dirty and thus do following operations: 4-1] at B : service qpidd start after 4-1 finishes: 4-2] at A : service qpidd stop after 4-2 is done cluster state is checked and test ends with error because B node did not start and thus cluster configuration at the end of 4-2] is : A, B down. Now to the question whether it is qpidd or test who is guilty. Test expects that when service qpidd operation (4-1]) ends - ensures that broker is settled and synchronized to the cluster and thus next operation (4-2]) may safely shutdown all other cluster nodes. Alan, Could you possibly comment above scenario, please? Raising NEEDINFO. Test transcript from 2-node cluster (rhel5.7 i/x): [14:32:36] DEBUG:TESTSET 'testset_cluster_toggle' loop 10/32 (10800.0) finished. [tst_cnt:759, err_cnt:0] [14:32:36] DEBUG:TESTSET 'testset_cluster_toggle' loop 11/36 (10800.0) started. [tst_cnt:759, err_cnt:0] [14:32:37] DEBUG:Cluster state: 1 of 2 up and running [14:32:37] DEBUG:TEST 'test_cluster_toggle_no_load' loop 1/1 started. [tst_cnt:759, err_cnt:0] [14:32:38] DEBUG:Cluster state: 1 of 2 up and running [14:32:38] DEBUG:Go to cluster state: 3 [True, True] DEBUG:ccfg:[False, True] -> tcfg:[True, True] DEBUG:start(A ~ 0) [14:32:48] INFO:localhost:'cluster topology change (ecode: True exp. True)', result:True, dur.:-1 [14:32:49] INFO:localhost:'Cluster state: 2 (exp. 2) of 2 up and running', result:True, dur.:-1 [14:32:51] INFO:dhcp-26-112:'qpid-cluster', result:0, dur.:1.46567 [14:32:51] INFO:localhost:'qpid-cluster cluster size check (exp. 2)', result:True, dur.:-1 [14:32:53] INFO:dhcp-26-114:'qpid-cluster', result:0, dur.:1.64677 [14:32:53] INFO:localhost:'qpid-cluster cluster size check (exp. 2)', result:True, dur.:-1 [14:32:53] DEBUG:Go to cluster state: 2 [True, False] DEBUG:ccfg:[True, True] -> tcfg:[True, False] DEBUG:stop(B ~ 1) [14:32:58] INFO:localhost:'cluster topology change (ecode: True exp. True)', result:True, dur.:-1 [14:32:59] INFO:localhost:'Cluster state: 1 (exp. 1) of 2 up and running', result:True, dur.:-1 [14:33:01] INFO:dhcp-26-112:'qpid-cluster', result:0, dur.:1.4745 [14:33:01] INFO:localhost:'qpid-cluster cluster size check (exp. 1)', result:True, dur.:-1 [14:33:01] DEBUG:Go to cluster state: 1 [False, True] DEBUG:ccfg:[True, False] -> tcfg:[False, True] DEBUG:start(B ~ 1) DEBUG:stop(A ~ 0) [14:33:11] INFO:localhost:'cluster topology change (ecode: True exp. True)', result:True, dur.:-1 [14:33:12] INFO:localhost:'Cluster state: 1 (exp. 1) of 2 up and running', result:True, dur.:-1 [14:33:14] INFO:dhcp-26-114:'qpid-cluster', result:0, dur.:1.38322 [14:33:14] INFO:localhost:'qpid-cluster cluster size check (exp. 1)', result:True, dur.:-1 [14:33:15] DEBUG:Cluster state: 1 of 2 up and running [14:33:15] DEBUG:TEST 'test_cluster_toggle_no_load' loop 1/1 finished. [tst_cnt:773, err_cnt:0] [14:33:15] DEBUG:TEST 'test_cluster_toggle_no_load_w_sesame' loop 1/1 started. [tst_cnt:773, err_cnt:0] [14:33:16] DEBUG:Cluster state: 1 of 2 up and running [14:33:16] DEBUG:Go to cluster state: 3 [True, True] DEBUG:ccfg:[False, True] -> tcfg:[True, True] DEBUG:start(A ~ 0) [14:33:26] INFO:localhost:'cluster topology change (ecode: True exp. True)', result:True, dur.:-1 [14:33:27] INFO:localhost:'Cluster state: 2 (exp. 2) of 2 up and running', result:True, dur.:-1 [14:33:29] INFO:dhcp-26-112:'service sesame status && service sesame restart || service sesame start ; service sesame status', result:0, dur.:0.98156 [14:33:33] INFO:dhcp-26-114:'service sesame status && service sesame restart || service sesame start ; service sesame status', result:0, dur.:3.58063 [14:33:35] INFO:dhcp-26-112:'qpid-cluster', result:0, dur.:1.53743 [14:33:35] INFO:localhost:'qpid-cluster cluster size check (exp. 2)', result:True, dur.:-1 [14:33:37] INFO:dhcp-26-114:'qpid-cluster', result:0, dur.:1.46806 [14:33:37] INFO:localhost:'qpid-cluster cluster size check (exp. 2)', result:True, dur.:-1 [14:33:37] DEBUG:Go to cluster state: 2 [True, False] DEBUG:ccfg:[True, True] -> tcfg:[True, False] DEBUG:stop(B ~ 1) [14:33:42] INFO:localhost:'cluster topology change (ecode: True exp. True)', result:True, dur.:-1 [14:33:43] INFO:localhost:'Cluster state: 1 (exp. 1) of 2 up and running', result:True, dur.:-1 [14:33:47] INFO:dhcp-26-112:'service sesame status && service sesame restart || service sesame start ; service sesame status', result:0, dur.:3.75164 [14:33:48] INFO:dhcp-26-114:'service sesame status && service sesame stop || service sesame start', result:0, dur.:0.475331 [14:33:50] INFO:dhcp-26-112:'qpid-cluster', result:0, dur.:1.52838 [14:33:50] INFO:localhost:'qpid-cluster cluster size check (exp. 1)', result:True, dur.:-1 [14:33:50] DEBUG:Go to cluster state: 1 [False, True] DEBUG:ccfg:[True, False] -> tcfg:[False, True] DEBUG:start(B ~ 1) DEBUG:stop(A ~ 0) [14:34:04] INFO:localhost:'cluster topology change (ecode: True exp. True)', result:True, dur.:-1 [14:34:05] ERROR:localhost:'Cluster state: 0 (exp. 1) of 2 up and running', result:False, dur.:-1 Traceback (most recent call last): File "./ctests.py", line 239, in main i_method[1](); File "./ctests.py", line 406, in testset_cluster_toggle self.test_cluster_toggle_no_load_w_sesame(); File "./ctests.py", line 893, in test_cluster_toggle_no_load_w_sesame self.report(); File "/home/freznice/prjs/rh/rhts_tests/tests/distribution/_others/cluster_test/testlib.py", line 1430, in report raise test_error("An error ocurred!"); test_error: 'An error ocurred!' [14:34:05] INFO: [14:34:05] INFO:Execution summary: TST_CNT:789, ERR_CNT:1, WRN_CNT:0, dur.:3418.27 [14:34:05] INFO: dhcp-26-112: 2011-08-01 14:33:23 notice cluster(192.168.1.5:26282 READY) caught up. 2011-08-01 14:33:26 info Queue "qmfagent-20e864fc-5a8b-4f9e-bf7d-61fe2cd0ecda": Policy created: type=reject; maxCount=0; maxSize=104857600 2011-08-01 14:33:30 info Queue "qmfagent-4acadabb-d7e9-4544-be78-266a7ff80ce2": Policy created: type=reject; maxCount=0; maxSize=104857600 2011-08-01 14:33:31 info Queue "reply-dhcp-26-112.brq.redhat.com.26476.1": Policy created: type=reject; maxCount=0; maxSize=104857600 2011-08-01 14:33:31 info Queue "topic-dhcp-26-112.brq.redhat.com.26476.1": Policy created: type=ring; maxCount=0; maxSize=104857600 2011-08-01 14:33:31 info Queue "qmfc-v2-dhcp-26-112.brq.redhat.com.26476.1": Policy created: type=reject; maxCount=0; maxSize=104857600 2011-08-01 14:33:31 info Queue "qmfc-v2-ui-dhcp-26-112.brq.redhat.com.26476.1": Policy created: type=ring; maxCount=0; maxSize=104857600 2011-08-01 14:33:31 info Queue "qmfc-v2-hb-dhcp-26-112.brq.redhat.com.26476.1": Policy created: type=ring; maxCount=0; maxSize=104857600 2011-08-01 14:33:33 info Queue "reply-dhcp-26-114.brq.redhat.com.28112.1": Policy created: type=reject; maxCount=0; maxSize=104857600 2011-08-01 14:33:33 info Queue "topic-dhcp-26-114.brq.redhat.com.28112.1": Policy created: type=ring; maxCount=0; maxSize=104857600 2011-08-01 14:33:33 info Queue "qmfc-v2-dhcp-26-114.brq.redhat.com.28112.1": Policy created: type=reject; maxCount=0; maxSize=104857600 2011-08-01 14:33:33 info Queue "qmfc-v2-ui-dhcp-26-114.brq.redhat.com.28112.1": Policy created: type=ring; maxCount=0; maxSize=104857600 2011-08-01 14:33:33 info Queue "qmfc-v2-hb-dhcp-26-114.brq.redhat.com.28112.1": Policy created: type=ring; maxCount=0; maxSize=104857600 2011-08-01 14:33:35 notice cluster(192.168.1.5:26282 READY) configuration change: 192.168.1.5:26282 2011-08-01 14:33:35 notice cluster(192.168.1.5:26282 READY) Members left: 192.168.1.6:27764 2011-08-01 14:33:35 info cluster(192.168.1.5:26282 READY) member update: 192.168.1.5:26282(member) frameSeq=21988358 2011-08-01 14:33:35 notice cluster(192.168.1.5:26282 READY)Sole member of cluster, marking store clean. 2011-08-01 14:33:35 notice cluster(192.168.1.5:26282 READY) last broker standing, update queue policies 2011-08-01 14:33:35 info cluster(192.168.1.5:26282 READY) became the elder, active for links. 2011-08-01 14:33:36 info Activating links 2011-08-01 14:33:45 info Queue "qmfagent-89a21650-abc9-461f-9f9f-ef61a27691cf": Policy created: type=reject; maxCount=0; maxSize=104857600 2011-08-01 14:33:47 info Queue "reply-dhcp-26-112.brq.redhat.com.26665.1": Policy created: type=reject; maxCount=0; maxSize=104857600 2011-08-01 14:33:47 info Queue "topic-dhcp-26-112.brq.redhat.com.26665.1": Policy created: type=ring; maxCount=0; maxSize=104857600 2011-08-01 14:33:47 info Queue "qmfc-v2-dhcp-26-112.brq.redhat.com.26665.1": Policy created: type=reject; maxCount=0; maxSize=104857600 2011-08-01 14:33:47 info Queue "qmfc-v2-ui-dhcp-26-112.brq.redhat.com.26665.1": Policy created: type=ring; maxCount=0; maxSize=104857600 2011-08-01 14:33:47 info Queue "qmfc-v2-hb-dhcp-26-112.brq.redhat.com.26665.1": Policy created: type=ring; maxCount=0; maxSize=104857600 2011-08-01 14:33:51 notice cluster(192.168.1.5:26282 READY) configuration change: 192.168.1.5:26282 192.168.1.6:28490 2011-08-01 14:33:51 notice cluster(192.168.1.5:26282 READY) Members joined: 192.168.1.6:28490 2011-08-01 14:33:52 info cluster(192.168.1.5:26282 OFFER) send update-offer to 192.168.1.6:28490 2011-08-01 14:33:52 info cluster(192.168.1.5:26282 OFFER) stall for update offer from 192.168.1.5:26282 to 192.168.1.6:28490 2011-08-01 14:33:52 notice cluster(192.168.1.5:26282 UPDATER) sending update to 192.168.1.6:28490 at amqp:tcp:10.34.26.114:5672,tcp:192.168.1.6:5672 2011-08-01 14:33:52 info SSL connector not enabled, you must set QPID_SSL_CERT_DB to enable it. 2011-08-01 14:33:52 info Loaded Module: /usr/lib/qpid/client/sslconnector.so 2011-08-01 14:33:52 info Connection connected to tcp:10.34.26.114:5672 dhcp-26-114: 2011-08-01 14:33:35 notice Shut down 2011-08-01 14:33:51 info Loaded Module: /usr/lib64/qpid/daemon/msgstore.so 2011-08-01 14:33:51 info Loaded Module: /usr/lib64/qpid/daemon/replicating_listener.so 2011-08-01 14:33:51 info Loaded Module: /usr/lib64/qpid/daemon/cluster.so 2011-08-01 14:33:51 info Loaded Module: /usr/lib64/qpid/daemon/ssl.so 2011-08-01 14:33:51 info Loaded Module: /usr/lib64/qpid/daemon/xml.so 2011-08-01 14:33:51 info Loaded Module: /usr/lib64/qpid/daemon/replication_exchange.so 2011-08-01 14:33:51 info Loaded Module: /usr/lib64/qpid/daemon/acl.so 2011-08-01 14:33:51 info Loaded Module: /usr/lib64/qpid/daemon/watchdog.so 2011-08-01 14:33:51 info No message store configured, persistence is disabled. 2011-08-01 14:33:51 info Management enabled 2011-08-01 14:33:51 info ManagementAgent restored broker ID: 96cf1637-7a1b-45f2-93a0-5b41bfa2ebee 2011-08-01 14:33:51 notice Journal "TplStore": Created 2011-08-01 14:33:51 notice Store module initialized; store-dir=/var/lib/qpidd 2011-08-01 14:33:51 info > Default files per journal: 8 2011-08-01 14:33:51 info > Default journal file size: 24 (wpgs) 2011-08-01 14:33:51 info > Default write cache page size: 32 (Kib) 2011-08-01 14:33:51 info > Default number of write cache pages: 32 2011-08-01 14:33:51 info > TPL files per journal: 8 2011-08-01 14:33:51 info > TPL journal file size: 24 (wpgs) 2011-08-01 14:33:51 info > TPL write cache page size: 4 (Kib) 2011-08-01 14:33:51 info > TPL number of write cache pages: 64 2011-08-01 14:33:51 info Registered xml exchange 2011-08-01 14:33:51 info Registered replication exchange 2011-08-01 14:33:51 notice Initializing CPG 2011-08-01 14:33:51 notice Cluster store state: dirty cluster-id=a114fb06-15d8-42dd-897f-cf2b8cd2d125 2011-08-01 14:33:51 notice cluster(192.168.1.6:28490 PRE_INIT) configuration change: 192.168.1.5:26282 192.168.1.6:28490 2011-08-01 14:33:51 notice cluster(192.168.1.6:28490 PRE_INIT) Members joined: 192.168.1.6:28490 2011-08-01 14:33:51 notice Cluster recovery: recovered journal data discarded and journal files pushed down 2011-08-01 14:33:51 notice Store directory /var/lib/qpidd/rhm was pushed down (saved) into directory /var/lib/qpidd/_cluster.bak.0020. 2011-08-01 14:33:51 notice Journal "TplStore": Created 2011-08-01 14:33:51 notice SASL disabled: No Authentication Performed 2011-08-01 14:33:51 notice Listening on TCP port 5672 2011-08-01 14:33:51 info SSL plugin not enabled, you must set --ssl-cert-db to enable it. 2011-08-01 14:33:51 info Policy file not specified. ACL Disabled, no ACL checking being done! 2011-08-01 14:33:51 info cluster(192.168.1.6:28490 INIT) not active for links. 2011-08-01 14:33:51 notice cluster(192.168.1.6:28490 INIT) cluster-uuid = a114fb06-15d8-42dd-897f-cf2b8cd2d125 2011-08-01 14:33:51 notice cluster(192.168.1.6:28490 JOINER) joining cluster mycluster 2011-08-01 14:33:51 notice Broker running 2011-08-01 14:33:51 info cluster(192.168.1.6:28490 JOINER) stall for update offer from 192.168.1.5:26282 to 192.168.1.6:28490 2011-08-01 14:33:51 notice cluster(192.168.1.6:28490 UPDATEE) receiving update from 192.168.1.5:26282 2011-08-01 14:33:51 notice Journal "test_cluster_toggle_load_w_sesame-00": Created 2011-08-01 14:33:52 info Loaded Module: /usr/lib64/qpid/daemon/msgstore.so 2011-08-01 14:33:52 info Loaded Module: /usr/lib64/qpid/daemon/replicating_listener.so 2011-08-01 14:33:52 info Loaded Module: /usr/lib64/qpid/daemon/cluster.so 2011-08-01 14:33:52 info Loaded Module: /usr/lib64/qpid/daemon/ssl.so 2011-08-01 14:33:52 info Queue "test_cluster_toggle_load_w_sesame-0_pub_done": Policy created: type=reject; maxCount=0; maxSize=104857600 2011-08-01 14:33:52 info Queue "test_cluster_toggle_load_w_sesame-0_pub_start": Policy created: type=reject; maxCount=0; maxSize=104857600 2011-08-01 14:33:52 info Queue "test_cluster_toggle_load_w_sesame-0_sub_done": Policy created: type=reject; maxCount=0; maxSize=104857600 2011-08-01 14:33:52 info Queue "test_cluster_toggle_load_w_sesame-0_sub_ready": Policy created: type=reject; maxCount=0; maxSize=104857600 2011-08-01 14:33:52 notice Journal "test_cluster_toggle_load_w_sesame-10": Created 2011-08-01 14:33:52 info Loaded Module: /usr/lib64/qpid/daemon/xml.so 2011-08-01 14:33:52 info Loaded Module: /usr/lib64/qpid/daemon/replication_exchange.so 2011-08-01 14:33:52 info Loaded Module: /usr/lib64/qpid/daemon/acl.so 2011-08-01 14:33:52 info Loaded Module: /usr/lib64/qpid/daemon/watchdog.so 2011-08-01 14:33:52 info Queue "test_cluster_toggle_load_w_sesame-1_pub_done": Policy created: type=reject; maxCount=0; maxSize=104857600 2011-08-01 14:33:52 info Queue "test_cluster_toggle_load_w_sesame-1_pub_start": Policy created: type=reject; maxCount=0; maxSize=104857600 2011-08-01 14:33:52 info Queue "test_cluster_toggle_load_w_sesame-1_sub_done": Policy created: type=reject; maxCount=0; maxSize=104857600 2011-08-01 14:33:52 info Queue "test_cluster_toggle_load_w_sesame-1_sub_ready": Policy created: type=reject; maxCount=0; maxSize=104857600 2011-08-01 14:33:52 info Queue "test_cluster_toggle_load_w_sesame-2_pub_done": Policy created: type=reject; maxCount=0; maxSize=104857600 2011-08-01 14:33:52 info Queue "test_cluster_toggle_load_w_sesame-2_pub_start": Policy created: type=reject; maxCount=0; maxSize=104857600 2011-08-01 14:33:52 info Queue "test_cluster_toggle_load_w_sesame-2_sub_done": Policy created: type=reject; maxCount=0; maxSize=104857600 2011-08-01 14:33:52 info Queue "test_cluster_toggle_load_w_sesame-2_sub_ready": Policy created: type=reject; maxCount=0; maxSize=104857600 2011-08-01 14:33:52 info Queue "test_cluster_toggle_load_w_sesame-3_pub_done": Policy created: type=reject; maxCount=0; maxSize=104857600 2011-08-01 14:33:52 info Queue "test_cluster_toggle_load_w_sesame-3_pub_start": Policy created: type=reject; maxCount=0; maxSize=104857600 2011-08-01 14:33:52 info Queue "test_cluster_toggle_load_w_sesame-3_sub_done": Policy created: type=reject; maxCount=0; maxSize=104857600 2011-08-01 14:33:52 info Queue "test_cluster_toggle_load_w_sesame-3_sub_ready": Policy created: type=reject; maxCount=0; maxSize=104857600 2011-08-01 14:33:52 notice Journal "test_cluster_toggle_load_w_sesame-40": Created 2011-08-01 14:33:53 info Queue "test_cluster_toggle_load_w_sesame-4_pub_done": Policy created: type=reject; maxCount=0; maxSize=104857600 2011-08-01 14:33:53 info Queue "test_cluster_toggle_load_w_sesame-4_pub_start": Policy created: type=reject; maxCount=0; maxSize=104857600 2011-08-01 14:33:53 info Queue "test_cluster_toggle_load_w_sesame-4_sub_done": Policy created: type=reject; maxCount=0; maxSize=104857600 2011-08-01 14:33:53 info Queue "test_cluster_toggle_load_w_sesame-4_sub_ready": Policy created: type=reject; maxCount=0; maxSize=104857600 2011-08-01 14:33:53 info Queue "test_cluster_toggle_load_w_sesame-5_pub_done": Policy created: type=reject; maxCount=0; maxSize=104857600 2011-08-01 14:33:53 info Queue "test_cluster_toggle_load_w_sesame-5_pub_start": Policy created: type=reject; maxCount=0; maxSize=104857600 2011-08-01 14:33:53 info Queue "test_cluster_toggle_load_w_sesame-5_sub_done": Policy created: type=reject; maxCount=0; maxSize=104857600 2011-08-01 14:33:53 info Queue "test_cluster_toggle_load_w_sesame-5_sub_ready": Policy created: type=reject; maxCount=0; maxSize=104857600 2011-08-01 14:33:53 notice Journal "test_cluster_toggle_load_w_sesame-60": Created 2011-08-01 14:33:53 info Passivating links 2011-08-01 14:33:54 info Queue "test_cluster_toggle_load_w_sesame-6_pub_done": Policy created: type=reject; maxCount=0; maxSize=104857600 2011-08-01 14:33:54 info Queue "test_cluster_toggle_load_w_sesame-6_pub_start": Policy created: type=reject; maxCount=0; maxSize=104857600 2011-08-01 14:33:54 info Queue "test_cluster_toggle_load_w_sesame-6_sub_done": Policy created: type=reject; maxCount=0; maxSize=104857600 2011-08-01 14:33:54 info Queue "test_cluster_toggle_load_w_sesame-6_sub_ready": Policy created: type=reject; maxCount=0; maxSize=104857600 2011-08-01 14:33:54 notice Journal "test_cluster_toggle_load_w_sesame-70": Created 2011-08-01 14:33:55 info Queue "test_cluster_toggle_load_w_sesame-7_pub_done": Policy created: type=reject; maxCount=0; maxSize=104857600 2011-08-01 14:33:55 info Queue "test_cluster_toggle_load_w_sesame-7_pub_start": Policy created: type=reject; maxCount=0; maxSize=104857600 2011-08-01 14:33:55 info Queue "test_cluster_toggle_load_w_sesame-7_sub_done": Policy created: type=reject; maxCount=0; maxSize=104857600 2011-08-01 14:33:55 info Queue "test_cluster_toggle_load_w_sesame-7_sub_ready": Policy created: type=reject; maxCount=0; maxSize=104857600 2011-08-01 14:33:55 notice Journal "test_cluster_toggle_load_w_sesame-80": Created 2011-08-01 14:33:59 info Queue "test_cluster_toggle_load_w_sesame-8_pub_done": Policy created: type=reject; maxCount=0; maxSize=104857600 2011-08-01 14:33:59 info Queue "test_cluster_toggle_load_w_sesame-8_pub_start": Policy created: type=reject; maxCount=0; maxSize=104857600 2011-08-01 14:33:59 info Queue "test_cluster_toggle_load_w_sesame-8_sub_done": Policy created: type=reject; maxCount=0; maxSize=104857600 2011-08-01 14:33:59 info Queue "test_cluster_toggle_load_w_sesame-8_sub_ready": Policy created: type=reject; maxCount=0; maxSize=104857600 2011-08-01 14:33:59 notice Journal "test_cluster_toggle_load_w_sesame-90": Created 2011-08-01 14:33:59 info Queue "test_cluster_toggle_load_w_sesame-9_pub_done": Policy created: type=reject; maxCount=0; maxSize=104857600 2011-08-01 14:33:59 info Queue "test_cluster_toggle_load_w_sesame-9_pub_start": Policy created: type=reject; maxCount=0; maxSize=104857600 2011-08-01 14:33:59 info Queue "test_cluster_toggle_load_w_sesame-9_sub_done": Policy created: type=reject; maxCount=0; maxSize=104857600 2011-08-01 14:33:59 info Queue "test_cluster_toggle_load_w_sesame-9_sub_ready": Policy created: type=reject; maxCount=0; maxSize=104857600 2011-08-01 14:33:59 info Queue "qpid.cluster-update": Policy created: type=reject; maxCount=0; maxSize=104857600 2011-08-01 14:33:59 critical cluster(192.168.1.6:28490 UPDATEE) catch-up connection closed prematurely 10.34.26.114:5672-10.34.26.112:36610(192.168.1.6:28490-1 local,catchup) 2011-08-01 14:33:59 notice cluster(192.168.1.6:28490 LEFT) leaving cluster mycluster 2011-08-01 14:33:59 notice Shut down
This assumption in comment 6 is incorrect: "Test expects that when service qpidd operation (4-1]) ends - ensures that broker is settled and synchronized to the cluster and thus next operation (4-2]) may safely shutdown all other cluster nodes." When qpidd -d returns the broker has done an initial handshake with the cluster but has not yet received the full state update. To wait for the full update to complete, connect a client to the newly started broker. The connection will not complete until the broker is fully ready. Any client will do, it doesn't have to do anything other than establish a connection.
Thanks for clarification Alan. The observed issue is expected and test code has to keep in mind that newbee sync takes longer time. -> CLOSED / NOT-A-BUG
A note for testing: you can ensure the newbie is ready without sleeping by just making a connection with any client you like. E.g. qpid-perftest --count 1. The connection will complete when the newbie is up to date at which point it is safe to proceed with the test.