Hide Forgot
Description of problem: When testing bug 667428, I found that broker[s] sometimes just refuse to start or stops after some time of operation when cluster width changes. I triggered the issue about five times, last time with very easy reproducer: [root@dhcp-26-233 bz667428]# service qpidd start Starting Qpid AMQP daemon: 2011-01-24 13:49:20 info Loaded Module: /usr/lib64/qpid/daemon/cluster.so 2011-01-24 13:49:20 info Loaded Module: /usr/lib64/qpid/daemon/xml.so 2011-01-24 13:49:20 info Loaded Module: /usr/lib64/qpid/daemon/acl.so 2011-01-24 13:49:20 info Loaded Module: /usr/lib64/qpid/daemon/ssl.so 2011-01-24 13:49:20 info Loaded Module: /usr/lib64/qpid/daemon/watchdog.so 2011-01-24 13:49:20 info Loaded Module: /usr/lib64/qpid/daemon/replication_exchange.so 2011-01-24 13:49:20 info Loaded Module: /usr/lib64/qpid/daemon/msgstore.so 2011-01-24 13:49:20 info Loaded Module: /usr/lib64/qpid/daemon/replicating_listener.so Daemon startup failed: Cannot load cluster store status: basic_ios::clear (qpid/cluster/StoreStatus.cpp:100) The environment I'm using uses this configuration file: cluster-mechanism=ANONYMOUS cluster-name=freznice-virt-cluster log-enable=info+ log-to-file=/tmp/qpidd.log mgmt-pub-interval=1 The bug 667428 test launches the broker on background with separate parameters like: qpidd --auth=no --cluster-name=bz667428_at_$(hostname) -p 5672 --data-dir=./dd5672 --log-enable=info+ &> ./qpidd.5672.log & qpidd --auth=no --cluster-name=bz667428_at_$(hostname) -p 10000 --data-dir=./dd10000 --log-enable=info+ &> ./qpidd.10000.log & Version-Release number of selected component (if applicable): qpid-cpp-client-devel-0.7.946106-27.el5 qpid-java-client-0.7.946106-14.el5 qpid-cpp-server-0.7.946106-27.el5 qpid-cpp-server-cluster-0.7.946106-27.el5 qpid-cpp-mrg-debuginfo-0.7.946106-26.el5 qpid-java-common-0.7.946106-14.el5 qpid-cpp-server-xml-0.7.946106-27.el5 qpid-tools-0.7.946106-12.el5 qpid-cpp-client-ssl-0.7.946106-27.el5 qpid-cpp-server-ssl-0.7.946106-27.el5 qpid-cpp-server-store-0.7.946106-27.el5 qpid-cpp-server-devel-0.7.946106-27.el5 qpid-cpp-client-0.7.946106-27.el5 qpid-java-example-0.7.946106-14.el5 python-qpid-0.7.946106-15.el5 qpid-cpp-client-devel-docs-0.7.946106-27.el5 How reproducible: seems quite often, brand new issue, still unclear on reproducibility Steps to Reproduce: 1. configure the qpidd.conf file 2. start broker in cluster (width=1) service qpidd start 3. service qpidd stop 4. start brokers with another cluster name to another data-directory qpidd --auth=no --cluster-name=bz667428_at_$(hostname) -p 5672 --data-dir=./dd5672 --log-enable=info+ &> ./qpidd.5672.log & qpidd --auth=no --cluster-name=bz667428_at_$(hostname) -p 10000 --data-dir=./dd10000 --log-enable=info+ &> ./qpidd.10000.log & 5. stop all running brokers killall qpidd 6. start the original broker as service service qpidd start 7. now you should see above problem Actual results: Broker refuses to start or shutdowns unexpectably. Expected results: Broker should not refuse to start or shutdown unexpectably. Additional info: Please treat this defect as early report. I'll add more once I know more details.
Created attachment 474954 [details] The issue reproducer This is the script which reproduces the problem. Dependencies: openais configured and running. java-openjdk 1.6 installed main + devel qpid-java-* pkgs installed How to run: ./bz667428 3 Then look to qpidd*.log and you should see the exception message there.
I think this bug is iatrogenic. ( Adjective: Of or relating to illness caused by medical examination or treatment. ) When I put the killall command into the run_the_batch fn, problem goes away. As it is now, here's what I believe is happening: 1. Loop 1 always works fine. The error does not occur when we kill the Tx client, which is what I believe the test is intending to look at. 2. When we come to loop 2, we have *not* killed the broker on port 10000. 3. But during loop two, we delete its data dir out from under it. 4. The original broker on port 10000 then drops dead, with the critical error message "I can't write my store state any more." 5. The test then starts a new broker on port 10000, and the process repeats. 6. Because of this, the critical error message always shows up N-1 times, where there are N loop.