Bug 672219

Summary: clustered broker ends with exception 'Error delivering frames: Cannot save cluster store status: basic_ios::clear (qpid/cluster/StoreStatus.cpp:113)'
Product: Red Hat Enterprise MRG Reporter: Frantisek Reznicek <freznice>
Component: qpid-cppAssignee: Frantisek Reznicek <freznice>
Status: CLOSED NOTABUG QA Contact: MRG Quality Engineering <mrgqe-bugs>
Severity: high Docs Contact:
Priority: urgent    
Version: 1.3CC: esammons, gsim
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2011-01-27 14:41:58 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Attachments:
Description Flags
The issue reproducer none

Description Frantisek Reznicek 2011-01-24 12:58:46 UTC
Description of problem:

When testing bug 667428, I found that broker[s] sometimes just refuse to start or stops after some time of operation when cluster width changes.

I triggered the issue about five times, last time with very easy reproducer:
[root@dhcp-26-233 bz667428]# service qpidd start
Starting Qpid AMQP daemon: 2011-01-24 13:49:20 info Loaded Module: /usr/lib64/qpid/daemon/cluster.so
2011-01-24 13:49:20 info Loaded Module: /usr/lib64/qpid/daemon/xml.so
2011-01-24 13:49:20 info Loaded Module: /usr/lib64/qpid/daemon/acl.so
2011-01-24 13:49:20 info Loaded Module: /usr/lib64/qpid/daemon/ssl.so
2011-01-24 13:49:20 info Loaded Module: /usr/lib64/qpid/daemon/watchdog.so
2011-01-24 13:49:20 info Loaded Module: /usr/lib64/qpid/daemon/replication_exchange.so
2011-01-24 13:49:20 info Loaded Module: /usr/lib64/qpid/daemon/msgstore.so
2011-01-24 13:49:20 info Loaded Module: /usr/lib64/qpid/daemon/replicating_listener.so
Daemon startup failed: Cannot load cluster store status: basic_ios::clear (qpid/cluster/StoreStatus.cpp:100)

The environment I'm using uses this configuration file:
cluster-mechanism=ANONYMOUS
cluster-name=freznice-virt-cluster
log-enable=info+
log-to-file=/tmp/qpidd.log
mgmt-pub-interval=1

The bug 667428 test launches the broker on background with separate parameters like:
qpidd --auth=no --cluster-name=bz667428_at_$(hostname) -p 5672 --data-dir=./dd5672 --log-enable=info+  &> ./qpidd.5672.log &
qpidd --auth=no --cluster-name=bz667428_at_$(hostname) -p 10000 --data-dir=./dd10000 --log-enable=info+  &> ./qpidd.10000.log &


Version-Release number of selected component (if applicable):
qpid-cpp-client-devel-0.7.946106-27.el5
qpid-java-client-0.7.946106-14.el5
qpid-cpp-server-0.7.946106-27.el5
qpid-cpp-server-cluster-0.7.946106-27.el5
qpid-cpp-mrg-debuginfo-0.7.946106-26.el5
qpid-java-common-0.7.946106-14.el5
qpid-cpp-server-xml-0.7.946106-27.el5
qpid-tools-0.7.946106-12.el5
qpid-cpp-client-ssl-0.7.946106-27.el5
qpid-cpp-server-ssl-0.7.946106-27.el5
qpid-cpp-server-store-0.7.946106-27.el5
qpid-cpp-server-devel-0.7.946106-27.el5
qpid-cpp-client-0.7.946106-27.el5
qpid-java-example-0.7.946106-14.el5
python-qpid-0.7.946106-15.el5
qpid-cpp-client-devel-docs-0.7.946106-27.el5

How reproducible:
seems quite often, brand new issue, still unclear on reproducibility

Steps to Reproduce:
1. configure the qpidd.conf file
2. start broker in cluster (width=1) service qpidd start
3. service qpidd stop
4. start brokers with another cluster name to another data-directory
qpidd --auth=no --cluster-name=bz667428_at_$(hostname) -p 5672 --data-dir=./dd5672 --log-enable=info+  &> ./qpidd.5672.log &
qpidd --auth=no --cluster-name=bz667428_at_$(hostname) -p 10000 --data-dir=./dd10000 --log-enable=info+  &> ./qpidd.10000.log &
5. stop all running brokers killall qpidd
6. start the original broker as service service qpidd start
7. now you should see above problem

Actual results:
Broker refuses to start or shutdowns unexpectably.

Expected results:
Broker should not refuse to start or shutdown unexpectably.


Additional info:

Please treat this defect as early report. I'll add more once I know more details.

Comment 1 Frantisek Reznicek 2011-01-24 13:04:32 UTC
Created attachment 474954 [details]
The issue reproducer

This is the script which reproduces the problem.

Dependencies:
  openais configured and running.
  java-openjdk 1.6 installed main + devel
  qpid-java-* pkgs installed
How to run:
  ./bz667428 3

Then look to qpidd*.log and you should see the exception message there.

Comment 3 mick 2011-01-26 17:19:25 UTC
I think this bug is iatrogenic.

( Adjective: Of or relating to illness caused by medical examination or treatment. )


When I put the killall command into the run_the_batch fn, problem goes away.

As it is now, here's what I believe is happening:

  1. Loop 1 always works fine.   The error does not occur when we kill the Tx client, which is what I believe the test is intending to look at.

  2. When we come to loop 2, we have *not* killed the broker on port 10000.

  3. But during loop two, we delete its data dir out from under it.

  4. The original broker on port 10000 then drops dead, with the critical error message "I can't write my store state any more."

  5. The test then starts a new broker on port 10000, and the process repeats.

  6. Because of this, the critical error message always shows up N-1 times, where there are N loop.