Bug 672219 - clustered broker ends with exception 'Error delivering frames: Cannot save cluster store status: basic_ios::clear (qpid/cluster/StoreStatus.cpp:113)'
Summary: clustered broker ends with exception 'Error delivering frames: Cannot save cl...
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Red Hat Enterprise MRG
Classification: Red Hat
Component: qpid-cpp
Version: 1.3
Hardware: Unspecified
OS: Unspecified
urgent
high
Target Milestone: ---
: ---
Assignee: Frantisek Reznicek
QA Contact: MRG Quality Engineering
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2011-01-24 12:58 UTC by Frantisek Reznicek
Modified: 2015-11-16 01:13 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2011-01-27 14:41:58 UTC
Target Upstream Version:


Attachments (Terms of Use)
The issue reproducer (5.27 KB, application/x-tbz)
2011-01-24 13:04 UTC, Frantisek Reznicek
no flags Details

Description Frantisek Reznicek 2011-01-24 12:58:46 UTC
Description of problem:

When testing bug 667428, I found that broker[s] sometimes just refuse to start or stops after some time of operation when cluster width changes.

I triggered the issue about five times, last time with very easy reproducer:
[root@dhcp-26-233 bz667428]# service qpidd start
Starting Qpid AMQP daemon: 2011-01-24 13:49:20 info Loaded Module: /usr/lib64/qpid/daemon/cluster.so
2011-01-24 13:49:20 info Loaded Module: /usr/lib64/qpid/daemon/xml.so
2011-01-24 13:49:20 info Loaded Module: /usr/lib64/qpid/daemon/acl.so
2011-01-24 13:49:20 info Loaded Module: /usr/lib64/qpid/daemon/ssl.so
2011-01-24 13:49:20 info Loaded Module: /usr/lib64/qpid/daemon/watchdog.so
2011-01-24 13:49:20 info Loaded Module: /usr/lib64/qpid/daemon/replication_exchange.so
2011-01-24 13:49:20 info Loaded Module: /usr/lib64/qpid/daemon/msgstore.so
2011-01-24 13:49:20 info Loaded Module: /usr/lib64/qpid/daemon/replicating_listener.so
Daemon startup failed: Cannot load cluster store status: basic_ios::clear (qpid/cluster/StoreStatus.cpp:100)

The environment I'm using uses this configuration file:
cluster-mechanism=ANONYMOUS
cluster-name=freznice-virt-cluster
log-enable=info+
log-to-file=/tmp/qpidd.log
mgmt-pub-interval=1

The bug 667428 test launches the broker on background with separate parameters like:
qpidd --auth=no --cluster-name=bz667428_at_$(hostname) -p 5672 --data-dir=./dd5672 --log-enable=info+  &> ./qpidd.5672.log &
qpidd --auth=no --cluster-name=bz667428_at_$(hostname) -p 10000 --data-dir=./dd10000 --log-enable=info+  &> ./qpidd.10000.log &


Version-Release number of selected component (if applicable):
qpid-cpp-client-devel-0.7.946106-27.el5
qpid-java-client-0.7.946106-14.el5
qpid-cpp-server-0.7.946106-27.el5
qpid-cpp-server-cluster-0.7.946106-27.el5
qpid-cpp-mrg-debuginfo-0.7.946106-26.el5
qpid-java-common-0.7.946106-14.el5
qpid-cpp-server-xml-0.7.946106-27.el5
qpid-tools-0.7.946106-12.el5
qpid-cpp-client-ssl-0.7.946106-27.el5
qpid-cpp-server-ssl-0.7.946106-27.el5
qpid-cpp-server-store-0.7.946106-27.el5
qpid-cpp-server-devel-0.7.946106-27.el5
qpid-cpp-client-0.7.946106-27.el5
qpid-java-example-0.7.946106-14.el5
python-qpid-0.7.946106-15.el5
qpid-cpp-client-devel-docs-0.7.946106-27.el5

How reproducible:
seems quite often, brand new issue, still unclear on reproducibility

Steps to Reproduce:
1. configure the qpidd.conf file
2. start broker in cluster (width=1) service qpidd start
3. service qpidd stop
4. start brokers with another cluster name to another data-directory
qpidd --auth=no --cluster-name=bz667428_at_$(hostname) -p 5672 --data-dir=./dd5672 --log-enable=info+  &> ./qpidd.5672.log &
qpidd --auth=no --cluster-name=bz667428_at_$(hostname) -p 10000 --data-dir=./dd10000 --log-enable=info+  &> ./qpidd.10000.log &
5. stop all running brokers killall qpidd
6. start the original broker as service service qpidd start
7. now you should see above problem

Actual results:
Broker refuses to start or shutdowns unexpectably.

Expected results:
Broker should not refuse to start or shutdown unexpectably.


Additional info:

Please treat this defect as early report. I'll add more once I know more details.

Comment 1 Frantisek Reznicek 2011-01-24 13:04:32 UTC
Created attachment 474954 [details]
The issue reproducer

This is the script which reproduces the problem.

Dependencies:
  openais configured and running.
  java-openjdk 1.6 installed main + devel
  qpid-java-* pkgs installed
How to run:
  ./bz667428 3

Then look to qpidd*.log and you should see the exception message there.

Comment 3 mick 2011-01-26 17:19:25 UTC
I think this bug is iatrogenic.

( Adjective: Of or relating to illness caused by medical examination or treatment. )


When I put the killall command into the run_the_batch fn, problem goes away.

As it is now, here's what I believe is happening:

  1. Loop 1 always works fine.   The error does not occur when we kill the Tx client, which is what I believe the test is intending to look at.

  2. When we come to loop 2, we have *not* killed the broker on port 10000.

  3. But during loop two, we delete its data dir out from under it.

  4. The original broker on port 10000 then drops dead, with the critical error message "I can't write my store state any more."

  5. The test then starts a new broker on port 10000, and the process repeats.

  6. Because of this, the critical error message always shows up N-1 times, where there are N loop.


Note You need to log in before you can comment on or make changes to this bug.