Bug 506758 - Error inconsistencies when starting cluster
Error inconsistencies when starting cluster
Product: Red Hat Enterprise MRG
Classification: Red Hat
Component: qpid-cpp (Show other bugs)
All Linux
urgent Severity urgent
: 1.1.6
: ---
Assigned To: Alan Conway
Frantisek Reznicek
Depends On:
  Show dependency treegraph
Reported: 2009-06-18 11:27 EDT by Gordon Sim
Modified: 2015-11-15 19:07 EST (History)
4 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Last Closed: 2009-07-14 13:31:51 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---

Attachments (Terms of Use)
test program (2.91 KB, text/x-c++src)
2009-06-18 11:28 EDT, Gordon Sim
no flags Details
test program that closes connections (3.04 KB, text/x-c++src)
2009-06-18 17:05 EDT, Alan Conway
no flags Details
patch to fix the issue (8.26 KB, patch)
2009-06-18 17:25 EDT, Alan Conway
no flags Details | Diff
Patch that really fixes the issue (32.09 KB, patch)
2009-07-01 17:33 EDT, Alan Conway
no flags Details | Diff
corrected patch, previous patch was missing new files. (36.06 KB, patch)
2009-07-02 12:10 EDT, Alan Conway
no flags Details | Diff

  None (edit)
Description Gordon Sim 2009-06-18 11:27:46 EDT
Description of problem:

When a series of clients are trying to subscribe to a non-existent queue and a 4 node cluster is then started, some of the nodes fail due to inconsistent errors.

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1. have clients trying to connect and subscribe to a non-existent queue
   (e.g. using the attached test prog I run:

    for q in `seq 1 50`; do ./subscribe --queue "queue-$q" --url amqp:tcp:,tcp:,tcp:,tcp: & done

2. start 4 cluster nodes for the clients to connect to
   for p in 5672 5673 5674 5675; do /usr/sbin/qpidd --auth no --cluster-name grs-mrg15-test-cluster --data-dir `pwd`/test-cluster-temp-data-$p --port $p --log-to-file test-cluster-qpidd.$p.log --log-to-syslog false --daemon; sleep 2; done
3. check each node is alive
   for p in 5672 5673 5674 5675; do /usr/sbin/qpidd --check --port $p; done
   tail test-cluster-qpidd.567*.log
Actual results:

2 or 3 nodes fail with:

2009-jun-18 11:21:49 error Execution exception: not-found: Queue not found: queue-37 (qpid/broker/SessionAdapter.cpp:742)
2009-jun-18 11:21:49 critical Error 4009 did not occur on
2009-jun-18 11:21:49 error Error delivering frames: Aborted by local failure that did not occur on all replicas
2009-jun-18 11:21:49 notice leaving cluster grs-mrg15-test-cluster
2009-jun-18 11:21:49 notice Shut down

Expected results:

All nodes remain up and running.

Additional info:

I had to put a short sleep between starting the nodes or the surviving nodes appear to get stuck trying to join.
Comment 1 Gordon Sim 2009-06-18 11:28:46 EDT
Created attachment 348500 [details]
test program
Comment 2 Gordon Sim 2009-06-18 12:11:00 EDT
In attached program, --retry-interval 0 --url amqp:tcp: makes it easier to reproduce on some boxes. Reproduced with qpidd-0.5.752581-18.el5 also.
Comment 3 Alan Conway 2009-06-18 17:05:12 EDT
Created attachment 348555 [details]
 test program  that closes connections

The original runs out of file descriptors if you're not sharp about starting the cluster.
Comment 4 Alan Conway 2009-06-18 17:25:42 EDT
Created attachment 348561 [details]
patch to fix the issue

Fix also commited to trun r786294

This fix includes XML changes so code needs to be re-generated.
Comment 5 Gordon Sim 2009-06-23 05:08:42 EDT
Fixed in qpidd-0.5.752581-20.
Comment 6 Gordon Sim 2009-06-23 06:05:47 EDT
Actually, I still see problems on qpidd-0.5.752581-20. When adding nodes while there are many clients attached (some of which are continually generating errors) the cluster can become unresponsive for long periods (indefinitely?). 

I also saw a node exit on one occasion with:

2009-jun-23 05:37:50 error Execution exception: not-found: Queue not found: queue-50 (qpid/broker/SessionAdapter.cpp:742)
2009-jun-23 05:37:50 error Execution exception: not-found: Unknown destination queue-4 (qpid/broker/SemanticState.cpp:458)
2009-jun-23 05:37:50 critical Error 4723 did not occur on
2009-jun-23 05:37:50 error Error delivering frames: Aborted by local failure that did not occur on all replicas
Comment 7 Gordon Sim 2009-06-23 06:25:47 EDT
Update on previous comment - cluster is not indefinitely unresponsive, it became usable for me in my latest test after 30 mins. If this is expected it should be noted in the documentation/release notes.
Comment 8 Alan Conway 2009-07-01 09:47:55 EDT
Fixed in r790163

If an offer is delivered while an error is being resolved, the offer is retracted and the newcomer tries again.
Comment 9 Alan Conway 2009-07-01 17:33:59 EDT
Created attachment 350203 [details]
Patch that really fixes the issue

Patch based on qpidd-0.5.752581-20 + patch from bug 508917 (apply that first)
Comment 10 Alan Conway 2009-07-01 17:59:26 EDT
Comitted to trunk r790397
Comment 11 Alan Conway 2009-07-02 12:10:42 EDT
Created attachment 350309 [details]
corrected patch, previous patch was missing new files.
Comment 12 Gordon Sim 2009-07-03 03:28:33 EDT
Fixed in qpidd-0.5.752581-22.
Comment 13 Gordon Sim 2009-07-03 05:31:21 EDT
I have reproduced a hung cluster on qpidd-0.5.752581-22 using the test for this BZ. The pstack output from all nodes shows nothing untoward, last log entry for each node is "2009-jul-03 05:14:55 notice Broker running" (i.e. no notice of joining cluster or of being the first member).
Comment 14 Alan Conway 2009-07-07 09:16:28 EDT
This hang may be https://bugzilla.redhat.com/show_bug.cgi?id=494393, I saw that several times while testing. If you can reproduce the hang with --log-enable=debug+:cluster you will see all members in the JOINER state.

Its quite rare and not an issue that would cause data loss in a running cluster so I think its reasonable to leave it for 1.2
Comment 15 Gordon Sim 2009-07-07 13:39:35 EDT
Believed fixed in qpidd-0.5.752581-22; issues seen likely to be over-aggressive test.
Comment 16 Frantisek Reznicek 2009-07-09 10:44:06 EDT
The issue has been fixed validated on RHEL 5.3 i386 / x86_64 on packages:

[root@mrg-qe-01 bz506758]# rpm -qa | grep -E '(qpid|rhm|openais)' | sort -u

Comment 18 errata-xmlrpc 2009-07-14 13:31:51 EDT
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.


Note You need to log in before you can comment on or make changes to this bug.