Red Hat Bugzilla – Bug 506758
Error inconsistencies when starting cluster
Last modified: 2015-11-15 19:07:19 EST
Description of problem:
When a series of clients are trying to subscribe to a non-existent queue and a 4 node cluster is then started, some of the nodes fail due to inconsistent errors.
Version-Release number of selected component (if applicable):
Steps to Reproduce:
1. have clients trying to connect and subscribe to a non-existent queue
(e.g. using the attached test prog I run:
for q in `seq 1 50`; do ./subscribe --queue "queue-$q" --url amqp:tcp:127.0.0.1:5672,tcp:127.0.0.1:5673,tcp:127.0.0.1:5674,tcp:127.0.0.1:5675 & done
2. start 4 cluster nodes for the clients to connect to
for p in 5672 5673 5674 5675; do /usr/sbin/qpidd --auth no --cluster-name grs-mrg15-test-cluster --data-dir `pwd`/test-cluster-temp-data-$p --port $p --log-to-file test-cluster-qpidd.$p.log --log-to-syslog false --daemon; sleep 2; done
3. check each node is alive
for p in 5672 5673 5674 5675; do /usr/sbin/qpidd --check --port $p; done
2 or 3 nodes fail with:
2009-jun-18 11:21:49 error Execution exception: not-found: Queue not found: queue-37 (qpid/broker/SessionAdapter.cpp:742)
2009-jun-18 11:21:49 critical 22.214.171.124:14363(CATCHUP/error) Error 4009 did not occur on 126.96.36.199:14545
2009-jun-18 11:21:49 error Error delivering frames: Aborted by local failure that did not occur on all replicas
2009-jun-18 11:21:49 notice 188.8.131.52:14363(LEFT/error) leaving cluster grs-mrg15-test-cluster
2009-jun-18 11:21:49 notice Shut down
All nodes remain up and running.
I had to put a short sleep between starting the nodes or the surviving nodes appear to get stuck trying to join.
Created attachment 348500 [details]
In attached program, --retry-interval 0 --url amqp:tcp:127.0.0.1 makes it easier to reproduce on some boxes. Reproduced with qpidd-0.5.752581-18.el5 also.
Created attachment 348555 [details]
test program that closes connections
The original runs out of file descriptors if you're not sharp about starting the cluster.
Created attachment 348561 [details]
patch to fix the issue
Fix also commited to trun r786294
This fix includes XML changes so code needs to be re-generated.
Fixed in qpidd-0.5.752581-20.
Actually, I still see problems on qpidd-0.5.752581-20. When adding nodes while there are many clients attached (some of which are continually generating errors) the cluster can become unresponsive for long periods (indefinitely?).
I also saw a node exit on one occasion with:
2009-jun-23 05:37:50 error Execution exception: not-found: Queue not found: queue-50 (qpid/broker/SessionAdapter.cpp:742)
2009-jun-23 05:37:50 error Execution exception: not-found: Unknown destination queue-4 (qpid/broker/SemanticState.cpp:458)
2009-jun-23 05:37:50 critical 184.108.40.206:20677(CATCHUP/error) Error 4723 did not occur on 220.127.116.11:20317
2009-jun-23 05:37:50 error Error delivering frames: Aborted by local failure that did not occur on all replicas
Update on previous comment - cluster is not indefinitely unresponsive, it became usable for me in my latest test after 30 mins. If this is expected it should be noted in the documentation/release notes.
Fixed in r790163
If an offer is delivered while an error is being resolved, the offer is retracted and the newcomer tries again.
Created attachment 350203 [details]
Patch that really fixes the issue
Patch based on qpidd-0.5.752581-20 + patch from bug 508917 (apply that first)
Comitted to trunk r790397
Created attachment 350309 [details]
corrected patch, previous patch was missing new files.
Fixed in qpidd-0.5.752581-22.
I have reproduced a hung cluster on qpidd-0.5.752581-22 using the test for this BZ. The pstack output from all nodes shows nothing untoward, last log entry for each node is "2009-jul-03 05:14:55 notice Broker running" (i.e. no notice of joining cluster or of being the first member).
This hang may be https://bugzilla.redhat.com/show_bug.cgi?id=494393, I saw that several times while testing. If you can reproduce the hang with --log-enable=debug+:cluster you will see all members in the JOINER state.
Its quite rare and not an issue that would cause data loss in a running cluster so I think its reasonable to leave it for 1.2
Believed fixed in qpidd-0.5.752581-22; issues seen likely to be over-aggressive test.
The issue has been fixed validated on RHEL 5.3 i386 / x86_64 on packages:
[root@mrg-qe-01 bz506758]# rpm -qa | grep -E '(qpid|rhm|openais)' | sort -u
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.