Description of problem: When a series of clients are trying to subscribe to a non-existent queue and a 4 node cluster is then started, some of the nodes fail due to inconsistent errors. Version-Release number of selected component (if applicable): qpidd-0.5.752581-17.el5 How reproducible: Reliably Steps to Reproduce: 1. have clients trying to connect and subscribe to a non-existent queue (e.g. using the attached test prog I run: for q in `seq 1 50`; do ./subscribe --queue "queue-$q" --url amqp:tcp:127.0.0.1:5672,tcp:127.0.0.1:5673,tcp:127.0.0.1:5674,tcp:127.0.0.1:5675 & done ) 2. start 4 cluster nodes for the clients to connect to (e.g. for p in 5672 5673 5674 5675; do /usr/sbin/qpidd --auth no --cluster-name grs-mrg15-test-cluster --data-dir `pwd`/test-cluster-temp-data-$p --port $p --log-to-file test-cluster-qpidd.$p.log --log-to-syslog false --daemon; sleep 2; done ) 3. check each node is alive (e.g. for p in 5672 5673 5674 5675; do /usr/sbin/qpidd --check --port $p; done tail test-cluster-qpidd.567*.log ) Actual results: 2 or 3 nodes fail with: 2009-jun-18 11:21:49 error Execution exception: not-found: Queue not found: queue-37 (qpid/broker/SessionAdapter.cpp:742) 2009-jun-18 11:21:49 critical 100.0.50.15:14363(CATCHUP/error) Error 4009 did not occur on 100.0.50.15:14545 2009-jun-18 11:21:49 error Error delivering frames: Aborted by local failure that did not occur on all replicas 2009-jun-18 11:21:49 notice 100.0.50.15:14363(LEFT/error) leaving cluster grs-mrg15-test-cluster 2009-jun-18 11:21:49 notice Shut down Expected results: All nodes remain up and running. Additional info: I had to put a short sleep between starting the nodes or the surviving nodes appear to get stuck trying to join.
Created attachment 348500 [details] test program
In attached program, --retry-interval 0 --url amqp:tcp:127.0.0.1 makes it easier to reproduce on some boxes. Reproduced with qpidd-0.5.752581-18.el5 also.
Created attachment 348555 [details] test program that closes connections The original runs out of file descriptors if you're not sharp about starting the cluster.
Created attachment 348561 [details] patch to fix the issue Fix also commited to trun r786294 This fix includes XML changes so code needs to be re-generated.
Fixed in qpidd-0.5.752581-20.
Actually, I still see problems on qpidd-0.5.752581-20. When adding nodes while there are many clients attached (some of which are continually generating errors) the cluster can become unresponsive for long periods (indefinitely?). I also saw a node exit on one occasion with: 2009-jun-23 05:37:50 error Execution exception: not-found: Queue not found: queue-50 (qpid/broker/SessionAdapter.cpp:742) 2009-jun-23 05:37:50 error Execution exception: not-found: Unknown destination queue-4 (qpid/broker/SemanticState.cpp:458) 2009-jun-23 05:37:50 critical 100.0.50.15:20677(CATCHUP/error) Error 4723 did not occur on 100.0.50.15:20317 2009-jun-23 05:37:50 error Error delivering frames: Aborted by local failure that did not occur on all replicas
Update on previous comment - cluster is not indefinitely unresponsive, it became usable for me in my latest test after 30 mins. If this is expected it should be noted in the documentation/release notes.
Fixed in r790163 If an offer is delivered while an error is being resolved, the offer is retracted and the newcomer tries again.
Created attachment 350203 [details] Patch that really fixes the issue Patch based on qpidd-0.5.752581-20 + patch from bug 508917 (apply that first)
Comitted to trunk r790397
Created attachment 350309 [details] corrected patch, previous patch was missing new files.
Fixed in qpidd-0.5.752581-22.
I have reproduced a hung cluster on qpidd-0.5.752581-22 using the test for this BZ. The pstack output from all nodes shows nothing untoward, last log entry for each node is "2009-jul-03 05:14:55 notice Broker running" (i.e. no notice of joining cluster or of being the first member).
This hang may be https://bugzilla.redhat.com/show_bug.cgi?id=494393, I saw that several times while testing. If you can reproduce the hang with --log-enable=debug+:cluster you will see all members in the JOINER state. Its quite rare and not an issue that would cause data loss in a running cluster so I think its reasonable to leave it for 1.2
Believed fixed in qpidd-0.5.752581-22; issues seen likely to be over-aggressive test.
The issue has been fixed validated on RHEL 5.3 i386 / x86_64 on packages: [root@mrg-qe-01 bz506758]# rpm -qa | grep -E '(qpid|rhm|openais)' | sort -u openais-0.80.3-22.el5_3.8 openais-debuginfo-0.80.3-22.el5_3.8 python-qpid-0.5.752581-3.el5 qpidc-0.5.752581-25.el5 qpidc-debuginfo-0.5.752581-25.el5 qpidc-devel-0.5.752581-25.el5 qpidc-perftest-0.5.752581-25.el5 qpidc-rdma-0.5.752581-25.el5 qpidc-ssl-0.5.752581-25.el5 qpidd-0.5.752581-25.el5 qpidd-acl-0.5.752581-25.el5 qpidd-cluster-0.5.752581-25.el5 qpidd-devel-0.5.752581-25.el5 qpid-dotnet-0.4.738274-2.el5 qpidd-rdma-0.5.752581-25.el5 qpidd-ssl-0.5.752581-25.el5 qpidd-xml-0.5.752581-25.el5 qpid-java-client-0.5.751061-8.el5 qpid-java-common-0.5.751061-8.el5 rhm-0.5.3206-6.el5 rhm-docs-0.5.756148-1.el5 -> VERIFIED
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2009-1153.html