Description of problem: The 'cluster-durable' mode is supposed to force transient messages to be persistent when cluster memberships drops down to one node. However if a queue contains more messages that can fit in the journal when this happens that last node will also exit at this point. Version-Release number of selected component (if applicable): qpidd-0.5.752581-22.el5 rhm-0.5.3206-5.el5 How reproducible: 100% Steps to Reproduce: 1. start two node cluster 2. create queue with cluster-durability enabled qpid-config add queue test-queue --durable --cluster-durable 3. fill queue with large number of transient messages for i in `seq 1 300000`; do echo "Message$i"; done | sender 4. kill one of the cluster nodes Actual results: The other node (not the one killed) exits with: 2009-jul-06 06:13:31 notice 10.16.44.221:26093(READY) last broker standing, update queue policies 2009-jul-06 06:13:31 warning Journal "test-queue": Enqueue capacity threshold exceeded on queue "test-queue". 2009-jul-06 06:13:31 error Error delivering frames: Enqueue capacity threshold exceeded on queue "test-queue". (JournalImpl.cpp:576) 2009-jul-06 06:13:31 notice 10.16.44.221:26093(LEFT) leaving cluster grs-mrg14-test-cluster 2009-jul-06 06:13:31 notice Shut down Expected results: Should not exit. Probably should just print an error indicating that not all messages could be persisted. Additional info:
I believe that the solution is to add exception handling in or around Queue::setLastNodeFailure(). This is the only place where there issufficient context to know how to handle the error and log an approriate error message.
Fixed with unit test Transmitting file data .. Committed revision 799658. Still needs system test before it can be marked modified.
509800 Tested: on 752581 bug appears on 946106 does not. It has been fixed validated on RHEL 5.5 i386 / x86_64 not on RHEL4 because of no clustering packages: # rpm -qa | grep -E '(qpid|openais|rhm)' | sort -u openais-0.80.6-16.el5_5.1 openais-debuginfo-0.80.6-16.el5_5.1 python-qpid-0.7.946106-1.el5 qpid-cpp-client-0.7.946106-2.el5 qpid-cpp-client-devel-0.7.946106-2.el5 qpid-cpp-client-devel-docs-0.7.946106-2.el5 qpid-cpp-client-ssl-0.7.946106-2.el5 qpid-cpp-mrg-debuginfo-0.7.946106-1.el5 qpid-cpp-server-0.7.946106-2.el5 qpid-cpp-server-cluster-0.7.946106-2.el5 qpid-cpp-server-devel-0.7.946106-2.el5 qpid-cpp-server-ssl-0.7.946106-2.el5 qpid-cpp-server-store-0.7.946106-2.el5 qpid-cpp-server-xml-0.7.946106-2.el5 qpid-java-client-0.7.946106-3.el5 qpid-java-common-0.7.946106-3.el5 qpid-tools-0.7.946106-4.el5 rhm-docs-0.7.946106-1.el5 ->VERIFIED
Technical note added. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: When the "--cluster-durable" mode was enabled, exceeding the journal capacity caused the last node to exit with the following error: Error delivering frames: Enqueue capacity threshold exceeded on queue "queue-name". (JournalImpl.cpp:576) With this update, the last node no longer shuts down when the journal capacity is exceeded.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2010-0773.html