Description of problem: If a message is routed by an exchange to more than one queue with the cluster-durable property enabled, it will only become persistent on the first of those queue should the cluster-durable functionality be invoked. Version-Release number of selected component (if applicable): qpidd-0.5.752581-22.el5 How reproducible: 100% Steps to Reproduce: 1. start two node cluster 2. create some queues with cluster durability enabled e.g. for q in `seq 1 10`; do qpid-config add queue queue-$q --durable --cluster-durable; done 3. bind those queues to some exchange such that they can be addressed as a group e.g. for q in `seq 1 10`; do qpid-config bind amq.fanout queue-$q; done 4. send some messages to that exchange matching this binding e.g. for i in `seq 1 10`; do echo "Message$i"; done | sender --exchange amq.fanout --send-eos 1 5. kill on node of cluster 6. stop and recover the other cluster node 7. check that each queue has the expected messages recovered Actual results: Only first queue has the messages Expected results: All queues have the messages
Do we know if the data has been written down to the journal correctly for all the queues? That side seems correct in the code, so I'm wondering if the patch above for 509803 might not also be the issue here for recovery.
Created attachment 350679 [details] candidate fix for issue The issue is that getPersistentID() was being used to know whether to enqueue to store in Queue::setLastNodeFailure(). Issue is that on the first queue this gets set, so remainder of queues will get skipped. The patch above corrects this logic. test is needed.
Fix and unit test committed to trunk. Committed revision 791672. Confirmed the patch (id=350679) is a valid fix.
The proposed patch introduces another, arguably worse, issue. It results in duplicate attempts to enqueue the same message should the last-man-standing mode ever be invoked again when one or more messages that were previously 'forced persistent' are still on the queue. This then results in the last man standing dying with: 2009-07-07 09:07:05 error Error delivering frames: Queue test-queue: store() failed: jexception 0x0b00 enq_map::insert_pfid() threw JERR_MAP_DUPLICATE: Attempted to insert record into map using duplicate key. (rid=0x1 pfid=0x0) (MessageStoreImpl.cpp:1485) 2009-07-07 09:07:05 notice 192.168.0.2:5985(LEFT) leaving cluster grs 2009-07-07 09:07:05 notice Shut down
The above case has been correct on trunk with tests: Committed revision 791858.
Created attachment 350819 [details] fix and unit tests for issue
Created attachment 350962 [details] fix and unit tests for issue This patch also corrects the requeue() caes for acquired messages that the last patch regresses.
Created attachment 350963 [details] patch for issue removed dup patch detail from other BZ
Created attachment 350990 [details] Updated fix
Fixed in qpidd-0.5.752581-25.el5
Tested: on -22 bug aapears on -25 has been fixed validated on RHEL 5.3 i386 / x86_64 packages: # rpm -qa | grep -E '(qpid|openais|rhm)' | sort -u openais-0.80.3-22.el5_3.8 openais-debuginfo-0.80.3-22.el5_3.8 openais-devel-0.80.3-22.el5_3.8 python-qpid-0.5.752581-3.el5 qpidc-0.5.752581-25.el5 qpidc-debuginfo-0.5.752581-22.el5 qpidc-devel-0.5.752581-25.el5 qpidc-perftest-0.5.752581-25.el5 qpidc-rdma-0.5.752581-25.el5 qpidc-ssl-0.5.752581-25.el5 qpidd-0.5.752581-25.el5 qpidd-acl-0.5.752581-25.el5 qpidd-cluster-0.5.752581-25.el5 qpidd-devel-0.5.752581-25.el5 qpid-dotnet-0.4.738274-2.el5 qpidd-rdma-0.5.752581-25.el5 qpidd-ssl-0.5.752581-25.el5 qpidd-xml-0.5.752581-25.el5 qpid-java-client-0.5.751061-8.el5 qpid-java-common-0.5.751061-8.el5 rhm-0.5.3206-6.el5 rhm-docs-0.5.756148-1.el5 ->VERIFIED
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2009-1153.html