509801 – cluster-durable mode does not work for messages enqueued on more than on queue

Bug 509801 - cluster-durable mode does not work for messages enqueued on more than on queue

Summary: cluster-durable mode does not work for messages enqueued on more than on queue

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise MRG
Classification:	Red Hat
Component:	qpid-cpp
Sub Component:
Version:	1.1.1
Hardware:	All
OS:	Linux
Priority:	high
Severity:	medium
Target Milestone:	1.1.6
Target Release:	---
Assignee:	Gordon Sim
QA Contact:	Jiri Kolar
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2009-07-06 10:28 UTC by Gordon Sim
Modified:	2009-07-14 17:32 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2009-07-14 17:32:21 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
candidate fix for issue (1.64 KB, patch) 2009-07-06 20:39 UTC, Carl Trieloff	no flags	Details \| Diff
fix and unit tests for issue (5.45 KB, patch) 2009-07-07 15:14 UTC, Carl Trieloff	no flags	Details \| Diff
fix and unit tests for issue (7.45 KB, patch) 2009-07-08 16:25 UTC, Carl Trieloff	no flags	Details \| Diff
patch for issue (7.00 KB, patch) 2009-07-08 16:29 UTC, Carl Trieloff	no flags	Details \| Diff
Updated fix (10.42 KB, patch) 2009-07-08 20:22 UTC, Gordon Sim	no flags	Details \| Diff
Show Obsolete (4) View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2009:1153	0	normal	SHIPPED_LIVE	Red Hat Enterprise MRG Messaging bug fixing update	2009-07-14 17:31:48 UTC

Description Gordon Sim 2009-07-06 10:28:40 UTC

Description of problem:

If a message is routed by an exchange to more than one queue with the cluster-durable property enabled, it will only become persistent on the first of those queue should the cluster-durable functionality be invoked.

Version-Release number of selected component (if applicable):

qpidd-0.5.752581-22.el5

How reproducible:

100%

Steps to Reproduce:
1. start two node cluster
2. create some queues with cluster durability enabled
  e.g. for q in `seq 1 10`; do qpid-config add queue queue-$q --durable --cluster-durable; done
3. bind those queues to some exchange such that they can be addressed as a group
  e.g. for q in `seq 1 10`; do qpid-config bind amq.fanout queue-$q; done
4. send some messages to that exchange matching this binding
  e.g. for i in `seq 1 10`; do echo "Message$i"; done | sender --exchange amq.fanout --send-eos 1
5. kill on node of cluster
6. stop and recover the other cluster node
7. check that each queue has the expected messages recovered
  
Actual results:

Only first queue has the messages

Expected results:

All queues have the messages

Comment 3 Carl Trieloff 2009-07-06 20:12:55 UTC


Do we know if the data has been written down to the journal correctly for all the queues? That side seems correct in the code, so I'm wondering if the patch above for 509803 might not also be the issue here for recovery.

Comment 4 Carl Trieloff 2009-07-06 20:39:49 UTC

Created attachment 350679 [details]
candidate fix for issue


The issue is that getPersistentID() was being used to know whether to enqueue to store in Queue::setLastNodeFailure(). Issue is that on the first queue this gets set, so remainder of queues will get skipped. The patch above corrects this logic.

test is needed.

Comment 5 Carl Trieloff 2009-07-07 01:56:39 UTC

Fix and unit test committed to trunk.
Committed revision 791672.


Confirmed the patch (id=350679) is a valid fix.

Comment 6 Gordon Sim 2009-07-07 08:07:14 UTC

The proposed patch introduces another, arguably worse, issue. It results in duplicate attempts to enqueue the same message should the last-man-standing mode ever be invoked again when one or more messages that were previously 'forced persistent' are still on the queue. This then results in the last man standing dying with:

2009-07-07 09:07:05 error Error delivering frames: Queue test-queue: store() failed: jexception 0x0b00 enq_map::insert_pfid() threw JERR_MAP_DUPLICATE: Attempted to insert record into map using duplicate key. (rid=0x1 pfid=0x0) (MessageStoreImpl.cpp:1485)
2009-07-07 09:07:05 notice 192.168.0.2:5985(LEFT) leaving cluster grs
2009-07-07 09:07:05 notice Shut down

Comment 7 Carl Trieloff 2009-07-07 15:06:31 UTC


The above case has been correct on trunk with tests:
Committed revision 791858.

Comment 8 Carl Trieloff 2009-07-07 15:14:28 UTC

Created attachment 350819 [details]
fix and unit tests for issue

Comment 9 Carl Trieloff 2009-07-08 16:25:23 UTC

Created attachment 350962 [details]
fix and unit tests for issue

This patch also corrects the requeue() caes for acquired messages that the last patch regresses.

Comment 10 Carl Trieloff 2009-07-08 16:29:07 UTC

Created attachment 350963 [details]
patch for issue

removed dup patch detail from other BZ

Comment 11 Gordon Sim 2009-07-08 20:22:30 UTC

Created attachment 350990 [details]
Updated fix

Comment 12 Gordon Sim 2009-07-09 07:08:50 UTC

Fixed in qpidd-0.5.752581-25.el5

Comment 13 Jiri Kolar 2009-07-09 09:25:26 UTC

Tested:
on -22 bug aapears
on -25 has been fixed

validated on RHEL  5.3 i386 / x86_64 

packages:

# rpm -qa | grep -E '(qpid|openais|rhm)' | sort -u

openais-0.80.3-22.el5_3.8
openais-debuginfo-0.80.3-22.el5_3.8
openais-devel-0.80.3-22.el5_3.8
python-qpid-0.5.752581-3.el5
qpidc-0.5.752581-25.el5
qpidc-debuginfo-0.5.752581-22.el5
qpidc-devel-0.5.752581-25.el5
qpidc-perftest-0.5.752581-25.el5
qpidc-rdma-0.5.752581-25.el5
qpidc-ssl-0.5.752581-25.el5
qpidd-0.5.752581-25.el5
qpidd-acl-0.5.752581-25.el5
qpidd-cluster-0.5.752581-25.el5
qpidd-devel-0.5.752581-25.el5
qpid-dotnet-0.4.738274-2.el5
qpidd-rdma-0.5.752581-25.el5
qpidd-ssl-0.5.752581-25.el5
qpidd-xml-0.5.752581-25.el5
qpid-java-client-0.5.751061-8.el5
qpid-java-common-0.5.751061-8.el5
rhm-0.5.3206-6.el5
rhm-docs-0.5.756148-1.el5

->VERIFIED

Comment 15 errata-xmlrpc 2009-07-14 17:32:21 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2009-1153.html

Note You need to log in before you can comment on or make changes to this bug.