509800 – If journal capacity is exceeded as a result of cluster-durable mode being invoked, last man standing exits

Bug 509800 - If journal capacity is exceeded as a result of cluster-durable mode being invoked, last man standing exits

Summary: If journal capacity is exceeded as a result of cluster-durable mode being inv...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise MRG
Classification:	Red Hat
Component:	qpid-cpp
Sub Component:
Version:	1.1.1
Hardware:	All
OS:	Linux
Priority:	urgent
Severity:	high
Target Milestone:	1.3
Target Release:	---
Assignee:	Carl Trieloff
QA Contact:	Jiri Kolar
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2009-07-06 10:20 UTC by Gordon Sim
Modified:	2010-10-14 15:59 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	When the "--cluster-durable" mode was enabled, exceeding the journal capacity caused the last node to exit with the following error: Error delivering frames: Enqueue capacity threshold exceeded on queue "queue-name". (JournalImpl.cpp:576) With this update, the last node no longer shuts down when the journal capacity is exceeded.
Clone Of:
Environment:
Last Closed:	2010-10-14 15:59:31 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2010:0773	0	normal	SHIPPED_LIVE	Moderate: Red Hat Enterprise MRG Messaging and Grid Version 1.3	2010-10-14 15:56:44 UTC

Description Gordon Sim 2009-07-06 10:20:01 UTC

Description of problem:

The 'cluster-durable' mode is supposed to force transient messages to be persistent when cluster memberships drops down to one node. However if a queue contains more messages that can fit in the journal when this happens that last node will also exit at this point.

Version-Release number of selected component (if applicable):

qpidd-0.5.752581-22.el5
rhm-0.5.3206-5.el5

How reproducible:

100%

Steps to Reproduce:
1. start two node cluster
2. create queue with cluster-durability enabled

  qpid-config add queue test-queue --durable --cluster-durable
  
3. fill queue with large number of transient messages

  for i in `seq 1 300000`; do echo "Message$i"; done | sender

4. kill one of the cluster nodes
  
Actual results:

The other node (not the one killed) exits with:

2009-jul-06 06:13:31 notice 10.16.44.221:26093(READY) last broker standing, update queue policies
2009-jul-06 06:13:31 warning Journal "test-queue": Enqueue capacity threshold exceeded on queue "test-queue".
2009-jul-06 06:13:31 error Error delivering frames: Enqueue capacity threshold exceeded on queue "test-queue". (JournalImpl.cpp:576)
2009-jul-06 06:13:31 notice 10.16.44.221:26093(LEFT) leaving cluster grs-mrg14-test-cluster
2009-jul-06 06:13:31 notice Shut down

Expected results:

Should not exit. Probably should just print an error indicating that not all messages could be persisted.

Additional info:

Comment 2 Gordon Sim 2009-07-31 15:26:20 UTC

I believe that the solution is to add exception handling in or around Queue::setLastNodeFailure(). This is the only place where there issufficient context to know how to handle the error and log an approriate error message.

Comment 3 Carl Trieloff 2009-07-31 16:53:49 UTC

Fixed with unit test

Transmitting file data ..
Committed revision 799658.


Still needs system test before it can be marked modified.

Comment 5 Jiri Kolar 2010-06-18 13:54:54 UTC

509800

Tested:
on 752581 bug appears
on 946106 does not. It has been fixed

validated on RHEL  5.5 i386 / x86_64 not on RHEL4 because of no clustering

packages:

# rpm -qa | grep -E '(qpid|openais|rhm)' | sort -u

openais-0.80.6-16.el5_5.1
openais-debuginfo-0.80.6-16.el5_5.1
python-qpid-0.7.946106-1.el5
qpid-cpp-client-0.7.946106-2.el5
qpid-cpp-client-devel-0.7.946106-2.el5
qpid-cpp-client-devel-docs-0.7.946106-2.el5
qpid-cpp-client-ssl-0.7.946106-2.el5
qpid-cpp-mrg-debuginfo-0.7.946106-1.el5
qpid-cpp-server-0.7.946106-2.el5
qpid-cpp-server-cluster-0.7.946106-2.el5
qpid-cpp-server-devel-0.7.946106-2.el5
qpid-cpp-server-ssl-0.7.946106-2.el5
qpid-cpp-server-store-0.7.946106-2.el5
qpid-cpp-server-xml-0.7.946106-2.el5
qpid-java-client-0.7.946106-3.el5
qpid-java-common-0.7.946106-3.el5
qpid-tools-0.7.946106-4.el5  
rhm-docs-0.7.946106-1.el5

->VERIFIED

Comment 6 Jaromir Hradilek 2010-10-08 09:57:01 UTC

    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
When the "--cluster-durable" mode was enabled, exceeding the journal capacity caused the last node to exit with the following error:

  Error delivering frames: Enqueue capacity threshold exceeded on queue "queue-name". (JournalImpl.cpp:576)

With this update, the last node no longer shuts down when the journal capacity is exceeded.

Comment 8 errata-xmlrpc 2010-10-14 15:59:31 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2010-0773.html

Note You need to log in before you can comment on or make changes to this bug.