Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 780210 (SOA-2593)

Summary:	Cancelling undelivered messages during shutdown leads to message loss
Product:	[JBoss] JBoss Enterprise SOA Platform 5	Reporter:	Kevin Conner <kevin.conner>
Component:	JBoss Messaging	Assignee:	trev <tkirby>
Status:	CLOSED NOTABUG	QA Contact:
Severity:	urgent	Docs Contact:
Priority:	urgent
Version:	5.1.0.ER4	CC:	jbertram
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
URL:	http://jira.jboss.org/jira/browse/SOA-2593
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2010-11-19 09:34:01 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Kevin Conner 2010-11-18 12:27:08 UTC

project_key: SOA

The message sucker attaches from one node to another, pulling excess messages for delivery locally.  The client consumer for the message sucker (presumably other direct clients) then buffers up the messages for delivery at a later point in time.

If the originating server shuts down then the second node receives a closing notification, the result of which is an attempt to cancel the delivery of the outstanding messages.

The originating server receives the cancel request but, erroneously, deletes the message from the database under the misguided belief that it has failed delivery and has no DLQ, resulting in message loss.

Comment 1 Kevin Conner 2010-11-18 12:30:24 UTC

This can be reproduced simply, by creating a clustered queue over two nodes and delaying the delivery via the message sucker.

The second node should contain the only consumer, with all messages being delivered to the first.

The following byteman script will delay the delivery, allowing the messages to buffer, during which a clean shutdown of the first server will result in the failure

([ServerSessionEndpoint] No DLQ has been specified so the message will be removed)



RULE delay delivery
CLASS org.jboss.messaging.core.impl.clusterconnection.MessageSucker
METHOD onMessage
AT CALL send
IF true
DO Thread.sleep(1000)
ENDRULE

Comment 2 Kevin Conner 2010-11-18 12:33:46 UTC

The bug appears to be in ServerSessionEndpoint.cancelDeliveryInternal

      boolean reachedMaxDeliveryAttempts =
         cancel.isReachedMaxDeliveryAttempts() || cancel.getDeliveryCount() >= rec.maxDeliveryAttempts;

cancel.isReachedMaxDeliveryAttempts() == false
cancel.getDeliveryCount() == 0
rec.maxDeliveryAttempts; == -1

Comment 3 Kevin Conner 2010-11-18 12:37:20 UTC

Link: Added: This issue depends JBPAPP-5429

Comment 4 Yong Hao Gao 2010-11-18 14:40:25 UTC

Thanks Kevin. I'm having trouble reproducing it. Here is what I did:

1. start a cluster of two nodes, node0 and node1
2. send a message to node0, but receive it on node1 
3. in MessageSucker.onMessage(), I put a 20 sec sleep before send() call.
4. when message is sucked from node0 to node1, onMessage() method is called. During the sleep I shutdown node0 (control-c).
5. I observe the message still be received by consumer on node1.

Am I missing some step?

Howard

Comment 5 Kevin Conner 2010-11-18 16:11:10 UTC

You need to send multiple messages to the first node so that the deliveries buffer up in the consumer associated with the MessageSucker.  It is the buffered messages which are cancelled and then lost.

Also, use the above byteman script to stall the delivery to the local queue on the second node.

Comment 6 Kevin Conner 2010-11-18 16:48:29 UTC

Ignore the part about byteman, it looks like you have modified the code directly to introduce the delay.  The key is sending multiple messages so that they buffer and must be cancelled.

Comment 7 Yong Hao Gao 2010-11-18 17:01:24 UTC

Thanks Kevin. 

This time I sent 3 messages. But still i didn't reproduce it. I'm not familiar with byteman but I'll try tomorrow. Just to confirm with you:

1. messages are sent to first node and then the messages are sucked to the second (the only consumer connects to). 
2. the sleep happens at the second node before send() call in onMessage()
3. shut down the second node so the first node will cancel the messages.

This is the step I did. If the steps are correct , then I guess this issue is not always happens. 

Thanks

Comment 8 Justin Bertram 2010-11-18 17:10:36 UTC

This looks a lot like JBMESSAGING-1774.  Can anyone confirm?

Comment 9 John Graham 2010-11-18 17:20:52 UTC

Assigning to Trevor, since when this is fixed, it will involve a build update of EAP.

Comment 10 Yong Hao Gao 2010-11-19 00:29:49 UTC

hi Justin,

Again, for another time, you saved me. :)

I can pretty confirm it hits 1774, as I have looked the code that the maxDeliveryAttempts has no where to be -1 in the Branch_1_4 (jbm dev branch). I've been wondering I must have missed some hidden code.
Thanks Justin. Can you deliver a patch to Kevin and let him confirm? 

:) Eagle eye Justin. 

Howard

Comment 11 Kevin Conner 2010-11-19 09:34:01 UTC

Rejecting this for SOA 5.1 as the fix was made elsewhere in the JBM codebase and I didn't pick up on that.  The fix was done for JBM 1.4.7 GA which is the version we currently use.