Bug 780210 (SOA-2593) - Cancelling undelivered messages during shutdown leads to message loss
Summary: Cancelling undelivered messages during shutdown leads to message loss
Keywords:
Status: CLOSED NOTABUG
Alias: SOA-2593
Product: JBoss Enterprise SOA Platform 5
Classification: JBoss
Component: JBoss Messaging
Version: 5.1.0.ER4
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: ---
: ---
Assignee: trev
QA Contact:
URL: http://jira.jboss.org/jira/browse/SOA...
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2010-11-18 12:27 UTC by Kevin Conner
Modified: 2010-11-19 09:34 UTC (History)
1 user (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2010-11-19 09:34:01 UTC
Type: Bug


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker SOA-2593 0 None None None Never

Description Kevin Conner 2010-11-18 12:27:08 UTC
project_key: SOA

The message sucker attaches from one node to another, pulling excess messages for delivery locally.  The client consumer for the message sucker (presumably other direct clients) then buffers up the messages for delivery at a later point in time.

If the originating server shuts down then the second node receives a closing notification, the result of which is an attempt to cancel the delivery of the outstanding messages.

The originating server receives the cancel request but, erroneously, deletes the message from the database under the misguided belief that it has failed delivery and has no DLQ, resulting in message loss.

Comment 1 Kevin Conner 2010-11-18 12:30:24 UTC
This can be reproduced simply, by creating a clustered queue over two nodes and delaying the delivery via the message sucker.

The second node should contain the only consumer, with all messages being delivered to the first.

The following byteman script will delay the delivery, allowing the messages to buffer, during which a clean shutdown of the first server will result in the failure

([ServerSessionEndpoint] No DLQ has been specified so the message will be removed)



RULE delay delivery
CLASS org.jboss.messaging.core.impl.clusterconnection.MessageSucker
METHOD onMessage
AT CALL send
IF true
DO Thread.sleep(1000)
ENDRULE



Comment 2 Kevin Conner 2010-11-18 12:33:46 UTC
The bug appears to be in ServerSessionEndpoint.cancelDeliveryInternal

      boolean reachedMaxDeliveryAttempts =
         cancel.isReachedMaxDeliveryAttempts() || cancel.getDeliveryCount() >= rec.maxDeliveryAttempts;

cancel.isReachedMaxDeliveryAttempts() == false
cancel.getDeliveryCount() == 0
rec.maxDeliveryAttempts; == -1



Comment 3 Kevin Conner 2010-11-18 12:37:20 UTC
Link: Added: This issue depends JBPAPP-5429


Comment 4 Yong Hao Gao 2010-11-18 14:40:25 UTC
Thanks Kevin. I'm having trouble reproducing it. Here is what I did:

1. start a cluster of two nodes, node0 and node1
2. send a message to node0, but receive it on node1 
3. in MessageSucker.onMessage(), I put a 20 sec sleep before send() call.
4. when message is sucked from node0 to node1, onMessage() method is called. During the sleep I shutdown node0 (control-c).
5. I observe the message still be received by consumer on node1.

Am I missing some step?

Howard


Comment 5 Kevin Conner 2010-11-18 16:11:10 UTC
You need to send multiple messages to the first node so that the deliveries buffer up in the consumer associated with the MessageSucker.  It is the buffered messages which are cancelled and then lost.

Also, use the above byteman script to stall the delivery to the local queue on the second node.

Comment 6 Kevin Conner 2010-11-18 16:48:29 UTC
Ignore the part about byteman, it looks like you have modified the code directly to introduce the delay.  The key is sending multiple messages so that they buffer and must be cancelled.

Comment 7 Yong Hao Gao 2010-11-18 17:01:24 UTC
Thanks Kevin. 

This time I sent 3 messages. But still i didn't reproduce it. I'm not familiar with byteman but I'll try tomorrow. Just to confirm with you:

1. messages are sent to first node and then the messages are sucked to the second (the only consumer connects to). 
2. the sleep happens at the second node before send() call in onMessage()
3. shut down the second node so the first node will cancel the messages.

This is the step I did. If the steps are correct , then I guess this issue is not always happens. 

Thanks


Comment 8 Justin Bertram 2010-11-18 17:10:36 UTC
This looks a lot like JBMESSAGING-1774.  Can anyone confirm?

Comment 9 John Graham 2010-11-18 17:20:52 UTC
Assigning to Trevor, since when this is fixed, it will involve a build update of EAP.

Comment 10 Yong Hao Gao 2010-11-19 00:29:49 UTC
hi Justin,

Again, for another time, you saved me. :)

I can pretty confirm it hits 1774, as I have looked the code that the maxDeliveryAttempts has no where to be -1 in the Branch_1_4 (jbm dev branch). I've been wondering I must have missed some hidden code.
Thanks Justin. Can you deliver a patch to Kevin and let him confirm? 

:) Eagle eye Justin. 

Howard


Comment 11 Kevin Conner 2010-11-19 09:34:01 UTC
Rejecting this for SOA 5.1 as the fix was made elsewhere in the JBM codebase and I didn't pick up on that.  The fix was done for JBM 1.4.7 GA which is the version we currently use.


Note You need to log in before you can comment on or make changes to this bug.