Hide Forgot
project_key: SOA The message sucker attaches from one node to another, pulling excess messages for delivery locally. The client consumer for the message sucker (presumably other direct clients) then buffers up the messages for delivery at a later point in time. If the originating server shuts down then the second node receives a closing notification, the result of which is an attempt to cancel the delivery of the outstanding messages. The originating server receives the cancel request but, erroneously, deletes the message from the database under the misguided belief that it has failed delivery and has no DLQ, resulting in message loss.
This can be reproduced simply, by creating a clustered queue over two nodes and delaying the delivery via the message sucker. The second node should contain the only consumer, with all messages being delivered to the first. The following byteman script will delay the delivery, allowing the messages to buffer, during which a clean shutdown of the first server will result in the failure ([ServerSessionEndpoint] No DLQ has been specified so the message will be removed) RULE delay delivery CLASS org.jboss.messaging.core.impl.clusterconnection.MessageSucker METHOD onMessage AT CALL send IF true DO Thread.sleep(1000) ENDRULE
The bug appears to be in ServerSessionEndpoint.cancelDeliveryInternal boolean reachedMaxDeliveryAttempts = cancel.isReachedMaxDeliveryAttempts() || cancel.getDeliveryCount() >= rec.maxDeliveryAttempts; cancel.isReachedMaxDeliveryAttempts() == false cancel.getDeliveryCount() == 0 rec.maxDeliveryAttempts; == -1
Link: Added: This issue depends JBPAPP-5429
Thanks Kevin. I'm having trouble reproducing it. Here is what I did: 1. start a cluster of two nodes, node0 and node1 2. send a message to node0, but receive it on node1 3. in MessageSucker.onMessage(), I put a 20 sec sleep before send() call. 4. when message is sucked from node0 to node1, onMessage() method is called. During the sleep I shutdown node0 (control-c). 5. I observe the message still be received by consumer on node1. Am I missing some step? Howard
You need to send multiple messages to the first node so that the deliveries buffer up in the consumer associated with the MessageSucker. It is the buffered messages which are cancelled and then lost. Also, use the above byteman script to stall the delivery to the local queue on the second node.
Ignore the part about byteman, it looks like you have modified the code directly to introduce the delay. The key is sending multiple messages so that they buffer and must be cancelled.
Thanks Kevin. This time I sent 3 messages. But still i didn't reproduce it. I'm not familiar with byteman but I'll try tomorrow. Just to confirm with you: 1. messages are sent to first node and then the messages are sucked to the second (the only consumer connects to). 2. the sleep happens at the second node before send() call in onMessage() 3. shut down the second node so the first node will cancel the messages. This is the step I did. If the steps are correct , then I guess this issue is not always happens. Thanks
This looks a lot like JBMESSAGING-1774. Can anyone confirm?
Assigning to Trevor, since when this is fixed, it will involve a build update of EAP.
hi Justin, Again, for another time, you saved me. :) I can pretty confirm it hits 1774, as I have looked the code that the maxDeliveryAttempts has no where to be -1 in the Branch_1_4 (jbm dev branch). I've been wondering I must have missed some hidden code. Thanks Justin. Can you deliver a patch to Kevin and let him confirm? :) Eagle eye Justin. Howard
Rejecting this for SOA 5.1 as the fix was made elsewhere in the JBM codebase and I didn't pick up on that. The fix was done for JBM 1.4.7 GA which is the version we currently use.