Bug 1023325 - Lost messages after fail in cluster with message grouping
Summary: Lost messages after fail in cluster with message grouping
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: JBoss Enterprise Application Platform 6
Classification: JBoss
Component: HornetQ
Version: 6.2.0
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: CR1
: EAP 6.2.0
Assignee: Andy Taylor
QA Contact: Miroslav Novak
Russell Dickenson
URL: https://c.na7.visual.force.com/apex/C...
Whiteboard:
Depends On: 1026553
Blocks:
TreeView+ depends on / blocked
 
Reported: 2013-10-25 08:17 UTC by Miroslav Novak
Modified: 2013-12-15 16:14 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: In a clustered HornetQ environment, following a failure in which the node with the local grouping handler went offline before other nodes. When the node with the local grouping handler came back online, it may attempt to distribute messages to nodes that are still starting up or may be offline. Consequence: This may situation may result in lost messages. Fix: * On restart or failover, new checks have been added to make sure the cluster has had time to initialize. The wait time before bindings are removed is configurable on the grouping handler via the 'timeout' property. * There is now a 'reaper' which runs on the local grouping handler node that periodically cleans out the bindings journal. This is configurable on the local grouping handler with the following properties: ** `group-timeout` The time in milliseconds that a group id will be bound to a node (default -1 (never)). b) `reaper-period` how often in ms the reaper thread should be run (default 30000). Result: These fixes ensure that messages will not be lost following a failure involving the local grouping handler in a clustered HornetQ environment.
Clone Of:
Environment:
Last Closed: 2013-12-15 16:14:50 UTC
Type: Bug
Embargoed:
rdickens: needinfo-


Attachments (Terms of Use)

Description Miroslav Novak 2013-10-25 08:17:30 UTC
Description of problem:

This issue is based on customer's ticket which he hit in production.

Restart of server in cluster leads to:

2013-10-18 15:50:45,460 ERROR [org.hornetq.core.server] (Thread-783 (HornetQ-remoting-threads-HornetQServerImpl::serverUUID=15ab993a-37e6-11e3-80d3-296cad79330b-1108351598-792153508)) HQ224016: Caught exception: HornetQException[errorType=QUEUE_DOES_NOT_EXIST message=HQ119016: queue jms.queue.queue/PdpDplReplication-MB.31301201a0151253-2db6-11e3-8f59-db5b5bcca1fe has been removed cannot deliver message, queues should not be removed when grouping is used]
	at org.hornetq.core.postoffice.impl.BindingsImpl.routeUsingStrictOrdering(BindingsImpl.java:504) [hornetq-server-2.3.1.Final-redhat-1.jar:2.3.1.Final-redhat-1]
	at org.hornetq.core.postoffice.impl.BindingsImpl.route(BindingsImpl.java:278) [hornetq-server-2.3.1.Final-redhat-1.jar:2.3.1.Final-redhat-1]
	at org.hornetq.core.postoffice.impl.PostOfficeImpl.route(PostOfficeImpl.java:633) [hornetq-server-2.3.1.Final-redhat-1.jar:2.3.1.Final-redhat-1]
	at org.hornetq.core.postoffice.impl.PostOfficeImpl.route(PostOfficeImpl.java:593) [hornetq-server-2.3.1.Final-redhat-1.jar:2.3.1.Final-redhat-1]
	at org.hornetq.core.server.impl.ServerSessionImpl.doSend(ServerSessionImpl.java:1590) [hornetq-server-2.3.1.Final-redhat-1.jar:2.3.1.Final-redhat-1]
	at org.hornetq.core.server.impl.ServerSessionImpl.send(ServerSessionImpl.java:1278) [hornetq-server-2.3.1.Final-redhat-1.jar:2.3.1.Final-redhat-1]

Customer is using message grouping and replicated journal for HA (not sure whether dedicated or colocated topology). In their case queue (most likely subscription to topic) got lost after server restart.

Comment 1 Andy Taylor 2013-10-29 11:24:18 UTC
This has been fixed in https://github.com/hornetq/hornetq/pull/1353 and https://github.com/hornetq/hornetq/pull/1352. there are 2 parts to the fix.

1. on restart (or failover) I have added a check to make sure the cluster has had time to form, if after this configurable timeout the bindings havent been added then we assume the remote queues havent been created and no longer exist and remove the groupid bindings. The wait time is configurable on the grouping handler via the 'timeout' property.

2. remote group id bindings were never removed so if you used a lot of groups that were only used for a short period then the bindings journal would grow indefinitely. Ive added a reaper that runs on the local grouping handler node, this again is configurable on the grouping handler via the following properties:

a) 'group-timeout' the time in milli seconds that a group id will be bound to a node (default -1 never).
b) 'reaper-period' how often in ms the reaper thread should be run (default 30000).
c) 'reaper-priority' the thread priority of the thread (default 3)

This will all need documenting

Comment 2 Andy Taylor 2013-10-29 11:57:05 UTC
actually its only a) and b), ignore 'reaper-priority' as i removed it

Comment 3 Justin Bertram 2013-10-29 19:24:34 UTC
FYI - this fix will require an update the of the messaging subsystem to support these new configuration elements.

Comment 4 Clebert Suconic 2013-10-29 20:26:23 UTC
For that reason this won't be able to make 6.2.0....

I will probably revert this fix if I have to make a 6.2.0... we can make one into 6.3.0... it could be a one-off for customers who need this.. or maybe play with default values?

Comment 7 Miroslav Novak 2013-11-11 14:53:54 UTC
As mentioned above there is still need to provide update of the messaging subsystem to support these new configuration elements. Moving to back to assigned. (checked in EAP 6.2.0.CR1)

Comment 8 Clebert Suconic 2013-11-11 15:16:45 UTC
Miroslav: We decided to use default values, and use system properties.


If you really want the updated schema, you could open a BZ to update the schema as the fix is working fine here.


I don't agree with the FailedQA.

Comment 9 Miroslav Novak 2013-11-11 17:10:42 UTC
Ok, I have nothing against using system properties but it must be documented. Setting requires_doc_text to ?. After reading related support case I'm still not able to understand the test scenario. Could you describe it here, please?

Comment 10 Andy Taylor 2013-11-12 15:28:01 UTC
Flags: whatinfo?(mnovak)

Comment 11 Miroslav Novak 2013-11-12 16:04:36 UTC
I would like to reproduce the issue and verify the fix manually. Still I'm not able to understand the test scenario. Can you help me with it, please?

Thanks,
Mirek

Comment 12 Andy Taylor 2013-11-13 08:01:27 UTC
to verify, 
1. start the 2 servers and send some messages
2. kill the node with the local grouping handler
3. kill the other node
4. restart the node with the grouping handler.

The bug was that it would try to distribute messages to the node that had disapeared.

Comment 13 Miroslav Novak 2013-11-13 12:17:33 UTC
Thanks Andy for help. I've managed to reproduce the issue with EAP 6.2.0.ER7. When I tried with EAP 6.2.0.CR1 then no message was lost and there is no exception as described in customer ticket. Nice work!

Setting as verified for EAP 6.2.0.CR1.

Comment 15 Russell Dickenson 2013-12-03 13:42:44 UTC
`Requires doc text` flag cleared as it's too late to include this in JBoss EAP 6.2.0 Release Notes.


Note You need to log in before you can comment on or make changes to this bug.