Description of problem: This issue is based on customer's ticket which he hit in production. Restart of server in cluster leads to: 2013-10-18 15:50:45,460 ERROR [org.hornetq.core.server] (Thread-783 (HornetQ-remoting-threads-HornetQServerImpl::serverUUID=15ab993a-37e6-11e3-80d3-296cad79330b-1108351598-792153508)) HQ224016: Caught exception: HornetQException[errorType=QUEUE_DOES_NOT_EXIST message=HQ119016: queue jms.queue.queue/PdpDplReplication-MB.31301201a0151253-2db6-11e3-8f59-db5b5bcca1fe has been removed cannot deliver message, queues should not be removed when grouping is used] at org.hornetq.core.postoffice.impl.BindingsImpl.routeUsingStrictOrdering(BindingsImpl.java:504) [hornetq-server-2.3.1.Final-redhat-1.jar:2.3.1.Final-redhat-1] at org.hornetq.core.postoffice.impl.BindingsImpl.route(BindingsImpl.java:278) [hornetq-server-2.3.1.Final-redhat-1.jar:2.3.1.Final-redhat-1] at org.hornetq.core.postoffice.impl.PostOfficeImpl.route(PostOfficeImpl.java:633) [hornetq-server-2.3.1.Final-redhat-1.jar:2.3.1.Final-redhat-1] at org.hornetq.core.postoffice.impl.PostOfficeImpl.route(PostOfficeImpl.java:593) [hornetq-server-2.3.1.Final-redhat-1.jar:2.3.1.Final-redhat-1] at org.hornetq.core.server.impl.ServerSessionImpl.doSend(ServerSessionImpl.java:1590) [hornetq-server-2.3.1.Final-redhat-1.jar:2.3.1.Final-redhat-1] at org.hornetq.core.server.impl.ServerSessionImpl.send(ServerSessionImpl.java:1278) [hornetq-server-2.3.1.Final-redhat-1.jar:2.3.1.Final-redhat-1] Customer is using message grouping and replicated journal for HA (not sure whether dedicated or colocated topology). In their case queue (most likely subscription to topic) got lost after server restart.
This has been fixed in https://github.com/hornetq/hornetq/pull/1353 and https://github.com/hornetq/hornetq/pull/1352. there are 2 parts to the fix. 1. on restart (or failover) I have added a check to make sure the cluster has had time to form, if after this configurable timeout the bindings havent been added then we assume the remote queues havent been created and no longer exist and remove the groupid bindings. The wait time is configurable on the grouping handler via the 'timeout' property. 2. remote group id bindings were never removed so if you used a lot of groups that were only used for a short period then the bindings journal would grow indefinitely. Ive added a reaper that runs on the local grouping handler node, this again is configurable on the grouping handler via the following properties: a) 'group-timeout' the time in milli seconds that a group id will be bound to a node (default -1 never). b) 'reaper-period' how often in ms the reaper thread should be run (default 30000). c) 'reaper-priority' the thread priority of the thread (default 3) This will all need documenting
actually its only a) and b), ignore 'reaper-priority' as i removed it
FYI - this fix will require an update the of the messaging subsystem to support these new configuration elements.
For that reason this won't be able to make 6.2.0.... I will probably revert this fix if I have to make a 6.2.0... we can make one into 6.3.0... it could be a one-off for customers who need this.. or maybe play with default values?
As mentioned above there is still need to provide update of the messaging subsystem to support these new configuration elements. Moving to back to assigned. (checked in EAP 6.2.0.CR1)
Miroslav: We decided to use default values, and use system properties. If you really want the updated schema, you could open a BZ to update the schema as the fix is working fine here. I don't agree with the FailedQA.
Ok, I have nothing against using system properties but it must be documented. Setting requires_doc_text to ?. After reading related support case I'm still not able to understand the test scenario. Could you describe it here, please?
Flags: whatinfo?(mnovak)
I would like to reproduce the issue and verify the fix manually. Still I'm not able to understand the test scenario. Can you help me with it, please? Thanks, Mirek
to verify, 1. start the 2 servers and send some messages 2. kill the node with the local grouping handler 3. kill the other node 4. restart the node with the grouping handler. The bug was that it would try to distribute messages to the node that had disapeared.
Thanks Andy for help. I've managed to reproduce the issue with EAP 6.2.0.ER7. When I tried with EAP 6.2.0.CR1 then no message was lost and there is no exception as described in customer ticket. Nice work! Setting as verified for EAP 6.2.0.CR1.
`Requires doc text` flag cleared as it's too late to include this in JBoss EAP 6.2.0 Release Notes.