1023325 – Lost messages after fail in cluster with message grouping

Bug 1023325 - Lost messages after fail in cluster with message grouping

Summary: Lost messages after fail in cluster with message grouping

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	JBoss Enterprise Application Platform 6
Classification:	JBoss
Component:	HornetQ
Sub Component:
Version:	6.2.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	CR1
Target Release:	EAP 6.2.0
Assignee:	Andy Taylor
QA Contact:	Miroslav Novak
Docs Contact:	Russell Dickenson
URL:	https://c.na7.visual.force.com/apex/C...
Whiteboard:
Depends On:	1026553
Blocks:
TreeView+	depends on / blocked

Reported:	2013-10-25 08:17 UTC by Miroslav Novak
Modified:	2013-12-15 16:14 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2013-12-15 16:14:50 UTC
Type:	Bug
Embargoed:
Flags:	rdickens: needinfo-

Attachments	(Terms of Use)

Description Miroslav Novak 2013-10-25 08:17:30 UTC

Description of problem:

This issue is based on customer's ticket which he hit in production.

Restart of server in cluster leads to:

2013-10-18 15:50:45,460 ERROR [org.hornetq.core.server] (Thread-783 (HornetQ-remoting-threads-HornetQServerImpl::serverUUID=15ab993a-37e6-11e3-80d3-296cad79330b-1108351598-792153508)) HQ224016: Caught exception: HornetQException[errorType=QUEUE_DOES_NOT_EXIST message=HQ119016: queue jms.queue.queue/PdpDplReplication-MB.31301201a0151253-2db6-11e3-8f59-db5b5bcca1fe has been removed cannot deliver message, queues should not be removed when grouping is used]
	at org.hornetq.core.postoffice.impl.BindingsImpl.routeUsingStrictOrdering(BindingsImpl.java:504) [hornetq-server-2.3.1.Final-redhat-1.jar:2.3.1.Final-redhat-1]
	at org.hornetq.core.postoffice.impl.BindingsImpl.route(BindingsImpl.java:278) [hornetq-server-2.3.1.Final-redhat-1.jar:2.3.1.Final-redhat-1]
	at org.hornetq.core.postoffice.impl.PostOfficeImpl.route(PostOfficeImpl.java:633) [hornetq-server-2.3.1.Final-redhat-1.jar:2.3.1.Final-redhat-1]
	at org.hornetq.core.postoffice.impl.PostOfficeImpl.route(PostOfficeImpl.java:593) [hornetq-server-2.3.1.Final-redhat-1.jar:2.3.1.Final-redhat-1]
	at org.hornetq.core.server.impl.ServerSessionImpl.doSend(ServerSessionImpl.java:1590) [hornetq-server-2.3.1.Final-redhat-1.jar:2.3.1.Final-redhat-1]
	at org.hornetq.core.server.impl.ServerSessionImpl.send(ServerSessionImpl.java:1278) [hornetq-server-2.3.1.Final-redhat-1.jar:2.3.1.Final-redhat-1]

Customer is using message grouping and replicated journal for HA (not sure whether dedicated or colocated topology). In their case queue (most likely subscription to topic) got lost after server restart.

Comment 1 Andy Taylor 2013-10-29 11:24:18 UTC

This has been fixed in https://github.com/hornetq/hornetq/pull/1353 and https://github.com/hornetq/hornetq/pull/1352. there are 2 parts to the fix.

1. on restart (or failover) I have added a check to make sure the cluster has had time to form, if after this configurable timeout the bindings havent been added then we assume the remote queues havent been created and no longer exist and remove the groupid bindings. The wait time is configurable on the grouping handler via the 'timeout' property.

2. remote group id bindings were never removed so if you used a lot of groups that were only used for a short period then the bindings journal would grow indefinitely. Ive added a reaper that runs on the local grouping handler node, this again is configurable on the grouping handler via the following properties:

a) 'group-timeout' the time in milli seconds that a group id will be bound to a node (default -1 never).
b) 'reaper-period' how often in ms the reaper thread should be run (default 30000).
c) 'reaper-priority' the thread priority of the thread (default 3)

This will all need documenting

Comment 2 Andy Taylor 2013-10-29 11:57:05 UTC

actually its only a) and b), ignore 'reaper-priority' as i removed it

Comment 3 Justin Bertram 2013-10-29 19:24:34 UTC

FYI - this fix will require an update the of the messaging subsystem to support these new configuration elements.

Comment 4 Clebert Suconic 2013-10-29 20:26:23 UTC

For that reason this won't be able to make 6.2.0....

I will probably revert this fix if I have to make a 6.2.0... we can make one into 6.3.0... it could be a one-off for customers who need this.. or maybe play with default values?

Comment 7 Miroslav Novak 2013-11-11 14:53:54 UTC

As mentioned above there is still need to provide update of the messaging subsystem to support these new configuration elements. Moving to back to assigned. (checked in EAP 6.2.0.CR1)

Comment 8 Clebert Suconic 2013-11-11 15:16:45 UTC

Miroslav: We decided to use default values, and use system properties.


If you really want the updated schema, you could open a BZ to update the schema as the fix is working fine here.


I don't agree with the FailedQA.

Comment 9 Miroslav Novak 2013-11-11 17:10:42 UTC

Ok, I have nothing against using system properties but it must be documented. Setting requires_doc_text to ?. After reading related support case I'm still not able to understand the test scenario. Could you describe it here, please?

Comment 10 Andy Taylor 2013-11-12 15:28:01 UTC

Flags: whatinfo?(mnovak)

Comment 11 Miroslav Novak 2013-11-12 16:04:36 UTC

I would like to reproduce the issue and verify the fix manually. Still I'm not able to understand the test scenario. Can you help me with it, please?

Thanks,
Mirek

Comment 12 Andy Taylor 2013-11-13 08:01:27 UTC

to verify, 
1. start the 2 servers and send some messages
2. kill the node with the local grouping handler
3. kill the other node
4. restart the node with the grouping handler.

The bug was that it would try to distribute messages to the node that had disapeared.

Comment 13 Miroslav Novak 2013-11-13 12:17:33 UTC

Thanks Andy for help. I've managed to reproduce the issue with EAP 6.2.0.ER7. When I tried with EAP 6.2.0.CR1 then no message was lost and there is no exception as described in customer ticket. Nice work!

Setting as verified for EAP 6.2.0.CR1.

Comment 15 Russell Dickenson 2013-12-03 13:42:44 UTC

`Requires doc text` flag cleared as it's too late to include this in JBoss EAP 6.2.0 Release Notes.

Note You need to log in before you can comment on or make changes to this bug.