1139690 – Message grouping with failover on consumer does not work,

Bug 1139690 - Message grouping with failover on consumer does not work,

Summary: Message grouping with failover on consumer does not work,

Keywords:
Status:	CLOSED EOL
Alias:	None
Product:	JBoss Enterprise Application Platform 6
Classification:	JBoss
Component:	HornetQ
Sub Component:
Version:	6.3.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Target Release:	EAP 6.4.0
Assignee:	Clebert Suconic
QA Contact:	Miroslav Novak
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2014-09-09 13:02 UTC by Ondřej Kalman
Modified:	2019-08-19 12:48 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2019-08-19 12:48:25 UTC
Type:	Bug
Embargoed:

Attachments	(Terms of Use)
node-1 config (33.45 KB, text/plain) 2014-09-09 13:02 UTC, Ondřej Kalman	no flags	Details
node-2 config (33.65 KB, text/plain) 2014-09-09 13:02 UTC, Ondřej Kalman	no flags	Details
View All

Description Ondřej Kalman 2014-09-09 13:02:14 UTC

Created attachment 935685 [details]
node-1 config

There are two server in colocated HA topology. Node-1 and his backup have set local handler for grouping. Node-2 and his backup have set remote handler for grouping.
Consumer is connected to Node-2. Producer sends messages to queue to Node-1.
Consumer starts receiving these messages. At this point everything is OK. But  when Node-2 is killed and receiver failovers to backup then he stops receiving messages. Problem is that messages which producer still sends are not delivered to backup server and consumer can not read them.

Comment 1 Ondřej Kalman 2014-09-09 13:02:52 UTC

Created attachment 935686 [details]
node-2 config

Comment 2 Andy Taylor 2014-09-09 14:01:43 UTC

when you say "Problem is that messages which producer still sends are not delivered to backup server" what do you actually mean?

Comment 3 Ondřej Kalman 2014-09-10 06:12:14 UTC

I mean, messages are not routed to Remote backup (which is live after failover).

Comment 4 Ondřej Kalman 2014-09-10 07:32:14 UTC

Here is reproducer:
clone our testsuite from git:
git://git.app.eng.bos.redhat.com/jbossqe/eap-tests-hornetq.git

Run groovy script PrepareServers.groovy with -DEAP_VERSION=6.3.0 parameter
Script will prepare 4 servers to server1-server4 directories to directory where are you currently standing.

Then export these paths to servers directories eg.:
JBOSS_HOME_1=$WORKSPACE/server1/jboss-eap
JBOSS_HOME_2=$WORKSPACE/server2/jboss-eap
JBOSS_HOME_3=$WORKSPACE/server3/jboss-eap
JBOSS_HOME_4=$WORKSPACE/server4/jboss-eap

And finally: go to jboss-hornetq-testsuite/ in our testsuite and run
mvn install -Dtest=ColocatedClusterFailoverTestCase#testGroupingFailoverNodeTwoDown

In 30% of runs test will not fail but in 70% of runs yes.

Comment 5 Andy Taylor 2014-09-10 11:18:02 UTC

when i run this i get Caught: java.lang.IllegalArgumentException: eapZipUrl cannot be empty or null
java.lang.IllegalArgumentException: eapZipUrl cannot be empty or null
	at PrepareServers.prepareServer(PrepareServers.groovy:128)
	at PrepareServers$prepareServer.call(Unknown Source)
	at PrepareServers.main(PrepareServers.groovy:396)


any chance of a standalone reproducer similar to what Miro usually provides and also not using groovy

Comment 8 Andy Taylor 2014-09-17 13:16:08 UTC

Ive got to the bottom of whats happened and this currently this works as expected. A grouping is only ever held for another node in the cluster as long as the life span of the target server, when the bridge disconnects it removes itself from the post office so it doesn't keep receiving messages, this is to stop messages becoming marooned if the bridge never reconnects. Its at the point of disconnect that the grouping is also removed to avoid the same situation.

there was an issue, https://issues.jboss.org/browse/HORNETQ-1362, that has been fixed on master so that the bindings are never removed, but this is a major change to the routing functionality and only makes sense because we know also have the ability to scale down from store and forward queues in master.

I would suggest documenting that groupings are removed when a server disconnects and mark this as fix against a later version of EAP.

Comment 9 Miroslav Novak 2014-09-23 08:43:03 UTC

Hi Andy,

is there a way how to configure local and remote grouping handler so messages are redistributed to backup with remote grouping handler? 

Thanks,
Mirek

Comment 10 Andy Taylor 2014-09-23 08:59:46 UTC

Mirek,

Can you explain  more, Im not sure what you're asking

Andy

Comment 11 Miroslav Novak 2014-09-23 09:08:33 UTC

If there is a live-backup pair with REMOTE grouping handler, is there a way how to configure it so consumer which previously consumed "grouped" messages from this live will consume them also on backup after failover.

Comment 12 Andy Taylor 2014-09-23 09:19:19 UTC

nope, like i said, as soon as the live server goes down the binding is removed.

Comment 13 Miroslav Novak 2014-09-23 09:57:20 UTC

This is kind of confusing thing. In documentation we encourage to use backup for server with LOCAL grouping handler because it's single point of failure. But crash of server with REMOTE grouping handler which has backup breaks this too. 

Would it be possible to check whether backup is configured before binding is removed and send messages to backup?

Comment 14 Andy Taylor 2014-09-23 10:20:20 UTC

having a backup makes sense, if the local handler fails then things carry on as normal.

If a remote node fails then there is no way of knowing if a backup will eventually come up and how long it will take it if it does. during this time we can froward messages to this node (SnF queue) so we need to remove the binding.

Comment 15 Miroslav Novak 2014-09-24 07:14:07 UTC

What about instead of simple remove of this binding to wait group-timeout for backup to activate and replace binding by new one which would be on backup?

Comment 16 Andy Taylor 2014-09-24 08:01:22 UTC

you would still have the problem where messages routed before the timeout end up marooned in the SnF queue if the backup never returns.

Also the actual routig binding is removed so wouldn't exist anyway and would fail when trying to locate the binding to route to. As I said above, this would require https://issues.jboss.org/browse/HORNETQ-1362 to be back ported but this is a *major* change to the routing functionality and shouldn't go into a stable release.

Comment 17 Miroslav Novak 2014-09-24 11:32:04 UTC

Ok, I understand. So this BZ should planned for EAP 7.

Note You need to log in before you can comment on or make changes to this bug.