Created attachment 935685 [details] node-1 config There are two server in colocated HA topology. Node-1 and his backup have set local handler for grouping. Node-2 and his backup have set remote handler for grouping. Consumer is connected to Node-2. Producer sends messages to queue to Node-1. Consumer starts receiving these messages. At this point everything is OK. But when Node-2 is killed and receiver failovers to backup then he stops receiving messages. Problem is that messages which producer still sends are not delivered to backup server and consumer can not read them.
Created attachment 935686 [details] node-2 config
when you say "Problem is that messages which producer still sends are not delivered to backup server" what do you actually mean?
I mean, messages are not routed to Remote backup (which is live after failover).
Here is reproducer: clone our testsuite from git: git://git.app.eng.bos.redhat.com/jbossqe/eap-tests-hornetq.git Run groovy script PrepareServers.groovy with -DEAP_VERSION=6.3.0 parameter Script will prepare 4 servers to server1-server4 directories to directory where are you currently standing. Then export these paths to servers directories eg.: JBOSS_HOME_1=$WORKSPACE/server1/jboss-eap JBOSS_HOME_2=$WORKSPACE/server2/jboss-eap JBOSS_HOME_3=$WORKSPACE/server3/jboss-eap JBOSS_HOME_4=$WORKSPACE/server4/jboss-eap And finally: go to jboss-hornetq-testsuite/ in our testsuite and run mvn install -Dtest=ColocatedClusterFailoverTestCase#testGroupingFailoverNodeTwoDown In 30% of runs test will not fail but in 70% of runs yes.
when i run this i get Caught: java.lang.IllegalArgumentException: eapZipUrl cannot be empty or null java.lang.IllegalArgumentException: eapZipUrl cannot be empty or null at PrepareServers.prepareServer(PrepareServers.groovy:128) at PrepareServers$prepareServer.call(Unknown Source) at PrepareServers.main(PrepareServers.groovy:396) any chance of a standalone reproducer similar to what Miro usually provides and also not using groovy
Ive got to the bottom of whats happened and this currently this works as expected. A grouping is only ever held for another node in the cluster as long as the life span of the target server, when the bridge disconnects it removes itself from the post office so it doesn't keep receiving messages, this is to stop messages becoming marooned if the bridge never reconnects. Its at the point of disconnect that the grouping is also removed to avoid the same situation. there was an issue, https://issues.jboss.org/browse/HORNETQ-1362, that has been fixed on master so that the bindings are never removed, but this is a major change to the routing functionality and only makes sense because we know also have the ability to scale down from store and forward queues in master. I would suggest documenting that groupings are removed when a server disconnects and mark this as fix against a later version of EAP.
Hi Andy, is there a way how to configure local and remote grouping handler so messages are redistributed to backup with remote grouping handler? Thanks, Mirek
Mirek, Can you explain more, Im not sure what you're asking Andy
If there is a live-backup pair with REMOTE grouping handler, is there a way how to configure it so consumer which previously consumed "grouped" messages from this live will consume them also on backup after failover.
nope, like i said, as soon as the live server goes down the binding is removed.
This is kind of confusing thing. In documentation we encourage to use backup for server with LOCAL grouping handler because it's single point of failure. But crash of server with REMOTE grouping handler which has backup breaks this too. Would it be possible to check whether backup is configured before binding is removed and send messages to backup?
having a backup makes sense, if the local handler fails then things carry on as normal. If a remote node fails then there is no way of knowing if a backup will eventually come up and how long it will take it if it does. during this time we can froward messages to this node (SnF queue) so we need to remove the binding.
What about instead of simple remove of this binding to wait group-timeout for backup to activate and replace binding by new one which would be on backup?
you would still have the problem where messages routed before the timeout end up marooned in the SnF queue if the backup never returns. Also the actual routig binding is removed so wouldn't exist anyway and would fail when trying to locate the binding to route to. As I said above, this would require https://issues.jboss.org/browse/HORNETQ-1362 to be back ported but this is a *major* change to the routing functionality and shouldn't go into a stable release.
Ok, I understand. So this BZ should planned for EAP 7.