Bug 1278341

Summary:	[QE] (6.4.z) Messages are not load balanced in HornetQ cluster
Product:	[JBoss] JBoss Enterprise Application Platform 6	Reporter:	Miroslav Novak <mnovak>
Component:	HornetQ	Assignee:	Dmitrii Tikhomirov <dtikhomi>
Status:	CLOSED CURRENTRELEASE	QA Contact:	Miroslav Novak <mnovak>
Severity:	urgent	Docs Contact:
Priority:	unspecified
Version:	6.4.5	CC:	bmaxwell, cdewolf, csuconic, dtikhomi, jbertram, mnovak, msvehla, okalman, vtunka
Target Milestone:	CR2	Keywords:	Regression
Target Release:	EAP 6.4.5
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2017-01-17 11:44:35 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1235745, 1279528

Description Miroslav Novak 2015-11-05 10:12:28 UTC

Description of problem:

We found a regression in customer use case (LODH customer) with EAP 6.4.5.CP.CR1 release. Messages are not balanced in HornetQ cluster and only one server of cluster contains all messages. Regression is related to change described in BZ#1222900 [1]. BZ introduces an optimization for message load balancing in cluster, however messages are not balanced at all when customer uses core bridges for trasporting messages between two clusters of HornetQ servers.

From QE point of view it is regression in customer use case which should be fixed.

Test scenario:
    - start 2 EAP 6 servers (server 1,2) with deployed InQueue in HornetQ cluster
    - start another 2 EAP 6 servers (server 3,4) with deployed OutQueue in different HornetQ cluster
    - set up 2 HornetQ core bridges deployed on server 1 and 2
    - core bridges resends messages from InQueue to OutQueue to 2nd cluster (1->3, 2->4)
    - start producer which sends messages to InQueue to server 1 and consumer which reads messages from OutQueue from server 4
    - during processing of messages by bridges, cleanly shutdown server 3 and restart after a while, producer is still running and connected to the server 1
    - stop producer and verify that all messages were received on server 4 by consumer

In EAP 6.4.4.CP consumer got messages when server 3 was down. But with EAP 6.4.5.CP.CR1 the problem is that once server 3 is shutdown, consumer on server 4 does not get any new messages until server 3 is restarted.
Investigation showed that no messages are load balanced to server 2 so core bridge 2->4 cannot send messages and consumer is starving when server 3 is not available.

Version-Release number of selected component (if applicable):
EAP 6.4.5.CP.CR1 (HornetQ 2.3.25.SP5)

[1] https://bugzilla.redhat.com/show_bug.cgi?id=1222900

Comment 1 Clebert Suconic 2015-11-05 19:09:48 UTC

This issue contradicts #bz-1222900


and I don't think it is an issue at all.

if you always want load-balance, set forward-when-no-consumers=true, and you should load balance. at least AFAIK

Comment 2 Justin Bertram 2015-11-05 20:10:47 UTC

I agree with Clebert. The semantic change from BZ-1222900 now requires that if you want load-balancing you must set forward-when-no-consumers to true.  If forward-when-no-consumers is false then no load-balancing will take place now whether or not there are consumers. It appears that LODH's use-case was relying on the non-intuitive behavior that BZ-1222900 changed so either the configuration for the use-case needs to change or BZ-1222900 needs to be reverted.

To be clear, BZ-1222900 was only applied to the '2.4.x' and 'master' branches at first. However, Tom Ross later back-ported the change to '2.3.x' and '2.3.25.x' which is how it ended up in EAP 6.x.

Comment 5 Miroslav Novak 2015-11-06 11:07:43 UTC

Setting forward-when-no-consumers=true requires restart and outage in production. 
Other problem is that setting this to true disables redistribution of messages between servers 3 and 4. Which means that messages will be stuck on server 3.

Comment 7 Miroslav Novak 2015-11-06 12:42:41 UTC

I've checked that hornetq core bridge on server 2 does not trigger redistribution of messages from server 1 to server 2. If it would then no configuration changes and restarts would be needed. 

@Justin
Do you think there is a reason why core bridge does not trigger redistribution?

Comment 8 Justin Bertram 2015-11-06 14:26:15 UTC

Redistribution only occurs when there are no local consumers on the queue.  Server 1 has a consumer on InQueue (i.e. the bridge) so no messages will ever be redistributed from Server 1 to Server 2 (and vice-versa).

Comment 9 Justin Bertram 2015-11-06 15:03:21 UTC

Re: Setting forward-when-no-consumers=true requires restart and outage in production.

Won't updating the version of EAP also require a restart and outage in production?


Re: Other problem is that setting this to true disables redistribution of messages between servers 3 and 4. Which means that messages will be stuck on server 3.

According to your original description of the environment both Server 3 and Server 4 have local consumers which means there would never be any redistribution anyway.


Either way you look at it we have a tough choice to make. Either we force existing customers to change their configuration or we revert the change from BZ#1222900 in which case a customer won't have the functionality they are looking for. Personally I think that BZ#1222900 should have been treated more like a feature request than a bug because of the semantic changes.

To be clear, we removed the 'forward-when-no-consumers' configuration element in Artemis because it has confused users for so long. We now have the 'message-load-balancing' configuration element with 3 choices rather than a boolean.

Comment 10 Miroslav Novak 2015-11-09 14:01:34 UTC

> Redistribution only occurs when there are no local consumers on the queue. 
> Server 1 has a consumer on InQueue (i.e. the bridge) so no messages will
> ever be redistributed from Server 1 to Server 2 (and vice-versa).

I've verified that core bridge does not trigger message redistribution by modifying the test scenario in following way:
 - I've undeployed core bridge from server 1.
 - Started servers 1,2,4. (server 3 is not started for the whole duration of test)
 - Core bridge is configured only on server 2 and sends messages from InQueue on server 2 to OutQueue on server 4
 - Started producer which sends messages to InQueue to server 1 and consumer which receives messages from OutQueue from server 4.

Result:  Consumer did not receive any message from server 4. All messages stayed on server 1 in InQueue which means that core bridge on server 2 did not trigger message redistribution.

> According to your original description of the environment both Server 3 and
> Server 4 have local consumers which means there would never be any
> redistribution anyway.

In our scenario there is no consumer on server 3. This scenario was based on feedback from customer.

> Won't updating the version of EAP also require a restart and outage in
> production?

Modification of configuration is no part of CP update.

> Either way you look at it we have a tough choice to make. Either we force
> existing customers to change their configuration or we revert the change
> from BZ#1222900 in which case a customer won't have the functionality they
> are looking for. Personally I think that BZ#1222900 should have been treated
> more like a feature request than a bug because of the semantic changes.

Message redistribution should work for core bridges.

Comment 11 Justin Bertram 2015-11-09 15:21:00 UTC

Did you adjust the redistribution-delay to be >= 0?

Comment 12 Miroslav Novak 2015-11-09 15:58:18 UTC

redistribution-delay is set to 0, If consumer tries to receive messages from InQueue on server 2 then messages are redistributed to it from server 1.

Comment 14 Clebert Suconic 2015-11-09 22:36:43 UTC

Revert this commit as requested on the SP6 tag:

https://github.com/hornetq/hornetq/commits/HornetQ_2_3_25_SP6

Comment 17 Miroslav Novak 2015-11-13 09:22:10 UTC

Verified in EAP 6.4.5.CP.CR2.

Comment 18 Petr Penicka 2017-01-17 11:44:35 UTC

Retroactively bulk-closing issues from released EAP 6.4 cumulative patches.