Bug 1033943

Summary: RFE: Allow setting of DCAwareRoundRobinPolicy for supporting multiple datacenters
Product: [Other] RHQ Project Reporter: Elias Ross <genman>
Component: Storage NodeAssignee: John Sanda <jsanda>
Status: CLOSED CURRENTRELEASE QA Contact: Mike Foley <mfoley>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 4.9CC: jsanda
Target Milestone: ---   
Target Release: RHQ 4.10   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2014-04-23 12:31:54 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Patch to master
none
Patch to master c39bcd8571 none

Description Elias Ross 2013-11-24 15:21:38 UTC
Description of problem:

The default load balancing policy is RoundRobinPolicy, which does not work well when using multiple datacenters.

This is an ask to support DCAwareRoundRobinPolicy by setting the system property:

+    private static final String DC_PROP = "rhq.storage.dc";

Version-Release number of selected component (if applicable): 4.9

Comment 1 Elias Ross 2013-12-02 19:17:51 UTC
Created attachment 831749 [details]
Patch to master

Applied to master branch.

I've only tested this on RHQ 4.9 only. I'm not certain if this works with the settings changes that are in master.

The other enhancement that may make sense is specifying the class for this instead. For example, there is this policy which may work as a better default:

http://www.datastax.com/drivers/java/1.0/apidocs/com/datastax/driver/core/policies/LatencyAwarePolicy.html

(not sure which version this is in, but not part of the driver RHQ users.)

Comment 2 Elias Ross 2013-12-02 19:20:16 UTC
Created attachment 831750 [details]
Patch to master c39bcd8571

Comment 3 John Sanda 2014-01-09 19:51:37 UTC
Elias, what other changes did you make for multi-data center support? It probably makes sense to track whatever additional work that will be involved to fully support multiple data centers in a separate bug, but it would be great to hear about what you have done.

Comment 4 Elias Ross 2014-01-09 21:18:11 UTC
For multi-DC work, the other issues are:
1032199	Storage Node	RHQ may overwrite storage node (Cassandra) replication settings
1032192	Core Server, Performance	RFE: Optimize RHQ server for remote database connection (~100ms latency)

There are some manual steps I did to install the new Cassandra node due to a problem adding more than one node at a time. I used 'Puppet' to create the cassandra.yaml. But then this led me to finding this problem:

1032308 - Storage node should output a warning from RhqInternodeAuthenticator if node not found

Currently, I can't get the RHQ 4.9 UI to show the storage nodes in the other data center. This may be due to network ACLs. It was very slow before anyway, with just 4 nodes. (I noticed some fixes in master for this.) I may retry with 4.10 at some point.

I did have RHQ running in the secondary datacenter for about a week, but again due to network ACLs, there was no way for the secondary DC RHQ to talk to agents in the other--it does work the other way currently. This goes back to the issue of how RHQ must connect outbound to all agents. Obviously getting the storage nodes installed requires network access as well.

Comment 5 John Sanda 2014-02-11 03:14:01 UTC
Changes have been committed to master. This adds some initial support for configuring the load balancing policy. Define the rhq.storage.client.load-balancing property in rhq-server.properties. It currently recognizes two values - RoundRobin and DCAwareRoundRobin. If the latter is specified, then the rhq.storage.dc property also needs to be set; otherwise, RoundRobinPolicy will be used.

I would like to add support for the token aware and latency aware policies as well. They both wrap another load balancing policy which will make configuring them more difficult since the configuration right now is pretty simple using system properties.

master commit hash: a78ef9f

Comment 6 Heiko W. Rupp 2014-04-23 12:31:54 UTC
Bulk closing of 4.10 issues.

If an issue is not solved for you, please open a new BZ (or clone the existing one) with a version designator of 4.10.