1009945 – Metrics aggregation is slow

Bug 1009945 - Metrics aggregation is slow

Summary: Metrics aggregation is slow

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	RHQ Project
Classification:	Other
Component:	Core Server, Storage Node
Sub Component:
Version:	4.9
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	GA
Target Release:	RHQ 4.10
Assignee:	John Sanda
QA Contact:	Mike Foley
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1011084 951619
TreeView+	depends on / blocked

Reported:	2013-09-19 14:38 UTC by John Sanda
Modified:	2014-04-23 12:30 UTC (History)
CC List:	1 user (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2014-04-23 12:30:38 UTC
Embargoed:

Attachments	(Terms of Use)
metrics simulator configuration (198 bytes, application/json) 2013-12-19 19:33 UTC, John Sanda	no flags	Details
results of async simulation run (139.40 KB, text/plain) 2013-12-19 19:34 UTC, John Sanda	no flags	Details
results of sync simulation run (143.61 KB, text/plain) 2013-12-19 19:35 UTC, John Sanda	no flags	Details
View All

Description John Sanda 2013-09-19 14:38:47 UTC

Description of problem:
The metrics aggregation is very slow right now due a suboptimal implementation. Here are some stats from an RHQ 4.9 test environment:

01:10:58,733 INFO  [org.rhq.server.metrics.MetricsServer] (RHQScheduler_Worker-1) Finished computing aggregates for table [raw_metrics]2770900 ms
01:14:32,498 INFO  [org.rhq.server.metrics.MetricsServer] (RHQScheduler_Worker-1) Finished computing aggregates for table [one_hour_metrics] 184424 ms
01:17:35,831 INFO  [org.rhq.server.metrics.MetricsServer] (RHQScheduler_Worker-1) Finished computing aggregates for table [six_hour_metrics] 154837 ms
01:17:35,833 INFO  [org.rhq.enterprise.server.scheduler.jobs.DataPurgeJob] (RHQScheduler_Worker-1) Measurement data compression completed in [3168002]ms

Aggregating raw data took about 46 minutes and the total overall time for metrics aggregation was 52.8 minutes. That's way too long considering the data purge job runs hourly. And this was from a server with a small to medium size inventory. There are a total of 3579 resources with only a couple agents. There are 61210 schedules that are enabled and the server is inserting about 37k metrics/minute.

The problem is actually pretty simple. Aggregation is done serially for each schedule (that has data to be aggregated). The CQL driver is asynchronous which makes it easier to compute aggregates for multiple schedules concurrently. This should produce a dramatic speed up.


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 John Sanda 2013-11-27 20:17:34 UTC

I want to provide some additional details from further analysis I have done in order to better frame the issue. I set up a 4.10-SNAPSHOT build that initially consisted of a single server and two storage nodes. Here are some specs on the environment.

* The server was co-located with the postgresql server
* The storage nodes were their own, separate machines 
* Each storage node was configured with a 2 GB heap
* There were over 5,000 resources in inventory
* There were roughly 85,800 numeric, measurement schedules enabled
* Each machine is running OpenJDK 1.6.0_24, 64 bit
* The metric insertion rate was around 150k/minute

Once memtables started getting flushed to disk ticked up to around 45 minutes on the low end and as high as 2 hours on the high end. I wound up expanding the cluster to 6 nodes, all having the same specs in terms of Java version, heap size, and disk. It did not make an appreciable difference in the overall execution time for metrics aggregation. It was still falling in that 45 minute to 2 hour range.

Some of the larger spikes in execution time that I observed occurred when nodes were performing long-running compactions and also when 1 hour and 6 data is aggregated. Read performance can degrade substantially while compaction runs, and we perform lots of reads during aggregation. Deploying more nodes helps reduce the overall frequency of compactions, but not to the extent to make enough of a difference for aggregation execution times. I believe this is due in large part to metric data being aggregated serially.

Comment 2 John Sanda 2013-12-19 19:27:23 UTC

The work for this was done in the jsanda/aggregation branch. I did a squash commit of the branch into master.

master commit hash - 3b192452e

The old version of the aggregation code is still intact for further testing. Switching between impls can be accomplished by setting a system property which is discussed more below.

The new, async aggregation code is enabled by default. To disable the async version, (re)start the RHQ server with the following system property in rhq-server.properties,

-Drhq.metrics.aggregation.async=false

Prior to these changes only the insertion of raw metrics was throttled since it was the only place where async requests were used. The throttling has been changed a good bit. RateLimiter is used instead of Semaphore. Separate RateLimiters are used for reads and for writes. The limits are configurable via the following system properties (that would be added in rhq-server.properties),

rhq.storage.read-limit # defaults to 1000
rhq.storage.write-limit # defaults to 2500

There are more changes coming to the throttling. In particular, if the client starts going too, generating timeout exceptions, then I think we ought to automatically lower the limits to prevent more timeouts. I will track this work under a separate bug. Increasing the limits will definitely increase throughput resulting in faster aggregation times.

If you increase the limits, you may also want to increase the driver's default connection pool sizes. This also is currently configurable only through the following system properties,

rhq.storage.client.local-connections # defaults to 24
rhq.storage.client.remote-connections # defaults to 16
rhq.storage.client.max-local-connections # defaults to 32
rhq.storage.client.max-remote-connections # defaults to 24

It probably makes sense to be able to manage/monitor this stuff via the agent plugin for the RHQ server. I will create a separate bug to track that effort.

Comment 3 John Sanda 2013-12-19 19:33:12 UTC

Created attachment 839223 [details]
metrics simulator configuration

Comment 4 John Sanda 2013-12-19 19:34:42 UTC

Created attachment 839228 [details]
results of async simulation run

Comment 5 John Sanda 2013-12-19 19:35:08 UTC

Created attachment 839229 [details]
results of sync simulation run

Comment 6 John Sanda 2013-12-19 19:45:59 UTC

I have attached some results of test runs with the metrics simulator that show *some* of the performance gain with the new async aggregation. Several metrics are reported in the results, but the one of interest is org.rhq.metrics.simulator.MeasurementAggregator.totalAggregationTime. It is reported as a histogram. Unfortunately, the values are a bit skewed due to the simulator. Time intervals are compressed with the simulator such that it generated  about 31 days of metric data in 78 minutes. Aggregation cannot keep up and so aggregation tasks start getting queued up. The simulator does not exit until all of the tasks are finished. Once the simulation ends, the aggregation task queue starts to drain and tasks complete much faster since there is no other load. Because of all this, we need to look at the max value reported in the histogram. For the async results, it is about 17 minutes. For the sync results, it is about 54 minutes. That is a speed up of roughly 68%. The simulation was inserting 40k metrics/minute. As the number of metrics/minute increases and the longer Cassandra is running, the greater the performance increase there will be.

Comment 7 John Sanda 2014-03-11 14:22:33 UTC

This can be moved to ON_QA. The work has been in master for a little while now. There is more work being done, but it is targeted for 4.11.

Comment 8 Heiko W. Rupp 2014-04-23 12:30:38 UTC

Bulk closing of 4.10 issues.

If an issue is not solved for you, please open a new BZ (or clone the existing one) with a version designator of 4.10.

Note You need to log in before you can comment on or make changes to this bug.