Description of problem:
The metrics aggregation is very slow right now due a suboptimal implementation. Here are some stats from an RHQ 4.9 test environment:
01:10:58,733 INFO [org.rhq.server.metrics.MetricsServer] (RHQScheduler_Worker-1) Finished computing aggregates for table [raw_metrics]2770900 ms
01:14:32,498 INFO [org.rhq.server.metrics.MetricsServer] (RHQScheduler_Worker-1) Finished computing aggregates for table [one_hour_metrics] 184424 ms
01:17:35,831 INFO [org.rhq.server.metrics.MetricsServer] (RHQScheduler_Worker-1) Finished computing aggregates for table [six_hour_metrics] 154837 ms
01:17:35,833 INFO [org.rhq.enterprise.server.scheduler.jobs.DataPurgeJob] (RHQScheduler_Worker-1) Measurement data compression completed in ms
Aggregating raw data took about 46 minutes and the total overall time for metrics aggregation was 52.8 minutes. That's way too long considering the data purge job runs hourly. And this was from a server with a small to medium size inventory. There are a total of 3579 resources with only a couple agents. There are 61210 schedules that are enabled and the server is inserting about 37k metrics/minute.
The problem is actually pretty simple. Aggregation is done serially for each schedule (that has data to be aggregated). The CQL driver is asynchronous which makes it easier to compute aggregates for multiple schedules concurrently. This should produce a dramatic speed up.
Version-Release number of selected component (if applicable):
Steps to Reproduce:
I want to provide some additional details from further analysis I have done in order to better frame the issue. I set up a 4.10-SNAPSHOT build that initially consisted of a single server and two storage nodes. Here are some specs on the environment.
* The server was co-located with the postgresql server
* The storage nodes were their own, separate machines
* Each storage node was configured with a 2 GB heap
* There were over 5,000 resources in inventory
* There were roughly 85,800 numeric, measurement schedules enabled
* Each machine is running OpenJDK 1.6.0_24, 64 bit
* The metric insertion rate was around 150k/minute
Once memtables started getting flushed to disk ticked up to around 45 minutes on the low end and as high as 2 hours on the high end. I wound up expanding the cluster to 6 nodes, all having the same specs in terms of Java version, heap size, and disk. It did not make an appreciable difference in the overall execution time for metrics aggregation. It was still falling in that 45 minute to 2 hour range.
Some of the larger spikes in execution time that I observed occurred when nodes were performing long-running compactions and also when 1 hour and 6 data is aggregated. Read performance can degrade substantially while compaction runs, and we perform lots of reads during aggregation. Deploying more nodes helps reduce the overall frequency of compactions, but not to the extent to make enough of a difference for aggregation execution times. I believe this is due in large part to metric data being aggregated serially.
The work for this was done in the jsanda/aggregation branch. I did a squash commit of the branch into master.
master commit hash - 3b192452e
The old version of the aggregation code is still intact for further testing. Switching between impls can be accomplished by setting a system property which is discussed more below.
The new, async aggregation code is enabled by default. To disable the async version, (re)start the RHQ server with the following system property in rhq-server.properties,
Prior to these changes only the insertion of raw metrics was throttled since it was the only place where async requests were used. The throttling has been changed a good bit. RateLimiter is used instead of Semaphore. Separate RateLimiters are used for reads and for writes. The limits are configurable via the following system properties (that would be added in rhq-server.properties),
rhq.storage.read-limit # defaults to 1000
rhq.storage.write-limit # defaults to 2500
There are more changes coming to the throttling. In particular, if the client starts going too, generating timeout exceptions, then I think we ought to automatically lower the limits to prevent more timeouts. I will track this work under a separate bug. Increasing the limits will definitely increase throughput resulting in faster aggregation times.
If you increase the limits, you may also want to increase the driver's default connection pool sizes. This also is currently configurable only through the following system properties,
rhq.storage.client.local-connections # defaults to 24
rhq.storage.client.remote-connections # defaults to 16
rhq.storage.client.max-local-connections # defaults to 32
rhq.storage.client.max-remote-connections # defaults to 24
It probably makes sense to be able to manage/monitor this stuff via the agent plugin for the RHQ server. I will create a separate bug to track that effort.
Created attachment 839223 [details]
metrics simulator configuration
Created attachment 839228 [details]
results of async simulation run
Created attachment 839229 [details]
results of sync simulation run
I have attached some results of test runs with the metrics simulator that show *some* of the performance gain with the new async aggregation. Several metrics are reported in the results, but the one of interest is org.rhq.metrics.simulator.MeasurementAggregator.totalAggregationTime. It is reported as a histogram. Unfortunately, the values are a bit skewed due to the simulator. Time intervals are compressed with the simulator such that it generated about 31 days of metric data in 78 minutes. Aggregation cannot keep up and so aggregation tasks start getting queued up. The simulator does not exit until all of the tasks are finished. Once the simulation ends, the aggregation task queue starts to drain and tasks complete much faster since there is no other load. Because of all this, we need to look at the max value reported in the histogram. For the async results, it is about 17 minutes. For the sync results, it is about 54 minutes. That is a speed up of roughly 68%. The simulation was inserting 40k metrics/minute. As the number of metrics/minute increases and the longer Cassandra is running, the greater the performance increase there will be.
This can be moved to ON_QA. The work has been in master for a little while now. There is more work being done, but it is targeted for 4.11.
Bulk closing of 4.10 issues.
If an issue is not solved for you, please open a new BZ (or clone the existing one) with a version designator of 4.10.