| Summary: | Metrics aggregation is slow | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | [Other] RHQ Project | Reporter: | John Sanda <jsanda> | ||||||||
| Component: | Core Server, Storage Node | Assignee: | John Sanda <jsanda> | ||||||||
| Status: | CLOSED CURRENTRELEASE | QA Contact: | Mike Foley <mfoley> | ||||||||
| Severity: | high | Docs Contact: | |||||||||
| Priority: | unspecified | ||||||||||
| Version: | 4.9 | CC: | hrupp | ||||||||
| Target Milestone: | GA | ||||||||||
| Target Release: | RHQ 4.10 | ||||||||||
| Hardware: | Unspecified | ||||||||||
| OS: | Unspecified | ||||||||||
| Whiteboard: | |||||||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||||||
| Doc Text: | Story Points: | --- | |||||||||
| Clone Of: | Environment: | ||||||||||
| Last Closed: | 2014-04-23 12:30:38 UTC | Type: | Bug | ||||||||
| Regression: | --- | Mount Type: | --- | ||||||||
| Documentation: | --- | CRM: | |||||||||
| Verified Versions: | Category: | --- | |||||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||||
| Bug Depends On: | |||||||||||
| Bug Blocks: | 1011084, 951619 | ||||||||||
| Attachments: |
|
||||||||||
|
Description
John Sanda
2013-09-19 14:38:47 UTC
I want to provide some additional details from further analysis I have done in order to better frame the issue. I set up a 4.10-SNAPSHOT build that initially consisted of a single server and two storage nodes. Here are some specs on the environment. * The server was co-located with the postgresql server * The storage nodes were their own, separate machines * Each storage node was configured with a 2 GB heap * There were over 5,000 resources in inventory * There were roughly 85,800 numeric, measurement schedules enabled * Each machine is running OpenJDK 1.6.0_24, 64 bit * The metric insertion rate was around 150k/minute Once memtables started getting flushed to disk ticked up to around 45 minutes on the low end and as high as 2 hours on the high end. I wound up expanding the cluster to 6 nodes, all having the same specs in terms of Java version, heap size, and disk. It did not make an appreciable difference in the overall execution time for metrics aggregation. It was still falling in that 45 minute to 2 hour range. Some of the larger spikes in execution time that I observed occurred when nodes were performing long-running compactions and also when 1 hour and 6 data is aggregated. Read performance can degrade substantially while compaction runs, and we perform lots of reads during aggregation. Deploying more nodes helps reduce the overall frequency of compactions, but not to the extent to make enough of a difference for aggregation execution times. I believe this is due in large part to metric data being aggregated serially. The work for this was done in the jsanda/aggregation branch. I did a squash commit of the branch into master. master commit hash - 3b192452e The old version of the aggregation code is still intact for further testing. Switching between impls can be accomplished by setting a system property which is discussed more below. The new, async aggregation code is enabled by default. To disable the async version, (re)start the RHQ server with the following system property in rhq-server.properties, -Drhq.metrics.aggregation.async=false Prior to these changes only the insertion of raw metrics was throttled since it was the only place where async requests were used. The throttling has been changed a good bit. RateLimiter is used instead of Semaphore. Separate RateLimiters are used for reads and for writes. The limits are configurable via the following system properties (that would be added in rhq-server.properties), rhq.storage.read-limit # defaults to 1000 rhq.storage.write-limit # defaults to 2500 There are more changes coming to the throttling. In particular, if the client starts going too, generating timeout exceptions, then I think we ought to automatically lower the limits to prevent more timeouts. I will track this work under a separate bug. Increasing the limits will definitely increase throughput resulting in faster aggregation times. If you increase the limits, you may also want to increase the driver's default connection pool sizes. This also is currently configurable only through the following system properties, rhq.storage.client.local-connections # defaults to 24 rhq.storage.client.remote-connections # defaults to 16 rhq.storage.client.max-local-connections # defaults to 32 rhq.storage.client.max-remote-connections # defaults to 24 It probably makes sense to be able to manage/monitor this stuff via the agent plugin for the RHQ server. I will create a separate bug to track that effort. Created attachment 839223 [details]
metrics simulator configuration
Created attachment 839228 [details]
results of async simulation run
Created attachment 839229 [details]
results of sync simulation run
I have attached some results of test runs with the metrics simulator that show *some* of the performance gain with the new async aggregation. Several metrics are reported in the results, but the one of interest is org.rhq.metrics.simulator.MeasurementAggregator.totalAggregationTime. It is reported as a histogram. Unfortunately, the values are a bit skewed due to the simulator. Time intervals are compressed with the simulator such that it generated about 31 days of metric data in 78 minutes. Aggregation cannot keep up and so aggregation tasks start getting queued up. The simulator does not exit until all of the tasks are finished. Once the simulation ends, the aggregation task queue starts to drain and tasks complete much faster since there is no other load. Because of all this, we need to look at the max value reported in the histogram. For the async results, it is about 17 minutes. For the sync results, it is about 54 minutes. That is a speed up of roughly 68%. The simulation was inserting 40k metrics/minute. As the number of metrics/minute increases and the longer Cassandra is running, the greater the performance increase there will be. This can be moved to ON_QA. The work has been in master for a little while now. There is more work being done, but it is targeted for 4.11. Bulk closing of 4.10 issues. If an issue is not solved for you, please open a new BZ (or clone the existing one) with a version designator of 4.10. |