Red Hat Bugzilla – Bug 1011107
Baseline calculations is slow
Last modified: 2014-07-07 12:56:21 EDT
Description of problem:
Baseline calculations can take a long time when there are a large number of schedules that need baselines. Here are some stats from a 4.10-SNAPSHOT environment:
18:07:29,675 INFO [org.rhq.enterprise.server.measurement.MeasurementBaselineManagerBean] (RHQScheduler_Worker-4) Calculated and inserted  new baselines. (1716527)ms
18:07:29,688 INFO [org.rhq.enterprise.server.scheduler.jobs.DataPurgeJob] (RHQScheduler_Worker-4) Auto-calculation of baselines completed in ms
18:07:29,688 INFO [org.rhq.enterprise.server.scheduler.jobs.DataPurgeJob] (RHQScheduler_Worker-4) Auto-calculation of OOBs starting
18:07:29,761 INFO [org.rhq.enterprise.server.measurement.MeasurementOOBManagerBean] (RHQScheduler_Worker-4) Removed  outdated OOBs
18:07:29,905 INFO [org.rhq.enterprise.server.measurement.MeasurementOOBManagerBean] (RHQScheduler_Worker-4) Computing OOBs
18:07:47,902 INFO [org.rhq.enterprise.server.cloud.instance.CacheConsistencyManagerBean] (EJB default - 3) jsanda-dev03.bc.jonqe.lab.eng.bos.redhat.com took ms to reload cache for 2 agents
18:08:55,892 INFO [org.rhq.enterprise.server.measurement.MeasurementOOBManagerBean] (RHQScheduler_Worker-4) Finished calculating 82 OOBs in 85987 ms
18:08:55,892 INFO [org.rhq.enterprise.server.scheduler.jobs.DataPurgeJob] (RHQScheduler_Worker-4) Auto-calculation of OOBs completed in ms
18:08:55,892 INFO [org.rhq.enterprise.server.scheduler.jobs.DataPurgeJob] (RHQScheduler_Worker-4) Data Purge Job FINISHED ms
From the about server.log output, we can see that baseline calculations took 28.6 minutes. This is the same environment that was used for bug 1009945, so it is not an overly large environment. Similar to the aggregation, the issue is straightforward. The calculations for each schedule are done serially. Calculating baselines for multiple schedules concurrently should yield a dramatic improvement.
I think we can do even better though than simply calculating multiple baselines concurrently. We can in effect create a pipeline for the calculations that need to be done. Once the one hour data for a schedule is calculated, we can go ahead and generate the baseline, and then we can do the OOBs. Right now, we first all the compression, then we do all the baselines, and then we do the OOBs.
raw data --> 1hr data --> 6 hr data --> 24 hr data
--> baseline (if necessary) --> OOBs
The above diagram shows what the pipeline would look like. For a given schedule, once we calculate the one hr data, we can go ahead and calculate the baselines (if necessar) and then do the OOB calculations. We can generate the 6 hr and 24 data in parallel to the baseline and OOB calculations. Right now we see a big memory spike during the data purge job because the generated 1 hr data is is passed to MeasurementOOBManagerLocal.computeOOBsForLastHour. This could be a sizable amount of memory depending on the number of raw metrics that are being aggregated. The pipeline is a more iterative approach where we could keep the increase in memory usage fixed regardless of the number of scheduled being aggregated.
Version-Release number of selected component (if applicable):
Steps to Reproduce:
*** Bug 1011114 has been marked as a duplicate of this bug. ***
I have created bug 1066515 for moving baselines into Cassandra. I think it makes sense to try and hold off making and big performance enhancements until that migration effort is done. Retargeting for RHQ 4.11.
Bump the target version now that 4.11 is out.
Bumping to 4.13 due to time constraints