1011107 – Baseline calculations is slow

Bug 1011107 - Baseline calculations is slow

Summary: Baseline calculations is slow

Keywords:
Status:	NEW
Alias:	None
Product:	RHQ Project
Classification:	Other
Component:	Core Server, Storage Node
Sub Component:
Version:	4.9
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	ER03
Target Release:	RHQ 4.13
Assignee:	Nobody
QA Contact:
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1011114 (view as bug list)
Depends On:
Blocks:	1011084 951619
TreeView+	depends on / blocked

Reported:	2013-09-23 15:54 UTC by John Sanda
Modified:	2022-03-31 04:28 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:
Embargoed:

Attachments	(Terms of Use)

Description John Sanda 2013-09-23 15:54:47 UTC

Description of problem:
Baseline calculations can take a long time when there are a large number of schedules that need baselines. Here are some stats from a 4.10-SNAPSHOT environment:

18:07:29,675 INFO  [org.rhq.enterprise.server.measurement.MeasurementBaselineManagerBean] (RHQScheduler_Worker-4) Calculated and inserted [41831] new baselines. (1716527)ms
18:07:29,688 INFO  [org.rhq.enterprise.server.scheduler.jobs.DataPurgeJob] (RHQScheduler_Worker-4) Auto-calculation of baselines completed in [1716690]ms
18:07:29,688 INFO  [org.rhq.enterprise.server.scheduler.jobs.DataPurgeJob] (RHQScheduler_Worker-4) Auto-calculation of OOBs starting
18:07:29,761 INFO  [org.rhq.enterprise.server.measurement.MeasurementOOBManagerBean] (RHQScheduler_Worker-4) Removed [21772] outdated OOBs
18:07:29,905 INFO  [org.rhq.enterprise.server.measurement.MeasurementOOBManagerBean] (RHQScheduler_Worker-4) Computing OOBs
18:07:47,902 INFO  [org.rhq.enterprise.server.cloud.instance.CacheConsistencyManagerBean] (EJB default - 3) jsanda-dev03.bc.jonqe.lab.eng.bos.redhat.com took [283]ms to reload cache for 2 agents
18:08:55,892 INFO  [org.rhq.enterprise.server.measurement.MeasurementOOBManagerBean] (RHQScheduler_Worker-4) Finished calculating 82 OOBs in 85987 ms
18:08:55,892 INFO  [org.rhq.enterprise.server.scheduler.jobs.DataPurgeJob] (RHQScheduler_Worker-4) Auto-calculation of OOBs completed in [86204]ms
18:08:55,892 INFO  [org.rhq.enterprise.server.scheduler.jobs.DataPurgeJob] (RHQScheduler_Worker-4) Data Purge Job FINISHED [6064715]ms


From the about server.log output, we can see that baseline calculations took 28.6 minutes. This is the same environment that was used for bug 1009945, so it is not an overly large environment. Similar to the aggregation, the issue is straightforward. The calculations for each schedule are done serially. Calculating baselines for multiple schedules concurrently should yield a dramatic improvement.

I think we can do even better though than simply calculating multiple baselines concurrently. We can in effect create a pipeline for the calculations that need to be done. Once the one hour data for a schedule is calculated, we can go ahead and generate the baseline, and then we can do the OOBs. Right now, we first all the compression, then we do all the baselines, and then we do the OOBs.

raw data --> 1hr data --> 6 hr data --> 24 hr data
             \
              \
               --> baseline (if necessary) --> OOBs


The above diagram shows what the pipeline would look like. For a given schedule, once we calculate the one hr data, we can go ahead and calculate the baselines (if necessar) and then do the OOB calculations. We can generate the 6 hr and 24 data in parallel to the baseline and OOB calculations. Right now we see a big memory spike during the data purge job because the generated 1 hr data is is passed to MeasurementOOBManagerLocal.computeOOBsForLastHour. This could be a sizable amount of memory depending on the number of raw metrics that are being aggregated. The pipeline is a more iterative approach where we could keep the increase in memory usage fixed regardless of the number of scheduled being aggregated.

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Heiko W. Rupp 2013-09-23 19:05:09 UTC

*** Bug 1011114 has been marked as a duplicate of this bug. ***

Comment 2 John Sanda 2014-02-18 15:09:57 UTC

I have created bug 1066515 for moving baselines into Cassandra. I think it makes sense to try and hold off making and big performance enhancements until that migration effort is done. Retargeting for RHQ 4.11.

Comment 3 Heiko W. Rupp 2014-05-08 14:43:01 UTC

Bump the target version now that 4.11 is out.

Comment 4 Jay Shaughnessy 2014-07-07 16:56:21 UTC

Bumping to 4.13 due to time constraints

Note You need to log in before you can comment on or make changes to this bug.