Bug 1114202 - Data aggregation should be fault tolerant
Summary: Data aggregation should be fault tolerant
Status: ON_QA
Alias: None
Product: RHQ Project
Classification: Other
Component: Core Server, Storage Node
Version: 4.11
Hardware: Unspecified
OS: Unspecified
unspecified vote
Target Milestone: ---
: RHQ 4.13
Assignee: RHQ Project Maintainer
QA Contact: Mike Foley
Depends On: 1114199
Blocks: 1133605 1114203
TreeView+ depends on / blocked
Reported: 2014-06-28 15:21 UTC by John Sanda
Modified: 2019-03-08 22:42 UTC (History)
1 user (show)

Clone Of:
: 1114203 (view as bug list)
Last Closed:

Attachments (Terms of Use)

Description John Sanda 2014-06-28 15:21:52 UTC
Description of problem:
In RHQ 4.9 if an error occurs during aggregation, then entire job is essentially aborted and the remaining data is not aggregated. The aggregation code was re-implemented in RHQ 4.10. Data is processed in batches. We fetch the data for 5 (that number is configurable) schedules in parallel, and then perform the aggregation for multiple batches concurrently. If an exception occurs, the aggregation for that batch is aborted, but we will continue aggregating data for other batches. In terms of fault tolerance, it is an improvement from the implementation in 4.9; however, for each batch that fails, we do not retry the aggregation. 

Work has already been done in master to address all failures. https://docs.jboss.org/author/display/RHQ/Aggregation+Schema+Changes describes the changes. The work done for bug 1114199 cover address failures. I decided to open a separate BZ though because there are different scenarios to test.

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

Actual results:

Expected results:

Additional info:

Comment 1 John Sanda 2014-08-25 15:01:06 UTC
I am re-targeting this for RHQ 4.13 due to issues found in 4.12.

Comment 2 John Sanda 2014-09-09 17:36:03 UTC
I am moving this back to ASSIGNED because I realize that there are a couple problems with the solution implemented in 4.12. It addresses failures in raw data aggregation and only partially in 1 hour or 6 hour data aggregation. When there is a failure, the corresponding schedule ids are not deleted from the metrics index table. On a subsequent run of the data aggregation job, we will query the raw data index for the prior time slice and attempt to aggregate the data again. If the 6 hour time slice has passed, we will also recompute the 6 hour data. And if the 24 hour time slice has passed, we will also recompute the 24 hour data.

Now suppose it is 12:00, and the data aggregation job runs. We will compute 1 hour data for the 11:00 - 12:00 hour. We will also compute 6 hour metrics for the 06:00 - 12:00 time slice. If there is an error aggregating the 1 hour data, we will not attempt to recompute the 6 hour metrics (assuming there are no errors computing raw data during the 06:00 - 12:00 time slice) since we only look at the raw data index for previous time slices. We need to look at the1 hour and 6 hour data indexes as well when looking at past time slices for any metrics that need to be recomputed.

Comment 3 John Sanda 2014-09-26 20:43:13 UTC
Changes have been pushed to master.

commit hash: 574393c12f2a

Note You need to log in before you can comment on or make changes to this bug.