Description of problem:
In RHQ 4.9 if an error occurs during aggregation, then entire job is essentially aborted and the remaining data is not aggregated. The aggregation code was re-implemented in RHQ 4.10. Data is processed in batches. We fetch the data for 5 (that number is configurable) schedules in parallel, and then perform the aggregation for multiple batches concurrently. If an exception occurs, the aggregation for that batch is aborted, but we will continue aggregating data for other batches. In terms of fault tolerance, it is an improvement from the implementation in 4.9; however, for each batch that fails, we do not retry the aggregation.
Work has already been done in master to address all failures. https://docs.jboss.org/author/display/RHQ/Aggregation+Schema+Changes describes the changes. The work done for bug 1114199 cover address failures. I decided to open a separate BZ though because there are different scenarios to test.
Version-Release number of selected component (if applicable):
Steps to Reproduce:
I am re-targeting this for RHQ 4.13 due to issues found in 4.12.
I am moving this back to ASSIGNED because I realize that there are a couple problems with the solution implemented in 4.12. It addresses failures in raw data aggregation and only partially in 1 hour or 6 hour data aggregation. When there is a failure, the corresponding schedule ids are not deleted from the metrics index table. On a subsequent run of the data aggregation job, we will query the raw data index for the prior time slice and attempt to aggregate the data again. If the 6 hour time slice has passed, we will also recompute the 6 hour data. And if the 24 hour time slice has passed, we will also recompute the 24 hour data.
Now suppose it is 12:00, and the data aggregation job runs. We will compute 1 hour data for the 11:00 - 12:00 hour. We will also compute 6 hour metrics for the 06:00 - 12:00 time slice. If there is an error aggregating the 1 hour data, we will not attempt to recompute the 6 hour metrics (assuming there are no errors computing raw data during the 06:00 - 12:00 time slice) since we only look at the raw data index for previous time slices. We need to look at the1 hour and 6 hour data indexes as well when looking at past time slices for any metrics that need to be recomputed.
Changes have been pushed to master.
commit hash: 574393c12f2a