+++ This bug was initially created as a clone of Bug #1114202 +++ Description of problem: In RHQ 4.9 if an error occurs during aggregation, then entire job is essentially aborted and the remaining data is not aggregated. The aggregation code was re-implemented in RHQ 4.10. Data is processed in batches. We fetch the data for 5 (that number is configurable) schedules in parallel, and then perform the aggregation for multiple batches concurrently. If an exception occurs, the aggregation for that batch is aborted, but we will continue aggregating data for other batches. In terms of fault tolerance, it is an improvement from the implementation in 4.9; however, for each batch that fails, we do not retry the aggregation. Work has already been done in master to address all failures. https://docs.jboss.org/author/display/RHQ/Aggregation+Schema+Changes describes the changes. The work done for bug 1114199 cover address failures. I decided to open a separate BZ though because there are different scenarios to test. Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
I want to point out that we cannot back port the changes to 3.2.x because there are substantial changes including schema changes. If we want to this in 3.2.x, then we need a separate BZ to track that effort.
Changes have been pushed to the release/jon3.3.x branch. See bug 1114202 for details. commit hashes: 2ee9abb58 05dbaec9b db066d9863 874addb583 dff81ed514
Moving to ON_QA as available for test with the following brew build: https://brewweb.devel.redhat.com//buildinfo?buildID=385149
verified in JON 3.3 ER04