Bug 1114202

Summary:	Data aggregation should be fault tolerant
Product:	[Other] RHQ Project	Reporter:	John Sanda <jsanda>
Component:	Core Server, Storage Node	Assignee:	Nobody <nobody>
Status:	ON_QA ---	QA Contact:
Severity:	unspecified	Docs Contact:
Priority:	unspecified
Version:	4.11	CC:	hrupp
Target Milestone:	---
Target Release:	RHQ 4.13
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:
Clones:	1114203 (view as bug list)		Environment:
Last Closed:		Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1114199
Bug Blocks:	1133605, 1114203

Description John Sanda 2014-06-28 15:21:52 UTC

Description of problem:
In RHQ 4.9 if an error occurs during aggregation, then entire job is essentially aborted and the remaining data is not aggregated. The aggregation code was re-implemented in RHQ 4.10. Data is processed in batches. We fetch the data for 5 (that number is configurable) schedules in parallel, and then perform the aggregation for multiple batches concurrently. If an exception occurs, the aggregation for that batch is aborted, but we will continue aggregating data for other batches. In terms of fault tolerance, it is an improvement from the implementation in 4.9; however, for each batch that fails, we do not retry the aggregation. 

Work has already been done in master to address all failures. https://docs.jboss.org/author/display/RHQ/Aggregation+Schema+Changes describes the changes. The work done for bug 1114199 cover address failures. I decided to open a separate BZ though because there are different scenarios to test.


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 John Sanda 2014-08-25 15:01:06 UTC

I am re-targeting this for RHQ 4.13 due to issues found in 4.12.

Comment 2 John Sanda 2014-09-09 17:36:03 UTC

I am moving this back to ASSIGNED because I realize that there are a couple problems with the solution implemented in 4.12. It addresses failures in raw data aggregation and only partially in 1 hour or 6 hour data aggregation. When there is a failure, the corresponding schedule ids are not deleted from the metrics index table. On a subsequent run of the data aggregation job, we will query the raw data index for the prior time slice and attempt to aggregate the data again. If the 6 hour time slice has passed, we will also recompute the 6 hour data. And if the 24 hour time slice has passed, we will also recompute the 24 hour data.

Now suppose it is 12:00, and the data aggregation job runs. We will compute 1 hour data for the 11:00 - 12:00 hour. We will also compute 6 hour metrics for the 06:00 - 12:00 time slice. If there is an error aggregating the 1 hour data, we will not attempt to recompute the 6 hour metrics (assuming there are no errors computing raw data during the 06:00 - 12:00 time slice) since we only look at the raw data index for previous time slices. We need to look at the1 hour and 6 hour data indexes as well when looking at past time slices for any metrics that need to be recomputed.

Comment 3 John Sanda 2014-09-26 20:43:13 UTC

Changes have been pushed to master.

commit hash: 574393c12f2a