534550 – (RHQ-1336) Purges from RHQ_MEASUREMENT_DATA_NUM_* are not robust in the face of server outage

Bug 534550 (RHQ-1336) - Purges from RHQ_MEASUREMENT_DATA_NUM_* are not robust in the face of server outage

Summary: Purges from RHQ_MEASUREMENT_DATA_NUM_* are not robust in the face of server o...

Keywords:
Status:	CLOSED NOTABUG
Alias:	RHQ-1336
Product:	RHQ Project
Classification:	Other
Component:	Monitoring
Sub Component:
Version:	unspecified
Hardware:	All
OS:	All
Priority:	high
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	RHQ Project Maintainer
QA Contact:
Docs Contact:
URL:	http://jira.rhq-project.org/browse/RH...
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2009-01-09 02:23 UTC by Charles Crouch
Modified:	2015-02-01 23:24 UTC (History)
CC List:	1 user (show)
Fixed In Version:
Doc Type:	Enhancement
Doc Text:
Clone Of:
Environment:
Last Closed:
Embargoed:

Attachments	(Terms of Use)

Description Charles Crouch 2009-01-09 02:23:00 UTC

The data purges from the RHQ_MEASUREMENT_DATA_NUM_* tables are of the form 
   "delete where timestamp < (now - X)" 
where X is 14/30/365days etc
The purge job runs on one server in the cloud every hour so when everything is going data will build up for say 14days and then each hour the oldest hour of data will be deleted. 

The problem with this approach from a performance perspective is that its possible for one delete statement to end up trying to delete a huge amount of data. e.g. if a table has already passed its purge date, e.g. more than 14days of data, then each day the purge doesnt run is another days worth of data which will be deleted the very next time the purge runs. Ultimately if no servers are running for 14days then the entire 1H table will be attempted to be deleted the first time the data purge job runs again. This can be such a large amount of data (113m rows in our perf env) that the delete statement doesn't actually complete in any reasonable timeframe. 

Fortunately this situation shouldn't occur very frequently since it requires all the servers in the cloud to be off for a long time, in an environment that has previously generated a large volume of data.

The solution would be to purge data based on number of rows in each slice to be deleted rather than just the age of the data.

Comment 1 Charles Crouch 2009-01-12 21:11:52 UTC

An alternative proposed by Joseph would be to have the purge jobs stick to the same "delete in one hour chunks" regardless of whether this is the first time the purge has run after a long outage or not. This should ensure the amount each purge "bites" off is proportional to the amount of data written in one hour vs. being proportional to the length of server outage. This would help even in times of relatively brief JON server outage, e.g just 12hrs. Obviously if "too much" data is written in any one hour slot then this won't help. You only option is to calculate a timestamp which leave your a reasonable number of rows to delete, e.g. maybe just 15mins worth, or try to avoid writing that much data into the DB in the first place.

Comment 2 Joseph Marques 2009-09-04 22:06:49 UTC

dup of RHQ-2372, which is already resolved.

Comment 3 Red Hat Bugzilla 2009-11-10 20:30:50 UTC

This bug was previously known as http://jira.rhq-project.org/browse/RHQ-1336
This bug relates to RHQ-1354
This bug relates to RHQ-1355
This bug relates to RHQ-1703

Comment 4 wes hayutin 2010-02-16 21:10:12 UTC

Mass move to component = Monitoring

Note You need to log in before you can comment on or make changes to this bug.