Bug 534550 (RHQ-1336)

Summary: Purges from RHQ_MEASUREMENT_DATA_NUM_* are not robust in the face of server outage
Product: [Other] RHQ Project Reporter: Charles Crouch <ccrouch>
Component: MonitoringAssignee: RHQ Project Maintainer <rhq-maint>
Status: CLOSED NOTABUG QA Contact:
Severity: medium Docs Contact:
Priority: high    
Version: unspecifiedCC: hbrock
Target Milestone: ---Keywords: Improvement
Target Release: ---   
Hardware: All   
OS: All   
URL: http://jira.rhq-project.org/browse/RHQ-1336
Whiteboard:
Fixed In Version: Doc Type: Enhancement
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Charles Crouch 2009-01-09 02:23:00 UTC
The data purges from the RHQ_MEASUREMENT_DATA_NUM_* tables are of the form 
   "delete where timestamp < (now - X)" 
where X is 14/30/365days etc
The purge job runs on one server in the cloud every hour so when everything is going data will build up for say 14days and then each hour the oldest hour of data will be deleted. 

The problem with this approach from a performance perspective is that its possible for one delete statement to end up trying to delete a huge amount of data. e.g. if a table has already passed its purge date, e.g. more than 14days of data, then each day the purge doesnt run is another days worth of data which will be deleted the very next time the purge runs. Ultimately if no servers are running for 14days then the entire 1H table will be attempted to be deleted the first time the data purge job runs again. This can be such a large amount of data (113m rows in our perf env) that the delete statement doesn't actually complete in any reasonable timeframe. 

Fortunately this situation shouldn't occur very frequently since it requires all the servers in the cloud to be off for a long time, in an environment that has previously generated a large volume of data.

The solution would be to purge data based on number of rows in each slice to be deleted rather than just the age of the data.

Comment 1 Charles Crouch 2009-01-12 21:11:52 UTC
An alternative proposed by Joseph would be to have the purge jobs stick to the same "delete in one hour chunks" regardless of whether this is the first time the purge has run after a long outage or not. This should ensure the amount each purge "bites" off is proportional to the amount of data written in one hour vs. being proportional to the length of server outage. This would help even in times of relatively brief JON server outage, e.g just 12hrs. Obviously if "too much" data is written in any one hour slot then this won't help. You only option is to calculate a timestamp which leave your a reasonable number of rows to delete, e.g. maybe just 15mins worth, or try to avoid writing that much data into the DB in the first place.

Comment 2 Joseph Marques 2009-09-04 22:06:49 UTC
dup of RHQ-2372, which is already resolved.

Comment 3 Red Hat Bugzilla 2009-11-10 20:30:50 UTC
This bug was previously known as http://jira.rhq-project.org/browse/RHQ-1336
This bug relates to RHQ-1354
This bug relates to RHQ-1355
This bug relates to RHQ-1703


Comment 4 wes hayutin 2010-02-16 21:10:12 UTC
Mass move to component = Monitoring