The data purges from the RHQ_MEASUREMENT_DATA_NUM_* tables are of the form "delete where timestamp < (now - X)" where X is 14/30/365days etc The purge job runs on one server in the cloud every hour so when everything is going data will build up for say 14days and then each hour the oldest hour of data will be deleted. The problem with this approach from a performance perspective is that its possible for one delete statement to end up trying to delete a huge amount of data. e.g. if a table has already passed its purge date, e.g. more than 14days of data, then each day the purge doesnt run is another days worth of data which will be deleted the very next time the purge runs. Ultimately if no servers are running for 14days then the entire 1H table will be attempted to be deleted the first time the data purge job runs again. This can be such a large amount of data (113m rows in our perf env) that the delete statement doesn't actually complete in any reasonable timeframe. Fortunately this situation shouldn't occur very frequently since it requires all the servers in the cloud to be off for a long time, in an environment that has previously generated a large volume of data. The solution would be to purge data based on number of rows in each slice to be deleted rather than just the age of the data.
An alternative proposed by Joseph would be to have the purge jobs stick to the same "delete in one hour chunks" regardless of whether this is the first time the purge has run after a long outage or not. This should ensure the amount each purge "bites" off is proportional to the amount of data written in one hour vs. being proportional to the length of server outage. This would help even in times of relatively brief JON server outage, e.g just 12hrs. Obviously if "too much" data is written in any one hour slot then this won't help. You only option is to calculate a timestamp which leave your a reasonable number of rows to delete, e.g. maybe just 15mins worth, or try to avoid writing that much data into the DB in the first place.
dup of RHQ-2372, which is already resolved.
This bug was previously known as http://jira.rhq-project.org/browse/RHQ-1336 This bug relates to RHQ-1354 This bug relates to RHQ-1355 This bug relates to RHQ-1703
Mass move to component = Monitoring