Bug 1115146

Summary:	Purge (delete) data using smaller batches
Product:	[Other] RHQ Project	Reporter:	Elias Ross <genman>
Component:	Core Server	Assignee:	Thomas Segismont <tsegismo>
Status:	ON_QA ---	QA Contact:
Severity:	unspecified	Docs Contact:
Priority:	unspecified
Version:	4.9	CC:	genman, hrupp, tsegismo
Target Milestone:	---
Target Release:	RHQ 4.13
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:
Clones:	1139186 (view as bug list)		Environment:
Last Closed:		Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1139186

Description Elias Ross 2014-07-01 17:29:32 UTC

Description of problem:


When purging of large number of rows in large tables, e.g.

rhq_availability

trait and event data, etc.

RHQ typically deletes multiple row in a single transaction, e.g.

  delete from X where some_time < Y

When tables are being written to, this can cause a large amount of contention. First of all, the number of rows is large. Secondly, selects and inserts can block. For example, when inserting a trait value, it needs to know if there is an existing trait there or not.

Some of this can be avoided by moving the data to Cassandra. E.g. Bug 1093948

Anyway, the solution is typically to do something like:

  select from X where some_time < Y

then in batches of 100 or so, delete the data in separate transactions.

Comment 1 Elias Ross 2014-07-01 17:31:05 UTC

Note the tables are: (from active to less so)

rhq_availability
rhq_measurement_data_trait
rhq_event
rhq_calltime_data_value (and _key)
rhq_alert

There may be more.

Comment 2 Elias Ross 2014-07-08 19:25:55 UTC

also RHQ_MEASUREMENT_OOB table.

Comment 3 Thomas Segismont 2014-07-11 22:33:49 UTC

Started pull request:
https://github.com/rhq-project/rhq/pull/81

Comment 4 Thomas Segismont 2014-07-23 16:07:53 UTC

Fixed in master

commit df5ddda18ac57f59690aa83026f534715a4753aa
Author: Thomas Segismont <tsegismo>
Date:   Sat Jul 12 00:28:13 2014 +0200

    Bug 1115146 - Purge (delete) data using smaller batches
    
    Introduced JDBCPurgeTemplate
    
    Update purge process for:
    * availabilities
    * traits
    * events
    * calltime data values
    * alerts (and notif log and condition log)
    * OOB
    
    Batch size:
    * is shared across implementation of the template
    * is configurable through a sys prop
    * value is unbounded
    * defaults to 3000

Comment 5 Thomas Segismont 2014-07-24 13:06:03 UTC

Additional commit in master

commit c66a4510f639981c330ab009609b7ce11294f310
Author: Thomas Segismont <tsegismo>
Date:   Thu Jul 24 15:03:48 2014 +0200

Fixed a mistake on Log type

Comment 6 Elias Ross 2014-08-01 19:48:25 UTC

With worries of deadlocks, see below, I decreased the batch size to 500 (from 3000).

With 500, I experimented with trying to delete a few months of data (by decreasing the time for valid entries). I have 74,000,000 rows in my trait database and with the fix, I don't seem to making much of a dent in it.

14:54:17,734 INFO  [org.rhq.enterprise.server.scheduler.jobs.DataPurgeJob] (RHQScheduler_Worker-2) Traits data purged [98400] - completed in [22855573]ms

This is six hours to delete 98,000 rows, or 15,000 rows per hour. Deletion was interrupted with:

14:54:11,933 INFO  [org.rhq.enterprise.server.measurement.MeasurementServerServiceImpl] (http-/0.0.0.0:7080-710) Performance: measurement merge [65] timing (60147)ms
14:54:12,359 WARN  [org.rhq.enterprise.server.measurement.MeasurementDataManagerBean] (http-/0.0.0.0:7080-58) Failure saving measurement trait data:
java.sql.BatchUpdateException:ORA-02049: timeout: distributed transaction waiting for lock
[SQLException=ORA-02049: timeout: distributed transaction waiting for lock
 -> ORA-02049: timeout: distributed transaction waiting for lock
(error-code=2049,sql-state=42000)]

I added an index on the time_stamp column to try to help.

I'm also increasing the batch size to see if it helps.

I worry that performance is worse than before and I will have to come up with my own purge process.

Comment 7 Thomas Segismont 2014-08-05 08:53:29 UTC

I'm OOO right now and will be back on the 19th. I thought the bottleneck would be the number of transactions to execute in a row. But it sounds like the problem comes from reading the table to purge, full scan, searching for keys of row to delete.

I'm not sure adding an index is a good idea, it's more work for the database (in particular if we do this for all the tables to purge) every time we try to insert something.

When I'm back, I'll try to enhance the process this way:
1. full scan (only once) the table to purge, and put the keys of the rows to delete in a *_purge table
2. delete the rows by primary key in batches

In the meantime, would you mind to post here the number of lines of all your tables involved in the purge process? Do you know how the age of lines is distributed? Ideally, how much data would you like to keep? A month? A year?

Comment 8 Elias Ross 2014-08-05 16:27:28 UTC

To answer your questions:
Age of lines: For traits, I'm guessing the data is evenly distributed. I'm guessing there are a large number of traits that change very frequently.
Data retention: There are no business requirements, but a year seems reasonable given there is adequate space at least.

In regard to the approach:
I agree with your suggested improvements. I agree that adding an index doesn't really make much sense, I just added it to get the process to complete. It did work much better to increase to 30,000 rows per delete, which isn't what I expected.

I think what needs to happen isn't put the data into a separate table, but simply query for all rows, then delete in batches. It seems unlikely that even 70,000,000 integers in memory could be much of a memory issue, although maybe you can simply cap things at 1,000,000 or 100,000 even.

As what I was doing was probably an unusual case (one normally doesn't reduce the data retention like that), I'm not sure you need to fully solve that problem.

Comment 9 Thomas Segismont 2014-09-08 10:38:48 UTC

Pushed to master

commit 29c77743dd94b768bb60b0d1e546fe52179b303d
Author: Thomas Segismont <tsegismo>
Date:   Mon Sep 8 12:37:55 2014 +0200
    
    1. full scan (only once) the table or index, and put the keys of the rows to delete in a temporary file
    2. delete the rows by primary key in batches