Bug 1139186 - Purge (delete) data using smaller batches
Summary: Purge (delete) data using smaller batches
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: JBoss Operations Network
Classification: JBoss
Component: Core Server
Version: JON 3.3.0
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ER03
: JON 3.3.0
Assignee: Thomas Segismont
QA Contact: Armine Hovsepyan
URL:
Whiteboard:
Depends On: 1115146
Blocks: 1079303
TreeView+ depends on / blocked
 
Reported: 2014-09-08 10:40 UTC by Thomas Segismont
Modified: 2015-09-03 00:03 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of: 1115146
Environment:
Last Closed: 2014-12-11 14:01:01 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)
bundle_res_dep_hist (20.36 KB, image/png)
2014-09-24 08:30 UTC, Armine Hovsepyan
no flags Details
orphaned_bundle_del_batches (212.52 KB, image/png)
2014-09-24 08:31 UTC, Armine Hovsepyan
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Bugzilla 1079303 0 unspecified CLOSED Excessive database size when using bundles causing UI to fail 2021-02-22 00:41:40 UTC

Internal Links: 1079303

Description Thomas Segismont 2014-09-08 10:40:14 UTC
+++ This bug was initially created as a clone of Bug #1115146 +++

Description of problem:


When purging of large number of rows in large tables, e.g.

rhq_availability

trait and event data, etc.

RHQ typically deletes multiple row in a single transaction, e.g.

  delete from X where some_time < Y

When tables are being written to, this can cause a large amount of contention. First of all, the number of rows is large. Secondly, selects and inserts can block. For example, when inserting a trait value, it needs to know if there is an existing trait there or not.

Some of this can be avoided by moving the data to Cassandra. E.g. Bug 1093948

Anyway, the solution is typically to do something like:

  select from X where some_time < Y

then in batches of 100 or so, delete the data in separate transactions.

--- Additional comment from Elias Ross on 2014-07-01 19:31:05 CEST ---

Note the tables are: (from active to less so)

rhq_availability
rhq_measurement_data_trait
rhq_event
rhq_calltime_data_value (and _key)
rhq_alert

There may be more.

--- Additional comment from Elias Ross on 2014-07-08 21:25:55 CEST ---

also RHQ_MEASUREMENT_OOB table.

--- Additional comment from Thomas Segismont on 2014-07-12 00:33:49 CEST ---

Started pull request:
https://github.com/rhq-project/rhq/pull/81

--- Additional comment from Thomas Segismont on 2014-07-23 18:07:53 CEST ---

Fixed in master

commit df5ddda18ac57f59690aa83026f534715a4753aa
Author: Thomas Segismont <tsegismo>
Date:   Sat Jul 12 00:28:13 2014 +0200

    Bug 1115146 - Purge (delete) data using smaller batches
    
    Introduced JDBCPurgeTemplate
    
    Update purge process for:
    * availabilities
    * traits
    * events
    * calltime data values
    * alerts (and notif log and condition log)
    * OOB
    
    Batch size:
    * is shared across implementation of the template
    * is configurable through a sys prop
    * value is unbounded
    * defaults to 3000

--- Additional comment from Thomas Segismont on 2014-07-24 15:06:03 CEST ---

Additional commit in master

commit c66a4510f639981c330ab009609b7ce11294f310
Author: Thomas Segismont <tsegismo>
Date:   Thu Jul 24 15:03:48 2014 +0200

Fixed a mistake on Log type

--- Additional comment from Elias Ross on 2014-08-01 21:48:25 CEST ---


With worries of deadlocks, see below, I decreased the batch size to 500 (from 3000).

With 500, I experimented with trying to delete a few months of data (by decreasing the time for valid entries). I have 74,000,000 rows in my trait database and with the fix, I don't seem to making much of a dent in it.

14:54:17,734 INFO  [org.rhq.enterprise.server.scheduler.jobs.DataPurgeJob] (RHQScheduler_Worker-2) Traits data purged [98400] - completed in [22855573]ms

This is six hours to delete 98,000 rows, or 15,000 rows per hour. Deletion was interrupted with:

14:54:11,933 INFO  [org.rhq.enterprise.server.measurement.MeasurementServerServiceImpl] (http-/0.0.0.0:7080-710) Performance: measurement merge [65] timing (60147)ms
14:54:12,359 WARN  [org.rhq.enterprise.server.measurement.MeasurementDataManagerBean] (http-/0.0.0.0:7080-58) Failure saving measurement trait data:
java.sql.BatchUpdateException:ORA-02049: timeout: distributed transaction waiting for lock
[SQLException=ORA-02049: timeout: distributed transaction waiting for lock
 -> ORA-02049: timeout: distributed transaction waiting for lock
(error-code=2049,sql-state=42000)]

I added an index on the time_stamp column to try to help.

I'm also increasing the batch size to see if it helps.

I worry that performance is worse than before and I will have to come up with my own purge process.

--- Additional comment from Thomas Segismont on 2014-08-05 10:53:29 CEST ---

I'm OOO right now and will be back on the 19th. I thought the bottleneck would be the number of transactions to execute in a row. But it sounds like the problem comes from reading the table to purge, full scan, searching for keys of row to delete.

I'm not sure adding an index is a good idea, it's more work for the database (in particular if we do this for all the tables to purge) every time we try to insert something.

When I'm back, I'll try to enhance the process this way:
1. full scan (only once) the table to purge, and put the keys of the rows to delete in a *_purge table
2. delete the rows by primary key in batches

In the meantime, would you mind to post here the number of lines of all your tables involved in the purge process? Do you know how the age of lines is distributed? Ideally, how much data would you like to keep? A month? A year?

--- Additional comment from Elias Ross on 2014-08-05 18:27:28 CEST ---


To answer your questions:
Age of lines: For traits, I'm guessing the data is evenly distributed. I'm guessing there are a large number of traits that change very frequently.
Data retention: There are no business requirements, but a year seems reasonable given there is adequate space at least.

In regard to the approach:
I agree with your suggested improvements. I agree that adding an index doesn't really make much sense, I just added it to get the process to complete. It did work much better to increase to 30,000 rows per delete, which isn't what I expected.

I think what needs to happen isn't put the data into a separate table, but simply query for all rows, then delete in batches. It seems unlikely that even 70,000,000 integers in memory could be much of a memory issue, although maybe you can simply cap things at 1,000,000 or 100,000 even.

As what I was doing was probably an unusual case (one normally doesn't reduce the data retention like that), I'm not sure you need to fully solve that problem.

--- Additional comment from Thomas Segismont on 2014-09-08 12:38:48 CEST ---

Pushed to master

commit 29c77743dd94b768bb60b0d1e546fe52179b303d
Author: Thomas Segismont <tsegismo>
Date:   Mon Sep 8 12:37:55 2014 +0200
    
    1. full scan (only once) the table or index, and put the keys of the rows to delete in a temporary file
    2. delete the rows by primary key in batches

Comment 2 Jay Shaughnessy 2014-09-10 00:41:08 UTC
Release/jon3.3.x commits:

commit 79456213b90e2b62e48f0a4aafb1f142012f3474
Author: Thomas Segismont <tsegismo>
Date:   Sat Jul 12 00:28:13 2014 +0200


    (cherry picked from commit df5ddda18ac57f59690aa83026f534715a4753aa)
    Signed-off-by: Jay Shaughnessy <jshaughn>

    Conflicts:
modules/enterprise/server/jar/src/main/java/org/rhq/enterprise/server/scheduler/jobs/DataPurgeJob.java


commit e58c2b13354fa328c3cea2dbea9fa21f366cd8d4
Author: Thomas Segismont <tsegismo>
Date:   Thu Jul 24 15:03:48 2014 +0200

    (cherry picked from commit c66a4510f639981c330ab009609b7ce11294f310)
    Signed-off-by: Jay Shaughnessy <jshaughn>


commit 0ffe33a32bc1ebfd99487d43fc114c9239b14c73
Author: Thomas Segismont <tsegismo>
Date:   Mon Sep 8 12:37:55 2014 +0200

    (cherry picked from commit 29c77743dd94b768bb60b0d1e546fe52179b303d)
    Signed-off-by: Jay Shaughnessy <jshaughn>

Comment 3 Simeon Pinder 2014-09-17 02:49:21 UTC
Moving to ON_QA as available for test with the following brew build:
https://brewweb.devel.redhat.com//buildinfo?buildID=385149

Comment 5 Armine Hovsepyan 2014-09-22 16:30:34 UTC
Created attachment 940078 [details]
purge_btn

Comment 7 Armine Hovsepyan 2014-09-24 08:30:44 UTC
Created attachment 940700 [details]
bundle_res_dep_hist

Comment 8 Armine Hovsepyan 2014-09-24 08:31:12 UTC
Created attachment 940701 [details]
orphaned_bundle_del_batches

Comment 9 Armine Hovsepyan 2014-09-24 08:32:21 UTC
verified on orphaned bundle resource deployment history -- deleted by 30k batches during the data purge job -- JON 3.3 ER03 -- screen-shots attached


Note You need to log in before you can comment on or make changes to this bug.