+++ This bug was initially created as a clone of Bug #1115146 +++ Description of problem: When purging of large number of rows in large tables, e.g. rhq_availability trait and event data, etc. RHQ typically deletes multiple row in a single transaction, e.g. delete from X where some_time < Y When tables are being written to, this can cause a large amount of contention. First of all, the number of rows is large. Secondly, selects and inserts can block. For example, when inserting a trait value, it needs to know if there is an existing trait there or not. Some of this can be avoided by moving the data to Cassandra. E.g. Bug 1093948 Anyway, the solution is typically to do something like: select from X where some_time < Y then in batches of 100 or so, delete the data in separate transactions. --- Additional comment from Elias Ross on 2014-07-01 19:31:05 CEST --- Note the tables are: (from active to less so) rhq_availability rhq_measurement_data_trait rhq_event rhq_calltime_data_value (and _key) rhq_alert There may be more. --- Additional comment from Elias Ross on 2014-07-08 21:25:55 CEST --- also RHQ_MEASUREMENT_OOB table. --- Additional comment from Thomas Segismont on 2014-07-12 00:33:49 CEST --- Started pull request: https://github.com/rhq-project/rhq/pull/81 --- Additional comment from Thomas Segismont on 2014-07-23 18:07:53 CEST --- Fixed in master commit df5ddda18ac57f59690aa83026f534715a4753aa Author: Thomas Segismont <tsegismo> Date: Sat Jul 12 00:28:13 2014 +0200 Bug 1115146 - Purge (delete) data using smaller batches Introduced JDBCPurgeTemplate Update purge process for: * availabilities * traits * events * calltime data values * alerts (and notif log and condition log) * OOB Batch size: * is shared across implementation of the template * is configurable through a sys prop * value is unbounded * defaults to 3000 --- Additional comment from Thomas Segismont on 2014-07-24 15:06:03 CEST --- Additional commit in master commit c66a4510f639981c330ab009609b7ce11294f310 Author: Thomas Segismont <tsegismo> Date: Thu Jul 24 15:03:48 2014 +0200 Fixed a mistake on Log type --- Additional comment from Elias Ross on 2014-08-01 21:48:25 CEST --- With worries of deadlocks, see below, I decreased the batch size to 500 (from 3000). With 500, I experimented with trying to delete a few months of data (by decreasing the time for valid entries). I have 74,000,000 rows in my trait database and with the fix, I don't seem to making much of a dent in it. 14:54:17,734 INFO [org.rhq.enterprise.server.scheduler.jobs.DataPurgeJob] (RHQScheduler_Worker-2) Traits data purged [98400] - completed in [22855573]ms This is six hours to delete 98,000 rows, or 15,000 rows per hour. Deletion was interrupted with: 14:54:11,933 INFO [org.rhq.enterprise.server.measurement.MeasurementServerServiceImpl] (http-/0.0.0.0:7080-710) Performance: measurement merge [65] timing (60147)ms 14:54:12,359 WARN [org.rhq.enterprise.server.measurement.MeasurementDataManagerBean] (http-/0.0.0.0:7080-58) Failure saving measurement trait data: java.sql.BatchUpdateException:ORA-02049: timeout: distributed transaction waiting for lock [SQLException=ORA-02049: timeout: distributed transaction waiting for lock -> ORA-02049: timeout: distributed transaction waiting for lock (error-code=2049,sql-state=42000)] I added an index on the time_stamp column to try to help. I'm also increasing the batch size to see if it helps. I worry that performance is worse than before and I will have to come up with my own purge process. --- Additional comment from Thomas Segismont on 2014-08-05 10:53:29 CEST --- I'm OOO right now and will be back on the 19th. I thought the bottleneck would be the number of transactions to execute in a row. But it sounds like the problem comes from reading the table to purge, full scan, searching for keys of row to delete. I'm not sure adding an index is a good idea, it's more work for the database (in particular if we do this for all the tables to purge) every time we try to insert something. When I'm back, I'll try to enhance the process this way: 1. full scan (only once) the table to purge, and put the keys of the rows to delete in a *_purge table 2. delete the rows by primary key in batches In the meantime, would you mind to post here the number of lines of all your tables involved in the purge process? Do you know how the age of lines is distributed? Ideally, how much data would you like to keep? A month? A year? --- Additional comment from Elias Ross on 2014-08-05 18:27:28 CEST --- To answer your questions: Age of lines: For traits, I'm guessing the data is evenly distributed. I'm guessing there are a large number of traits that change very frequently. Data retention: There are no business requirements, but a year seems reasonable given there is adequate space at least. In regard to the approach: I agree with your suggested improvements. I agree that adding an index doesn't really make much sense, I just added it to get the process to complete. It did work much better to increase to 30,000 rows per delete, which isn't what I expected. I think what needs to happen isn't put the data into a separate table, but simply query for all rows, then delete in batches. It seems unlikely that even 70,000,000 integers in memory could be much of a memory issue, although maybe you can simply cap things at 1,000,000 or 100,000 even. As what I was doing was probably an unusual case (one normally doesn't reduce the data retention like that), I'm not sure you need to fully solve that problem. --- Additional comment from Thomas Segismont on 2014-09-08 12:38:48 CEST --- Pushed to master commit 29c77743dd94b768bb60b0d1e546fe52179b303d Author: Thomas Segismont <tsegismo> Date: Mon Sep 8 12:37:55 2014 +0200 1. full scan (only once) the table or index, and put the keys of the rows to delete in a temporary file 2. delete the rows by primary key in batches
Release/jon3.3.x commits: commit 79456213b90e2b62e48f0a4aafb1f142012f3474 Author: Thomas Segismont <tsegismo> Date: Sat Jul 12 00:28:13 2014 +0200 (cherry picked from commit df5ddda18ac57f59690aa83026f534715a4753aa) Signed-off-by: Jay Shaughnessy <jshaughn> Conflicts: modules/enterprise/server/jar/src/main/java/org/rhq/enterprise/server/scheduler/jobs/DataPurgeJob.java commit e58c2b13354fa328c3cea2dbea9fa21f366cd8d4 Author: Thomas Segismont <tsegismo> Date: Thu Jul 24 15:03:48 2014 +0200 (cherry picked from commit c66a4510f639981c330ab009609b7ce11294f310) Signed-off-by: Jay Shaughnessy <jshaughn> commit 0ffe33a32bc1ebfd99487d43fc114c9239b14c73 Author: Thomas Segismont <tsegismo> Date: Mon Sep 8 12:37:55 2014 +0200 (cherry picked from commit 29c77743dd94b768bb60b0d1e546fe52179b303d) Signed-off-by: Jay Shaughnessy <jshaughn>
Moving to ON_QA as available for test with the following brew build: https://brewweb.devel.redhat.com//buildinfo?buildID=385149
Created attachment 940078 [details] purge_btn
Created attachment 940700 [details] bundle_res_dep_hist
Created attachment 940701 [details] orphaned_bundle_del_batches
verified on orphaned bundle resource deployment history -- deleted by 30k batches during the data purge job -- JON 3.3 ER03 -- screen-shots attached