Bug 1115146
Summary: | Purge (delete) data using smaller batches | |||
---|---|---|---|---|
Product: | [Other] RHQ Project | Reporter: | Elias Ross <genman> | |
Component: | Core Server | Assignee: | Thomas Segismont <tsegismo> | |
Status: | ON_QA --- | QA Contact: | ||
Severity: | unspecified | Docs Contact: | ||
Priority: | unspecified | |||
Version: | 4.9 | CC: | genman, hrupp, tsegismo | |
Target Milestone: | --- | |||
Target Release: | RHQ 4.13 | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | Doc Type: | Bug Fix | ||
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 1139186 (view as bug list) | Environment: | ||
Last Closed: | Type: | Bug | ||
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1139186 |
Description
Elias Ross
2014-07-01 17:29:32 UTC
Note the tables are: (from active to less so) rhq_availability rhq_measurement_data_trait rhq_event rhq_calltime_data_value (and _key) rhq_alert There may be more. also RHQ_MEASUREMENT_OOB table. Started pull request: https://github.com/rhq-project/rhq/pull/81 Fixed in master commit df5ddda18ac57f59690aa83026f534715a4753aa Author: Thomas Segismont <tsegismo> Date: Sat Jul 12 00:28:13 2014 +0200 Bug 1115146 - Purge (delete) data using smaller batches Introduced JDBCPurgeTemplate Update purge process for: * availabilities * traits * events * calltime data values * alerts (and notif log and condition log) * OOB Batch size: * is shared across implementation of the template * is configurable through a sys prop * value is unbounded * defaults to 3000 Additional commit in master commit c66a4510f639981c330ab009609b7ce11294f310 Author: Thomas Segismont <tsegismo> Date: Thu Jul 24 15:03:48 2014 +0200 Fixed a mistake on Log type With worries of deadlocks, see below, I decreased the batch size to 500 (from 3000). With 500, I experimented with trying to delete a few months of data (by decreasing the time for valid entries). I have 74,000,000 rows in my trait database and with the fix, I don't seem to making much of a dent in it. 14:54:17,734 INFO [org.rhq.enterprise.server.scheduler.jobs.DataPurgeJob] (RHQScheduler_Worker-2) Traits data purged [98400] - completed in [22855573]ms This is six hours to delete 98,000 rows, or 15,000 rows per hour. Deletion was interrupted with: 14:54:11,933 INFO [org.rhq.enterprise.server.measurement.MeasurementServerServiceImpl] (http-/0.0.0.0:7080-710) Performance: measurement merge [65] timing (60147)ms 14:54:12,359 WARN [org.rhq.enterprise.server.measurement.MeasurementDataManagerBean] (http-/0.0.0.0:7080-58) Failure saving measurement trait data: java.sql.BatchUpdateException:ORA-02049: timeout: distributed transaction waiting for lock [SQLException=ORA-02049: timeout: distributed transaction waiting for lock -> ORA-02049: timeout: distributed transaction waiting for lock (error-code=2049,sql-state=42000)] I added an index on the time_stamp column to try to help. I'm also increasing the batch size to see if it helps. I worry that performance is worse than before and I will have to come up with my own purge process. I'm OOO right now and will be back on the 19th. I thought the bottleneck would be the number of transactions to execute in a row. But it sounds like the problem comes from reading the table to purge, full scan, searching for keys of row to delete. I'm not sure adding an index is a good idea, it's more work for the database (in particular if we do this for all the tables to purge) every time we try to insert something. When I'm back, I'll try to enhance the process this way: 1. full scan (only once) the table to purge, and put the keys of the rows to delete in a *_purge table 2. delete the rows by primary key in batches In the meantime, would you mind to post here the number of lines of all your tables involved in the purge process? Do you know how the age of lines is distributed? Ideally, how much data would you like to keep? A month? A year? To answer your questions: Age of lines: For traits, I'm guessing the data is evenly distributed. I'm guessing there are a large number of traits that change very frequently. Data retention: There are no business requirements, but a year seems reasonable given there is adequate space at least. In regard to the approach: I agree with your suggested improvements. I agree that adding an index doesn't really make much sense, I just added it to get the process to complete. It did work much better to increase to 30,000 rows per delete, which isn't what I expected. I think what needs to happen isn't put the data into a separate table, but simply query for all rows, then delete in batches. It seems unlikely that even 70,000,000 integers in memory could be much of a memory issue, although maybe you can simply cap things at 1,000,000 or 100,000 even. As what I was doing was probably an unusual case (one normally doesn't reduce the data retention like that), I'm not sure you need to fully solve that problem. Pushed to master commit 29c77743dd94b768bb60b0d1e546fe52179b303d Author: Thomas Segismont <tsegismo> Date: Mon Sep 8 12:37:55 2014 +0200 1. full scan (only once) the table or index, and put the keys of the rows to delete in a temporary file 2. delete the rows by primary key in batches |