The database needs an impossible amount of time to clean up its call-time data tables if the call-time data tables get big (more than 500000 entries). We observed such a delete statement that took 20 hours to complete! What's more, it pratically disabled the machine using up 100% CPU and caused the RHQ server to produce all kinds of DB errors (failed inserts, deletes, timed out transactions, etc.). Our solution is to install the following indexes: CREATE INDEX PERMEO.CALL_DAT_END ON PERMEO.RHQ_CALLTIME_DATA_VALUE (END_TIME) COMPRESS 1 TABLESPACE BASIDX; CREATE INDEX PERMEO.CALL_KEY_BEG ON PERMEO.RHQ_CALLTIME_DATA_VALUE (KEY_ID, BEGIN_TIME) COMPRESS 1 TABLESPACE BASIDX ONLINE; In our case everything works like charm now (same test case but CPU <5%). Please include this fix in the next version of RHQ! Also note that I have seen a similar thing happening using Postgres. I suggest to include similar indexes for Postgres as well.
Frank - what was the size of your calltime data? Row counts and all that.
and adding those indexes - did you notice any degredation of performance on insert? if so, can you quantify that?
I think at the time we added the INDEX we had about 900.000 rows in the call-time value table. Now after some more test runs we have 4.843.626 rows. Actually we didn't check the performance of the insert statement before and after. I remember that we noticed that the insert statement dominated the resource usage statistics in Oracle after adding the indexes. But that is no wonder as I was running a test case for data collection and my RHQ plugin basically consists of call-time data metrics. We didn't investigate further as the box was near idle and thus the performance was satisfactory for us. I will ask a college to have a look at the Oracle performance tool. We can give you some stats for the insert statement execution after the indexes where added.
Ok here are some DB-stats for you. Formatting in pure text is hard, but I try my best. First for the delete statement: #### Query: DELETE FROM RHQ_CALLTIME_DATA_VALUE WHERE key_id = (SELECT id FROM RHQ_CALLTIME_DATA_KEY WHERE schedule_id = :1 AND call_destination = :2) AND begin_time = :3 #BEFORE: Plan: ----------------------------------------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | ----------------------------------------------------------------------------------------------------------- | 0 | DELETE STATEMENT | | | | 2 (100)| | | 1 | DELETE | RHQ_CALLTIME_DATA_VALUE | | | | | | 2 | TABLE ACCESS FULL | RHQ_CALLTIME_DATA_VALUE | 1 | 39 | 2 (0)| 00:00:01 | | 3 | TABLE ACCESS BY INDEX ROWID| RHQ_CALLTIME_DATA_KEY | 1 | 2028 | 0 (0)| | | 4 | INDEX UNIQUE SCAN | RHQ_CALLTIME_DATA_KEY_IDX | 1 | | 0 (0)| | ----------------------------------------------------------------------------------------------------------- Please note that this involves a full table scan! Stats, i.e Resource usage for the query: Total Per Execution Per Row Executions 3 1 0.00 CPU Time (sec) 7066.83 2355.61 1.79 Buffer Gets 377280523 125760174.33 95538.24 Disk Reads 1815 605.00 0.46 Direct Writes 0 0.00 0.00 Rows 3949 1316.33 1 Fetches 0 0.00 0.00 #AFTER Plan: ----------------------------------------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | ----------------------------------------------------------------------------------------------------------- | 0 | DELETE STATEMENT | | | | 4 (100)| | | 1 | DELETE | RHQ_CALLTIME_DATA_VALUE | | | | | | 2 | INDEX RANGE SCAN | CALL_KEY_BEG | 1 | 29 | 3 (0)| 00:00:01 | | 3 | TABLE ACCESS BY INDEX ROWID| RHQ_CALLTIME_DATA_KEY | 1 | 80 | 1 (0)| 00:00:01 | | 4 | INDEX UNIQUE SCAN | RHQ_CALLTIME_DATA_KEY_IDX | 1 | | 1 (0)| 00:00:01 | ----------------------------------------------------------------------------------------------------------- Stats: Total Per Execution Per Row Executions 104 1 3.85 CPU Time (sec) 13.08 0.13 0.48 Buffer Gets 2697502 25937.52 99907.48 Disk Reads 19 0.18 0.70 Direct Writes 0 0.00 0.00 Rows 27 0.26 1 Fetches 0 0.00 0.00 ### As you can see adding the indexes yielded in a lot less CPU time and a lot less buffer get operations. We are talking about at least an order of magnitude! For the insert statement I have only stats confirming the performance *after* adding the indexes is satisfactory. But as you will see the execution plan for the query does not involve the new indexes. So I think the indexes will not affect the speed of the insert query as such, but overall DB peformance. Her come the stats: ### Query: INSERT INTO RHQ_CALLTIME_DATA_VALUE(id, key_id, begin_time, end_time, minimum, maximum, total, count) SELECT RHQ_calltime_data_value_id_seq.nextval, key.id, :1, :2, :3, :4, :5, :6 FROM RHQ_numbers num, RHQ_calltime_data_key key WHERE num.i = 42 AND key.id = (SELECT id FROM RHQ_CALLTIME_DATA_KEY WHERE schedule_id = :7 AND call_destination = :8) Plan: ------------------------------------------------------------------------------------------------------ | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| ------------------------------------------------------------------------------------------------------ | 0 | INSERT STATEMENT | | | | 1 (100)| | 1 | SEQUENCE | RHQ_CALLTIME_DATA_VALUE_ID_SEQ | | | | | 2 | NESTED LOOPS | | 1 | 15 | 0 (0)| | 3 | INDEX UNIQUE SCAN | SYS_C0029575 | 1 | 2 | 0 (0)| | 4 | INDEX UNIQUE SCAN | SYS_C0029545 | 1 | 13 | 0 (0)| | 5 | TABLE ACCESS BY INDEX ROWID| RHQ_CALLTIME_DATA_KEY | 1 | 2028 | 0 (0)| | 6 | INDEX UNIQUE SCAN | RHQ_CALLTIME_DATA_KEY_IDX | 1 | | 0 (0)| ------------------------------------------------------------------------------------------------------ Stats: Total Per Execution Per Row Executions 21 1 0.00 CPU Time (sec) 4.63 0.22 0.00 Buffer Gets 633018 30143.71 20.75 Disk Reads 123 5.86 0.00 Direct Writes 0 0.00 0.00 Rows 30503 1452.52 1 Fetches 0 0.00 0.00 ###
Target at 1.3 for triage
Moving features/improvements to 1.4
This bug was previously known as http://jira.rhq-project.org/browse/RHQ-2356
mass add of key word FutureFeature to help track
Index on end_time added in 1d53a6b on release-3.0.0 I've seen on PG 8.4, OS X Laptop the insert rate dropping from ~1million/h to ~600000/h which is still blasting fast. Also Postgres 'explain' figures show much better values for delete after the addition on the index: Postgres explain functionlity suggests that just putting an index on end_time will reduce delete time from the 0... 317232 units range to the 0...60 range. For removal of ~1300 old rows 0... 317232 to 0...61926 for 1.5 million rows (data older than 12h) 0... 317232 to 249341 for 6 million rows (data older than 6h) for 10.8 million rows (data older than 4h) PG switches to direct table access In reality with an insert rate of e.g. 100.000/h one can expect that the purge also deletes 100.000 entries each hour. With 31 days of data retention this would be less than 1% of the data, so just using the right data from the index is favorable.
After end_time index addition : 08:00:27,997 INFO [DataPurgeJob] Calltime purged [1156544] - completed in [25926]ms with around 13 million rows before purge.
QA Closing - perf env code change.
Mass-closure of verified bugs against JON.