Bug 494064
Summary: | Satellite: performance issue in procedure remove_server_entitlement (seen on satellite with lots of monitoring data) | |||
---|---|---|---|---|
Product: | Red Hat Satellite 5 | Reporter: | Issue Tracker <tao> | |
Component: | Server | Assignee: | Jan Pazdziora <jpazdziora> | |
Status: | CLOSED CURRENTRELEASE | QA Contact: | Milan Zázrivec <mzazrivec> | |
Severity: | high | Docs Contact: | ||
Priority: | high | |||
Version: | 406 | CC: | cperry, dgoodwin, jbrownin, mmraka, mzazrivec, tao, xdmoon | |
Target Milestone: | --- | |||
Target Release: | --- | |||
Hardware: | All | |||
OS: | Linux | |||
Whiteboard: | ||||
Fixed In Version: | sat530 | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 497477 (view as bug list) | Environment: | ||
Last Closed: | 2009-09-10 18:38:04 UTC | Type: | --- | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 463877, 488327 |
Description
Issue Tracker
2009-04-03 23:39:22 UTC
Public Bug Summary: History/background info (for the impatient, jump to "Main performance bottleneck" section): RHEL 4 i386 4.0.6 Satellite This all started when customer got cert expiration emails and tried to activate a new cert, upon which rhn-satellite-activate appeared to "hang" i.e. unresponsive for >15 minutes, eventually failing with this trace as seen from rhn_server_satellite.log upon satellite-sync - Exception Handler Information Traceback (most recent call last): File "/usr/bin/satellite-sync", line 134, in main return satsync.Runner().main() File "/usr/share/rhn/satellite_tools/satsync.py", line 197, in main ret = method() File "/usr/share/rhn/satellite_tools/satsync.py", line 266, in _step_channel_families self.syncer.syncCert() File "/usr/share/rhn/satellite_tools/satsync.py", line 497, in syncCert return self._process_cert(cert) File "/usr/share/rhn/satellite_tools/satsync.py", line 532, in _process_cert satCerts.storeRhnCert(cert) File "/usr/share/rhn/satellite_tools/satCerts.py", line 227, in storeRhnCert set_slots_from_cert(sc) File "/usr/share/rhn/satellite_tools/satCerts.py", line 135, in set_slots_from_cert update_counts_proc(org_id, 'S', slot_type_id, quantity) File "/usr/share/rhn/server/rhnSQL/int_oracle.py", line 26, in __call__ retval = self._call_proc(args) File "/usr/share/rhn/server/rhnSQL/driver_cx_Oracle.py", line 251, in _call_proc return self._call_proc_ret(args, ret_type=None) File "/usr/share/rhn/server/rhnSQL/driver_cx_Oracle.py", line 272, in _call_proc_ret raise sql_base.SQLSchemaError(e.code, e.message, e.context) SQLSchemaError: (8102, 'ORA-08102: index key not found, obj# 7388, dba 62940916 (2)', 'ORA-06512: at "RHNSAT.RHN_ENTITLEMENTS", line 240\nORA-06512: at "RHNSAT.RHN_ENTITLEMENTS", line 538\nORA-06512: at "RHNSAT.RHN_ENTITLEMENTS", line 577\nORA-06512: at line 1\n', 'Cursor_InternalExecute()') This looked to indicate an inconsistency in the table index and a possible corruption (see http://ora-08102.ora-code.com/), possibly at index, table, or data block level. The object is 7388, which is a Monitoring index - SQL> select object_name from all_objects where object_id=7388; OBJECT_NAME ------------------------------ TIME_SERIES_OID_ENTRY_IDX So first, we attempted to rebuild the index, but saw temp_tbs (the temporary tablespace) was full, even after we added another temp data file - SQL> alter index TIME_SERIES_OID_ENTRY_IDX rebuild tablespace DATA_TBS; alter index TIME_SERIES_OID_ENTRY_IDX rebuild tablespace DATA_TBS * ERROR at line 1: ORA-01652: unable to extend temp segment by 128 in tablespace TEMP_TBS SQL> select TABLESPACE_NAME, BYTES_USED, BYTES_FREE from V$TEMP_SPACE_HEADER; TABLESPACE_NAME BYTES_USED BYTES_FREE ------------------------------ ---------- ---------- TEMP_TBS 262144000 0 TEMP_TBS 262144000 0 At this point we got a db dump to reproduce and test in-house, after adding more temporary datafiles for temp_tbs, the index was able to be recreated - of course, this itself indicates a problem b/c the operation itself most likely should not require that much temporary space - SQL> drop index time_series_oid_entry_idx; Index dropped. SQL> create index time_series_oid_entry_idx on time_series(o_id, entry_time) tablespace DATA_TBS storage ( freelists 16 ) initrans 32; 2 3 4 5 Index created. SQL> select TABLESPACE_NAME, BYTES_USED, BYTES_FREE from V$TEMP_SPACE_HEADER; TABLESPACE_NAME BYTES_USED BYTES_FREE ------------------------------ ---------- ---------- TEMP_TBS 210763776 51380224 TEMP_TBS 210763776 51380224 TEMP_TBS 210763776 51380224 TEMP_TBS 209715200 52428800 TEMP_TBS 209715200 52428800 TEMP_TBS 262144000 0 TEMP_TBS 262144000 0 TEMP_TBS 262144000 0 However, after this, attempting to activate the cert again still does not return after 4+ hours and top shows high cpu on oracle, it also ran out of undo space so I had to up the UNDO_RETENTION setting. In the end, based on the oracle trace/dump, and stats pack, the problem location is the "DELETE FROM time_series" section of remove_server_entitlement procedure in PL/SQL. This was called most likely b/c the cert had fewer entitlements than satellite is already using, so some systems were being unentitled for Monitoring. Temporary workaround is to activate with a cert with more slots so no system will be unentitled. Main performance bottleneck: There appears to be a performance issue in the "delete from time_series" section of the remove_server_entitlement procuder, which is called by prune_group, which is called by set_group_count - ... procedure remove_server_entitlement ( server_id_in in number, type_label_in in varchar2 := 'sw_mgr_entitled' ) is group_id number; ... if type_label_in = 'monitoring_entitled' then DELETE FROM state_change WHERE o_id IN (SELECT probe_id FROM rhn_check_probe WHERE host_id = server_id_in); DELETE FROM time_series WHERE SUBSTR(o_id, INSTR(o_id, '-') + 1, (INSTR(o_id, '-', INSTR(o_id, '-') + 1) - INSTR(o_id, '-')) - 1) IN (SELECT probe_id FROM rhn_check_probe WHERE host_id = server_id_in); DELETE FROM rhn_probe WHERE recid IN (SELECT probe_id FROM rhn_check_probe WHERE host_id = server_id_in); end if; ... This manifests on a satellite with lots of Monitoring data, for example, tablesizes show time_series size = 1.7G out of the total 3.3G for all tables, and it has 43099025 rows - TIME_SERIES 1.7G -------------------------------- ------- Total 3.3G SQL> select count(*) from time_series; COUNT(*) ---------- 43099025 I've looked at 5.2 satellite schema and this section looks to be the same as from 4.0.6 satellite, so it's still worth investigating for sat 5.3.0. Snippets from oracle dump file and stats pack report - From oracle dump file (see Comment #12) ... *** 2009-04-02 17:04:18.988 ksedmp: internal or fatal error Current SQL statement for this session: DELETE FROM TIME_SERIES WHERE SUBSTR(O_ID, INSTR(O_ID, '-') + 1, (INSTR(O_ID, '-', INSTR(O_ID, '-') + 1) - INSTR(O_ID, '-')) - 1) IN (SELECT PROBE_ID FROM RHN_CHECK_PROBE WHERE HOST_ID = :B1 ) ----- PL/SQL Call Stack ----- object line object handle number name 0x59893f74 240 package body RHNSAT.RHN_ENTITLEMENTS 0x59893f74 538 package body RHNSAT.RHN_ENTITLEMENTS 0x59893f74 577 package body RHNSAT.RHN_ENTITLEMENTS 0x5989bf08 1 anonymous block ... From statspack report (attached) - ... CPU Elapsd Buffer Gets Executions Gets per Exec %Total Time (s) Time (s) Hash Value --------------- ------------ -------------- ------ -------- --------- ---------- 523,168,052 2 261,584,026.0 97.7 ######## 18971.85 912455645 Module: python@ (TNS V1-V3) begin rhn_entitlements.set_group_count(:1,:2,:3,:4); end; 517,001,292 23 22,478,317.0 96.5 ######## 18164.38 3663609765 Module: python@ (TNS V1-V3) DELETE FROM TIME_SERIES WHERE SUBSTR(O_ID, INSTR(O_ID, '-') + 1, (INSTR(O_ID, '-', INSTR(O_ID, '-') + 1) - INSTR(O_ID, '-')) - 1 ) IN (SELECT PROBE_ID FROM RHN_CHECK_PROBE WHERE HOST_ID = :B1 ) ... Addressed in Spacewalk repo via bug 497477. The fix made it to spacewalk-schema-0.5.20-9.el?sat which is on Satellite-5.3.0-RHEL?-re20090501.1 composes. Moving ON_QA. The test plan would be: 1) Have a satellite with some machines registered, monitoring enabled, and with huge number of monitoring probes run, so that in the time_series there are at least a couple of dozens of millions records. Instead of running the probes, you can probably just insert the records there in sqlplus. 2) Run rhn-satellite-activate with certificate that has less monitoring entitlements than the number of monitoring-entitled machines in the satellite. During the rhn-satellite-activate, the monitoring entitlements will be revoked from some of the machines. Without the fix, the process should take significant amount of time. With the fix, the process should be faster by cca 50 per cent. So two scenarios: 1. deleting ~ 4 million rows (for one probe) from time_series which has > 40 million rows (4 mil. rows roughly corresponds to a probe that is 20 - 40 years in production -> very unrealistic). 2. deleting ~ 100.000 rows (for one probe) from time_series which has > 40 million rows (100.000 rows is approximately result of one year probe usage). * Satellite 5.2.0 (no functional index): Scenario 1 -- ~ 38 minutes Scenario 2 -- ~ 8 minutes * Satellite 5.3.0 (with functional index): Scenario 1 -- ~ 52 minutes Scenario 2 -- ~ 5 minutes Hence VERIFIED (although the functional index is just a MacGyver solution; it does not solve the real problem of TIME_SERIES table, but helps with the performance for the most common scenarios). An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHEA-2009-1434.html |