Bug 494064

Summary: Satellite: performance issue in procedure remove_server_entitlement (seen on satellite with lots of monitoring data)
Product: Red Hat Satellite 5 Reporter: Issue Tracker <tao>
Component: ServerAssignee: Jan Pazdziora <jpazdziora>
Status: CLOSED CURRENTRELEASE QA Contact: Milan Zázrivec <mzazrivec>
Severity: high Docs Contact:
Priority: high    
Version: 406CC: cperry, dgoodwin, jbrownin, mmraka, mzazrivec, tao, xdmoon
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: sat530 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 497477 (view as bug list) Environment:
Last Closed: 2009-09-10 18:38:04 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 463877, 488327    

Description Issue Tracker 2009-04-03 23:39:22 UTC
Escalated to Bugzilla from IssueTracker

Comment 19 Xixi 2009-04-04 00:18:17 UTC
Public Bug Summary:

History/background info (for the impatient, jump to "Main performance bottleneck" section):

RHEL 4 i386 4.0.6 Satellite

This all started when customer got cert expiration emails and tried to activate a new cert, upon which rhn-satellite-activate appeared to "hang" i.e. unresponsive for >15 minutes, eventually failing with this trace as seen from rhn_server_satellite.log upon satellite-sync -

Exception Handler Information
Traceback (most recent call last):
  File "/usr/bin/satellite-sync", line 134, in main
    return satsync.Runner().main()
  File "/usr/share/rhn/satellite_tools/satsync.py", line 197, in main
    ret = method()
  File "/usr/share/rhn/satellite_tools/satsync.py", line 266, in
_step_channel_families
    self.syncer.syncCert()
  File "/usr/share/rhn/satellite_tools/satsync.py", line 497, in
syncCert
    return self._process_cert(cert)
  File "/usr/share/rhn/satellite_tools/satsync.py", line 532, in
_process_cert
    satCerts.storeRhnCert(cert)
  File "/usr/share/rhn/satellite_tools/satCerts.py", line 227, in
storeRhnCert
    set_slots_from_cert(sc)
  File "/usr/share/rhn/satellite_tools/satCerts.py", line 135, in
set_slots_from_cert
    update_counts_proc(org_id, 'S', slot_type_id, quantity)
  File "/usr/share/rhn/server/rhnSQL/int_oracle.py", line 26, in
__call__
    retval = self._call_proc(args)
  File "/usr/share/rhn/server/rhnSQL/driver_cx_Oracle.py", line 251, in
_call_proc
    return self._call_proc_ret(args, ret_type=None)
  File "/usr/share/rhn/server/rhnSQL/driver_cx_Oracle.py", line 272, in
_call_proc_ret
    raise sql_base.SQLSchemaError(e.code, e.message, e.context)
SQLSchemaError: (8102, 'ORA-08102: index key not found, obj# 7388, dba
62940916 (2)', 'ORA-06512: at "RHNSAT.RHN_ENTITLEMENTS", line
240\nORA-06512: at "RHNSAT.RHN_ENTITLEMENTS", line 538\nORA-06512: at
"RHNSAT.RHN_ENTITLEMENTS", line 577\nORA-06512: at line 1\n',
'Cursor_InternalExecute()')

This looked to indicate an inconsistency in the table index and a possible
corruption (see http://ora-08102.ora-code.com/), possibly at index, table, or data block level.  The object is 7388, which is a Monitoring index - 
SQL> select object_name from all_objects where object_id=7388;
OBJECT_NAME
------------------------------
TIME_SERIES_OID_ENTRY_IDX

So first, we attempted to rebuild the index, but saw temp_tbs (the temporary tablespace) was full, even after we added another temp data file -

SQL> alter index TIME_SERIES_OID_ENTRY_IDX rebuild tablespace DATA_TBS;
alter index TIME_SERIES_OID_ENTRY_IDX rebuild tablespace DATA_TBS
*
ERROR at line 1:
ORA-01652: unable to extend temp segment by 128 in tablespace TEMP_TBS

SQL> select TABLESPACE_NAME, BYTES_USED, BYTES_FREE from V$TEMP_SPACE_HEADER;

TABLESPACE_NAME                BYTES_USED BYTES_FREE
------------------------------ ---------- ----------
TEMP_TBS                        262144000          0
TEMP_TBS                        262144000          0

At this point we got a db dump to reproduce and test in-house, after adding more temporary datafiles for temp_tbs, the index was able to be recreated - of course, this itself indicates a problem b/c the operation itself most likely should not require that much temporary space - 

SQL> drop index time_series_oid_entry_idx;

Index dropped.

SQL> create index time_series_oid_entry_idx
    on time_series(o_id, entry_time)
    tablespace DATA_TBS
    storage ( freelists 16 )
    initrans 32;
  2    3    4    5

Index created.

SQL> select TABLESPACE_NAME, BYTES_USED, BYTES_FREE from
V$TEMP_SPACE_HEADER;

TABLESPACE_NAME                BYTES_USED BYTES_FREE
------------------------------ ---------- ----------
TEMP_TBS                        210763776   51380224
TEMP_TBS                        210763776   51380224
TEMP_TBS                        210763776   51380224
TEMP_TBS                        209715200   52428800
TEMP_TBS                        209715200   52428800
TEMP_TBS                        262144000          0
TEMP_TBS                        262144000          0
TEMP_TBS                        262144000          0

However, after this, attempting to activate the cert again still does not return after 4+ hours and top shows high cpu on oracle, it also ran out of undo space so I had to up the UNDO_RETENTION setting. 

In the end, based on the oracle trace/dump, and stats pack, the problem location is the "DELETE FROM time_series" section of remove_server_entitlement procedure in PL/SQL.  This was called most likely b/c the cert had fewer entitlements than satellite is already using, so some systems were being unentitled for Monitoring.  Temporary workaround is to activate with a cert with more slots so no system will be unentitled. 

Main performance bottleneck:

There appears to be a performance issue in the "delete from time_series" section of the remove_server_entitlement procuder, which is called by prune_group, which is called by set_group_count -
...
        procedure remove_server_entitlement (
                server_id_in in number,
                type_label_in in varchar2 := 'sw_mgr_entitled'
        ) is
                group_id number;
...
         if type_label_in = 'monitoring_entitled' then
           DELETE
             FROM state_change
            WHERE o_id IN (SELECT probe_id
                             FROM rhn_check_probe
                            WHERE host_id = server_id_in);
           DELETE
             FROM time_series
            WHERE SUBSTR(o_id, INSTR(o_id, '-') + 1,
                               (INSTR(o_id, '-', INSTR(o_id, '-') + 1) -
INSTR(o_id, '-')) - 1)
              IN (SELECT probe_id
                    FROM rhn_check_probe
                   WHERE host_id = server_id_in);
           DELETE
             FROM rhn_probe
            WHERE recid IN (SELECT probe_id
                              FROM rhn_check_probe
                             WHERE host_id = server_id_in);
         end if;
...

This manifests on a satellite with lots of Monitoring data, for example, tablesizes show time_series size = 1.7G out of the total 3.3G for all tables, and it has 43099025 rows -

TIME_SERIES                         1.7G
-------------------------------- -------
Total                               3.3G

SQL> select count(*) from time_series;
  COUNT(*)
----------
  43099025

I've looked at 5.2 satellite schema and this section looks to be the same as from 4.0.6 satellite, so it's still worth investigating for sat 5.3.0.

Snippets from oracle dump file and stats pack report -

From oracle dump file (see Comment #12)
...
*** 2009-04-02 17:04:18.988
ksedmp: internal or fatal error
Current SQL statement for this session:
DELETE FROM TIME_SERIES WHERE SUBSTR(O_ID, INSTR(O_ID, '-') + 1, (INSTR(O_ID, '-', INSTR(O_ID, '-') + 1) - INSTR(O_ID, '-'))
- 1) IN (SELECT PROBE_ID FROM RHN_CHECK_PROBE WHERE HOST_ID = :B1 )
----- PL/SQL Call Stack -----
 object      line  object
 handle    number  name
0x59893f74       240  package body RHNSAT.RHN_ENTITLEMENTS
0x59893f74       538  package body RHNSAT.RHN_ENTITLEMENTS
0x59893f74       577  package body RHNSAT.RHN_ENTITLEMENTS
0x5989bf08         1  anonymous block
...

From statspack report (attached) -
...
                                                     CPU      Elapsd
  Buffer Gets    Executions  Gets per Exec  %Total Time (s)  Time (s) Hash Value
--------------- ------------ -------------- ------ -------- --------- ----------
    523,168,052            2  261,584,026.0   97.7 ########  18971.85  912455645
Module: python@ (TNS V1-V3)
begin rhn_entitlements.set_group_count(:1,:2,:3,:4); end;

    517,001,292           23   22,478,317.0   96.5 ########  18164.38 3663609765
Module: python@ (TNS V1-V3)
DELETE FROM TIME_SERIES WHERE SUBSTR(O_ID, INSTR(O_ID, '-') + 1,
 (INSTR(O_ID, '-', INSTR(O_ID, '-') + 1) - INSTR(O_ID, '-')) - 1
) IN (SELECT PROBE_ID FROM RHN_CHECK_PROBE WHERE HOST_ID = :B1 )
...

Comment 28 Jan Pazdziora 2009-04-24 07:50:44 UTC
Addressed in Spacewalk repo via bug 497477.

Comment 29 Jan Pazdziora 2009-05-04 15:21:31 UTC
The fix made it to spacewalk-schema-0.5.20-9.el?sat which is on Satellite-5.3.0-RHEL?-re20090501.1 composes.

Moving ON_QA.

Comment 31 Jan Pazdziora 2009-06-11 09:07:45 UTC
The test plan would be:

1) Have a satellite with some machines registered, monitoring enabled, and with huge number of monitoring probes run, so that in the time_series there are at least a couple of dozens of millions records. Instead of running the probes, you can probably just insert the records there in sqlplus.
2) Run rhn-satellite-activate with certificate that has less monitoring entitlements than the number of monitoring-entitled machines in the satellite.

During the rhn-satellite-activate, the monitoring entitlements will be revoked from some of the machines.

Without the fix, the process should take significant amount of time.

With the fix, the process should be faster by cca 50 per cent.

Comment 32 Milan Zázrivec 2009-07-01 16:56:34 UTC
So two scenarios:
1. deleting ~ 4 million rows (for one probe) from time_series
which has > 40 million rows (4 mil. rows roughly corresponds to a probe
that is 20 - 40 years in production -> very unrealistic).
2. deleting ~ 100.000 rows (for one probe) from time_series which has > 40
million rows (100.000 rows is approximately result of one year probe usage).

* Satellite 5.2.0 (no functional index):
Scenario 1 -- ~ 38 minutes
Scenario 2 -- ~  8 minutes

* Satellite 5.3.0 (with functional index):
Scenario 1 -- ~ 52 minutes
Scenario 2 -- ~  5 minutes

Hence VERIFIED (although the functional index is just a MacGyver solution;
it does not solve the real problem of TIME_SERIES table, but helps with
the performance for the most common scenarios).

Comment 33 Brandon Perkins 2009-09-10 18:38:04 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHEA-2009-1434.html