Bug 497477

Summary:	Satellite: performance issue in procedure remove_server_entitlement (seen on satellite with lots of monitoring data)
Product:	[Community] Spacewalk	Reporter:	Jan Pazdziora (Red Hat) <jpazdziora>
Component:	Server	Assignee:	Jan Pazdziora (Red Hat) <jpazdziora>
Status:	CLOSED CURRENTRELEASE	QA Contact:	Red Hat Satellite QA List <satqe-list>
Severity:	high	Docs Contact:
Priority:	high
Version:	0.5	CC:	cperry, mzazrivec, tao, xdmoon
Target Milestone:	---
Target Release:	---
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:	494064	Environment:
Last Closed:	2009-09-10 12:05:59 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	456554

Description Jan Pazdziora (Red Hat) 2009-04-24 06:21:47 UTC

+++ This bug was initially created as a clone of Bug #494064 +++

From bug 494064 comment 19 by Xixi:

--- Additional comment from xdmoon on 2009-04-03 20:18:17 EDT ---

Public Bug Summary:

History/background info (for the impatient, jump to "Main performance
bottleneck" section):

RHEL 4 i386 4.0.6 Satellite

This all started when customer got cert expiration emails and tried to activate a new cert, upon which rhn-satellite-activate appeared to "hang" i.e. unresponsive for >15 minutes, eventually failing with this trace as seen from rhn_server_satellite.log upon satellite-sync -

Exception Handler Information
Traceback (most recent call last):
  File "/usr/bin/satellite-sync", line 134, in main
    return satsync.Runner().main()
  File "/usr/share/rhn/satellite_tools/satsync.py", line 197, in main
    ret = method()
  File "/usr/share/rhn/satellite_tools/satsync.py", line 266, in
_step_channel_families
    self.syncer.syncCert()
  File "/usr/share/rhn/satellite_tools/satsync.py", line 497, in
syncCert
    return self._process_cert(cert)
  File "/usr/share/rhn/satellite_tools/satsync.py", line 532, in
_process_cert
    satCerts.storeRhnCert(cert)
  File "/usr/share/rhn/satellite_tools/satCerts.py", line 227, in
storeRhnCert
    set_slots_from_cert(sc)
  File "/usr/share/rhn/satellite_tools/satCerts.py", line 135, in
set_slots_from_cert
    update_counts_proc(org_id, 'S', slot_type_id, quantity)
  File "/usr/share/rhn/server/rhnSQL/int_oracle.py", line 26, in
__call__
    retval = self._call_proc(args)
  File "/usr/share/rhn/server/rhnSQL/driver_cx_Oracle.py", line 251, in
_call_proc
    return self._call_proc_ret(args, ret_type=None)
  File "/usr/share/rhn/server/rhnSQL/driver_cx_Oracle.py", line 272, in
_call_proc_ret
    raise sql_base.SQLSchemaError(e.code, e.message, e.context)
SQLSchemaError: (8102, 'ORA-08102: index key not found, obj# 7388, dba
62940916 (2)', 'ORA-06512: at "RHNSAT.RHN_ENTITLEMENTS", line
240\nORA-06512: at "RHNSAT.RHN_ENTITLEMENTS", line 538\nORA-06512: at
"RHNSAT.RHN_ENTITLEMENTS", line 577\nORA-06512: at line 1\n',
'Cursor_InternalExecute()')

This looked to indicate an inconsistency in the table index and a possible
corruption (see http://ora-08102.ora-code.com/), possibly at index, table, or data block level.  The object is 7388, which is a Monitoring index - 
SQL> select object_name from all_objects where object_id=7388;
OBJECT_NAME
------------------------------
TIME_SERIES_OID_ENTRY_IDX

So first, we attempted to rebuild the index, but saw temp_tbs (the temporary tablespace) was full, even after we added another temp data file -

SQL> alter index TIME_SERIES_OID_ENTRY_IDX rebuild tablespace DATA_TBS;
alter index TIME_SERIES_OID_ENTRY_IDX rebuild tablespace DATA_TBS
*
ERROR at line 1:
ORA-01652: unable to extend temp segment by 128 in tablespace TEMP_TBS

SQL> select TABLESPACE_NAME, BYTES_USED, BYTES_FREE from V$TEMP_SPACE_HEADER;

TABLESPACE_NAME                BYTES_USED BYTES_FREE
------------------------------ ---------- ----------
TEMP_TBS                        262144000          0
TEMP_TBS                        262144000          0

At this point we got a db dump to reproduce and test in-house, after adding more temporary datafiles for temp_tbs, the index was able to be recreated - of course, this itself indicates a problem b/c the operation itself most likely should not require that much temporary space - 

SQL> drop index time_series_oid_entry_idx;

Index dropped.

SQL> create index time_series_oid_entry_idx
    on time_series(o_id, entry_time)
    tablespace DATA_TBS
    storage ( freelists 16 )
    initrans 32;
  2    3    4    5

Index created.

SQL> select TABLESPACE_NAME, BYTES_USED, BYTES_FREE from
V$TEMP_SPACE_HEADER;

TABLESPACE_NAME                BYTES_USED BYTES_FREE
------------------------------ ---------- ----------
TEMP_TBS                        210763776   51380224
TEMP_TBS                        210763776   51380224
TEMP_TBS                        210763776   51380224
TEMP_TBS                        209715200   52428800
TEMP_TBS                        209715200   52428800
TEMP_TBS                        262144000          0
TEMP_TBS                        262144000          0
TEMP_TBS                        262144000          0

However, after this, attempting to activate the cert again still does not return after 4+ hours and top shows high cpu on oracle, it also ran out of undo space so I had to up the UNDO_RETENTION setting. 

In the end, based on the oracle trace/dump, and stats pack, the problem location is the "DELETE FROM time_series" section of remove_server_entitlement procedure in PL/SQL.  This was called most likely b/c the cert had fewer entitlements than satellite is already using, so some systems were being unentitled for Monitoring.  Temporary workaround is to activate with a cert with more slots so no system will be unentitled. 

Main performance bottleneck:

There appears to be a performance issue in the "delete from time_series" section of the remove_server_entitlement procuder, which is called by prune_group, which is called by set_group_count -
...
        procedure remove_server_entitlement (
                server_id_in in number,
                type_label_in in varchar2 := 'sw_mgr_entitled'
        ) is
                group_id number;
...
         if type_label_in = 'monitoring_entitled' then
           DELETE
             FROM state_change
            WHERE o_id IN (SELECT probe_id
                             FROM rhn_check_probe
                            WHERE host_id = server_id_in);
           DELETE
             FROM time_series
            WHERE SUBSTR(o_id, INSTR(o_id, '-') + 1,
                               (INSTR(o_id, '-', INSTR(o_id, '-') + 1) -
INSTR(o_id, '-')) - 1)
              IN (SELECT probe_id
                    FROM rhn_check_probe
                   WHERE host_id = server_id_in);
           DELETE
             FROM rhn_probe
            WHERE recid IN (SELECT probe_id
                              FROM rhn_check_probe
                             WHERE host_id = server_id_in);
         end if;
...

This manifests on a satellite with lots of Monitoring data, for example, tablesizes show time_series size = 1.7G out of the total 3.3G for all tables, and it has 43099025 rows -

TIME_SERIES                         1.7G
-------------------------------- -------
Total                               3.3G

SQL> select count(*) from time_series;
  COUNT(*)
----------
  43099025

I've looked at 5.2 satellite schema and this section looks to be the same as from 4.0.6 satellite, so it's still worth investigating for sat 5.3.0.

Snippets from oracle dump file and stats pack report -

From oracle dump file (see Comment #12)
...
*** 2009-04-02 17:04:18.988
ksedmp: internal or fatal error
Current SQL statement for this session:
DELETE FROM TIME_SERIES WHERE SUBSTR(O_ID, INSTR(O_ID, '-') + 1, (INSTR(O_ID, '-', INSTR(O_ID, '-') + 1) - INSTR(O_ID, '-'))
- 1) IN (SELECT PROBE_ID FROM RHN_CHECK_PROBE WHERE HOST_ID = :B1 )
----- PL/SQL Call Stack -----
 object      line  object
 handle    number  name
0x59893f74       240  package body RHNSAT.RHN_ENTITLEMENTS
0x59893f74       538  package body RHNSAT.RHN_ENTITLEMENTS
0x59893f74       577  package body RHNSAT.RHN_ENTITLEMENTS
0x5989bf08         1  anonymous block
...

From statspack report (attached) -
...
                                                     CPU      Elapsd
  Buffer Gets    Executions  Gets per Exec  %Total Time (s)  Time (s) Hash Value
--------------- ------------ -------------- ------ -------- --------- ----------
    523,168,052            2  261,584,026.0   97.7 ########  18971.85  912455645
Module: python@ (TNS V1-V3)
begin rhn_entitlements.set_group_count(:1,:2,:3,:4); end;

    517,001,292           23   22,478,317.0   96.5 ########  18164.38 3663609765
Module: python@ (TNS V1-V3)
DELETE FROM TIME_SERIES WHERE SUBSTR(O_ID, INSTR(O_ID, '-') + 1,
 (INSTR(O_ID, '-', INSTR(O_ID, '-') + 1) - INSTR(O_ID, '-')) - 1
) IN (SELECT PROBE_ID FROM RHN_CHECK_PROBE WHERE HOST_ID = :B1 )
...

From bug 494064 comment 19 by Xixi:

--- Additional comment from jpazdziora on 2009-04-10 04:05:51 EDT ---

On the reproducer rlx-1-22 machine, select

select count(*)
FROM TIME_SERIES WHERE SUBSTR(O_ID, INSTR(O_ID, '-') + 1,
 (INSTR(O_ID, '-', INSTR(O_ID, '-') + 1) - INSTR(O_ID, '-')) - 1
) IN (SELECT PROBE_ID FROM RHN_CHECK_PROBE WHERE HOST_ID = 1000010060 );

uses full table access and takes 22 minutes.

When I create function-based index

create index TIME_SERIES_PROBE_ID_IDX
on TIME_SERIES_PROBE(SUBSTR(O_ID, INSTR(O_ID, '-') + 1,
 (INSTR(O_ID, '-', INSTR(O_ID, '-') + 1) - INSTR(O_ID, '-')) - 1
)) nologging;

and use

select /*+index(TIME_SERIES TIME_SERIES_PROBE_ID_IDX)*/ count(*)
FROM TIME_SERIES WHERE SUBSTR(O_ID, INSTR(O_ID, '-') + 1,
 (INSTR(O_ID, '-', INSTR(O_ID, '-') + 1) - INSTR(O_ID, '-')) - 1
) IN (SELECT to_char(PROBE_ID) FROM RHN_CHECK_PROBE WHERE HOST_ID = 1000010060 );

the select uses the index and takes 24 seconds.

Of course, this bugzilla is very related to bug 461704, "Monitoring probes data in time_series table are not cleaned up when probes/system profiles are deleted". So better fix might be to change the time_series table which has that column o_id which is not in the first normal form into table (with some compatibility view on top of it), and have proper indexes and foreign keys (maybe with cascade on delete) on it.

--- Additional comment from jpazdziora on 2009-04-24 02:08:01 EDT ---

The above should have read

create index TIME_SERIES_PROBE_ID_IDX
on TIME_SERIES(SUBSTR(O_ID, INSTR(O_ID, '-') + 1,
 (INSTR(O_ID, '-', INSTR(O_ID, '-') + 1) - INSTR(O_ID, '-')) - 1
)) nologging;

Comment 1 Jan Pazdziora (Red Hat) 2009-04-24 07:49:39 UTC

Function based index and hint added.

Master: 41da35eae9d2636d8fbf4e0a2f110d40c0349e4f and ff66fe94a48182d5c84ab269af9da6ccc3d94e03.

VADER: d8eda972965aa909e77acd183092779b4c032ecd and 07ca6b9ca231d9d6ca1085db67f09ae6b8b51ca2.

Comment 2 Jan Pazdziora (Red Hat) 2009-04-24 07:52:54 UTC

Tagged as spacewalk-schema-0.6.4-1 (for Spacewalk).

Comment 3 Miroslav Suchý 2009-09-10 12:05:59 UTC

Spacewalk 0.6 released