559447 – Unable to activate Satellite cert (oracle "hangs" with high cpu/mem consumption) - performance issue with update_family_counts() proc

Bug 559447 - Unable to activate Satellite cert (oracle "hangs" with high cpu/mem consumption) - performance issue with update_family_counts() proc

Summary: Unable to activate Satellite cert (oracle "hangs" with high cpu/mem consumpti...

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	Red Hat Satellite 5
Classification:	Red Hat
Component:	Server
Sub Component:
Version:	423
Hardware:	All
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	---
Assignee:	Michael Mráka
QA Contact:	Red Hat Satellite QA List
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	158643
TreeView+	depends on / blocked

Reported:	2010-01-28 06:06 UTC by Xixi
Modified:	2018-10-27 15:44 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Clones:	561140 (view as bug list)
Environment:
Last Closed:	2011-01-05 22:32:12 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Xixi 2010-01-28 06:06:15 UTC

Description of problem:
Customer received a new RHN Satellite certificate, but is unable to activate. It does the activation on the RHN Hosted side just fine, but hangs when attempting to apply it to the local Satellite (whether through rhn-satellite-activate or through satellite-sync to pull down from Hosted, all attempts seem to "hang").

This is a external-database Satellite, customer notes that the load/cpu is minimal on the satellite itself, but the database box is spinning at 100%. The DBA observers the following query hanging for quite some time:

UPDATE RHNPRIVATECHANNELFAMILY SET CURRENT_MEMBERS = (   SELECT COUNT(SC.SERVER_ID)  FROM RHNSERVERCHANNEL SC, RHNCHANNELFAMILYMEMBERS CFM, RHNSERVER S   WHERE S.ORG_ID = :B2   AND S.ID = SC.SERVER_ID   AND CFM.CHANNEL_FAMILY_ID = :B1   AND CFM.CHANNEL_ID = SC.CHANNEL_ID )   WHERE ORG_ID = :B2   AND CHANNEL_FAMILY_ID = :B1

Version-Release number of selected component (if applicable):
Red Hat Network (RHN) Satellite 4.2.3

System architecture(s):
RHEL 5 x86

How reproducible:
Always.

Steps to Reproduce:
1. Have a satellite with a medium-to-large deployment - for example, customer's db has:

SQL> select count(*) from rhnChannel;
  COUNT(*)
----------
       503
SQL> select count(*) from rhnServer;
 COUNT(*)
----------
     8257
SQL> select count(*) from rhnServerGroup;
 COUNT(*)
----------
      145
SQL> select count (*) from rhnServerChannel;
 COUNT(*)
----------
    18693
SQL> select count(*) from rhnChannelFamilyMembers;
 COUNT(*)
----------
      503

(an internal database with customer's dump has been set up as a reproducer, information to follow)

2. Activate a new satellite cert (sample attached) - 
rhn-satellite-activate --rhn-cert=/root/1-21-bac.cert
3. Wait.

Actual results:
Activation appears to "hang" with high cpu/mem usage on Oracle -
 PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
9230 oracle    25   0  236m  60m  58m R 99.9  4.1  21:28.36 oracle
Internal reproducer is a beefy satellite with embedded db and it took more than 11 hours to activate the certificate (crashing satellite services in the process).

Expected results:
Cert activation completes fast, as normal. 

Additional info:
Already discussed with engineering.

Comment 4 Xixi 2010-01-28 06:22:03 UTC

Statspack reports taken on reproducer show high CPU Time and Time for rhn_entitlements.set_group_count -

sp_4_5.lst:

SQL ordered by Reads for DB: RHNSAT  Instance: rhnsat  Snaps: 4 -5
-> End Disk Reads Threshold:      1000

                                                    CPU      Elapsd
Physical Reads  Executions  Reads per Exec %Total Time (s)  Time (s) Hash Value
--------------- ------------ -------------- ------ -------- --------- ----------
        10,458           10        1,045.8   68.5 ########  38946.47  912455645
Module: python@...com (TNS V1-V3)
begin rhn_entitlements.set_group_count(:1,:2,:3,:4); end;

sp_1_2.lst
                                                    CPU      Elapsd
Physical Reads  Executions  Reads per Exec %Total Time (s)  Time (s) Hash Value
--------------- ------------ -------------- ------ -------- --------- ----------
        10,694           10        1,069.4   66.5 ########  39461.51  912455645
Module: python@...com (TNS V1-V3)
begin rhn_entitlements.set_group_count(:1,:2,:3,:4); end;

And customer's DBA had reported seeing "this query hanging for quite some time:
UPDATE RHNPRIVATECHANNELFAMILY SET CURRENT_MEMBERS = (   SELECT COUNT(SC.SERVER_ID)  FROM RHNSERVERCHANNEL SC, RHNCHANNELFAMILYMEMBERS CFM, RHNSERVER S   WHERE S.ORG_ID = :B2   AND S.ID = SC.SERVER_ID   AND CFM.CHANNEL_FAMILY_ID = :B1   AND CFM.CHANNEL_ID = SC.CHANNEL_ID ) WHERE ORG_ID = :B2   AND CHANNEL_FAMILY_ID = :B1"
...which is part of update_family_counts procedure, so one possible path would be: set_group_count calls prune_server calls clear_subscriptions calls update_family_counts, and the bottleneck is there. 

Relevant SQL:
...
       procedure set_group_count (
               customer_id_in in number,
               type_in in char,
               group_type_in in number,
               quantity_in in number
       ) is
               group_id number;
               quantity number;
       begin
               quantity := quantity_in;
               if quantity is not null and quantity < 0 then
                       quantity := 0;
               end if;
               if type_in = 'U' then
                       select  rug.id
                       into    group_id
                       from    rhnUserGroup rug
                       where   1=1
                               and rug.org_id = customer_id_in
                               and rug.group_type = group_type_in;
               elsif type_in = 'S' then
                       select  rsg.id
                       into    group_id
                       from    rhnServerGroup rsg
                       where   1=1
                               and rsg.org_id = customer_id_in
                               and rsg.group_type = group_type_in;
               end if;
               rhn_entitlements.prune_group(
                       group_id,
                       type_in,
                       quantity
               );
       exception
               when no_data_found then
                       if type_in = 'U' then
                               insert into rhnUserGroup (
                                               id, name, description, max_members, current_members,
                                               group_type, org_id, created, modified
                                       ) (
                                               select  rhn_user_group_id_seq.nextval, name, name,
                                                               quantity, 0, id, customer_id_in,
                                                               sysdate, sysdate
                                               from    rhnUserGroupType
                                               where   id = group_type_in
                               );
                       elsif type_in = 'S' then
                               insert into rhnServerGroup (
                                               id, name, description, max_members, current_members,
                                               group_type, org_id, created, modified
                                       ) (
                                               select  rhn_server_group_id_seq.nextval, name, name,
                                                               quantity, 0, id, customer_id_in,
                                                               sysdate, sysdate
                                               from    rhnServerGroupType
                                               where   id = group_type_in
                               );
                       end if;
       end set_group_count;
...
   PROCEDURE update_family_counts(channel_family_id_in IN NUMBER, org_id_in IN NUMBER)
   IS
   BEGIN
               update rhnPrivateChannelFamily
               set current_members = (
                               select  count(sc.server_id)
                               from    rhnServerChannel sc,
                                               rhnChannelFamilyMembers cfm,
                                               rhnServer s
                               where   s.org_id = org_id_in
                                       and s.id = sc.server_id
                                       and cfm.channel_family_id = channel_family_id_in
                                       and cfm.channel_id = sc.channel_id
                               )
                       where org_id = org_id_in
                               and channel_family_id = channel_family_id_in;
   END update_family_counts;
...

Comment 6 Michael Mráka 2010-01-28 20:11:16 UTC

Investigation summary:

Simplified call stack is:
rhn-satellite-activation 
  -> activate_system_entitlement(org_id_in, group_label_in)
       -> set_group_count(org_id_in, group_type)
            -> prune_group(group_id)
                  for sg in servergroups loop
                     -> rhn_channel.clear_subscriptions(sg.server_id);
                            for channel in server_channels(server_id_in) loop
                                 rhn_channel.update_family_counts(channel.channel_family_id)  
                                           UPDATE RHNPRIVATECHANNELFAMILY SET CURRENT_MEMBERS...

So update_family_counts() is called once per subscribed channel per server in group
which means we call it thousand times over and over per same family(!).
I'm not sure if calling update_family_counts() during activation isn't  
a non-sense and shouldn't be removed completely. Anyway... at least we should call it only
once per family which is also my proposed way to fix it.

Same call stack can be seen on satellite 4.2 and 5.3.

Comment 7 Michael Mráka 2010-01-29 13:50:32 UTC

The issue has been fixed in spacewalk.git

commit 9376fa4c50709439d71a5dc7071b22f552021820
    559447 - speed up rhn-satelite-activate
    
        we recount family group members during activation; the trouble is we do
        it in clear_subscriptions() which means we recount every family
        numerous times (once per every server subscribed to channel belonging to
        the family which sometimes means up to hundred to thousand times)
    
        so this commit added update_family_countsYN parameter to clear_subscriptions()
        which inhibits recounting inside and moved it to the
        update_group_family_counts() which does bulk update once per family

Comment 14 Xixi 2010-02-02 20:16:39 UTC

This BZ is for rhn42maint tracking. BZ for Satellite 6.0.0 fix tracking has been cloned to bug 561140.

Comment 19 Clifford Perry 2011-01-05 22:32:12 UTC

The RHN Satellite 4.x and RHN Proxy 4.x products have reached their end of life. Please see:

Satellite 4 EOL Errata
 - https://rhn.redhat.com/errata/RHSA-2011-0001.html

RHN Proxy 4 EOL Errata
 - https://rhn.redhat.com/errata/RHSA-2011-0002.html

This bugzilla was reported for a product which is no longer supported. As such we are closing this bugzilla out. If you feel this bug report should be reviewed again since it is valid for a currently supported product version, then please feel free to re-open this bug report. 

Regards,
Clifford

Note You need to log in before you can comment on or make changes to this bug.