Bug 559447
| Summary: | Unable to activate Satellite cert (oracle "hangs" with high cpu/mem consumption) - performance issue with update_family_counts() proc | |||
|---|---|---|---|---|
| Product: | Red Hat Satellite 5 | Reporter: | Xixi <xdmoon> | |
| Component: | Server | Assignee: | Michael Mráka <mmraka> | |
| Status: | CLOSED WONTFIX | QA Contact: | Red Hat Satellite QA List <satqe-list> | |
| Severity: | high | Docs Contact: | ||
| Priority: | high | |||
| Version: | 423 | CC: | cperry, pbat, tao, xdmoon | |
| Target Milestone: | --- | |||
| Target Release: | --- | |||
| Hardware: | All | |||
| OS: | Linux | |||
| Whiteboard: | ||||
| Fixed In Version: | Doc Type: | Bug Fix | ||
| Doc Text: | Story Points: | --- | ||
| Clone Of: | ||||
| : | 561140 (view as bug list) | Environment: | ||
| Last Closed: | 2011-01-05 22:32:12 UTC | Type: | --- | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Embargoed: | ||||
| Bug Depends On: | ||||
| Bug Blocks: | 158643 | |||
Statspack reports taken on reproducer show high CPU Time and Time for rhn_entitlements.set_group_count -
sp_4_5.lst:
SQL ordered by Reads for DB: RHNSAT Instance: rhnsat Snaps: 4 -5
-> End Disk Reads Threshold: 1000
CPU Elapsd
Physical Reads Executions Reads per Exec %Total Time (s) Time (s) Hash Value
--------------- ------------ -------------- ------ -------- --------- ----------
10,458 10 1,045.8 68.5 ######## 38946.47 912455645
Module: python@...com (TNS V1-V3)
begin rhn_entitlements.set_group_count(:1,:2,:3,:4); end;
sp_1_2.lst
CPU Elapsd
Physical Reads Executions Reads per Exec %Total Time (s) Time (s) Hash Value
--------------- ------------ -------------- ------ -------- --------- ----------
10,694 10 1,069.4 66.5 ######## 39461.51 912455645
Module: python@...com (TNS V1-V3)
begin rhn_entitlements.set_group_count(:1,:2,:3,:4); end;
And customer's DBA had reported seeing "this query hanging for quite some time:
UPDATE RHNPRIVATECHANNELFAMILY SET CURRENT_MEMBERS = ( SELECT COUNT(SC.SERVER_ID) FROM RHNSERVERCHANNEL SC, RHNCHANNELFAMILYMEMBERS CFM, RHNSERVER S WHERE S.ORG_ID = :B2 AND S.ID = SC.SERVER_ID AND CFM.CHANNEL_FAMILY_ID = :B1 AND CFM.CHANNEL_ID = SC.CHANNEL_ID ) WHERE ORG_ID = :B2 AND CHANNEL_FAMILY_ID = :B1"
...which is part of update_family_counts procedure, so one possible path would be: set_group_count calls prune_server calls clear_subscriptions calls update_family_counts, and the bottleneck is there.
Relevant SQL:
...
procedure set_group_count (
customer_id_in in number,
type_in in char,
group_type_in in number,
quantity_in in number
) is
group_id number;
quantity number;
begin
quantity := quantity_in;
if quantity is not null and quantity < 0 then
quantity := 0;
end if;
if type_in = 'U' then
select rug.id
into group_id
from rhnUserGroup rug
where 1=1
and rug.org_id = customer_id_in
and rug.group_type = group_type_in;
elsif type_in = 'S' then
select rsg.id
into group_id
from rhnServerGroup rsg
where 1=1
and rsg.org_id = customer_id_in
and rsg.group_type = group_type_in;
end if;
rhn_entitlements.prune_group(
group_id,
type_in,
quantity
);
exception
when no_data_found then
if type_in = 'U' then
insert into rhnUserGroup (
id, name, description, max_members, current_members,
group_type, org_id, created, modified
) (
select rhn_user_group_id_seq.nextval, name, name,
quantity, 0, id, customer_id_in,
sysdate, sysdate
from rhnUserGroupType
where id = group_type_in
);
elsif type_in = 'S' then
insert into rhnServerGroup (
id, name, description, max_members, current_members,
group_type, org_id, created, modified
) (
select rhn_server_group_id_seq.nextval, name, name,
quantity, 0, id, customer_id_in,
sysdate, sysdate
from rhnServerGroupType
where id = group_type_in
);
end if;
end set_group_count;
...
PROCEDURE update_family_counts(channel_family_id_in IN NUMBER, org_id_in IN NUMBER)
IS
BEGIN
update rhnPrivateChannelFamily
set current_members = (
select count(sc.server_id)
from rhnServerChannel sc,
rhnChannelFamilyMembers cfm,
rhnServer s
where s.org_id = org_id_in
and s.id = sc.server_id
and cfm.channel_family_id = channel_family_id_in
and cfm.channel_id = sc.channel_id
)
where org_id = org_id_in
and channel_family_id = channel_family_id_in;
END update_family_counts;
...
Investigation summary:
Simplified call stack is:
rhn-satellite-activation
-> activate_system_entitlement(org_id_in, group_label_in)
-> set_group_count(org_id_in, group_type)
-> prune_group(group_id)
for sg in servergroups loop
-> rhn_channel.clear_subscriptions(sg.server_id);
for channel in server_channels(server_id_in) loop
rhn_channel.update_family_counts(channel.channel_family_id)
UPDATE RHNPRIVATECHANNELFAMILY SET CURRENT_MEMBERS...
So update_family_counts() is called once per subscribed channel per server in group
which means we call it thousand times over and over per same family(!).
I'm not sure if calling update_family_counts() during activation isn't
a non-sense and shouldn't be removed completely. Anyway... at least we should call it only
once per family which is also my proposed way to fix it.
Same call stack can be seen on satellite 4.2 and 5.3.
The issue has been fixed in spacewalk.git
commit 9376fa4c50709439d71a5dc7071b22f552021820
559447 - speed up rhn-satelite-activate
we recount family group members during activation; the trouble is we do
it in clear_subscriptions() which means we recount every family
numerous times (once per every server subscribed to channel belonging to
the family which sometimes means up to hundred to thousand times)
so this commit added update_family_countsYN parameter to clear_subscriptions()
which inhibits recounting inside and moved it to the
update_group_family_counts() which does bulk update once per family
This BZ is for rhn42maint tracking. BZ for Satellite 6.0.0 fix tracking has been cloned to bug 561140. The RHN Satellite 4.x and RHN Proxy 4.x products have reached their end of life. Please see: Satellite 4 EOL Errata - https://rhn.redhat.com/errata/RHSA-2011-0001.html RHN Proxy 4 EOL Errata - https://rhn.redhat.com/errata/RHSA-2011-0002.html This bugzilla was reported for a product which is no longer supported. As such we are closing this bugzilla out. If you feel this bug report should be reviewed again since it is valid for a currently supported product version, then please feel free to re-open this bug report. Regards, Clifford |
Description of problem: Customer received a new RHN Satellite certificate, but is unable to activate. It does the activation on the RHN Hosted side just fine, but hangs when attempting to apply it to the local Satellite (whether through rhn-satellite-activate or through satellite-sync to pull down from Hosted, all attempts seem to "hang"). This is a external-database Satellite, customer notes that the load/cpu is minimal on the satellite itself, but the database box is spinning at 100%. The DBA observers the following query hanging for quite some time: UPDATE RHNPRIVATECHANNELFAMILY SET CURRENT_MEMBERS = ( SELECT COUNT(SC.SERVER_ID) FROM RHNSERVERCHANNEL SC, RHNCHANNELFAMILYMEMBERS CFM, RHNSERVER S WHERE S.ORG_ID = :B2 AND S.ID = SC.SERVER_ID AND CFM.CHANNEL_FAMILY_ID = :B1 AND CFM.CHANNEL_ID = SC.CHANNEL_ID ) WHERE ORG_ID = :B2 AND CHANNEL_FAMILY_ID = :B1 Version-Release number of selected component (if applicable): Red Hat Network (RHN) Satellite 4.2.3 System architecture(s): RHEL 5 x86 How reproducible: Always. Steps to Reproduce: 1. Have a satellite with a medium-to-large deployment - for example, customer's db has: SQL> select count(*) from rhnChannel; COUNT(*) ---------- 503 SQL> select count(*) from rhnServer; COUNT(*) ---------- 8257 SQL> select count(*) from rhnServerGroup; COUNT(*) ---------- 145 SQL> select count (*) from rhnServerChannel; COUNT(*) ---------- 18693 SQL> select count(*) from rhnChannelFamilyMembers; COUNT(*) ---------- 503 (an internal database with customer's dump has been set up as a reproducer, information to follow) 2. Activate a new satellite cert (sample attached) - rhn-satellite-activate --rhn-cert=/root/1-21-bac.cert 3. Wait. Actual results: Activation appears to "hang" with high cpu/mem usage on Oracle - PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 9230 oracle 25 0 236m 60m 58m R 99.9 4.1 21:28.36 oracle Internal reproducer is a beefy satellite with embedded db and it took more than 11 hours to activate the certificate (crashing satellite services in the process). Expected results: Cert activation completes fast, as normal. Additional info: Already discussed with engineering.