Description of problem: Customer received a new RHN Satellite certificate, but is unable to activate. It does the activation on the RHN Hosted side just fine, but hangs when attempting to apply it to the local Satellite (whether through rhn-satellite-activate or through satellite-sync to pull down from Hosted, all attempts seem to "hang"). This is a external-database Satellite, customer notes that the load/cpu is minimal on the satellite itself, but the database box is spinning at 100%. The DBA observers the following query hanging for quite some time: UPDATE RHNPRIVATECHANNELFAMILY SET CURRENT_MEMBERS = ( SELECT COUNT(SC.SERVER_ID) FROM RHNSERVERCHANNEL SC, RHNCHANNELFAMILYMEMBERS CFM, RHNSERVER S WHERE S.ORG_ID = :B2 AND S.ID = SC.SERVER_ID AND CFM.CHANNEL_FAMILY_ID = :B1 AND CFM.CHANNEL_ID = SC.CHANNEL_ID ) WHERE ORG_ID = :B2 AND CHANNEL_FAMILY_ID = :B1 Version-Release number of selected component (if applicable): Red Hat Network (RHN) Satellite 4.2.3 System architecture(s): RHEL 5 x86 How reproducible: Always. Steps to Reproduce: 1. Have a satellite with a medium-to-large deployment - for example, customer's db has: SQL> select count(*) from rhnChannel; COUNT(*) ---------- 503 SQL> select count(*) from rhnServer; COUNT(*) ---------- 8257 SQL> select count(*) from rhnServerGroup; COUNT(*) ---------- 145 SQL> select count (*) from rhnServerChannel; COUNT(*) ---------- 18693 SQL> select count(*) from rhnChannelFamilyMembers; COUNT(*) ---------- 503 (an internal database with customer's dump has been set up as a reproducer, information to follow) 2. Activate a new satellite cert (sample attached) - rhn-satellite-activate --rhn-cert=/root/1-21-bac.cert 3. Wait. Actual results: Activation appears to "hang" with high cpu/mem usage on Oracle - PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 9230 oracle 25 0 236m 60m 58m R 99.9 4.1 21:28.36 oracle Internal reproducer is a beefy satellite with embedded db and it took more than 11 hours to activate the certificate (crashing satellite services in the process). Expected results: Cert activation completes fast, as normal. Additional info: Already discussed with engineering.
Statspack reports taken on reproducer show high CPU Time and Time for rhn_entitlements.set_group_count - sp_4_5.lst: SQL ordered by Reads for DB: RHNSAT Instance: rhnsat Snaps: 4 -5 -> End Disk Reads Threshold: 1000 CPU Elapsd Physical Reads Executions Reads per Exec %Total Time (s) Time (s) Hash Value --------------- ------------ -------------- ------ -------- --------- ---------- 10,458 10 1,045.8 68.5 ######## 38946.47 912455645 Module: python@...com (TNS V1-V3) begin rhn_entitlements.set_group_count(:1,:2,:3,:4); end; sp_1_2.lst CPU Elapsd Physical Reads Executions Reads per Exec %Total Time (s) Time (s) Hash Value --------------- ------------ -------------- ------ -------- --------- ---------- 10,694 10 1,069.4 66.5 ######## 39461.51 912455645 Module: python@...com (TNS V1-V3) begin rhn_entitlements.set_group_count(:1,:2,:3,:4); end; And customer's DBA had reported seeing "this query hanging for quite some time: UPDATE RHNPRIVATECHANNELFAMILY SET CURRENT_MEMBERS = ( SELECT COUNT(SC.SERVER_ID) FROM RHNSERVERCHANNEL SC, RHNCHANNELFAMILYMEMBERS CFM, RHNSERVER S WHERE S.ORG_ID = :B2 AND S.ID = SC.SERVER_ID AND CFM.CHANNEL_FAMILY_ID = :B1 AND CFM.CHANNEL_ID = SC.CHANNEL_ID ) WHERE ORG_ID = :B2 AND CHANNEL_FAMILY_ID = :B1" ...which is part of update_family_counts procedure, so one possible path would be: set_group_count calls prune_server calls clear_subscriptions calls update_family_counts, and the bottleneck is there. Relevant SQL: ... procedure set_group_count ( customer_id_in in number, type_in in char, group_type_in in number, quantity_in in number ) is group_id number; quantity number; begin quantity := quantity_in; if quantity is not null and quantity < 0 then quantity := 0; end if; if type_in = 'U' then select rug.id into group_id from rhnUserGroup rug where 1=1 and rug.org_id = customer_id_in and rug.group_type = group_type_in; elsif type_in = 'S' then select rsg.id into group_id from rhnServerGroup rsg where 1=1 and rsg.org_id = customer_id_in and rsg.group_type = group_type_in; end if; rhn_entitlements.prune_group( group_id, type_in, quantity ); exception when no_data_found then if type_in = 'U' then insert into rhnUserGroup ( id, name, description, max_members, current_members, group_type, org_id, created, modified ) ( select rhn_user_group_id_seq.nextval, name, name, quantity, 0, id, customer_id_in, sysdate, sysdate from rhnUserGroupType where id = group_type_in ); elsif type_in = 'S' then insert into rhnServerGroup ( id, name, description, max_members, current_members, group_type, org_id, created, modified ) ( select rhn_server_group_id_seq.nextval, name, name, quantity, 0, id, customer_id_in, sysdate, sysdate from rhnServerGroupType where id = group_type_in ); end if; end set_group_count; ... PROCEDURE update_family_counts(channel_family_id_in IN NUMBER, org_id_in IN NUMBER) IS BEGIN update rhnPrivateChannelFamily set current_members = ( select count(sc.server_id) from rhnServerChannel sc, rhnChannelFamilyMembers cfm, rhnServer s where s.org_id = org_id_in and s.id = sc.server_id and cfm.channel_family_id = channel_family_id_in and cfm.channel_id = sc.channel_id ) where org_id = org_id_in and channel_family_id = channel_family_id_in; END update_family_counts; ...
Investigation summary: Simplified call stack is: rhn-satellite-activation -> activate_system_entitlement(org_id_in, group_label_in) -> set_group_count(org_id_in, group_type) -> prune_group(group_id) for sg in servergroups loop -> rhn_channel.clear_subscriptions(sg.server_id); for channel in server_channels(server_id_in) loop rhn_channel.update_family_counts(channel.channel_family_id) UPDATE RHNPRIVATECHANNELFAMILY SET CURRENT_MEMBERS... So update_family_counts() is called once per subscribed channel per server in group which means we call it thousand times over and over per same family(!). I'm not sure if calling update_family_counts() during activation isn't a non-sense and shouldn't be removed completely. Anyway... at least we should call it only once per family which is also my proposed way to fix it. Same call stack can be seen on satellite 4.2 and 5.3.
The issue has been fixed in spacewalk.git commit 9376fa4c50709439d71a5dc7071b22f552021820 559447 - speed up rhn-satelite-activate we recount family group members during activation; the trouble is we do it in clear_subscriptions() which means we recount every family numerous times (once per every server subscribed to channel belonging to the family which sometimes means up to hundred to thousand times) so this commit added update_family_countsYN parameter to clear_subscriptions() which inhibits recounting inside and moved it to the update_group_family_counts() which does bulk update once per family
This BZ is for rhn42maint tracking. BZ for Satellite 6.0.0 fix tracking has been cloned to bug 561140.
The RHN Satellite 4.x and RHN Proxy 4.x products have reached their end of life. Please see: Satellite 4 EOL Errata - https://rhn.redhat.com/errata/RHSA-2011-0001.html RHN Proxy 4 EOL Errata - https://rhn.redhat.com/errata/RHSA-2011-0002.html This bugzilla was reported for a product which is no longer supported. As such we are closing this bugzilla out. If you feel this bug report should be reviewed again since it is valid for a currently supported product version, then please feel free to re-open this bug report. Regards, Clifford