Bug 573153
| Summary: | Satellite 5.2 & 5.3: Systems tab or System Groups link takes 2-3+ min to load or fails with "503 Service Temporarily Unavailable" | ||
|---|---|---|---|
| Product: | Red Hat Satellite 5 | Reporter: | Xixi <xdmoon> |
| Component: | Server | Assignee: | Justin Sherrill <jsherril> |
| Status: | CLOSED ERRATA | QA Contact: | Petr Sklenar <psklenar> |
| Severity: | high | Docs Contact: | |
| Priority: | high | ||
| Version: | 521 | CC: | cperry, ktoyama, psklenar, tao, xdmoon |
| Target Milestone: | --- | ||
| Target Release: | --- | ||
| Hardware: | All | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | Bug Fix | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2010-06-22 14:04:26 UTC | Type: | --- |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Bug Depends On: | |||
| Bug Blocks: | 518256 | ||
|
Description
Xixi
2010-03-13 01:00:18 UTC
Some stats from customer db:
SQL> SELECT COUNT(*) from web_contact;
COUNT(*)
----------
55
SQL> SELECT COUNT(*) FROM rhnUserManagedServerGroups;
COUNT(*)
----------
542
SQL> SELECT COUNT(*) FROM rhnServerGroupMembers;
COUNT(*)
----------
2603
SQL> select count(*) from rhnServerNeededErrataCache;
COUNT(*)
----------
2948048
SQL> select count(*) from rhnServerNeededPackageCache;
COUNT(*)
----------
7921017
Things I've tried so far:
1. reproduced issue and took statspack report #1, sp_1_11.lst, which highlights the following query which appears to be visible_to_user_overview_fast -
...
^LSQL ordered by CPU DB/Inst: RHNSAT/rhnsat Snaps: 1-11
-> Resources reported for PL/SQL code includes the resources used by all SQL
statements called by the code.
-> Total DB CPU (s): 409
-> Captured SQL accounts for 98.8% of Total DB CPU
-> SQL reported below exceeded 1.0% of Total DB CPU
CPU CPU per Elapsd Old
Time (s) Executions Exec (s) %Total Time (s) Buffer Gets Hash Value
---------- ------------ ---------- ------ ---------- --------------- ----------
401.90 1 401.90 98.4 693.47 95,896,297 721418455
SELECT G.ID AS ID, G.NAME AS NAME,
(SELECT COUNT(*) FROM rhnUserManagedServerGroups UMSG WHERE
UMSG.server_group_id = G.id) AS GROUP_ADMINS, (SELE
CT COUNT(*) FROM rhnServerGroupMembers SGM WHERE SGM.server_grou
p_id = G.id AND EXISTS ( SELECT 1
Elapsed Elap per CPU Old
Time (s) Executions Exec (s) %Total Time (s) Physical Reads Hash Value
---------- ------------ ---------- ------ ---------- --------------- ----------
693.47 1 693.47 94.1 401.90 322,395 721418455
SELECT G.ID AS ID, G.NAME AS NAME,
(SELECT COUNT(*) FROM rhnUserManagedServerGroups UMSG WHERE
UMSG.server_group_id = G.id) AS GROUP_ADMINS, (SELE
CT COUNT(*) FROM rhnServerGroupMembers SGM WHERE SGM.server_grou
p_id = G.id AND EXISTS ( SELECT 1
Buffer Gets Executions Gets per Exec %Total Time (s) Time (s) Hash Value
--------------- ------------ -------------- ------ -------- --------- ----------
95,896,297 1 95,896,297.0 100.4 401.90 693.47 721418455
SELECT G.ID AS ID, G.NAME AS NAME,
(SELECT COUNT(*) FROM rhnUserManagedServerGroups UMSG WHERE
UMSG.server_group_id = G.id) AS GROUP_ADMINS, (SELE
CT COUNT(*) FROM rhnServerGroupMembers SGM WHERE SGM.server_grou
p_id = G.id AND EXISTS ( SELECT 1
CPU Elapsd Old
Physical Reads Executions Reads per Exec %Total Time (s) Time (s) Hash Value
--------------- ------------ -------------- ------ -------- --------- ----------
322,395 1 322,395.0 99.8 401.90 693.47 721418455
SELECT G.ID AS ID, G.NAME AS NAME,
(SELECT COUNT(*) FROM rhnUserManagedServerGroups UMSG WHERE
UMSG.server_group_id = G.id) AS GROUP_ADMINS, (SELE
CT COUNT(*) FROM rhnServerGroupMembers SGM WHERE SGM.server_grou
p_id = G.id AND EXISTS ( SELECT 1
...
2. Updated indexes to be noparallel:
alter index RHN_SGMEMBERS_SGID_SID_IDX noparallel;
alter index RHN_UGMEMBERS_UID_UGID_UQ noparallel;
alter index RHN_UG_ORG_ID_GTYPE_IDX noparallel;
alter index WEB_CONTACT_ID_OID_CUST_LUC noparallel;
alter index WEB_CONTACT_OID_ID noparallel;
alter index RHN_UG_ORG_ID_TYPE_IDX noparallel;
alter index RHN_UGMEMBERS_UGID_UID_IDX noparallel;
alter index WEB_USER_SITE_INFO_WUID noparallel;
alter index RHN_SP_SNEP_IDX noparallel;
alter index RHN_SNEC_OID_EID_SID_IDX noparallel;
alter index RHN_SNEC_SID_EID_IDX noparallel;
alter index RHN_SNEC_EID_SID_IDX noparallel;
alter index RHN_UG_ORG_ID_NAME_IDX noparallel;
alter index RHN_UG_ID_NAME_ORG noparallel;
alter index RHN_SG_OID_TYPE_ID_IDX noparallel;
alter index RHN_SG_TYPE_ID_IDX noparallel;
alter index RHN_SG_OID_ID_NAME_IDX noparallel;
alter index RHN_SG_ID_OID_NAME_IDX noparallel;
alter index RHN_SNPC_OID_IDX noparallel;
alter index RHN_SNPC_EID_IDX noparallel;
alter index RHN_SNPC_SID_IDX noparallel;
alter index RHN_SNPC_PID_IDX noparallel;
no visible difference in UI behavior.
3. took another statspack report sp_21_31.lst which shows slightly better times, after step 2 above:
...
^LSQL ordered by CPU DB/Inst: RHNSAT/rhnsat Snaps: 21-31
-> Resources reported for PL/SQL code includes the resources used by all SQL
statements called by the code.
-> Total DB CPU (s): 78
-> Captured SQL accounts for 96.7% of Total DB CPU
-> SQL reported below exceeded 1.0% of Total DB CPU
CPU CPU per Elapsd Old
Time (s) Executions Exec (s) %Total Time (s) Buffer Gets Hash Value
---------- ------------ ---------- ------ ---------- --------------- ----------
71.98 1 71.98 91.8 119.00 36,928,272 2836748205
SELECT G.ID AS ID, G.NAME AS NAME,
(SELECT COUNT(*) FROM rhnUserManagedServerGroups UMSG WHERE
UMSG.server_group_id = G.id) AS GROUP_ADMINS, (SELE
CT COUNT(*) FROM rhnServerGroupMembers SGM WHERE SGM.server_grou
p_id = G.id AND EXISTS ( SELECT 1
2.42 1 2.42 3.1 3.56 9,991 338002664
Module: SQL*Plus
begin :snap := statspack.snap(i_snap_level=>7); end;
119.00 1 119.00 90.9 71.98 17,268 2836748205
SELECT G.ID AS ID, G.NAME AS NAME,
(SELECT COUNT(*) FROM rhnUserManagedServerGroups UMSG WHERE
UMSG.server_group_id = G.id) AS GROUP_ADMINS, (SELE
CT COUNT(*) FROM rhnServerGroupMembers SGM WHERE SGM.server_grou
p_id = G.id AND EXISTS ( SELECT 1
...
4. Tried workaround from https://bugzilla.redhat.com/show_bug.cgi?id=525575#c12 - no visible difference still, i.e., clicking on Systems tab still times out.
5. Attempting a gather-stats to see if that'll help, the first time it failed due to TEMP tablespace running out of free space -
-bash-3.2$ db-control gather-stats
Gathering statistics...
WARNING: this may be a very slow process.
DBD::Oracle::db do failed: ORA-12801: error signaled in parallel query server P003
ORA-01652: unable to extend temp segment by 128 in tablespace TEMP_TBS
ORA-06512: at "SYS.DBMS_STATS", line 13591
ORA-06512: at "SYS.DBMS_STATS", line 13937
ORA-06512: at "SYS.DBMS_STATS", line 14015
ORA-06512: at "SYS.DBMS_STATS", line 13974
ORA-06512: at line 1 (DBD ERROR: OCIStmtExecute) [for Statement "begin dbms_stats.gather_schema_stats(NULL, ESTIMATE_PERCENT=> 15, DEGREE=>DBMS_STATS.DEFAULT_DEGREE, CASCADE=>TRUE); end;"] at /usr/lib/perl5/site_perl/5.8.8/Dobby/DB.pm line 174.
...as confirmed via sqlplus:
SQL> select TABLESPACE_NAME, BYTES_USED, BYTES_FREE from V$TEMP_SPACE_HEADER;
select * from V$TEMP_SPACE_HEADER;
TABLESPACE_NAME BYTES_USED BYTES_FREE
------------------------------ ---------- ----------
TEMP_TBS 262144000 0
SQL>
TABLESPACE_NAME FILE_ID BYTES_USED BLOCKS_USED BYTES_FREE
------------------------------ ---------- ---------- ----------- ----------
BLOCKS_FREE RELATIVE_FNO
----------- ------------
TEMP_TBS 1 262144000 32000 0
0 1
So I added more datafiles to TEMP_TBS and ran gather-stats again.
alter TABLESPACE TEMP_TBS ADD TEMPFILE '/rhnsat/data/rhnsat/temp_02.dbf' SIZE 512M REUSE;
alter TABLESPACE TEMP_TBS ADD TEMPFILE '/rhnsat/data/rhnsat/temp_03.dbf' SIZE 512M REUSE;
-bash-3.2$ db-control gather-stats
Gathering statistics...
WARNING: this may be a very slow process.
...problem persists after gather-stats.
6. Tried turning off Jasper development mode https://bugzilla.redhat.com/show_bug.cgi?id=533527
[root@dhcp243-194 ~]# diff -ru /etc/tomcat5/web.xml.orig /etc/tomcat5/web.xml
--- /etc/tomcat5/web.xml.orig 2009-06-25 12:44:46.000000000 -0400
+++ /etc/tomcat5/web.xml 2010-03-11 15:37:41.000000000 -0500
@@ -220,6 +220,10 @@
<param-name>xpoweredBy</param-name>
<param-value>false</param-value>
</init-param>
+ <init-param>
+ <param-name>development</param-name>
+ <param-value>false</param-value>
+ </init-param>
<load-on-startup>3</load-on-startup>
</servlet>
7. Upped apache connection timeout:
[root@dhcp243-194 ~]# diff -ru /etc/httpd/conf/httpd.conf.orig /etc/httpd/conf/httpd.conf
--- /etc/httpd/conf/httpd.conf.orig 2009-11-10 09:09:14.000000000 -0500
+++ /etc/httpd/conf/httpd.conf 2010-03-11 16:00:30.000000000 -0500
@@ -65,7 +65,8 @@
#
# Timeout: The number of seconds before receives and sends time out.
#
-Timeout 120
+#Timeout 120
+Timeout 300
#
# KeepAlive: Whether or not to allow persistent connections (more than
Still getting same error.
Public update: This also appears to affect Satellite 5.3, although the workaround and long-term fix will be different. This BZ is for tracking the fix for 5.3.x. fixed in commit: 73369cb5971a658c38f3f41e8354ad1a15a67fb4 Basically i reduced the 3 sub queries down to a single one, and changed it to use decode which seemed to make it a lot more speedy :] Will build some packages to test properly on our reproducer. test procedure: have customer db from Comment 573153#c19 - thanks much Xixi :) OLD: I could see 503 Service Temporarily Unavailable on Overview.do, SystemList.do, SystemGroupList.do pages NEW: I could open repeatedly these links: https://satellite-1.brq.redhat.com/rhn/systems/Overview.do https://satellite-1.brq.redhat.com/rhn/systems/SystemList.do https://satellite-1.brq.redhat.com/rhn/systems/SystemGroupList.do I takes longer (1-2 min) but I haven't seen 503 error Verified with rhel55@x8664 with sat530+ upgraded form errata.stage and with errata's packages: spacewalk-java-0.5.44-79.el5sat spacewalk-java-config-0.5.44-79.el5sat spacewalk-java-lib-0.5.44-79.el5sat spacewalk-taskomatic-0.5.44-79.el5sat cobbler-1.6.6-8.el5sat cobbler-debuginfo-1.6.6-8.el5sat spacewalk-base-0.5.23-35.el5sat spacewalk-base-minimal-0.5.23-35.el5sat spacewalk-cypress-0.5.23-35.el5sat spacewalk-grail-0.5.23-35.el5sat spacewalk-html-0.5.23-35.el5sat spacewalk-moon-0.5.23-35.el5sat spacewalk-pxt-0.5.23-35.el5sat spacewalk-sniglets-0.5.23-35.el5sat machine info: Mem= 4Gigs, CPU= 2x Intel Xeon @ 1.60GHz An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2010-0498.html |