Description of problem: When clicking on the Systems tab (link at the top) or the System Groups link, the page could take upwards of 2-3 minutes to load, if at all. Sometimes, a "503 Service Temporarily Unavailable" page will be displayed instead. Customer has had to extend the "DATA_TBS" and "UNDO_TBS" Oracle tablespaces using 'db-control extend" when they noticed that they were full or nearly full. This has been an ongoing issue since their upgrade to 5.2. However, it has gotten really bad recently, notably right after updating to sat-5.2.1 on 1/20. They were still running the hotfix rpms up until a few weeks ago when they installed all available updates (rhns-* rpms included), which effectively brought their Satellite to revision 5.2.1. Version-Release number of selected component (if applicable): Red Hat Network (RHN) Satellite 5.2.0 System architecture(s): Red Hat Enterprise Linux Server release 5.4 (Tikanga) x86_64 rhns-5.2.1-1.el5-noarch How reproducible: Always. Steps to Reproduce: 1. Have a 5.2 satellite and import customer db dump. 2. Click on Systems tab or manually try to go to rhn/systems/Overview.do and rhn/systems/SystemList.do 3. Wait. Actual results: "503 Service Temporarily Unavailable" (it loads sometimes for customer after more than 2-4 minutes but on my reproducer always times out) Expected results: Page loads in reasonable time, no 503 timeouts. Additional info:
Some stats from customer db: SQL> SELECT COUNT(*) from web_contact; COUNT(*) ---------- 55 SQL> SELECT COUNT(*) FROM rhnUserManagedServerGroups; COUNT(*) ---------- 542 SQL> SELECT COUNT(*) FROM rhnServerGroupMembers; COUNT(*) ---------- 2603 SQL> select count(*) from rhnServerNeededErrataCache; COUNT(*) ---------- 2948048 SQL> select count(*) from rhnServerNeededPackageCache; COUNT(*) ---------- 7921017 Things I've tried so far: 1. reproduced issue and took statspack report #1, sp_1_11.lst, which highlights the following query which appears to be visible_to_user_overview_fast - ... ^LSQL ordered by CPU DB/Inst: RHNSAT/rhnsat Snaps: 1-11 -> Resources reported for PL/SQL code includes the resources used by all SQL statements called by the code. -> Total DB CPU (s): 409 -> Captured SQL accounts for 98.8% of Total DB CPU -> SQL reported below exceeded 1.0% of Total DB CPU CPU CPU per Elapsd Old Time (s) Executions Exec (s) %Total Time (s) Buffer Gets Hash Value ---------- ------------ ---------- ------ ---------- --------------- ---------- 401.90 1 401.90 98.4 693.47 95,896,297 721418455 SELECT G.ID AS ID, G.NAME AS NAME, (SELECT COUNT(*) FROM rhnUserManagedServerGroups UMSG WHERE UMSG.server_group_id = G.id) AS GROUP_ADMINS, (SELE CT COUNT(*) FROM rhnServerGroupMembers SGM WHERE SGM.server_grou p_id = G.id AND EXISTS ( SELECT 1 Elapsed Elap per CPU Old Time (s) Executions Exec (s) %Total Time (s) Physical Reads Hash Value ---------- ------------ ---------- ------ ---------- --------------- ---------- 693.47 1 693.47 94.1 401.90 322,395 721418455 SELECT G.ID AS ID, G.NAME AS NAME, (SELECT COUNT(*) FROM rhnUserManagedServerGroups UMSG WHERE UMSG.server_group_id = G.id) AS GROUP_ADMINS, (SELE CT COUNT(*) FROM rhnServerGroupMembers SGM WHERE SGM.server_grou p_id = G.id AND EXISTS ( SELECT 1 Buffer Gets Executions Gets per Exec %Total Time (s) Time (s) Hash Value --------------- ------------ -------------- ------ -------- --------- ---------- 95,896,297 1 95,896,297.0 100.4 401.90 693.47 721418455 SELECT G.ID AS ID, G.NAME AS NAME, (SELECT COUNT(*) FROM rhnUserManagedServerGroups UMSG WHERE UMSG.server_group_id = G.id) AS GROUP_ADMINS, (SELE CT COUNT(*) FROM rhnServerGroupMembers SGM WHERE SGM.server_grou p_id = G.id AND EXISTS ( SELECT 1 CPU Elapsd Old Physical Reads Executions Reads per Exec %Total Time (s) Time (s) Hash Value --------------- ------------ -------------- ------ -------- --------- ---------- 322,395 1 322,395.0 99.8 401.90 693.47 721418455 SELECT G.ID AS ID, G.NAME AS NAME, (SELECT COUNT(*) FROM rhnUserManagedServerGroups UMSG WHERE UMSG.server_group_id = G.id) AS GROUP_ADMINS, (SELE CT COUNT(*) FROM rhnServerGroupMembers SGM WHERE SGM.server_grou p_id = G.id AND EXISTS ( SELECT 1 ... 2. Updated indexes to be noparallel: alter index RHN_SGMEMBERS_SGID_SID_IDX noparallel; alter index RHN_UGMEMBERS_UID_UGID_UQ noparallel; alter index RHN_UG_ORG_ID_GTYPE_IDX noparallel; alter index WEB_CONTACT_ID_OID_CUST_LUC noparallel; alter index WEB_CONTACT_OID_ID noparallel; alter index RHN_UG_ORG_ID_TYPE_IDX noparallel; alter index RHN_UGMEMBERS_UGID_UID_IDX noparallel; alter index WEB_USER_SITE_INFO_WUID noparallel; alter index RHN_SP_SNEP_IDX noparallel; alter index RHN_SNEC_OID_EID_SID_IDX noparallel; alter index RHN_SNEC_SID_EID_IDX noparallel; alter index RHN_SNEC_EID_SID_IDX noparallel; alter index RHN_UG_ORG_ID_NAME_IDX noparallel; alter index RHN_UG_ID_NAME_ORG noparallel; alter index RHN_SG_OID_TYPE_ID_IDX noparallel; alter index RHN_SG_TYPE_ID_IDX noparallel; alter index RHN_SG_OID_ID_NAME_IDX noparallel; alter index RHN_SG_ID_OID_NAME_IDX noparallel; alter index RHN_SNPC_OID_IDX noparallel; alter index RHN_SNPC_EID_IDX noparallel; alter index RHN_SNPC_SID_IDX noparallel; alter index RHN_SNPC_PID_IDX noparallel; no visible difference in UI behavior. 3. took another statspack report sp_21_31.lst which shows slightly better times, after step 2 above: ... ^LSQL ordered by CPU DB/Inst: RHNSAT/rhnsat Snaps: 21-31 -> Resources reported for PL/SQL code includes the resources used by all SQL statements called by the code. -> Total DB CPU (s): 78 -> Captured SQL accounts for 96.7% of Total DB CPU -> SQL reported below exceeded 1.0% of Total DB CPU CPU CPU per Elapsd Old Time (s) Executions Exec (s) %Total Time (s) Buffer Gets Hash Value ---------- ------------ ---------- ------ ---------- --------------- ---------- 71.98 1 71.98 91.8 119.00 36,928,272 2836748205 SELECT G.ID AS ID, G.NAME AS NAME, (SELECT COUNT(*) FROM rhnUserManagedServerGroups UMSG WHERE UMSG.server_group_id = G.id) AS GROUP_ADMINS, (SELE CT COUNT(*) FROM rhnServerGroupMembers SGM WHERE SGM.server_grou p_id = G.id AND EXISTS ( SELECT 1 2.42 1 2.42 3.1 3.56 9,991 338002664 Module: SQL*Plus begin :snap := statspack.snap(i_snap_level=>7); end; 119.00 1 119.00 90.9 71.98 17,268 2836748205 SELECT G.ID AS ID, G.NAME AS NAME, (SELECT COUNT(*) FROM rhnUserManagedServerGroups UMSG WHERE UMSG.server_group_id = G.id) AS GROUP_ADMINS, (SELE CT COUNT(*) FROM rhnServerGroupMembers SGM WHERE SGM.server_grou p_id = G.id AND EXISTS ( SELECT 1 ... 4. Tried workaround from https://bugzilla.redhat.com/show_bug.cgi?id=525575#c12 - no visible difference still, i.e., clicking on Systems tab still times out. 5. Attempting a gather-stats to see if that'll help, the first time it failed due to TEMP tablespace running out of free space - -bash-3.2$ db-control gather-stats Gathering statistics... WARNING: this may be a very slow process. DBD::Oracle::db do failed: ORA-12801: error signaled in parallel query server P003 ORA-01652: unable to extend temp segment by 128 in tablespace TEMP_TBS ORA-06512: at "SYS.DBMS_STATS", line 13591 ORA-06512: at "SYS.DBMS_STATS", line 13937 ORA-06512: at "SYS.DBMS_STATS", line 14015 ORA-06512: at "SYS.DBMS_STATS", line 13974 ORA-06512: at line 1 (DBD ERROR: OCIStmtExecute) [for Statement "begin dbms_stats.gather_schema_stats(NULL, ESTIMATE_PERCENT=> 15, DEGREE=>DBMS_STATS.DEFAULT_DEGREE, CASCADE=>TRUE); end;"] at /usr/lib/perl5/site_perl/5.8.8/Dobby/DB.pm line 174. ...as confirmed via sqlplus: SQL> select TABLESPACE_NAME, BYTES_USED, BYTES_FREE from V$TEMP_SPACE_HEADER; select * from V$TEMP_SPACE_HEADER; TABLESPACE_NAME BYTES_USED BYTES_FREE ------------------------------ ---------- ---------- TEMP_TBS 262144000 0 SQL> TABLESPACE_NAME FILE_ID BYTES_USED BLOCKS_USED BYTES_FREE ------------------------------ ---------- ---------- ----------- ---------- BLOCKS_FREE RELATIVE_FNO ----------- ------------ TEMP_TBS 1 262144000 32000 0 0 1 So I added more datafiles to TEMP_TBS and ran gather-stats again. alter TABLESPACE TEMP_TBS ADD TEMPFILE '/rhnsat/data/rhnsat/temp_02.dbf' SIZE 512M REUSE; alter TABLESPACE TEMP_TBS ADD TEMPFILE '/rhnsat/data/rhnsat/temp_03.dbf' SIZE 512M REUSE; -bash-3.2$ db-control gather-stats Gathering statistics... WARNING: this may be a very slow process. ...problem persists after gather-stats. 6. Tried turning off Jasper development mode https://bugzilla.redhat.com/show_bug.cgi?id=533527 [root@dhcp243-194 ~]# diff -ru /etc/tomcat5/web.xml.orig /etc/tomcat5/web.xml --- /etc/tomcat5/web.xml.orig 2009-06-25 12:44:46.000000000 -0400 +++ /etc/tomcat5/web.xml 2010-03-11 15:37:41.000000000 -0500 @@ -220,6 +220,10 @@ <param-name>xpoweredBy</param-name> <param-value>false</param-value> </init-param> + <init-param> + <param-name>development</param-name> + <param-value>false</param-value> + </init-param> <load-on-startup>3</load-on-startup> </servlet> 7. Upped apache connection timeout: [root@dhcp243-194 ~]# diff -ru /etc/httpd/conf/httpd.conf.orig /etc/httpd/conf/httpd.conf --- /etc/httpd/conf/httpd.conf.orig 2009-11-10 09:09:14.000000000 -0500 +++ /etc/httpd/conf/httpd.conf 2010-03-11 16:00:30.000000000 -0500 @@ -65,7 +65,8 @@ # # Timeout: The number of seconds before receives and sends time out. # -Timeout 120 +#Timeout 120 +Timeout 300 # # KeepAlive: Whether or not to allow persistent connections (more than Still getting same error.
Public update: This also appears to affect Satellite 5.3, although the workaround and long-term fix will be different. This BZ is for tracking the fix for 5.3.x.
fixed in commit: 73369cb5971a658c38f3f41e8354ad1a15a67fb4 Basically i reduced the 3 sub queries down to a single one, and changed it to use decode which seemed to make it a lot more speedy :] Will build some packages to test properly on our reproducer.
test procedure: have customer db from Comment 573153#c19 - thanks much Xixi :) OLD: I could see 503 Service Temporarily Unavailable on Overview.do, SystemList.do, SystemGroupList.do pages NEW: I could open repeatedly these links: https://satellite-1.brq.redhat.com/rhn/systems/Overview.do https://satellite-1.brq.redhat.com/rhn/systems/SystemList.do https://satellite-1.brq.redhat.com/rhn/systems/SystemGroupList.do I takes longer (1-2 min) but I haven't seen 503 error Verified with rhel55@x8664 with sat530+ upgraded form errata.stage and with errata's packages: spacewalk-java-0.5.44-79.el5sat spacewalk-java-config-0.5.44-79.el5sat spacewalk-java-lib-0.5.44-79.el5sat spacewalk-taskomatic-0.5.44-79.el5sat cobbler-1.6.6-8.el5sat cobbler-debuginfo-1.6.6-8.el5sat spacewalk-base-0.5.23-35.el5sat spacewalk-base-minimal-0.5.23-35.el5sat spacewalk-cypress-0.5.23-35.el5sat spacewalk-grail-0.5.23-35.el5sat spacewalk-html-0.5.23-35.el5sat spacewalk-moon-0.5.23-35.el5sat spacewalk-pxt-0.5.23-35.el5sat spacewalk-sniglets-0.5.23-35.el5sat machine info: Mem= 4Gigs, CPU= 2x Intel Xeon @ 1.60GHz
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2010-0498.html