Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 573153

Summary:	Satellite 5.2 & 5.3: Systems tab or System Groups link takes 2-3+ min to load or fails with "503 Service Temporarily Unavailable"
Product:	Red Hat Satellite 5	Reporter:	Xixi <xdmoon>
Component:	Server	Assignee:	Justin Sherrill <jsherril>
Status:	CLOSED ERRATA	QA Contact:	Petr Sklenar <psklenar>
Severity:	high	Docs Contact:
Priority:	high
Version:	521	CC:	cperry, ktoyama, psklenar, tao, xdmoon
Target Milestone:	---
Target Release:	---
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2010-06-22 14:04:26 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	518256

Description Xixi 2010-03-13 01:00:18 UTC

Description of problem:
When clicking on the Systems tab (link at the top) or the System Groups link, the page could take upwards of 2-3 minutes to load, if at all.  Sometimes, a "503 Service Temporarily Unavailable" page will be displayed instead. 
Customer has had to extend the "DATA_TBS" and "UNDO_TBS" Oracle tablespaces using 'db-control extend" when they noticed that they were full or nearly full.

This has been an ongoing issue since their upgrade to 5.2.  However, it has gotten really bad recently, notably right after updating to sat-5.2.1 on 1/20.  They were still running the hotfix rpms up until a few weeks ago when they installed all available updates (rhns-* rpms included), which effectively brought their Satellite to revision 5.2.1.

Version-Release number of selected component (if applicable):
Red Hat Network (RHN) Satellite 5.2.0

System architecture(s):
Red Hat Enterprise Linux Server release 5.4 (Tikanga) x86_64
rhns-5.2.1-1.el5-noarch                  

How reproducible:
Always.

Steps to Reproduce:
1. Have a 5.2 satellite and import customer db dump. 
2. Click on Systems tab or manually try to go to rhn/systems/Overview.do and rhn/systems/SystemList.do
3. Wait.

Actual results:
"503 Service Temporarily Unavailable" (it loads sometimes for customer after more than 2-4 minutes but on my reproducer always times out)

Expected results:
Page loads in reasonable time, no 503 timeouts.

Additional info:

Comment 2 Xixi 2010-03-13 01:05:14 UTC

Some stats from customer db:

SQL> SELECT COUNT(*) from web_contact;

 COUNT(*)
----------
       55

SQL> SELECT COUNT(*) FROM rhnUserManagedServerGroups;

 COUNT(*)
----------
      542

SQL> SELECT COUNT(*) FROM rhnServerGroupMembers;

 COUNT(*)
----------
     2603

SQL> select count(*) from rhnServerNeededErrataCache;

 COUNT(*)
----------
  2948048

SQL> select count(*) from rhnServerNeededPackageCache;

 COUNT(*)
----------
  7921017


Things I've tried so far:

1. reproduced issue and took statspack report #1, sp_1_11.lst, which highlights the following query which appears to be visible_to_user_overview_fast -

...
^LSQL ordered by CPU  DB/Inst: RHNSAT/rhnsat  Snaps: 1-11
-> Resources reported for PL/SQL code includes the resources used by all SQL
  statements called by the code.
-> Total DB CPU (s):             409
-> Captured SQL accounts for   98.8% of Total DB CPU
-> SQL reported below exceeded  1.0% of Total DB CPU

   CPU                  CPU per             Elapsd                     Old
 Time (s)   Executions  Exec (s)  %Total   Time (s)    Buffer Gets  Hash Value
---------- ------------ ---------- ------ ---------- --------------- ----------
   401.90            1     401.90   98.4     693.47      95,896,297  721418455
SELECT             G.ID AS ID, G.NAME AS NAME,
    (SELECT COUNT(*) FROM rhnUserManagedServerGroups UMSG WHERE
UMSG.server_group_id = G.id)          AS GROUP_ADMINS,    (SELE
CT COUNT(*) FROM rhnServerGroupMembers SGM WHERE SGM.server_grou
p_id = G.id                  AND EXISTS ( SELECT 1

 Elapsed                Elap per            CPU                        Old
 Time (s)   Executions  Exec (s)  %Total   Time (s)  Physical Reads Hash Value
---------- ------------ ---------- ------ ---------- --------------- ----------
   693.47            1     693.47   94.1     401.90         322,395  721418455
SELECT             G.ID AS ID, G.NAME AS NAME,
    (SELECT COUNT(*) FROM rhnUserManagedServerGroups UMSG WHERE
UMSG.server_group_id = G.id)          AS GROUP_ADMINS,    (SELE
CT COUNT(*) FROM rhnServerGroupMembers SGM WHERE SGM.server_grou
p_id = G.id                  AND EXISTS ( SELECT 1

 Buffer Gets    Executions  Gets per Exec  %Total Time (s)  Time (s) Hash Value
--------------- ------------ -------------- ------ -------- --------- ----------
    95,896,297            1   95,896,297.0  100.4   401.90    693.47  721418455
SELECT             G.ID AS ID, G.NAME AS NAME,
    (SELECT COUNT(*) FROM rhnUserManagedServerGroups UMSG WHERE
UMSG.server_group_id = G.id)          AS GROUP_ADMINS,    (SELE
CT COUNT(*) FROM rhnServerGroupMembers SGM WHERE SGM.server_grou
p_id = G.id                  AND EXISTS ( SELECT 1

                                                    CPU      Elapsd     Old
Physical Reads  Executions  Reads per Exec %Total Time (s)  Time (s) Hash Value
--------------- ------------ -------------- ------ -------- --------- ----------
       322,395            1      322,395.0   99.8   401.90    693.47  721418455
SELECT             G.ID AS ID, G.NAME AS NAME,
    (SELECT COUNT(*) FROM rhnUserManagedServerGroups UMSG WHERE
UMSG.server_group_id = G.id)          AS GROUP_ADMINS,    (SELE
CT COUNT(*) FROM rhnServerGroupMembers SGM WHERE SGM.server_grou
p_id = G.id                  AND EXISTS ( SELECT 1
...

2. Updated indexes to be noparallel: 

alter index RHN_SGMEMBERS_SGID_SID_IDX noparallel;
alter index RHN_UGMEMBERS_UID_UGID_UQ noparallel;
alter index RHN_UG_ORG_ID_GTYPE_IDX noparallel;
alter index WEB_CONTACT_ID_OID_CUST_LUC noparallel;
alter index WEB_CONTACT_OID_ID noparallel;
alter index RHN_UG_ORG_ID_TYPE_IDX noparallel;
alter index RHN_UGMEMBERS_UGID_UID_IDX noparallel;
alter index WEB_USER_SITE_INFO_WUID noparallel;
alter index RHN_SP_SNEP_IDX noparallel;
alter index RHN_SNEC_OID_EID_SID_IDX noparallel;
alter index RHN_SNEC_SID_EID_IDX noparallel;
alter index RHN_SNEC_EID_SID_IDX noparallel;
alter index RHN_UG_ORG_ID_NAME_IDX noparallel;
alter index RHN_UG_ID_NAME_ORG noparallel;
alter index RHN_SG_OID_TYPE_ID_IDX noparallel;
alter index RHN_SG_TYPE_ID_IDX noparallel;
alter index RHN_SG_OID_ID_NAME_IDX noparallel;
alter index RHN_SG_ID_OID_NAME_IDX noparallel;
alter index RHN_SNPC_OID_IDX noparallel;
alter index RHN_SNPC_EID_IDX noparallel;
alter index RHN_SNPC_SID_IDX noparallel;
alter index RHN_SNPC_PID_IDX noparallel;

no visible difference in UI behavior.

3. took another statspack report sp_21_31.lst which shows slightly better times, after step 2 above:

...
^LSQL ordered by CPU  DB/Inst: RHNSAT/rhnsat  Snaps: 21-31
-> Resources reported for PL/SQL code includes the resources used by all SQL
  statements called by the code.
-> Total DB CPU (s):              78
-> Captured SQL accounts for   96.7% of Total DB CPU
-> SQL reported below exceeded  1.0% of Total DB CPU

   CPU                  CPU per             Elapsd                     Old
 Time (s)   Executions  Exec (s)  %Total   Time (s)    Buffer Gets  Hash Value
---------- ------------ ---------- ------ ---------- --------------- ----------
    71.98            1      71.98   91.8     119.00      36,928,272 2836748205
SELECT             G.ID AS ID, G.NAME AS NAME,
    (SELECT COUNT(*) FROM rhnUserManagedServerGroups UMSG WHERE
UMSG.server_group_id = G.id)          AS GROUP_ADMINS,    (SELE
CT COUNT(*) FROM rhnServerGroupMembers SGM WHERE SGM.server_grou
p_id = G.id                  AND EXISTS ( SELECT 1

     2.42            1       2.42    3.1       3.56           9,991  338002664
Module: SQL*Plus
begin   :snap := statspack.snap(i_snap_level=>7);   end;

   119.00            1     119.00   90.9      71.98          17,268 2836748205
SELECT             G.ID AS ID, G.NAME AS NAME,
    (SELECT COUNT(*) FROM rhnUserManagedServerGroups UMSG WHERE
UMSG.server_group_id = G.id)          AS GROUP_ADMINS,    (SELE
CT COUNT(*) FROM rhnServerGroupMembers SGM WHERE SGM.server_grou
p_id = G.id                  AND EXISTS ( SELECT 1
...

4. Tried workaround from https://bugzilla.redhat.com/show_bug.cgi?id=525575#c12 - no visible difference still, i.e., clicking on Systems tab still times out.

5. Attempting a gather-stats to see if that'll help, the first time it failed due to TEMP tablespace running out of free space -

-bash-3.2$ db-control gather-stats
Gathering statistics...
WARNING: this may be a very slow process.
DBD::Oracle::db do failed: ORA-12801: error signaled in parallel query server P003
ORA-01652: unable to extend temp segment by 128 in tablespace TEMP_TBS
ORA-06512: at "SYS.DBMS_STATS", line 13591
ORA-06512: at "SYS.DBMS_STATS", line 13937
ORA-06512: at "SYS.DBMS_STATS", line 14015
ORA-06512: at "SYS.DBMS_STATS", line 13974
ORA-06512: at line 1 (DBD ERROR: OCIStmtExecute) [for Statement "begin dbms_stats.gather_schema_stats(NULL, ESTIMATE_PERCENT=> 15, DEGREE=>DBMS_STATS.DEFAULT_DEGREE, CASCADE=>TRUE); end;"] at /usr/lib/perl5/site_perl/5.8.8/Dobby/DB.pm line 174.

...as confirmed via sqlplus:

SQL> select TABLESPACE_NAME, BYTES_USED, BYTES_FREE from V$TEMP_SPACE_HEADER;
select * from V$TEMP_SPACE_HEADER;

TABLESPACE_NAME                BYTES_USED BYTES_FREE
------------------------------ ---------- ----------
TEMP_TBS                        262144000          0

SQL>
TABLESPACE_NAME                   FILE_ID BYTES_USED BLOCKS_USED BYTES_FREE
------------------------------ ---------- ---------- ----------- ----------
BLOCKS_FREE RELATIVE_FNO
----------- ------------
TEMP_TBS                                1  262144000       32000          0
         0            1

So I added more datafiles to TEMP_TBS and ran gather-stats again.

alter TABLESPACE TEMP_TBS ADD TEMPFILE '/rhnsat/data/rhnsat/temp_02.dbf' SIZE 512M REUSE;
alter TABLESPACE TEMP_TBS ADD TEMPFILE '/rhnsat/data/rhnsat/temp_03.dbf' SIZE 512M REUSE;

-bash-3.2$ db-control gather-stats
Gathering statistics...
WARNING: this may be a very slow process.

...problem persists after gather-stats.

6. Tried turning off Jasper development mode https://bugzilla.redhat.com/show_bug.cgi?id=533527

[root@dhcp243-194 ~]# diff -ru /etc/tomcat5/web.xml.orig /etc/tomcat5/web.xml
--- /etc/tomcat5/web.xml.orig   2009-06-25 12:44:46.000000000 -0400
+++ /etc/tomcat5/web.xml        2010-03-11 15:37:41.000000000 -0500
@@ -220,6 +220,10 @@
            <param-name>xpoweredBy</param-name>
            <param-value>false</param-value>
        </init-param>
+        <init-param>
+            <param-name>development</param-name>
+            <param-value>false</param-value>
+        </init-param>
        <load-on-startup>3</load-on-startup>
    </servlet>

7. Upped apache connection timeout:

[root@dhcp243-194 ~]# diff -ru /etc/httpd/conf/httpd.conf.orig /etc/httpd/conf/httpd.conf
--- /etc/httpd/conf/httpd.conf.orig     2009-11-10 09:09:14.000000000 -0500
+++ /etc/httpd/conf/httpd.conf  2010-03-11 16:00:30.000000000 -0500
@@ -65,7 +65,8 @@
#
# Timeout: The number of seconds before receives and sends time out.
#
-Timeout 120
+#Timeout 120
+Timeout 300

#
# KeepAlive: Whether or not to allow persistent connections (more than

Still getting same error.

Comment 8 Xixi 2010-04-01 18:50:38 UTC

Public update:
This also appears to affect Satellite 5.3, although the workaround and long-term fix will be different.  This BZ is for tracking the fix for 5.3.x.

Comment 10 Justin Sherrill 2010-04-07 15:36:51 UTC

fixed in commit:

73369cb5971a658c38f3f41e8354ad1a15a67fb4


Basically i reduced the 3 sub queries down to a single one, and changed it to use decode which seemed to make it a lot more speedy :]

Will build some packages to test properly on our reproducer.

Comment 23 Petr Sklenar 2010-06-09 15:54:25 UTC

test procedure:
have customer db from Comment 573153#c19 - thanks much Xixi :) 

OLD: I could see 503 Service Temporarily Unavailable on Overview.do, SystemList.do, SystemGroupList.do pages

NEW:
I could open repeatedly these links:
https://satellite-1.brq.redhat.com/rhn/systems/Overview.do
https://satellite-1.brq.redhat.com/rhn/systems/SystemList.do
https://satellite-1.brq.redhat.com/rhn/systems/SystemGroupList.do
I takes longer (1-2 min) but I haven't seen 503 error

Verified with rhel55@x8664 with sat530+ upgraded form errata.stage and with errata's packages:
spacewalk-java-0.5.44-79.el5sat
spacewalk-java-config-0.5.44-79.el5sat
spacewalk-java-lib-0.5.44-79.el5sat
spacewalk-taskomatic-0.5.44-79.el5sat
cobbler-1.6.6-8.el5sat
cobbler-debuginfo-1.6.6-8.el5sat
spacewalk-base-0.5.23-35.el5sat
spacewalk-base-minimal-0.5.23-35.el5sat
spacewalk-cypress-0.5.23-35.el5sat
spacewalk-grail-0.5.23-35.el5sat
spacewalk-html-0.5.23-35.el5sat
spacewalk-moon-0.5.23-35.el5sat
spacewalk-pxt-0.5.23-35.el5sat
spacewalk-sniglets-0.5.23-35.el5sat

machine info: Mem= 4Gigs, CPU= 2x Intel Xeon @ 1.60GHz

Comment 25 errata-xmlrpc 2010-06-22 14:04:26 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2010-0498.html