958993 – Alert history portlet or page can crash server if thousands of alerts exist

Bug 958993 - Alert history portlet or page can crash server if thousands of alerts exist

Summary: Alert history portlet or page can crash server if thousands of alerts exist

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	JBoss Operations Network
Classification:	JBoss
Component:	UI
Sub Component:
Version:	JON 3.1.2
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	high
Target Milestone:	ER03
Target Release:	JON 3.2.0
Assignee:	RHQ Project Maintainer
QA Contact:	Larry O'Leary
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2013-05-02 22:19 UTC by Larry O'Leary
Modified:	2018-12-01 15:12 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2014-01-02 20:36:35 UTC
Type:	Bug
Embargoed:

Attachments	(Terms of Use)
Server log with DEBUG enabled for org.rhq (342.98 KB, text/x-log) 2013-05-02 22:19 UTC, Larry O'Leary	no flags	Details
View All

Links
System	ID	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	720826	urgent	NEW	Recent Alerts subsystem view times out if a large number of alerts exist	2024-03-04 13:35:15 UTC
Red Hat Bugzilla	959593	medium	CLOSED	Alert history and recent alerts views are unavailable and timeout when a large number of alerts exist	2021-02-22 00:41:40 UTC
Red Hat Knowledge Base (Solution)	362084	None	None	None	Never

Internal Links: 720826 959593

Description Larry O'Leary 2013-05-02 22:19:19 UTC

Created attachment 742933 [details]
Server log with DEBUG enabled for org.rhq

Description of problem:
If there are thousands of alerts, the JBoss ON server will become slow or unresponsive after attempting to view the recent alerts dashboard protlet or the alerts history page for a resource or resource group. Eventually, the server will stop working altogether after a java.lang.OutOfMemoryError: GC overhead limit exceeded or java.lang.OutOfMemoryError: Java heap space exception is logged.

Depending on the environment, this can happen in as little as 50,000 alerts contained in the rhq_alerts table.

Version-Release number of selected component (if applicable):
4.4.0.JON312GA

How reproducible:
Almost always

Steps to Reproduce:
1.  Start JBoss ON system.
2.  Import RHQ Agent resource into inventory.
3.  Create an alert definition that will get fired off 200,000 times. To make this testable without waiting for 200,000 alert conditions to occur, the following can be done to simulate 200,000 alerts:

    The following Linux shell command will produce two SQL files. One named alertDef.sql which contains the alert definition for the platform resource (id 10001) and alerts.sql which will contain <_numOfAlerts> alerts based on the alertDef. 

        _resourceId=10001
        _alertTime=1367417645665
        _alertCondTime=1367417655665
        _alertId=10001
        _numOfAlerts=200000

        echo "INSERT INTO rhq_alert_definition VALUES (10001, 'Alert Def 01', 1367353530092, 1367353530092, 0, NULL, NULL, 'MEDIUM', NULL, ${_resourceId}, NULL, true, 0, 0, false, false, false, false, false, 0, 0, NULL, 0, NULL);" >alertDef.sql
        echo "INSERT INTO rhq_alert_condition VALUES (10001, 'CONTROL', NULL, 'viewProcessList', NULL, NULL, 'SUCCESS', 10001, NULL);" >>alertDef.sql

        echo "COPY rhq_alert (id, alert_definition_id, ctime, recovery_id, will_recover, ack_time, ack_subject) FROM stdin;" >alertsTmp.sql
        echo "COPY rhq_alert_condition_log (id, ctime, alert_id, condition_id, value) FROM stdin;" >alert_conditionsTmp.sql

        for (( i=1; i<=_numOfAlerts; i++ )); do
            echo "${_alertId}"$'\t'"10001"$'\t'"${_alertTime}"$'\t'"0"$'\t'"f"$'\t'"-1"$'\t'"\N" >>alertsTmp.sql
            echo "${_alertId}"$'\t'"${_alertCondTime}"$'\t'"${_alertId}"$'\t'"10001"$'\t'"Success" >>alert_conditionsTmp.sql
            (( _alertTime += 1000 ))
            (( _alertCondTime += 1000 ))
            (( _alertId++ ))
        done
        echo "\." >>alertsTmp.sql
        echo "" >>alertsTmp.sql
        echo "\." >>alert_conditionsTmp.sql
        echo "" >>alert_conditionsTmp.sql

        cat alertsTmp.sql alert_conditionsTmp.sql >alerts.sql
        rm alertsTmp.sql alert_conditionsTmp.sql


    The two files can be imported using psql:
        psql -d rhq -f alertDef.sql
        psql -d rhq -f alerts.sql 

4.  Login to the JBoss ON UI. Presumably the dashboard should appear and the default dashboard contains the recent alerts portlet.
  
Actual results:
After 30 seconds, the following warning appears in the UI:

    Message :	Failed to fetch alerts data. This occurred because the server is taking a long time to complete this request. Please be aware that the server may still be processing your request and it may complete shortly. You can check the server logs to see if any abnormal errors occurred.

    com.google.gwt.http.client.RequestTimeoutException:A request timeout has expired after 30000 ms
    --- STACK TRACE FOLLOWS ---
    A request timeout has expired after 30000 ms
       at Unknown.java_lang_Exception_Exception__Ljava_lang_String_2V(Unknown Source)
       at Unknown.com_google_gwt_http_client_RequestTimeoutException_RequestTimeoutException__Lcom_google_gwt_http_client_Request_2IV(Unknown Source)
       at Unknown.com_google_gwt_http_client_Request_$fireOnTimeout__Lcom_google_gwt_http_client_Request_2Lcom_google_gwt_http_client_RequestCallback_2V(Unknown Source)
       at Unknown.com_google_gwt_http_client_Request$3_run__V(Unknown Source)
       at Unknown.com_google_gwt_user_client_Timer_fire__V(Unknown Source)
       at Unknown.anonymous(Unknown Source)
       at Unknown.com_google_gwt_core_client_impl_Impl_apply__Ljava_lang_Object_2Ljava_lang_Object_2Ljava_lang_Object_2Ljava_lang_Object_2(Unknown Source)
       at Unknown.com_google_gwt_core_client_impl_Impl_entry0__Ljava_lang_Object_2Ljava_lang_Object_2Ljava_lang_Object_2Ljava_lang_Object_2(Unknown Source)
       at Unknown.anonymous(Unknown Source)
       at Unknown.anonymous(Unknown Source)

At this point the server is very sluggish and may become unresponsive. After about 20 minutes or so the server will throw an OutOfMemoryError exception. Similar to:

    2013-05-02 16:50:36,871 DEBUG [org.rhq.enterprise.server.core.plugin.AgentPluginScanner] Scanning for agent plugins
    2013-05-02 16:51:54,011 DEBUG [org.rhq.enterprise.server.util.HibernatePerformanceMonitor] HibernateStats[ queries=1, xactions=0, loads=1, connects=1, time=272118 ](perf: slowness?)  for SLSB:CloudManagerBean.getServerByName
    2013-05-02 16:51:43,575 ERROR [STDERR] Exception in thread "Timer-0" 
    2013-05-02 16:55:23,001 ERROR [STDERR] java.lang.OutOfMemoryError: Java heap space
    2013-05-02 16:55:16,302 DEBUG [org.rhq.enterprise.server.util.CriteriaQueryRunner] restriction=null, resultSize=0, resultCount=0
    2013-05-02 16:54:41,810 DEBUG [org.rhq.enterprise.server.util.HibernatePerformanceMonitor] HibernateStats[ queries=2, xactions=3, loads=1, connects=3, time=498407 ](perf: slowness?)  for SLSB:StatusManagerBean.getAndClearAgentsWithStatusForServer


Expected results:
The server should remain stable and not throw an OutOfMemoryError exception.

Additional info:

Comment 1 John Mazzitelli 2013-05-13 18:57:28 UTC

is this essentially the same as bug #959593 ?

Comment 2 Larry O'Leary 2013-05-13 19:38:08 UTC

The root causes of the issue is most likely the same. With bug 959593 the alerts history page does not display due to a UI timeout regardless of memory.

With this bug, the server will OOME in the event that the JVM's max heap isn't large enough.

In both cases, it appears to be due to no limit/filter being used on the database query and the entire contents of the alerts table being loaded into memory.

Comment 3 Jay Shaughnessy 2013-06-03 21:22:10 UTC

As with bug 959593, I have a feeling this may be fundamentally due to the problem in Bug 620603, recently fixed by Lukas.  That problem basically was responsible for loading all rows into memory in order to perform a sort.

Comment 4 Larry O'Leary 2013-08-09 14:52:17 UTC

I will re-test with a alpha build of 3.2 to see if the work done in bug 620603 resolves this.

Comment 5 Larry O'Leary 2013-10-11 20:26:39 UTC

The fix for bug 620603 also fixed this bug. The test case for this is described in bug 959593. 

Marking this as VERIFIED in 3.2.0.ER3 as I was not able to reproduce this issue using 200,000 alerts.

Note You need to log in before you can comment on or make changes to this bug.