Bug 958993 - Alert history portlet or page can crash server if thousands of alerts exist
Summary: Alert history portlet or page can crash server if thousands of alerts exist
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: JBoss Operations Network
Classification: JBoss
Component: UI
Version: JON 3.1.2
Hardware: Unspecified
OS: Unspecified
medium
high
Target Milestone: ER03
: JON 3.2.0
Assignee: RHQ Project Maintainer
QA Contact: Larry O'Leary
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2013-05-02 22:19 UTC by Larry O'Leary
Modified: 2018-12-01 15:12 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2014-01-02 20:36:35 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)
Server log with DEBUG enabled for org.rhq (342.98 KB, text/x-log)
2013-05-02 22:19 UTC, Larry O'Leary
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Bugzilla 720826 0 urgent NEW Recent Alerts subsystem view times out if a large number of alerts exist 2024-03-04 13:35:15 UTC
Red Hat Bugzilla 959593 0 medium CLOSED Alert history and recent alerts views are unavailable and timeout when a large number of alerts exist 2021-02-22 00:41:40 UTC
Red Hat Knowledge Base (Solution) 362084 0 None None None Never

Internal Links: 720826 959593

Description Larry O'Leary 2013-05-02 22:19:19 UTC
Created attachment 742933 [details]
Server log with DEBUG enabled for org.rhq

Description of problem:
If there are thousands of alerts, the JBoss ON server will become slow or unresponsive after attempting to view the recent alerts dashboard protlet or the alerts history page for a resource or resource group. Eventually, the server will stop working altogether after a java.lang.OutOfMemoryError: GC overhead limit exceeded or java.lang.OutOfMemoryError: Java heap space exception is logged.

Depending on the environment, this can happen in as little as 50,000 alerts contained in the rhq_alerts table.

Version-Release number of selected component (if applicable):
4.4.0.JON312GA

How reproducible:
Almost always

Steps to Reproduce:
1.  Start JBoss ON system.
2.  Import RHQ Agent resource into inventory.
3.  Create an alert definition that will get fired off 200,000 times. To make this testable without waiting for 200,000 alert conditions to occur, the following can be done to simulate 200,000 alerts:

    The following Linux shell command will produce two SQL files. One named alertDef.sql which contains the alert definition for the platform resource (id 10001) and alerts.sql which will contain <_numOfAlerts> alerts based on the alertDef. 

        _resourceId=10001
        _alertTime=1367417645665
        _alertCondTime=1367417655665
        _alertId=10001
        _numOfAlerts=200000

        echo "INSERT INTO rhq_alert_definition VALUES (10001, 'Alert Def 01', 1367353530092, 1367353530092, 0, NULL, NULL, 'MEDIUM', NULL, ${_resourceId}, NULL, true, 0, 0, false, false, false, false, false, 0, 0, NULL, 0, NULL);" >alertDef.sql
        echo "INSERT INTO rhq_alert_condition VALUES (10001, 'CONTROL', NULL, 'viewProcessList', NULL, NULL, 'SUCCESS', 10001, NULL);" >>alertDef.sql

        echo "COPY rhq_alert (id, alert_definition_id, ctime, recovery_id, will_recover, ack_time, ack_subject) FROM stdin;" >alertsTmp.sql
        echo "COPY rhq_alert_condition_log (id, ctime, alert_id, condition_id, value) FROM stdin;" >alert_conditionsTmp.sql

        for (( i=1; i<=_numOfAlerts; i++ )); do
            echo "${_alertId}"$'\t'"10001"$'\t'"${_alertTime}"$'\t'"0"$'\t'"f"$'\t'"-1"$'\t'"\N" >>alertsTmp.sql
            echo "${_alertId}"$'\t'"${_alertCondTime}"$'\t'"${_alertId}"$'\t'"10001"$'\t'"Success" >>alert_conditionsTmp.sql
            (( _alertTime += 1000 ))
            (( _alertCondTime += 1000 ))
            (( _alertId++ ))
        done
        echo "\." >>alertsTmp.sql
        echo "" >>alertsTmp.sql
        echo "\." >>alert_conditionsTmp.sql
        echo "" >>alert_conditionsTmp.sql

        cat alertsTmp.sql alert_conditionsTmp.sql >alerts.sql
        rm alertsTmp.sql alert_conditionsTmp.sql


    The two files can be imported using psql:
        psql -d rhq -f alertDef.sql
        psql -d rhq -f alerts.sql 

4.  Login to the JBoss ON UI. Presumably the dashboard should appear and the default dashboard contains the recent alerts portlet.
  
Actual results:
After 30 seconds, the following warning appears in the UI:

    Message :	Failed to fetch alerts data. This occurred because the server is taking a long time to complete this request. Please be aware that the server may still be processing your request and it may complete shortly. You can check the server logs to see if any abnormal errors occurred.

    com.google.gwt.http.client.RequestTimeoutException:A request timeout has expired after 30000 ms
    --- STACK TRACE FOLLOWS ---
    A request timeout has expired after 30000 ms
       at Unknown.java_lang_Exception_Exception__Ljava_lang_String_2V(Unknown Source)
       at Unknown.com_google_gwt_http_client_RequestTimeoutException_RequestTimeoutException__Lcom_google_gwt_http_client_Request_2IV(Unknown Source)
       at Unknown.com_google_gwt_http_client_Request_$fireOnTimeout__Lcom_google_gwt_http_client_Request_2Lcom_google_gwt_http_client_RequestCallback_2V(Unknown Source)
       at Unknown.com_google_gwt_http_client_Request$3_run__V(Unknown Source)
       at Unknown.com_google_gwt_user_client_Timer_fire__V(Unknown Source)
       at Unknown.anonymous(Unknown Source)
       at Unknown.com_google_gwt_core_client_impl_Impl_apply__Ljava_lang_Object_2Ljava_lang_Object_2Ljava_lang_Object_2Ljava_lang_Object_2(Unknown Source)
       at Unknown.com_google_gwt_core_client_impl_Impl_entry0__Ljava_lang_Object_2Ljava_lang_Object_2Ljava_lang_Object_2Ljava_lang_Object_2(Unknown Source)
       at Unknown.anonymous(Unknown Source)
       at Unknown.anonymous(Unknown Source)

At this point the server is very sluggish and may become unresponsive. After about 20 minutes or so the server will throw an OutOfMemoryError exception. Similar to:

    2013-05-02 16:50:36,871 DEBUG [org.rhq.enterprise.server.core.plugin.AgentPluginScanner] Scanning for agent plugins
    2013-05-02 16:51:54,011 DEBUG [org.rhq.enterprise.server.util.HibernatePerformanceMonitor] HibernateStats[ queries=1, xactions=0, loads=1, connects=1, time=272118 ](perf: slowness?)  for SLSB:CloudManagerBean.getServerByName
    2013-05-02 16:51:43,575 ERROR [STDERR] Exception in thread "Timer-0" 
    2013-05-02 16:55:23,001 ERROR [STDERR] java.lang.OutOfMemoryError: Java heap space
    2013-05-02 16:55:16,302 DEBUG [org.rhq.enterprise.server.util.CriteriaQueryRunner] restriction=null, resultSize=0, resultCount=0
    2013-05-02 16:54:41,810 DEBUG [org.rhq.enterprise.server.util.HibernatePerformanceMonitor] HibernateStats[ queries=2, xactions=3, loads=1, connects=3, time=498407 ](perf: slowness?)  for SLSB:StatusManagerBean.getAndClearAgentsWithStatusForServer


Expected results:
The server should remain stable and not throw an OutOfMemoryError exception.

Additional info:

Comment 1 John Mazzitelli 2013-05-13 18:57:28 UTC
is this essentially the same as bug #959593 ?

Comment 2 Larry O'Leary 2013-05-13 19:38:08 UTC
The root causes of the issue is most likely the same. With bug 959593 the alerts history page does not display due to a UI timeout regardless of memory.

With this bug, the server will OOME in the event that the JVM's max heap isn't large enough.

In both cases, it appears to be due to no limit/filter being used on the database query and the entire contents of the alerts table being loaded into memory.

Comment 3 Jay Shaughnessy 2013-06-03 21:22:10 UTC
As with bug 959593, I have a feeling this may be fundamentally due to the problem in Bug 620603, recently fixed by Lukas.  That problem basically was responsible for loading all rows into memory in order to perform a sort.

Comment 4 Larry O'Leary 2013-08-09 14:52:17 UTC
I will re-test with a alpha build of 3.2 to see if the work done in bug 620603 resolves this.

Comment 5 Larry O'Leary 2013-10-11 20:26:39 UTC
The fix for bug 620603 also fixed this bug. The test case for this is described in bug 959593. 

Marking this as VERIFIED in 3.2.0.ER3 as I was not able to reproduce this issue using 200,000 alerts.


Note You need to log in before you can comment on or make changes to this bug.