Created attachment 742933 [details] Server log with DEBUG enabled for org.rhq Description of problem: If there are thousands of alerts, the JBoss ON server will become slow or unresponsive after attempting to view the recent alerts dashboard protlet or the alerts history page for a resource or resource group. Eventually, the server will stop working altogether after a java.lang.OutOfMemoryError: GC overhead limit exceeded or java.lang.OutOfMemoryError: Java heap space exception is logged. Depending on the environment, this can happen in as little as 50,000 alerts contained in the rhq_alerts table. Version-Release number of selected component (if applicable): 4.4.0.JON312GA How reproducible: Almost always Steps to Reproduce: 1. Start JBoss ON system. 2. Import RHQ Agent resource into inventory. 3. Create an alert definition that will get fired off 200,000 times. To make this testable without waiting for 200,000 alert conditions to occur, the following can be done to simulate 200,000 alerts: The following Linux shell command will produce two SQL files. One named alertDef.sql which contains the alert definition for the platform resource (id 10001) and alerts.sql which will contain <_numOfAlerts> alerts based on the alertDef. _resourceId=10001 _alertTime=1367417645665 _alertCondTime=1367417655665 _alertId=10001 _numOfAlerts=200000 echo "INSERT INTO rhq_alert_definition VALUES (10001, 'Alert Def 01', 1367353530092, 1367353530092, 0, NULL, NULL, 'MEDIUM', NULL, ${_resourceId}, NULL, true, 0, 0, false, false, false, false, false, 0, 0, NULL, 0, NULL);" >alertDef.sql echo "INSERT INTO rhq_alert_condition VALUES (10001, 'CONTROL', NULL, 'viewProcessList', NULL, NULL, 'SUCCESS', 10001, NULL);" >>alertDef.sql echo "COPY rhq_alert (id, alert_definition_id, ctime, recovery_id, will_recover, ack_time, ack_subject) FROM stdin;" >alertsTmp.sql echo "COPY rhq_alert_condition_log (id, ctime, alert_id, condition_id, value) FROM stdin;" >alert_conditionsTmp.sql for (( i=1; i<=_numOfAlerts; i++ )); do echo "${_alertId}"$'\t'"10001"$'\t'"${_alertTime}"$'\t'"0"$'\t'"f"$'\t'"-1"$'\t'"\N" >>alertsTmp.sql echo "${_alertId}"$'\t'"${_alertCondTime}"$'\t'"${_alertId}"$'\t'"10001"$'\t'"Success" >>alert_conditionsTmp.sql (( _alertTime += 1000 )) (( _alertCondTime += 1000 )) (( _alertId++ )) done echo "\." >>alertsTmp.sql echo "" >>alertsTmp.sql echo "\." >>alert_conditionsTmp.sql echo "" >>alert_conditionsTmp.sql cat alertsTmp.sql alert_conditionsTmp.sql >alerts.sql rm alertsTmp.sql alert_conditionsTmp.sql The two files can be imported using psql: psql -d rhq -f alertDef.sql psql -d rhq -f alerts.sql 4. Login to the JBoss ON UI. Presumably the dashboard should appear and the default dashboard contains the recent alerts portlet. Actual results: After 30 seconds, the following warning appears in the UI: Message : Failed to fetch alerts data. This occurred because the server is taking a long time to complete this request. Please be aware that the server may still be processing your request and it may complete shortly. You can check the server logs to see if any abnormal errors occurred. com.google.gwt.http.client.RequestTimeoutException:A request timeout has expired after 30000 ms --- STACK TRACE FOLLOWS --- A request timeout has expired after 30000 ms at Unknown.java_lang_Exception_Exception__Ljava_lang_String_2V(Unknown Source) at Unknown.com_google_gwt_http_client_RequestTimeoutException_RequestTimeoutException__Lcom_google_gwt_http_client_Request_2IV(Unknown Source) at Unknown.com_google_gwt_http_client_Request_$fireOnTimeout__Lcom_google_gwt_http_client_Request_2Lcom_google_gwt_http_client_RequestCallback_2V(Unknown Source) at Unknown.com_google_gwt_http_client_Request$3_run__V(Unknown Source) at Unknown.com_google_gwt_user_client_Timer_fire__V(Unknown Source) at Unknown.anonymous(Unknown Source) at Unknown.com_google_gwt_core_client_impl_Impl_apply__Ljava_lang_Object_2Ljava_lang_Object_2Ljava_lang_Object_2Ljava_lang_Object_2(Unknown Source) at Unknown.com_google_gwt_core_client_impl_Impl_entry0__Ljava_lang_Object_2Ljava_lang_Object_2Ljava_lang_Object_2Ljava_lang_Object_2(Unknown Source) at Unknown.anonymous(Unknown Source) at Unknown.anonymous(Unknown Source) At this point the server is very sluggish and may become unresponsive. After about 20 minutes or so the server will throw an OutOfMemoryError exception. Similar to: 2013-05-02 16:50:36,871 DEBUG [org.rhq.enterprise.server.core.plugin.AgentPluginScanner] Scanning for agent plugins 2013-05-02 16:51:54,011 DEBUG [org.rhq.enterprise.server.util.HibernatePerformanceMonitor] HibernateStats[ queries=1, xactions=0, loads=1, connects=1, time=272118 ](perf: slowness?) for SLSB:CloudManagerBean.getServerByName 2013-05-02 16:51:43,575 ERROR [STDERR] Exception in thread "Timer-0" 2013-05-02 16:55:23,001 ERROR [STDERR] java.lang.OutOfMemoryError: Java heap space 2013-05-02 16:55:16,302 DEBUG [org.rhq.enterprise.server.util.CriteriaQueryRunner] restriction=null, resultSize=0, resultCount=0 2013-05-02 16:54:41,810 DEBUG [org.rhq.enterprise.server.util.HibernatePerformanceMonitor] HibernateStats[ queries=2, xactions=3, loads=1, connects=3, time=498407 ](perf: slowness?) for SLSB:StatusManagerBean.getAndClearAgentsWithStatusForServer Expected results: The server should remain stable and not throw an OutOfMemoryError exception. Additional info:
is this essentially the same as bug #959593 ?
The root causes of the issue is most likely the same. With bug 959593 the alerts history page does not display due to a UI timeout regardless of memory. With this bug, the server will OOME in the event that the JVM's max heap isn't large enough. In both cases, it appears to be due to no limit/filter being used on the database query and the entire contents of the alerts table being loaded into memory.
As with bug 959593, I have a feeling this may be fundamentally due to the problem in Bug 620603, recently fixed by Lukas. That problem basically was responsible for loading all rows into memory in order to perform a sort.
I will re-test with a alpha build of 3.2 to see if the work done in bug 620603 resolves this.
The fix for bug 620603 also fixed this bug. The test case for this is described in bug 959593. Marking this as VERIFIED in 3.2.0.ER3 as I was not able to reproduce this issue using 200,000 alerts.