Bug 959593

Summary:	Alert history and recent alerts views are unavailable and timeout when a large number of alerts exist
Product:	[JBoss] JBoss Operations Network	Reporter:	Larry O'Leary <loleary>
Component:	UI	Assignee:	Larry O'Leary <loleary>
Status:	CLOSED CURRENTRELEASE	QA Contact:	Mike Foley <mfoley>
Severity:	high	Docs Contact:
Priority:	medium
Version:	JON 3.1.2	CC:	hrupp, jshaughn, loleary, mazz, myarboro
Target Milestone:	ER03	Keywords:	TestCaseNeeded
Target Release:	JON 3.2.0
Hardware:	All
OS:	All
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:	720826	Environment:
Last Closed:		Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	720826
Bug Blocks:

Description Larry O'Leary 2013-05-03 22:11:25 UTC

Description of problem:
When there are thousands of alerts, the JBoss ON UI will fail to display the recent alerts portlet on the dashboard, resource/group summary page, and resource/group alerts history page due to a timeout error/warning:

    Message: Failed to fetch alerts data. This occurred because the server is taking a long time to complete this request. Please be aware that the server may still be processing your request and it may complete shortly. You can check the server logs to see if any abnormal errors occurred.
    Severity: Warning
	
    Time: Thursday, May 2, 2013 2:02:05 PM UTC+1
    Detail:	com.google.gwt.http.client.RequestTimeoutException:A request timeout has expired after 30000 ms
    --- STACK TRACE FOLLOWS ---
    A request timeout has expired after 30000 ms
       at Unknown.java_lang_Exception_Exception__Ljava_lang_String_2V(Unknown Source)
       at Unknown.com_google_gwt_http_client_RequestTimeoutException_RequestTimeoutException__Lcom_google_gwt_http_client_Request_2IV(Unknown Source)
       at Unknown.com_google_gwt_http_client_Request_$fireOnTimeout__Lcom_google_gwt_http_client_Request_2Lcom_google_gwt_http_client_RequestCallback_2V(Unknown Source)
       at Unknown.com_google_gwt_http_client_Request$3_run__V(Unknown Source)
       at Unknown.com_google_gwt_user_client_Timer_fire__V(Unknown Source)
       at Unknown.anonymous(Unknown Source)
       at Unknown.com_google_gwt_core_client_impl_Impl_apply__Ljava_lang_Object_2Ljava_lang_Object_2Ljava_lang_Object_2Ljava_lang_Object_2(Unknown Source)
       at Unknown.com_google_gwt_core_client_impl_Impl_entry0__Ljava_lang_Object_2Ljava_lang_Object_2Ljava_lang_Object_2Ljava_lang_Object_2(Unknown Source)
       at Unknown.anonymous(Unknown Source)
       at Unknown.anonymous(Unknown Source)


The issue becomes much worse if the user navigates away from a page that contains a recent alert or alert history view to another page that contains the same or similar recent alert or alert history view. 

The overall impact:
 - User is not able to view alert history or recent alert portlets
 - JON server puts massive load on CPU (complete saturation of 3 or more cores)
 - JON server becomes unresponsive
 - JON server stops responding altogether


Version-Release number of selected component (if applicable):
4.4.0.JON312GA

How reproducible:
Almost always

Steps to Reproduce:
1.  Configure JBoss ON server with 2GB max heap.
2.  Start JBoss ON system.
3.  Import RHQ Agent resource into inventory.
4.  Create an alert definition that will get fired off 200,000 times. To make this testable without waiting for 200,000 alert conditions to occur, the following can be done to simulate 200,000 alerts:

    The following Linux shell command will produce two SQL files. One named alertDef.sql which contains the alert definition for the platform resource (id 10001) and alerts.sql which will contain <_numOfAlerts> alerts based on the alertDef. 

        _resourceId=10001
        _alertTime=1367417645665
        _alertCondTime=1367417655665
        _alertId=10001
        _numOfAlerts=200000

        echo "INSERT INTO rhq_alert_definition VALUES (10001, 'Alert Def 01', 1367353530092, 1367353530092, 0, NULL, NULL, 'MEDIUM', NULL, ${_resourceId}, NULL, true, 0, 0, false, false, false, false, false, 0, 0, NULL, 0, NULL);" >alertDef.sql
        echo "INSERT INTO rhq_alert_condition VALUES (10001, 'CONTROL', NULL, 'viewProcessList', NULL, NULL, 'SUCCESS', 10001, NULL);" >>alertDef.sql

        echo "COPY rhq_alert (id, alert_definition_id, ctime, recovery_id, will_recover, ack_time, ack_subject) FROM stdin;" >alertsTmp.sql
        echo "COPY rhq_alert_condition_log (id, ctime, alert_id, condition_id, value) FROM stdin;" >alert_conditionsTmp.sql

        for (( i=1; i<=_numOfAlerts; i++ )); do
            echo "${_alertId}"$'\t'"10001"$'\t'"${_alertTime}"$'\t'"0"$'\t'"f"$'\t'"-1"$'\t'"\N" >>alertsTmp.sql
            echo "${_alertId}"$'\t'"${_alertCondTime}"$'\t'"${_alertId}"$'\t'"10001"$'\t'"Success" >>alert_conditionsTmp.sql
            (( _alertTime += 1000 ))
            (( _alertCondTime += 1000 ))
            (( _alertId++ ))
        done
        echo "\." >>alertsTmp.sql
        echo "" >>alertsTmp.sql
        echo "\." >>alert_conditionsTmp.sql
        echo "" >>alert_conditionsTmp.sql

        cat alertsTmp.sql alert_conditionsTmp.sql >alerts.sql
        rm alertsTmp.sql alert_conditionsTmp.sql


    The two files can be imported using psql:
        psql -d rhq -f alertDef.sql
        psql -d rhq -f alerts.sql 

5.  Login to the JBoss ON UI. Presumably the dashboard should appear and the default dashboard contains the recent alerts portlet.

It may take multiple refreshes of the dashboard page or other pages that contain alert history/recent alerts in order to see the condition occur.
  
Actual results:
Recent alerts portlet contains the message "No results found using specified criteria" and the following warning appears in the UI:

    Message :	Failed to fetch alerts data. This occurred because the server is taking a long time to complete this request. Please be aware that the server may still be processing your request and it may complete shortly. You can check the server logs to see if any abnormal errors occurred.

    com.google.gwt.http.client.RequestTimeoutException:A request timeout has expired after 30000 ms
    --- STACK TRACE FOLLOWS ---
    A request timeout has expired after 30000 ms
       at Unknown.java_lang_Exception_Exception__Ljava_lang_String_2V(Unknown Source)
       at Unknown.com_google_gwt_http_client_RequestTimeoutException_RequestTimeoutException__Lcom_google_gwt_http_client_Request_2IV(Unknown Source)
       at Unknown.com_google_gwt_http_client_Request_$fireOnTimeout__Lcom_google_gwt_http_client_Request_2Lcom_google_gwt_http_client_RequestCallback_2V(Unknown Source)
       at Unknown.com_google_gwt_http_client_Request$3_run__V(Unknown Source)
       at Unknown.com_google_gwt_user_client_Timer_fire__V(Unknown Source)
       at Unknown.anonymous(Unknown Source)
       at Unknown.com_google_gwt_core_client_impl_Impl_apply__Ljava_lang_Object_2Ljava_lang_Object_2Ljava_lang_Object_2Ljava_lang_Object_2(Unknown Source)
       at Unknown.com_google_gwt_core_client_impl_Impl_entry0__Ljava_lang_Object_2Ljava_lang_Object_2Ljava_lang_Object_2Ljava_lang_Object_2(Unknown Source)
       at Unknown.anonymous(Unknown Source)
       at Unknown.anonymous(Unknown Source)


Expected results:
The 5 most recent alerts should appear after a few seconds (less then 5)

Additional info:
This appears to have been identified in upstream bug 720826 and was originally marked as resolved. However, after looking at the bug I do not see any actual resolution. We only made some performance tweaks and increased the timeout value in the hopes that the issue would impact fewer users. 

From looking at the code and the reproducer, it appears that there are multiple things at play which are causing the slow down:

1) As identified in bug 958993, the entire result set (in this case 200,000 rows) is being brought back into memory and manipulated by the server. This is done each time the query is executed. If the server is under any kind of load and the user happens to navigate away from the page and back or click the refresh or time a refresh at the same time as an auto-refresh, the result is 200,000 x {n} where {n} = the number of times the refresh occurred or a new page was requested. 

The end result is a major degradation in overall server performance and capacity while the alerts result set is being retrieved. 



2) Paging is not being used on the alert history pages. This means that all 200,000 rows are loaded into the server and then prepared for delivery to the UI. Because the result set is so large, it takes more time to process all 200,000 rows in this case. So, even though query execution at the database may have only been 8 seconds, it still takes the server 20+ seconds to process the large result set. 



3) Even if paging was being used, it appears Hibernate doesn't support this on the fetch joins we are using. I think the upstream bug 720826 describes this in some capacity. As such, we should probably not use Hibernate or check to see if this is something addressed in a later release. 

What this basically means is that even in the case of the recent alerts portlet it appears that we are paging (by default) to 5 rows. But the translated query from JPA is missing the LIMIT/OFFSET clause in the case of the alerts query. So, a query that would normally take 0.9 seconds is taken 9 seconds due to a simple "LIMIT 5" missing from the end of the translated query.



4) In comment 3 of upstream bug 720826 it states that we already have an index on the sort column 'ctime'. I am not seeing this index, nor do I see an index on any of the other columns that make up the sort or join criteria.

I did a quick test to see if indexes would have a difference and could saw an improvement of approximately 3 seconds in execution time. Doesn't seem like much but considering that is a 33% improvement, might be worth looking into. Indexes for sorting can improve query performance but I am not sure how the join criteria is used when looking at indexing. Therefore, my index commands that I tested with may not all be required:

 CREATE INDEX RHQ_ALERT_IDX_CTIME ON RHQ_ALERT (ctime);
 CREATE INDEX RHQ_ALERT_IDX_DEFID ON RHQ_ALERT (alert_definition_id);
 CREATE INDEX RHQ_ALERT_IDX_ID ON RHQ_ALERT (id);
 CREATE INDEX RHQ_ALERT_IDX_RECID ON RHQ_ALERT (recovery_id);
 CREATE INDEX RHQ_ALERT_COND_LOG_IDX ON RHQ_ALERT_CONDITION_LOG (id);

Comment 1 Jay Shaughnessy 2013-06-03 21:21:04 UTC

I have a feeling this may be fundamentally due to the problem in Bug 620603, recently fixed by Lukas.  That problem basically was responsible for loading all rows into memory in order to perform a sort.

Comment 2 Jay Shaughnessy 2013-06-04 15:19:37 UTC

As for indexing.  There should already be an index on ctime.  *** If a db is lacking this index then there must be a problem with upgrade, because it exists for newly created dbs. *** This index really MUST exist for perf.  It is used for the recent alerts portlet, the alerts report and I believe for the resource/group alert views; all of which are sorted ctime desc by default.

There is also an index of alert_definition_id, although I somewhat question this one due to the fairly low cardinality of values.  But it does support the FK to the resource, so perhaps it is useful even if bloated. It would require testing.

ID is the primary key, and is therefore indexed [unique] by default.

Indexes generate a lot of overhead on insert/update/delete so over-indexing can be a problem as well. I would not recommend further indexing without perf proof of the necessity.  But I would make sure that we have the expected indexes in place for upgraded dbs.

Comment 3 Jay Shaughnessy 2013-06-04 15:23:13 UTC

As a follow-up, if we suspect that some dbs may be lacking the proper indexing we could add logic to recreate the indexes in the next db-upgrade.  

Larry, is this something you think is needed?

Comment 4 Larry O'Leary 2013-06-04 16:04:25 UTC

I only raised the issue about CTIME because I didn't see the index when I reviewed the table metadata for RHQ_ALERT. I have taken another look and now see it as RHQ_ALERT_IDX_TIME. Not sure why I didn't see this the first time. 

(In reply to Jay Shaughnessy from comment #3)
> As a follow-up, if we suspect that some dbs may be lacking the proper
> indexing we could add logic to recreate the indexes in the next db-upgrade.  
> 
> Larry, is this something you think is needed?

No. Everything looks intact. As for indexing the other columns, it appears that the improvement seen by my test is actually 3 seconds regardless of the total query execution time without the index. I can only assume that the improvement is due to index caching being done by the database itself and does not provide a significant performance boost to warrant any changes.

I completely agree with your comment 1 and think that the lack of a LIMIT is the issue here.

Comment 5 Larry O'Leary 2013-08-09 14:48:10 UTC

I will re-test with a alpha build of 3.2 to see if the work done in bug 620603 resolves this.

Comment 6 Larry O'Leary 2013-10-11 20:23:32 UTC

Moving this to ON_QA so that a test case can be captured.

I have retested this against 3.2.0.ER3 and the fix for bug 620603 fixes this issue.

The test case should be captured so that this is part of automated testing or UI testing to ensure that a very large number of alerts can be handled/rendered in the UI.

Here are the steps I used to test in 3.2:

1.  Configure JBoss ON server with 2GB max heap.
2.  Start JBoss ON system.
3.  Import RHQ Agent resource into inventory.
4.  Create an alert definition that will get fired off 200,000 times. To make this testable without waiting for 200,000 alert conditions to occur, the following can be done to simulate 200,000 alerts:

    The following Linux shell command will produce two SQL files. One named alertDef.sql which contains the alert definition for the platform resource (id 10001) and alerts.sql which will contain <_numOfAlerts> alerts based on the alertDef. 


        function printLogTimestamp_ms() {
            timestamp=$(date +"%s|%N")
            ms=$((${timestamp#*|}/1000000 ))
            timestamp_ms=$(( (${timestamp%|*} * 1000) + ${ms}))
            echo -n "${timestamp_ms}"
        }

        _resourceId=10001
        _alertTime=$(( $(printLogTimestamp_ms)-28800000))
        _alertCondTime=$(( ${_alertTime}-600000 ))
        _alertId=10100
        _alertDefId=11123
        _alertCondId=32111
        _numOfAlerts=200000

        echo "INSERT INTO rhq_alert_definition (id, name, ctime, mtime, priority, resource_id, enabled, required, recovery_id, will_recover, notify_filtered, control_filtered, deleted, read_only, dampening_category, dampening_value, dampening_period) VALUES (${_alertDefId}, 'Test Alert Def 01', 1367353530092, 1367353530092, 'MEDIUM', ${_resourceId}, true, 0, 0, false, false, false, false, false, 0, 0, 0);" >alertDef.sql

        echo "INSERT INTO rhq_alert_condition (id, type, name, option_status, alert_definition_id) VALUES (${_alertCondId}, 'CONTROL', 'viewProcessList', 'SUCCESS', ${_alertDefId});" >>alertDef.sql

        echo "COPY rhq_alert (id, alert_definition_id, ctime, recovery_id, will_recover, ack_time, ack_subject) FROM stdin;" >alertsTmp.sql
        echo "COPY rhq_alert_condition_log (id, ctime, alert_id, condition_id, value) FROM stdin;" >alert_conditionsTmp.sql

        for (( i=1; i<=_numOfAlerts; i++ )); do
            echo "${_alertId}"$'\t'"${_alertDefId}"$'\t'"${_alertTime}"$'\t'"0"$'\t'"f"$'\t'"-1"$'\t'"\N" >>alertsTmp.sql
            echo "${_alertId}"$'\t'"${_alertCondTime}"$'\t'"${_alertId}"$'\t'"${_alertCondId}"$'\t'"Success" >>alert_conditionsTmp.sql
            (( _alertTime += 1000 ))
            (( _alertCondTime += 1000 ))
            (( _alertId++ ))
        done
        echo "\." >>alertsTmp.sql
        echo "" >>alertsTmp.sql
        echo "\." >>alert_conditionsTmp.sql
        echo "" >>alert_conditionsTmp.sql

        cat alertsTmp.sql alert_conditionsTmp.sql >alerts.sql
        rm alertsTmp.sql alert_conditionsTmp.sql


    The two files can be imported using psql:
        psql -d rhq -f alertDef.sql
        psql -d rhq -f alerts.sql 

5.  Login to the JBoss ON UI.
6.  From the Dashboard page that contains the recent alerts portlet, perform a page refresh 5 times.
7.  Navigate to the alerts history page for the platform resource.
8.  Verify total rows is 200000.
9.  Perform page refresh 5 times.

In the end, the goal is to get at least 200,000 alerts to be displayed in the UI over several page reloads (within a 30 second time span) to confirm that such actions won't result in an out-of-memory condition to occur on the server or the UI to fail with a timeout.