Bug 720826

Summary: Recent Alerts subsystem view times out if a large number of alerts exist
Product: [Other] RHQ Project Reporter: Ian Springer <ian.springer>
Component: Core UIAssignee: Nobody <nobody>
Status: NEW --- QA Contact:
Severity: high Docs Contact:
Priority: urgent    
Version: 4.0.1, 4.4CC: hrupp, loleary, mazz
Target Milestone: ---Keywords: Reopened
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 959593 (view as bug list) Environment:
Last Closed: 2012-02-07 19:27:58 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 717358, 722548, 729848, 730796, 959593    

Description Ian Springer 2011-07-12 22:08:29 UTC
I have around 100k alerts, and the view times out with the following exception:

Failed to fetch alerts data. This occurred because the server is taking a long time to complete this request. Please be aware that the server may still be processing your request and it may complete shortly. You can check the server logs to see if any abnormal errors occurred.
Severity :	
Warning
	
Time :	
Tuesday, July 12, 2011 6:02:43 PM Etc/GMT+4
Detail :	
com.google.gwt.http.client.RequestTimeoutException:A request timeout has expired after 10000 ms
--- STACK TRACE FOLLOWS ---
A request timeout has expired after 10000 ms
    at Unknown.com_google_gwt_http_client_RequestTimeoutException_$RequestTimeoutException__Lcom_google_gwt_http_client_RequestTimeoutException_2Lcom_google_gwt_http_client_Request_2ILcom_google_gwt_http_client_RequestTimeoutException_2(Unknown source:0)
    at Unknown.com_google_gwt_http_client_Request_$fireOnTimeout__Lcom_google_gwt_http_client_Request_2Lcom_google_gwt_http_client_RequestCallback_2V(Unknown source:0)
    at Unknown.com_google_gwt_http_client_Request$3_run__V(Unknown source:0)
    at Unknown.com_google_gwt_user_client_Timer_fire__V(Unknown source:0)
    at Unknown.anonymous(Unknown source:0)
    at Unknown.com_google_gwt_core_client_impl_Impl_entry0__Ljava_lang_Object_2Ljava_lang_Object_2Ljava_lang_Object_2Ljava_lang_Object_2(Unknown source:0)
    at Unknown.anonymous(Unknown source:0)
    at Unknown.anonymous(Unknown source:0)


100k is a ton of alerts, but probably not an unrealistic number for an environment that's been running for a long time (e.g. a few years). We should look into improving the performance of the select query and/or increasing the GWT RPC timeout.

Comment 1 Ian Springer 2011-09-15 20:16:24 UTC
On my box, with 112,000 total alerts, upon going to the Alerts subsystem view, the findAlertsByCriteria RPC call took 42 seconds to complete. This seems really long, considering the underlying query should use paging.

Comment 2 Charles Crouch 2011-10-10 20:04:38 UTC
Thats both a lot of alerts and also a long time for the query. We should double 
check the query is working as expected wrt paging etc. If the query is just 
slow for 100k+ alerts we should try to address this for the release by upp'ing 
the gwt timeout on this page. Post release we should come back and determine if 
there are other speed ups possible.

Comment 3 Ian Springer 2011-10-14 19:44:32 UTC
The load time of this view has been improved, though the page can still take 10+ seconds to load if there are thousands of alerts (e.g. with 8000 alerts, the view takes 12 seconds to load).

Two commits contribute to the improvement in load time:

1) workaround SmartGWT bug that was causing 1000 alerts to be requested on the first fetch rather than 50 - http://git.fedorahosted.org/git/?p=rhq/rhq.git;a=commitdiff;h=c64f046
2) make alert.alertDefinition.groupAlertDefinition LAZY fetch, since that field isn't needed for alert list views - http://git.fedorahosted.org/git/?p=rhq/rhq.git;a=commitdiff;h=5c44fc6

I think the call is still somewhat slow because there are n + 1 queries required to fetch the measurementDefinition of each of the alert's condition logs (needed to render the condition in the GUI). I see no way to get these in a single query using JPA, and going outside of JPA would not allow us to use our criteria framework.

Note, by default, the results are sorted by the ctime column. There is already an index defined for that column.

Comment 4 Ian Springer 2011-10-14 19:46:16 UTC
http://git.fedorahosted.org/git/?p=rhq/rhq.git;a=commitdiff;h=e96159b fixes an NPE that was introduced by [master c64f046].

Comment 5 Mike Foley 2011-10-18 15:00:50 UTC
verified basic functionality.

Comment 6 Mike Foley 2012-02-07 19:27:58 UTC
changing status of VERIFIED BZs for JON 2.4.2 and JON 3.0 to CLOSED/CURRENTRELEASE

Comment 7 Larry O'Leary 2013-05-03 21:15:31 UTC
Reopening this issue as it isn't actually resolved. The issue still occurs. The commits only improved performance but did not fix the underlying problem of fetching the entire date set within a single request without the use of a limit (due to the Hibernate fetch join implementation).

In cases of 50,000 alerts, this failure can still occur. It really depends on the amount of memory/heap for the RHQ server and the performance of the database but in the end, the warning is displayed and alert history and recent alerts are not available. 

To make matter worse, if the UI is refreshed while a previous request is still processing, you can end up in a situation of the user crashing the server due to the server executing the same query, without criteria, which will result in the same massive result set being received multiple times.