720826 – Recent Alerts subsystem view times out if a large number of alerts exist

Bug 720826 - Recent Alerts subsystem view times out if a large number of alerts exist

Summary: Recent Alerts subsystem view times out if a large number of alerts exist

Keywords:
Status:	NEW
Alias:	None
Product:	RHQ Project
Classification:	Other
Component:	Core UI
Sub Component:
Version:	4.0.1,4.4
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Nobody
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	jon30-perf rhq-gui-timeouts rhq41 rhq41-ui 959593
TreeView+	depends on / blocked

Reported:	2011-07-12 22:08 UTC by Ian Springer
Modified:	2024-03-04 13:35 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Clone Of:
Clones:	959593 (view as bug list)
Environment:
Last Closed:	2012-02-07 19:27:58 UTC
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	958993	0	medium	CLOSED	Alert history portlet or page can crash server if thousands of alerts exist	2021-02-22 00:41:40 UTC

Internal Links: 958993

Description Ian Springer 2011-07-12 22:08:29 UTC

I have around 100k alerts, and the view times out with the following exception:

Failed to fetch alerts data. This occurred because the server is taking a long time to complete this request. Please be aware that the server may still be processing your request and it may complete shortly. You can check the server logs to see if any abnormal errors occurred.
Severity :	
Warning
	
Time :	
Tuesday, July 12, 2011 6:02:43 PM Etc/GMT+4
Detail :	
com.google.gwt.http.client.RequestTimeoutException:A request timeout has expired after 10000 ms
--- STACK TRACE FOLLOWS ---
A request timeout has expired after 10000 ms
    at Unknown.com_google_gwt_http_client_RequestTimeoutException_$RequestTimeoutException__Lcom_google_gwt_http_client_RequestTimeoutException_2Lcom_google_gwt_http_client_Request_2ILcom_google_gwt_http_client_RequestTimeoutException_2(Unknown source:0)
    at Unknown.com_google_gwt_http_client_Request_$fireOnTimeout__Lcom_google_gwt_http_client_Request_2Lcom_google_gwt_http_client_RequestCallback_2V(Unknown source:0)
    at Unknown.com_google_gwt_http_client_Request$3_run__V(Unknown source:0)
    at Unknown.com_google_gwt_user_client_Timer_fire__V(Unknown source:0)
    at Unknown.anonymous(Unknown source:0)
    at Unknown.com_google_gwt_core_client_impl_Impl_entry0__Ljava_lang_Object_2Ljava_lang_Object_2Ljava_lang_Object_2Ljava_lang_Object_2(Unknown source:0)
    at Unknown.anonymous(Unknown source:0)
    at Unknown.anonymous(Unknown source:0)


100k is a ton of alerts, but probably not an unrealistic number for an environment that's been running for a long time (e.g. a few years). We should look into improving the performance of the select query and/or increasing the GWT RPC timeout.

Comment 1 Ian Springer 2011-09-15 20:16:24 UTC

On my box, with 112,000 total alerts, upon going to the Alerts subsystem view, the findAlertsByCriteria RPC call took 42 seconds to complete. This seems really long, considering the underlying query should use paging.

Comment 2 Charles Crouch 2011-10-10 20:04:38 UTC

Thats both a lot of alerts and also a long time for the query. We should double 
check the query is working as expected wrt paging etc. If the query is just 
slow for 100k+ alerts we should try to address this for the release by upp'ing 
the gwt timeout on this page. Post release we should come back and determine if 
there are other speed ups possible.

Comment 3 Ian Springer 2011-10-14 19:44:32 UTC

The load time of this view has been improved, though the page can still take 10+ seconds to load if there are thousands of alerts (e.g. with 8000 alerts, the view takes 12 seconds to load).

Two commits contribute to the improvement in load time:

1) workaround SmartGWT bug that was causing 1000 alerts to be requested on the first fetch rather than 50 - http://git.fedorahosted.org/git/?p=rhq/rhq.git;a=commitdiff;h=c64f046
2) make alert.alertDefinition.groupAlertDefinition LAZY fetch, since that field isn't needed for alert list views - http://git.fedorahosted.org/git/?p=rhq/rhq.git;a=commitdiff;h=5c44fc6

I think the call is still somewhat slow because there are n + 1 queries required to fetch the measurementDefinition of each of the alert's condition logs (needed to render the condition in the GUI). I see no way to get these in a single query using JPA, and going outside of JPA would not allow us to use our criteria framework.

Note, by default, the results are sorted by the ctime column. There is already an index defined for that column.

Comment 4 Ian Springer 2011-10-14 19:46:16 UTC

http://git.fedorahosted.org/git/?p=rhq/rhq.git;a=commitdiff;h=e96159b fixes an NPE that was introduced by [master c64f046].

Comment 5 Mike Foley 2011-10-18 15:00:50 UTC

verified basic functionality.

Comment 6 Mike Foley 2012-02-07 19:27:58 UTC

changing status of VERIFIED BZs for JON 2.4.2 and JON 3.0 to CLOSED/CURRENTRELEASE

Comment 7 Larry O'Leary 2013-05-03 21:15:31 UTC

Reopening this issue as it isn't actually resolved. The issue still occurs. The commits only improved performance but did not fix the underlying problem of fetching the entire date set within a single request without the use of a limit (due to the Hibernate fetch join implementation).

In cases of 50,000 alerts, this failure can still occur. It really depends on the amount of memory/heap for the RHQ server and the performance of the database but in the end, the warning is displayed and alert history and recent alerts are not available. 

To make matter worse, if the UI is refreshed while a previous request is still processing, you can end up in a situation of the user crashing the server due to the server executing the same query, without criteria, which will result in the same massive result set being received multiple times.

Note You need to log in before you can comment on or make changes to this bug.