Bug 722548 (rhq-gui-timeouts)

Summary: tracker for views w/ GWT RPC call timeouts that are too short for calls that return lots of results
Product: [Other] RHQ Project Reporter: Ian Springer <ian.springer>
Component: Core UIAssignee: Ian Springer <ian.springer>
Status: CLOSED CURRENTRELEASE QA Contact: Mike Foley <mfoley>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 4.0.1CC: ccrouch, hrupp, jshaughn
Target Milestone: ---Keywords: Tracking
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2013-02-26 21:37:42 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 720826, 720497, 720794, 720835, 727869, 734527, 734599, 736517, 736802, 736807, 738798    
Bug Blocks: 717358    

Description Ian Springer 2011-07-15 15:50:46 UTC
Using the perf env setup, I've already hit a number of RPC calls that take more than 10s to complete, e.g.:

- importing large number of resources
- viewing Monitor>Graphs or Monitor>Tables subtabs for large group
- quickly scrolling to bottom of Table containing large number of Resources, alertdefs, alerts, etc.

It would be great if we could just increase the 10s timeout to 100s, but alas there is a good reason we have the 10s timeout. By default, Firefox and other browsers will only use one HTTP connection at a time for a given session. So just one long-running RPC call would cause the GUI to become unusable until that call completes or is timed out. It's possible to increase the maximum # of simultaneous HTTP connections per session in the browser's configuration, but we really don't want to require users to do that in order to use our GUI, since it's a global browser setting.

Comment 1 Charles Crouch 2011-07-15 17:28:56 UTC
my 2c, 
this is not an issue of making the user wait longer
the issue is the length of time the work is taking

either
a) do less work, e.g. return fewer rows, or do the work async
b) do the work faster, e.g. optimize the query

In general the solution is not to have the user sit there and wait minutes for the result.

Obviously Monitor>Graphs tab for large groups is not amenable to a) and has historically been optimized so there may not be much more than can be squeezed out of it given the current architecture. Don't these graphs still live in iframes? Can't that be used to overcome the connection problem. The desire here specifically would not to regress over prior releases.

Comment 2 Ian Springer 2011-07-15 17:50:33 UTC
After discussing with some other devs, the consensus here is to try to address this as follows:

1) make the default GWT RPC timeout a configurable Server setting; make sure to grab this setting from the Server right at GUI startup, so the GUI starts using it right away; this will be very valuable in the perf env, where we can set it very high so nothing times out, allowing us to see exactly how long various calls are taking
2) also add an optional query string parameter (e.g. rpcTimeout=100) that can be added to any URL to tell the GUI to override the default timeout; this will be valuable, because it will allow a user who is hitting a timeout for a particular view to increase the timeout enough to allow the view to load
3) identify the calls that are taking more than 10s and investigate whether their performance can be tuned to a more acceptable level - use Oracle EM, hibernate perf monitor, code analysis, etc.
4) for certain RPC calls (e.g. group metric queries) increase the timeout for those particular calls to a value higher than 10s; we should make sure that the query string parameter described by 2) takes precedence over per-call timeouts, so user still has the ability to override them if they're not high enough
5) recommend in our docs that users increase their max simultaneous http connections browser setting to some value that allows for more concurrency in the GUI; note, we can not *require* this, since it's a global browser setting and users' corporate policies may not allow changing such settings; we also need to be careful not to recommend too high of a value, since more simultaneous connections from multiple clients will also put more stress on the Server and could possibly hit simultaneous connection limits configured in the RHQ Server's JBoss Web Server.

We could also consider adding a user preference in the future, but I think 1) and 2) will should be sufficient.

Hopefully, the above will be enough to get the situation under control, and we will not have to undertake the major refactoring of making all RPC fully async (i.e. where the RPC calls immediately return void and then the Server pushes the responses to the client via Errai or some other web push library once the server-side SLSB method returns).

Comment 3 Ian Springer 2011-07-15 17:57:41 UTC
(In reply to comment #1)
> my 2c, 
> this is not an issue of making the user wait longer
> the issue is the length of time the work is taking
> 
> either
> a) do less work, e.g. return fewer rows, or do the work async

all of these queries *should* already be paged. however, i already discovered one, the autodiscovery list view, that is not being paged (and hence took 65s to load w/ 1500 NEW Resources). so we should definitely check all slow list queries (well, all list queries really) and make sure they are actually doing paging.

async is last resort since it will require adding something like Errai, which will be a pretty big effort.

> b) do the work faster, e.g. optimize the query
> 
> In general the solution is not to have the user sit there and wait minutes for
> the result.

agreed. however, if they have a group w/ 100k resources, they might have to accept that certain views for that group will take more than 10s to load, especially if we've already done all we can to optimize the underlying queries.

Comment 4 Ian Springer 2011-07-15 18:45:53 UTC
Currently if an RPC call is in progress and the user decides navigate to some other view, that call will continue to consume one of the precious http connections until it completes or times out, even though the user has abandoned the view that initiated the call. So we should consider adding something that would cancel any *user-initiated* calls that are still in progress when the user navigates to another view.

Comment 5 Charles Crouch 2011-07-25 15:59:24 UTC
(In reply to comment #3)
> 
> all of these queries *should* already be paged. however, i already discovered
> one, the autodiscovery list view, that is not being paged (and hence took 65s
> to load w/ 1500 NEW Resources). so we should definitely check all slow list
> queries (well, all list queries really) and make sure they are actually doing
> paging.
> 

Is this fixed, is there another BZ for this?

Comment 6 Charles Crouch 2011-07-25 17:43:59 UTC
Answering my own question I see
https://bugzilla.redhat.com/show_bug.cgi?id=720791
was raised to cover the autodiscovery queue issue

Comment 7 Robert Buck 2011-08-08 20:23:35 UTC
Merged rpcTimeout parameter to master in:

4b7240fe1fbd0a169a4ad0663ecc3602f100dcaa
3268a40274ada82656d2b5ea095280c5985efde6

Comment 8 Ian Springer 2011-08-09 14:57:24 UTC
So far we've only addressed 2) from Comment 2 above. 3) and 4) also need to be addressed before this issue should be moved to ON_QA.

Comment 9 Ian Springer 2011-08-31 21:29:21 UTC
[master f4fdec7] includes the following:

a) add new #Test/Rpc view that can be used to invoke a new sleep() RPC method that sleeps for a specified number of seconds
b) increase default RPC timeout from 10s to 30s
c) use custom timeouts for autodiscovery queue, config history list, group schedules list views, which may exceed even the new default timeout

a) allows us to test how the app behaves when the browser's max http connections setting has been exceeded,
b) gives us a little more breathing room in the default timeout to handle calls that take more than 10s but not ridiculously long,
and c) are temporary measures for specific calls that are known to take unacceptably long; the longer term plan is to optimize the performance of these calls so they take no more than 10s

Comment 10 Charles Crouch 2011-09-30 21:56:56 UTC
removing superfluous blocks bz's

Comment 11 Jay Shaughnessy 2013-02-26 21:37:42 UTC
All dependent bugs completed.