Bug 722320 - after 2-3 GWT RPC requests timeout, all subsequent requests timeout, and GUI becomes unusable and never recovers, and Server consumes tons of memory and 100% CPU
Summary: after 2-3 GWT RPC requests timeout, all subsequent requests timeout, and GUI ...
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: RHQ Project
Classification: Other
Component: Core UI
Version: 4.0.1
Hardware: Unspecified
OS: Unspecified
low
high
Target Milestone: ---
: ---
Assignee: Ian Springer
QA Contact: Mike Foley
URL:
Whiteboard:
Depends On:
Blocks: jon3 jon30-perf rhq41 rhq41-ui
TreeView+ depends on / blocked
 
Reported: 2011-07-14 21:33 UTC by Ian Springer
Modified: 2013-08-06 00:39 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2011-09-06 20:54:45 UTC
Embargoed:


Attachments (Terms of Use)

Description Ian Springer 2011-07-14 21:33:22 UTC
To reproduce this issue in the perf env, do the following:

1) create a compat group of all service-d-metrics resources. this group will have around 100k members. easiest way to do this is to create a dynagroup that creates on compat group per restype.
2) go to the service-d-metrics compat group's Monitor>Graphs subtab. after 10s, you will get a timeout error.
3) go to the service-d-metrics compat group's Monitor>Tables subtab. after 10s, you will get a timeout error.
4) repeat step 2

At this point, the GUI will be in the totally hosed state. Try to do anything else in the GUI that requires one or more RPC calls to the Server and you will get timeout errors.

If you logout and try to log back in, that will even fail with a "Backend datasource is unavailable." error. The only way to recover seems to be to logout and then clear the browser's cache and then log back in.

However, the Server also appears to be in a very bad state at this point. Its heap is almost full and it sporadically logs "GC overhead limit has been reached." OOMEs. top reports that the Server java process is consuming around 6 GB of memory and 100% CPU. The only way to get the Server back to a stable state seems to be to restart it...

Comment 1 Ian Springer 2011-07-14 21:46:06 UTC
Here's the output from top for the Server java process...

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                                             
21720 test_jon  20   0 6834m 2.3g  13m S  1.7 14.8  38:26.68 java

Comment 2 Ian Springer 2011-07-15 17:26:53 UTC
Mazz's opinion is that the GUI getting into the state if all RPC calls timing out is just a side effect of the Server being in a bad state. 

It seems that the group metric queries for the 100k-member group are what sent the Server into the hosed state, but this needs to be confirmed.

To further analyze this, there are a number of things we can do:

1) configure the Server JVM to write out verbose GC logs (-Xloggc:file etc.)
2) get a heap dump from the Server JVM and analyze it w/ MAT
3) do live profiling of the Server JVM w/ JProfiler - this would be nice because we could see what spikes right after doing the group metric queries, however connecting to one of the Server JVMs remotely may not be feasible due to firewall restrictions and/or network lag
4) use Oracle EM to see if particular queries are taking a really long time and see if it has any suggestions 
5) restart the Server and then try executing the same group metric queries via portal-war or the CLI, and see if the Server goes into the hosed state; this will tell us whether it is something specific to coregui, or if it is purely a Server/DB issue

Comment 3 Jay Shaughnessy 2011-08-16 20:49:56 UTC
If this is really due to the server being jammed then perhaps it can be closed.

Comment 4 Ian Springer 2011-09-06 20:54:45 UTC
It turned out the Server was processing a large amount of call-time data from the perftest plugin, which is what was bringing the Server and the DB to its knees. The issues in the GUI were just a side effect, so I am going to close this bug.


Note You need to log in before you can comment on or make changes to this bug.