Bug 722320

Summary:	after 2-3 GWT RPC requests timeout, all subsequent requests timeout, and GUI becomes unusable and never recovers, and Server consumes tons of memory and 100% CPU
Product:	[Other] RHQ Project	Reporter:	Ian Springer <ian.springer>
Component:	Core UI	Assignee:	Ian Springer <ian.springer>
Status:	CLOSED NOTABUG	QA Contact:	Mike Foley <mfoley>
Severity:	high	Docs Contact:
Priority:	low
Version:	4.0.1	CC:	ccrouch, hrupp, jshaughn
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2011-09-06 20:54:45 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	678340, 717358, 729848, 730796

Description Ian Springer 2011-07-14 21:33:22 UTC

To reproduce this issue in the perf env, do the following:

1) create a compat group of all service-d-metrics resources. this group will have around 100k members. easiest way to do this is to create a dynagroup that creates on compat group per restype.
2) go to the service-d-metrics compat group's Monitor>Graphs subtab. after 10s, you will get a timeout error.
3) go to the service-d-metrics compat group's Monitor>Tables subtab. after 10s, you will get a timeout error.
4) repeat step 2

At this point, the GUI will be in the totally hosed state. Try to do anything else in the GUI that requires one or more RPC calls to the Server and you will get timeout errors.

If you logout and try to log back in, that will even fail with a "Backend datasource is unavailable." error. The only way to recover seems to be to logout and then clear the browser's cache and then log back in.

However, the Server also appears to be in a very bad state at this point. Its heap is almost full and it sporadically logs "GC overhead limit has been reached." OOMEs. top reports that the Server java process is consuming around 6 GB of memory and 100% CPU. The only way to get the Server back to a stable state seems to be to restart it...

Comment 1 Ian Springer 2011-07-14 21:46:06 UTC

Here's the output from top for the Server java process...

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                                             
21720 test_jon  20   0 6834m 2.3g  13m S  1.7 14.8  38:26.68 java

Comment 2 Ian Springer 2011-07-15 17:26:53 UTC

Mazz's opinion is that the GUI getting into the state if all RPC calls timing out is just a side effect of the Server being in a bad state. 

It seems that the group metric queries for the 100k-member group are what sent the Server into the hosed state, but this needs to be confirmed.

To further analyze this, there are a number of things we can do:

1) configure the Server JVM to write out verbose GC logs (-Xloggc:file etc.)
2) get a heap dump from the Server JVM and analyze it w/ MAT
3) do live profiling of the Server JVM w/ JProfiler - this would be nice because we could see what spikes right after doing the group metric queries, however connecting to one of the Server JVMs remotely may not be feasible due to firewall restrictions and/or network lag
4) use Oracle EM to see if particular queries are taking a really long time and see if it has any suggestions 
5) restart the Server and then try executing the same group metric queries via portal-war or the CLI, and see if the Server goes into the hosed state; this will tell us whether it is something specific to coregui, or if it is purely a Server/DB issue

Comment 3 Jay Shaughnessy 2011-08-16 20:49:56 UTC

If this is really due to the server being jammed then perhaps it can be closed.

Comment 4 Ian Springer 2011-09-06 20:54:45 UTC

It turned out the Server was processing a large amount of call-time data from the perftest plugin, which is what was bringing the Server and the DB to its knees. The issues in the GUI were just a side effect, so I am going to close this bug.