Bug 722320
Summary: | after 2-3 GWT RPC requests timeout, all subsequent requests timeout, and GUI becomes unusable and never recovers, and Server consumes tons of memory and 100% CPU | ||
---|---|---|---|
Product: | [Other] RHQ Project | Reporter: | Ian Springer <ian.springer> |
Component: | Core UI | Assignee: | Ian Springer <ian.springer> |
Status: | CLOSED NOTABUG | QA Contact: | Mike Foley <mfoley> |
Severity: | high | Docs Contact: | |
Priority: | low | ||
Version: | 4.0.1 | CC: | ccrouch, hrupp, jshaughn |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2011-09-06 20:54:45 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 678340, 717358, 729848, 730796 |
Description
Ian Springer
2011-07-14 21:33:22 UTC
Here's the output from top for the Server java process... PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 21720 test_jon 20 0 6834m 2.3g 13m S 1.7 14.8 38:26.68 java Mazz's opinion is that the GUI getting into the state if all RPC calls timing out is just a side effect of the Server being in a bad state. It seems that the group metric queries for the 100k-member group are what sent the Server into the hosed state, but this needs to be confirmed. To further analyze this, there are a number of things we can do: 1) configure the Server JVM to write out verbose GC logs (-Xloggc:file etc.) 2) get a heap dump from the Server JVM and analyze it w/ MAT 3) do live profiling of the Server JVM w/ JProfiler - this would be nice because we could see what spikes right after doing the group metric queries, however connecting to one of the Server JVMs remotely may not be feasible due to firewall restrictions and/or network lag 4) use Oracle EM to see if particular queries are taking a really long time and see if it has any suggestions 5) restart the Server and then try executing the same group metric queries via portal-war or the CLI, and see if the Server goes into the hosed state; this will tell us whether it is something specific to coregui, or if it is purely a Server/DB issue If this is really due to the server being jammed then perhaps it can be closed. It turned out the Server was processing a large amount of call-time data from the perftest plugin, which is what was bringing the Server and the DB to its knees. The issues in the GUI were just a side effect, so I am going to close this bug. |