To reproduce this issue in the perf env, do the following:
1) create a compat group of all service-d-metrics resources. this group will have around 100k members. easiest way to do this is to create a dynagroup that creates on compat group per restype.
2) go to the service-d-metrics compat group's Monitor>Graphs subtab. after 10s, you will get a timeout error.
3) go to the service-d-metrics compat group's Monitor>Tables subtab. after 10s, you will get a timeout error.
4) repeat step 2
At this point, the GUI will be in the totally hosed state. Try to do anything else in the GUI that requires one or more RPC calls to the Server and you will get timeout errors.
If you logout and try to log back in, that will even fail with a "Backend datasource is unavailable." error. The only way to recover seems to be to logout and then clear the browser's cache and then log back in.
However, the Server also appears to be in a very bad state at this point. Its heap is almost full and it sporadically logs "GC overhead limit has been reached." OOMEs. top reports that the Server java process is consuming around 6 GB of memory and 100% CPU. The only way to get the Server back to a stable state seems to be to restart it...
Here's the output from top for the Server java process...
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
21720 test_jon 20 0 6834m 2.3g 13m S 1.7 14.8 38:26.68 java
Mazz's opinion is that the GUI getting into the state if all RPC calls timing out is just a side effect of the Server being in a bad state.
It seems that the group metric queries for the 100k-member group are what sent the Server into the hosed state, but this needs to be confirmed.
To further analyze this, there are a number of things we can do:
1) configure the Server JVM to write out verbose GC logs (-Xloggc:file etc.)
2) get a heap dump from the Server JVM and analyze it w/ MAT
3) do live profiling of the Server JVM w/ JProfiler - this would be nice because we could see what spikes right after doing the group metric queries, however connecting to one of the Server JVMs remotely may not be feasible due to firewall restrictions and/or network lag
4) use Oracle EM to see if particular queries are taking a really long time and see if it has any suggestions
5) restart the Server and then try executing the same group metric queries via portal-war or the CLI, and see if the Server goes into the hosed state; this will tell us whether it is something specific to coregui, or if it is purely a Server/DB issue
If this is really due to the server being jammed then perhaps it can be closed.
It turned out the Server was processing a large amount of call-time data from the perftest plugin, which is what was bringing the Server and the DB to its knees. The issues in the GUI were just a side effect, so I am going to close this bug.