Created attachment 699024 [details] logs Description of problem: After invocation of several CLI tests, RHQ server was consuming ~99% of CPU even when tests were not running any more. After ~10 minutes, usage of CPU went down to ~0%. Version-Release number of selected component (if applicable): Version: 4.6.0-SNAPSHOT Build Number: f7409c5 How reproducible: Always Steps to Reproduce: No exact steps available. It happens after several runs of our suite with CLI tests. I'm not able to isolate the problem to a certain test. I can provide a machine showing this problem. Please add any hints how to isolate the problem and i will re-test it and provide exact repro steps. Additional info: Complete agent's and server's logs are attached. I noticed one exception in the agent's log, which should be related to this problem: 2013-02-18 11:58:30,540 WARN [InventoryManager.discovery-1] (rhq.core.pc.inventory.InventoryManager)- Failure during discovery for [RHQ Server Plugin Subsystem] Resources - failed after 60001 ms. 1248 java.lang.Exception: Discovery component invocation failed. 1249 at org.rhq.core.pc.util.DiscoveryComponentProxyFactory$ComponentInvocationThread.call(DiscoveryComponentProxyFactory.java:297) 1250 at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) 1251 at java.util.concurrent.FutureTask.run(FutureTask.java:166) 1252 at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) 1253 at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) 1254 at java.lang.Thread.run(Thread.java:679) 1255 Caused by: java.lang.reflect.UndeclaredThrowableException 1256 at $Proxy89.queryNames(Unknown Source) 1257 at org.mc4j.ems.impl.jmx.connection.DConnection.queryBeans(DConnection.java:307) 1258 at org.mc4j.ems.impl.jmx.connection.DConnection.queryBeans(DConnection.java:332) 1259 at org.rhq.plugins.jmx.MBeanResourceDiscoveryComponent.performDiscovery(MBeanResourceDiscoveryComponent.java:149) 1260 at org.rhq.plugins.jmx.MBeanResourceDiscoveryComponent.discoverResources(MBeanResourceDiscoveryComponent.java:98) 1261 at org.rhq.plugins.jmx.MBeanResourceDiscoveryComponent.discoverResources(MBeanResourceDiscoveryComponent.java:85) 1262 at sun.reflect.GeneratedMethodAccessor28.invoke(Unknown Source) 1263 at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) 1264 at java.lang.reflect.Method.invoke(Method.java:616) 1265 at org.rhq.core.pc.util.DiscoveryComponentProxyFactory$ComponentInvocationThread.call(DiscoveryComponentProxyFactory.java:293) 1266 ... 5 more 1267 Caused by: java.io.IOException: Unable to obtain isRegistered, status=WAITING 1268 at org.jboss.remotingjmx.protocol.v1.ClientConnection$TheConnection.queryNames(ClientConnection.java:890) 1269 at sun.reflect.GeneratedMethodAccessor90.invoke(Unknown Source) 1270 at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) 1271 at java.lang.reflect.Method.invoke(Method.java:616) 1272 at org.mc4j.ems.impl.jmx.connection.support.providers.proxy.JMXRemotingMBeanServerProxy.invoke(JMXRemotingMBeanServerProxy.java:59)
Filip Try http://www.cubrid.org/blog/dev-platform/how-to-analyze-java-thread-dumps/ section "Example 1: When the CPU Usage is Abnormally High"
Just a suggestion ... This is an extremely lightweight way of doing CPU profiling ... -Xrunhprof:cpu=samples,file=c:\temp\hprof.txt so in rhq-server.sh, try this ... RHQ_SERVER_ADDITIONAL_JAVA_OPTS=-Xrunhprof:cpu=samples,file=/home/mfoley/cpudata.txt when you ctrl-break the server ... there will be a list of java classes ...sorted by CPU usage. i did this and got the following: CPU SAMPLES BEGIN (total = 66113) Mon Feb 18 13:30:55 2013 rank self accum count trace method 1 40.71% 40.71% 26917 301046 sun.nio.ch.EPollArrayWrapper.epollWait 2 34.01% 74.73% 22487 301509 java.net.PlainSocketImpl.socketAccept 3 5.72% 80.45% 3784 301671 java.net.PlainSocketImpl.socketAccept 4 2.11% 82.56% 1398 300534 java.lang.Class.getClassLoader0 5 1.65% 84.22% 1092 300493 java.io.FileInputStream.readBytes 6 1.53% 85.74% 1009 300521 java.lang.ClassLoader.defineClass1 7 0.95% 86.69% 625 300525 java.util.zip.Inflater.inflateBytes 8 0.80% 87.49% 528 300618 java.io.UnixFileSystem.canonicalize0 9 0.56 so in your scenario here ...we will see something different. the classes that are spinning the CPU should be near the top ... and then we will have additional info on what is going on ...
Adding full thread dumps and cpu samples. Attached file thDumps.txt contains thread dumps with std and stderr outputs of the RHQ server. cpudata.txt contains cpu samples. Each sample was taken asynchronously by sending SIGQUIT to the RHQ server. First two samples were taken before tests were running (cpu usage of RHQ server was normal) Subsequent samples were taken when i noticed first "Caused by: java.io.IOException: Unable to obtain isRegistered, status=WAITING" exception in the agent's log. Last three samples were taken when a cpu usage of the RHQ server went back to normal (the very last when the RHQ server was in a shutdown process) Tests started at 08:04:30 and finished at 09:00 and cpu usage of RHQ server went back to normal at 09:13 EST (you can see timestamp of each sample in cpudata.txt) I noticed that the method which was consuming cpu the most between 09:00 and 09:13 was 'java.util.regex.Pattern$Start.match'. It seems a bit suspicious (see cpudata.txt). Last operation invoked by tests was "uninventory all resources".
Created attachment 699495 [details] cpudata.txt and thDumps.txt
I have to do it again. Default stack trace depth is only 4 so the previous cpudata.txt is a bit useless. I will attached a new version in 1 hour.
I didn't manage to reproduce it this time and I don't have time for another run. I will continue tomorrow, plus i will do another runs with increased heap space to see if the high cpu usage is related to the size of heap space.
I did another testing with RHQ server's heap space increased to 2G (-Xms2G -Xmx2G). After several runs of our CLI test suite was the RHQ server consuming again ~100% of CPU and the exception "Caused by: java.io.IOException: Unable to obtain isRegistered, status=WAITING" was being thrown to the agent's log repeatedly. See attached pictures with heap and pemrGen graphs. High cpu usage should not be caused by low heap or permGen size. There were again a lot of threads doing "java.util.regex.Pattern$Start.match" which could be the cause of high cpu usage. See attached cpudata.txt for full trace. Attaching full agent's and server's logs, cpu samples and thread dumps.
Created attachment 700114 [details] heap chart
Created attachment 700115 [details] permGen chart
Created attachment 700116 [details] agent and server logs from second run
Created attachment 700117 [details] cpudata and threadDumps from second run
are you running an agent that is monitoring this RHQ Server? I ask because : org.rhq.core.pc.util.DiscoveryComponentProxyFactory$ComponentInvocationThread.call(DiscoveryComponentProxyFactory.java:293) 1266 ... 5 more 1267 Caused by: java.io.IOException: Unable to obtain isRegistered, status=WAITING 1268 at org.jboss.remotingjmx.protocol.v1.ClientConnection$TheConnection.queryNames(ClientConnection.java:890) seems to indicate this is the rhq-server plugin making remote JMX calls to get metric data from the RHQ Server.
Yes, the RHQ agent is monitoring the RHQ Server as well. All discovered resources are imported before the test suite. So the exception from previous comment 12 only indicates that RHQ serve is not able to respond to the agent in time?It's only a consequence of overloaded RHQ server?
This should be retested once we have a JON build with EAP6.1