Bug 912468 - RHQ server is consuming 99 % of cpu
Summary: RHQ server is consuming 99 % of cpu
Status: NEW
Alias: None
Product: RHQ Project
Classification: Other
Component: CLI, Core Server   
(Show other bugs)
Version: 4.5
Hardware: Unspecified Unspecified
unspecified
unspecified vote
Target Milestone: ---
: ---
Assignee: RHQ Project Maintainer
QA Contact: Mike Foley
URL:
Whiteboard:
Keywords:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2013-02-18 18:11 UTC by Filip Brychta
Modified: 2013-05-14 16:12 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed:
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
logs (125.75 KB, application/x-gzip)
2013-02-18 18:11 UTC, Filip Brychta
no flags Details
cpudata.txt and thDumps.txt (278.31 KB, application/x-gzip)
2013-02-19 15:06 UTC, Filip Brychta
no flags Details
heap chart (140.35 KB, image/png)
2013-02-20 15:55 UTC, Filip Brychta
no flags Details
permGen chart (126.04 KB, image/png)
2013-02-20 15:56 UTC, Filip Brychta
no flags Details
agent and server logs from second run (239.26 KB, application/x-gzip)
2013-02-20 15:57 UTC, Filip Brychta
no flags Details
cpudata and threadDumps from second run (464.56 KB, application/x-gzip)
2013-02-20 15:58 UTC, Filip Brychta
no flags Details

Description Filip Brychta 2013-02-18 18:11:08 UTC
Created attachment 699024 [details]
logs

Description of problem:
After invocation of several CLI tests, RHQ server was consuming ~99% of CPU even when tests were not running any more. After ~10 minutes, usage of CPU went down to ~0%.


Version-Release number of selected component (if applicable):
Version: 4.6.0-SNAPSHOT
Build Number: f7409c5

How reproducible:
Always

Steps to Reproduce:
No exact steps available. It happens after several runs of our suite with CLI tests. I'm not able to isolate the problem to a certain test. I can provide a machine showing this problem.

Please add any hints how to isolate the problem and i will re-test it and provide exact repro steps.
  

Additional info:
Complete agent's and server's logs are attached. I noticed one exception in the agent's log, which should be related to this problem:

2013-02-18 11:58:30,540 WARN  [InventoryManager.discovery-1] (rhq.core.pc.inventory.InventoryManager)- Failure during discovery for [RHQ Server Plugin Subsystem] Resources - failed after 60001 ms.
1248 java.lang.Exception: Discovery component invocation failed.
1249         at org.rhq.core.pc.util.DiscoveryComponentProxyFactory$ComponentInvocationThread.call(DiscoveryComponentProxyFactory.java:297)
1250         at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
1251         at java.util.concurrent.FutureTask.run(FutureTask.java:166)
1252         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
1253         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
1254         at java.lang.Thread.run(Thread.java:679)
1255 Caused by: java.lang.reflect.UndeclaredThrowableException
1256         at $Proxy89.queryNames(Unknown Source)
1257         at org.mc4j.ems.impl.jmx.connection.DConnection.queryBeans(DConnection.java:307)
1258         at org.mc4j.ems.impl.jmx.connection.DConnection.queryBeans(DConnection.java:332)
1259         at org.rhq.plugins.jmx.MBeanResourceDiscoveryComponent.performDiscovery(MBeanResourceDiscoveryComponent.java:149)
1260         at org.rhq.plugins.jmx.MBeanResourceDiscoveryComponent.discoverResources(MBeanResourceDiscoveryComponent.java:98)
1261         at org.rhq.plugins.jmx.MBeanResourceDiscoveryComponent.discoverResources(MBeanResourceDiscoveryComponent.java:85)
1262         at sun.reflect.GeneratedMethodAccessor28.invoke(Unknown Source)
1263         at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
1264         at java.lang.reflect.Method.invoke(Method.java:616)
1265         at org.rhq.core.pc.util.DiscoveryComponentProxyFactory$ComponentInvocationThread.call(DiscoveryComponentProxyFactory.java:293)
1266         ... 5 more
1267 Caused by: java.io.IOException: Unable to obtain isRegistered, status=WAITING
1268         at org.jboss.remotingjmx.protocol.v1.ClientConnection$TheConnection.queryNames(ClientConnection.java:890)
1269         at sun.reflect.GeneratedMethodAccessor90.invoke(Unknown Source)
1270         at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
1271         at java.lang.reflect.Method.invoke(Method.java:616)
1272         at org.mc4j.ems.impl.jmx.connection.support.providers.proxy.JMXRemotingMBeanServerProxy.invoke(JMXRemotingMBeanServerProxy.java:59)

Comment 1 Charles Crouch 2013-02-18 18:24:43 UTC
Filip
Try http://www.cubrid.org/blog/dev-platform/how-to-analyze-java-thread-dumps/
section "Example 1: When the CPU Usage is Abnormally High"

Comment 2 Mike Foley 2013-02-18 18:39:22 UTC
Just a suggestion ...

This is an extremely lightweight way of doing CPU profiling ...

-Xrunhprof:cpu=samples,file=c:\temp\hprof.txt

so in rhq-server.sh, try this ... 


RHQ_SERVER_ADDITIONAL_JAVA_OPTS=-Xrunhprof:cpu=samples,file=/home/mfoley/cpudata.txt


when you ctrl-break the server ... there will be a list of java classes ...sorted by CPU usage.  i did this and got the following:

CPU SAMPLES BEGIN (total = 66113) Mon Feb 18 13:30:55 2013
rank   self  accum   count trace method
   1 40.71% 40.71%   26917 301046 sun.nio.ch.EPollArrayWrapper.epollWait
   2 34.01% 74.73%   22487 301509 java.net.PlainSocketImpl.socketAccept
   3  5.72% 80.45%    3784 301671 java.net.PlainSocketImpl.socketAccept
   4  2.11% 82.56%    1398 300534 java.lang.Class.getClassLoader0
   5  1.65% 84.22%    1092 300493 java.io.FileInputStream.readBytes
   6  1.53% 85.74%    1009 300521 java.lang.ClassLoader.defineClass1
   7  0.95% 86.69%     625 300525 java.util.zip.Inflater.inflateBytes
   8  0.80% 87.49%     528 300618 java.io.UnixFileSystem.canonicalize0
   9  0.56


so in your scenario here ...we will see something different.   the classes that are spinning the CPU should be near the top ... and then we will have additional info on what is going on ...

Comment 3 Filip Brychta 2013-02-19 15:04:52 UTC
Adding full thread dumps and cpu samples.
 
Attached file thDumps.txt contains thread dumps with std and stderr outputs of the RHQ server.
cpudata.txt contains cpu samples.

Each sample was taken asynchronously by sending SIGQUIT to the RHQ server.
First two samples were taken before tests were running (cpu usage of RHQ server was normal)
Subsequent samples were taken when i noticed first "Caused by: java.io.IOException: Unable to obtain isRegistered, status=WAITING" exception in the agent's log.
Last three samples were taken when a cpu usage of the RHQ server went back to normal (the very last when the RHQ server was in a shutdown process) 

Tests started at 08:04:30 and finished at 09:00 and cpu usage of RHQ server went back to normal at 09:13 EST (you can see timestamp of each sample in cpudata.txt)


I noticed that the method which was consuming cpu the most between 09:00 and 09:13 was 'java.util.regex.Pattern$Start.match'. It seems a bit suspicious (see cpudata.txt). Last operation invoked by tests was "uninventory all resources".

Comment 4 Filip Brychta 2013-02-19 15:06:22 UTC
Created attachment 699495 [details]
cpudata.txt and thDumps.txt

Comment 5 Filip Brychta 2013-02-19 15:41:28 UTC
I have to do it again. Default stack trace depth is only 4 so the previous cpudata.txt is a bit useless. I will attached a new version in 1 hour.

Comment 6 Filip Brychta 2013-02-19 16:51:08 UTC
I didn't manage to reproduce it this time and I don't have time for another run. I will continue tomorrow, plus i will do another runs with increased heap space to see if the high cpu usage is related to the size of heap space.

Comment 7 Filip Brychta 2013-02-20 15:54:46 UTC
I did another testing with RHQ server's heap space increased to 2G (-Xms2G -Xmx2G). After several runs of our CLI test suite was the RHQ server consuming again ~100% of CPU and the exception "Caused by: java.io.IOException: Unable to obtain isRegistered, status=WAITING" was being thrown to the agent's log repeatedly. 

See attached pictures with heap and pemrGen graphs. High cpu usage should not be caused by low heap or permGen size.

There were again a lot of threads doing "java.util.regex.Pattern$Start.match" which could be the cause of high cpu usage. See attached cpudata.txt for full trace.

Attaching full agent's and server's logs, cpu samples and thread dumps.

Comment 8 Filip Brychta 2013-02-20 15:55:57 UTC
Created attachment 700114 [details]
heap chart

Comment 9 Filip Brychta 2013-02-20 15:56:30 UTC
Created attachment 700115 [details]
permGen chart

Comment 10 Filip Brychta 2013-02-20 15:57:26 UTC
Created attachment 700116 [details]
agent and server logs from second run

Comment 11 Filip Brychta 2013-02-20 15:58:07 UTC
Created attachment 700117 [details]
cpudata and threadDumps from second run

Comment 12 John Mazzitelli 2013-02-20 17:51:09 UTC
are you running an agent that is monitoring this RHQ Server?

I ask because :

org.rhq.core.pc.util.DiscoveryComponentProxyFactory$ComponentInvocationThread.call(DiscoveryComponentProxyFactory.java:293)
1266         ... 5 more
1267 Caused by: java.io.IOException: Unable to obtain isRegistered, status=WAITING
1268         at org.jboss.remotingjmx.protocol.v1.ClientConnection$TheConnection.queryNames(ClientConnection.java:890)

seems to indicate this is the rhq-server plugin making remote JMX calls to get metric data from the RHQ Server.

Comment 13 Filip Brychta 2013-02-21 09:25:24 UTC
Yes, the RHQ agent is monitoring the RHQ Server as well. All discovered resources are imported before the test suite.

So the exception from previous comment 12 only indicates that RHQ serve is not able to respond to the agent in time?It's only a consequence of overloaded RHQ server?

Comment 14 Charles Crouch 2013-05-14 16:12:24 UTC
This should be retested once we have a JON build with EAP6.1


Note You need to log in before you can comment on or make changes to this bug.