Bug 589674
Summary: | Important memory leak of the agent on Linux | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | [Other] RHQ Project | Reporter: | lionel.duriez | ||||||
Component: | Agent | Assignee: | RHQ Project Maintainer <rhq-maint> | ||||||
Status: | CLOSED CURRENTRELEASE | QA Contact: | Mike Foley <mfoley> | ||||||
Severity: | high | Docs Contact: | |||||||
Priority: | high | ||||||||
Version: | 1.3.1 | CC: | jshaughn, loleary, mazz, runtis, soda-ghanassia.consultant, xavier.chatelain | ||||||
Target Milestone: | --- | ||||||||
Target Release: | --- | ||||||||
Hardware: | All | ||||||||
OS: | Linux | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | Environment: | ||||||||
Last Closed: | 2014-05-29 18:00:48 UTC | Type: | --- | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Bug Depends On: | |||||||||
Bug Blocks: | 591531 | ||||||||
Attachments: |
|
Description
lionel.duriez
2010-05-06 16:29:41 UTC
Hi Lionel Can you add the HeapDumpOnOutOfMemoryError system property to the agent startup, and reduce the Xmx settings back to their defaults, and then make available to us the hprof file that is produced when the agent goes OOM. Also which JDK are you using? Thanks Charles Created attachment 414497 [details]
RHQ agent Java heap dump
The heap dump has been obtained when the RHQ agent Java process occupied 1 GB of memory. Only 9.7 MB of Java memory are in use.
We use JDK 1.5.0_19-b02, the JVM memory settings (XmS, XmX) are the default ones. We do not get an out of memory error on the Java side. Analyzing the heap dump shows that only 9.7 MB of Java memory is in use when the RHQ agent Java process occupies 1 GB of memory. The memory leak is in the native code but we do not known how to identify it, it might be due to a bug in the JNI code of the agent or a leak in the augeas library. Hi Lionel Can you confirm which version of RHQ you are using. If you are not using the latest, can you try installing RHQ 3.0.0.B05: http://sourceforge.net/projects/rhq/files/rhq/rhq-3.0.0.B05/rhq-enterprise-server-3.0.0.B05.zip/download I don't think the problem is augeas related since we've not yet released a version of RHQ which supports using augeas with JDK5. Any augeas usage should have resulted in a class version exception in the agent logs. One thing to try if you install a new version is to remove the rhq-virt-plugin-3.0.0.B05.jar plugin from \rhq-server-3.0.0.B05\jbossas\server\default\deploy\rhq.ear.rej\rhq-downloads\rhq-plugins\ before you install. This will remove one more plugin which uses a native library. If you still have the problem on the latest version of RHQ can you enable debug logging, set the environment variable RHQ_AGENT_DEBUG=true, start the agent and then let it run while the memory builds up and then the attach the zipped logs files to this issue. Thanks Charles Hi We use Jopr 2.3.1 and the bundled RHQ agent 1.3.1. When the augeas library was not installed on the monitored servers, we were getting error messages in the logs about the missing augeas library. What is the difference between the RHQ server and Jopr? If the problem is fixed in the version 3.0.0.B05 of RHQ server, are we assured it will also be fixed in the next Jopr version ? Thanks Lionel Related bug 582275 (High memory consumption, OOM errors (heap space)) has been set to verified. The comment made on bug 582275 is not very clear, can you confirm that the memory leak is fixed? Thanks, Lionel Hi Lionel From 582275 we were not able to determine any change in memory usage from JON 2.3.1 to the latest version. We have tested with multiple AS instances on a server and not seen any leaks. I suggest you try out the RHQ 3.0.0 release which is due out very shortly and see if you can reproduce the problem. If you still have the problem on the latest version of RHQ can you enable debug logging, set the environment variable RHQ_AGENT_DEBUG=true, start the agent and then let it run while the memory builds up and then the attach the zipped logs files to this issue. Thanks Charles FYI: I ran a 90-hour test over the July 2-5 timeframe. Agent was on Sun JDK6, Fedora11 with quad-core and 12G RAM. It was monitoring 15 JBossAS 4.2.3 instances and 5 JBossAS 5.0.1 instances with the native SIGAR library enabled. I had it hooked up to JProfiler. I was not able to detect any memory leaks. I also tried running a 24 hour test with 30 JBossAS 5.0.1 instances being monitors, with some servers down and others up. Still no leaks detected. There must be something very specific that causes this - because I've heard a couple people report memory bloat in the agent but I've never been able to replicate it in my tests. The fact that Lionel says the memory bloat went away with the deactivation of the native library in the agent configuration (rhq.agent.disable-native-system set to true) tells me there is something in the use of SIGAR (that "native system" does not refer to augeas, which is not used by the core agent; "native system" refers to the third party SIGAR library). Augeas is only used by the augeas based plugins. This also makes sense because Lionel says the problem doesn't exist on HPUX 11 but does exist on other OS platforms - and since SIGAR has different builds for different platforms, it is possible that this problem occurs on one platform but not another. Let's try to narrow this down. So far it seems as though disabling the native system (i.e. SIGAR) makes the leak problem go away on Fedora and CentOS (curiously, both are Linux based) but there is no leak on HPUX (curiously, not Linux based :) I would like to know more information about the resources that are backed by the native code. Specifically, we'd like more information on the types/number of CPUs, filesystems, and network adapters you have imported into your RHQ inventory under the fedora and centos platforms. Any and all info on that might be helpful, especially if we can find a way to duplicate that environment for testing. Hi John, i take over from Lionel and i'll try to give you as much information as possible to solve this problem. I installed the server and agent in version 3.0.0, on Linux CentOS. At start-up, the agent uses 311 MB of memory, and then there is a gradual increase in memory consumption occurs, at a rate of 200 MB every 24 hours. Here is the requested information (gathered on the CentOS release 5.3 (Final) platform): CPU CPU 0 Type: CPU (Service) Description: AMD Sempron(tm) Processor 2600+ Version: Sempron(tm) Processor 2600+Parent: Linux 'jopragent' Recent Measurements System Load: 0,04% User Load: 0,6% Wait Load: 0,003% File System: '/' Type: File System (Service) Description: /dev/hda2: / Version: none Parent: Linux 'jopragent' Volume Type: local Drive Type: ext3 Recent Measurements Disk Read Bytes per Minute: 0KB Disk Reads per Minute: 0 Disk Write Bytes per Minute: 381,6KB Disk Writes per Minute: 48,2 Free Space: 69,1675GB Used: 2,6851GB Used Percentage: 4% /boot Type: File System (Service) Description: /dev/hda1: /boot Version: none Parent: Linux 'jopragent' Volume Type: local Drive Type: ext3 Recent Measurements Disk Read Bytes per Minute: 0B Disk Reads per Minute: 0 Disk Write Bytes per Minute: 0B Disk Writes per Minute: 0 Free Space: 87,6494MB Used: 11,0674MB Used Percentage: 12% Network Adapter ETH0 Type: Network Adapter (Service) Description: 00:13:D4:E7:34:6F Version: none Parent: Linux 'jopragent' Inet4Address: 10.156.246.133 Interface Flags: UP BROADCAST RUNNING MULTICAST Recent Measurements Bytes Received per Minute: 520,1573KB Bytes Transmitted per Minute: 79,9KB Packets Received per Minute: 6 625,3565 Packets Transmitted per Minute: 67,1 L0 Type: Network Adapter (Service) Description: 00:00:00:00:00:00 Version: none Parent: Linux 'jopragent' Inet4Address: 127.0.0.1 Interface Flags: UP LOOPBACK RUNNING Recent Measurements Bytes Received per Minute: 0B Bytes Transmitted per Minute: 0B Packets Received per Minute: 0 Packets Transmitted per Minute: 0 I hope it helps, do you need any further information don’t hesitate to contact me. Best regards, Cédric Ghanassia Cedric, Can I ask that you perform something? When your agent has been running for a while (preferably over 24 or 48 hours - at the time when you know the agent has gotten bigger), can you take a heap dump and zip up/attach that dump file to this bugzilla issue? I'd like to examine that to see if it shows anything. Read this to know how to take a heap dump: http://management-platform.blogspot.com/2009/02/quick-java-heap-analysis.html Specifically you run this shell command while the agent is still running: jmap -dump:format=b,file=dump.hprof <pid> where <pid> is your RHQ agent's process ID. I'm hoping your version of the Java JRE supports this. If not, see if you can get the heap anyway you can - these might help: http://java.sun.com/developer/technicalArticles/Programming/HPROF.html that tells you how to set up the agent's JRE to dump heap at exit (using "-Xrunhprof:format=b,file=dump.hprof" for example). If you can only get the heap this way, start the agent with that setting (RHQ_AGENT_ADDITIONAL_JAVA_OPTS might be useful here). Then let it run for a day or two, then exit the agent normally so it can dump heap. FYI: I did notice that there is already a heap dump attached to this issue. I looked at it and I concur with lionel's earlier comment that it doesn't appear that heap is used very much. It would be useful to get a second heap just to confirm that, if you get another OOM and the heap dump doesn't show much usage, that we should look at the native stuff as the culprit. This would at least eliminate the possibility that its the agent's creation of SIGAR Java objects being the problem. That said, if I had to guess, I would say based on what we know now, that its a native leak somewhere. Created attachment 433895 [details]
RHQ agent Java heap dump V2
Ok, the second dump give the same result, The memory leak is still in native code as you can see in attachment. I'm trying with last version of CentOs 5.4 to see if the problem occurs again Hi John, I retry with normal CentOs 5.3 without additional package, except augeas and java. And i didn't reproduce this issue. The problem is probably with our specific distribution of CentOs. We are investigating on the difference between our version and orginal version. We'll let you know if we resolved this issue Thank you, Cédric Ghanassia hello, Cédric, Have you come to a resolution on this issue? Hi Guys, I believe I have this issue as well and wanted to share it. On RHEL servers with the following specs 1) OS/Kernel/Arch uname -ar Linux xxxxxxx 2.6.18-92.el5 #1 SMP Tue Apr 29 13:16:15 EDT 2008 x86_64 x86_64 x86_64 GNU/Linux 2) Java /usr/java/latest/bin/java -version java version "1.6.0_07" Java(TM) SE Runtime Environment (build 1.6.0_07-b06) Java HotSpot(TM) Server VM (build 10.0-b23, mixed mode) (note it is an i586 JDK) 3) RHEL Version /etc/redhat-release Red Hat Enterprise Linux Server release 5.2 (Tikanga) 4) rhq-agent version 3.0.0.GA 5) Relevent Agent Java settings: RHQ_AGENT_JAVA_EXE_FILE_PATH="/usr/java/latest/bin/java" RHQ_AGENT_JAVA_OPTS="-Xms64m -Xmx128m -Djava.net.preferIPv4Stack=true" the process for rhq-agent will reach about 1G if left running for a day. And in excess of 4G is running for 2 days. Changing the RHQ_AGENT_JAVA_OPTS to RHQ_AGENT_JAVA_OPTS="-Xms64m -Xmx128m -Djava.net.preferIPv4Stack=true -verbose:gc -Xloggc:/apps/rhq-agent/logs/gc.log -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+HeapDumpOnOutOfMemoryError" and analysing the logs shows garbage collection is functioning normally and the agent is operating within expected thresholds for gc, used heap, and permgen. Setting <entry key="rhq.agent.disable-native-system" value="true"/> in $AGENT_HOME/conf/agent-configuration.xml and restarting the agent addresses the issue but I believe will prevent monitoring of CPU/Filesystem etc. This workaround is fine for now as this is monitored by another monitoring system. However on other servers with the same environment as listed before but a x86_64 bit jdk and an updated version of /usr/java/latest/bin/java -version java version "1.6.0_23" Java(TM) SE Runtime Environment (build 1.6.0_23-b05) Java HotSpot(TM) Server VM (build 19.0-b09, mixed mode) the memory leak has not occurred and the agent has been running for 10 days. The process uses between 130M-190M of memory (this is within my expectations from the default OPTS) on different servers and rhq.agent.disable-native-system is not disabled. very interesting data Myee - thanks for sharing that. So, it looks like (at least for your case) that the issue is probably due to either: a) an older version of the JRE 1.6 (_07 vs _20) *or* b) running a 32 bit JRE on a 64 bit machine. Correct? In addition, it seems everyone is seeing a leak in the native Sigar layer here, not Java heap. The fact that disabling the native layer works around the issue tells me this. The original requestor mentioned they had a 32 bit machine and JDK 1.5.0_19-b02. That takes out the 32-bit/64-bit issue from the equation, but leaves the JRE as a possible culprit. This seems to all be pointing to a JRE bug that has been fixed in a later JRE (1.6.0_20 at least). So, if anyone sees this problem, the first recommendation would be to upgrade their JRE and see if it fixes the problem. Yep, I agree. In my instance it is the Java version. I have been testing rhq-agent running on a 32bit JVM on a 64bit server and the process uses the expected amount of memory and has been running for a day. This further rules out 32bit/64bit. This issue appears to be very similar to Bug 721152. In regards to Comment 18, I want to point out that by disabling native support on the agent, the Apache plug-in is no longer using Augeas. Not sure if this was on purpose or a bug but it appears that if native support is disabled, resources fail to start. Assuming resolved in various future releases. |