Bug 589674

Summary: Important memory leak of the agent on Linux
Product: [Other] RHQ Project Reporter: lionel.duriez
Component: AgentAssignee: RHQ Project Maintainer <rhq-maint>
Status: CLOSED CURRENTRELEASE QA Contact: Mike Foley <mfoley>
Severity: high Docs Contact:
Priority: high    
Version: 1.3.1CC: jshaughn, loleary, mazz, runtis, soda-ghanassia.consultant, xavier.chatelain
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2014-05-29 18:00:48 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 591531    
Attachments:
Description Flags
RHQ agent Java heap dump
none
RHQ agent Java heap dump V2 none

Description lionel.duriez 2010-05-06 16:29:41 UTC
We are experiencing important memory leaks on the agent side.
We have installed the agent on 20 servers, the leak has occured on 18 of them : the agent process occupies around 1 GB of memory after running continuously during 3 days.

Here is the list of the operating systems showing the problem :
- Fedora core 6 32 bit
- CentOS 5 32 bit

We have only one OS immune to the memory leak : HP-UX 11.11

It may be due to the type of monitored servers, on HP-UX we only run Oracle whereas on Linux boxes we run Jboss and Tomcat servers.
A Java heap dump of the agent JVM shows no leak on the Java side so the leak is probably in the augeas native library in use by the agent.
We have tried several versions of augeas without noticing any change : 0.5.1-1 and 0.7.0-1.

The deactivation of the native library in the agent configuration (rhq.agent.disable-native-system set to true) stops the memory leak but then 
some important metrics are no longer available (CPU, File System and Network Adapter).

Comment 1 Charles Crouch 2010-05-12 17:52:44 UTC
Hi Lionel
Can you add the HeapDumpOnOutOfMemoryError system property to the agent startup, and reduce the Xmx settings back to their defaults, and then make available to us the hprof file that is produced when the agent goes OOM. Also which JDK are you using?

Thanks
Charles

Comment 2 lionel.duriez 2010-05-17 09:35:52 UTC
Created attachment 414497 [details]
RHQ agent Java heap dump

The heap dump has been obtained when the RHQ agent Java process occupied 1 GB of memory. Only 9.7 MB of Java memory are in use.

Comment 3 lionel.duriez 2010-05-17 09:38:01 UTC
We use JDK 1.5.0_19-b02, the JVM memory settings (XmS, XmX) are the default ones.
We do not get an out of memory error on the Java side.

Analyzing the heap dump shows that only 9.7 MB of Java memory is in use when the RHQ agent Java process occupies 1 GB of memory. The memory leak is in the native code but we do not known how to identify it, it might be due to a bug in the JNI code of the agent or a leak in the augeas library.

Comment 4 Charles Crouch 2010-05-24 17:46:10 UTC
Hi Lionel
Can you confirm which version of RHQ you are using. If you are not using the latest, can you try installing RHQ 3.0.0.B05:

http://sourceforge.net/projects/rhq/files/rhq/rhq-3.0.0.B05/rhq-enterprise-server-3.0.0.B05.zip/download

I don't think the problem is augeas related since we've not yet released a version of RHQ which supports using augeas with JDK5. Any augeas usage should have resulted in a class version exception in the agent logs.

One thing to try if you install a new version is to remove the rhq-virt-plugin-3.0.0.B05.jar plugin from \rhq-server-3.0.0.B05\jbossas\server\default\deploy\rhq.ear.rej\rhq-downloads\rhq-plugins\  before you install. This will remove one more plugin which uses a native library.

If you still have the problem on the latest version of RHQ can you enable debug logging, set the environment variable RHQ_AGENT_DEBUG=true, start the agent and then let it run while the memory builds up and then the attach the zipped logs files to this issue.

Thanks
Charles

Comment 5 lionel.duriez 2010-05-25 08:36:35 UTC
Hi

We use Jopr 2.3.1 and the bundled RHQ agent 1.3.1.
When the augeas library was not installed on the monitored servers, we were getting error messages in the logs about the missing augeas library.

What is the difference between the RHQ server and Jopr?

If the problem is fixed in the version 3.0.0.B05 of RHQ server, are we assured it will also be fixed in the next Jopr version ?

Thanks
Lionel

Comment 6 lionel.duriez 2010-06-28 13:59:49 UTC
Related bug 582275 (High memory consumption, OOM errors (heap space)) has been set to verified.
The comment made on bug 582275 is not very clear, can you confirm that the memory leak is fixed?

Thanks,
Lionel

Comment 7 Charles Crouch 2010-07-07 02:52:00 UTC
Hi Lionel
From  582275  we were not able to determine any change in memory usage from JON 2.3.1 to the latest version. We have tested with multiple AS instances on a server and not seen any leaks.

I suggest you try out the RHQ 3.0.0 release which is due out very shortly and see if you can reproduce the problem. If you still have the problem on the latest version of RHQ can you enable debug logging, set the environment variable RHQ_AGENT_DEBUG=true, start the agent and then let it run while the memory builds up and then the attach the zipped logs files to this issue.

Thanks
Charles

Comment 8 John Mazzitelli 2010-07-08 17:36:03 UTC
FYI: I ran a 90-hour test over the July 2-5 timeframe. Agent was on Sun JDK6, Fedora11 with quad-core and 12G RAM. It was monitoring 15 JBossAS 4.2.3 instances and 5 JBossAS 5.0.1 instances with the native SIGAR library enabled. I had it hooked up to JProfiler. I was not able to detect any memory leaks. I also tried running a 24 hour test with 30 JBossAS 5.0.1 instances being monitors, with some servers down and others up. Still no leaks detected.

There must be something very specific that causes this - because I've heard a couple people report memory bloat in the agent but I've never been able to replicate it in my tests.

The fact that Lionel says the memory bloat went away with the deactivation of the native library in the agent configuration (rhq.agent.disable-native-system set to true) tells me there is something in the use of SIGAR (that "native system" does not refer to augeas, which is not used by the core agent; "native system" refers to the third party SIGAR library). Augeas is only used by the augeas based plugins.

This also makes sense because Lionel says the problem doesn't exist on HPUX 11 but does exist on other OS platforms - and since SIGAR has different builds for different platforms, it is possible that this problem occurs on one platform but not another.

Comment 9 John Mazzitelli 2010-07-08 18:15:56 UTC
Let's try to narrow this down. So far it seems as though disabling the native system (i.e. SIGAR) makes the leak problem go away on Fedora and CentOS (curiously, both are Linux based) but there is no leak on HPUX (curiously, not Linux based :)

I would like to know more information about the resources that are backed by the native code. Specifically, we'd like more information on the types/number of CPUs, filesystems, and network adapters you have imported into your RHQ inventory under the fedora and centos platforms. Any and all info on that might be helpful, especially if we can find a way to duplicate that environment for testing.

Comment 10 soda-ghanassia.consultant 2010-07-20 12:15:19 UTC
Hi John,
i take over from Lionel and i'll try to give you as much information as possible to solve this problem. I installed the server and agent in version 3.0.0, on Linux CentOS. At start-up, the agent uses 311 MB of memory, and then there is a gradual
increase in memory consumption occurs, at a rate of 200 MB every 24 hours.

Here is the requested information (gathered on the CentOS release 5.3 (Final) platform):

CPU
CPU 0
Type: CPU (Service)	            Description: AMD Sempron(tm) Processor 2600+
Version: Sempron(tm) Processor 2600+Parent: Linux 'jopragent'

Recent Measurements
System Load: 0,04%
User Load: 0,6%
Wait Load: 0,003%

File System:
'/'
Type: File System (Service)	Description: /dev/hda2: /
Version: none			Parent: Linux 'jopragent'

Volume Type: local	Drive Type: ext3

Recent Measurements
Disk Read Bytes per Minute: 0KB 
Disk Reads per Minute: 0 
Disk Write Bytes per Minute: 381,6KB 
Disk Writes per Minute: 48,2 
Free Space: 69,1675GB 
Used: 2,6851GB 
Used Percentage: 4%

/boot
Type: File System (Service)	Description: /dev/hda1: /boot
Version: none			Parent: Linux 'jopragent'

Volume Type: local	Drive Type: ext3
Recent Measurements
Disk Read Bytes per Minute: 0B 
Disk Reads per Minute: 0 
Disk Write Bytes per Minute: 0B 
Disk Writes per Minute: 0 
Free Space: 87,6494MB 
Used: 11,0674MB 
Used Percentage: 12%

Network Adapter

ETH0
Type: Network Adapter (Service)	        Description: 00:13:D4:E7:34:6F
Version: none				Parent: Linux 'jopragent'
Inet4Address: 10.156.246.133		Interface Flags: UP BROADCAST RUNNING MULTICAST 

Recent Measurements
Bytes Received per Minute: 520,1573KB 
Bytes Transmitted per Minute: 79,9KB 
Packets Received per Minute: 6 625,3565 
Packets Transmitted per Minute: 67,1

L0
Type: Network Adapter (Service)	Description: 00:00:00:00:00:00
Version: none			Parent: Linux 'jopragent'
Inet4Address: 127.0.0.1		Interface Flags: UP LOOPBACK RUNNING

Recent Measurements
Bytes Received per Minute: 0B 
Bytes Transmitted per Minute: 0B 
Packets Received per Minute: 0 
Packets Transmitted per Minute: 0



I hope it helps, do you need any further information don’t hesitate to contact me.

Best regards,
Cédric Ghanassia

Comment 11 John Mazzitelli 2010-07-20 14:01:43 UTC
Cedric,

Can I ask that you perform something? When your agent has been running for a while (preferably over 24 or 48 hours - at the time when you know the agent has gotten bigger), can you take a heap dump and zip up/attach that dump file to this bugzilla issue? I'd like to examine that to see if it shows anything.

Read this to know how to take a heap dump:

http://management-platform.blogspot.com/2009/02/quick-java-heap-analysis.html

Specifically you run this shell command while the agent is still running:

jmap -dump:format=b,file=dump.hprof <pid> where <pid> is your RHQ agent's process ID.

I'm hoping your version of the Java JRE supports this. If not, see if you can get the heap anyway you can - these might help:

http://java.sun.com/developer/technicalArticles/Programming/HPROF.html

that tells you how to set up the agent's JRE to dump heap at exit (using "-Xrunhprof:format=b,file=dump.hprof" for example). If you can only get the heap this way, start the agent with that setting (RHQ_AGENT_ADDITIONAL_JAVA_OPTS might be useful here). Then let it run for a day or two, then exit the agent normally so it can dump heap.

Comment 12 John Mazzitelli 2010-07-22 12:16:54 UTC
FYI: I did notice that there is already a heap dump attached to this issue. I looked at it and I concur with lionel's earlier comment that it doesn't appear that heap is used very much. It would be useful to get a second heap just to confirm that, if you get another OOM and the heap dump doesn't show much usage, that we should look at the native stuff as the culprit. This would at least eliminate the possibility that its the agent's creation of SIGAR Java objects being the problem.

That said, if I had to guess, I would say based on what we know now, that its a native leak somewhere.

Comment 13 soda-ghanassia.consultant 2010-07-23 08:06:22 UTC
Created attachment 433895 [details]
RHQ agent Java heap dump V2

Comment 14 soda-ghanassia.consultant 2010-07-23 08:08:48 UTC
Ok, the second dump give the same result, The memory leak is still in native code as you can see in attachment.
I'm trying with last version of CentOs 5.4 to see if the problem occurs again

Comment 15 soda-ghanassia.consultant 2010-08-03 12:33:56 UTC
Hi John,
I retry with normal CentOs 5.3 without additional package, except augeas and java.  And i didn't reproduce this issue.
The problem is probably with our specific distribution of CentOs. We are investigating on the difference between our version and orginal version.
We'll let you know if we resolved this issue
Thank you,
Cédric Ghanassia

Comment 16 Corey Welton 2010-09-21 02:02:21 UTC
hello, Cédric,

Have you come to a resolution on this issue?

Comment 17 Myee Riri 2010-12-15 00:23:29 UTC
Hi Guys,

I believe I have this issue as well and wanted to share it.

On RHEL servers with the following specs

1) OS/Kernel/Arch

uname -ar

Linux xxxxxxx 2.6.18-92.el5 #1 SMP Tue Apr 29 13:16:15 EDT 2008 x86_64 x86_64 x86_64 GNU/Linux

2) Java

/usr/java/latest/bin/java -version
java version "1.6.0_07"
Java(TM) SE Runtime Environment (build 1.6.0_07-b06)
Java HotSpot(TM) Server VM (build 10.0-b23, mixed mode)

(note it is an i586 JDK)

3) RHEL Version

/etc/redhat-release

Red Hat Enterprise Linux Server release 5.2 (Tikanga)

4) rhq-agent version 3.0.0.GA

5) Relevent Agent Java settings:

RHQ_AGENT_JAVA_EXE_FILE_PATH="/usr/java/latest/bin/java"
RHQ_AGENT_JAVA_OPTS="-Xms64m -Xmx128m -Djava.net.preferIPv4Stack=true"

the process for rhq-agent will reach about 1G if left running for a day. And in excess of 4G is running for 2 days.

Changing the RHQ_AGENT_JAVA_OPTS to 

RHQ_AGENT_JAVA_OPTS="-Xms64m -Xmx128m -Djava.net.preferIPv4Stack=true -verbose:gc -Xloggc:/apps/rhq-agent/logs/gc.log -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+HeapDumpOnOutOfMemoryError"

and analysing the logs shows garbage collection is functioning normally and the agent is operating within expected thresholds for gc, used heap, and permgen.

Setting

 <entry key="rhq.agent.disable-native-system" value="true"/>

in $AGENT_HOME/conf/agent-configuration.xml and restarting the agent addresses the issue but I believe will prevent monitoring of CPU/Filesystem etc. This workaround is fine for now as this is monitored by another monitoring system.

However on other servers with the same environment as listed before but a x86_64 bit jdk and an updated version of

/usr/java/latest/bin/java -version
java version "1.6.0_23"
Java(TM) SE Runtime Environment (build 1.6.0_23-b05)
Java HotSpot(TM) Server VM (build 19.0-b09, mixed mode)

the memory leak has not occurred and the agent has been running for 10 days. The process uses between 130M-190M of memory (this is within my expectations from the default OPTS) on different servers and rhq.agent.disable-native-system is not disabled.

Comment 18 John Mazzitelli 2010-12-15 04:19:26 UTC
very interesting data Myee - thanks for sharing that. So, it looks like (at least for your case) that the issue is probably due to either:

a) an older version of the JRE 1.6 (_07 vs _20)

*or*

b) running a 32 bit JRE on a 64 bit machine.

Correct?

In addition, it seems everyone is seeing a leak in the native Sigar layer here, not Java heap.  The fact that disabling the native layer works around the issue tells me this.

The original requestor mentioned they had a 32 bit machine and JDK 1.5.0_19-b02. That takes out the 32-bit/64-bit issue from the equation, but leaves the JRE as a possible culprit.

This seems to all be pointing to a JRE bug that has been fixed in a later JRE (1.6.0_20 at least).

So, if anyone sees this problem, the first recommendation would be to upgrade their JRE and see if it fixes the problem.

Comment 19 Myee Riri 2010-12-15 05:36:51 UTC
Yep, I agree.

In my instance it is the Java version. I have been testing rhq-agent running on a 32bit JVM on a 64bit server and the process uses the expected amount of memory and has been running for a day. This further rules out 32bit/64bit.

Comment 20 Larry O'Leary 2011-07-13 21:25:04 UTC
This issue appears to be very similar to Bug 721152.

In regards to Comment 18, I want to point out that by disabling native support on the agent, the Apache plug-in is no longer using Augeas. Not sure if this was on purpose or a bug but it appears that if native support is disabled, resources fail to start.

Comment 21 Jay Shaughnessy 2014-05-29 18:00:48 UTC
Assuming resolved in various future releases.