Bug 766959

Summary: Memory leak in agent; likely use of native code
Product: [Other] RHQ Project Reporter: Elias Ross <genman>
Component: AgentAssignee: RHQ Project Maintainer <rhq-maint>
Status: CLOSED CURRENTRELEASE QA Contact: Mike Foley <mfoley>
Severity: high Docs Contact:
Priority: medium    
Version: 4.1CC: dsteigne, hrupp, lkrejci, loleary
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
See Also: https://bugzilla.redhat.com/show_bug.cgi?id=773031
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2013-09-01 15:18:46 EDT Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---
Description Flags
result of leak detection
Patch based on commit 1174064a0372d31199d75939b64d36eaa2232d02 none

Description Elias Ross 2011-12-12 15:57:53 EST
Description of problem:

Memory leaks over time with RHQ agent loaded.
Java heap shows very little memory consumed, most seems to be non-Java memory, i.e. native code leaking.

See: http://community.jboss.org/message/641142#641142

From Top:

	15   0 11.0g  10g  13m S  0.0 33.0  26:03.99 java                                                                                                             	
That's 11g of memory usage. jhat and the like show quite a lot less in the JVM itself. Here is a histogram:
num 	#instances     	#bytes  class name
   1:     	39295    	5184840  <constMethodKlass>
   2:     	39295    	4725784  <methodKlass>
   3:      	4091    	4414496  <constantPoolKlass>
   4:     	43604    	3832328  [C
   5:     	64747    	3511200  <symbolKlass>
   6:      	4712    	3101592  [I
   7:      	4091    	2994248  <instanceKlassKlass>
   8:      	3706    	2709792  <constantPoolCacheKlass>
   9:     	23291    	2499224  [Ljava.lang.Object;
  10:     	12682    	2042320  [Ljava.util.HashMap$Entry;

$ /usr/java/jdk1.6.0_17/bin/jmap -heap 28675 
Attaching to process ID 28675, please wait...
Debugger attached successfully.
Server compiler detected.
JVM version is 14.3-b01

using thread-local object allocation.
Parallel GC with 8 thread(s)

Heap Configuration:
   MinHeapFreeRatio = 40
   MaxHeapFreeRatio = 70
   MaxHeapSize      = 134217728 (128.0MB)
   NewSize          = 2686976 (2.5625MB)
   MaxNewSize       = 17592186044415 MB
   OldSize          = 5439488 (5.1875MB)
   NewRatio         = 2
   SurvivorRatio    = 8
   PermSize         = 21757952 (20.75MB)
   MaxPermSize      = 88080384 (84.0MB)

I'm guessing the plugin update/upgrade doesn't update the shared libraries as well?

Or the new versions have leaks?

Here is what I have (notice the dates):

$ ls -l /usr/local/rhq-agent/lib/augeas/lib/
total 412
-rw-r--r-- 1 root root 260432 Jul 26 20:19 libaugeas.so
-rw-r--r-- 1 root root  72404 Jul 26 20:19 libfa.so
-rw-r--r-- 1 root root  72404 Jul 26 20:19 libfa.so.1

Here is the inventory for this particular host:

		Aliases File
		Apache HTTP Servers
		Bundle Handler - Ant
		Bundle Handler - File Template
		File Systems
		Hosts File
		Network Adapters
		Postfix Servers
		RHQ Agent
		Samba Servers

Version-Release number of selected component (if applicable):

Agent version 4.1 (updated from 4.0.) Also present in 4.0.
Linux vg61l01ad-opsdev002.apple.com 2.6.18- #1 SMP Fri Jul 15 04:42:13 EDT 2011 x86_64 x86_64 x86_64 GNU/Linux
Red Hat Enterprise Linux Server release 5.6 (Tikanga)

How reproducible:

Every time, every system we have.

Steps to Reproduce:
1. Install agent
2. Import all resources
3. Wait a few days.
Actual results:

Memory usage increases slowly over time (maybe 1GB per few days?)

Expected results:

No leaks ;-)

Additional info:
Comment 1 Elias Ross 2011-12-12 16:36:05 EST
Hardware info:

$ free -m
             total       used       free     shared    buffers     cached
Mem:         32183      26323       5860          0        642      12349
-/+ buffers/cache:      13331      18852
Swap:        18047         14      18033


8 of these
vendor_id	: GenuineIntel
cpu family	: 6
model		: 26
model name	: Intel(R) Xeon(R) CPU           E5506  @ 2.13GHz
stepping	: 5
cpu MHz		: 2133.469
cache size	: 4096 KB

root@vg61l01ad-opsdev002 # lspci -tv

-[0000:00]-+-00.0  Intel Corporation 5520 I/O Hub to ESI Port
           +-01.0-[03]----00.0  Hewlett-Packard Company Smart Array G6 controllers
           +-08.0-[02]--+-00.0  Broadcom Corporation NetXtreme II BCM5709 Gigabit Ethernet
           |            \-00.1  Broadcom Corporation NetXtreme II BCM5709 Gigabit Ethernet
           +-1d.0  Intel Corporation 82801JI (ICH10 Family) USB UHCI Controller #1
           +-1d.1  Intel Corporation 82801JI (ICH10 Family) USB UHCI Controller #2
           +-1d.2  Intel Corporation 82801JI (ICH10 Family) USB UHCI Controller #3
           +-1d.3  Intel Corporation 82801JI (ICH10 Family) USB UHCI Controller #6
           +-1d.7  Intel Corporation 82801JI (ICH10 Family) USB2 EHCI Controller #1
           +-1e.0-[01]--+-03.0  ATI Technologies Inc ES1000
           |            +-04.0  Compaq Computer Corporation Integrated Lights Out Controller
           |            +-04.2  Compaq Computer Corporation Integrated Lights Out  Processor
           |            +-04.4  Hewlett-Packard Company Proliant iLO2/iLO3 virtual USB controller
           |            \-04.6  Hewlett-Packard Company Proliant iLO2 virtual UART
           +-1f.0  Intel Corporation 82801JIB (ICH10) LPC Interface Controller
           \-1f.2  Intel Corporation 82801JI (ICH10 Family) 4 port SATA IDE Controller #1
Comment 2 Lukas Krejci 2011-12-13 02:50:05 EST
Elias, this indeed looks like it might be a problem with augeas as mentioned on the forum pages.

Could you check if augeas library is installed system-wide and what version it is?

On RHEL/Fedora, do something like:

yum info augeas-libs
Comment 3 Elias Ross 2011-12-13 14:29:19 EST
Installed Packages
Name       : augeas-libs
Arch       : x86_64
Version    : 0.7.4
Release    : 1.el5
Size       : 950 k
Repo       : installed
Summary    : Libraries for augeas
URL        : http://augeas.net/
License    : LGPLv2+
Description: The libraries for augeas.

Available Packages
Name       : augeas-libs
Arch       : i386
Version    : 0.9.0
Release    : 1.el5
Size       : 338 k
Repo       : epel-5
Summary    : Libraries for augeas
URL        : http://augeas.net/
License    : LGPLv2+
Description: The libraries for augeas.

Name       : augeas-libs
Arch       : x86_64
Version    : 0.9.0
Release    : 1.el5
Size       : 340 k
Repo       : epel-5
Summary    : Libraries for augeas
URL        : http://augeas.net/
License    : LGPLv2+
Description: The libraries for augeas.

Yes, looks so... So maybe I need to update? What's the version to use?
Comment 4 Lukas Krejci 2011-12-14 03:20:30 EST
Ok, that I hope explains it.

While the RHQ agent bundles the augeas libraries (in version 0.9.0), it uses the system-wide augeas in preference if it is available.

augeas 0.7.4 has a known memory leak (I think) so I'd recommend to a) upgrade the system-wide augeas to 0.9.0 or (if that is not possible) to start the agent with modified LD_LIBRARY_PATH that would point to the bundled augeas location before the standard paths:

LD_LIBRARY_PATH=$RHQ_AGENT_HOME/lib/augeas/lib64:"$LD_LIBRARY_PATH" bin/rhq-agent.sh
Comment 5 Elias Ross 2011-12-15 12:06:32 EST
I'll keep an eye on things and verify that it isn't leaking memory.
Comment 6 Elias Ross 2011-12-20 19:31:48 EST
I am still seeing a leak, even though Augeas was updated to 0.90 system-wide.

Contrary to what you claim, the wrapper seems to always boot with the shipped version of Augeas, as it prepends LD_LIBRARY_PATH no matter what...

# ----------------------------------------------------------------------
# Prepare LD_LIBRARY_PATH to include libraries shipped with the agent
# ----------------------------------------------------------------------

if [ "x$_LINUX" != "x" ]; then
   if [ "x$LD_LIBRARY_PATH" = "x" ]; then
      if [ "x$_X86_64"  != "x" ]; then
      if [ "x$_X86_64"  != "x" ]; then

In any case, what I have system-wide (0.90) matches the agent's included copy.

$ rpm --query --file /usr/lib/libaugeas.so.0.14.0 
$ diff /usr/lib/libaugeas.so.0.14.0 /usr/local/rhq-agentlib/augeas/lib/libaugeas.so ; echo $?

I'm seeing a growth of about 20GB of memory over 14 days.

I'm guessing there must be a leak someplace else. What can I look at?
Comment 7 Lukas Krejci 2011-12-22 11:26:39 EST
ok, this is bad..

Not sure what's leaking and where so I would like to ask you to do the following steps to get to the bottom of this:

1) checkout https://github.com/metlos/rhq-project-samples and build the augeas leak detector that I put together which is located in rhq-project-samples/agent/debug-tools/augeas-leak-detector

2) Follow the directions in the README.txt there to run the agent with the detector.

3) Grab the agent log for the time the agent was running with the detector + the augeas-leak-detection-results.txt which should be generated in the RHQ_AGENT_HOME after you stop the instrumented agent and attach both files to this BZ.
Comment 8 Elias Ross 2012-01-09 17:47:40 EST
Created attachment 551692 [details]
result of leak detection

I ran the agent for about 4 hours.

I still see the memory usage increase, but nothing in the report seems suspicious. Could it be something in another native library?
Comment 9 Lukas Krejci 2012-01-10 10:10:20 EST
Well, there are a couple of leaked references, but I'd agree with you that those probably don't contribute to the continuous increase of the memory usage, since these references are created on resource component startup and are stored in instance fields of the components - the components live for the lifetime of agent so they shouldn't be contributing to the increase of the leak unless there is some more "hidden" leak inside the augeas library itself (which has not been reported as of yet). 

The only other native library we use is sigar but that has been stable for a long time so I somehow doubt the leak is going to be there - but of course I can't rule that out - you can turn off the usage of sigar when you disable the native system in the agent. On the agent prompt:
setconfig rhq.agent.disable-native-system=true
or if you have your agent inventoried, you can check the "Disable Native System" property in the Connection Settings of the agent resource.

Unfortunately, I was still not able to reproduce this locally so I will have to ask you for some more testing:

1) Does disabling the agent's native system help?

2) How does the leak change when you disable the augeas-based plugins?:
Aliases, Apache HTTP Server, Cobbler, Cron, Hosts, Postfix, Samba, Sudo Access

4) If the leak disappears when you disable the above, you should enable them one by one to see which ones of them leak.
Comment 10 Lukas Krejci 2012-01-11 07:12:41 EST

the cause for leaks detected by the augeas leak detector is most probably bug 773031.

At the same time I think we're dealing with something more than just that here.
Comment 11 Elias Ross 2012-01-11 19:39:42 EST
1) Native system disabled: Memory still leaked.
2) I disabled those but it still leaked.

Once I ALSO disabled "OpenSSH,GRUB,Iptables" the memory use seems to be stable but I will check overnight. (I suspected there were more than what you listed, so I grepped around for Augeus in the source tree.)

So, I suspect one of those three plugins, though there could be more than one that leaks. What seems likeliest?

I'll report tomorrow.
Comment 12 Elias Ross 2012-01-13 00:57:04 EST
Seems that GRUB is suspect, and perhaps others, but at least enabling GRUB has created a leak. (It is hard to observe.)

Few questions:

1) How often is loadResourceConfiguration() called?

I don't know the call flow, but it appears "augeas" is never closed:

public class GrubComponent implements ResourceComponent, ConfigurationFacet {

    public Configuration loadResourceConfiguration() throws Exception {
        Configuration pluginConfiguration = resourceContext.getPluginConfiguration();

        return loadResourceConfiguration(pluginConfiguration);

    public Configuration loadResourceConfiguration(Configuration pluginConfiguration) throws Exception {
        // Gather data necessary to create the Augeas hook

        Augeas augeas = new Augeas(rootPath, lensesPath, Augeas.NONE);

The same issue (no "close") appears in 



rhq/modules/plugins (plugingen) $ git grep -E "new +Augeas\\("

apache/src/main/java/org/rhq/plugins/apache/ApacheServerComponent.java:                ag = new Augeas();
apache/src/test/java/org/rhq/plugins/apache/ApacheAugeasTest.java:              Augeas ag = new Augeas();
augeas/src/main/java/org/rhq/augeas/AugeasProxy.java:            augeas = new Augeas(config.getRootPath(), config.getLoadPath(), config.getMode());
augeas/src/main/java/org/rhq/plugins/augeas/AugeasConfigurationComponent.java:            augeas = new Augeas(this.augeasRootPath, augeasLoadPath, Augeas.NO_MODL_AUTOLOAD);
augeas/src/main/java/org/rhq/plugins/augeas/helper/AugeasRawConfigHelper.java:        Augeas aug = new Augeas(rootPath, loadPath, Augeas.NO_MODL_AUTOLOAD);
grub/src/main/java/org/rhq/plugins/grub/GrubComponent.java:        Augeas augeas = new Augeas(rootPath, lensesPath, Augeas.NONE);
sshd/src/main/java/org/rhq/plugins/sshd/OpenSSHDComponent.java:        Augeas augeas = new Augeas(rootPath, lensesPath, Augeas.NONE);

I have a proposed patch but will test and need approval before submission.
Comment 13 Elias Ross 2012-01-16 12:53:36 EST
My plugin fixes seem to have worked.

Memory usage does grow for about 24 hours and remains steady at about 300MB with perhaps slow growth over time, but much better than before.

So the fixes required are for 'grub' and 'sshd' plugins only.
Comment 14 dsteigne 2012-01-19 07:35:57 EST
Will there be a JON 3.0 patch for this?  If so, is there an ETA?
Comment 15 Larry O'Leary 2012-01-19 11:22:30 EST
If the issue affects the grub and sshd plug-ins, there will be no patch for JON 3.0 as these plug-ins are not included with the product.
Comment 16 Elias Ross 2012-02-13 17:29:28 EST
Created attachment 561689 [details]
Patch based on commit 1174064a0372d31199d75939b64d36eaa2232d02

Approved by employer
Comment 17 Lukas Krejci 2012-02-14 05:12:42 EST
Thanks, Elias!

I only changed one minor thing - in SSHD plugin I changed the getConfig() method to throw an Exception, just to minimize the diff.

master: http://git.fedorahosted.org/git/?p=rhq/rhq.git;a=commitdiff;h=ff54dedcfdefb994c8390d6f68393134b3842e8e
Author: Elias Ross <elias_ross@apple.com>
Date:   Thu Jan 12 21:51:51 2012 -0800

    [BZ 766959] fix possible memory leak in plugins
Comment 18 Heiko W. Rupp 2013-09-01 15:18:46 EDT
Bulk closing of BZs that have no target version set, but which are ON_QA for more than a year and thus are in production for a long time.