Description of problem: Memory leaks over time with RHQ agent loaded. Java heap shows very little memory consumed, most seems to be non-Java memory, i.e. native code leaking. See: http://community.jboss.org/message/641142#641142 From Top: PID USER PR NI VIRT RES SHR S %CPU %MEM 15 0 11.0g 10g 13m S 0.0 33.0 26:03.99 java That's 11g of memory usage. jhat and the like show quite a lot less in the JVM itself. Here is a histogram: num #instances #bytes class name 1: 39295 5184840 <constMethodKlass> 2: 39295 4725784 <methodKlass> 3: 4091 4414496 <constantPoolKlass> 4: 43604 3832328 [C 5: 64747 3511200 <symbolKlass> 6: 4712 3101592 [I 7: 4091 2994248 <instanceKlassKlass> 8: 3706 2709792 <constantPoolCacheKlass> 9: 23291 2499224 [Ljava.lang.Object; 10: 12682 2042320 [Ljava.util.HashMap$Entry; $ /usr/java/jdk1.6.0_17/bin/jmap -heap 28675 Attaching to process ID 28675, please wait... Debugger attached successfully. Server compiler detected. JVM version is 14.3-b01 using thread-local object allocation. Parallel GC with 8 thread(s) Heap Configuration: MinHeapFreeRatio = 40 MaxHeapFreeRatio = 70 MaxHeapSize = 134217728 (128.0MB) NewSize = 2686976 (2.5625MB) MaxNewSize = 17592186044415 MB OldSize = 5439488 (5.1875MB) NewRatio = 2 SurvivorRatio = 8 PermSize = 21757952 (20.75MB) MaxPermSize = 88080384 (84.0MB) I'm guessing the plugin update/upgrade doesn't update the shared libraries as well? Or the new versions have leaks? Here is what I have (notice the dates): $ ls -l /usr/local/rhq-agent/lib/augeas/lib/ total 412 -rw-r--r-- 1 root root 260432 Jul 26 20:19 libaugeas.so -rw-r--r-- 1 root root 72404 Jul 26 20:19 libfa.so -rw-r--r-- 1 root root 72404 Jul 26 20:19 libfa.so.1 Here is the inventory for this particular host: host.com Aliases File Apache HTTP Servers Bundle Handler - Ant Bundle Handler - File Template Cobblers CPUs Cron File Systems GRUB Hosts File Network Adapters Postfix Servers RHQ Agent Samba Servers SnmpTrapds SSHDs Sudoers Version-Release number of selected component (if applicable): Agent version 4.1 (updated from 4.0.) Also present in 4.0. Linux vg61l01ad-opsdev002.apple.com 2.6.18-238.19.1.0.1.el5 #1 SMP Fri Jul 15 04:42:13 EDT 2011 x86_64 x86_64 x86_64 GNU/Linux Red Hat Enterprise Linux Server release 5.6 (Tikanga) How reproducible: Every time, every system we have. Steps to Reproduce: 1. Install agent 2. Import all resources 3. Wait a few days. Actual results: Memory usage increases slowly over time (maybe 1GB per few days?) Expected results: No leaks ;-) Additional info:
Hardware info: $ free -m total used free shared buffers cached Mem: 32183 26323 5860 0 642 12349 -/+ buffers/cache: 13331 18852 Swap: 18047 14 18033 CPU: 8 of these vendor_id : GenuineIntel cpu family : 6 model : 26 model name : Intel(R) Xeon(R) CPU E5506 @ 2.13GHz stepping : 5 cpu MHz : 2133.469 cache size : 4096 KB root@vg61l01ad-opsdev002 # lspci -tv -[0000:00]-+-00.0 Intel Corporation 5520 I/O Hub to ESI Port +-01.0-[03]----00.0 Hewlett-Packard Company Smart Array G6 controllers +-08.0-[02]--+-00.0 Broadcom Corporation NetXtreme II BCM5709 Gigabit Ethernet | \-00.1 Broadcom Corporation NetXtreme II BCM5709 Gigabit Ethernet +-1d.0 Intel Corporation 82801JI (ICH10 Family) USB UHCI Controller #1 +-1d.1 Intel Corporation 82801JI (ICH10 Family) USB UHCI Controller #2 +-1d.2 Intel Corporation 82801JI (ICH10 Family) USB UHCI Controller #3 +-1d.3 Intel Corporation 82801JI (ICH10 Family) USB UHCI Controller #6 +-1d.7 Intel Corporation 82801JI (ICH10 Family) USB2 EHCI Controller #1 +-1e.0-[01]--+-03.0 ATI Technologies Inc ES1000 | +-04.0 Compaq Computer Corporation Integrated Lights Out Controller | +-04.2 Compaq Computer Corporation Integrated Lights Out Processor | +-04.4 Hewlett-Packard Company Proliant iLO2/iLO3 virtual USB controller | \-04.6 Hewlett-Packard Company Proliant iLO2 virtual UART +-1f.0 Intel Corporation 82801JIB (ICH10) LPC Interface Controller \-1f.2 Intel Corporation 82801JI (ICH10 Family) 4 port SATA IDE Controller #1
Elias, this indeed looks like it might be a problem with augeas as mentioned on the forum pages. Could you check if augeas library is installed system-wide and what version it is? On RHEL/Fedora, do something like: yum info augeas-libs
Installed Packages Name : augeas-libs Arch : x86_64 Version : 0.7.4 Release : 1.el5 Size : 950 k Repo : installed Summary : Libraries for augeas URL : http://augeas.net/ License : LGPLv2+ Description: The libraries for augeas. Available Packages Name : augeas-libs Arch : i386 Version : 0.9.0 Release : 1.el5 Size : 338 k Repo : epel-5 Summary : Libraries for augeas URL : http://augeas.net/ License : LGPLv2+ Description: The libraries for augeas. Name : augeas-libs Arch : x86_64 Version : 0.9.0 Release : 1.el5 Size : 340 k Repo : epel-5 Summary : Libraries for augeas URL : http://augeas.net/ License : LGPLv2+ Description: The libraries for augeas. Yes, looks so... So maybe I need to update? What's the version to use?
Ok, that I hope explains it. While the RHQ agent bundles the augeas libraries (in version 0.9.0), it uses the system-wide augeas in preference if it is available. augeas 0.7.4 has a known memory leak (I think) so I'd recommend to a) upgrade the system-wide augeas to 0.9.0 or (if that is not possible) to start the agent with modified LD_LIBRARY_PATH that would point to the bundled augeas location before the standard paths: LD_LIBRARY_PATH=$RHQ_AGENT_HOME/lib/augeas/lib64:"$LD_LIBRARY_PATH" bin/rhq-agent.sh
I'll keep an eye on things and verify that it isn't leaking memory.
I am still seeing a leak, even though Augeas was updated to 0.90 system-wide. Contrary to what you claim, the wrapper seems to always boot with the shipped version of Augeas, as it prepends LD_LIBRARY_PATH no matter what... # ---------------------------------------------------------------------- # Prepare LD_LIBRARY_PATH to include libraries shipped with the agent # ---------------------------------------------------------------------- if [ "x$_LINUX" != "x" ]; then if [ "x$LD_LIBRARY_PATH" = "x" ]; then if [ "x$_X86_64" != "x" ]; then LD_LIBRARY_PATH="${RHQ_AGENT_HOME}/lib/augeas/lib64" else LD_LIBRARY_PATH="${RHQ_AGENT_HOME}/lib/augeas/lib" fi else if [ "x$_X86_64" != "x" ]; then LD_LIBRARY_PATH="${RHQ_AGENT_HOME}/lib/augeas/lib64:${LD_LIBRARY_PATH}" else LD_LIBRARY_PATH="${RHQ_AGENT_HOME}/lib/augeas/lib:${LD_LIBRARY_PATH}" fi fi export LD_LIBRARY_PATH debug_msg "LD_LIBRARY_PATH: $LD_LIBRARY_PATH" fi In any case, what I have system-wide (0.90) matches the agent's included copy. $ rpm --query --file /usr/lib/libaugeas.so.0.14.0 augeas-libs-0.9.0-1.el5 $ diff /usr/lib/libaugeas.so.0.14.0 /usr/local/rhq-agentlib/augeas/lib/libaugeas.so ; echo $? 0 I'm seeing a growth of about 20GB of memory over 14 days. I'm guessing there must be a leak someplace else. What can I look at?
ok, this is bad.. Not sure what's leaking and where so I would like to ask you to do the following steps to get to the bottom of this: 1) checkout https://github.com/metlos/rhq-project-samples and build the augeas leak detector that I put together which is located in rhq-project-samples/agent/debug-tools/augeas-leak-detector 2) Follow the directions in the README.txt there to run the agent with the detector. 3) Grab the agent log for the time the agent was running with the detector + the augeas-leak-detection-results.txt which should be generated in the RHQ_AGENT_HOME after you stop the instrumented agent and attach both files to this BZ.
Created attachment 551692 [details] result of leak detection I ran the agent for about 4 hours. I still see the memory usage increase, but nothing in the report seems suspicious. Could it be something in another native library?
Well, there are a couple of leaked references, but I'd agree with you that those probably don't contribute to the continuous increase of the memory usage, since these references are created on resource component startup and are stored in instance fields of the components - the components live for the lifetime of agent so they shouldn't be contributing to the increase of the leak unless there is some more "hidden" leak inside the augeas library itself (which has not been reported as of yet). The only other native library we use is sigar but that has been stable for a long time so I somehow doubt the leak is going to be there - but of course I can't rule that out - you can turn off the usage of sigar when you disable the native system in the agent. On the agent prompt: setconfig rhq.agent.disable-native-system=true or if you have your agent inventoried, you can check the "Disable Native System" property in the Connection Settings of the agent resource. Unfortunately, I was still not able to reproduce this locally so I will have to ask you for some more testing: 1) Does disabling the agent's native system help? 2) How does the leak change when you disable the augeas-based plugins?: Aliases, Apache HTTP Server, Cobbler, Cron, Hosts, Postfix, Samba, Sudo Access 4) If the leak disappears when you disable the above, you should enable them one by one to see which ones of them leak.
FYI, the cause for leaks detected by the augeas leak detector is most probably bug 773031. At the same time I think we're dealing with something more than just that here.
1) Native system disabled: Memory still leaked. 2) I disabled those but it still leaked. Once I ALSO disabled "OpenSSH,GRUB,Iptables" the memory use seems to be stable but I will check overnight. (I suspected there were more than what you listed, so I grepped around for Augeus in the source tree.) So, I suspect one of those three plugins, though there could be more than one that leaks. What seems likeliest? I'll report tomorrow.
Seems that GRUB is suspect, and perhaps others, but at least enabling GRUB has created a leak. (It is hard to observe.) Few questions: 1) How often is loadResourceConfiguration() called? I don't know the call flow, but it appears "augeas" is never closed: public class GrubComponent implements ResourceComponent, ConfigurationFacet { public Configuration loadResourceConfiguration() throws Exception { Configuration pluginConfiguration = resourceContext.getPluginConfiguration(); return loadResourceConfiguration(pluginConfiguration); } public Configuration loadResourceConfiguration(Configuration pluginConfiguration) throws Exception { // Gather data necessary to create the Augeas hook ... Augeas augeas = new Augeas(rootPath, lensesPath, Augeas.NONE); The same issue (no "close") appears in sshd/src/main/java/org/rhq/plugins/sshd/OpenSSHDComponent.java also rhq/modules/plugins (plugingen) $ git grep -E "new +Augeas\\(" apache/src/main/java/org/rhq/plugins/apache/ApacheServerComponent.java: ag = new Augeas(); apache/src/test/java/org/rhq/plugins/apache/ApacheAugeasTest.java: Augeas ag = new Augeas(); augeas/src/main/java/org/rhq/augeas/AugeasProxy.java: augeas = new Augeas(config.getRootPath(), config.getLoadPath(), config.getMode()); augeas/src/main/java/org/rhq/plugins/augeas/AugeasConfigurationComponent.java: augeas = new Augeas(this.augeasRootPath, augeasLoadPath, Augeas.NO_MODL_AUTOLOAD); augeas/src/main/java/org/rhq/plugins/augeas/helper/AugeasRawConfigHelper.java: Augeas aug = new Augeas(rootPath, loadPath, Augeas.NO_MODL_AUTOLOAD); grub/src/main/java/org/rhq/plugins/grub/GrubComponent.java: Augeas augeas = new Augeas(rootPath, lensesPath, Augeas.NONE); sshd/src/main/java/org/rhq/plugins/sshd/OpenSSHDComponent.java: Augeas augeas = new Augeas(rootPath, lensesPath, Augeas.NONE); I have a proposed patch but will test and need approval before submission.
My plugin fixes seem to have worked. Memory usage does grow for about 24 hours and remains steady at about 300MB with perhaps slow growth over time, but much better than before. So the fixes required are for 'grub' and 'sshd' plugins only.
Will there be a JON 3.0 patch for this? If so, is there an ETA?
If the issue affects the grub and sshd plug-ins, there will be no patch for JON 3.0 as these plug-ins are not included with the product.
Created attachment 561689 [details] Patch based on commit 1174064a0372d31199d75939b64d36eaa2232d02 Approved by employer
Thanks, Elias! I only changed one minor thing - in SSHD plugin I changed the getConfig() method to throw an Exception, just to minimize the diff. master: http://git.fedorahosted.org/git/?p=rhq/rhq.git;a=commitdiff;h=ff54dedcfdefb994c8390d6f68393134b3842e8e Author: Elias Ross <elias_ross> Date: Thu Jan 12 21:51:51 2012 -0800 [BZ 766959] fix possible memory leak in plugins
Bulk closing of BZs that have no target version set, but which are ON_QA for more than a year and thus are in production for a long time.