After the plugin container is restarted, the RootPluginClassLoader and all its descendant classloaders from the previous PC instance are not cleaned up. This caused permgen to leak and, if the pc is restarted enough times, the accumulating leaked classloaders will eventually cause a permgen OutOfMemoryError. The classloaders will consume significantly more space if one or more JBossAS servers are being managed by the pc - this is because these plugins define lots of resource types and have lots of dependencies, thereby causing lots of classes to be loaded into the corresponding plugin classloaders. The underlying cause is the following bug in the JDK, which causes classloaders that are no longer used to not get cleaned up: http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=5041014 The bug has a comment that says this: "Please see bug report 4167874. A new method URLClassLoader.close() is being added in jdk 7. It should be integrated in the next few weeks." so it sounds unlikely we will ever see a fix for this in JDK6. There are lots of blogs on this topic - here are just a couple, one written by Mazz: http://management-platform.blogspot.com/2009/01/classloaders-keeping-jar-files-open.html http://my.opera.com/karmazilla/blog/2007/03/13/good-riddance-permgen-outofmemoryerror
The following article gives a nice tutorial on how to analyze permgen out-of-memory-errors / leaks using Eclipse Memory Analyzer (MAT) to analyze a heap dump file: http://dev.eclipse.org/blogs/memoryanalyzer/2008/05/17/the-unknown-generation-perm/ Note, if you notice a bunch of sun.reflect.DelegatingClassLoaders while analyzing the heap dump, this is what they are: Hotspot JVM generates classes on-the-fly to speed up Java reflective calls (Constructor.newInstance, Method.invoke etc.). We want to find out how many such classes were generated. Piece of implementation detail before proceeding: All reflection speed-up classes are loaded by classloaders of type sun.reflect.DelegatingClassLoader. Each such loader only loads a single class. So they are probably not the culprits for the permgen leak.
Created attachment 432433 [details] classloaders-1-plugin-10-restarts.txt taking hprof profiler dumps of the VM, I do see many tens (if not over a hundred or more) of sun.reflect.DelegatingClassLoaders instances. As ips said, that's probably not the source of the problem - for each one of those, there is only a single class definition loaded inside it. I ran a test - I disabled all agent plugins but the platform plugin. I then "plugins update" 10 times. Took hprof dump and examined it via Eclipse MAT. I then looked at the classloaders and I see 132 of those sun.reflect.DelegatingClassLoaders. I also see a single RootPluginClassloader and a single PluginClassLoader - which is to be expected with a single plugin deployed. See attached "classloaders-1-plugin-10-restarts.txt". So this tells me the core plugin container is ok and is cleaning up after itself - restarting the PC 10 times shows that I still have one plugin classloader. I'm going to look at what happens with other plugins - I did some test runs with that and I've seen bunches of EMS classloaders - we might not be cleaning those up or EMS itself isn't cleaning itself up. I'm going to see if there is something we can do to help clean up the EMS classloaders (if indeed that is a source of leakage).
I noticed the EMS child-first classloader is referenced by javax.security.auth.login.Configuration See this sun bug: http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6727821
This might involve the JBossConfiguration object that is in EMS. Gonna investigate this further, but I sense this is the area where the "bad things" might be caused.
After the agent is fully shutdown (PC is shutdown, comm is down, just the agent input prompt thread is running), if I invoke this: javax.security.auth.login.Configuration.getConfiguration() I see EMS classes in here (stored here by the jboss plugin and/or the EMS library) - that object's data members are: configuration=org.mc4j.ems.impl.jmx.connection.support.providers.jaas.JBossConfiguration@32d463e5 contextClassLoader=org.mc4j.ems.connection.support.classloader.ChildFirstClassloader@77a82f1 Remember, the entire PC is down, all of our classloaders should be freed/unused. However, we store references to these EMS classes in a JRE javax static location. This is why I think we are leaking. Just a theory for now, but we definitely need to clean out these references regardless.
I added this to the code. Still seeing perm gen grow, but I verified in my debugger that the rhq/ems references in that javax Configuration static is no longer there. Still searching for other areas where we are similarly leaking. Notice I also do a LogFactory.releaseAll for good measure here: @@ -116,2 +118,7 @@ public class PluginContainer implements ContainerService { private PluginContainer() { + // for why we need to do this, see http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6727821 + try { + Configuration.getConfiguration(); + } catch (Throwable t) { + } } @@ -338,2 +345,11 @@ public class PluginContainer implements ContainerService { Introspector.flushCaches(); + + // for why we need to do this, see http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6727821 + try { + Configuration.setConfiguration(null); + } catch (Throwable t) { + } + + LogFactory.releaseAll(); + System.gc();
Another place to fix is inside EMS - doing this, I see perm gen usage be much more stable and things getting freed up more. Index: src/ems-impl/org/mc4j/ems/impl/jmx/connection/DConnection.java =================================================================== --- src/ems-impl/org/mc4j/ems/impl/jmx/connection/DConnection.java (revision 616) +++ src/ems-impl/org/mc4j/ems/impl/jmx/connection/DConnection.java (working copy) @@ -102,6 +102,7 @@ // tracker.stopTracker(); connectionProvider.disconnect(); + LogFactory.release(connectionProvider.getClass().getClassLoader()); }
getting further along. I will be committing a new EMS change (1.2.13) that will provide a public API (ClassLoaderFactory.clearCaches) so I can clear its caches of jar files, temp files and most importantly classloaders. I ran a test and that seems to help even further. However, I do notice some leakage if I just restart the PC (via "plugins update" for example). I see URLClassLoader instances grow unbounded, among other things. However, if I shutdown the full agent core internals (via "shutdown" which also kills the comm layer and the agent management MBean) most of those leaked instances free up (though not entirely). So there is still some things we can do to further fix this. At least with my current fixes, the agent is able to have a much more stable perm gen.
i checked in some minor tweeks to PluginContainer. I don't think it changes much in the way of issues with perm gen, but maybe. In addition, we may want to consider adding these VM options to the agent - I'm reading that they may help: -XX:+UseConcMarkSweepGC -XX:+CMSPermGenSweepingEnabled -XX:+CMSClassUnloadingEnabled
I'm gonna close out this BZ. Most of the perm gen leaks are fixed. There still seems to be some minor perm gen leakage that occurs when restarting the internals (using either "shutdown/start", "plugins update" or "pc stop/pc start" - which all restart the PC; the first one also shuts down the rest of the core agent internals). Those minor leaks not withstanding, I think the bulk of the problems with perm gen are fixed. Here's the master sha commits that were involved: plugin container change: 4ae53d12b5b30aafe5362ad5435b0fbe548962b6 69c6da3af5ef3a12988836d5a11b5b09e459075c b13693f2a55c24422be11d38bf23361ed1db3950 jmx-plugin change: 774ed6788f6d14ec7b14f169b2181fbcf20e5302 jboss-as plugin change: 724682f8e03a5f804774559a97a70298dc402a43 There is nothing to test really - this is all code changes. You'd actually test by hooking up JProfiler to the agent and confirm that perm gen is more stable. You'd have to do lots of "plugins update" commands after monitoring one or more JBossAS instances (the RHQ Server is a valid one to test with).
moving to verified.
Mass-closure of verified bugs against JON.
This wasn't fixed in JON2.4
Moving back to Mazz's last state of verified
Bookkeeping - closing bug - fixed in recent release.