Bug 1051245
| Summary: | java.lang.ref.Finalizer leak when upgrading from 1.62 to 1.66 | ||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 6 | Reporter: | Stuart D Gathman <stuart> | ||||||||||||||||
| Component: | java-1.6.0-openjdk | Assignee: | Andrew Dinn <adinn> | ||||||||||||||||
| Status: | CLOSED ERRATA | QA Contact: | Lukáš Zachar <lzachar> | ||||||||||||||||
| Severity: | unspecified | Docs Contact: | |||||||||||||||||
| Priority: | high | ||||||||||||||||||
| Version: | 6.5 | CC: | ahughes, aph, dbhole, iweiss, jkurik, jvanek, lakagwu, lzachar, rhatlapa, rsvoboda, stuart | ||||||||||||||||
| Target Milestone: | rc | ||||||||||||||||||
| Target Release: | --- | ||||||||||||||||||
| Hardware: | Unspecified | ||||||||||||||||||
| OS: | Unspecified | ||||||||||||||||||
| Whiteboard: | |||||||||||||||||||
| Fixed In Version: | java-1.6.0-openjdk-1.6.0.0-5.1.13.3.el6_5 | Doc Type: | Bug Fix | ||||||||||||||||
| Doc Text: | Story Points: | --- | |||||||||||||||||
| Clone Of: | |||||||||||||||||||
| : | 1070387 1075659 (view as bug list) | Environment: | |||||||||||||||||
| Last Closed: | 2014-06-16 15:47:16 UTC | Type: | Bug | ||||||||||||||||
| Regression: | --- | Mount Type: | --- | ||||||||||||||||
| Documentation: | --- | CRM: | |||||||||||||||||
| Verified Versions: | Category: | --- | |||||||||||||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||||||||||||
| Embargoed: | |||||||||||||||||||
| Bug Depends On: | |||||||||||||||||||
| Bug Blocks: | 1070387, 1075659 | ||||||||||||||||||
| Attachments: |
|
||||||||||||||||||
Created attachment 847884 [details]
Sample OutOfMemory
The usual source of crash.
The machine on which this is running appears to be old (32-bit Pentium II). The message indicates that the 32-bit size limit (4GB) was hit. 1.66 contains a newer version of HotSpot that likely changed object sizes and is causing the 4GB memory limit to be hit. Is this reproducible on a newer 64-bit machine? Adding aph to cc: for reference. Created attachment 847901 [details]
An actual OutOfMemory exception
Java level OutOfMemory
The 32-bit memory limit is hit because of the rapid memory growth. On working JVMs (last known good release is 1.62), the application never uses more than 1G. I do plan to try it on a 64-bit VM, as JVM engineers probably don't like rehashing old CPU architectures - but this requires migration to run in production. The current production application system runs in a 32-bit virtual machine under an EL6.5 host system. (In reply to Deepak Bhole from comment #3) > The machine on which this is running appears to be old (32-bit Pentium II). > The message indicates that the 32-bit size limit (4GB) was hit. 1.66 > contains a newer version of HotSpot that likely changed object sizes and is > causing the 4GB memory limit to be hit. > > Is this reproducible on a newer 64-bit machine? Does your theory explain the rapid heap growth as well? Does "object size" refer to compiled code cache, or to larger headers on Java objects? Hmm, so perhaps reserving additional code cache space could fix it in 32-bit mode? I will work on migrating to a 64-bit VM. Created attachment 847937 [details]
Normal memory usage (under 1.62)
My concern is that the rapid memory *growth* is the real problem. Using more memory with the new hotspot is understandable, but it should still stabilize at a higher number than the previous version. Instead, the memory use keeps growing until it hits the limit. I'm worried that in 64-bit mode, it will simply eat more memory before crashing. Fair enough -- I consulted with one of our engineers and they agree that there could be a potential issue here. Sorry for jumping the gun earlier. Can you provide detailed steps on how to reproduce this, by any chance? I know in the report you said to just start adempiere but no one on our team is familiar with it. How to reproduce, aye there's the rub. It is not just starting Adempiere (http://adempiere.org - we are on the rather old version 352a), but running it under load. I, of course, tested things with just a single user after updating. The web user interface could be exercised with a load tester, and I could try to do that - but I have no experience in load testing. Could you recommend an open source product I could use here? If I can reproduce without turning the system loose on the users, I can package it up so you can also. My instinct is that while the new hotspot may have a larger stable memory footprint, the rapid memory leak is in the classpath java library. Many of the clean java OutOfMemory exceptions are in StringBuilder allocating a rather larger that reasonable char[]. I'm wondering if there is a clean way to use the new classpath library with the old JVM, or vice versa? Or are the two tied together? One thing I can try is updating to 1.65 - that is prior to the new hotspot (correct?), but perhaps already has the hypothetical classpath library leak bug. Andrew, any suggests re: comment #11? I have 1.66 on my devel VM, and the memory growth happens even with just a single user, although much slower than on production. If I set max heap to 300M, it takes a day or so instead of an hour or so to hit the limit. But that is sufficient for me to determine whether a JVM version has the problem or not without bothering production. That also means that heavy load isn't necessary for failure - there is just some operation that leaks memory. I will see if I can enable JMDI on the JVM for devel, and see if anything obviously out of place is hogging the heap. Sahi web automation and test tool seems to be popular with Adempiere developers, so I can see my way to get this reproducible in a lab. Due to many responsibilities, it may take some time. Hi Stuart, no problem. We have not seen this issue anywhere else so it likely has only a few triggers. Please let us know when you come across an easy way to reproduce and we'd be happy to take a look! I set the JVM to dump the heap on OutOfMemory, and got a heap profile. I can make the hprof available after checking with the boss (600M - with no passwords but private accounting data). I looked at it with eclipse memory analyzer, and 49% of the memory (the amount over the normal stable size) is filled with: 257,837 instances of "java.lang.ref.Finalizer", loaded by "<system class loader>" occupy 298,262,288 (49.26%) bytes. These instances are referenced from one instance of "java.util.Hashtable$Entry[]", loaded by "<system class loader>" I'm trying to see if there is a way to export reports... Created attachment 856883 [details]
Memory analyzer leak suspects report
Hi Stuart, I took a look at the heap report and the large number of Finalizer instances makes me suspect that the FinalizerThread instance may be wedged. Finalizers are only ever created by Finalizer.register() which gets called by the VM after GC. So, these instances should not be accessible to any other thread (short of code with suitable privilege accessing them via reflection). The FinalizerThread thread does not appear to be dead (i.e. it has not exited on an exception) -- at least it appears in the thread listing. If it is actually running then it must be executing finalizers very slowly. Alternatively, maybe it is deadlocked under a specific finalize() call. This sounds like the most likely explanation -- a deadlock could easily have manifested in the upgrade because of changes in the order of how things get finalized (e.g. subclasses of java.awt.Graphics have had their dispose() implementation changed and dispose gets called by java.awt.Graphics.finalize()). It would help if you could provide some thread dumps (complete with Java thread stack backtraces) as the application approaches full memory or at OOM. These should show which finalize method the FinalizerThread is executing which might help pin down the problem. The stack backtraces might also show if there is a deadlock. You can obtain a thread dump by sending SIGHUP to the JVM process. Created attachment 862524 [details]
jstack -l output at 100% committed, but not yet out of memory
SIGHUP shuts down the JVM. I found the jstack utility, and it seems to work. This thread dump was taken at 100% committed, but before the OutOfMemory exception. I will take another at OutOfMemory.
The Finalizer thread doesn't seem to be executing a finalizer - just waiting on a queue.
Hi Stuart, Thanks for the stack dump which confirms a similar state found from another report of this problem (which included a heap dump). It appears that the FinalizerThread is sitting on an empty ReferenceQueue yet there are many Finalizer instances in the system pending finalization but not yet transferred into the queue by the garbage collector. So, it appears that the GC either is failing to pass them to the FinalizerThread or cannot keep up with the rate of accumulation of finalizeable objects. I checked the changes to the GC reference processor code between the two releases you are using and there are a couple of significant changes which might be responsible. However, they look correct by eyeball. So, I think it is unlikely that the GC is not correctly communicating data to the FinalizerThread. There are also changes in the JDK code which might have changed the rate at which Finalizable objects are created and/or released and made available for transfer. We would very much like to investigate this further but we cannot really do so without a program which reproduces the problem. Could you provide us a version of your code (suitably stripped to remove confidential data) which will manifest this error? Thanks for your help. I can think of two approaches to get what you need: 1) See if the stock adempiere352a distribution with sample data will reproduce the problem. Even if that works, it takes a while for the Finalizers to build up, so it isn't the most convenient method. 2) Find out what class of objects the Finalizers are pointing at, and see if I can make a fail quick program. Most of the Finalizers are pointing at org.postgresql.jdbc4.Jdcb4PreparedStatement instances. I may be able to create a synthetic reproducer after all. Most, as in 99% org.postgresql.jdbc4 makes extensive use of PhantomReference: -------------------- When a statement is successfully parsed, registerParsedQuery is called. This creates a PhantomReference referring to the "owner" of the statement (the originating Query object) and inserts that reference as a key in parsedQueryMap. The values of parsedQueryMap are the corresponding allocated statement names. The originating Query object also holds a reference to the PhantomReference. When the owning Query object is closed, it enqueues and clears the associated PhantomReference. If the owning Query object becomes unreachable (see java.lang.ref javadoc) before being closed, the corresponding PhantomReference is enqueued on parsedQueryCleanupQueue. In the Sun JVM, phantom references are only enqueued when a GC occurs, so this is not necessarily prompt but should eventually happen. Periodically (currently, just before query execution), the parsedQueryCleanupQueue is polled. For each enqueued PhantomReference we find, we remove the corresponding entry from parsedQueryMap, obtaining the name of the underlying statement in the process. Then we send a message to the backend to deallocate that statement. -------------------- *** notices that Adempiere wraps the java.sql.PreparedStatement in its own CPreparedStatement class - which has a finalizer that calls close() on the PreparedStatement when invoked, and speculates about interactions between finalize() and PhantomReference. "In the Sun JVM, phantom references are only enqueued when a GC occurs, so this is not necessarily prompt but should eventually happen."
"should eventually" is not enough of a guarantee. You can run a full GC and still not get back much working space. That's because the lifetime of the referenced objects discovered during the GC extends at least to the next GC. So, if your program builds up a lot of such references you will have a problem.
This can happen either because you release a lot of phantom referenced objects all at once or it can happen gradually over time if your program creates references at a faster rate which than the reference processor and finalizer threads clear them (n your case the threads appear to be blocking so it appears to be the former). When the references queued for processing refer to a large data set then as your application adds new objects to its live working set it can push the total working set over the available heap limit. It loks to me like that is what is happening here.
"speculates about interactions between finalize() and PhantomReference".
I don't think there is any very significant interaction. Both FinalReference and PhantomReference instances get pushed to their respective reference queues by the reference processor thread. I believe the only interaction here is that they combine additively to determine the load on this thread.
Why this has become a problem for your app between releases is hard to say. It may be that changes in the JDK awt and/or memory buffer code (which also use references) have increased the amount of working set held by referenced objects. It may be that the timing with which the finalizer and reference processor threads runs has changed or the timing of GC has changed. It may even be that with this release your app footprint or the JBoss footprint has become slightly larger and pushed the total working set over the heap limit. All of these tip your app over from fitting in heap to hitting the heap limit.
I think the way to work around this problem would be either to find a way minimise your use of the objects which employ PhatomReferences/Finalizers (e.g. can you reuse PreparedStatement instances) or to scale down your app so that it employs a smaller working set. A bigger heap on a 64-bit JVM might possibly help but, of course, above 4GB you might also start seeing noticeable GC pauses ('noticeable' depending up in what latencies your app can tolerate).
At the moment I don't think there is any more advice I can provide. If you can come up with any further details which might allow me to investigate further I will be happy to look at them and, if possible, provide more detailed advice.
The working set is only about 220M - that is the stable heap size used with a single user on the test system where I am investigating this. That is nowhere near any heap limit. For some reason, the new release prevents PreparedStatements from getting collected (or collected fast enough). I think I will be able to create a synthetic app that works in 1.62 and fails in 1.66. (In reply to Stuart D Gathman from comment #28) > The working set is only about 220M - that is the stable heap size used with > a single user on the test system where I am investigating this. That is > nowhere near any heap limit. For some reason, the new release prevents > PreparedStatements from getting collected (or collected fast enough). I > think I will be able to create a synthetic app that works in 1.62 and fails > in 1.66. I have found a simple way to reproduce the problem and I think I have pinned down the VM change which causes it. There does not appear to be a way to work around this behaviour. I also found that the same change does not affect the latest jdk7u tree -- so some later change included after jdk6-b30 must have repaired the problem. I am continuing to investigate to see i) what this fix is and ii) whether we can pull it down into jdk6. Hmm, there may be a workaround you can try here. Further testing of the later 1.66 release indicates that the CMS GC is much better able to handle programs which generate large numbers of finalizers than the (default) parallel scavenge GC which you have been using. So, if you try running your program with -XX:+UseConcMarkSweepGC you might find that it is able to keep up with your application's rate of creation of finalizable objects. However, this may still cause a slowdown in throughput and/or larger GC pause times relative to the behaviour on 1.62 with the parallel scavenge GC. Hi Stuart, I finally pinned this down to a simple error in the jdk native code (yes, nothing was actually wrong in the GCs or any other part of the hotspot VM code). The jdk code which notifies the JVM of the JDK version was failing to zero all of the jdk_version_info data. In particular, a single bit was added in jdk6_b28 allowing the GC to forward the chain of discovered Reference instances linked either through Reference.next or Reference.discovered. Depending on what is on the stack you get one or other behaviour. Unfortunately, the JDK code was always expecting instances to be linked by Reference.next. This simply requires a one line patch to make a call to memset use the correct size. The fix will be in the next JDK6 release. *** Bug 1066981 has been marked as a duplicate of this bug. *** Created attachment 868394 [details]
Reproducer
Reproducer program which stresses the reference processor code by retaining a large Finalizer set while turning it over at a high rate. Run as
java -XX:+PrintGCDetails Finalizetest
The test should finalize about 500 blocks before OOM. The bug in jdk6_b28 reduces it to a few blocks at most.
Wow. I don't know whether to be disappointed that I didn't get to a reproducer first, or grateful that I don't have to. :-) In any case, I am vicariously thrilled to nail the bug! Thanks for the GC algorithm tip, I will try it. I've been fixing some of the more egregious Finalizer creating code in Adempiere: For instance, creating PreparedStatements for the same query in an inner loop! But it is a drop in the bucket of a large project, and the JVM will have to be able to handle the inefficient code. (In reply to Stuart D Gathman from comment #40) > Wow. I don't know whether to be disappointed that I didn't get to a > reproducer first, or grateful that I don't have to. :-) In any case, I am > vicariously thrilled to nail the bug! The hardest bit was looking for a needle in a haystack labelled hotspot only to find that it was actually in the next door haystack labelled jdk. Be very, very grateful :-) > Thanks for the GC algorithm tip, I will try it. I've been fixing some of > the more egregious Finalizer creating code in Adempiere: For instance, > creating PreparedStatements for the same query in an inner loop! But it is > a drop in the bucket of a large project, and the JVM will have to be able to > handle the inefficient code. Well, beware. This worked when I tried it. However, it probably only did so by luck -- the same reference forwarding code is used by all GCs. I suspect that choosing a different GC left different data on the stack and these happened to include the necessary 0 bit to select the old vs new path for discovered Reference linking. pushed http://pkgs.devel.redhat.com/cgit/rpms/java-1.6.0-openjdk/commit/?h=rhel-6.6&id=2c89d7c91d63866bc5630c23eff83c9d92c20db9 and build http://brewweb.devel.redhat.com/brew/taskinfo?taskID=7194088 *** Bug 1025555 has been marked as a duplicate of this bug. *** Restoring needinfo ... Restoring needinfo ... I am not clear on what info is needed from me. I don't believe the patch has hit CentOS yet, and unfortunately, my current employer is not a RedHat subscriber. This is fixed. This was released in 1.13.2: http://blog.fuseyism.com/index.php/2014/03/28/icedtea-1-13-2-for-openjdk-6-sizeof-considered-harmful-released/ It was picked up in RHEL 6.5 when it upgraded to 1.13.3 in April 2014. |
Created attachment 847872 [details] sample SIGSEGV Description of problem: After upgrading to 1.6.0.0-1.66.1.13.0.el6, application system (Adempiere-352a) leaks memory rapidly, lasting only an hour or so before crashing under load. Reverting back to java-1.6.0-openjdk-1.6.0.0-1.62.1.11.11.90.el6_4.i686 restores normal operation. Version-Release number of selected component (if applicable): How reproducible: Always with application load. Might be tricky to reproduce in a lab setting. Steps to Reproduce: 1. Start adempiere 2. Wait 1 or 2 hours for crash 3. Actual results: Usually an OutOfMemory exception in StringWriter. Sometimes a SIGSEGV. Expected results: Normal operation as with 1.62 Additional info: Attached stack traces of a SIGSEGV and an OutOfMemory