Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1051245

Summary:

java.lang.ref.Finalizer leak when upgrading from 1.62 to 1.66

Product:

Red Hat Enterprise Linux 6

Reporter:

Stuart D Gathman <stuart>

Component:

java-1.6.0-openjdk

Assignee:

Andrew Dinn <adinn>

Status:

CLOSED ERRATA

QA Contact:

Lukáš Zachar <lzachar>

Severity:

unspecified

Docs Contact:

Priority:

high

Version:

6.5

CC:

ahughes, aph, dbhole, iweiss, jkurik, jvanek, lakagwu, lzachar, rhatlapa, rsvoboda, stuart

Target Milestone:

Target Release:

---

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

java-1.6.0-openjdk-1.6.0.0-5.1.13.3.el6_5

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Clones:

1070387 1075659 (view as bug list)

Environment:

Last Closed:

2014-06-16 15:47:16 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

1070387, 1075659

Attachments:

Description	Flags
sample SIGSEGV	none
Sample OutOfMemory	none
An actual OutOfMemory exception	none
Normal memory usage (under 1.62)	none
Memory analyzer leak suspects report	none
jstack -l output at 100% committed, but not yet out of memory	none
Reproducer	adinn: review-

Description Stuart D Gathman 2014-01-09 22:12:12 UTC

Created attachment 847872 [details]
sample SIGSEGV

Description of problem:
After upgrading to 1.6.0.0-1.66.1.13.0.el6, application system (Adempiere-352a) leaks memory rapidly, lasting only an hour or so before crashing under load.  Reverting back to java-1.6.0-openjdk-1.6.0.0-1.62.1.11.11.90.el6_4.i686 restores normal operation.  

Version-Release number of selected component (if applicable):


How reproducible:
Always with application load.  Might be tricky to reproduce in a lab setting.

Steps to Reproduce:
1. Start adempiere 
2. Wait 1 or 2 hours for crash
3. 

Actual results:
Usually an OutOfMemory exception in StringWriter.  Sometimes a SIGSEGV. 

Expected results:
Normal operation as with 1.62

Additional info:
Attached stack traces of a SIGSEGV and an OutOfMemory

Comment 1 Stuart D Gathman 2014-01-09 22:14:43 UTC

Created attachment 847884 [details]
Sample OutOfMemory

The usual source of crash.

Comment 3 Deepak Bhole 2014-01-09 22:42:55 UTC

The machine on which this is running appears to be old (32-bit Pentium II). The message indicates that the 32-bit size limit (4GB) was hit. 1.66 contains a newer version of HotSpot that likely changed object sizes and is causing the 4GB memory limit to be hit.

Is this reproducible on a newer 64-bit machine?

Comment 4 Deepak Bhole 2014-01-09 22:43:21 UTC

Adding aph to cc: for reference.

Comment 5 Stuart D Gathman 2014-01-09 22:48:38 UTC

Created attachment 847901 [details]
An actual OutOfMemory exception

Java level OutOfMemory

Comment 6 Stuart D Gathman 2014-01-09 22:53:58 UTC

The 32-bit memory limit is hit because of the rapid memory growth.  On working JVMs (last known good release is 1.62), the application never uses more than 1G.  I do plan to try it on a 64-bit VM, as JVM engineers probably don't like rehashing old CPU architectures - but this requires migration to run in production.  The current production application system runs in a 32-bit virtual machine under an EL6.5 host system.

Comment 7 Stuart D Gathman 2014-01-09 23:01:21 UTC

(In reply to Deepak Bhole from comment #3)
> The machine on which this is running appears to be old (32-bit Pentium II).
> The message indicates that the 32-bit size limit (4GB) was hit. 1.66
> contains a newer version of HotSpot that likely changed object sizes and is
> causing the 4GB memory limit to be hit.
> 
> Is this reproducible on a newer 64-bit machine?

Does your theory explain the rapid heap growth as well?  Does "object size" refer to compiled code cache, or to larger headers on Java objects?

Hmm, so perhaps reserving additional code cache space could fix it in 32-bit mode?  I will work on migrating to a 64-bit VM.

Comment 8 Stuart D Gathman 2014-01-09 23:29:11 UTC

Created attachment 847937 [details]
Normal memory usage (under 1.62)

Comment 9 Stuart D Gathman 2014-01-10 21:18:30 UTC

My concern is that the rapid memory *growth* is the real problem.  Using more memory with the new hotspot is understandable, but it should still stabilize at a higher number than the previous version.  Instead, the memory use keeps growing until it hits the limit.  I'm worried that in 64-bit mode, it will simply eat more memory before crashing.

Comment 10 Deepak Bhole 2014-01-13 15:17:40 UTC

Fair enough -- I consulted with one of our engineers and they agree that there could be a potential issue here. Sorry for jumping the gun earlier. 

Can you provide detailed steps on how to reproduce this, by any chance? I know in the report you said to just start adempiere but no one on our team is familiar with it.

Comment 11 Stuart D Gathman 2014-01-13 17:04:00 UTC

How to reproduce, aye there's the rub.  It is not just starting Adempiere (http://adempiere.org - we are on the rather old version 352a), but running it under load.  I, of course, tested things with just a single user after updating.  

The web user interface could be exercised with a load tester, and I could try to do that - but I have no experience in load testing.  Could you recommend an open source product I could use here?  If I can reproduce without turning the system loose on the users, I can package it up so you can also.

My instinct is that while the new hotspot may have a larger stable memory footprint, the rapid memory leak is in the classpath java library.  Many of the clean java OutOfMemory exceptions are in StringBuilder allocating a rather larger that reasonable char[].  I'm wondering if there is a clean way to use the new classpath library with the old JVM, or vice versa?  Or are the two tied together?

One thing I can try is updating to 1.65 - that is prior to the new hotspot (correct?), but perhaps already has the hypothetical classpath library leak bug.

Comment 12 Deepak Bhole 2014-01-13 22:14:42 UTC

Andrew, any suggests re: comment #11?

Comment 13 Stuart D Gathman 2014-01-15 00:25:00 UTC

I have 1.66 on my devel VM, and the memory growth happens even with just a single user, although much slower than on production.  If I set max heap to 300M, it takes a day or so instead of an hour or so to hit the limit.  But that is sufficient for me to determine whether a JVM version has the problem or not without bothering production.  That also means that heavy load isn't necessary for failure - there is just some operation that leaks memory.  

I will see if I can enable JMDI on the JVM for devel, and see if anything obviously out of place is hogging the heap.

Comment 14 Stuart D Gathman 2014-01-17 03:34:07 UTC

Sahi web automation and test tool seems to be popular with Adempiere developers, so I can see my way to get this reproducible in a lab.  Due to many responsibilities, it may take some time.

Comment 15 Deepak Bhole 2014-01-17 21:29:56 UTC

Hi Stuart, no problem. We have not seen this issue anywhere else so it likely has only a few triggers. Please let us know when you come across an easy way to reproduce and we'd be happy to take a look!

Comment 16 Stuart D Gathman 2014-01-29 00:15:07 UTC

I set the JVM to dump the heap on OutOfMemory, and got a heap profile.  I can make the hprof available after checking with the boss (600M - with no passwords but private accounting data).  I looked at it with eclipse memory analyzer, and 49% of the memory (the amount over the normal stable size) is filled with:

257,837 instances of "java.lang.ref.Finalizer", loaded by "<system class loader>" occupy 298,262,288 (49.26%) bytes.

These instances are referenced from one instance of "java.util.Hashtable$Entry[]", loaded by "<system class loader>"

I'm trying to see if there is a way to export reports...

Comment 17 Stuart D Gathman 2014-01-29 00:46:23 UTC

Created attachment 856883 [details]
Memory analyzer leak suspects report

Comment 18 Andrew Dinn 2014-02-05 17:00:51 UTC

Hi Stuart,

I took a look at the heap report and the large number of Finalizer instances  makes me suspect that the FinalizerThread instance may be wedged. Finalizers are only ever created by Finalizer.register() which gets called by the VM after GC. So, these instances should not be accessible to any other thread (short of code with suitable privilege accessing them via reflection).

The FinalizerThread thread does not appear to be dead (i.e. it has not exited on an exception) -- at least it appears in the thread listing. If it is actually running then it must be executing finalizers very slowly. Alternatively, maybe it is deadlocked under a specific finalize() call. This sounds like the most likely explanation -- a deadlock could easily have manifested in the upgrade because of changes in the order of how things get finalized (e.g. subclasses of java.awt.Graphics have had their dispose() implementation changed and dispose gets called by java.awt.Graphics.finalize()).

It would help if you could provide some thread dumps (complete with Java thread stack backtraces) as the application approaches full memory or at OOM. These should show which finalize method the FinalizerThread is executing which might help pin down the problem. The stack backtraces might also show if there is a deadlock. You can obtain a thread dump by sending SIGHUP to the JVM process.

Comment 21 Stuart D Gathman 2014-02-12 20:01:49 UTC

Created attachment 862524 [details]
jstack -l output at 100% committed, but not yet out of memory

SIGHUP shuts down the JVM.  I found the jstack utility, and it seems to work.  This thread dump was taken at 100% committed, but before the OutOfMemory exception.  I will take another at OutOfMemory.

The Finalizer thread doesn't seem to be executing a finalizer - just waiting on a queue.

Comment 22 Andrew Dinn 2014-02-13 12:15:37 UTC

Hi Stuart,

Thanks for the stack dump which confirms a similar state found from another report of this problem (which included a heap dump). It appears that the FinalizerThread is sitting on an empty ReferenceQueue yet there are many Finalizer instances in the system pending finalization but not yet transferred into the queue by the garbage collector. So, it appears that the GC either  is failing to pass them to the FinalizerThread or cannot keep up with the rate of accumulation of finalizeable objects.

I checked the changes to the GC reference processor code between the two releases you are using and there are a couple of significant changes which might be responsible. However, they look correct by eyeball. So, I think it is unlikely that the GC is not correctly communicating data to the FinalizerThread. There are also changes in the JDK code which might have changed the rate at which Finalizable objects are created and/or released and made available for transfer.

We would very much like to investigate this further but we cannot really do so without a program which reproduces the problem. Could you provide us a version of your code (suitably stripped to remove confidential data) which will manifest this error? Thanks for your help.

Comment 23 Stuart D Gathman 2014-02-14 00:51:32 UTC

I can think of two approaches to get what you need:

1) See if the stock adempiere352a distribution with sample data will reproduce the problem.  Even if that works, it takes a while for the Finalizers to build up, so it isn't the most convenient method.

2) Find out what class of objects the Finalizers are pointing at, and see if I can make a fail quick program.

Comment 24 Stuart D Gathman 2014-02-14 15:17:06 UTC

Most of the Finalizers are pointing at org.postgresql.jdbc4.Jdcb4PreparedStatement instances.  I may be able to create a synthetic reproducer after all.

Comment 25 Stuart D Gathman 2014-02-14 15:19:55 UTC

Most, as in 99%

Comment 26 Stuart D Gathman 2014-02-14 18:17:57 UTC

org.postgresql.jdbc4 makes extensive use of PhantomReference:
--------------------
When a statement is successfully parsed, registerParsedQuery is called.  This creates a PhantomReference referring to the "owner" of the statement (the originating Query object) and inserts that reference as a key in parsedQueryMap. The values of parsedQueryMap are the corresponding allocated statement names. The originating Query object also holds a reference to the PhantomReference.

When the owning Query object is closed, it enqueues and clears the associated PhantomReference.

If the owning Query object becomes unreachable (see java.lang.ref javadoc) before being closed, the corresponding PhantomReference is enqueued on parsedQueryCleanupQueue. In the Sun JVM, phantom references are only enqueued when a GC occurs, so this is not necessarily prompt but should eventually happen.

Periodically (currently, just before query execution), the parsedQueryCleanupQueue is polled. For each enqueued PhantomReference we find, we remove the corresponding entry from parsedQueryMap, obtaining the name of the underlying statement in the process. Then we send a message to the backend to deallocate that statement.
--------------------
*** notices that Adempiere wraps the java.sql.PreparedStatement in its own CPreparedStatement class - which has a finalizer that calls close() on the PreparedStatement when invoked, and speculates about interactions between finalize() and PhantomReference.

Comment 27 Andrew Dinn 2014-02-17 15:42:18 UTC

"In the Sun JVM, phantom references are only enqueued when a GC occurs, so this is not necessarily prompt but should eventually happen."

"should eventually" is not enough of a guarantee. You can run a full GC and still not get back much working space. That's because the lifetime of the referenced objects discovered during the GC extends at least to the next GC. So, if your program builds up a lot of such references you will have a problem.

This can happen either because you release a lot of phantom referenced objects all at once or it can happen gradually over time if your program creates references at a faster rate which than the reference processor and finalizer threads clear them (n your case the threads appear to be blocking so it appears to be the former). When the references queued for processing refer to a large data set then as your application adds new objects to its live working set it can push the total working set over the available heap limit. It loks to me like that is what is happening here.

"speculates about interactions between finalize() and PhantomReference".

I don't think there is any very significant interaction. Both FinalReference and PhantomReference instances get pushed to their respective reference queues by the reference processor thread. I believe the only interaction here is that they combine additively to determine the load on this thread.

Why this has become a problem for your app between releases is hard to say. It may be that changes in the JDK awt and/or memory buffer code (which also use references) have increased the amount of working set held by referenced objects. It may be that the timing with which the finalizer and reference processor threads runs has changed or the timing of GC has changed. It may even be that with this release your app footprint or the JBoss footprint has become slightly larger and pushed the total working set over the heap limit. All of these tip your app over from fitting in heap to hitting the heap limit.

I think the way to work around this problem would be either to find a way minimise your use of the objects which employ PhatomReferences/Finalizers (e.g. can you reuse PreparedStatement instances) or to scale down your app so that it employs a smaller working set. A bigger heap on a 64-bit JVM might possibly help but, of course, above 4GB you might also start seeing noticeable GC pauses ('noticeable' depending up in what latencies your app can tolerate).

At the moment I don't think there is any more advice I can provide. If you can come up with any further details which might allow me to investigate further I will be happy to look at them and, if possible, provide more detailed advice.

Comment 28 Stuart D Gathman 2014-02-17 17:53:38 UTC

The working set is only about 220M - that is the stable heap size used with a single user on the test system where I am investigating this.  That is nowhere near any heap limit.  For some reason, the new release prevents PreparedStatements from getting collected (or collected fast enough).  I think I will be able to create a synthetic app that works in 1.62 and fails in 1.66.

Comment 30 Andrew Dinn 2014-02-25 10:25:11 UTC

(In reply to Stuart D Gathman from comment #28)
> The working set is only about 220M - that is the stable heap size used with
> a single user on the test system where I am investigating this.  That is
> nowhere near any heap limit.  For some reason, the new release prevents
> PreparedStatements from getting collected (or collected fast enough).  I
> think I will be able to create a synthetic app that works in 1.62 and fails
> in 1.66.

I have found a simple way to reproduce the problem and I think I have pinned down the VM change which causes it. There does not appear to be a way to work around this behaviour.

I also found that the same change does not affect the latest jdk7u tree -- so some later change included after jdk6-b30 must have repaired the problem. I am continuing to investigate to see i) what this fix is and ii) whether we can pull it down into jdk6.

Comment 31 Andrew Dinn 2014-02-25 10:34:44 UTC

Hmm, there may be a workaround you can try here. Further testing of the later 1.66 release indicates that the CMS GC is much better able to handle programs which generate large numbers of finalizers than the (default) parallel scavenge GC which you have been using. So, if you try running your program with -XX:+UseConcMarkSweepGC you might find that it is able to keep up with your application's rate of creation of finalizable objects. However, this may still cause a slowdown in throughput and/or larger GC pause times relative to the behaviour on 1.62 with the parallel scavenge GC.

Comment 32 Andrew Dinn 2014-02-26 15:37:05 UTC

Hi Stuart,

I finally pinned this down to a simple error in the jdk native code (yes, nothing was actually wrong in the GCs or any other part of the hotspot VM code). The jdk code which notifies the JVM of the JDK version was failing to zero all of the jdk_version_info data.

In particular, a single bit was added in jdk6_b28 allowing the GC to forward the chain of discovered Reference instances linked either through Reference.next or Reference.discovered. Depending on what is on the stack you get one or other behaviour. Unfortunately, the JDK code was always expecting instances to be linked by Reference.next.

This simply requires a one line patch to make a call to memset use the correct size. The fix will be in the next JDK6 release.

Comment 37 Deepak Bhole 2014-02-26 17:11:19 UTC

*** Bug 1066981 has been marked as a duplicate of this bug. ***

Comment 39 Andrew Dinn 2014-02-27 09:41:30 UTC

Created attachment 868394 [details]
Reproducer

Reproducer program which stresses the reference processor code by retaining a large Finalizer set while turning it over at a high rate. Run as

java -XX:+PrintGCDetails Finalizetest

The test should finalize about 500 blocks before OOM. The bug in jdk6_b28 reduces it to a few blocks at most.

Comment 40 Stuart D Gathman 2014-02-27 15:33:29 UTC

Wow.  I don't know whether to be disappointed that I didn't get to a reproducer first, or grateful that I don't have to.  :-)  In any case, I am vicariously thrilled to nail the bug!

Thanks for the GC algorithm tip, I will try it.  I've been fixing some of the more egregious Finalizer creating code in Adempiere:  For instance, creating PreparedStatements for the same query in an inner loop!  But it is a drop in the bucket of a large project, and the JVM will have to be able to handle the inefficient code.

Comment 41 Andrew Dinn 2014-02-27 16:02:46 UTC

(In reply to Stuart D Gathman from comment #40)
> Wow.  I don't know whether to be disappointed that I didn't get to a
> reproducer first, or grateful that I don't have to.  :-)  In any case, I am
> vicariously thrilled to nail the bug!

The hardest bit was looking for a needle in a haystack labelled hotspot only to find that it was actually in the next door haystack labelled jdk. Be very, very grateful :-)

> Thanks for the GC algorithm tip, I will try it.  I've been fixing some of
> the more egregious Finalizer creating code in Adempiere:  For instance,
> creating PreparedStatements for the same query in an inner loop!  But it is
> a drop in the bucket of a large project, and the JVM will have to be able to
> handle the inefficient code.

Well, beware. This worked when I tried it. However, it probably only did so by luck -- the same reference forwarding code is used by all GCs. I suspect that choosing a different GC left different data on the stack and these happened to include the necessary 0 bit to select the old vs new path for discovered Reference linking.

Comment 42 jiri vanek 2014-03-12 17:03:58 UTC

pushed http://pkgs.devel.redhat.com/cgit/rpms/java-1.6.0-openjdk/commit/?h=rhel-6.6&id=2c89d7c91d63866bc5630c23eff83c9d92c20db9
and build http://brewweb.devel.redhat.com/brew/taskinfo?taskID=7194088

Comment 43 Deepak Bhole 2014-03-12 19:52:43 UTC

*** Bug 1025555 has been marked as a duplicate of this bug. ***

Comment 45 Rostislav Svoboda 2014-03-20 12:15:54 UTC

Restoring needinfo ...

Comment 49 jiri vanek 2014-03-21 09:00:31 UTC

Restoring needinfo ...

Comment 54 Stuart D Gathman 2014-06-16 15:35:23 UTC

I am not clear on what info is needed from me.  I don't believe the patch has hit CentOS yet, and unfortunately, my current employer is not a RedHat subscriber.

Comment 55 Andrew Haley 2014-06-16 15:47:16 UTC

This is fixed.

Comment 57 Andrew John Hughes 2017-02-07 16:42:49 UTC

This was released in 1.13.2: http://blog.fuseyism.com/index.php/2014/03/28/icedtea-1-13-2-for-openjdk-6-sizeof-considered-harmful-released/

It was picked up in RHEL 6.5 when it upgraded to 1.13.3 in April 2014.