Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1072374

Summary: OOM permGen after application redeploy
Product: [JBoss] JBoss Enterprise Application Platform 6 Reporter: Jitka Kozana <jkudrnac>
Component: Class LoadingAssignee: David M. Lloyd <david.lloyd>
Status: CLOSED WONTFIX QA Contact: Jitka Kozana <jkudrnac>
Severity: high Docs Contact:
Priority: unspecified    
Version: 6.3.0CC: dpospisi, jdoyle, jkudrnac, lthon, myarboro, paul.ferraro, rhusar
Target Milestone: ---Keywords: Regression
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2014-05-21 10:01:28 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
jmap -permstat output for perf18 on 6.3.0.DR4
none
permgen-related parts of jmap -heap $PID output for 6.2.0.GA and 6.3.0.ER1 none

Description Jitka Kozana 2014-03-04 13:20:40 UTC
EAP 6.3.0.DR1. Failover tests, failure type: undeploy of application. Standalone clients are accessing EJB via servlet. Cache setup: REPL SYNC, DIST SYNC.

After the first undeploy, OOM pergen is thrown. The node has to be killed manually. 

See the server log [1] and [2]. 

[1] https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/eap-6x-failover-ejb-ejbservlet-undeploy-repl-sync/47/console-perf18/
[2] https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/eap-6x-failover-ejb-ejbservlet-undeploy-dist-sync/34/console-perf18/

I'll give you a heapdump shortly.

Comment 1 Paul Ferraro 2014-03-04 15:24:38 UTC
There are known classloader leaks in the distributable SFSB implementation in EAP 6.3.  This will be fixed in EAP 7.

Comment 2 Radoslav Husar 2014-03-04 16:34:58 UTC
Simply adding e.g. -XX:MaxPermSize=512m avoids the problem in that test. [1]

Looking at heap dump, there I don't see anything else leaking.

[1] https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/rhusar___eap-6x-failover-ejb-ejbservlet-undeploy-repl-sync/5/

Comment 5 Ladislav Thon 2014-03-12 09:00:22 UTC
I've also seen a permgen OOM in https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/eap-6x-failover-http-session-undeploy-dist-async/45/ (look for "OutOfMemory" in https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/eap-6x-failover-http-session-undeploy-dist-async/45/console-perf20/), which is a test that doesn't involve EJBs at all.

Note that I didn't dig deeper yet, so this might be a separate issue; I writing it here for now, though.

Comment 6 Ladislav Thon 2014-03-12 09:52:14 UTC
I was a bit too fast, there's another case here: https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/eap-6x-failover-http-session-undeploy-dist-sync/42/ -- funny thing in this case is that OOM happens during shutdown (both on perf19 and perf20).

Comment 7 Paul Ferraro 2014-03-12 14:13:12 UTC
"After redeploy of application using distributable SFSB EJB3, an OOM is thrown."
This not a categorically true statement.  The clustering testsuite has several tests that undeploy an application using distributable SFSBs, and they do not cause OOMs.  So, no, the results of this test run do not necessarily correspond to the behavior that customers will see.

To clarify the scope of the problem: Currently, SFSB instances for all SFSB types from all deployments are stored in a single cache instance.  When a deployment containing clustered SFSB is undeployed, any associated SFSB instances will remain in memory (although still subject to passivation), and that memory will not be completely freed until all deployments containing clustered SFSBs are undeployed (on that node). Only then will that global cache instance shutdown. This has been a problem since EAP 6.0 and will remain so until the new implementation found in WildFly is productized.

Comment 8 Ladislav Thon 2014-03-13 09:52:39 UTC
OK, looking at some heap dumps and the OOMs in the "http-session" tests are probably bug 1032552. Not related to this bug.

Comment 11 Jitka Kozana 2014-03-18 14:55:56 UTC
We have run the tests mentioned in #1 with EAP 6.3.0.DR0, which was using Infinispan 5.2.7. and the OOM issue is there as well. Hence, this leak seems unrelated to the Infinispan upgrade from 5.2.7. to 5.2.8.

Comment 13 Jitka Kozana 2014-03-19 14:48:49 UTC
Created attachment 876374 [details]
jmap -permstat output for perf18 on 6.3.0.DR4

Comment 14 Rostislav Svoboda 2014-03-21 10:40:25 UTC
We still think it's blocker ... proposing for beta blocker tracker

Comment 15 Paul Ferraro 2014-03-21 13:55:12 UTC
Given comment #11, there are no clustering changes between 6.2 and 6.3 that would indicate that clustering is to blame for this OOM.  Therefore, QE should look elsewhere to find the source of the increased memory requirements that lead to this OOM.  My negative ack stands.

Comment 16 Radoslav Husar 2014-03-25 16:28:39 UTC
According to comment #12 this is reproducible on DR0. There are virtually no changes to clustering in DR0, so it's unlikely that this is a clustering regression as the BZ describes at the time being.

I am going to move it to other component so others can help looking into this.

Comment 17 David M. Lloyd 2014-03-25 16:43:03 UTC
The build #41 heap dump does not show any evidence of a leak.  There is only one live service class loader, which would be the single deployment.  It is possible that undeploy may cause additional classes to be loaded; in this case the correct fix would be to simply increase the max perm gen by some percentage margin.

Comment 18 David M. Lloyd 2014-03-25 16:53:48 UTC
One thing worth noting is that in JDK 6 and earlier, interned strings seem to have gone into the permgen, which is no longer the case on JDK 7.  Given the jmap output showing what appears to be a very reasonable number and size of classes, and a slightly larger <internal> area, it's *possible* that something is interning strings that should not be.

However I'll stick with my initial hypothesis that the perm gen is simply too small based on the number of classes being loaded.  If it were right on the edge, then ordering of operations might cause certain classes to be loaded or not loaded depending on factors such as scheduling, as there are several code paths in the AS which optimize themselves if a calculation result is available earlier than expected.

Comment 19 Radoslav Husar 2014-03-25 23:50:15 UTC
> It is possible that undeploy may cause additional classes to be loaded; in this 
> case the correct fix would be to simply increase the max perm gen by some 
> percentage margin.

Let me note here though IIRC, that the OOM happens when the application is being deployed back and not when the actual undeployment happens.

Nevertheless, I agree with your theory. Should we actually consider increasing the default permgen since there are just more classes being loaded in 6.3 compared to 6.2? Even though it is generally up to the administrator to tune the permgen space given the deployment requirements.

Comment 20 David M. Lloyd 2014-03-27 19:42:44 UTC
We could; someone would have to do some investigation to determine a good size though.  A good approach might be:

1. Set permgen to something big, like 400MB
2. Deploy three or four quickstarts or typical applications
3. Examine permgen size ("s")
4. Change default to be s or k*s where k = 1.2 or something like that

Comment 21 Ladislav Thon 2014-04-10 13:11:50 UTC
I tried to bisect between EAP_6.2.0.CR3-dev2 and EAP_6.3.0.DR0-dev in hope that I will find a single commit that is responsible for this. No luck :-( This really seems to be caused by simply loading more classes than before.

I'd argue that we _should_ adjust our default permgen size, since clusterbench is really a _tiny_ application and we are not doing anything extraordinary here.

Comment 22 David M. Lloyd 2014-04-10 13:47:50 UTC
Honestly we haven't ruled out a leak; I'm just still inclined to go with the "simplest explanation" in absence of better evidence.

Some questions need to be answered:

* How many redeploys does it take to hit OOME?
* Can you set a large pergen size and perform several redeploys, then inspect the heap to see if the old deployment class loaders are still around?  (In other words we do not need to hit OOME to know there is a problem.)

If the answer to the first question is "1" *and* the answer to the second question is "there are none", then yes, we can be deterministic in saying we just need a larger permgen size.

Comment 23 Ladislav Thon 2014-04-11 13:50:31 UTC
With small permgen, yes, 1 redeploy is enough to hit OOM.

I did another test run with larger permgen where each server in the cluster performed 5 undeploy/redeploy cycles, and I've taken heap dumps from all servers at the end of the test. No OOM happened, though the heap dumps show that on each server, there are 6 ModuleClassLoaders that loaded clusterbench classes. I don't know the details, but this might be the known classloader leak Paul mentioned above (comment 1).

In case anyone is interested, I uploaded the heap dumps over here: http://lacrosse.corp.redhat.com/~lthon/bz1072374/04-11/

Comment 24 David M. Lloyd 2014-04-11 16:04:40 UTC
Okay if the class loaders are hanging around then there is definitely a leak; the question is whether the leak is in something within the app server or whether the clusterbench library itself causes it somehow.

Comment 25 Rostislav Svoboda 2014-04-15 12:01:22 UTC
Customer/User won't see this problem immediately with standard configuration, but after several redeploys. Thus removing Beta Blocker, still blocker for GA.

Comment 26 David M. Lloyd 2014-04-15 12:30:08 UTC
I don't have permission to download those heap dumps, Ladislav.

Comment 27 Ladislav Thon 2014-04-15 13:15:51 UTC
Sorry about that, David, I obviously forgot to adjust permissions. Should be fine now.

Comment 28 David M. Lloyd 2014-04-15 14:20:38 UTC
Rostislav, is there evidence that such leaks occur for all applications?  I thought this only was happening with clusterbench.

Comment 29 David M. Lloyd 2014-04-15 14:28:00 UTC
The heap dump does indeed show an extra 24 modules of clusterbench, but they are all pending finalization.  Perhaps force a GC before gathering the heap dump?  This probably explains why you never actually hit OOM.

Comment 30 Ladislav Thon 2014-04-16 07:32:49 UTC
Yes, in this case permgen was big enough so that it didn't have to be collected. Forcing GC before taking a heap dump totally makes sense, I'll try to do another run shortly (hopefully tomorrow).

BTW, how can I find out if an object is pending finalization from the heap dump? I don't remember seeing that information in Eclise MAT, though I admit my understanding of the tool is rather shallow. Or did you use another tool (jhat)?

Comment 31 David M. Lloyd 2014-04-16 12:50:01 UTC
I looked in jhat first, but jhat is not so great.  I switched to yourkit which gave much better information (faster too).

Comment 32 Ladislav Thon 2014-04-22 11:46:03 UTC
I just got to this and run another small test, looking at the permgen size at the beginning of the test (before first undeploy) and at the end of the test (after last redeploy) using `jmap -heap $PID`. I didn't trigger GC manually.

On 6.2.0.GA with JVM-default permgen size (which in our environment is 82 MB), the permgen is always like 99.9% full. For 6.3.0.ER1, I adjusted MaxPermSize to 90 MB and the comparison suggests that 6.3.0.ER1 needs a little more (cca 1 MB) permgen than 6.2.0.GA. This sounds reasonable to me. (I will attach a file with all the numbers in a moment, if anyone wants to take a look.)

From my point of view, no additional investigation is needed.

Comment 33 Ladislav Thon 2014-04-22 11:47:13 UTC
Created attachment 888457 [details]
permgen-related parts of jmap -heap $PID output for 6.2.0.GA and 6.3.0.ER1

Comment 34 Rostislav Svoboda 2014-05-21 10:01:28 UTC
Comment 32 says only a small increase is needed