Bug 1072374
| Summary: | OOM permGen after application redeploy | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Product: | [JBoss] JBoss Enterprise Application Platform 6 | Reporter: | Jitka Kozana <jkudrnac> | ||||||
| Component: | Class Loading | Assignee: | David M. Lloyd <david.lloyd> | ||||||
| Status: | CLOSED WONTFIX | QA Contact: | Jitka Kozana <jkudrnac> | ||||||
| Severity: | high | Docs Contact: | |||||||
| Priority: | unspecified | ||||||||
| Version: | 6.3.0 | CC: | dpospisi, jdoyle, jkudrnac, lthon, myarboro, paul.ferraro, rhusar | ||||||
| Target Milestone: | --- | Keywords: | Regression | ||||||
| Target Release: | --- | ||||||||
| Hardware: | Unspecified | ||||||||
| OS: | Unspecified | ||||||||
| Whiteboard: | |||||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||||
| Doc Text: | Story Points: | --- | |||||||
| Clone Of: | Environment: | ||||||||
| Last Closed: | 2014-05-21 10:01:28 UTC | Type: | Bug | ||||||
| Regression: | --- | Mount Type: | --- | ||||||
| Documentation: | --- | CRM: | |||||||
| Verified Versions: | Category: | --- | |||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||
| Embargoed: | |||||||||
| Attachments: |
|
||||||||
|
Description
Jitka Kozana
2014-03-04 13:20:40 UTC
There are known classloader leaks in the distributable SFSB implementation in EAP 6.3. This will be fixed in EAP 7. Simply adding e.g. -XX:MaxPermSize=512m avoids the problem in that test. [1] Looking at heap dump, there I don't see anything else leaking. [1] https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/rhusar___eap-6x-failover-ejb-ejbservlet-undeploy-repl-sync/5/ I've also seen a permgen OOM in https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/eap-6x-failover-http-session-undeploy-dist-async/45/ (look for "OutOfMemory" in https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/eap-6x-failover-http-session-undeploy-dist-async/45/console-perf20/), which is a test that doesn't involve EJBs at all. Note that I didn't dig deeper yet, so this might be a separate issue; I writing it here for now, though. I was a bit too fast, there's another case here: https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/eap-6x-failover-http-session-undeploy-dist-sync/42/ -- funny thing in this case is that OOM happens during shutdown (both on perf19 and perf20). "After redeploy of application using distributable SFSB EJB3, an OOM is thrown." This not a categorically true statement. The clustering testsuite has several tests that undeploy an application using distributable SFSBs, and they do not cause OOMs. So, no, the results of this test run do not necessarily correspond to the behavior that customers will see. To clarify the scope of the problem: Currently, SFSB instances for all SFSB types from all deployments are stored in a single cache instance. When a deployment containing clustered SFSB is undeployed, any associated SFSB instances will remain in memory (although still subject to passivation), and that memory will not be completely freed until all deployments containing clustered SFSBs are undeployed (on that node). Only then will that global cache instance shutdown. This has been a problem since EAP 6.0 and will remain so until the new implementation found in WildFly is productized. OK, looking at some heap dumps and the OOMs in the "http-session" tests are probably bug 1032552. Not related to this bug. We have run the tests mentioned in #1 with EAP 6.3.0.DR0, which was using Infinispan 5.2.7. and the OOM issue is there as well. Hence, this leak seems unrelated to the Infinispan upgrade from 5.2.7. to 5.2.8. Created attachment 876374 [details]
jmap -permstat output for perf18 on 6.3.0.DR4
We still think it's blocker ... proposing for beta blocker tracker Given comment #11, there are no clustering changes between 6.2 and 6.3 that would indicate that clustering is to blame for this OOM. Therefore, QE should look elsewhere to find the source of the increased memory requirements that lead to this OOM. My negative ack stands. According to comment #12 this is reproducible on DR0. There are virtually no changes to clustering in DR0, so it's unlikely that this is a clustering regression as the BZ describes at the time being. I am going to move it to other component so others can help looking into this. The build #41 heap dump does not show any evidence of a leak. There is only one live service class loader, which would be the single deployment. It is possible that undeploy may cause additional classes to be loaded; in this case the correct fix would be to simply increase the max perm gen by some percentage margin. One thing worth noting is that in JDK 6 and earlier, interned strings seem to have gone into the permgen, which is no longer the case on JDK 7. Given the jmap output showing what appears to be a very reasonable number and size of classes, and a slightly larger <internal> area, it's *possible* that something is interning strings that should not be. However I'll stick with my initial hypothesis that the perm gen is simply too small based on the number of classes being loaded. If it were right on the edge, then ordering of operations might cause certain classes to be loaded or not loaded depending on factors such as scheduling, as there are several code paths in the AS which optimize themselves if a calculation result is available earlier than expected. > It is possible that undeploy may cause additional classes to be loaded; in this
> case the correct fix would be to simply increase the max perm gen by some
> percentage margin.
Let me note here though IIRC, that the OOM happens when the application is being deployed back and not when the actual undeployment happens.
Nevertheless, I agree with your theory. Should we actually consider increasing the default permgen since there are just more classes being loaded in 6.3 compared to 6.2? Even though it is generally up to the administrator to tune the permgen space given the deployment requirements.
We could; someone would have to do some investigation to determine a good size though. A good approach might be:
1. Set permgen to something big, like 400MB
2. Deploy three or four quickstarts or typical applications
3. Examine permgen size ("s")
4. Change default to be s or k*s where k = 1.2 or something like that
I tried to bisect between EAP_6.2.0.CR3-dev2 and EAP_6.3.0.DR0-dev in hope that I will find a single commit that is responsible for this. No luck :-( This really seems to be caused by simply loading more classes than before. I'd argue that we _should_ adjust our default permgen size, since clusterbench is really a _tiny_ application and we are not doing anything extraordinary here. Honestly we haven't ruled out a leak; I'm just still inclined to go with the "simplest explanation" in absence of better evidence. Some questions need to be answered: * How many redeploys does it take to hit OOME? * Can you set a large pergen size and perform several redeploys, then inspect the heap to see if the old deployment class loaders are still around? (In other words we do not need to hit OOME to know there is a problem.) If the answer to the first question is "1" *and* the answer to the second question is "there are none", then yes, we can be deterministic in saying we just need a larger permgen size. With small permgen, yes, 1 redeploy is enough to hit OOM. I did another test run with larger permgen where each server in the cluster performed 5 undeploy/redeploy cycles, and I've taken heap dumps from all servers at the end of the test. No OOM happened, though the heap dumps show that on each server, there are 6 ModuleClassLoaders that loaded clusterbench classes. I don't know the details, but this might be the known classloader leak Paul mentioned above (comment 1). In case anyone is interested, I uploaded the heap dumps over here: http://lacrosse.corp.redhat.com/~lthon/bz1072374/04-11/ Okay if the class loaders are hanging around then there is definitely a leak; the question is whether the leak is in something within the app server or whether the clusterbench library itself causes it somehow. Customer/User won't see this problem immediately with standard configuration, but after several redeploys. Thus removing Beta Blocker, still blocker for GA. I don't have permission to download those heap dumps, Ladislav. Sorry about that, David, I obviously forgot to adjust permissions. Should be fine now. Rostislav, is there evidence that such leaks occur for all applications? I thought this only was happening with clusterbench. The heap dump does indeed show an extra 24 modules of clusterbench, but they are all pending finalization. Perhaps force a GC before gathering the heap dump? This probably explains why you never actually hit OOM. Yes, in this case permgen was big enough so that it didn't have to be collected. Forcing GC before taking a heap dump totally makes sense, I'll try to do another run shortly (hopefully tomorrow). BTW, how can I find out if an object is pending finalization from the heap dump? I don't remember seeing that information in Eclise MAT, though I admit my understanding of the tool is rather shallow. Or did you use another tool (jhat)? I looked in jhat first, but jhat is not so great. I switched to yourkit which gave much better information (faster too). I just got to this and run another small test, looking at the permgen size at the beginning of the test (before first undeploy) and at the end of the test (after last redeploy) using `jmap -heap $PID`. I didn't trigger GC manually. On 6.2.0.GA with JVM-default permgen size (which in our environment is 82 MB), the permgen is always like 99.9% full. For 6.3.0.ER1, I adjusted MaxPermSize to 90 MB and the comparison suggests that 6.3.0.ER1 needs a little more (cca 1 MB) permgen than 6.2.0.GA. This sounds reasonable to me. (I will attach a file with all the numbers in a moment, if anyone wants to take a look.) From my point of view, no additional investigation is needed. Created attachment 888457 [details]
permgen-related parts of jmap -heap $PID output for 6.2.0.GA and 6.3.0.ER1
Comment 32 says only a small increase is needed |