Bug 993559
Summary: | Regression: OutOfMemoryError, round 2: still looks like a leak somewhere | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | [JBoss] JBoss Enterprise Application Platform 6 | Reporter: | Ladislav Thon <lthon> | ||||||
Component: | Clustering | Assignee: | Radoslav Husar <rhusar> | ||||||
Status: | CLOSED CURRENTRELEASE | QA Contact: | Ladislav Thon <lthon> | ||||||
Severity: | urgent | Docs Contact: | Russell Dickenson <rdickens> | ||||||
Priority: | urgent | ||||||||
Version: | 6.1.1 | CC: | cdewolf, jkudrnac, lthon, myarboro, paul.ferraro, pslavice, rhusar | ||||||
Target Milestone: | ER5 | Keywords: | Regression | ||||||
Target Release: | EAP 6.1.1 | ||||||||
Hardware: | Unspecified | ||||||||
OS: | Unspecified | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | Environment: | ||||||||
Last Closed: | 2013-09-16 20:22:50 UTC | Type: | Bug | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Attachments: |
|
Description
Ladislav Thon
2013-08-06 07:46:42 UTC
@Ladislav could you please provide the heap snapshot (hprof) like last time? Uploaded heap dumps to http://file.brq.redhat.com/~lthon/bz993559/ Forbidden You don't have permission to access /~lthon/bz993559/java_pid13628.hprof on this server. I don't have read permission on those files. (I got read permission on the directory itself.) I received a feedback that the original description could be... more clear, so I'll try to rephrase. With -Xmx2250m, memory consumption in EAP 6.1.0.ER8 [1] versus EAP 6.1.1.ER4 + fix [2]. Didn't hit OOM in [2]. With -Xmx768m, memory consumption in EAP 6.1.0.GA [3] versus EAP 6.1.1.ER4 + fix [4]. Did hit OOM in [4]. [1] https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/eap-6x-clustering-soak-http-repl-async-web/28/artifact/report/graph-cluster-memory.png [2] https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/eap-6x-clustering-soak-http-repl-async-web-oom-investigation/7/artifact/report/graph-cluster-memory.png [3] https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/eap-6x-clustering-soak-http-repl-async-web-oom-investigation/3/artifact/report/graph-cluster-memory.png [4] https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/eap-6x-clustering-soak-http-repl-async-web-oom-investigation/6/artifact/report/graph-cluster-memory.png Regarding comment 3, that was a stupid mistake on my side, fixed. Can you fill in some metadata about the heaps? I.e.: which server is witch and where in the test were the snapshots taken? Thanks I have analyzed the heaps and we are having the same problem as in BZ 990092 we seem to never end some batches. I am investigating the root cause now. Rado: heap dumps are taken automatically via -XX:+HeapDumpOnOutOfMemoryError. Which server is which dump can be seen in the logs: perf18: java_pid20159.hprof perf19: java_pid27241.hprof perf20: java_pid13628.hprof perf21: java_pid16827.hprof Created attachment 783299 [details]
jboss web with filled in missing endBatch()
Ladislav, could you try with this patched Jar I built for you?
Rado, I am double checking here: does your fix (the jar in comment #9) contain the first patch for BZ 990092? We are doing another set of runs with a very restricted configuration to "quickly" reproduce the issue. The tests are running for 2 hours and servers are configured with -Xmx512m. [1] is EAP 6.1.1.ER4 + the original fix, OOM was present. Uploaded heap dumps to http://file.brq.redhat.com/~lthon/bz993559/run-no8/. [2] is EAP 6.1.1.ER4 + the new fix. Still running, will see. Then, we will run the same test with EAP 6.1.0 for comparison. [1] https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/eap-6x-clustering-soak-http-repl-async-web-oom-investigation/8/ [2] https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/eap-6x-clustering-soak-http-repl-async-web-oom-investigation/11/ This is an update of comment 12. We've done another set of runs with a very restricted configuration to "quickly" reproduce the issue and test the fixes. The tests are running for 2 hours and servers are configured with -Xmx512m. See graph-cluster-memory.png in all runs for memory consumption in time. [1] is EAP 6.1.0. The memory is OK. [2] is EAP 6.1.1.ER4 + the original fix (fix #1), OOM was present. Uploaded heap dumps to http://file.brq.redhat.com/~lthon/bz993559/run-no8/. [3] is EAP 6.1.1.ER4 + the new fix (fix #2). OOM not present, memory is looking OK for 3 of 4 servers. For perf18, which is the cluster coordinator, the memory usage still looks bad. To conclude: fix #2 (the one attached to this BZ) seems to fix most of the memory leaks, but not all of them. Will run another test with fix #2 to try to reproduce OOM. Expecting OOM to only happen on the cluster coordinator node. [1] https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/eap-6x-clustering-soak-http-repl-async-web-oom-investigation/12/ [2] https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/eap-6x-clustering-soak-http-repl-async-web-oom-investigation/8/ [3] https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/eap-6x-clustering-soak-http-repl-async-web-oom-investigation/11/ Thanks Ladislav, I was just analyzing the reuslts too. It seems as though the 2nd fix managed to mitigate the issue even more. I am going to try more ideas now meanwhile I submit thus fix for PR. PR: https://github.com/jbossas/jboss-eap/pull/280 @Jitka, yes, the Jar contains all fixes up to date. So I ran the last test (6.1.1.ER4 + fix #2, -Xmx512m) again, now for 9 hours, see [1]. Memory usage on the cluster coordinator node (perf18) goes up right from the beginning, but after cca 5 hours, memory usage on the other nodes goes up as well. Looking at the graph, this probably starts to happen when the coordinator OOMs. Not yet sure about the implications. Uploaded heap dumps from this run to [2] for analysis. [1] https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/eap-6x-clustering-soak-http-repl-async-web-oom-investigation/13/ [2] http://file.brq.redhat.com/~lthon/bz993559/run-no13/ Small addition to comment 16: I just figured that it would be probably useful to be able to distinguish perf18 heap dump from the others :-) Here's the mapping of heap dumps in http://file.brq.redhat.com/~lthon/bz993559/run-no13/ to cluster node names: perf18: java_pid8497.hprof perf19: java_pid24304.hprof perf20: java_pid29232.hprof perf21: java_pid5350.hprof Created attachment 784400 [details]
jboss web for qe use with fixes until 2013/8/8 15:50 CET
I should add a note about the progress: the heap dumps show that PR 280 resolves the problem with open batches. However it hints that there is remaining problem with session expiration not expiring sessions but rather keeping them in the cache. QA note: the tests could probably be speeded up to demostrate the problem by shortening the session lifespan and increasing number of sessions to be left for expiration and invalidation. Let me know if you need some help tweaking. We have run a test with the new patch (comment #18) and still got the OutOfMemory error. I will upload the heap dumps shortly. Here are the heap dumps: http://lacrosse.redhat.com/~jkudrnac/oom-run-15/ perf18: java_pid17558.hprof perf19: java_pid26989.hprof perf20: java_pid23944.hprof perf21: java_pid8953.hprof @Jitka Have you run this last test (which includes the 2 fixes) against both 6.1.0 and 6.1.1? I just want to make sure the current state of this is still a behavioral regression. I ask because I don't see any issue with expiration - as suspected in #c19 @Paul, the 6.1.0 run of this test scenario is still here: [1]. The 6.1.1. run with all the latest fixes is [2], as you have already found out. The OOM was seen approx. after 70 min after starting the test, see the snippet from logs of run [2] below [3]. [1] 6.1.0: https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/eap-6x-clustering-soak-http-repl-async-web-oom-investigation/12/ [2] 6.1.1 + patch #3: https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/eap-6x-clustering-soak-http-repl-async-web-oom-investigation/15/ [3] 10:28:37:765 EDT Starting the test. (...) 11:36:58,416 java.lang.OutOfMemoryError: GC overhead limit exceeded Dumping heap to java_pid17558.hprof Verified on EAP 6.1.1.ER6. |