Red Hat Bugzilla – Bug 993559
Regression: OutOfMemoryError, round 2: still looks like a leak somewhere
Last modified: 2013-09-16 16:22:50 EDT
This is a continuation of bug 990092, as we've ran the soak test several times with 6.1.1.ER4 + the fix from https://github.com/jbossas/jboss-eap/pull/265. See .
Memory consumption looks a lot better than with pure 6.1.1.ER4, but I believe that there is still something wrong. Compare , which is the latest ER build from EAP 6.1.0 testing, with , which is 6.1.1.ER4 + fix in the same conditions (except that it's only half of a day). In the run of , there is no OOM.
However, we did hit the OOM when we ran the full length test (1 day) with 6.1.1.ER4 + fix with reduced memory (-Xmx768m). It's questionable if this is representative, but I believe that it is. Compare , which is 6.1.1.ER4 + fix with reduced memory (-Xmx768m) running for 1 day, with , which is EAP 6.1.0 running with reduced memory (-Xmx768m).  only runs for few hours, but it clearly shows the trend.
Given that we don't have directly comparable data and have to extrapolate, I'm setting severity to "high" only, but I believe we should look into this. I will upload heap dumps from  and have a short look at them soon.
@Ladislav could you please provide the heap snapshot (hprof) like last time?
Uploaded heap dumps to http://file.brq.redhat.com/~lthon/bz993559/
You don't have permission to access /~lthon/bz993559/java_pid13628.hprof on this server.
I don't have read permission on those files. (I got read permission on the directory itself.)
I received a feedback that the original description could be... more clear, so I'll try to rephrase.
With -Xmx2250m, memory consumption in EAP 6.1.0.ER8  versus EAP 6.1.1.ER4 + fix . Didn't hit OOM in .
With -Xmx768m, memory consumption in EAP 6.1.0.GA  versus EAP 6.1.1.ER4 + fix . Did hit OOM in .
Regarding comment 3, that was a stupid mistake on my side, fixed.
Can you fill in some metadata about the heaps? I.e.: which server is witch and where in the test were the snapshots taken? Thanks
I have analyzed the heaps and we are having the same problem as in BZ 990092 we seem to never end some batches. I am investigating the root cause now.
Rado: heap dumps are taken automatically via -XX:+HeapDumpOnOutOfMemoryError. Which server is which dump can be seen in the logs:
Created attachment 783299 [details]
jboss web with filled in missing endBatch()
Ladislav, could you try with this patched Jar I built for you?
I am double checking here: does your fix (the jar in comment #9) contain the first patch for BZ 990092?
We are doing another set of runs with a very restricted configuration to "quickly" reproduce the issue. The tests are running for 2 hours and servers are configured with -Xmx512m.
 is EAP 6.1.1.ER4 + the original fix, OOM was present. Uploaded heap dumps to http://file.brq.redhat.com/~lthon/bz993559/run-no8/.
 is EAP 6.1.1.ER4 + the new fix. Still running, will see.
Then, we will run the same test with EAP 6.1.0 for comparison.
This is an update of comment 12.
We've done another set of runs with a very restricted configuration to "quickly" reproduce the issue and test the fixes. The tests are running for 2 hours and servers are configured with -Xmx512m. See graph-cluster-memory.png in all runs for memory consumption in time.
 is EAP 6.1.0. The memory is OK.
 is EAP 6.1.1.ER4 + the original fix (fix #1), OOM was present. Uploaded heap dumps to http://file.brq.redhat.com/~lthon/bz993559/run-no8/.
 is EAP 6.1.1.ER4 + the new fix (fix #2). OOM not present, memory is looking OK for 3 of 4 servers. For perf18, which is the cluster coordinator, the memory usage still looks bad.
To conclude: fix #2 (the one attached to this BZ) seems to fix most of the memory leaks, but not all of them. Will run another test with fix #2 to try to reproduce OOM. Expecting OOM to only happen on the cluster coordinator node.
Thanks Ladislav, I was just analyzing the reuslts too. It seems as though the 2nd fix managed to mitigate the issue even more.
I am going to try more ideas now meanwhile I submit thus fix for PR.
@Jitka, yes, the Jar contains all fixes up to date.
So I ran the last test (6.1.1.ER4 + fix #2, -Xmx512m) again, now for 9 hours, see .
Memory usage on the cluster coordinator node (perf18) goes up right from the beginning, but after cca 5 hours, memory usage on the other nodes goes up as well. Looking at the graph, this probably starts to happen when the coordinator OOMs. Not yet sure about the implications.
Uploaded heap dumps from this run to  for analysis.
Small addition to comment 16: I just figured that it would be probably useful to be able to distinguish perf18 heap dump from the others :-) Here's the mapping of heap dumps in http://file.brq.redhat.com/~lthon/bz993559/run-no13/ to cluster node names:
Created attachment 784400 [details]
jboss web for qe use with fixes until 2013/8/8 15:50 CET
I should add a note about the progress: the heap dumps show that PR 280 resolves the problem with open batches.
However it hints that there is remaining problem with session expiration not expiring sessions but rather keeping them in the cache.
QA note: the tests could probably be speeded up to demostrate the problem by shortening the session lifespan and increasing number of sessions to be left for expiration and invalidation. Let me know if you need some help tweaking.
We have run a test with the new patch (comment #18) and still got the OutOfMemory error.
I will upload the heap dumps shortly.
Here are the heap dumps:
@Jitka Have you run this last test (which includes the 2 fixes) against both 6.1.0 and 6.1.1? I just want to make sure the current state of this is still a behavioral regression.
I ask because I don't see any issue with expiration - as suspected in #c19
the 6.1.0 run of this test scenario is still here: .
The 6.1.1. run with all the latest fixes is , as you have already found out. The OOM was seen approx. after 70 min after starting the test, see the snippet from logs of run  below .
 6.1.0: https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/eap-6x-clustering-soak-http-repl-async-web-oom-investigation/12/
 6.1.1 + patch #3: https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/eap-6x-clustering-soak-http-repl-async-web-oom-investigation/15/
 10:28:37:765 EDT Starting the test.
11:36:58,416 java.lang.OutOfMemoryError: GC overhead limit exceeded
Dumping heap to java_pid17558.hprof
Verified on EAP 6.1.1.ER6.