Bug 993559 - Regression: OutOfMemoryError, round 2: still looks like a leak somewhere
Regression: OutOfMemoryError, round 2: still looks like a leak somewhere
Product: JBoss Enterprise Application Platform 6
Classification: JBoss
Component: Clustering (Show other bugs)
Unspecified Unspecified
urgent Severity urgent
: ER5
: EAP 6.1.1
Assigned To: Radoslav Husar
Ladislav Thon
Russell Dickenson
: Regression
Depends On:
  Show dependency treegraph
Reported: 2013-08-06 03:46 EDT by Ladislav Thon
Modified: 2013-09-16 16:22 EDT (History)
7 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Last Closed: 2013-09-16 16:22:50 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---

Attachments (Terms of Use)
jboss web with filled in missing endBatch() (600.74 KB, application/x-java-archive)
2013-08-06 08:09 EDT, Radoslav Husar
no flags Details
jboss web for qe use with fixes until 2013/8/8 15:50 CET (600.97 KB, application/x-java-archive)
2013-08-08 09:51 EDT, Radoslav Husar
no flags Details

  None (edit)
Description Ladislav Thon 2013-08-06 03:46:42 EDT
This is a continuation of bug 990092, as we've ran the soak test several times with 6.1.1.ER4 + the fix from https://github.com/jbossas/jboss-eap/pull/265. See [1].

Memory consumption looks a lot better than with pure 6.1.1.ER4, but I believe that there is still something wrong. Compare [2], which is the latest ER build from EAP 6.1.0 testing, with [3], which is 6.1.1.ER4 + fix in the same conditions (except that it's only half of a day). In the run of [3], there is no OOM.

However, we did hit the OOM when we ran the full length test (1 day) with 6.1.1.ER4 + fix with reduced memory (-Xmx768m). It's questionable if this is representative, but I believe that it is. Compare [4], which is 6.1.1.ER4 + fix with reduced memory (-Xmx768m) running for 1 day, with [5], which is EAP 6.1.0 running with reduced memory (-Xmx768m). [5] only runs for few hours, but it clearly shows the trend.

Given that we don't have directly comparable data and have to extrapolate, I'm setting severity to "high" only, but I believe we should look into this. I will upload heap dumps from [4] and have a short look at them soon.

[1] https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/eap-6x-clustering-soak-http-repl-async-web-oom-investigation/
[2] https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/eap-6x-clustering-soak-http-repl-async-web/28/artifact/report/graph-cluster-memory.png
[3] https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/eap-6x-clustering-soak-http-repl-async-web-oom-investigation/7/artifact/report/graph-cluster-memory.png
[4] https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/eap-6x-clustering-soak-http-repl-async-web-oom-investigation/6/artifact/report/graph-cluster-memory.png
[5] https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/eap-6x-clustering-soak-http-repl-async-web-oom-investigation/3/artifact/report/graph-cluster-memory.png
Comment 1 Radoslav Husar 2013-08-06 04:05:13 EDT
@Ladislav could you please provide the heap snapshot (hprof) like last time?
Comment 2 Ladislav Thon 2013-08-06 04:34:46 EDT
Uploaded heap dumps to http://file.brq.redhat.com/~lthon/bz993559/
Comment 3 Carlo de Wolf 2013-08-06 04:45:21 EDT

You don't have permission to access /~lthon/bz993559/java_pid13628.hprof on this server.

I don't have read permission on those files. (I got read permission on the directory itself.)
Comment 4 Ladislav Thon 2013-08-06 04:49:39 EDT
I received a feedback that the original description could be... more clear, so I'll try to rephrase.

With -Xmx2250m, memory consumption in EAP 6.1.0.ER8 [1] versus EAP 6.1.1.ER4 + fix [2]. Didn't hit OOM in [2].

With -Xmx768m, memory consumption in EAP 6.1.0.GA [3] versus EAP 6.1.1.ER4 + fix [4]. Did hit OOM in [4].

[1] https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/eap-6x-clustering-soak-http-repl-async-web/28/artifact/report/graph-cluster-memory.png
[2] https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/eap-6x-clustering-soak-http-repl-async-web-oom-investigation/7/artifact/report/graph-cluster-memory.png
[3] https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/eap-6x-clustering-soak-http-repl-async-web-oom-investigation/3/artifact/report/graph-cluster-memory.png
[4] https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/eap-6x-clustering-soak-http-repl-async-web-oom-investigation/6/artifact/report/graph-cluster-memory.png

Regarding comment 3, that was a stupid mistake on my side, fixed.
Comment 5 Radoslav Husar 2013-08-06 05:00:56 EDT
Can you fill in some metadata about the heaps? I.e.: which server is witch and where in the test were the snapshots taken? Thanks
Comment 6 Radoslav Husar 2013-08-06 05:12:09 EDT
I have analyzed the heaps and we are having the same problem as in BZ 990092 we seem to never end some batches. I am investigating the root cause now.
Comment 7 Ladislav Thon 2013-08-06 05:34:24 EDT
Rado: heap dumps are taken automatically via -XX:+HeapDumpOnOutOfMemoryError. Which server is which dump can be seen in the logs:

perf18: java_pid20159.hprof
perf19: java_pid27241.hprof
perf20: java_pid13628.hprof
perf21: java_pid16827.hprof
Comment 9 Radoslav Husar 2013-08-06 08:09:21 EDT
Created attachment 783299 [details]
jboss web with filled in missing endBatch()

Ladislav, could you try with this patched Jar I built for you?
Comment 11 Jitka Kozana 2013-08-06 08:59:25 EDT

I am double checking here: does your fix (the jar in comment #9) contain the first patch for BZ 990092?
Comment 12 Ladislav Thon 2013-08-06 10:02:48 EDT
We are doing another set of runs with a very restricted configuration to "quickly" reproduce the issue. The tests are running for 2 hours and servers are configured with -Xmx512m. 

[1] is EAP 6.1.1.ER4 + the original fix, OOM was present. Uploaded heap dumps to http://file.brq.redhat.com/~lthon/bz993559/run-no8/.

[2] is EAP 6.1.1.ER4 + the new fix. Still running, will see.

Then, we will run the same test with EAP 6.1.0 for comparison.

[1] https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/eap-6x-clustering-soak-http-repl-async-web-oom-investigation/8/

[2] https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/eap-6x-clustering-soak-http-repl-async-web-oom-investigation/11/
Comment 13 Ladislav Thon 2013-08-06 15:37:12 EDT
This is an update of comment 12.

We've done another set of runs with a very restricted configuration to "quickly" reproduce the issue and test the fixes. The tests are running for 2 hours and servers are configured with -Xmx512m. See graph-cluster-memory.png in all runs for memory consumption in time.

[1] is EAP 6.1.0. The memory is OK.

[2] is EAP 6.1.1.ER4 + the original fix (fix #1), OOM was present. Uploaded heap dumps to http://file.brq.redhat.com/~lthon/bz993559/run-no8/.

[3] is EAP 6.1.1.ER4 + the new fix (fix #2). OOM not present, memory is looking OK for 3 of 4 servers. For perf18, which is the cluster coordinator, the memory usage still looks bad.

To conclude: fix #2 (the one attached to this BZ) seems to fix most of the memory leaks, but not all of them. Will run another test with fix #2 to try to reproduce OOM. Expecting OOM to only happen on the cluster coordinator node.

[1] https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/eap-6x-clustering-soak-http-repl-async-web-oom-investigation/12/

[2] https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/eap-6x-clustering-soak-http-repl-async-web-oom-investigation/8/

[3] https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/eap-6x-clustering-soak-http-repl-async-web-oom-investigation/11/
Comment 14 Radoslav Husar 2013-08-06 15:46:15 EDT
Thanks Ladislav, I was just analyzing the reuslts too. It seems as though the 2nd fix managed to mitigate the issue even more. 

I am going to try more ideas now meanwhile I submit thus fix for PR.
Comment 15 Radoslav Husar 2013-08-06 15:52:06 EDT
PR: https://github.com/jbossas/jboss-eap/pull/280

@Jitka, yes, the Jar contains all fixes up to date.
Comment 16 Ladislav Thon 2013-08-07 02:43:43 EDT
So I ran the last test (6.1.1.ER4 + fix #2, -Xmx512m) again, now for 9 hours, see [1].

Memory usage on the cluster coordinator node (perf18) goes up right from the beginning, but after cca 5 hours, memory usage on the other nodes goes up as well. Looking at the graph, this probably starts to happen when the coordinator OOMs. Not yet sure about the implications.

Uploaded heap dumps from this run to [2] for analysis.

[1] https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/eap-6x-clustering-soak-http-repl-async-web-oom-investigation/13/

[2] http://file.brq.redhat.com/~lthon/bz993559/run-no13/
Comment 17 Ladislav Thon 2013-08-07 02:46:37 EDT
Small addition to comment 16: I just figured that it would be probably useful to be able to distinguish perf18 heap dump from the others :-) Here's the mapping of heap dumps in http://file.brq.redhat.com/~lthon/bz993559/run-no13/ to cluster node names:

perf18: java_pid8497.hprof
perf19: java_pid24304.hprof
perf20: java_pid29232.hprof
perf21: java_pid5350.hprof
Comment 18 Radoslav Husar 2013-08-08 09:51:06 EDT
Created attachment 784400 [details]
jboss web for qe use with fixes until 2013/8/8 15:50 CET
Comment 19 Radoslav Husar 2013-08-08 16:55:39 EDT
I should add a note about the progress: the heap dumps show that PR 280 resolves the problem with open batches. 

However it hints that there is remaining problem with session expiration not expiring sessions but rather keeping them in the cache.

QA note: the tests could probably be speeded up to demostrate the problem by shortening the session lifespan and increasing number of sessions to be left for expiration and invalidation. Let me know if you need some help tweaking.
Comment 20 Jitka Kozana 2013-08-09 01:05:02 EDT
We have run a test with the new patch (comment #18) and still got the OutOfMemory error.

I will upload the heap dumps shortly.
Comment 21 Jitka Kozana 2013-08-09 02:24:29 EDT
Here are the heap dumps:

perf18: java_pid17558.hprof
perf19: java_pid26989.hprof
perf20: java_pid23944.hprof
perf21: java_pid8953.hprof
Comment 22 Paul Ferraro 2013-08-10 11:10:43 EDT
@Jitka Have you run this last test (which includes the 2 fixes) against both 6.1.0 and 6.1.1?  I just want to make sure the current state of this is still a behavioral regression.
Comment 23 Paul Ferraro 2013-08-10 12:22:05 EDT
I ask because I don't see any issue with expiration - as suspected in #c19
Comment 25 Jitka Kozana 2013-08-12 01:26:50 EDT
the 6.1.0 run of this test scenario is still here: [1].

The 6.1.1. run with all the latest fixes is [2], as you have already found out. The OOM was seen approx. after 70 min after starting the test, see the snippet from logs of run [2] below [3].

[1] 6.1.0: https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/eap-6x-clustering-soak-http-repl-async-web-oom-investigation/12/

[2] 6.1.1 + patch #3: https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/eap-6x-clustering-soak-http-repl-async-web-oom-investigation/15/

[3] 10:28:37:765 EDT Starting the test.
11:36:58,416 java.lang.OutOfMemoryError: GC overhead limit exceeded
Dumping heap to java_pid17558.hprof
Comment 29 Jitka Kozana 2013-08-19 03:59:15 EDT
Verified on EAP 6.1.1.ER6.

Note You need to log in before you can comment on or make changes to this bug.