Bug 993559

Summary: Regression: OutOfMemoryError, round 2: still looks like a leak somewhere
Product: [JBoss] JBoss Enterprise Application Platform 6 Reporter: Ladislav Thon <lthon>
Component: ClusteringAssignee: Radoslav Husar <rhusar>
Status: CLOSED CURRENTRELEASE QA Contact: Ladislav Thon <lthon>
Severity: urgent Docs Contact: Russell Dickenson <rdickens>
Priority: urgent    
Version: 6.1.1CC: cdewolf, jkudrnac, lthon, myarboro, paul.ferraro, pslavice, rhusar
Target Milestone: ER5Keywords: Regression
Target Release: EAP 6.1.1   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2013-09-16 20:22:50 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
jboss web with filled in missing endBatch()
none
jboss web for qe use with fixes until 2013/8/8 15:50 CET none

Description Ladislav Thon 2013-08-06 07:46:42 UTC
This is a continuation of bug 990092, as we've ran the soak test several times with 6.1.1.ER4 + the fix from https://github.com/jbossas/jboss-eap/pull/265. See [1].

Memory consumption looks a lot better than with pure 6.1.1.ER4, but I believe that there is still something wrong. Compare [2], which is the latest ER build from EAP 6.1.0 testing, with [3], which is 6.1.1.ER4 + fix in the same conditions (except that it's only half of a day). In the run of [3], there is no OOM.

However, we did hit the OOM when we ran the full length test (1 day) with 6.1.1.ER4 + fix with reduced memory (-Xmx768m). It's questionable if this is representative, but I believe that it is. Compare [4], which is 6.1.1.ER4 + fix with reduced memory (-Xmx768m) running for 1 day, with [5], which is EAP 6.1.0 running with reduced memory (-Xmx768m). [5] only runs for few hours, but it clearly shows the trend.

Given that we don't have directly comparable data and have to extrapolate, I'm setting severity to "high" only, but I believe we should look into this. I will upload heap dumps from [4] and have a short look at them soon.

[1] https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/eap-6x-clustering-soak-http-repl-async-web-oom-investigation/
[2] https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/eap-6x-clustering-soak-http-repl-async-web/28/artifact/report/graph-cluster-memory.png
[3] https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/eap-6x-clustering-soak-http-repl-async-web-oom-investigation/7/artifact/report/graph-cluster-memory.png
[4] https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/eap-6x-clustering-soak-http-repl-async-web-oom-investigation/6/artifact/report/graph-cluster-memory.png
[5] https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/eap-6x-clustering-soak-http-repl-async-web-oom-investigation/3/artifact/report/graph-cluster-memory.png

Comment 1 Radoslav Husar 2013-08-06 08:05:13 UTC
@Ladislav could you please provide the heap snapshot (hprof) like last time?

Comment 2 Ladislav Thon 2013-08-06 08:34:46 UTC
Uploaded heap dumps to http://file.brq.redhat.com/~lthon/bz993559/

Comment 3 Carlo de Wolf 2013-08-06 08:45:21 UTC
Forbidden

You don't have permission to access /~lthon/bz993559/java_pid13628.hprof on this server.

I don't have read permission on those files. (I got read permission on the directory itself.)

Comment 4 Ladislav Thon 2013-08-06 08:49:39 UTC
I received a feedback that the original description could be... more clear, so I'll try to rephrase.

With -Xmx2250m, memory consumption in EAP 6.1.0.ER8 [1] versus EAP 6.1.1.ER4 + fix [2]. Didn't hit OOM in [2].

With -Xmx768m, memory consumption in EAP 6.1.0.GA [3] versus EAP 6.1.1.ER4 + fix [4]. Did hit OOM in [4].

[1] https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/eap-6x-clustering-soak-http-repl-async-web/28/artifact/report/graph-cluster-memory.png
[2] https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/eap-6x-clustering-soak-http-repl-async-web-oom-investigation/7/artifact/report/graph-cluster-memory.png
[3] https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/eap-6x-clustering-soak-http-repl-async-web-oom-investigation/3/artifact/report/graph-cluster-memory.png
[4] https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/eap-6x-clustering-soak-http-repl-async-web-oom-investigation/6/artifact/report/graph-cluster-memory.png

Regarding comment 3, that was a stupid mistake on my side, fixed.

Comment 5 Radoslav Husar 2013-08-06 09:00:56 UTC
Can you fill in some metadata about the heaps? I.e.: which server is witch and where in the test were the snapshots taken? Thanks

Comment 6 Radoslav Husar 2013-08-06 09:12:09 UTC
I have analyzed the heaps and we are having the same problem as in BZ 990092 we seem to never end some batches. I am investigating the root cause now.

Comment 7 Ladislav Thon 2013-08-06 09:34:24 UTC
Rado: heap dumps are taken automatically via -XX:+HeapDumpOnOutOfMemoryError. Which server is which dump can be seen in the logs:

perf18: java_pid20159.hprof
perf19: java_pid27241.hprof
perf20: java_pid13628.hprof
perf21: java_pid16827.hprof

Comment 9 Radoslav Husar 2013-08-06 12:09:21 UTC
Created attachment 783299 [details]
jboss web with filled in missing endBatch()

Ladislav, could you try with this patched Jar I built for you?

Comment 11 Jitka Kozana 2013-08-06 12:59:25 UTC
Rado,

I am double checking here: does your fix (the jar in comment #9) contain the first patch for BZ 990092?

Comment 12 Ladislav Thon 2013-08-06 14:02:48 UTC
We are doing another set of runs with a very restricted configuration to "quickly" reproduce the issue. The tests are running for 2 hours and servers are configured with -Xmx512m. 

[1] is EAP 6.1.1.ER4 + the original fix, OOM was present. Uploaded heap dumps to http://file.brq.redhat.com/~lthon/bz993559/run-no8/.

[2] is EAP 6.1.1.ER4 + the new fix. Still running, will see.

Then, we will run the same test with EAP 6.1.0 for comparison.

[1] https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/eap-6x-clustering-soak-http-repl-async-web-oom-investigation/8/

[2] https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/eap-6x-clustering-soak-http-repl-async-web-oom-investigation/11/

Comment 13 Ladislav Thon 2013-08-06 19:37:12 UTC
This is an update of comment 12.

We've done another set of runs with a very restricted configuration to "quickly" reproduce the issue and test the fixes. The tests are running for 2 hours and servers are configured with -Xmx512m. See graph-cluster-memory.png in all runs for memory consumption in time.

[1] is EAP 6.1.0. The memory is OK.

[2] is EAP 6.1.1.ER4 + the original fix (fix #1), OOM was present. Uploaded heap dumps to http://file.brq.redhat.com/~lthon/bz993559/run-no8/.

[3] is EAP 6.1.1.ER4 + the new fix (fix #2). OOM not present, memory is looking OK for 3 of 4 servers. For perf18, which is the cluster coordinator, the memory usage still looks bad.

To conclude: fix #2 (the one attached to this BZ) seems to fix most of the memory leaks, but not all of them. Will run another test with fix #2 to try to reproduce OOM. Expecting OOM to only happen on the cluster coordinator node.

[1] https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/eap-6x-clustering-soak-http-repl-async-web-oom-investigation/12/

[2] https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/eap-6x-clustering-soak-http-repl-async-web-oom-investigation/8/

[3] https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/eap-6x-clustering-soak-http-repl-async-web-oom-investigation/11/

Comment 14 Radoslav Husar 2013-08-06 19:46:15 UTC
Thanks Ladislav, I was just analyzing the reuslts too. It seems as though the 2nd fix managed to mitigate the issue even more. 

I am going to try more ideas now meanwhile I submit thus fix for PR.

Comment 15 Radoslav Husar 2013-08-06 19:52:06 UTC
PR: https://github.com/jbossas/jboss-eap/pull/280

@Jitka, yes, the Jar contains all fixes up to date.

Comment 16 Ladislav Thon 2013-08-07 06:43:43 UTC
So I ran the last test (6.1.1.ER4 + fix #2, -Xmx512m) again, now for 9 hours, see [1].

Memory usage on the cluster coordinator node (perf18) goes up right from the beginning, but after cca 5 hours, memory usage on the other nodes goes up as well. Looking at the graph, this probably starts to happen when the coordinator OOMs. Not yet sure about the implications.

Uploaded heap dumps from this run to [2] for analysis.

[1] https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/eap-6x-clustering-soak-http-repl-async-web-oom-investigation/13/

[2] http://file.brq.redhat.com/~lthon/bz993559/run-no13/

Comment 17 Ladislav Thon 2013-08-07 06:46:37 UTC
Small addition to comment 16: I just figured that it would be probably useful to be able to distinguish perf18 heap dump from the others :-) Here's the mapping of heap dumps in http://file.brq.redhat.com/~lthon/bz993559/run-no13/ to cluster node names:

perf18: java_pid8497.hprof
perf19: java_pid24304.hprof
perf20: java_pid29232.hprof
perf21: java_pid5350.hprof

Comment 18 Radoslav Husar 2013-08-08 13:51:06 UTC
Created attachment 784400 [details]
jboss web for qe use with fixes until 2013/8/8 15:50 CET

Comment 19 Radoslav Husar 2013-08-08 20:55:39 UTC
I should add a note about the progress: the heap dumps show that PR 280 resolves the problem with open batches. 

However it hints that there is remaining problem with session expiration not expiring sessions but rather keeping them in the cache.

QA note: the tests could probably be speeded up to demostrate the problem by shortening the session lifespan and increasing number of sessions to be left for expiration and invalidation. Let me know if you need some help tweaking.

Comment 20 Jitka Kozana 2013-08-09 05:05:02 UTC
We have run a test with the new patch (comment #18) and still got the OutOfMemory error.

I will upload the heap dumps shortly.

Comment 21 Jitka Kozana 2013-08-09 06:24:29 UTC
Here are the heap dumps:
http://lacrosse.redhat.com/~jkudrnac/oom-run-15/

perf18: java_pid17558.hprof
perf19: java_pid26989.hprof
perf20: java_pid23944.hprof
perf21: java_pid8953.hprof

Comment 22 Paul Ferraro 2013-08-10 15:10:43 UTC
@Jitka Have you run this last test (which includes the 2 fixes) against both 6.1.0 and 6.1.1?  I just want to make sure the current state of this is still a behavioral regression.

Comment 23 Paul Ferraro 2013-08-10 16:22:05 UTC
I ask because I don't see any issue with expiration - as suspected in #c19

Comment 25 Jitka Kozana 2013-08-12 05:26:50 UTC
@Paul,
the 6.1.0 run of this test scenario is still here: [1].

The 6.1.1. run with all the latest fixes is [2], as you have already found out. The OOM was seen approx. after 70 min after starting the test, see the snippet from logs of run [2] below [3].


[1] 6.1.0: https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/eap-6x-clustering-soak-http-repl-async-web-oom-investigation/12/

[2] 6.1.1 + patch #3: https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/eap-6x-clustering-soak-http-repl-async-web-oom-investigation/15/

[3] 10:28:37:765 EDT Starting the test.
(...)
11:36:58,416 java.lang.OutOfMemoryError: GC overhead limit exceeded
Dumping heap to java_pid17558.hprof

Comment 29 Jitka Kozana 2013-08-19 07:59:15 UTC
Verified on EAP 6.1.1.ER6.