Bug 990092

Summary: Regression: OutOfMemory exception
Product: [JBoss] JBoss Enterprise Application Platform 6 Reporter: Jitka Kozana <jkudrnac>
Component: ClusteringAssignee: Paul Ferraro <paul.ferraro>
Status: CLOSED CURRENTRELEASE QA Contact: Jitka Kozana <jkudrnac>
Severity: urgent Docs Contact: Russell Dickenson <rdickens>
Priority: urgent    
Version: 6.1.1CC: cdewolf, jkudrnac, lthon, myarboro, rhusar, rjanik
Target Milestone: ER5Keywords: Regression
Target Release: EAP 6.1.1   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2013-09-16 20:21:16 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Leak Suspects
none
Leak Suspects
none
Leak Suspects
none
Leak Suspects none

Description Jitka Kozana 2013-07-30 11:17:40 UTC
While running a clustering soak test, we have observed an java.lang.OutOfMemoryError: Java heap space
exception.

We have investigated the issue and confirmed, the problem is _not_ present in EAP 6.1.0. Therefore, this is a regression.

During the investigation, we have created two (shorter) instances of the soak test: [1] was run with 6.1.0.GA, [2] was run with 6.1.1.ER3. With the same setup, run [2] did crash with OutOfMemory, run [1] did not crash. 

We have created heap dumps, they have been uploaded to:
http://lacrosse.redhat.com/~jkudrnac/oom/

We have briefly examined the heap dumps and mainly a huge instance of org.infinispan.transaction.synchronization.SyncLocalTransaction takes much of the memory.

Also, please see the memory consumption graphs:
6.1.0.GA: https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/eap-6x-clustering-soak-http-repl-async-web-oom-investigation/3/artifact/report/graph-cluster-memory.png

6.1.1.ER3: https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/eap-6x-clustering-soak-http-repl-async-web-oom-investigation/2/artifact/report/graph-cluster-memory.png

[1] https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/eap-6x-clustering-soak-http-repl-async-web-oom-investigation/3/

[2] https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/eap-6x-clustering-soak-http-repl-async-web-oom-investigation/2/

Comment 1 Ladislav Thon 2013-07-30 11:54:28 UTC
Created attachment 780594 [details]
Leak Suspects

Comment 2 Ladislav Thon 2013-07-30 11:54:55 UTC
Created attachment 780595 [details]
Leak Suspects

Comment 3 Ladislav Thon 2013-07-30 11:55:17 UTC
Created attachment 780596 [details]
Leak Suspects

Comment 4 Ladislav Thon 2013-07-30 11:55:37 UTC
Created attachment 780597 [details]
Leak Suspects

Comment 5 Ladislav Thon 2013-07-30 11:56:31 UTC
As already mentioned, we've briefly looked at the heap dumps with Eclipse Memory Analyzer. I've attached the "Leak Suspects" reports for all 4 servers.

Comment 6 Radoslav Husar 2013-07-30 13:04:36 UTC
At the first look it seems to me that the real problem is manifested as org.infinispan.commands.write.RemoveCommand leak as there are cca 330,000 instances (35M) accounting for most of the char[]s (160M).

What is the root cause I am still investigating.

Comment 7 Paul Ferraro 2013-07-30 16:03:50 UTC
I've located the issue.  I'll have a PR ready soon.

Comment 8 Paul Ferraro 2013-07-30 17:56:39 UTC
https://github.com/jbossas/jboss-eap/pull/265

Comment 12 Ladislav Thon 2013-08-06 07:47:55 UTC
Rado and all: as we've been asked, I created a new bug to follow up with this: bug 993559.

Comment 13 Radoslav Husar 2013-08-06 09:15:42 UTC
Not sure whether we needed a new BZ 993559 for this (who asked you?). It seems like the same problem, but the current fix doesn't seem to cover all scenarios.

Comment 14 Carlo de Wolf 2013-08-06 10:41:36 UTC
We should not re-use BZs that are past MODIFIED. A code change has been made, regardless of whether we solve any more issues.

Comment 15 Radoslav Husar 2013-08-06 11:01:00 UTC
Sorry, but that's not correct according to BZ workflow. Please see what the states represent and how they are supposed to be used: 

https://bugzilla.redhat.com/page.cgi?id=fields.html#bug_status

Specifically: "From here, a bug can fail testing, or need additional testing information from the engineer and be moved to ASSIGNED"

What you described might be beneficial but at the same time it makes this BZ unverifiable until all the memory leaks described here are fixed. Also you are denying the whole point of "reopening" an issue - failing QE verification. Needless to say it increases management overhead with the BZ in general.

Comment 16 Radoslav Husar 2013-08-06 14:33:05 UTC
I have attached a patched Jar on BZ 993559 with the potential fix.

Comment 17 Jitka Kozana 2013-08-19 07:59:32 UTC
Verified on EAP 6.1.1.ER6.