Bug 990092 - Regression: OutOfMemory exception
Regression: OutOfMemory exception
Status: CLOSED CURRENTRELEASE
Product: JBoss Enterprise Application Platform 6
Classification: JBoss
Component: Clustering (Show other bugs)
6.1.1
Unspecified Unspecified
urgent Severity urgent
: ER5
: EAP 6.1.1
Assigned To: Paul Ferraro
Jitka Kozana
Russell Dickenson
: Regression
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2013-07-30 07:17 EDT by Jitka Kozana
Modified: 2013-09-16 16:21 EDT (History)
6 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2013-09-16 16:21:16 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
Leak Suspects (109.05 KB, application/zip)
2013-07-30 07:54 EDT, Ladislav Thon
no flags Details
Leak Suspects (109.11 KB, application/zip)
2013-07-30 07:54 EDT, Ladislav Thon
no flags Details
Leak Suspects (108.07 KB, application/zip)
2013-07-30 07:55 EDT, Ladislav Thon
no flags Details
Leak Suspects (99.15 KB, application/zip)
2013-07-30 07:55 EDT, Ladislav Thon
no flags Details

  None (edit)
Description Jitka Kozana 2013-07-30 07:17:40 EDT
While running a clustering soak test, we have observed an java.lang.OutOfMemoryError: Java heap space
exception.

We have investigated the issue and confirmed, the problem is _not_ present in EAP 6.1.0. Therefore, this is a regression.

During the investigation, we have created two (shorter) instances of the soak test: [1] was run with 6.1.0.GA, [2] was run with 6.1.1.ER3. With the same setup, run [2] did crash with OutOfMemory, run [1] did not crash. 

We have created heap dumps, they have been uploaded to:
http://lacrosse.redhat.com/~jkudrnac/oom/

We have briefly examined the heap dumps and mainly a huge instance of org.infinispan.transaction.synchronization.SyncLocalTransaction takes much of the memory.

Also, please see the memory consumption graphs:
6.1.0.GA: https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/eap-6x-clustering-soak-http-repl-async-web-oom-investigation/3/artifact/report/graph-cluster-memory.png

6.1.1.ER3: https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/eap-6x-clustering-soak-http-repl-async-web-oom-investigation/2/artifact/report/graph-cluster-memory.png

[1] https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/eap-6x-clustering-soak-http-repl-async-web-oom-investigation/3/

[2] https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/eap-6x-clustering-soak-http-repl-async-web-oom-investigation/2/
Comment 1 Ladislav Thon 2013-07-30 07:54:28 EDT
Created attachment 780594 [details]
Leak Suspects
Comment 2 Ladislav Thon 2013-07-30 07:54:55 EDT
Created attachment 780595 [details]
Leak Suspects
Comment 3 Ladislav Thon 2013-07-30 07:55:17 EDT
Created attachment 780596 [details]
Leak Suspects
Comment 4 Ladislav Thon 2013-07-30 07:55:37 EDT
Created attachment 780597 [details]
Leak Suspects
Comment 5 Ladislav Thon 2013-07-30 07:56:31 EDT
As already mentioned, we've briefly looked at the heap dumps with Eclipse Memory Analyzer. I've attached the "Leak Suspects" reports for all 4 servers.
Comment 6 Radoslav Husar 2013-07-30 09:04:36 EDT
At the first look it seems to me that the real problem is manifested as org.infinispan.commands.write.RemoveCommand leak as there are cca 330,000 instances (35M) accounting for most of the char[]s (160M).

What is the root cause I am still investigating.
Comment 7 Paul Ferraro 2013-07-30 12:03:50 EDT
I've located the issue.  I'll have a PR ready soon.
Comment 8 Paul Ferraro 2013-07-30 13:56:39 EDT
https://github.com/jbossas/jboss-eap/pull/265
Comment 12 Ladislav Thon 2013-08-06 03:47:55 EDT
Rado and all: as we've been asked, I created a new bug to follow up with this: bug 993559.
Comment 13 Radoslav Husar 2013-08-06 05:15:42 EDT
Not sure whether we needed a new BZ 993559 for this (who asked you?). It seems like the same problem, but the current fix doesn't seem to cover all scenarios.
Comment 14 Carlo de Wolf 2013-08-06 06:41:36 EDT
We should not re-use BZs that are past MODIFIED. A code change has been made, regardless of whether we solve any more issues.
Comment 15 Radoslav Husar 2013-08-06 07:01:00 EDT
Sorry, but that's not correct according to BZ workflow. Please see what the states represent and how they are supposed to be used: 

https://bugzilla.redhat.com/page.cgi?id=fields.html#bug_status

Specifically: "From here, a bug can fail testing, or need additional testing information from the engineer and be moved to ASSIGNED"

What you described might be beneficial but at the same time it makes this BZ unverifiable until all the memory leaks described here are fixed. Also you are denying the whole point of "reopening" an issue - failing QE verification. Needless to say it increases management overhead with the BZ in general.
Comment 16 Radoslav Husar 2013-08-06 10:33:05 EDT
I have attached a patched Jar on BZ 993559 with the potential fix.
Comment 17 Jitka Kozana 2013-08-19 03:59:32 EDT
Verified on EAP 6.1.1.ER6.

Note You need to log in before you can comment on or make changes to this bug.