Bug 990092

Summary:

Regression: OutOfMemory exception

Product:

[JBoss] JBoss Enterprise Application Platform 6

Reporter:

Jitka Kozana <jkudrnac>

Component:

Clustering

Assignee:

Paul Ferraro <paul.ferraro>

Status:

CLOSED CURRENTRELEASE

QA Contact:

Jitka Kozana <jkudrnac>

Severity:

urgent

Docs Contact:

Russell Dickenson <rdickens>

Priority:

urgent

Version:

6.1.1

CC:

cdewolf, jkudrnac, lthon, myarboro, rhusar, rjanik

Target Milestone:

ER5

Keywords:

Regression

Target Release:

EAP 6.1.1

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2013-09-16 20:21:16 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
Leak Suspects	none
Leak Suspects	none
Leak Suspects	none
Leak Suspects	none

Description Jitka Kozana 2013-07-30 11:17:40 UTC

While running a clustering soak test, we have observed an java.lang.OutOfMemoryError: Java heap space
exception.

We have investigated the issue and confirmed, the problem is _not_ present in EAP 6.1.0. Therefore, this is a regression.

During the investigation, we have created two (shorter) instances of the soak test: [1] was run with 6.1.0.GA, [2] was run with 6.1.1.ER3. With the same setup, run [2] did crash with OutOfMemory, run [1] did not crash. 

We have created heap dumps, they have been uploaded to:
http://lacrosse.redhat.com/~jkudrnac/oom/

We have briefly examined the heap dumps and mainly a huge instance of org.infinispan.transaction.synchronization.SyncLocalTransaction takes much of the memory.

Also, please see the memory consumption graphs:
6.1.0.GA: https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/eap-6x-clustering-soak-http-repl-async-web-oom-investigation/3/artifact/report/graph-cluster-memory.png

6.1.1.ER3: https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/eap-6x-clustering-soak-http-repl-async-web-oom-investigation/2/artifact/report/graph-cluster-memory.png

[1] https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/eap-6x-clustering-soak-http-repl-async-web-oom-investigation/3/

[2] https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/eap-6x-clustering-soak-http-repl-async-web-oom-investigation/2/

Comment 1 Ladislav Thon 2013-07-30 11:54:28 UTC

Created attachment 780594 [details]
Leak Suspects

Comment 2 Ladislav Thon 2013-07-30 11:54:55 UTC

Created attachment 780595 [details]
Leak Suspects

Comment 3 Ladislav Thon 2013-07-30 11:55:17 UTC

Created attachment 780596 [details]
Leak Suspects

Comment 4 Ladislav Thon 2013-07-30 11:55:37 UTC

Created attachment 780597 [details]
Leak Suspects

Comment 5 Ladislav Thon 2013-07-30 11:56:31 UTC

As already mentioned, we've briefly looked at the heap dumps with Eclipse Memory Analyzer. I've attached the "Leak Suspects" reports for all 4 servers.

Comment 6 Radoslav Husar 2013-07-30 13:04:36 UTC

At the first look it seems to me that the real problem is manifested as org.infinispan.commands.write.RemoveCommand leak as there are cca 330,000 instances (35M) accounting for most of the char[]s (160M).

What is the root cause I am still investigating.

Comment 7 Paul Ferraro 2013-07-30 16:03:50 UTC

I've located the issue.  I'll have a PR ready soon.

Comment 8 Paul Ferraro 2013-07-30 17:56:39 UTC

https://github.com/jbossas/jboss-eap/pull/265

Comment 12 Ladislav Thon 2013-08-06 07:47:55 UTC

Rado and all: as we've been asked, I created a new bug to follow up with this: bug 993559.

Comment 13 Radoslav Husar 2013-08-06 09:15:42 UTC

Not sure whether we needed a new BZ 993559 for this (who asked you?). It seems like the same problem, but the current fix doesn't seem to cover all scenarios.

Comment 14 Carlo de Wolf 2013-08-06 10:41:36 UTC

We should not re-use BZs that are past MODIFIED. A code change has been made, regardless of whether we solve any more issues.

Comment 15 Radoslav Husar 2013-08-06 11:01:00 UTC

Sorry, but that's not correct according to BZ workflow. Please see what the states represent and how they are supposed to be used: 

https://bugzilla.redhat.com/page.cgi?id=fields.html#bug_status

Specifically: "From here, a bug can fail testing, or need additional testing information from the engineer and be moved to ASSIGNED"

What you described might be beneficial but at the same time it makes this BZ unverifiable until all the memory leaks described here are fixed. Also you are denying the whole point of "reopening" an issue - failing QE verification. Needless to say it increases management overhead with the BZ in general.

Comment 16 Radoslav Husar 2013-08-06 14:33:05 UTC

I have attached a patched Jar on BZ 993559 with the potential fix.

Comment 17 Jitka Kozana 2013-08-19 07:59:32 UTC

Verified on EAP 6.1.1.ER6.