990092 – Regression: OutOfMemory exception

Bug 990092 - Regression: OutOfMemory exception

Summary: Regression: OutOfMemory exception

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	JBoss Enterprise Application Platform 6
Classification:	JBoss
Component:	Clustering
Sub Component:
Version:	6.1.1
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	ER5
Target Release:	EAP 6.1.1
Assignee:	Paul Ferraro
QA Contact:	Jitka Kozana
Docs Contact:	Russell Dickenson
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2013-07-30 11:17 UTC by Jitka Kozana
Modified:	2013-09-16 20:21 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2013-09-16 20:21:16 UTC
Type:	Bug
Embargoed:

Attachments	(Terms of Use)
Leak Suspects (109.05 KB, application/zip) 2013-07-30 11:54 UTC, Ladislav Thon	no flags	Details
Leak Suspects (109.11 KB, application/zip) 2013-07-30 11:54 UTC, Ladislav Thon	no flags	Details
Leak Suspects (108.07 KB, application/zip) 2013-07-30 11:55 UTC, Ladislav Thon	no flags	Details
Leak Suspects (99.15 KB, application/zip) 2013-07-30 11:55 UTC, Ladislav Thon	no flags	Details
View All

Description Jitka Kozana 2013-07-30 11:17:40 UTC

While running a clustering soak test, we have observed an java.lang.OutOfMemoryError: Java heap space
exception.

We have investigated the issue and confirmed, the problem is _not_ present in EAP 6.1.0. Therefore, this is a regression.

During the investigation, we have created two (shorter) instances of the soak test: [1] was run with 6.1.0.GA, [2] was run with 6.1.1.ER3. With the same setup, run [2] did crash with OutOfMemory, run [1] did not crash. 

We have created heap dumps, they have been uploaded to:
http://lacrosse.redhat.com/~jkudrnac/oom/

We have briefly examined the heap dumps and mainly a huge instance of org.infinispan.transaction.synchronization.SyncLocalTransaction takes much of the memory.

Also, please see the memory consumption graphs:
6.1.0.GA: https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/eap-6x-clustering-soak-http-repl-async-web-oom-investigation/3/artifact/report/graph-cluster-memory.png

6.1.1.ER3: https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/eap-6x-clustering-soak-http-repl-async-web-oom-investigation/2/artifact/report/graph-cluster-memory.png

[1] https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/eap-6x-clustering-soak-http-repl-async-web-oom-investigation/3/

[2] https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/eap-6x-clustering-soak-http-repl-async-web-oom-investigation/2/

Comment 1 Ladislav Thon 2013-07-30 11:54:28 UTC

Created attachment 780594 [details]
Leak Suspects

Comment 2 Ladislav Thon 2013-07-30 11:54:55 UTC

Created attachment 780595 [details]
Leak Suspects

Comment 3 Ladislav Thon 2013-07-30 11:55:17 UTC

Created attachment 780596 [details]
Leak Suspects

Comment 4 Ladislav Thon 2013-07-30 11:55:37 UTC

Created attachment 780597 [details]
Leak Suspects

Comment 5 Ladislav Thon 2013-07-30 11:56:31 UTC

As already mentioned, we've briefly looked at the heap dumps with Eclipse Memory Analyzer. I've attached the "Leak Suspects" reports for all 4 servers.

Comment 6 Radoslav Husar 2013-07-30 13:04:36 UTC

At the first look it seems to me that the real problem is manifested as org.infinispan.commands.write.RemoveCommand leak as there are cca 330,000 instances (35M) accounting for most of the char[]s (160M).

What is the root cause I am still investigating.

Comment 7 Paul Ferraro 2013-07-30 16:03:50 UTC

I've located the issue.  I'll have a PR ready soon.

Comment 8 Paul Ferraro 2013-07-30 17:56:39 UTC

https://github.com/jbossas/jboss-eap/pull/265

Comment 12 Ladislav Thon 2013-08-06 07:47:55 UTC

Rado and all: as we've been asked, I created a new bug to follow up with this: bug 993559.

Comment 13 Radoslav Husar 2013-08-06 09:15:42 UTC

Not sure whether we needed a new BZ 993559 for this (who asked you?). It seems like the same problem, but the current fix doesn't seem to cover all scenarios.

Comment 14 Carlo de Wolf 2013-08-06 10:41:36 UTC

We should not re-use BZs that are past MODIFIED. A code change has been made, regardless of whether we solve any more issues.

Comment 15 Radoslav Husar 2013-08-06 11:01:00 UTC

Sorry, but that's not correct according to BZ workflow. Please see what the states represent and how they are supposed to be used: 

https://bugzilla.redhat.com/page.cgi?id=fields.html#bug_status

Specifically: "From here, a bug can fail testing, or need additional testing information from the engineer and be moved to ASSIGNED"

What you described might be beneficial but at the same time it makes this BZ unverifiable until all the memory leaks described here are fixed. Also you are denying the whole point of "reopening" an issue - failing QE verification. Needless to say it increases management overhead with the BZ in general.

Comment 16 Radoslav Husar 2013-08-06 14:33:05 UTC

I have attached a patched Jar on BZ 993559 with the potential fix.

Comment 17 Jitka Kozana 2013-08-19 07:59:32 UTC

Verified on EAP 6.1.1.ER6.

Note You need to log in before you can comment on or make changes to this bug.