While running a clustering soak test, we have observed an java.lang.OutOfMemoryError: Java heap space exception. We have investigated the issue and confirmed, the problem is _not_ present in EAP 6.1.0. Therefore, this is a regression. During the investigation, we have created two (shorter) instances of the soak test: [1] was run with 6.1.0.GA, [2] was run with 6.1.1.ER3. With the same setup, run [2] did crash with OutOfMemory, run [1] did not crash. We have created heap dumps, they have been uploaded to: http://lacrosse.redhat.com/~jkudrnac/oom/ We have briefly examined the heap dumps and mainly a huge instance of org.infinispan.transaction.synchronization.SyncLocalTransaction takes much of the memory. Also, please see the memory consumption graphs: 6.1.0.GA: https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/eap-6x-clustering-soak-http-repl-async-web-oom-investigation/3/artifact/report/graph-cluster-memory.png 6.1.1.ER3: https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/eap-6x-clustering-soak-http-repl-async-web-oom-investigation/2/artifact/report/graph-cluster-memory.png [1] https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/eap-6x-clustering-soak-http-repl-async-web-oom-investigation/3/ [2] https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/eap-6x-clustering-soak-http-repl-async-web-oom-investigation/2/
Created attachment 780594 [details] Leak Suspects
Created attachment 780595 [details] Leak Suspects
Created attachment 780596 [details] Leak Suspects
Created attachment 780597 [details] Leak Suspects
As already mentioned, we've briefly looked at the heap dumps with Eclipse Memory Analyzer. I've attached the "Leak Suspects" reports for all 4 servers.
At the first look it seems to me that the real problem is manifested as org.infinispan.commands.write.RemoveCommand leak as there are cca 330,000 instances (35M) accounting for most of the char[]s (160M). What is the root cause I am still investigating.
I've located the issue. I'll have a PR ready soon.
https://github.com/jbossas/jboss-eap/pull/265
Rado and all: as we've been asked, I created a new bug to follow up with this: bug 993559.
Not sure whether we needed a new BZ 993559 for this (who asked you?). It seems like the same problem, but the current fix doesn't seem to cover all scenarios.
We should not re-use BZs that are past MODIFIED. A code change has been made, regardless of whether we solve any more issues.
Sorry, but that's not correct according to BZ workflow. Please see what the states represent and how they are supposed to be used: https://bugzilla.redhat.com/page.cgi?id=fields.html#bug_status Specifically: "From here, a bug can fail testing, or need additional testing information from the engineer and be moved to ASSIGNED" What you described might be beneficial but at the same time it makes this BZ unverifiable until all the memory leaks described here are fixed. Also you are denying the whole point of "reopening" an issue - failing QE verification. Needless to say it increases management overhead with the BZ in general.
I have attached a patched Jar on BZ 993559 with the potential fix.
Verified on EAP 6.1.1.ER6.