Bug 745875 (EDG-113)

Summary: Large volumes of garbage when adding a single 1 Mb value
Product: [JBoss] JBoss Data Grid 5 Reporter: Richard Achmatowicz <rachmato>
Component: InfinispanAssignee: Default User <jbpapp-maint>
Status: CLOSED NEXTRELEASE QA Contact:
Severity: high Docs Contact:
Priority: high    
Version: unspecifiedCC: galder.zamarreno, rachmato
Target Milestone: ---   
Target Release: EAP 5.1.0 EDG TP   
Hardware: Unspecified   
OS: Unspecified   
URL: http://jira.jboss.org/jira/browse/EDG-113
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2011-05-03 13:53:40 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Attachments:
Description Flags
onemeg-heapchange.gif
none
onemeg-referencedheapobjects.gif
none
onemeg-alloccalltree.gif
none
onemeg-allochotspot.gif
none
onemegvalue-hr-referencedheapobjects-fix.gif
none
onemegvalue-hr-recordedobjects-fix.gif
none
onemegvalue-hr-allochotspot-fix.gif
none
onemegvalue-hr-alloccalltree-fix.gif none

Description Richard Achmatowicz 2011-03-30 23:07:34 UTC
project_key: EDG

When a key with value of size 1 Mb is stored into the grid, very large amounts of garbage is generated - on the order of 30 Mb. 

This garbage eventually becomes unreferenced, but it is still a surprisingly large amount of garbage and adversely affects the size of the heap and promotes the need for garbage collection. It may be that some of this memory is being used unnecessarily, and even reducing the amount of memory used would probably improve performance.

Steps to reproduce:
1. Start a single node EDG cluster, instrumented for jProfiler
2. Remotely attach the jProfiler GUI to the node, with CPU and memory allocation turned off 
3. Once the GUI is attached, start memory allocation recording and CPU recording
4. Use HotRod or memcached client to put a single key witha 1 Mb value to the node
5. Turn off recording

The Recorded Objects memory views show that roughly 30 Mb of data have been allocated (not solely the first time through either). The same results can be seen by using jmap -heap to view the overall heap stats.

Comment 1 Richard Achmatowicz 2011-03-30 23:09:36 UTC
Attaching jProfiler screen dumps

Comment 2 Richard Achmatowicz 2011-03-30 23:09:36 UTC
Attachment: Added: onemeg-heapchange.gif
Attachment: Added: onemeg-referencedheapobjects.gif
Attachment: Added: onemeg-alloccalltree.gif
Attachment: Added: onemeg-allochotspot.gif


Comment 3 Richard Achmatowicz 2011-03-30 23:14:55 UTC
The screendumps taken during the exercise:

1. onemeg-heapchange.gif shows the recorded objects allocated during the put operation. The total allocated memory is 33 Mb.
2. onemeg-referencedheapobjects.gif shows the allocated objects after a garbage collection - most of the garbage has been collected and we can see the 1 Mb byte[] array
3. onemeg-alloccalltree.gif shows the methods doing the allocation of the recorded objects
4. onemeg-allochotspot.gif shows the stack traces of methods doing the byte[] allocations



Comment 4 Richard Achmatowicz 2011-03-30 23:17:05 UTC
32 Mb of overhead seems a bit OTT for handling 1 Mb of data. Could this be reduced?

This would help the data stress test results, whose heaps are under serious stress. 

Comment 5 Galder Zamarreño 2011-04-07 11:52:14 UTC
Richard, I'm investigating this...

Comment 6 Galder Zamarreño 2011-04-07 14:06:21 UTC
Richard, this is really interesting. It would appear as if to read the 1mb key, it ends up having to create 53 times a new byte[] of 1mb. This is easily reproducible in my machine; a similar test, when profiled, shows 5 allocations of 1mb byte[]. I'll get back to you when I know more about this...

Comment 7 Galder Zamarreño 2011-04-07 14:15:41 UTC
Right, I think I know what the problem is. The netty decoder, as long as it has not received all the data, will carry on re-reading the header. So, I need to put some markers to make sure that if the key is read once, it's not read again all the time, since this generates all those garbage byte[]. The same needs to happen with value, which potentially could be big as well. I'll create a JIRA for this in Infinispan.

Comment 8 Galder Zamarreño 2011-04-07 14:16:10 UTC
Btw Richard, great catch! Fixing this is very important.

Comment 9 Galder Zamarreño 2011-04-07 14:20:47 UTC
Link: Added: This issue depends ISPN-1035


Comment 10 Richard Achmatowicz 2011-04-11 10:24:55 UTC
OK, that's great. The data stress tests showed really poor throughput (time in app / time in app + GC) I tried to find out where all the garbage was coming from. Hope this fixes the problems with those tests.


Comment 11 Galder Zamarreño 2011-04-11 14:13:08 UTC
Richard, I've verified it and it will fix it :) - I'm tidying up the code now, it required some decent restructuring in the decoding.

Comment 12 Richard Achmatowicz 2011-04-11 14:15:17 UTC
Wonderful. Let me know when I can grab a snapshot containing the changes and i'll give it a whirl.


Comment 13 Galder Zamarreño 2011-04-12 07:11:19 UTC
Hey Richard, I've just uploaded a snapshot taken from my 4.2.x branch of the ISPN-1035 issue. The id of the snapshot is: infinispan-4.2.1-20110412.064620-20

Give it a go and let's see if it makes a difference :)

Comment 14 Richard Achmatowicz 2011-04-12 12:33:43 UTC
Hi Galder

We rebuilt EDG using the snapshot here: http://hudson.qa.jboss.com/hudson/view/EDG/job/edg-51x-build-alt/14
The console log shows that the snapshot is getting picked up:
[java] [INFO] snapshot org.infinispan:infinispan-server-memcached:4.2.1-SNAPSHOT: checking for updates from jboss-public-repository-group
     [java] [INFO] snapshot org.infinispan:infinispan-server-memcached:4.2.1-SNAPSHOT: checking for updates from jboss-public-repository-devel-group
     [java] Downloading: https://repository.jboss.org/nexus/content/groups/developer//org/infinispan/infinispan-server-memcached/4.2.1-SNAPSHOT/infinispan-server-memcached-4.2.1-20110412.064620-22.pom
     [java] 3K downloaded
     [java] [INFO] snapshot org.infinispan:infinispan-server-parent:4.2.1-SNAPSHOT: checking for updates from jboss-public-repository-group
     [java] [INFO] snapshot org.infinispan:infinispan-server-parent:4.2.1-SNAPSHOT: checking for updates from jboss-public-repository-devel-group
     [java] Downloading: https://repository.jboss.org/nexus/content/groups/developer//org/infinispan/infinispan-server-parent/4.2.1-SNAPSHOT/infinispan-server-parent-4.2.1-20110412.064620-25.pom
     [java] 1K downloaded
     [java] [INFO] snapshot org.infinispan:infinispan-parent:4.2.1-SNAPSHOT: checking for updates from jboss-public-repository-group
     [java] [INFO] snapshot org.infinispan:infinispan-parent:4.2.1-SNAPSHOT: checking for updates from jboss-public-repository-devel-group
     [java] Downloading: https://repository.jboss.org/nexus/content/groups/developer//org/infinispan/infinispan-parent/4.2.1-SNAPSHOT/infinispan-parent-4.2.1-20110412.064620-30.pom
     [java] 19K downloaded
     [java] [INFO] snapshot org.infinispan:infinispan-server-core:4.2.1-SNAPSHOT: checking for updates from jboss-public-repository-group
     [java] [INFO] snapshot org.infinispan:infinispan-server-core:4.2.1-SNAPSHOT: checking for updates from jboss-public-repository-devel-group
     [java] Downloading: https://repository.jboss.org/nexus/content/groups/developer//org/infinispan/infinispan-server-core/4.2.1-SNAPSHOT/infinispan-server-core-4.2.1-20110412.064620-25.pom
     [java] 2K downloaded

and the zip file has been placed here:
/qa/home/mlinhard/edg-ispn-4.2.2-SNAPSHOT-20110412.064620.zip


However, when I rerun my 1 Mb addition test using the ew zip, I still see 31 Mb of data being allocated in 2073 instances of byte[].
I also ran a comparative test using one of the data stress tests and saew no differences.

So it looks as though your fix is not getting picked up somehow.
Can you check to see if the new zip tallies with the fix you applied? 



Comment 15 Galder Zamarreño 2011-04-12 15:13:29 UTC
Just talked to Richard online and want to clarify something. In the profiler snapshots he attached, AbstractProtocolDecoder.readKeyParams() is generating over 50mb from reading the key multiple times. With the fix, AbstractProtocolDecoder.readKey() only generates 1mb which is the right thing, the key should only be read once. This should be an easy way to make sure the snapshot contains at least a fix for that issue.

If even with the fix there's still too much byte[] being generated, it needs further investigation.

Comment 16 Galder Zamarreño 2011-04-12 15:17:39 UTC
To be precise, in my local machine AsbtractProtocolDecoder.decode() is generating 1.8 mb data with 1mb coming from reading the key and 0.5mb CacheDelegate.put(). This compares to Richard's original snapshot showing 57mb being generated by AsbtractProtocolDecoder.decode()

Comment 17 Galder Zamarreño 2011-04-12 16:26:36 UTC
Ok, I've inspected Richard's profiler snapshots and checkpointing partially solves the problem but even when reading the key or value, we need to do it more efficiently and avoid creating some many useless byte[]. I'm checking with Trustin to find a more efficient way of reading a ranged byte[] (maybe buffering?)

Comment 18 Richard Achmatowicz 2011-04-14 17:02:05 UTC
The first snapshot which Galder passed on to me recuded the garbage from about 30 Mb to 26 Mb. 
He then provided a second snapshot file which reduces the garbage generated from 30 Mb down to 6 Mb. Very nice improvement, Galder!

Here are some jProfiler screenshots to look at, resulting from putting one key value pair with a 1 Mb value:

1. onemegvalue-hr-recordedobjects-fix.gif shows the recorded objects allocated during the put operation. The total allocated memory is about 6 Mb.
2. onemegvalue-hr-referencedheapobjects-fix.gif shows the allocated objects after a garbage collection - most of the garbage has been collected and we can see the 1 Mb byte[] array
3. onemegvalue-hr-alloccalltree-fix.gif shows the methods doing the allocation of the recorded objects
4. onemegvalue-hr-allochotspot-fix.gif shows the stack traces of methods doing the byte[] allocations

Comment 19 Richard Achmatowicz 2011-04-14 17:03:23 UTC
Attachment: Added: onemegvalue-hr-referencedheapobjects-fix.gif
Attachment: Added: onemegvalue-hr-recordedobjects-fix.gif
Attachment: Added: onemegvalue-hr-allochotspot-fix.gif
Attachment: Added: onemegvalue-hr-alloccalltree-fix.gif


Comment 20 Richard Achmatowicz 2011-04-14 17:04:30 UTC
Going to try using this build in the heap stress tests.

Comment 21 Galder Zamarreño 2011-04-18 07:28:35 UTC
Thx Richard! The screenshots look fine now. I'm gonna resolve the linked ISPN jira now so that it can be incorporated in 5.0.0.BETA2 which should be released later today.

Comment 22 Galder Zamarreño 2011-04-29 15:24:00 UTC
Link: Added: This issue relates to JBPAPP-6411


Comment 23 Galder Zamarreño 2011-05-03 09:48:17 UTC
I consider this to be closed since associated ISPN jira is fixed.

Comment 24 Richard Achmatowicz 2011-05-03 13:53:40 UTC
Was verified as complete in an earlier comment