Bug 1198556

Summary: Optimize Putall operations
Product: [JBoss] JBoss Data Grid 6 Reporter: Pedro Zapata <pzapataf>
Component: InfinispanAssignee: William Burns <wburns>
Status: CLOSED CURRENTRELEASE QA Contact: Martin Gencur <mgencur>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 6.5.0CC: dmehra, dstahl, jdg-bugs, ttarrant, vjuranek, wburns
Target Milestone: ER2   
Target Release: 6.5.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 1213367 (view as bug list) Environment:
Last Closed: 2015-06-23 12:24:57 UTC Type: Enhancement
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1186882, 1213367    

Description Pedro Zapata 2015-03-04 11:29:23 UTC
We have an internal need for improving the performance of GetAll and PutAll for existing JDG features.

1) GetAll: to ensure that there’s a way to batch get multiple objects together, both in Library mode and over Hot Rod.
 
2) PutAll: to ensure that customers can load massive amount of data into the grid quickly, both in Library mode and over Hot Rod.
 
Additionally, we have requests for this performance improvement from customers/prospects.

Comment 1 JBoss JIRA Server 2015-03-26 18:42:24 UTC
William Burns <wburns> updated the status of jira ISPN-5264 to Coding In Progress

Comment 2 JBoss JIRA Server 2015-03-31 14:52:27 UTC
William Burns <wburns> updated the status of jira ISPN-5265 to Coding In Progress

Comment 3 JBoss JIRA Server 2015-03-31 15:45:44 UTC
William Burns <wburns> updated the status of jira ISPN-5265 to Open

Comment 4 JBoss JIRA Server 2015-03-31 15:46:59 UTC
William Burns <wburns> updated the status of jira ISPN-5266 to Coding In Progress

Comment 5 JBoss JIRA Server 2015-04-07 09:34:08 UTC
Tristan Tarrant <ttarrant> updated the status of jira ISPN-5263 to Closed

Comment 6 JBoss JIRA Server 2015-04-09 14:21:55 UTC
William Burns <wburns> updated the status of jira ISPN-5264 to Reopened

Comment 7 William Burns 2015-04-15 17:24:43 UTC
PR for putAll https://github.com/infinispan/jdg/pull/613

Comment 8 William Burns 2015-04-20 11:38:39 UTC
This issue is to handle the putAll improvements for both embedded and remote caches.

Comment 9 JBoss JIRA Server 2015-04-23 11:47:19 UTC
William Burns <wburns> updated the status of jira ISPN-5265 to Coding In Progress

Comment 10 Martin Gencur 2015-05-04 07:54:17 UTC
Here are some measurements:
Performance comparison of putAll() operation with ER1, ER2, it looks it is fast when data are loaded on single node. I have tried following configurations:
  a) HR client - 500MBs of data single node, ER2 is 3-4x faster then ER1, 1 entry is 1024Bs, 20bytes for key and rest for value.
  b) HR client - 200MBs of data cluster of 2 nodes, NO PERFORMANCE IMPROVEMENT FOUND
    ON ER1 [Data loading took 0 hours-0 minutes-58 seconds-325 milliseconds]
    ON ER2 [Data loading took 0 hours-0 minutes-58 seconds-197 milliseconds] 
  c) HR client- 500MBs, increased heap size to 3Gbs, with ER2 test fails with OOM so NO PERFORMANCE IMPROVEMENT
    ON ER1 [Data loading took 0 hours-1 minutes-55 seconds-678 milliseconds]
    ON ER2 OOM occurs
         ERROR [org.jgroups.protocols.UNICAST3] (OOB-21,shared=udp) JGRP000039: JDG2/clustered: failed to deliver OOB message [dst: JDG2/clustered, src: JDG1/clustered (3 headers), size=60000 bytes, flags=OOB|DONT_BUNDLE|NO_TOTAL_ORDER]: java.lang.OutOfMemoryError: Java heap space
         ERROR [org.jgroups.protocols.UNICAST3] (OOB-16,shared=udp) JGRP000039: JDG1/clustered: failed to deliver OOB message [dst: JDG1/clustered, src: JDG2/clustered (3 headers), size=60000 bytes, flags=OOB|DONT_BUNDLE|NO_TOTAL_ORDER]: java.lang.OutOfMemoryError: Java heap space
  d) Library mode with EAP, 200MBs, NO PERFORMANCE IMPROVEMENT
    ON ER1 [Data loading took 0 hours-0 minutes-0 seconds-869 milliseconds] 
    ON ER2 [Data loading took 0 hours-0 minutes-0 seconds-897 milliseconds]
  e) Library mode with EAP, 500MBs, NO PERFORMANCE IMPROVEMENT   
    ON ER1 [Data loading took 0 hours-0 minutes-2 seconds-930 milliseconds]
    ON ER2 [Data loading took 0 hours-0 minutes-2 seconds-853 milliseconds]

Response from wburns:

If the library mode is with only 2 nodes, that is to be expected.  The changes for embedded should only improve performance for DIST only when you have more nodes than numOwners.

The remote is odd, it should be substantially faster if you have a good amount of entries > 10 in pretty much all cases.

The extra memory usage is to be expected, unfortunately.  Compared to ER1 it sequentially sent each entry so there was a smaller message overhead, where as now it has to hold the entire map in a message.  So 3 GB for both servers and client with 500MB being inserted is not going to be enough.

I will dig a little to see why it wasn't performing properly in the 200MBs range though.

Martin:
We'll probably need to run a few more tests with the number of nodes > numOwners.

Comment 12 Vojtech Juranek 2015-05-11 08:55:04 UTC
Hi Will,
in order to proceed with verification of this BZ, could you please specify more precisely what should be outcome from this issue, ideally in terms under which circumstances the performance should increase and rough estimate of improvement (i.e. which improvement you want to achieve)?
Thanks
Vojta

Comment 17 William Burns 2015-05-12 12:02:06 UTC
(In reply to Martin Gencur from comment #10)
> Here are some measurements:
> Performance comparison of putAll() operation with ER1, ER2, it looks it is
> fast when data are loaded on single node. I have tried following
> configurations:
>   a) HR client - 500MBs of data single node, ER2 is 3-4x faster then ER1, 1
> entry is 1024Bs, 20bytes for key and rest for value.
>   b) HR client - 200MBs of data cluster of 2 nodes, NO PERFORMANCE
> IMPROVEMENT FOUND
>     ON ER1 [Data loading took 0 hours-0 minutes-58 seconds-325 milliseconds]
>     ON ER2 [Data loading took 0 hours-0 minutes-58 seconds-197 milliseconds] 
>   c) HR client- 500MBs, increased heap size to 3Gbs, with ER2 test fails
> with OOM so NO PERFORMANCE IMPROVEMENT
>     ON ER1 [Data loading took 0 hours-1 minutes-55 seconds-678 milliseconds]
>     ON ER2 OOM occurs
>          ERROR [org.jgroups.protocols.UNICAST3] (OOB-21,shared=udp)
> JGRP000039: JDG2/clustered: failed to deliver OOB message [dst:
> JDG2/clustered, src: JDG1/clustered (3 headers), size=60000 bytes,
> flags=OOB|DONT_BUNDLE|NO_TOTAL_ORDER]: java.lang.OutOfMemoryError: Java heap
> space
>          ERROR [org.jgroups.protocols.UNICAST3] (OOB-16,shared=udp)
> JGRP000039: JDG1/clustered: failed to deliver OOB message [dst:
> JDG1/clustered, src: JDG2/clustered (3 headers), size=60000 bytes,
> flags=OOB|DONT_BUNDLE|NO_TOTAL_ORDER]: java.lang.OutOfMemoryError: Java heap
> space
>   d) Library mode with EAP, 200MBs, NO PERFORMANCE IMPROVEMENT
>     ON ER1 [Data loading took 0 hours-0 minutes-0 seconds-869 milliseconds] 
>     ON ER2 [Data loading took 0 hours-0 minutes-0 seconds-897 milliseconds]
>   e) Library mode with EAP, 500MBs, NO PERFORMANCE IMPROVEMENT   
>     ON ER1 [Data loading took 0 hours-0 minutes-2 seconds-930 milliseconds]
>     ON ER2 [Data loading took 0 hours-0 minutes-2 seconds-853 milliseconds]
> 
> Response from wburns:
> 
> If the library mode is with only 2 nodes, that is to be expected.  The
> changes for embedded should only improve performance for DIST only when you
> have more nodes than numOwners.
> 
> The remote is odd, it should be substantially faster if you have a good
> amount of entries > 10 in pretty much all cases.
> 
> The extra memory usage is to be expected, unfortunately.  Compared to ER1 it
> sequentially sent each entry so there was a smaller message overhead, where
> as now it has to hold the entire map in a message.  So 3 GB for both servers
> and client with 500MB being inserted is not going to be enough.
> 
> I will dig a little to see why it wasn't performing properly in the 200MBs
> range though.
> 
> Martin:
> We'll probably need to run a few more tests with the number of nodes >
> numOwners.

To further clarify the 3 GB note I put here.  The issue was that both servers and the client were ran on the same JVM.  So in that case 3 GB for all of these combined is not enough.  Normally these would be on separate JVMs and a much smaller heap could be utilized.  Just a bit more than double the target putAll map size should be sufficient (this would allow for the map itself and the message containing the map being fully serialized).