Created attachment 750689 [details] Log files from each node in the cluster Description of problem: I am running tests with MapReduce using RadarGun. I have a stage that takes a file path and size as parameters, then uses all active cluster nodes to write values to the cache. Using a 100MB file and Infinispan 5.2.4.Final-redhat-1/JDG 6.1 GA, this operation takes less than 8 seconds: 12:23:13,259 DEBUG [org.radargun.Master] (main) Starting 'LoadFileStage' on 8 slave nodes. Details: LoadFile {bucket=null, exitBenchmarkOnSlaveFailure=false, filePath=/home/hudson/tmp/afield/william-shakespeare-100MB.txt, runOnAllSlaves=false, slaves=null, useSmartClassLoading=true, valueSize=1024 } 12:23:57,818 INFO [org.radargun.stages.LoadFileStage] (main) Received responses from all 8 slaves. Durations [0:31.83 seconds, 1:31.96 seconds, 2:31.9 seconds, 3:32.06 seconds, 4:44.53 seconds, 5:44.37 seconds, 6:44.5 seconds, 7:31.83 seconds] 12:23:57,819 INFO [org.radargun.stages.LoadFileStage] (main) -------------------- 12:23:57,820 INFO [org.radargun.stages.LoadFileStage] (main) Size of file '/home/hudson/tmp/afield/william-shakespeare-100MB.txt' is 100623474 bytes 12:23:57,820 INFO [org.radargun.stages.LoadFileStage] (main) Value size is '1024' which will produce 98266 keys 12:23:57,821 INFO [org.radargun.stages.LoadFileStage] (main) Slave 0 wrote 12284 values to the cache with a total size of 12578816 bytes 12:23:57,821 INFO [org.radargun.stages.LoadFileStage] (main) Slave 1 wrote 12284 values to the cache with a total size of 12577906 bytes 12:23:57,821 INFO [org.radargun.stages.LoadFileStage] (main) Slave 2 wrote 12283 values to the cache with a total size of 12577792 bytes 12:23:57,822 INFO [org.radargun.stages.LoadFileStage] (main) Slave 3 wrote 12283 values to the cache with a total size of 12577792 bytes 12:23:57,822 INFO [org.radargun.stages.LoadFileStage] (main) Slave 4 wrote 12283 values to the cache with a total size of 12577792 bytes 12:23:57,822 INFO [org.radargun.stages.LoadFileStage] (main) Slave 5 wrote 12283 values to the cache with a total size of 12577792 bytes 12:23:57,823 INFO [org.radargun.stages.LoadFileStage] (main) Slave 6 wrote 12283 values to the cache with a total size of 12577792 bytes 12:23:57,823 INFO [org.radargun.stages.LoadFileStage] (main) Slave 7 wrote 12283 values to the cache with a total size of 12577792 bytes 12:23:57,823 INFO [org.radargun.stages.LoadFileStage] (main) -------------------- Using the same file and nodes with Infinispan 5.3beta2 takes about 8 minutes: 12:24:48,332 DEBUG [org.radargun.Master] (main) Starting 'LoadFileStage' on 8 slave nodes. Details: LoadFile {bucket=null, exitBenchmarkOnSlaveFailure=false, filePath=/home/hudson/tmp/afield/william-shakespeare-100MB.txt, runOnAllSlaves=false, slaves=null, useSmartClassLoading=true, valueSize=1024 } 12:46:35,276 INFO [org.radargun.stages.LoadFileStage] (main) Received responses from all 8 slaves. Durations [0:21.73 minutes, 1:21.76 minutes, 2:21.78 minutes, 3:21.73 minutes, 4:21.73 minutes, 5:21.73 minutes, 6:21.78 minutes, 7:21.67 minutes] 12:46:35,278 INFO [org.radargun.stages.LoadFileStage] (main) -------------------- 12:46:35,278 INFO [org.radargun.stages.LoadFileStage] (main) Size of file '/home/hudson/tmp/afield/william-shakespeare-100MB.txt' is 100623474 bytes 12:46:35,278 INFO [org.radargun.stages.LoadFileStage] (main) Value size is '1024' which will produce 98266 keys 12:46:35,279 INFO [org.radargun.stages.LoadFileStage] (main) Slave 0 wrote 12284 values to the cache with a total size of 12578816 bytes 12:46:35,279 INFO [org.radargun.stages.LoadFileStage] (main) Slave 1 wrote 12284 values to the cache with a total size of 12577906 bytes 12:46:35,279 INFO [org.radargun.stages.LoadFileStage] (main) Slave 2 wrote 12283 values to the cache with a total size of 12577792 bytes 12:46:35,279 INFO [org.radargun.stages.LoadFileStage] (main) Slave 3 wrote 12283 values to the cache with a total size of 12577792 bytes 12:46:35,280 INFO [org.radargun.stages.LoadFileStage] (main) Slave 4 wrote 12283 values to the cache with a total size of 12577792 bytes 12:46:35,280 INFO [org.radargun.stages.LoadFileStage] (main) Slave 5 wrote 12283 values to the cache with a total size of 12577792 bytes 12:46:35,280 INFO [org.radargun.stages.LoadFileStage] (main) Slave 6 wrote 12283 values to the cache with a total size of 12577792 bytes 12:46:35,281 INFO [org.radargun.stages.LoadFileStage] (main) Slave 7 wrote 12283 values to the cache with a total size of 12577792 bytes 12:46:35,281 INFO [org.radargun.stages.LoadFileStage] (main) -------------------- The attached logs show the duration for each PUT operation in the cache. On node 0 using Infinispan 5.2 the average duration is 2.35 milliseconds. In Infinispan 5.3beta2 the average duration is 106.05 milliseconds. Infinispan 5.2 is using this configuration file: https://svn.devel.redhat.com/repos/jboss-qa/load-testing/etc/jdg-60/ispn-configs/infinispan52/dist-udp-no-tx.xml Infinispan 5.3 is using this configuration file: https://svn.devel.redhat.com/repos/jboss-qa/load-testing/etc/jdg-60/ispn-configs/infinispan53/dist-udp-no-tx.xml Both versions are using this JGroups configuration file: https://svn.devel.redhat.com/repos/jboss-qa/load-testing/etc/jdg-60/jgroups/jgroups-udp.xml Version-Release number of selected component (if applicable): 5.2.4.Final-redhat-1 and Infinispan 5.3.0Beta2 How reproducible: 100% Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
I just want to update this report with the correct interpretation of the durations in the log file: 12:23:57,818 INFO [org.radargun.stages.LoadFileStage] (main) Received responses from all 8 slaves. Durations [0:31.83 seconds, 1:31.96 seconds, 2:31.9 seconds, 3:32.06 seconds, 4:44.53 seconds, 5:44.37 seconds, 6:44.5 seconds, 7:31.83 seconds] The number before the ":" is the node index, and is not part of the duration. The duration on node 0 was 31.83 seconds. The duration on node 1 was 31.96 seconds, etc.
This is the source for the LoadFileStage as well: https://github.com/radargun/radargun/blob/master/framework/src/main/java/org/radargun/stages/LoadFileStage.java
This regression was caused by a stale value in the JGroups configuration file. The previous version of the file had UDP.bundler_type="old". In the new version of the file, the value is UDP.bundler_type="new". Here are the results with the new configuration file in place: 17:42:51,749 DEBUG [org.radargun.Master] (main) Starting 'LoadFileStage' on 8 slave nodes. Details: LoadFile {bucket=null, exitBenchmarkOnSlaveFailure=false, filePath=/qa/services/hudson/static_build_env/jdg/data/william-shakespeare-100MB.txt, printWriteStatistics=false, runOnAllSlaves=false, slaves=null, useSmartClassLoading=true, valueSize=1024 } 17:42:51,759 INFO [org.radargun.Slave] (pool-1-thread-1) Executing stage: LoadFile {bucket=null, exitBenchmarkOnSlaveFailure=false, filePath=/qa/services/hudson/static_build_env/jdg/data/william-shakespeare-100MB.txt, printWriteStatistics=false, runOnAllSlaves=false, slaves=null, useSmartClassLoading=true, valueSize=1024 } 17:42:51,761 INFO [org.radargun.stages.LoadFileStage] (pool-1-thread-1) Writing 1024 bytes to cache key: 7-7168 at position 8192 17:43:09,626 INFO [org.radargun.stages.LoadFileStage] (pool-1-thread-1) Writing 1024 bytes to cache key: 7-40967168 at position 40968192 17:43:17,487 INFO [org.radargun.stages.LoadFileStage] (pool-1-thread-1) Writing 1024 bytes to cache key: 7-81927168 at position 81928192 17:43:20,856 INFO [org.radargun.Slave] (pool-1-thread-1) Finished stage: LoadFile {bucket=null, exitBenchmarkOnSlaveFailure=false, filePath=/qa/services/hudson/static_build_env/jdg/data/william-shakespeare-100MB.txt, printWriteStatistics=false, runOnAllSlaves=false, slaves=null, useSmartClassLoading=true, valueSize=1024 } 17:43:20,857 INFO [org.radargun.Slave] (main) Ack successfully sent to the master 17:43:20,985 INFO [org.radargun.stages.LoadFileStage] (main) Received responses from all 8 slaves. Durations [0 = 29.16 seconds, 1 = 29.1 seconds, 2 = 29.2 seconds, 3 = 29.09 seconds, 4 = 29.13 seconds, 5 = 29.09 seconds, 6 = 29.22 seconds, 7 = 29.09 seconds] 17:43:20,987 INFO [org.radargun.stages.LoadFileStage] (main) -------------------- 17:43:20,987 INFO [org.radargun.stages.LoadFileStage] (main) Size of file '/qa/services/hudson/static_build_env/jdg/data/william-shakespeare-100MB.txt' is 100623474 bytes 17:43:20,987 INFO [org.radargun.stages.LoadFileStage] (main) Value size is '1024' which will produce 98266 keys 17:43:20,987 INFO [org.radargun.stages.LoadFileStage] (main) Slave 0 wrote 12284 values to the cache with a total size of 12578816 bytes 17:43:20,988 INFO [org.radargun.stages.LoadFileStage] (main) Slave 1 wrote 12284 values to the cache with a total size of 12577906 bytes 17:43:20,988 INFO [org.radargun.stages.LoadFileStage] (main) Slave 2 wrote 12283 values to the cache with a total size of 12577792 bytes 17:43:20,988 INFO [org.radargun.stages.LoadFileStage] (main) Slave 3 wrote 12283 values to the cache with a total size of 12577792 bytes 17:43:20,988 INFO [org.radargun.stages.LoadFileStage] (main) Slave 4 wrote 12283 values to the cache with a total size of 12577792 bytes 17:43:20,989 INFO [org.radargun.stages.LoadFileStage] (main) Slave 5 wrote 12283 values to the cache with a total size of 12577792 bytes 17:43:20,989 INFO [org.radargun.stages.LoadFileStage] (main) Slave 6 wrote 12283 values to the cache with a total size of 12577792 bytes 17:43:20,989 INFO [org.radargun.stages.LoadFileStage] (main) Slave 7 wrote 12283 values to the cache with a total size of 12577792 bytes 17:43:20,989 INFO [org.radargun.stages.LoadFileStage] (main) -------------------- These values are consistent with JDG 6.1. Maybe JGroups 3.3.x should print a warning message when UDP.bundler_type="old" is being used?
Pedro Ruivo <pedroruivo2> made a comment on jira ISPN-3126 In any case, I've sent an email to ispn-dev warning about the use of the new bundler.
Mircea Markus <mmarkus> made a comment on jira ISPN-3126 Can you please close it once you confirmed that this is not a problem?
The performance degradation is not a bug, but JGroups should warn that the old bundler is enabled. I'll enter a new JIRA for that.
Martin Gencur <mgencur> updated the status of jira ISPN-3126 to Resolved
Martin Gencur <mgencur> made a comment on jira ISPN-3126 This is not a bug. The performance regression was caused by incorrect JGroups settings.