Description of problem: Glusterfs-hadoop plugin default block size = 32MB; HDFS default block size = 64MB. Default block size used to calculate input split size in the default case. Inputsplit = Total file size / default block size Glusterhadoop plugin uses org.apache.hadoop.fs.FileSystem.getDefaultBlockSize() public long getDefaultBlockSize() { // default to 32MB: large enough to minimize the impact of seeks return getConf().getLong("fs.local.block.size", 32 * 1024 * 1024); } which results in a default block size of 32MB. Unlike HDFS which uses the default dfs.block.size property set in hdfs/hdfs-default.xml. <property> <name>dfs.block.size</name> <value>67108864</value> <description>The default block size for new files.</description> </property> This is important because default block size is used to calculate input split size in the default case. To illustrate the problem consider the following hadooop use case: In a 100GB wordcount benchmark which you run using the folllowing command: bin/hadoop jar hadoop-examples-*.jar wordcount bigdata.txt out-bigdata where bigdata.txt is a 100GB text file. HDFS will spawn 1504 map tasks, one for each input split of size 67108864 bytes (approximate split size) Kind Total Tasks(successful+failed+killed) Successful tasks Failed tasks Killed tasks Start Time Finish Time Setup 1 1 0 0 19-Sep-2013 16:27:39 19-Sep-2013 16:27:40 (1sec) Map 1504 1504 0 0 19-Sep-2013 16:27:41 19-Sep-2013 16:33:25 (5mins, 44sec) Reduce 24 24 0 0 19-Sep-2013 16:27:58 19-Sep-2013 16:35:36 (7mins, 38sec) RHS/Hadoop will spawn 3008 map tasks, one for each input split of size 33,554,432 bytes (approximate split size) Kind % Complete Num Tasks Pending Running Complete Killed Failed/Killed Task Attempts map 10.65% 3533 3134 32 367 0 0 / 0 reduce 3.32% 1 0 1 0 0 0 / 0 Consequently RHS/Hadoop spawns 2X the number of map tasks spawned by HDFS/Hadoop (by default). Spawning extra map task processes reduces RHS/Hadoop performance. Bigger input splits are better in most cases. Hence default block size should be bigger than 32MB. How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
Updated default config to the larger size.
Patched here: https://github.com/gluster/glusterfs-hadoop/pull/84
This problem has reappeared in the High Touch Beta ISO and was reported by Mike Clark on April 16, 2014. The patch (a one line fix) did not make it into the High Touch Beta release. The patch changes a constant from 32MB to 64MB. This should be fixed permanently, so that we don't waste our time explaining the workaround.
This solution is no longer available from Red Hat.