Bug 1010407

Summary: Glusterfs-hadoop plugin default block size is 32MB; HDFS default block size is 64MB. Default block size used to calculate input split size in the default case. Hence Hadoop on RHS has 2X map tasks by default. Too much overhead.
Product: [Community] GlusterFS Reporter: Diane Feddema <dfeddema>
Component: gluster-hadoopAssignee: Bradley Childs <bchilds>
Status: CLOSED EOL QA Contact: Martin Bukatovic <mbukatov>
Severity: high Docs Contact:
Priority: medium    
Version: mainlineCC: bchilds, bugs, chrisw, dahorak, eboyd, hcfs-gluster-bugs, matt, mbukatov, mkudlej, poelstra, rhs-bugs, swatt, vbellur
Target Milestone: ---Keywords: Triaged
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-02-01 16:14:51 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1057253    

Description Diane Feddema 2013-09-20 17:34:30 UTC
Description of problem:
Glusterfs-hadoop plugin default block size = 32MB;  
HDFS default block size = 64MB.  
Default block size used to calculate input split size 
in the default case. Inputsplit = Total file size / default block size 

Glusterhadoop plugin uses org.apache.hadoop.fs.FileSystem.getDefaultBlockSize()

public long getDefaultBlockSize() {
    // default to 32MB: large enough to minimize the impact of seeks
    return getConf().getLong("fs.local.block.size", 32 * 1024 * 1024);
  }

which results in a default block size of 32MB.  Unlike HDFS which uses
the default dfs.block.size property set in hdfs/hdfs-default.xml.  

<property>
  <name>dfs.block.size</name>
  <value>67108864</value>
  <description>The default block size for new files.</description>
</property>

This is important because default block size is used
to calculate input split size in the default case. 

To illustrate the problem consider the following hadooop use case:

In a 100GB wordcount benchmark which you run using the folllowing command:

bin/hadoop jar hadoop-examples-*.jar wordcount bigdata.txt out-bigdata

where bigdata.txt is a 100GB text file. 

HDFS will spawn 1504 map tasks, one for each input split of size 67108864 bytes (approximate split size)

Kind	Total Tasks(successful+failed+killed)	Successful tasks	Failed tasks	Killed tasks	Start Time	Finish Time
Setup 	1 	1 	0 	0 	19-Sep-2013 16:27:39 	19-Sep-2013 16:27:40 (1sec)
Map 	1504 	1504 	0 	0 	19-Sep-2013 16:27:41 	19-Sep-2013 16:33:25 (5mins, 44sec)
Reduce 	24 	24 	0 	0 	19-Sep-2013 16:27:58 	19-Sep-2013 16:35:36 (7mins, 38sec)

RHS/Hadoop will spawn 3008 map tasks, one for each input split of size 33,554,432 bytes (approximate split size)
Kind	% Complete Num Tasks Pending Running Complete Killed	Failed/Killed Task Attempts
map	10.65%     3533	     3134	32	367	0	0 / 0
reduce	3.32%      1	     0	        1	0	0	0 / 0


Consequently RHS/Hadoop spawns 2X the number of map tasks spawned by HDFS/Hadoop (by default).  
Spawning extra map task processes reduces RHS/Hadoop performance.  Bigger input splits are better
in most cases.  Hence default block size should be bigger than 32MB.  




How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 3 Bradley Childs 2013-12-19 16:06:28 UTC
Updated default config to the larger size.

Comment 6 Bradley Childs 2014-03-13 18:19:35 UTC
Patched here:

https://github.com/gluster/glusterfs-hadoop/pull/84

Comment 7 Diane Feddema 2014-04-25 18:20:55 UTC
This problem has reappeared in the High Touch Beta ISO and was reported by Mike Clark on April 16, 2014.  The patch (a one line fix) did not make it into the High Touch Beta release.  The patch changes a constant from 32MB to 64MB.   This should be fixed permanently, so that we don't waste our time explaining the workaround.

Comment 9 Steve Watt 2016-02-01 16:14:51 UTC
This solution is no longer available from Red Hat.