Bug 1010407 - Glusterfs-hadoop plugin default block size is 32MB; HDFS default block size is 64MB. Default block size used to calculate input split size in the default case. Hence Hadoop on RHS has 2X map tasks by default. Too much overhead.
Summary: Glusterfs-hadoop plugin default block size is 32MB; HDFS default block size ...
Keywords:
Status: CLOSED EOL
Alias: None
Product: GlusterFS
Classification: Community
Component: gluster-hadoop
Version: mainline
Hardware: Unspecified
OS: Unspecified
medium
high
Target Milestone: ---
Assignee: Bradley Childs
QA Contact: Martin Bukatovic
URL:
Whiteboard:
Depends On:
Blocks: 1057253
TreeView+ depends on / blocked
 
Reported: 2013-09-20 17:34 UTC by Diane Feddema
Modified: 2016-02-01 16:14 UTC (History)
13 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2016-02-01 16:14:51 UTC
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Embargoed:


Attachments (Terms of Use)

Description Diane Feddema 2013-09-20 17:34:30 UTC
Description of problem:
Glusterfs-hadoop plugin default block size = 32MB;  
HDFS default block size = 64MB.  
Default block size used to calculate input split size 
in the default case. Inputsplit = Total file size / default block size 

Glusterhadoop plugin uses org.apache.hadoop.fs.FileSystem.getDefaultBlockSize()

public long getDefaultBlockSize() {
    // default to 32MB: large enough to minimize the impact of seeks
    return getConf().getLong("fs.local.block.size", 32 * 1024 * 1024);
  }

which results in a default block size of 32MB.  Unlike HDFS which uses
the default dfs.block.size property set in hdfs/hdfs-default.xml.  

<property>
  <name>dfs.block.size</name>
  <value>67108864</value>
  <description>The default block size for new files.</description>
</property>

This is important because default block size is used
to calculate input split size in the default case. 

To illustrate the problem consider the following hadooop use case:

In a 100GB wordcount benchmark which you run using the folllowing command:

bin/hadoop jar hadoop-examples-*.jar wordcount bigdata.txt out-bigdata

where bigdata.txt is a 100GB text file. 

HDFS will spawn 1504 map tasks, one for each input split of size 67108864 bytes (approximate split size)

Kind	Total Tasks(successful+failed+killed)	Successful tasks	Failed tasks	Killed tasks	Start Time	Finish Time
Setup 	1 	1 	0 	0 	19-Sep-2013 16:27:39 	19-Sep-2013 16:27:40 (1sec)
Map 	1504 	1504 	0 	0 	19-Sep-2013 16:27:41 	19-Sep-2013 16:33:25 (5mins, 44sec)
Reduce 	24 	24 	0 	0 	19-Sep-2013 16:27:58 	19-Sep-2013 16:35:36 (7mins, 38sec)

RHS/Hadoop will spawn 3008 map tasks, one for each input split of size 33,554,432 bytes (approximate split size)
Kind	% Complete Num Tasks Pending Running Complete Killed	Failed/Killed Task Attempts
map	10.65%     3533	     3134	32	367	0	0 / 0
reduce	3.32%      1	     0	        1	0	0	0 / 0


Consequently RHS/Hadoop spawns 2X the number of map tasks spawned by HDFS/Hadoop (by default).  
Spawning extra map task processes reduces RHS/Hadoop performance.  Bigger input splits are better
in most cases.  Hence default block size should be bigger than 32MB.  




How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 3 Bradley Childs 2013-12-19 16:06:28 UTC
Updated default config to the larger size.

Comment 6 Bradley Childs 2014-03-13 18:19:35 UTC
Patched here:

https://github.com/gluster/glusterfs-hadoop/pull/84

Comment 7 Diane Feddema 2014-04-25 18:20:55 UTC
This problem has reappeared in the High Touch Beta ISO and was reported by Mike Clark on April 16, 2014.  The patch (a one line fix) did not make it into the High Touch Beta release.  The patch changes a constant from 32MB to 64MB.   This should be fixed permanently, so that we don't waste our time explaining the workaround.

Comment 9 Steve Watt 2016-02-01 16:14:51 UTC
This solution is no longer available from Red Hat.


Note You need to log in before you can comment on or make changes to this bug.