Bug 1093838 - Brick-sorted order of filenames in RHS directory harms Hadoop mapreduce performance
Summary: Brick-sorted order of filenames in RHS directory harms Hadoop mapreduce perfo...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat Storage
Component: rhs-hadoop
Version: unspecified
Hardware: Unspecified
OS: Linux
medium
medium
Target Milestone: ---
: RHGS 3.1.0
Assignee: Bradley Childs
QA Contact: Martin Bukatovic
URL:
Whiteboard:
Depends On:
Blocks: 1202842 1235599
TreeView+ depends on / blocked
 
Reported: 2014-05-02 19:11 UTC by Hank Jakiela
Modified: 2023-09-14 02:07 UTC (History)
13 users (show)

Fixed In Version: org.gluster-glusterfs-hadoop-2.4.0-2
Doc Type: Bug Fix
Doc Text:
Previously, directory with many small files were lining up files in the listing by brick. As a consequence, there was a decrease in performance of Hadoop jobs as the files were processed in the order of the listing. The job was focusing on a single brick at a time. With this fix, the files are sorted by directory listing and not by brick to enhance the performance.
Clone Of:
Environment:
Last Closed: 2015-07-29 04:15:30 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Bugzilla 1241160 0 unspecified CLOSED document fs.glusterfs.sort.directory.listing property of rhs-hadoop plugin 2021-02-22 00:41:40 UTC
Red Hat Product Errata RHEA-2015:1517 0 normal SHIPPED_LIVE Red Hat Storage Server 3.1 Hadoop plug-in enhancement 2015-07-29 08:15:17 UTC
Red Hat Product Errata RHSA-2015:1495 0 normal SHIPPED_LIVE Important: Red Hat Gluster Storage 3.1 update 2015-07-29 08:26:26 UTC

Internal Links: 1241160

Description Hank Jakiela 2014-05-02 19:11:55 UTC
Description of problem:


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 2 Hank Jakiela 2014-05-02 19:52:42 UTC
When a Hadoop mapreduce job starts working on a list of input files in a directory, it expects files to be randomly scattered across the nodes in the cluster. Early in a large job it processes the first files in the list provided by the file system. Over the course of the map phase of the job, it works its way through the list. If the files are randomly scattered across all the nodes, then all the storage can be fairly busy all the time.

The normal way users specify the set of input files to a job is to name the directory they are in. The order that the job processes the files is the order they are delivered by the file system. This is the order of files returned by "ls -f" or by repeated calls readdir().

The default order of files returned by RHS is not random, but ordered by brick (or by pair of bricks if two-way replication is in use). The list of files begins with those on the first pair of bricks, then those on the next pair or bricks, and so on.

The consequence of this is that for a large mapreduce job, only one pair of file servers may be active at any one time, as the job works its way through the list of files. This causes a rolling I/O bottleneck.

A simple way to see the default unsorted order of files in a directory is with "ls -f" or by repeated calls to readdir().

One workaround is to specify the input files with a glob rather than a directory name. So, instead of "indir", use "indir/*". The glob causes the filenames to be returned in alpahabetical order. Because of GlusterFS hashing, alphabetical order is a random order in physical location. This spreads the I/O load across all the storage servers. This workaround is simple to do, but is not how users normally specify input files. Users need to be told how to use this workaround.

The performance impact can be significant. We ran the HiBench Wordcount benchmark on an 18-node cluster, using a 1 TB dataset (1000 files of 1GB each). With the input specified as a directory (brick-sorted filenames) the run took 45 minutes. With the input specified as "indir/*" (alphabetical by name, random by brick) the run took 30 minutes. The impact of confining the I/O load to only two servers at a time would be worse on larger clusters, as a smaller fraction of the servers would be active at any one time.

Comment 3 Bradley Childs 2014-05-28 19:23:31 UTC
I've added an optional parameter to sort the directory listing (insertion sort, so not as much overhead).  If undefined, directory listings are returned as received from the filesystem.  If enabled listings are alphabetized then returned.  

This should resolve issue around bunching tasks around the nodes holding the first directory listing elements. The code is upstream and will start moving down.

https://github.com/gluster/glusterfs-hadoop/pull/98

Comment 4 Hank Jakiela 2014-05-29 15:12:08 UTC
OK, +1.

I suspect the overhead of sorting the list of file names will be tiny compared to the cost of getting the list of file names in the first place.

Comment 5 Martin Kudlej 2014-06-27 11:43:54 UTC
I think this patch is not part of rhs-hadoop-2.3.2-3.noarch.rpm.
1) I don't see patch applied in src rpm.
2) This simple script doesn't return sorted list of files:
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.fs.FilterFileSystem;

Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(conf);

for (i in fs.listStatus(new Path("/"))) {
  println(i.path);
}

-->Assigned

Comment 9 Hank Jakiela 2014-11-04 15:45:18 UTC
What does " rc -> --- " mean? Is this bug fixed in RHS 3.0.2?

Comment 12 Martin Bukatovic 2015-06-24 11:56:43 UTC
Checking with rhs-hadoop-2.4.0-4.el6.noarch

I have created directory in /mnt/glusterfs/HadoopVol1/tmp/wikipedia
with 10000 files of random name.

When I use default configuration, output of script from comment 5
listing files from wikipedia directory was unsorted:

$ head output.hadoop-api
glusterfs:/tmp/wikipedia/wikipages.sha1
glusterfs:/tmp/wikipedia/n57mnhttuhxxpq1nanak3zhmmmcl622.wikitext
glusterfs:/tmp/wikipedia/80eb0u0r4ebdp20dbvburoawhsqm4pv.wikitext
glusterfs:/tmp/wikipedia/7x092vk67vfmuaydgnv3iibesvoztpj.wikitext
glusterfs:/tmp/wikipedia/njf4p5cawnde84b6opfl0g4fgygvpxw.wikitext
glusterfs:/tmp/wikipedia/q2lguie57e7djt3z1u1m54hrv9mdsvb.wikitext
glusterfs:/tmp/wikipedia/1qsexspc1aljksosrk2e5ph9b7kwnjv.wikitext
glusterfs:/tmp/wikipedia/8bze59ph64196ms8d22i50nd6agy5yl.wikitext
glusterfs:/tmp/wikipedia/saxjmb9pr579nxn4jrmd3egncfavmlt.wikitext
glusterfs:/tmp/wikipedia/6y5mnxfe5tkop8l36pdovya92ypq4d2.wikitext

When I added property fs.glusterfs.sort.directory.listing into
core-site and set it to true (via ambari), the output of the
script was sorted:

$ head output.hadoop-api.listing-true 
glusterfs:/tmp/wikipedia/001ma1f8eb23ujutb9zfwy02odgb6ma.wikitext
glusterfs:/tmp/wikipedia/002xyr65wcnclrztswzioppcvjalwra.wikitext
glusterfs:/tmp/wikipedia/004rrmqfec4imjkr58woj1sy81zadhm.wikitext
glusterfs:/tmp/wikipedia/009pq5c77mqka3io72g74n4r77fcdoq.wikitext
glusterfs:/tmp/wikipedia/00biqy93vv616yhs1zebo9xigi9vncp.wikitext
glusterfs:/tmp/wikipedia/00iu2n81ybzbz5kzw09tmfgg45av48g.wikitext
glusterfs:/tmp/wikipedia/00skedh9bhdevhgtzxyql5b4010m88u.wikitext
glusterfs:/tmp/wikipedia/00ujpjhl6nwupmbz9j0xdio2ydbvisl.wikitext
glusterfs:/tmp/wikipedia/00wd87uti5x3sxtzoh95r1ro69nerpx.wikitext
glusterfs:/tmp/wikipedia/00zthzswjejxq8wcrtda9tl8zn3bsw0.wikitext

So the feature works.

Few questions that requires decision remains though:

 * should we enable this feature by default?
 * if yes, how? (via default core-site config or via hardcoded default?)
 * if no, how would our users know when and how to enable the feature?

In other words, we need to decide which default configuration makes
more sense and then how to document the feature.

Comment 13 Martin Bukatovic 2015-06-24 15:24:20 UTC
Additional information: Using directory with 10000 files of random name (from
comment 12), I tried to check increase of time required to list the directory
when enabling fs.glusterfs.sort.directory.listing. I executed 100 runs with
both configurations with the following results:

fs.glusterfs.sort.directory.listing = false, real time median: 13.59 s
fs.glusterfs.sort.directory.listing = true,  real time median: 20.66 s
increate of time required: 152.02 %

This is by no means real performance measurement, it's rather a first try
and/or little hint how it behaves in QE's minimal enviroment.

Did anyone tried to run real performance analysis of this feature?

Comment 14 Martin Bukatovic 2015-07-08 14:31:19 UTC
Answers from discussion during dev-qe sync on 2nd July 2015:

 * performance evaluation has been done by the requestor's team
 * the default value stays the same (so that the feature is disabled by default)

(so clearing the needinfo)

Based on this and my previous evaluation (see comment 12), I'm moving this to
verified state.

Comment 15 Divya 2015-07-22 10:24:51 UTC
Hi,

Please review the edited doc text and sign off.

Comment 17 errata-xmlrpc 2015-07-29 04:15:30 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHEA-2015-1517.html

Comment 18 Red Hat Bugzilla 2023-09-14 02:07:17 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days


Note You need to log in before you can comment on or make changes to this bug.