Description of problem: Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
When a Hadoop mapreduce job starts working on a list of input files in a directory, it expects files to be randomly scattered across the nodes in the cluster. Early in a large job it processes the first files in the list provided by the file system. Over the course of the map phase of the job, it works its way through the list. If the files are randomly scattered across all the nodes, then all the storage can be fairly busy all the time. The normal way users specify the set of input files to a job is to name the directory they are in. The order that the job processes the files is the order they are delivered by the file system. This is the order of files returned by "ls -f" or by repeated calls readdir(). The default order of files returned by RHS is not random, but ordered by brick (or by pair of bricks if two-way replication is in use). The list of files begins with those on the first pair of bricks, then those on the next pair or bricks, and so on. The consequence of this is that for a large mapreduce job, only one pair of file servers may be active at any one time, as the job works its way through the list of files. This causes a rolling I/O bottleneck. A simple way to see the default unsorted order of files in a directory is with "ls -f" or by repeated calls to readdir(). One workaround is to specify the input files with a glob rather than a directory name. So, instead of "indir", use "indir/*". The glob causes the filenames to be returned in alpahabetical order. Because of GlusterFS hashing, alphabetical order is a random order in physical location. This spreads the I/O load across all the storage servers. This workaround is simple to do, but is not how users normally specify input files. Users need to be told how to use this workaround. The performance impact can be significant. We ran the HiBench Wordcount benchmark on an 18-node cluster, using a 1 TB dataset (1000 files of 1GB each). With the input specified as a directory (brick-sorted filenames) the run took 45 minutes. With the input specified as "indir/*" (alphabetical by name, random by brick) the run took 30 minutes. The impact of confining the I/O load to only two servers at a time would be worse on larger clusters, as a smaller fraction of the servers would be active at any one time.
I've added an optional parameter to sort the directory listing (insertion sort, so not as much overhead). If undefined, directory listings are returned as received from the filesystem. If enabled listings are alphabetized then returned. This should resolve issue around bunching tasks around the nodes holding the first directory listing elements. The code is upstream and will start moving down. https://github.com/gluster/glusterfs-hadoop/pull/98
OK, +1. I suspect the overhead of sorting the list of file names will be tiny compared to the cost of getting the list of file names in the first place.
I think this patch is not part of rhs-hadoop-2.3.2-3.noarch.rpm. 1) I don't see patch applied in src rpm. 2) This simple script doesn't return sorted list of files: import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.FileSystem; import org.apache.hadoop.fs.Path; import org.apache.hadoop.fs.FilterFileSystem; Configuration conf = new Configuration(); FileSystem fs = FileSystem.get(conf); for (i in fs.listStatus(new Path("/"))) { println(i.path); } -->Assigned
What does " rc -> --- " mean? Is this bug fixed in RHS 3.0.2?
Checking with rhs-hadoop-2.4.0-4.el6.noarch I have created directory in /mnt/glusterfs/HadoopVol1/tmp/wikipedia with 10000 files of random name. When I use default configuration, output of script from comment 5 listing files from wikipedia directory was unsorted: $ head output.hadoop-api glusterfs:/tmp/wikipedia/wikipages.sha1 glusterfs:/tmp/wikipedia/n57mnhttuhxxpq1nanak3zhmmmcl622.wikitext glusterfs:/tmp/wikipedia/80eb0u0r4ebdp20dbvburoawhsqm4pv.wikitext glusterfs:/tmp/wikipedia/7x092vk67vfmuaydgnv3iibesvoztpj.wikitext glusterfs:/tmp/wikipedia/njf4p5cawnde84b6opfl0g4fgygvpxw.wikitext glusterfs:/tmp/wikipedia/q2lguie57e7djt3z1u1m54hrv9mdsvb.wikitext glusterfs:/tmp/wikipedia/1qsexspc1aljksosrk2e5ph9b7kwnjv.wikitext glusterfs:/tmp/wikipedia/8bze59ph64196ms8d22i50nd6agy5yl.wikitext glusterfs:/tmp/wikipedia/saxjmb9pr579nxn4jrmd3egncfavmlt.wikitext glusterfs:/tmp/wikipedia/6y5mnxfe5tkop8l36pdovya92ypq4d2.wikitext When I added property fs.glusterfs.sort.directory.listing into core-site and set it to true (via ambari), the output of the script was sorted: $ head output.hadoop-api.listing-true glusterfs:/tmp/wikipedia/001ma1f8eb23ujutb9zfwy02odgb6ma.wikitext glusterfs:/tmp/wikipedia/002xyr65wcnclrztswzioppcvjalwra.wikitext glusterfs:/tmp/wikipedia/004rrmqfec4imjkr58woj1sy81zadhm.wikitext glusterfs:/tmp/wikipedia/009pq5c77mqka3io72g74n4r77fcdoq.wikitext glusterfs:/tmp/wikipedia/00biqy93vv616yhs1zebo9xigi9vncp.wikitext glusterfs:/tmp/wikipedia/00iu2n81ybzbz5kzw09tmfgg45av48g.wikitext glusterfs:/tmp/wikipedia/00skedh9bhdevhgtzxyql5b4010m88u.wikitext glusterfs:/tmp/wikipedia/00ujpjhl6nwupmbz9j0xdio2ydbvisl.wikitext glusterfs:/tmp/wikipedia/00wd87uti5x3sxtzoh95r1ro69nerpx.wikitext glusterfs:/tmp/wikipedia/00zthzswjejxq8wcrtda9tl8zn3bsw0.wikitext So the feature works. Few questions that requires decision remains though: * should we enable this feature by default? * if yes, how? (via default core-site config or via hardcoded default?) * if no, how would our users know when and how to enable the feature? In other words, we need to decide which default configuration makes more sense and then how to document the feature.
Additional information: Using directory with 10000 files of random name (from comment 12), I tried to check increase of time required to list the directory when enabling fs.glusterfs.sort.directory.listing. I executed 100 runs with both configurations with the following results: fs.glusterfs.sort.directory.listing = false, real time median: 13.59 s fs.glusterfs.sort.directory.listing = true, real time median: 20.66 s increate of time required: 152.02 % This is by no means real performance measurement, it's rather a first try and/or little hint how it behaves in QE's minimal enviroment. Did anyone tried to run real performance analysis of this feature?
Answers from discussion during dev-qe sync on 2nd July 2015: * performance evaluation has been done by the requestor's team * the default value stays the same (so that the feature is disabled by default) (so clearing the needinfo) Based on this and my previous evaluation (see comment 12), I'm moving this to verified state.
Hi, Please review the edited doc text and sign off.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHEA-2015-1517.html
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days