Description of problem: The Apache Hadoop Distributed File System (org.apache.hadoop.hdfs.DistributedFileSystem) tracks all read and write operations to the Distributed File System using org.apache.hadoop.fs.Statistics.incrementReadOps and incrementWriteOps . The Gluster Plugin should offer the same features as this is expected behavior for a Hadoop Compatible File System. Currently the Gluster Plugin does not track these operations in the plugin, it needs to be extended to similarly track these operations.
For detail: here is an initial list of the methods that implement incrementRead/Write ops in the org.apache.hadoop's DistributedFileSystem class, (each one below which we implement in GlusterFS will require the said modification), public BlockLocation[] getFileBlockLocations(Path p, public FSDataInputStream open(Path f, int bufferSize) throws IOException { public FSDataOutputStream append(Path f, int bufferSize, public FSDataOutputStream create(Path f, FsPermission permission, public FSDataOutputStream createNonRecursive(Path f, FsPermission permission, public boolean setReplication(Path src, public boolean rename(Path src, Path dst) throws IOException { public void rename(Path src, Path dst, Options.Rename... options) throws IOException { public boolean delete(Path f, boolean recursive) throws IOException { public ContentSummary getContentSummary(Path f) throws IOException { public boolean mkdir(Path f, FsPermission permission) throws IOException { public boolean mkdirs(Path f, FsPermission permission) throws IOException { public FsStatus getStatus(Path p) throws IOException { public FileStatus getFileStatus(Path f) throws IOException { public MD5MD5CRC32FileChecksum getFileChecksum(Path f) throws IOException { public void setPermission(Path p, FsPermission permission public void setTimes(Path p, long mtime, long atime For example: @Override public FileStatus getFileStatus(Path f) throws IOException { statistics.incrementReadOps(1); HdfsFileStatus fi = dfs.getFileInfo(getPathName(f)); ... }
Fixed with redesign in this checkin: https://github.com/gluster/hadoop-glusterfs/commit/ce6325313dcb0df6cc73379248c1e07a9aa0b025
I've tried to test this and found that read/write operations are 0. FILE: Number of bytes read=226 FILE: Number of bytes written=921439 FILE: Number of read operations=0 FILE: Number of large read operations=0 FILE: Number of write operations=0 GLUSTERFS: Number of bytes read=8605 GLUSTERFS: Number of bytes written=215 GLUSTERFS: Number of read operations=0 GLUSTERFS: Number of large read operations=0 GLUSTERFS: Number of write operations=0 same example with hdfs: FILE: Number of bytes read=226 FILE: Number of bytes written=913772 FILE: Number of read operations=0 FILE: Number of large read operations=0 FILE: Number of write operations=0 HDFS: Number of bytes read=2870 HDFS: Number of bytes written=215 HDFS: Number of read operations=43 HDFS: Number of large read operations=0 HDFS: Number of write operations=3 I've used rhs-hadoop-2.1.5-1.noarch hadoop-2.2.0.2.0.6.0-76.el6.x86_64 hadoop-client-2.2.0.2.0.6.0-76.el6.x86_64 hadoop-yarn-2.2.0.2.0.6.0-76.el6.x86_64 hadoop-mapreduce-2.2.0.2.0.6.0-76.el6.x86_64 hadoop-libhdfs-2.2.0.2.0.6.0-76.el6.x86_64 hadoop-lzo-0.5.0-1.x86_64 hadoop-lzo-native-0.5.0-1.x86_64 hadoop-hdfs-2.2.0.2.0.6.0-76.el6.x86_64 glusterfs-3.4.0.44rhs-1.el6rhs.x86_64 --->ASSIGNED Log from example run: hadoop jar /usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples-*.jar pi 10 10 Number of Maps = 10 Samples per Map = 10 14/01/23 13:25:46 INFO glusterfs.GlusterVolume: Initializing gluster volume.. 14/01/23 13:25:46 INFO glusterfs.GlusterFileSystem: Configuring GlusterFS 14/01/23 13:25:46 INFO glusterfs.GlusterFileSystem: Initializing GlusterFS, CRC disabled. 14/01/23 13:25:46 INFO glusterfs.GlusterFileSystem: GIT INFO={git.commit.id.abbrev=51e5108, git.commit.user.email=bchilds, git.commit.message.full=2.1.5 branch/build , git.commit.id=51e5108fbec0b50d921aeb00ba2489bbdbe3d6ff, git.commit.message.short=2.1.5 branch/build, git.commit.user.name=childsb, git.build.user.name=Unknown, git.commit.id.describe=2.1.4-21-g51e5108, git.build.user.email=Unknown, git.branch=master, git.commit.time=17.01.2014 @ 16:05:54 EST, git.build.time=21.01.2014 @ 02:19:28 EST} 14/01/23 13:25:46 INFO glusterfs.GlusterFileSystem: GIT_TAG=2.1.4 14/01/23 13:25:46 INFO glusterfs.GlusterFileSystem: Configuring GlusterFS 14/01/23 13:25:46 INFO glusterfs.GlusterVolume: Initializing gluster volume.. 14/01/23 13:25:46 INFO glusterfs.GlusterVolume: Root of Gluster file system is /mnt/glusterfs 14/01/23 13:25:46 INFO glusterfs.GlusterVolume: mapreduce/superuser daemon : null 14/01/23 13:25:46 INFO glusterfs.GlusterVolume: Working directory is : glusterfs:/user/root 14/01/23 13:25:46 INFO glusterfs.GlusterVolume: Write buffer size : 131072 Wrote input for Map #0 Wrote input for Map #1 Wrote input for Map #2 Wrote input for Map #3 Wrote input for Map #4 Wrote input for Map #5 Wrote input for Map #6 Wrote input for Map #7 Wrote input for Map #8 Wrote input for Map #9 Starting Job 14/01/23 13:25:50 INFO client.RMProxy: Connecting to ResourceManager at _machine_:8050 14/01/23 13:25:50 INFO glusterfs.GlusterVolume: Initializing gluster volume.. 14/01/23 13:25:50 INFO glusterfs.GlusterVolume: Initializing gluster volume.. 14/01/23 13:25:50 INFO glusterfs.GlusterVolume: Root of Gluster file system is /mnt/glusterfs 14/01/23 13:25:50 INFO glusterfs.GlusterVolume: mapreduce/superuser daemon : null 14/01/23 13:25:50 INFO glusterfs.GlusterVolume: Working directory is : glusterfs:/user/root 14/01/23 13:25:50 INFO glusterfs.GlusterVolume: Write buffer size : 131072 14/01/23 13:25:56 INFO input.FileInputFormat: Total input paths to process : 10 14/01/23 13:25:56 INFO mapreduce.JobSubmitter: number of splits:10 14/01/23 13:25:56 INFO Configuration.deprecation: user.name is deprecated. Instead, use mapreduce.job.user.name 14/01/23 13:25:56 INFO Configuration.deprecation: mapred.jar is deprecated. Instead, use mapreduce.job.jar 14/01/23 13:25:56 INFO Configuration.deprecation: mapred.map.tasks.speculative.execution is deprecated. Instead, use mapreduce.map.speculative 14/01/23 13:25:56 INFO Configuration.deprecation: mapred.reduce.tasks is deprecated. Instead, use mapreduce.job.reduces 14/01/23 13:25:56 INFO Configuration.deprecation: mapred.output.value.class is deprecated. Instead, use mapreduce.job.output.value.class 14/01/23 13:25:56 INFO Configuration.deprecation: mapred.reduce.tasks.speculative.execution is deprecated. Instead, use mapreduce.reduce.speculative 14/01/23 13:25:56 INFO Configuration.deprecation: mapreduce.map.class is deprecated. Instead, use mapreduce.job.map.class 14/01/23 13:25:56 INFO Configuration.deprecation: mapred.job.name is deprecated. Instead, use mapreduce.job.name 14/01/23 13:25:56 INFO Configuration.deprecation: mapreduce.reduce.class is deprecated. Instead, use mapreduce.job.reduce.class 14/01/23 13:25:56 INFO Configuration.deprecation: mapreduce.inputformat.class is deprecated. Instead, use mapreduce.job.inputformat.class 14/01/23 13:25:56 INFO Configuration.deprecation: mapred.input.dir is deprecated. Instead, use mapreduce.input.fileinputformat.inputdir 14/01/23 13:25:56 INFO Configuration.deprecation: mapred.output.dir is deprecated. Instead, use mapreduce.output.fileoutputformat.outputdir 14/01/23 13:25:56 INFO Configuration.deprecation: mapreduce.outputformat.class is deprecated. Instead, use mapreduce.job.outputformat.class 14/01/23 13:25:56 INFO Configuration.deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps 14/01/23 13:25:56 INFO Configuration.deprecation: mapred.output.key.class is deprecated. Instead, use mapreduce.job.output.key.class 14/01/23 13:25:56 INFO Configuration.deprecation: mapred.working.dir is deprecated. Instead, use mapreduce.job.working.dir 14/01/23 13:25:57 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1390474574687_0001 14/01/23 13:25:58 INFO impl.YarnClientImpl: Submitted application application_1390474574687_0001 to ResourceManager at _machine_:8050 14/01/23 13:25:58 INFO mapreduce.Job: The url to track the job: _machine_:8088/proxy/application_1390474574687_0001/ 14/01/23 13:25:58 INFO mapreduce.Job: Running job: job_1390474574687_0001 14/01/23 13:26:17 INFO mapreduce.Job: Job job_1390474574687_0001 running in uber mode : false 14/01/23 13:26:17 INFO mapreduce.Job: map 0% reduce 0% 14/01/23 13:28:06 INFO mapreduce.Job: map 50% reduce 0% 14/01/23 13:29:07 INFO mapreduce.Job: map 100% reduce 0% 14/01/23 13:29:21 INFO mapreduce.Job: map 100% reduce 100% 14/01/23 13:29:25 INFO mapreduce.Job: Job job_1390474574687_0001 completed successfully 14/01/23 13:29:27 INFO mapreduce.Job: Counters: 43 File System Counters FILE: Number of bytes read=226 FILE: Number of bytes written=921439 FILE: Number of read operations=0 FILE: Number of large read operations=0 FILE: Number of write operations=0 GLUSTERFS: Number of bytes read=8605 GLUSTERFS: Number of bytes written=215 GLUSTERFS: Number of read operations=0 GLUSTERFS: Number of large read operations=0 GLUSTERFS: Number of write operations=0 Job Counters Launched map tasks=10 Launched reduce tasks=1 Data-local map tasks=10 Total time spent by all maps in occupied slots (ms)=2480649 Total time spent by all reduces in occupied slots (ms)=51608 Map-Reduce Framework Map input records=10 Map output records=20 Map output bytes=180 Map output materialized bytes=280 Input split bytes=1350 Combine input records=0 Combine output records=0 Reduce input groups=2 Reduce shuffle bytes=280 Reduce input records=20 Reduce output records=0 Spilled Records=40 Shuffled Maps =10 Failed Shuffles=0 Merged Map outputs=10 GC time elapsed (ms)=25269 CPU time spent (ms)=35510 Physical memory (bytes) snapshot=2833489920 Virtual memory (bytes) snapshot=12698566656 Total committed heap usage (bytes)=2426511360 Shuffle Errors BAD_ID=0 CONNECTION=0 IO_ERROR=0 WRONG_LENGTH=0 WRONG_MAP=0 WRONG_REDUCE=0 File Input Format Counters Bytes Read=1180 File Output Format Counters Bytes Written=97 Job Finished in 217.403 seconds Estimated value of Pi is 3.20000000000000000000
According to the hadoop documentation, "large read operations" (in HDFS) counter should be incremented every time when listing files under a large directory. Does this counter work the same way in glusterfs-hadoop as it is working in the HDFS? I've tried several mapreduce jobs on directories containing thousands of files but still got 0 large read operations on the counter. We need to know how this counter works in glusterfs-hadoop (so We can properly test it).
because of the large number of bugs filed against mainline version\ is ambiguous and about to be removed as a choice. If you believe this is still a bug, please change the status back to NEW and choose the appropriate, applicable version for it.
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days