Bug 765117 (GLUSTER-3385)

Summary: 'terasort' after 'teragen' gives lot warnings
Product: [Community] GlusterFS Reporter: M S Vishwanath Bhat <vbhat>
Component: HDFSAssignee: Venky Shankar <vshankar>
Status: CLOSED CURRENTRELEASE QA Contact:
Severity: medium Docs Contact:
Priority: medium    
Version: pre-releaseCC: gluster-bugs, mzywusko, rabhat
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: hadoop-glusterfs Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description M S Vishwanath Bhat 2011-08-10 12:32:57 UTC
I ran 'terasort' after generating data sets using 'teragen' during the map jobs I get following warning.

11/08/10 04:07:22 INFO mapred.JobClient: Task Id : attempt_201108090104_0044_m_000009_0, Status : FAILED
java.io.IOException: File: glusterfs://ubuntu1:9000/mnt/glusterfs/ter-inp/_partition.lst#_partition.lst has changed on HDFS since job started
        at org.apache.hadoop.filecache.DistributedCache.ifExistsAndFresh(DistributedCache.java:484)
        at org.apache.hadoop.filecache.DistributedCache.localizeCache(DistributedCache.java:356)
        at org.apache.hadoop.filecache.DistributedCache.getLocalCache(DistributedCache.java:205)
        at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:172)

java.io.IOException: File: glusterfs://ubuntu1:9000/mnt/glusterfs/ter-inp/_partition.lst#_partition.lst has changed on HDFS since job started
        at org.apache.hadoop.filecache.DistributedCache.ifExistsAndFresh(DistributedCache.java:484)
        at org.apache.hadoop.filecache.DistributedCache.localizeCache(DistributedCache.java:356)
        at org.apache.hadoop.filecache.DistributedCache.getLocalCache(DistributedCache.java:205)
        at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:172)

11/08/10 04:07:22 WARN mapred.JobClient: Error reading task outputhttp://ubuntu2:50060/tasklog?plaintext=true&taskid=attempt_201108090104_0044_m_000009_0&filter=stdout
11/08/10 04:07:22 WARN mapred.JobClient: Error reading task outputhttp://ubuntu2:50060/tasklog?plaintext=true&taskid=attempt_201108090104_0044_m_000009_0&filter=stderr
11/08/10 04:07:23 INFO mapred.JobClient:  map 3% reduce 0%
11/08/10 04:07:27 INFO mapred.JobClient:  map 4% reduce 0%
11/08/10 04:07:31 INFO mapred.JobClient:  map 5% reduce 0%
11/08/10 04:07:31 INFO mapred.JobClient: Task Id : attempt_201108090104_0044_m_000014_0, Status : FAILED
java.io.IOException: File: glusterfs://ubuntu1:9000/mnt/glusterfs/ter-inp/_partition.lst#_partition.lst has changed on HDFS since job started
        at org.apache.hadoop.filecache.DistributedCache.ifExistsAndFresh(DistributedCache.java:484)
        at org.apache.hadoop.filecache.DistributedCache.localizeCache(DistributedCache.java:356)
        at org.apache.hadoop.filecache.DistributedCache.getLocalCache(DistributedCache.java:205)
        at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:172)

java.io.IOException: File: glusterfs://ubuntu1:9000/mnt/glusterfs/ter-inp/_partition.lst#_partition.lst has changed on HDFS since job started
        at org.apache.hadoop.filecache.DistributedCache.ifExistsAndFresh(DistributedCache.java:484)
        at org.apache.hadoop.filecache.DistributedCache.localizeCache(DistributedCache.java:356)
        at org.apache.hadoop.filecache.DistributedCache.getLocalCache(DistributedCache.java:205)
        at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:172)

11/08/10 04:07:31 WARN mapred.JobClient: Error reading task outputhttp://ubuntu2:50060/tasklog?plaintext=true&taskid=attempt_201108090104_0044_m_000014_0&filter=stdout
11/08/10 04:07:31 WARN mapred.JobClient: Error reading task outputhttp://ubuntu2:50060/tasklog?plaintext=true&taskid=attempt_201108090104_0044_m_000014_0&filter=stderr
11/08/10 04:07:34 INFO mapred.JobClient: Task Id : attempt_201108090104_0044_m_000015_0, Status : FAILED
java.io.IOException: File: glusterfs://ubuntu1:9000/mnt/glusterfs/ter-inp/_partition.lst#_partition.lst has changed on HDFS since job started
        at org.apache.hadoop.filecache.DistributedCache.ifExistsAndFresh(DistributedCache.java:484)
        at org.apache.hadoop.filecache.DistributedCache.localizeCache(DistributedCache.java:356)
        at org.apache.hadoop.filecache.DistributedCache.getLocalCache(DistributedCache.java:205)
        at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:172)

java.io.IOException: File: glusterfs://ubuntu1:9000/mnt/glusterfs/ter-inp/_partition.lst#_partition.lst has changed on HDFS since job started
        at org.apache.hadoop.filecache.DistributedCache.ifExistsAndFresh(DistributedCache.java:484)
        at org.apache.hadoop.filecache.DistributedCache.localizeCache(DistributedCache.java:356)
        at org.apache.hadoop.filecache.DistributedCache.getLocalCache(DistributedCache.java:205)
        at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:172)

11/08/10 04:07:34 WARN mapred.JobClient: Error reading task outputhttp://ubuntu2:50060/tasklog?plaintext=true&taskid=attempt_201108090104_0044_m_000015_0&filter=stdout
11/08/10 04:07:34 WARN mapred.JobClient: Error reading task outputhttp://ubuntu2:50060/tasklog?plaintext=true&taskid=attempt_201108090104_0044_m_000015_0&filter=stderr


Though I get warnings, job seems to run fine.

Comment 1 M S Vishwanath Bhat 2011-08-10 15:28:08 UTC
Even the job fails with following backtrace...


11/08/10 10:48:04 WARN mapred.JobClient: Error reading task outputhttp://ubuntu3:50060/tasklog?plaintext=true&taskid=attempt_201108101044_0001_r_000001_0&filter=stdout
11/08/10 10:48:04 WARN mapred.JobClient: Error reading task outputhttp://ubuntu3:50060/tasklog?plaintext=true&taskid=attempt_201108101044_0001_r_000001_0&filter=stderr
11/08/10 10:48:08 INFO mapred.JobClient: Job complete: job_201108101044_0001
11/08/10 10:48:08 INFO mapred.JobClient: Counters: 14
11/08/10 10:48:08 INFO mapred.JobClient:   Job Counters
11/08/10 10:48:08 INFO mapred.JobClient:     Launched reduce tasks=2
11/08/10 10:48:08 INFO mapred.JobClient:     Rack-local map tasks=15
11/08/10 10:48:08 INFO mapred.JobClient:     Launched map tasks=41
11/08/10 10:48:08 INFO mapred.JobClient:     Data-local map tasks=26
11/08/10 10:48:08 INFO mapred.JobClient:     Failed map tasks=1
11/08/10 10:48:08 INFO mapred.JobClient:   FileSystemCounters
11/08/10 10:48:08 INFO mapred.JobClient:     FILE_BYTES_READ=547609080
11/08/10 10:48:08 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=1095217136
11/08/10 10:48:08 INFO mapred.JobClient:   Map-Reduce Framework
11/08/10 10:48:08 INFO mapred.JobClient:     Combine output records=0
11/08/10 10:48:08 INFO mapred.JobClient:     Map input records=5368710
11/08/10 10:48:08 INFO mapred.JobClient:     Spilled Records=10737420
11/08/10 10:48:08 INFO mapred.JobClient:     Map output bytes=536871000
11/08/10 10:48:08 INFO mapred.JobClient:     Map input bytes=536871000
11/08/10 10:48:08 INFO mapred.JobClient:     Combine input records=0
11/08/10 10:48:08 INFO mapred.JobClient:     Map output records=5368710
java.io.IOException: Job failed!
        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1252)
        at org.apache.hadoop.examples.terasort.TeraSort.run(TeraSort.java:248)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
        at org.apache.hadoop.examples.terasort.TeraSort.main(TeraSort.java:257)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:616)
        at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
        at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
        at org.apache.hadoop.examples.ExampleDriver.main(ExampleDriver.java:64)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:616)
        at org.apache.hadoop.util.RunJar.main(RunJar.java:156)

Comment 2 Venky Shankar 2011-08-13 10:20:25 UTC
the warnings related to file timstamps may be due to inconsistency in the clock time of nodes in the cluster. I see ~5-10 minutes difference in clock in the server.

I will sync the clock and give a run again.

Also the exceptions are warnings - the job runs to completion for me and teravalidate gives positive output for the run.

Thanks,
-Venky

Comment 3 Venky Shankar 2011-08-17 04:15:01 UTC
timestamp of the _partition.lst file on master and slave ..

for i in {21,22,23,24,33,34,35,36}; do echo "Doing 10.1.11.$i"; ssh 10.1.11.$i ' ls -l /mnt/glusterfs/gen-dir/_partition.lst';done

Doing 10.1.11.21
-rw-r--r-- 1 root root 129 2011-08-17 00:10 /mnt/glusterfs/gen-dir/_partition.lst
Doing 10.1.11.22
-rw-r--r-- 1 root root 129 2011-08-17 00:10 /mnt/glusterfs/gen-dir/_partition.lst
Doing 10.1.11.23
-rw-r--r-- 1 root root 129 2011-08-17 00:10 /mnt/glusterfs/gen-dir/_partition.lst
Doing 10.1.11.24
-rw-r--r-- 1 root root 129 2011-08-17 00:14 /mnt/glusterfs/gen-dir/_partition.lst
Doing 10.1.11.33
-rw-r--r-- 1 root root 129 2011-08-17 00:14 /mnt/glusterfs/gen-dir/_partition.lst
Doing 10.1.11.34
-rw-r--r-- 1 root root 129 2011-08-17 00:10 /mnt/glusterfs/gen-dir/_partition.lst
Doing 10.1.11.35
-rw-r--r-- 1 root root 129 2011-08-17 00:10 /mnt/glusterfs/gen-dir/_partition.lst
Doing 10.1.11.36
-rw-r--r-- 1 root root 129 2011-08-17 00:14 /mnt/glusterfs/gen-dir/_partition.lst

As you see the timestamp of the same file on the hosts differs on the FUSE mount.

This cause the hadoop job to assume that the file has changed and throws an Ex eption regarding the same

Comment 4 Venky Shankar 2011-08-17 05:33:45 UTC
after syncing the clocks with ntpd - the job runs fine without any warnings. Closing the bug. re-open if you see any more.

Comment 5 Anand Avati 2011-08-17 06:23:38 UTC
CHANGE: http://review.gluster.com/241 (Change-Id: I71bc8a477155cff62c3218fee285a22a44aa54ba) merged in release-3.2 by Anand Avati (avati)

Comment 6 Raghavendra Bhat 2011-08-17 06:44:54 UTC
The above patch is for the bug 765116 and not this.

Comment 7 M S Vishwanath Bhat 2011-08-23 07:34:05 UTC
I see this issue only in the ubuntu vm's which have few seconds of time difference. But with AWS machines and bare metals I'm not seeing this issue. So moving to verified state.