1182628 – Mapreduce jobs can cause FileNotFoundException or Resource changed on src filesystem error.

Bug 1182628 - Mapreduce jobs can cause FileNotFoundException or Resource changed on src filesystem error.

Summary: Mapreduce jobs can cause FileNotFoundException or Resource changed on src fil...

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	rhs-hadoop
Sub Component:
Version:	rhgs-3.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Target Release:	RHGS 3.1.0
Assignee:	Bradley Childs
QA Contact:	Martin Bukatovic
Docs Contact:
URL:
Whiteboard:
Depends On:	1240739 1314508
Blocks:	1202842
TreeView+	depends on / blocked

Reported:	2015-01-15 15:25 UTC by Ondřej Komárek
Modified:	2018-11-19 04:00 UTC (History)
CC List:	14 users (show)
Fixed In Version:	org.gluster-glusterfs-hadoop-2.4.0-2
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2018-02-05 01:35:58 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
Example run of sqoop import with fails because of "Resource changed on src filesystem error". (8.86 KB, text/plain) 2015-06-25 09:43 UTC, Daniel Horák	no flags	Details
View All

Comment 3 Bradley Childs 2015-05-26 17:48:18 UTC

This bug is caused by clock sync issues when Hadoop is being extra picky on the timestamp synchronization.  The best fix is to get the clock skew fixed with NTP as close as possible.

If that doesn't work, then there's a new feature in the rhs-hadoop plugin which allows adjusting the precision of the timestamps.  Set fs.glusterfs.timestamp.trim to 5 in core-site.xml to truncate the 4 least significant digits on the timestamp.

Comment 5 Daniel Horák 2015-06-25 09:43:33 UTC

Created attachment 1043019 [details]
Example run of sqoop import with fails because of "Resource changed on src filesystem error".

I'm able to reproduce it quite steadily with sqoop import/import-all-tables commands with fs.glusterfs.timestamp.trim=0 (unconfigured).

Environment:
  Red Hat Enterprise Linux Server release 6.7 Beta (Santiago)
  HDP 2.1
  glusterfs-3.7.1-4.el6rhs.x86_64
  hadoop-2.4.0.2.1.7.0-784.el6.x86_64
  rhs-hadoop-2.4.0-4.el6.noarch
  rhs-hadoop-install-2_54-1.el6rhs.noarch

Example log in attachment.

Comment 6 Daniel Horák 2015-06-25 09:48:21 UTC

When I run the same test with sqoop import/import-all-tables commands on the same/similar cluster with fs.glusterfs.timestamp.trim=5, it fails only once from 100 runs.

Without the fs.glusterfs.timestamp.trim=5 configuration (as described in previous comment 5), it fails few times in 10 runs.

Comment 7 Martin Bukatovic 2015-06-29 19:54:03 UTC

First of all I would like to reformulate the issue here. You should correct
me if there is something wrong with my explanation.

Timestamps on HDFS
==================

Basic features of HDFS:

 * while it stores data blocks in a distributed fashion, metadata are managed
   centrally by NameNode daemon
 * timestamps are stored with milisecond precision

This means that no matter which node you are on, when you ask for a timestamp
of a file, you will see the same timestamp value with milisecond precision
every time. There is some caching to make this practical, but I haven't looked
into such details.

Example
-------

~~~
# hadoop fs -put /etc/passwd /tmp/passwd
# hadoop fs -stat '%Y' /tmp/passwd
1435327127796
~~~

Timestamps on GlusterFS Hadoop
==============================

Compared to HDFS, GlusterFS stores both data and metadata in distributed manner.
Timestamps are stored directly on bricks (as any other metadata) so that the
underlying filesystem (XFS in our case) stores and provides the answer.

When 2-way replicated volume is used and glusterfs native client asks for
timestamp of given file, the timestamp will be taken either from brick1 or
brick2 of the replicated pair. This means that 2 clients can receive
2 different answers (based on which glusterfs peer they ask). To mitigate this
problem all peer nodes are running ntpd and so the timestamps are usually
reasonably close to each other. On the other hand we can be quite sure that
such timestamps won't be the same.

XFS stores timestamps with submilisecond precision, but timestamps reported by
rhs-hadoop plugin are trimmed by 3 least significant numbers, so the timestamp
is effectively second precision only.

Example
-------

On master/management node I create a file and check it's timestamp of last
modification:

~~~
# hadoop fs -put /etc/passwd /tmp/passwd
... output removed ...
# stat /mnt/glusterfs/HadoopVol1/tmp/passwd | grep ^Modify
Modify: 2015-06-29 21:11:09.462729840 +0200
# stat -c "%Y" /mnt/glusterfs/HadoopVol1/tmp/passwd
1435605069
# hadoop fs -stat "%Y" /tmp/passwd 2>/dev/null
1435605069000
~~~

Note that:

 * glusterfs provides submilisecond timestamps (as provided by XFS)
 * hadoop plugin trims the timestamp even without new feature introduced
   in this BZ (new property fs.glusterfs.timestamp.trim)

Now on each storage node which forms a replicated pair in 2-way replicated
gluster volume:

Node1:

~~~
[root@dhcp-37-194 ~]# stat /mnt/glusterfs/HadoopVol1/tmp/passwd | grep ^Modify
Modify: 2015-06-29 21:11:09.462729840 +0200
[root@dhcp-37-194 ~]# stat -c "%Y" /mnt/glusterfs/HadoopVol1/tmp/passwd
1435605069
[root@dhcp-37-194 ~]# hadoop fs -stat "%Y" /tmp/passwd 2>/dev/null
1435605069000
~~~

Node2:

~~~
[root@dhcp-37-195 ~]# stat /mnt/glusterfs/HadoopVol1/tmp/passwd | grep ^Modify
Modify: 2015-06-29 21:11:09.461607840 +0200
[root@dhcp-37-195 ~]# stat -c "%Y" /mnt/glusterfs/HadoopVol1/tmp/passwd
1435605069
[root@dhcp-37-195 ~]# hadoop fs -stat "%Y" /tmp/passwd 2>/dev/null
1435605069000
~~~

Note that:

 * ntpd properly runs on all glusterfs peer nodes
 * timestamp on each brick is really different, but the difference is quite
   small (about 1ms) so that second precision timestamp (unix %Y) are same
   and so that hadoop sees the same timestamp
 * master/management node takes the timestamp value from the first node

In the previous example, it was all good in the end, but when you have this few
miliseconds difference on a boundary of a second, let's say like this:

 * brick1 on node1: 1435605070000 ms
 * brick1 on node2: 1435605069999 ms

The trimming doesn't help. One hadoop node then would see 1 second
larger/smaller value that the other node.

Conclusion
==========

Based on all this, I think that the following statement from comment 3 doesn't
hold:

> This bug is caused by clock sync issues when Hadoop is being extra picky on
> the timestamp synchronization.  The best fix is to get the clock skew fixed
> with NTP as close as possible.

I see this as a design problem which can't be easily fixed just by using ntp.

Comment 8 Martin Bukatovic 2015-06-29 20:09:14 UTC

The timestamp check Hadoop fails on in this BZ is part of local resource
management.

Before Hadoop container can be executed on given node, YARN copies all required
files (job metadata, application jar file and required libraries) from HDFS
into local filesystem (the exact place is configured in
yarn.nodemanager.local-dirs of yarn site).

As description of this process states[1], YARN also stores last-modification
timestamp of the resource on the remote file-system:

> LocalResource time-stamps
>
> As mentioned above, NodeManager tracks the last-modification timestamp of each
> LocalResource before container-start. Before downloading, NodeManager checks
> that the files haven’t changed in the interim. This is a way of giving
> a consistent view at the LocalResources – an application can use the very same
> file-contents all the time it runs without worrying about data corruption
> issues due to concurrent writers to the same file.
> 
> Once the file is copied from its remote location to one of the NodeManager’s
> local disks, it loses any connection to the original file other than the URL
> (used while copying). Any future modifications to the remote file are NOT
> tracked and hence if an external system has to update the remote resource – it
> should be done via versioning. YARN will fail containers that depend on
> modified remote resources to prevent inconsistencies.
> 
> Note that ApplicationMaster specifies the resource time-stamps to a NodeManager
> while starting any container on that node. Similarly, for the container running
> the ApplicationMaster itself, the client has to populate the time-stamps for
> all the resources that ApplicationMaster needs.
> 
> In case of a MapReduce application, the MapReduce JobClient determines the
> modification-timestamps of the resources needed by MapReduce ApplicationMaster.
> The ApplicationMaster itself then sets the timestamps for the resources needed
> by the MapReduce tasks.

In our case it means:

 * Hadoop really needs to check the full timestamp here
 * the check is done on each node, so that when the timestamps differ enough,
   the fail is inevitable (or in other words, any difference it seconds precision
   timestamps is enough for this BZ to be triggered)

[1] http://hortonworks.com/blog/management-of-application-dependencies-in-yarn/

Comment 9 Martin Bukatovic 2015-06-29 20:31:37 UTC

Having in mind the facts stated above, I checked the new property
fs.glusterfs.timestamp.trim.

Since the plugin already trims hadoop timestamp by 3 lest significant numbers,
values 0, 1, 2 and 3 of the new property has no effect.

fs.glusterfs.timestamp.trim = 0 or unconfigured:

~~~
[root@dhcp-37-182 tmp]# hadoop fs -stat "%Y" /tmp/passwd 2>/dev/null
1435324986000
~~~

fs.glusterfs.timestamp.trim = 1

~~~
[root@dhcp-37-182 tmp]# hadoop fs -stat "%Y" /tmp/passwd 2>/dev/null
1435324986000
~~~

fs.glusterfs.timestamp.trim = 2

~~~
[root@dhcp-37-182 tmp]# hadoop fs -stat "%Y" /tmp/passwd 2>/dev/null
1435324986000
~~~

fs.glusterfs.timestamp.trim = 3

~~~
[root@dhcp-37-182 tmp]# hadoop fs -stat "%Y" /tmp/passwd 2>/dev/null
1435324986000
~~~

So the first value which actually does something is 4:

~~~
[root@dhcp-37-182 tmp]# hadoop fs -stat "%Y" /tmp/passwd 2>/dev/null
1435077390000
~~~

But by doing this, we degrade the whole Hadoop filesystem into
terrible timestamp precision. We have 10s precision here now.

Going on and setting even larger values works, eg.:

fs.glusterfs.timestamp.trim = 7

~~~
[root@dhcp-37-182 tmp]# hadoop fs -stat "%Y" /tmp/passwd 2>/dev/null
1435320000000
~~~

fs.glusterfs.timestamp.trim = 13

~~~
[root@dhcp-37-182 tmp]# hadoop fs -stat "%Y" /tmp/passwd 2>/dev/null
0
~~~

But setting larger values doesn't make any sense to me as it effectively breaks
the filesystem and I consider it dangerous.

Moreover when you set there some really crazy value, it obviously makes
filesystem completelly broken:

~~~
[root@dhcp-37-182 tmp]# hadoop fs -stat "%Y" /tmp/passwd #2>/dev/null
15/06/26 21:11:26 INFO glusterfs.GlusterVolume: Initializing gluster volume..
15/06/26 21:11:26 INFO glusterfs.GlusterFileSystem: Configuring GlusterFS
15/06/26 21:11:26 INFO glusterfs.GlusterFileSystem: Initializing GlusterFS,  CRC disabled.
15/06/26 21:11:26 INFO glusterfs.GlusterFileSystem: GIT INFO={git.commit.id.abbrev=3e043b8, git.commit.user.email=bchilds, git.commit.message.full=[update RPM spec file/changelog] - 2.4.0
, git.commit.id=3e043b8f230d995909675d9f0af1a52df8cc3f9e, git.commit.message.short=[update RPM spec file/changelog] - 2.4.0, git.commit.user.name=bchilds, git.build.user.name=Unknown, git.commit.id.describe=GA-25-g3e043b8-dirty, git.build.user.email=Unknown, git.branch=2.4.0, git.commit.time=16.06.2015 @ 16:16:25 EDT, git.build.time=17.06.2015 @ 09:31:07 EDT}
15/06/26 21:11:26 INFO glusterfs.GlusterFileSystem: GIT_TAG=GA
15/06/26 21:11:26 INFO glusterfs.GlusterFileSystem: Configuring GlusterFS
15/06/26 21:11:27 INFO glusterfs.GlusterVolume: Initializing gluster volume..
15/06/26 21:11:27 INFO glusterfs.GlusterVolume: Gluster volume: HadoopVol1 at : /mnt/glusterfs/HadoopVol1
15/06/26 21:11:27 INFO glusterfs.GlusterVolume: Gluster volume: gv0 at : /mnt/gv0
15/06/26 21:11:27 INFO glusterfs.GlusterVolume: Gluster volume: HadoopVol2 at : /mnt/glusterfs/HadoopVol2
15/06/26 21:11:27 INFO glusterfs.GlusterVolume: Working directory is : glusterfs:/user/root
15/06/26 21:11:27 INFO glusterfs.GlusterVolume: Write buffer size : 131072
15/06/26 21:11:27 INFO glusterfs.GlusterVolume: Default block size : 67108864
15/06/26 21:11:27 INFO glusterfs.GlusterVolume: Directory list order : fs ordering
-stat: Fatal internal error
java.lang.RuntimeException: java.lang.NumberFormatException: For input string: "100000000000000000000000000000000000000000000000000000"
        at org.apache.hadoop.fs.glusterfs.GlusterVolume.setConf(GlusterVolume.java:205)
        at org.apache.hadoop.fs.RawLocalFileSystem.initialize(RawLocalFileSystem.java:92)
        at org.apache.hadoop.fs.glusterfs.GlusterVolume.initialize(GlusterVolume.java:90)
        at org.apache.hadoop.fs.FilterFileSystem.initialize(FilterFileSystem.java:90)
        at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2397)
        at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:89)
        at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2431)
        at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2413)
        at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:368)
        at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:167)
        at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:352)
        at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)
        at org.apache.hadoop.fs.shell.PathData.expandAsGlob(PathData.java:325)
        at org.apache.hadoop.fs.shell.Command.expandArgument(Command.java:224)
        at org.apache.hadoop.fs.shell.Command.expandArguments(Command.java:207)
        at org.apache.hadoop.fs.shell.Command.processRawArguments(Command.java:190)
        at org.apache.hadoop.fs.shell.Command.run(Command.java:154)
        at org.apache.hadoop.fs.FsShell.run(FsShell.java:255)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
        at org.apache.hadoop.fs.FsShell.main(FsShell.java:308)
Caused by: java.lang.NumberFormatException: For input string: "100000000000000000000000000000000000000000000000000000"
        at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
        at java.lang.Integer.parseInt(Integer.java:495)
        at java.lang.Integer.parseInt(Integer.java:527)
        at org.apache.hadoop.conf.Configuration.getInt(Configuration.java:1094)
        at org.apache.hadoop.fs.glusterfs.GlusterVolume.setConf(GlusterVolume.java:200)
        ... 20 more
~~~

But at least Ambari still works so that you are able to fix the problem and
reconfigure the property.

Conclusion
----------

To sum this up, I see only 2 reasonable values of this property: 0 and 4 and
even 4 is questionable because it breaks filesystem timestamps - chopping
seconds is human noticeable already.

Comment 10 Martin Bukatovic 2015-06-29 20:50:11 UTC

When I check whether introduction of property fs.glusterfs.timestamp.trim
actually fixed the problem, I have executed automated test `test_sqoop`
100 times (which takes about 25 hours).

Firs I executed 100 runs with fs.glusterfs.timestamp.trim set to 0 (undefined).
When I examine the results, I see 32 failures:

~~~
$ ./log-checker.py evidence/fs.glusterfs.timestamp.trim_0/test_sqoop.log.* 
1435240990000 1435240989000 1000 glusterfs:/user/hive/.staging/job_1435132870665_0782/libjars/mysql-connector-java-5.1.17.jar
1435247559000 1435247558000 1000 glusterfs:/user/hive/.staging/job_1435132870665_0991/libjars/mysql-connector-java.jar
1435252077000 1435252078000 1000 glusterfs:/user/hive/.staging/job_1435132870665_1136/libjars/mysql-connector-java-5.1.17.jar
1435255991000 1435255990000 1000 glusterfs:/user/hive/.staging/job_1435132870665_1263/libjars/ant-eclipse-1.0-jvm1.2.jar
1435256132000 1435256133000 1000 glusterfs:/user/hive/.staging/job_1435132870665_1267/libjars/ant-eclipse-1.0-jvm1.2.jar
1435258439000 1435258438000 1000 glusterfs:/user/hive/.staging/job_1435132870665_1321/libjars/mysql-connector-java.jar
1435261680000 1435261681000 1000 glusterfs:/user/hive/.staging/job_1435132870665_1427/libjars/commons-io-1.4.jar
1435264639000 1435264638000 1000 glusterfs:/user/hive/.staging/job_1435132870665_1523/libjars/ant-contrib-1.0b3.jar
1435272280000 1435272279000 1000 glusterfs:/user/hive/.staging/job_1435132870665_1773/libjars/avro-1.5.3.jar
1435283500000 1435283499000 1000 glusterfs:/user/hive/.staging/job_1435132870665_2106/libjars/avro-mapred-1.5.3.jar
1435284274000 1435284273000 1000 glusterfs:/user/hive/.staging/job_1435132870665_2129/libjars/ant-eclipse-1.0-jvm1.2.jar
1435284341000 1435284342000 1000 glusterfs:/user/hive/.staging/job_1435132870665_2131/libjars/avro-ipc-1.5.3.jar
1435284704000 1435284705000 1000 glusterfs:/user/hive/.staging/job_1435132870665_2142/libjars/commons-io-1.4.jar
1435284949000 1435284950000 1000 glusterfs:/user/hive/.staging/job_1435132870665_2149/libjars/snappy-java-1.0.3.2.jar
1435286091000 1435286092000 1000 glusterfs:/user/hive/.staging/job_1435132870665_2184/libjars/jackson-mapper-asl-1.7.3.jar
1435288735000 1435288734000 1000 glusterfs:/user/hive/.staging/job_1435132870665_2266/libjars/snappy-java-1.0.3.2.jar
1435289005000 1435289006000 1000 glusterfs:/user/hive/.staging/job_1435132870665_2274/libjars/mysql-connector-java-5.1.17.jar
1435290240000 1435290239000 1000 glusterfs:/user/hive/.staging/job_1435132870665_2311/libjars/jackson-core-asl-1.7.3.jar
1435293146000 1435293145000 1000 glusterfs:/user/hive/.staging/job_1435132870665_2384/libjars/ant-contrib-1.0b3.jar
1435297500000 1435297499000 1000 glusterfs:/user/hive/.staging/job_1435132870665_2514/libjars/jopt-simple-3.2.jar
1435299214000 1435299215000 1000 glusterfs:/user/hive/.staging/job_1435132870665_2565/libjars/mysql-connector-java-5.1.17.jar
1435299301000 1435299302000 1000 glusterfs:/user/hive/.staging/job_1435132870665_2567/libjars/ant-eclipse-1.0-jvm1.2.jar
1435301256000 1435301257000 1000 glusterfs:/user/hive/.staging/job_1435132870665_2628/libjars/avro-mapred-1.5.3.jar
1435302379000 1435302378000 1000 glusterfs:/user/hive/.staging/job_1435132870665_2662/libjars/avro-mapred-1.5.3.jar
1435302615000 1435302614000 1000 glusterfs:/user/hive/.staging/job_1435132870665_2669/libjars/jackson-core-asl-1.7.3.jar
1435314229000 1435314230000 1000 glusterfs:/user/hive/.staging/job_1435132870665_3011/job.xml
1435318057000 1435318056000 1000 glusterfs:/user/hive/.staging/job_1435132870665_3129/libjars/avro-ipc-1.5.3.jar
1435329899000 1435329898000 1000 glusterfs:/user/hive/.staging/job_1435132870665_3484/libjars/avro-ipc-1.5.3.jar
1435332210000 1435332211000 1000 glusterfs:/user/hive/.staging/job_1435132870665_3555/libjars/paranamer-2.3.jar
1435335610000 1435335609000 1000 glusterfs:/user/hive/.staging/job_1435132870665_3662/libjars/sqoop-1.4.4.2.1.7.0-784.jar
1435336793000 1435336792000 1000 glusterfs:/user/hive/.staging/job_1435132870665_3700/libjars/mysql-connector-java.jar
1435343308000 1435343309000 1000 glusterfs:/user/hive/.staging/job_1435132870665_3901/libjars/paranamer-2.3.jar
~~~

Note:

 * 1st collumn: expected timestamp
 * 2nd collumn: was timestamp
 * difference of timestamps
 * affected file

So as you can see, the difference is always 1s and affected file is part of
job libraries it requires to run or it's metadata (as explained above).
So more libraries a hadoop job requires, more likely it will fail because of
this problem.

With fs.glusterfs.timestamp.trim set to 4 I see just single failure:

~~~
$ ./log-checker.py evidence/fs.glusterfs.timestamp.trim_4/test_sqoop.log.* 
1435395730000 1435395740000 10000 glusterfs:/user/hive/.staging/job_1435132870665_5340/libjars/jackson-core-asl-1.7.3.jar
~~~

Note that the difference is 10s now and that the trim property can't actually
fix this. It only makes this kind of failure less probable.

I haven't tried larger values, but as Daniel notes in comment 6, he can see
one failure even with fs.glusterfs.timestamp.trim set to 5. But as I have noted
previously, I don't consider this to be a reasonable configuration to use.

Comment 11 Martin Bukatovic 2015-06-29 20:53:45 UTC

Based on information I provided in previous comments, I moving this back to
ASSIGNED asking developers to re-evaluate.

Comment 12 Martin Bukatovic 2015-06-30 08:37:28 UTC

Small correction:

I did a mistake in comment 10: I have executed more test runs in case when
trim = 0, but I have forgot to not include those additional log files for
comparison. The correct failure ratio comparing exact same number of test
runs is:

 * 27 failures (instead of 32 as stated in comment 10) when fs.glusterfs.timestamp.trim = 0
 * 1 failure when fs.glusterfs.timestamp.trim = 4

That said, it doesn't change any conclusions or other information I provided.

Comment 13 Martin Bukatovic 2015-06-30 08:41:23 UTC

Demonstration of fs.glusterfs.timestamp.trim on the timestamp:

$ ./timestamp-trim-example.py
trim 3: 1435653553000 -> 2015-06-30 10:39:13
trim 4: 1435653550000 -> 2015-06-30 10:39:10
trim 5: 1435653500000 -> 2015-06-30 10:38:20
trim 6: 1435653000000 -> 2015-06-30 10:30:00

Comment 16 Martin Bukatovic 2015-07-07 15:55:04 UTC

I have reported the root cause of this issue as BZ 1240739 to ensure that we
evaluate if this can be fixed in gluster directly.

Note You need to log in before you can comment on or make changes to this bug.