This bug is caused by clock sync issues when Hadoop is being extra picky on the timestamp synchronization. The best fix is to get the clock skew fixed with NTP as close as possible. If that doesn't work, then there's a new feature in the rhs-hadoop plugin which allows adjusting the precision of the timestamps. Set fs.glusterfs.timestamp.trim to 5 in core-site.xml to truncate the 4 least significant digits on the timestamp.
Created attachment 1043019 [details] Example run of sqoop import with fails because of "Resource changed on src filesystem error". I'm able to reproduce it quite steadily with sqoop import/import-all-tables commands with fs.glusterfs.timestamp.trim=0 (unconfigured). Environment: Red Hat Enterprise Linux Server release 6.7 Beta (Santiago) HDP 2.1 glusterfs-3.7.1-4.el6rhs.x86_64 hadoop-2.4.0.2.1.7.0-784.el6.x86_64 rhs-hadoop-2.4.0-4.el6.noarch rhs-hadoop-install-2_54-1.el6rhs.noarch Example log in attachment.
When I run the same test with sqoop import/import-all-tables commands on the same/similar cluster with fs.glusterfs.timestamp.trim=5, it fails only once from 100 runs. Without the fs.glusterfs.timestamp.trim=5 configuration (as described in previous comment 5), it fails few times in 10 runs.
First of all I would like to reformulate the issue here. You should correct me if there is something wrong with my explanation. Timestamps on HDFS ================== Basic features of HDFS: * while it stores data blocks in a distributed fashion, metadata are managed centrally by NameNode daemon * timestamps are stored with milisecond precision This means that no matter which node you are on, when you ask for a timestamp of a file, you will see the same timestamp value with milisecond precision every time. There is some caching to make this practical, but I haven't looked into such details. Example ------- ~~~ # hadoop fs -put /etc/passwd /tmp/passwd # hadoop fs -stat '%Y' /tmp/passwd 1435327127796 ~~~ Timestamps on GlusterFS Hadoop ============================== Compared to HDFS, GlusterFS stores both data and metadata in distributed manner. Timestamps are stored directly on bricks (as any other metadata) so that the underlying filesystem (XFS in our case) stores and provides the answer. When 2-way replicated volume is used and glusterfs native client asks for timestamp of given file, the timestamp will be taken either from brick1 or brick2 of the replicated pair. This means that 2 clients can receive 2 different answers (based on which glusterfs peer they ask). To mitigate this problem all peer nodes are running ntpd and so the timestamps are usually reasonably close to each other. On the other hand we can be quite sure that such timestamps won't be the same. XFS stores timestamps with submilisecond precision, but timestamps reported by rhs-hadoop plugin are trimmed by 3 least significant numbers, so the timestamp is effectively second precision only. Example ------- On master/management node I create a file and check it's timestamp of last modification: ~~~ # hadoop fs -put /etc/passwd /tmp/passwd ... output removed ... # stat /mnt/glusterfs/HadoopVol1/tmp/passwd | grep ^Modify Modify: 2015-06-29 21:11:09.462729840 +0200 # stat -c "%Y" /mnt/glusterfs/HadoopVol1/tmp/passwd 1435605069 # hadoop fs -stat "%Y" /tmp/passwd 2>/dev/null 1435605069000 ~~~ Note that: * glusterfs provides submilisecond timestamps (as provided by XFS) * hadoop plugin trims the timestamp even without new feature introduced in this BZ (new property fs.glusterfs.timestamp.trim) Now on each storage node which forms a replicated pair in 2-way replicated gluster volume: Node1: ~~~ [root@dhcp-37-194 ~]# stat /mnt/glusterfs/HadoopVol1/tmp/passwd | grep ^Modify Modify: 2015-06-29 21:11:09.462729840 +0200 [root@dhcp-37-194 ~]# stat -c "%Y" /mnt/glusterfs/HadoopVol1/tmp/passwd 1435605069 [root@dhcp-37-194 ~]# hadoop fs -stat "%Y" /tmp/passwd 2>/dev/null 1435605069000 ~~~ Node2: ~~~ [root@dhcp-37-195 ~]# stat /mnt/glusterfs/HadoopVol1/tmp/passwd | grep ^Modify Modify: 2015-06-29 21:11:09.461607840 +0200 [root@dhcp-37-195 ~]# stat -c "%Y" /mnt/glusterfs/HadoopVol1/tmp/passwd 1435605069 [root@dhcp-37-195 ~]# hadoop fs -stat "%Y" /tmp/passwd 2>/dev/null 1435605069000 ~~~ Note that: * ntpd properly runs on all glusterfs peer nodes * timestamp on each brick is really different, but the difference is quite small (about 1ms) so that second precision timestamp (unix %Y) are same and so that hadoop sees the same timestamp * master/management node takes the timestamp value from the first node In the previous example, it was all good in the end, but when you have this few miliseconds difference on a boundary of a second, let's say like this: * brick1 on node1: 1435605070000 ms * brick1 on node2: 1435605069999 ms The trimming doesn't help. One hadoop node then would see 1 second larger/smaller value that the other node. Conclusion ========== Based on all this, I think that the following statement from comment 3 doesn't hold: > This bug is caused by clock sync issues when Hadoop is being extra picky on > the timestamp synchronization. The best fix is to get the clock skew fixed > with NTP as close as possible. I see this as a design problem which can't be easily fixed just by using ntp.
The timestamp check Hadoop fails on in this BZ is part of local resource management. Before Hadoop container can be executed on given node, YARN copies all required files (job metadata, application jar file and required libraries) from HDFS into local filesystem (the exact place is configured in yarn.nodemanager.local-dirs of yarn site). As description of this process states[1], YARN also stores last-modification timestamp of the resource on the remote file-system: > LocalResource time-stamps > > As mentioned above, NodeManager tracks the last-modification timestamp of each > LocalResource before container-start. Before downloading, NodeManager checks > that the files haven’t changed in the interim. This is a way of giving > a consistent view at the LocalResources – an application can use the very same > file-contents all the time it runs without worrying about data corruption > issues due to concurrent writers to the same file. > > Once the file is copied from its remote location to one of the NodeManager’s > local disks, it loses any connection to the original file other than the URL > (used while copying). Any future modifications to the remote file are NOT > tracked and hence if an external system has to update the remote resource – it > should be done via versioning. YARN will fail containers that depend on > modified remote resources to prevent inconsistencies. > > Note that ApplicationMaster specifies the resource time-stamps to a NodeManager > while starting any container on that node. Similarly, for the container running > the ApplicationMaster itself, the client has to populate the time-stamps for > all the resources that ApplicationMaster needs. > > In case of a MapReduce application, the MapReduce JobClient determines the > modification-timestamps of the resources needed by MapReduce ApplicationMaster. > The ApplicationMaster itself then sets the timestamps for the resources needed > by the MapReduce tasks. In our case it means: * Hadoop really needs to check the full timestamp here * the check is done on each node, so that when the timestamps differ enough, the fail is inevitable (or in other words, any difference it seconds precision timestamps is enough for this BZ to be triggered) [1] http://hortonworks.com/blog/management-of-application-dependencies-in-yarn/
Having in mind the facts stated above, I checked the new property fs.glusterfs.timestamp.trim. Since the plugin already trims hadoop timestamp by 3 lest significant numbers, values 0, 1, 2 and 3 of the new property has no effect. fs.glusterfs.timestamp.trim = 0 or unconfigured: ~~~ [root@dhcp-37-182 tmp]# hadoop fs -stat "%Y" /tmp/passwd 2>/dev/null 1435324986000 ~~~ fs.glusterfs.timestamp.trim = 1 ~~~ [root@dhcp-37-182 tmp]# hadoop fs -stat "%Y" /tmp/passwd 2>/dev/null 1435324986000 ~~~ fs.glusterfs.timestamp.trim = 2 ~~~ [root@dhcp-37-182 tmp]# hadoop fs -stat "%Y" /tmp/passwd 2>/dev/null 1435324986000 ~~~ fs.glusterfs.timestamp.trim = 3 ~~~ [root@dhcp-37-182 tmp]# hadoop fs -stat "%Y" /tmp/passwd 2>/dev/null 1435324986000 ~~~ So the first value which actually does something is 4: ~~~ [root@dhcp-37-182 tmp]# hadoop fs -stat "%Y" /tmp/passwd 2>/dev/null 1435077390000 ~~~ But by doing this, we degrade the whole Hadoop filesystem into terrible timestamp precision. We have 10s precision here now. Going on and setting even larger values works, eg.: fs.glusterfs.timestamp.trim = 7 ~~~ [root@dhcp-37-182 tmp]# hadoop fs -stat "%Y" /tmp/passwd 2>/dev/null 1435320000000 ~~~ fs.glusterfs.timestamp.trim = 13 ~~~ [root@dhcp-37-182 tmp]# hadoop fs -stat "%Y" /tmp/passwd 2>/dev/null 0 ~~~ But setting larger values doesn't make any sense to me as it effectively breaks the filesystem and I consider it dangerous. Moreover when you set there some really crazy value, it obviously makes filesystem completelly broken: ~~~ [root@dhcp-37-182 tmp]# hadoop fs -stat "%Y" /tmp/passwd #2>/dev/null 15/06/26 21:11:26 INFO glusterfs.GlusterVolume: Initializing gluster volume.. 15/06/26 21:11:26 INFO glusterfs.GlusterFileSystem: Configuring GlusterFS 15/06/26 21:11:26 INFO glusterfs.GlusterFileSystem: Initializing GlusterFS, CRC disabled. 15/06/26 21:11:26 INFO glusterfs.GlusterFileSystem: GIT INFO={git.commit.id.abbrev=3e043b8, git.commit.user.email=bchilds, git.commit.message.full=[update RPM spec file/changelog] - 2.4.0 , git.commit.id=3e043b8f230d995909675d9f0af1a52df8cc3f9e, git.commit.message.short=[update RPM spec file/changelog] - 2.4.0, git.commit.user.name=bchilds, git.build.user.name=Unknown, git.commit.id.describe=GA-25-g3e043b8-dirty, git.build.user.email=Unknown, git.branch=2.4.0, git.commit.time=16.06.2015 @ 16:16:25 EDT, git.build.time=17.06.2015 @ 09:31:07 EDT} 15/06/26 21:11:26 INFO glusterfs.GlusterFileSystem: GIT_TAG=GA 15/06/26 21:11:26 INFO glusterfs.GlusterFileSystem: Configuring GlusterFS 15/06/26 21:11:27 INFO glusterfs.GlusterVolume: Initializing gluster volume.. 15/06/26 21:11:27 INFO glusterfs.GlusterVolume: Gluster volume: HadoopVol1 at : /mnt/glusterfs/HadoopVol1 15/06/26 21:11:27 INFO glusterfs.GlusterVolume: Gluster volume: gv0 at : /mnt/gv0 15/06/26 21:11:27 INFO glusterfs.GlusterVolume: Gluster volume: HadoopVol2 at : /mnt/glusterfs/HadoopVol2 15/06/26 21:11:27 INFO glusterfs.GlusterVolume: Working directory is : glusterfs:/user/root 15/06/26 21:11:27 INFO glusterfs.GlusterVolume: Write buffer size : 131072 15/06/26 21:11:27 INFO glusterfs.GlusterVolume: Default block size : 67108864 15/06/26 21:11:27 INFO glusterfs.GlusterVolume: Directory list order : fs ordering -stat: Fatal internal error java.lang.RuntimeException: java.lang.NumberFormatException: For input string: "100000000000000000000000000000000000000000000000000000" at org.apache.hadoop.fs.glusterfs.GlusterVolume.setConf(GlusterVolume.java:205) at org.apache.hadoop.fs.RawLocalFileSystem.initialize(RawLocalFileSystem.java:92) at org.apache.hadoop.fs.glusterfs.GlusterVolume.initialize(GlusterVolume.java:90) at org.apache.hadoop.fs.FilterFileSystem.initialize(FilterFileSystem.java:90) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2397) at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:89) at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2431) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2413) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:368) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:167) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:352) at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296) at org.apache.hadoop.fs.shell.PathData.expandAsGlob(PathData.java:325) at org.apache.hadoop.fs.shell.Command.expandArgument(Command.java:224) at org.apache.hadoop.fs.shell.Command.expandArguments(Command.java:207) at org.apache.hadoop.fs.shell.Command.processRawArguments(Command.java:190) at org.apache.hadoop.fs.shell.Command.run(Command.java:154) at org.apache.hadoop.fs.FsShell.run(FsShell.java:255) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84) at org.apache.hadoop.fs.FsShell.main(FsShell.java:308) Caused by: java.lang.NumberFormatException: For input string: "100000000000000000000000000000000000000000000000000000" at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65) at java.lang.Integer.parseInt(Integer.java:495) at java.lang.Integer.parseInt(Integer.java:527) at org.apache.hadoop.conf.Configuration.getInt(Configuration.java:1094) at org.apache.hadoop.fs.glusterfs.GlusterVolume.setConf(GlusterVolume.java:200) ... 20 more ~~~ But at least Ambari still works so that you are able to fix the problem and reconfigure the property. Conclusion ---------- To sum this up, I see only 2 reasonable values of this property: 0 and 4 and even 4 is questionable because it breaks filesystem timestamps - chopping seconds is human noticeable already.
When I check whether introduction of property fs.glusterfs.timestamp.trim actually fixed the problem, I have executed automated test `test_sqoop` 100 times (which takes about 25 hours). Firs I executed 100 runs with fs.glusterfs.timestamp.trim set to 0 (undefined). When I examine the results, I see 32 failures: ~~~ $ ./log-checker.py evidence/fs.glusterfs.timestamp.trim_0/test_sqoop.log.* 1435240990000 1435240989000 1000 glusterfs:/user/hive/.staging/job_1435132870665_0782/libjars/mysql-connector-java-5.1.17.jar 1435247559000 1435247558000 1000 glusterfs:/user/hive/.staging/job_1435132870665_0991/libjars/mysql-connector-java.jar 1435252077000 1435252078000 1000 glusterfs:/user/hive/.staging/job_1435132870665_1136/libjars/mysql-connector-java-5.1.17.jar 1435255991000 1435255990000 1000 glusterfs:/user/hive/.staging/job_1435132870665_1263/libjars/ant-eclipse-1.0-jvm1.2.jar 1435256132000 1435256133000 1000 glusterfs:/user/hive/.staging/job_1435132870665_1267/libjars/ant-eclipse-1.0-jvm1.2.jar 1435258439000 1435258438000 1000 glusterfs:/user/hive/.staging/job_1435132870665_1321/libjars/mysql-connector-java.jar 1435261680000 1435261681000 1000 glusterfs:/user/hive/.staging/job_1435132870665_1427/libjars/commons-io-1.4.jar 1435264639000 1435264638000 1000 glusterfs:/user/hive/.staging/job_1435132870665_1523/libjars/ant-contrib-1.0b3.jar 1435272280000 1435272279000 1000 glusterfs:/user/hive/.staging/job_1435132870665_1773/libjars/avro-1.5.3.jar 1435283500000 1435283499000 1000 glusterfs:/user/hive/.staging/job_1435132870665_2106/libjars/avro-mapred-1.5.3.jar 1435284274000 1435284273000 1000 glusterfs:/user/hive/.staging/job_1435132870665_2129/libjars/ant-eclipse-1.0-jvm1.2.jar 1435284341000 1435284342000 1000 glusterfs:/user/hive/.staging/job_1435132870665_2131/libjars/avro-ipc-1.5.3.jar 1435284704000 1435284705000 1000 glusterfs:/user/hive/.staging/job_1435132870665_2142/libjars/commons-io-1.4.jar 1435284949000 1435284950000 1000 glusterfs:/user/hive/.staging/job_1435132870665_2149/libjars/snappy-java-1.0.3.2.jar 1435286091000 1435286092000 1000 glusterfs:/user/hive/.staging/job_1435132870665_2184/libjars/jackson-mapper-asl-1.7.3.jar 1435288735000 1435288734000 1000 glusterfs:/user/hive/.staging/job_1435132870665_2266/libjars/snappy-java-1.0.3.2.jar 1435289005000 1435289006000 1000 glusterfs:/user/hive/.staging/job_1435132870665_2274/libjars/mysql-connector-java-5.1.17.jar 1435290240000 1435290239000 1000 glusterfs:/user/hive/.staging/job_1435132870665_2311/libjars/jackson-core-asl-1.7.3.jar 1435293146000 1435293145000 1000 glusterfs:/user/hive/.staging/job_1435132870665_2384/libjars/ant-contrib-1.0b3.jar 1435297500000 1435297499000 1000 glusterfs:/user/hive/.staging/job_1435132870665_2514/libjars/jopt-simple-3.2.jar 1435299214000 1435299215000 1000 glusterfs:/user/hive/.staging/job_1435132870665_2565/libjars/mysql-connector-java-5.1.17.jar 1435299301000 1435299302000 1000 glusterfs:/user/hive/.staging/job_1435132870665_2567/libjars/ant-eclipse-1.0-jvm1.2.jar 1435301256000 1435301257000 1000 glusterfs:/user/hive/.staging/job_1435132870665_2628/libjars/avro-mapred-1.5.3.jar 1435302379000 1435302378000 1000 glusterfs:/user/hive/.staging/job_1435132870665_2662/libjars/avro-mapred-1.5.3.jar 1435302615000 1435302614000 1000 glusterfs:/user/hive/.staging/job_1435132870665_2669/libjars/jackson-core-asl-1.7.3.jar 1435314229000 1435314230000 1000 glusterfs:/user/hive/.staging/job_1435132870665_3011/job.xml 1435318057000 1435318056000 1000 glusterfs:/user/hive/.staging/job_1435132870665_3129/libjars/avro-ipc-1.5.3.jar 1435329899000 1435329898000 1000 glusterfs:/user/hive/.staging/job_1435132870665_3484/libjars/avro-ipc-1.5.3.jar 1435332210000 1435332211000 1000 glusterfs:/user/hive/.staging/job_1435132870665_3555/libjars/paranamer-2.3.jar 1435335610000 1435335609000 1000 glusterfs:/user/hive/.staging/job_1435132870665_3662/libjars/sqoop-1.4.4.2.1.7.0-784.jar 1435336793000 1435336792000 1000 glusterfs:/user/hive/.staging/job_1435132870665_3700/libjars/mysql-connector-java.jar 1435343308000 1435343309000 1000 glusterfs:/user/hive/.staging/job_1435132870665_3901/libjars/paranamer-2.3.jar ~~~ Note: * 1st collumn: expected timestamp * 2nd collumn: was timestamp * difference of timestamps * affected file So as you can see, the difference is always 1s and affected file is part of job libraries it requires to run or it's metadata (as explained above). So more libraries a hadoop job requires, more likely it will fail because of this problem. With fs.glusterfs.timestamp.trim set to 4 I see just single failure: ~~~ $ ./log-checker.py evidence/fs.glusterfs.timestamp.trim_4/test_sqoop.log.* 1435395730000 1435395740000 10000 glusterfs:/user/hive/.staging/job_1435132870665_5340/libjars/jackson-core-asl-1.7.3.jar ~~~ Note that the difference is 10s now and that the trim property can't actually fix this. It only makes this kind of failure less probable. I haven't tried larger values, but as Daniel notes in comment 6, he can see one failure even with fs.glusterfs.timestamp.trim set to 5. But as I have noted previously, I don't consider this to be a reasonable configuration to use.
Based on information I provided in previous comments, I moving this back to ASSIGNED asking developers to re-evaluate.
Small correction: I did a mistake in comment 10: I have executed more test runs in case when trim = 0, but I have forgot to not include those additional log files for comparison. The correct failure ratio comparing exact same number of test runs is: * 27 failures (instead of 32 as stated in comment 10) when fs.glusterfs.timestamp.trim = 0 * 1 failure when fs.glusterfs.timestamp.trim = 4 That said, it doesn't change any conclusions or other information I provided.
Demonstration of fs.glusterfs.timestamp.trim on the timestamp: $ ./timestamp-trim-example.py trim 3: 1435653553000 -> 2015-06-30 10:39:13 trim 4: 1435653550000 -> 2015-06-30 10:39:10 trim 5: 1435653500000 -> 2015-06-30 10:38:20 trim 6: 1435653000000 -> 2015-06-30 10:30:00
I have reported the root cause of this issue as BZ 1240739 to ensure that we evaluate if this can be fixed in gluster directly.