1066892 – cannot access temporary file

Bug 1066892 - cannot access temporary file

Summary: cannot access temporary file

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	rhs-hadoop
Sub Component:
Version:	unspecified
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	unspecified
Target Milestone:	---
Target Release:	RHGS 3.0.0
Assignee:	Bradley Childs
QA Contact:	BigData QE
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1057253
TreeView+	depends on / blocked

Reported:	2014-02-19 10:10 UTC by Martin Kudlej
Modified:	2015-05-15 17:53 UTC (History)
CC List:	9 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2014-06-04 15:03:32 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
anaconda kickstart (2.22 KB, text/plain) 2014-04-28 11:41 UTC, Martin Kudlej	no flags	Details
View All

Description Martin Kudlej 2014-02-19 10:10:20 UTC

Description of problem:
While I did:
$ for i in `seq 30`; do hadoop fs -copyFromLocal test32 dir1/test32_$i; done
where test32 has 32MB.

I did on different machine but with same glusterfs mounted.

$ hadoop fs -ls dir1
14/02/19 10:31:13 INFO glusterfs.GlusterVolume: Initializing gluster volume..
14/02/19 10:31:13 INFO glusterfs.GlusterFileSystem: Configuring GlusterFS
14/02/19 10:31:13 INFO glusterfs.GlusterFileSystem: Initializing GlusterFS,  CRC disabled.
14/02/19 10:31:13 INFO glusterfs.GlusterFileSystem: GIT INFO={git.commit.id.abbrev=7b04317, git.commit.user.email=jayunit100, git.commit.message.full=Merge pull request #80 from jayunit100/2.1.6_release_fix_sudoers

include the sudoers file in the srpm, git.commit.id=7b04317ff5c13af8de192626fb40c4a0a5c37000, git.commit.message.short=Merge pull request #80 from jayunit100/2.1.6_release_fix_sudoers, git.commit.user.name=jay vyas, git.build.user.name=Unknown, git.commit.id.describe=2.1.6, git.build.user.email=Unknown, git.branch=master, git.commit.time=07.02.2014 @ 12:06:31 EST, git.build.time=07.02.2014 @ 13:58:44 EST}
14/02/19 10:31:13 INFO glusterfs.GlusterFileSystem: GIT_TAG=2.1.6
14/02/19 10:31:13 INFO glusterfs.GlusterFileSystem: Configuring GlusterFS
14/02/19 10:31:13 INFO glusterfs.GlusterVolume: Initializing gluster volume..
14/02/19 10:31:13 INFO glusterfs.GlusterVolume: Root of Gluster file system is /mnt/glusterfs
14/02/19 10:31:13 INFO glusterfs.GlusterVolume: mapreduce/superuser daemon : yarn
14/02/19 10:31:14 INFO glusterfs.GlusterVolume: Working directory is : glusterfs:/user/test1
14/02/19 10:31:14 INFO glusterfs.GlusterVolume: Write buffer size : 131072
Found 15 items
-ls: Fatal internal error
java.lang.RuntimeException: Error while running command to get file permissions : org.apache.hadoop.util.Shell$ExitCodeException: /bin/ls: cannot access /mnt/glusterfs/user/test1/dir1/test32_15._COPYING_: No such file or directory

        at org.apache.hadoop.util.Shell.runCommand(Shell.java:464)
        at org.apache.hadoop.util.Shell.run(Shell.java:379)
        at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:589)
        at org.apache.hadoop.util.Shell.execCommand(Shell.java:678)
        at org.apache.hadoop.util.Shell.execCommand(Shell.java:661)
        at org.apache.hadoop.fs.glusterfs.Util.execCommand(Util.java:38)
        at org.apache.hadoop.fs.glusterfs.GlusterFileStatus.loadPermissionInfo(GlusterFileStatus.java:87)
        at org.apache.hadoop.fs.glusterfs.GlusterFileStatus.getOwner(GlusterFileStatus.java:70)
        at org.apache.hadoop.fs.shell.Ls.adjustColumnWidths(Ls.java:130)
        at org.apache.hadoop.fs.shell.Ls.processPaths(Ls.java:101)
        at org.apache.hadoop.fs.shell.Command.recursePath(Command.java:347)
        at org.apache.hadoop.fs.shell.Ls.processPathArgument(Ls.java:89)
        at org.apache.hadoop.fs.shell.Command.processArgument(Command.java:260)
        at org.apache.hadoop.fs.shell.Command.processArguments(Command.java:244)
        at org.apache.hadoop.fs.shell.Command.processRawArguments(Command.java:190)
        at org.apache.hadoop.fs.shell.Command.run(Command.java:154)
        at org.apache.hadoop.fs.FsShell.run(FsShell.java:255)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
        at org.apache.hadoop.fs.FsShell.main(FsShell.java:305)

        at org.apache.hadoop.fs.glusterfs.GlusterFileStatus.loadPermissionInfo(GlusterFileStatus.java:110)
        at org.apache.hadoop.fs.glusterfs.GlusterFileStatus.getOwner(GlusterFileStatus.java:70)
        at org.apache.hadoop.fs.shell.Ls.adjustColumnWidths(Ls.java:130)
        at org.apache.hadoop.fs.shell.Ls.processPaths(Ls.java:101)
        at org.apache.hadoop.fs.shell.Command.recursePath(Command.java:347)
        at org.apache.hadoop.fs.shell.Ls.processPathArgument(Ls.java:89)
        at org.apache.hadoop.fs.shell.Command.processArgument(Command.java:260)
        at org.apache.hadoop.fs.shell.Command.processArguments(Command.java:244)
        at org.apache.hadoop.fs.shell.Command.processRawArguments(Command.java:190)
        at org.apache.hadoop.fs.shell.Command.run(Command.java:154)
        at org.apache.hadoop.fs.FsShell.run(FsShell.java:255)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
        at org.apache.hadoop.fs.FsShell.main(FsShell.java:305)


Version-Release number of selected component (if applicable):
glusterfs-3.4.0.44rhs-1.el6rhs.x86_64
glusterfs-fuse-3.4.0.44rhs-1.el6rhs.x86_64
glusterfs-geo-replication-3.4.0.44rhs-1.el6rhs.x86_64
glusterfs-libs-3.4.0.44rhs-1.el6rhs.x86_64
glusterfs-server-3.4.0.44rhs-1.el6rhs.x86_64
hadoop-2.2.0.2.0.6.0-101.el6.x86_64
hadoop-client-2.2.0.2.0.6.0-101.el6.x86_64
hadoop-hdfs-2.2.0.2.0.6.0-101.el6.x86_64
hadoop-libhdfs-2.2.0.2.0.6.0-101.el6.x86_64
hadoop-lzo-0.5.0-1.x86_64
hadoop-lzo-native-0.5.0-1.x86_64
hadoop-mapreduce-2.2.0.2.0.6.0-101.el6.x86_64
hadoop-yarn-2.2.0.2.0.6.0-101.el6.x86_64
hadoop-yarn-nodemanager-2.2.0.2.0.6.0-101.el6.x86_64
python-rhsm-1.8.17-1.el6_4.x86_64
redhat-storage-logos-60.0.17-1.el6rhs.noarch
rhs-hadoop-2.1.6-2.noarch
rhs-hadoop-install-0_65-2.el6rhs.noarch

How reproducible:
few %

Steps to Reproduce:
Above.


Actual results:
Sometimes there is error while listing files in directory while new files are created.

Expected results:
There is no exception in above scenario(more than 1 000 000 copies of small files ~ few MB).

Comment 1 Vivek Agarwal 2014-04-07 11:40:12 UTC

Per bug triage, between dev, PM and QA, moving these out of denali

Comment 2 Scott Haines 2014-04-09 00:09:47 UTC

Per Apr-02 bug triage meeting, granting both devel and pm acks

Comment 3 Bradley Childs 2014-04-23 21:13:30 UTC

Martin,

This sounds like a FUSE problem we were experiencing earlier with the plugin.  I haven't been able to reproduce on my setup.

Could you verify the OS you are testing on?  I know the FUSE patch is in RHS, but other OS may not have it.  Was setup done with the rhs.hadoop installer script?

There are some other namespace cacheing things you may need to turn off that also effect consistency.  These two need to be in the fstab or gluster mount options: entry-timeout=0,attribute-timeout=0 

And these need to be set on the volume:
gluster volume set $VOLNAME quick-read off 
gluster volume set $VOLNAME cluster.eager-lock on 
gluster volume set $VOLNAME performance.stat-prefetch off

Comment 4 Martin Kudlej 2014-04-28 11:37:35 UTC

Bradley,
We use for testing only OS images which have all required settings and package versions.

"entry-timeout=0,attribute-timeout=0" is set by rhs-hadoop-install script. 

I did additional set up too:
gluster volume set $VOLNAME quick-read off 
gluster volume set $VOLNAME cluster.eager-lock on 
gluster volume set $VOLNAME performance.stat-prefetch off

and then on one node:
$ for i in `seq 100`; do hadoop fs -copyFromLocal test32 dir1/test32_$i; done
and on second node:
$ for i in `seq 150`; do hadoop fs -ls dir1; done
and 5 exceptions have raised.

For re-testing I've used this ISO http://download.eng.brq.redhat.com/nightly/latest-RHSS-2.1.bd/2.1.bd/RHS/x86_64/os (ISO for HTB release) with 
$ rhn-channel -l
rhel-x86_64-server-6.4.z

with all updates.

Comment 5 Martin Kudlej 2014-04-28 11:41:02 UTC

Created attachment 890447 [details]
anaconda kickstart

Comment 6 Nagaprasad Sathyanarayana 2014-05-06 10:35:10 UTC

BZs not targeted for Denali.

Comment 8 Martin Kudlej 2014-05-29 14:02:17 UTC

I've prepared machines for reproducing this BZ and I've sent email to Bradley with connection details. Have you been able to reproduce this issue Bradley?

Comment 9 Bradley Childs 2014-05-29 18:48:21 UTC

Thanks Martin, here's whats happening-

The Hadoop copy command copies files to a temporary file during the actual I/O then renames the file on successful copy completion. The time between the copy finish and rename opens a window for directory listings to change. The ls command that occasionally fails is getting directory listing during the start of copy, then querying extra file info after the copy completes and the file renamed. The directory listing originally queried has changed during and the error is displayed. This is quite inefficient for gluster which calculate file placement based on file name hash, and would have to be re-hashed after the copy completes probably to another node. Copy performance aside, a typical application would block file or directory access from another thread/system but thats not an option in hadoop- there is no file or directory locking.

We could minimize the impact by skipping the lazy-load for permission/ownership of file but this would be a big performance nock. Most file ownership/permission is never checked during the course of a job. Plus this would only shorten the timing exposure window and could not guarantee that the bug is fixed, just reduce the frequency it occurs (which is already pretty low)

The best practice (and work-around) is for users to use the real linux shell commands directly to view and manipulate the gluster data. Users should avoid using the hadoop fs commands as they are subject to timing issues and aren't optimal/can't be optimized for glusters file handling.

Comment 10 Martin Kudlej 2014-05-30 06:10:13 UTC

You have changed status of this bug to MODIFIED. Does it mean that you've created patch which has fixed this issue?

Comment 11 Bradley Childs 2014-06-04 15:03:32 UTC

Closing as won't fix since the change is upstream apache hadoop core.

Note You need to log in before you can comment on or make changes to this bug.