Red Hat Bugzilla – Bug 765491
"Stale NFS file handle" when transfering large amount of files
Last modified: 2011-11-09 23:54:21 EST
1 volume set up as replicate 2, on 2 servers (glusterhost1, glusterhost2), one brick each. Running Ubuntu oneiric and Gluster 3.2.4.
glusterhost1 has a local mount of the volume exported as NFS (mount -o vers=3,proto=tcp -t nfs localhost:/volume /mnt/share)
When rsync'ing 1000s of small files (under 50kb) from source_server to glusterhost1 it works, if I cancel the rsync and wait 0-10min and restart the rsync: every file transfer results in a "Stale NFS file handle" error.
After i restart the glusterhost-servers I can do the same event over again.
I've also testet this with 3.3beta-2 and 3.1.7.
I can recreate the process just by remounting, no need for restarting.
I was using ext4 filesystem on the two disks/bricks that made up the volume, after changing to ext3 the problem was gone...
Strike that, the problem just appeared with ext3 too :(
Sorry for the many comments, I'm trying to debug and gather some info.
Here is what I got so far: I tried to run rsync with more verbosity and it turn out its not all files that generates the "Stale NFS file handle" - only files on directory structures deeper than 16 levels (including the file itself), fx: /1/2/3/4/5/6/7/8/9/10/11/12/13/14/15/16.gif
I have seen something at least similar, for me it seems to also favor files that are fairly deep in the structure.
see http://bugs.gluster.com/show_bug.cgi?id=3712 for my report.
I can usually get my file if i try to get a little closer to it and stat() it again. Everything works using the fuse client though and that is probably no less secure, although the nfs client has an advantage when it comes to caching because it has access to the vfs cache stuff.
Rudy, Peter, Yes there is a limitation on the directory depth when NFS is used. This is because we have to encode the directory path in the file handle used by NFS client to communicate with the NFS server. The file handle can be 64 bytes max (limitation by RFC). We are bringing in some changes to overcome this limitation and the directory depth issue will be fixed in future releases.
When you run NFS server in TRACE mode you can grep for "Dir depth validation failed" which results in ESTALE ("Stale NFS file handle") errors on NFS client.
Can you supply more information on what future release this is scheduled for? Will it be possible to backport or in other ways make it available sooner?
I have a couple of projects on hold because of this, and I know other people who are touched by this limitation.
Can you please check with this volume setting:
gluster volume set <volname> nfs.mem-factor 40
Please note that this would lead to Gluster NFS server consuming more physical memory.
It works! I've completed the same work, that previously gave the error without any hiccups.
Why 40? :) can I rely on this change to solve the problem completely or should I set the value to something higher?
I appreciate you and Krishnas efforts and like Peter Linder mentioned in a related bug report - we here at Systime are also ready to sponsor any development that would favor our use of GlusterFS.
Rudi, You can let it remain at 40 if you no longer see the ESTALE problem. Though note that this work around will not fix the problem seen by Peter.