765491 – (GLUSTER-3759) "Stale NFS file handle" when transfering large amount of files

Bug 765491 (GLUSTER-3759) - "Stale NFS file handle" when transfering large amount of files

Summary: "Stale NFS file handle" when transfering large amount of files

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	GLUSTER-3759
Product:	GlusterFS
Classification:	Community
Component:	nfs
Sub Component:
Version:	3.2.4
Hardware:	x86_64
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Assignee:	Krishna Srinivas
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2011-10-27 19:32 UTC by Rudi Meyer
Modified:	2011-11-10 04:54 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:
Regression:	---
Mount Type:	---
Documentation:	---
CRM:
Verified Versions:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Rudi Meyer 2011-10-27 19:32:55 UTC

1 volume set up as replicate 2, on 2 servers (glusterhost1, glusterhost2), one brick each. Running Ubuntu oneiric and Gluster 3.2.4.
glusterhost1 has a local mount of the volume exported as NFS (mount -o vers=3,proto=tcp -t nfs localhost:/volume /mnt/share)

When rsync'ing 1000s of small files (under 50kb) from source_server to glusterhost1 it works, if I cancel the rsync and wait 0-10min and restart the rsync: every file transfer results in a "Stale NFS file handle" error.

After i restart the glusterhost-servers I can do the same event over again.

Comment 1 Rudi Meyer 2011-10-28 09:30:37 UTC

I've also testet this with 3.3beta-2 and 3.1.7.

I can recreate the process just by remounting, no need for restarting.

Comment 2 Rudi Meyer 2011-10-28 15:44:32 UTC

I was using ext4 filesystem on the two disks/bricks that made up the volume, after changing to ext3 the problem was gone...

Comment 3 Rudi Meyer 2011-10-28 15:45:39 UTC

Strike that, the problem just appeared with ext3 too :(

Comment 4 Rudi Meyer 2011-10-28 16:04:45 UTC

Sorry for the many comments, I'm trying to debug and gather some info.

Here is what I got so far: I tried to run rsync with more verbosity and it turn out its not all files that generates the "Stale NFS file handle" - only files on directory structures deeper than 16 levels (including the file itself), fx: /1/2/3/4/5/6/7/8/9/10/11/12/13/14/15/16.gif

Comment 5 Peter Linder 2011-10-30 11:05:47 UTC

I have seen something at least similar, for me it seems to also favor files that are fairly deep in the structure. 

see http://bugs.gluster.com/show_bug.cgi?id=3712 for my report.

I can usually get my file if i try to get a little closer to it and stat() it again. Everything works using the fuse client though and that is probably no less secure, although the nfs client has an advantage when it comes to caching because it has access to the vfs cache stuff.

Comment 6 Krishna Srinivas 2011-10-31 03:14:20 UTC

Rudy, Peter, Yes there is a limitation on the directory depth when NFS is used. This is because we have to encode the directory path in the file handle used by NFS client to communicate with the NFS server. The file handle can be 64 bytes max (limitation by RFC). We are bringing in some changes to overcome this limitation and the directory depth issue will be fixed in future releases.

When you run NFS server in TRACE mode you can grep for "Dir depth validation failed" which results in ESTALE ("Stale NFS file handle") errors on NFS client.

Comment 7 Rudi Meyer 2011-10-31 07:40:06 UTC

Can you supply more information on what future release this is scheduled for? Will it be possible to backport or in other ways make it available sooner?
I have a couple of projects on hold because of this, and I know other people who are touched by this limitation.

Comment 8 Vijay Bellur 2011-10-31 07:44:49 UTC

Rudi,

Can you please check with this volume setting:

gluster volume set <volname> nfs.mem-factor 40

Please note that this would lead to Gluster NFS server consuming more physical memory.

Thanks,
Vijay

Comment 9 Rudi Meyer 2011-10-31 08:04:32 UTC

It works! I've completed the same work, that previously gave the error without any hiccups.

Why 40? :)  can I rely on this change to solve the problem completely or should I set the value to something higher?

I appreciate you and Krishnas efforts and like Peter Linder mentioned in a related bug report - we here at Systime are also ready to sponsor any development that would favor our use of GlusterFS.

Comment 10 Krishna Srinivas 2011-10-31 08:48:52 UTC

Rudi, You can let it remain at 40 if you no longer see the ESTALE problem. Though note that this work around will not fix the problem seen by Peter.

Note You need to log in before you can comment on or make changes to this bug.