Bug 762121 (GLUSTER-389)

Summary: auto-heal fails randomly and causes "Stale NFS file handle" errors
Product: [Community] GlusterFS Reporter: John Leach <john>
Component: replicateAssignee: Vikas Gorur <vikas>
Status: CLOSED CURRENTRELEASE QA Contact:
Severity: high Docs Contact:
Priority: low    
Version: mainlineCC: gluster-bugs, vijay
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Bug Depends On:    
Bug Blocks: 762118    

Description John Leach 2009-11-17 23:48:28 UTC
this is with glusterfs 3.0.0pre1 and also the latest git commit (e98020d5f6f) but patched with patch 2232 to fix a crash on 32bit: http://patches.gluster.com/patch/2232/

Very basic test setup with two servers with posix->locks->iothreads and two clients with client->replicate->iothreads.

One server is 64bit kernel, 64bit userspace, other is 64bit kernel 32bit userspace.

steps to reproduce:

start a copy of one directory tree to another on the gluster mount on one client

after a minute, stop one of the the servers.

after a minute, start the server again.

stop the copy and run "ls -lR >/dev/null"

I can see that the auto-healing is working by watching the size of the directory on the server I stopped and started, but eventually I see errors like:

ls: cannot access ./uploads/xx/forums/users/0000/9519/1.jpg: Stale NFS file handle
ls: cannot access ./uploads/xx/forums/users/0000/9519/2.jpg: Stale NFS file handle
ls: cannot access ./uploads/xx/forums/users/0000/9519/1.jpg: Stale NFS file handle
ls: cannot access ./uploads/xx/forums/users/0000/9540/1.jpg: Stale NFS file handle

these are logged as:

[2009-11-17 23:16:08] W [fuse-bridge.c:562:fuse_entry_cbk] glusterfs-fuse: 17844: LOOKUP() /test-copy/uploads/xx/forums/users/0000/9519/1.jpg => -1 (Stale NFS file handle)

these re-occur on subsequent runs of "ls -lR".  Different files are listed as stale on the two different clients.  Restarting the gluster servers changes the list of files that are stale.  Unmount and remounting the gluster filesystem  fixes the problem (though the two trees then appear to be different, so improperly auto-healed).

Comment 1 Vikas Gorur 2009-11-18 03:05:48 UTC
Thanks for testing the release and reporting this. We've seen this bug too, and are working on fixing it.

Comment 2 Anand Avati 2009-11-24 09:36:44 UTC
PATCH: http://patches.gluster.com/patch/2339 in master (cluster/afr: Fix handling of revalidate lookups.)

Comment 3 Vikas Gorur 2009-12-07 11:23:16 UTC
The patch above should fix this issue. I'm marking it as fixed, please re-open this if you see the bug with 3.0.0.