this is with glusterfs 3.0.0pre1 and also the latest git commit (e98020d5f6f) but patched with patch 2232 to fix a crash on 32bit: http://patches.gluster.com/patch/2232/ Very basic test setup with two servers with posix->locks->iothreads and two clients with client->replicate->iothreads. One server is 64bit kernel, 64bit userspace, other is 64bit kernel 32bit userspace. steps to reproduce: start a copy of one directory tree to another on the gluster mount on one client after a minute, stop one of the the servers. after a minute, start the server again. stop the copy and run "ls -lR >/dev/null" I can see that the auto-healing is working by watching the size of the directory on the server I stopped and started, but eventually I see errors like: ls: cannot access ./uploads/xx/forums/users/0000/9519/1.jpg: Stale NFS file handle ls: cannot access ./uploads/xx/forums/users/0000/9519/2.jpg: Stale NFS file handle ls: cannot access ./uploads/xx/forums/users/0000/9519/1.jpg: Stale NFS file handle ls: cannot access ./uploads/xx/forums/users/0000/9540/1.jpg: Stale NFS file handle these are logged as: [2009-11-17 23:16:08] W [fuse-bridge.c:562:fuse_entry_cbk] glusterfs-fuse: 17844: LOOKUP() /test-copy/uploads/xx/forums/users/0000/9519/1.jpg => -1 (Stale NFS file handle) these re-occur on subsequent runs of "ls -lR". Different files are listed as stale on the two different clients. Restarting the gluster servers changes the list of files that are stale. Unmount and remounting the gluster filesystem fixes the problem (though the two trees then appear to be different, so improperly auto-healed).
Thanks for testing the release and reporting this. We've seen this bug too, and are working on fixing it.
PATCH: http://patches.gluster.com/patch/2339 in master (cluster/afr: Fix handling of revalidate lookups.)
The patch above should fix this issue. I'm marking it as fixed, please re-open this if you see the bug with 3.0.0.