762121 – (GLUSTER-389) auto-heal fails randomly and causes "Stale NFS file handle" errors

Bug 762121 (GLUSTER-389) - auto-heal fails randomly and causes "Stale NFS file handle" errors

Summary: auto-heal fails randomly and causes "Stale NFS file handle" errors

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	GLUSTER-389
Product:	GlusterFS
Classification:	Community
Component:	replicate
Sub Component:
Version:	mainline
Hardware:	All
OS:	Linux
Priority:	low
Severity:	high
Target Milestone:	---
Assignee:	Vikas Gorur
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	GLUSTER-386
TreeView+	depends on / blocked

Reported:	2009-11-17 23:48 UTC by John Leach
Modified:	2009-12-07 14:23 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:
Regression:	---
Mount Type:	---
Documentation:	---
CRM:
Verified Versions:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description John Leach 2009-11-17 23:48:28 UTC

this is with glusterfs 3.0.0pre1 and also the latest git commit (e98020d5f6f) but patched with patch 2232 to fix a crash on 32bit: http://patches.gluster.com/patch/2232/

Very basic test setup with two servers with posix->locks->iothreads and two clients with client->replicate->iothreads.

One server is 64bit kernel, 64bit userspace, other is 64bit kernel 32bit userspace.

steps to reproduce:

start a copy of one directory tree to another on the gluster mount on one client

after a minute, stop one of the the servers.

after a minute, start the server again.

stop the copy and run "ls -lR >/dev/null"

I can see that the auto-healing is working by watching the size of the directory on the server I stopped and started, but eventually I see errors like:

ls: cannot access ./uploads/xx/forums/users/0000/9519/1.jpg: Stale NFS file handle
ls: cannot access ./uploads/xx/forums/users/0000/9519/2.jpg: Stale NFS file handle
ls: cannot access ./uploads/xx/forums/users/0000/9519/1.jpg: Stale NFS file handle
ls: cannot access ./uploads/xx/forums/users/0000/9540/1.jpg: Stale NFS file handle

these are logged as:

[2009-11-17 23:16:08] W [fuse-bridge.c:562:fuse_entry_cbk] glusterfs-fuse: 17844: LOOKUP() /test-copy/uploads/xx/forums/users/0000/9519/1.jpg => -1 (Stale NFS file handle)

these re-occur on subsequent runs of "ls -lR".  Different files are listed as stale on the two different clients.  Restarting the gluster servers changes the list of files that are stale.  Unmount and remounting the gluster filesystem  fixes the problem (though the two trees then appear to be different, so improperly auto-healed).

Comment 1 Vikas Gorur 2009-11-18 03:05:48 UTC

Thanks for testing the release and reporting this. We've seen this bug too, and are working on fixing it.

Comment 2 Anand Avati 2009-11-24 09:36:44 UTC

PATCH: http://patches.gluster.com/patch/2339 in master (cluster/afr: Fix handling of revalidate lookups.)

Comment 3 Vikas Gorur 2009-12-07 11:23:16 UTC

The patch above should fix this issue. I'm marking it as fixed, please re-open this if you see the bug with 3.0.0.

Note You need to log in before you can comment on or make changes to this bug.