Bug 157266

Summary: RHEL 4 needs ESTALE pathname resolution logic
Product: Red Hat Enterprise Linux 4 Reporter: Chuck Lever <cel>
Component: kernelAssignee: Peter Staubach <staubach>
Status: CLOSED WONTFIX QA Contact: Brian Brock <bbrock>
Severity: medium Docs Contact:
Priority: medium    
Version: 4.0CC: bnocera, dmair, fleite, hgarcia, jlayton, jmbastia, jplans, steved, syeghiay, tao, xdl-redhat-bugzilla
Target Milestone: rc   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2009-01-27 16:49:28 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 176344, 198694, 391511, 409961, 461297    
Attachments:
Description Flags
patch to retry lookup if the stat call fails
none
updated patch, fix vfs_lstat the same way
none
patch against upstream, fix vfs_stat, vfs_lstat and sys_readlinkat
none
new patch for vfs_stat, vfs_lstat and sys_readlinkat none

Description Chuck Lever 2005-05-09 22:43:05 UTC
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.7.6) Gecko/20050317 Firefox/1.0.2

Description of problem:
A patch to retry pathname resolution if an ESTALE error occurs is now in the
2.6 mainline (2.6.12-rc1 or rc2).  This patch should be backported to RHEL 4's
2.6.9-based kernel.

Version-Release number of selected component (if applicable):


How reproducible:
Always

Steps to Reproduce:
1. mount any NFS server
2. "ls" a directory on that mount
3. go to the server and remove the directory, then recreate it
4. "ls" again on the client
  

Actual Results:  ls: '.': Stale file handle

Expected Results:  'ls' should have returned the contents of the new directory.

Additional info:

I would provide the changeset information for the specific 2.6 patch, but we
aren't using BK anymore.

Comment 1 Steve Dickson 2005-05-10 19:23:41 UTC
Was the patch post to a list or is it on a website somewhere?

Comment 3 Steve Dickson 2005-09-06 12:28:47 UTC
This should be fixed in RHEL4-U2 kernel

Comment 4 Bastien Nocera 2006-04-11 13:59:18 UTC
Still happens with kernel 2.6.9-34.EL

Comment 13 Jeff Bastian 2006-08-28 20:31:13 UTC
I tested this on two RHEL4 U4 NFS clients with a NetApp filer for the NFS server
and this bug is still present in U4 (kernel 2.6.9-42.EL).

Comment 14 RHEL Program Management 2006-09-07 19:40:57 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 15 RHEL Program Management 2006-09-07 19:40:57 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 16 RHEL Program Management 2006-09-07 19:41:16 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 20 Jeff Layton 2006-11-08 14:00:22 UTC
I inserted a dump_stack() in __nfs_revalidate_inode when the getattr fails with
an ESTALE. Here's the result:

 [<e064196d>] __nfs_revalidate_inode+0x1d3/0x42e [nfs]
 [<c018557b>] dput+0x33/0x423
 [<c017c752>] link_path_walk+0x90/0xb9
 [<c01d39a2>] selinux_inode_getattr+0x48/0x50
 [<e06411fc>] nfs_getattr+0x45/0x69 [nfs]
 [<e06411b7>] nfs_getattr+0x0/0x69 [nfs]
 [<c0176ca5>] vfs_getattr+0x35/0x88
 [<c0176d20>] vfs_stat+0x28/0x3a
 [<c0134124>] do_sigaction+0x3e8/0x491
 [<c0177315>] sys_stat64+0xf/0x23
 [<c013462e>] sys_rt_sigaction+0x8f/0xf2
 [<c011d7e5>] do_page_fault+0x0/0x4dc
 [<c031a39b>] error_code+0x2f/0x38
 [<c0319903>] syscall_call+0x7/0xb

...so we apparently need to handle the ESTALE gracefully somewhere along here.


Comment 21 Jeff Layton 2006-11-10 18:37:17 UTC
Confirmed that problem exists in kernel.org 2.6.15 kernels as well. Looking now
to see if it's ever actually been fixed.


Comment 22 Jeff Layton 2006-11-10 18:44:18 UTC
Also exists in 2.6.18-rc7, and in Linville's wireless-dev git tree (which seems
to be ~2.6.19-rc4 for VFS/NFS stuff).



Comment 23 Jeff Layton 2006-11-10 20:14:12 UTC
Created attachment 140933 [details]
patch to retry lookup if the stat call fails

Here's what I think is happening.

The first lookup occurs and populates the dcache/icache. The dir is then
removed and recreated on the server, which changes the filehandle. The client
then makes a stat() call which ends up in vfs_stat. vfs_stat uses
user_path_walk() to populate the nameidata struct, but because the attribute
cache hasn't timed out, it presumes that the info it has is already good.
vfs_stat then calls vfs_getattr, which calls nfs_getattr, which forces an on
the wire lookup because the filesystem is mounted with atimes enabled. This
generates the ESTALE, and invalidates the inode.

The attached patch is one way to fix it. If user_path_walk() works, but the
next call ends up with a -ESTALE, then we can presume the above has occurred
and force a retry on the lookup. If that also fails with an -ESTALE then we'd
return with that error.

We'd probably also want to fix up vfs_lstat too (and maybe some other
functions, I'll troll through them).

Another possibility might be to set the LOOKUP_REVAL flag in the original
lookup to force a true on-the-wire revalidation, but that would probably mean
more otw operations overall.

Peter, what do you think about this approach?

Comment 24 Peter Staubach 2006-11-10 20:20:49 UTC
It seems like something like this will be required, but this will
require some careful analysis and checking to ensure that it will
not loop indefinitely.

We can't set LOOKUP_REVAL on the original lookup.  The performance cost
would be too high.

The set of system calls which need to be checked is the entire set
which take pathnames as their arguments.

Comment 25 Jeff Layton 2006-11-10 20:30:55 UTC
Created attachment 140934 [details]
updated patch, fix vfs_lstat the same way

This patch fixes vfs_lstat in the same way. I don't see any other calls that
would need to be fixed this way right offhand.

Also, my description of the problem was a little off. nfs_getattr doesn't force
an otw lookup, rather it forces an otw getattr call.

These patches assume that it's not possible for user_path_walk to repeatedly
succeed but to repeatedly get an ESTALE from the vfs_getattr call. Such a
situation would make an infinite loop, obviously, but I don't see how that
could occur.

Comment 26 Peter Staubach 2006-11-10 20:37:51 UTC
_All_ system calls which take pathnames as an argument _must_ be
fixed in this fashion.

The trick is to ensure that the lookup process eventually either
fails with ENOENT or continues to purge stale caches until the
ESTALE is discovered by validating the attributes on the root of
the file system.  Then, the error, ENOENT, should be returned
instead of ESTALE.

Without this trick, if the root of a mount point goes stale,
then the client will loop forever.

Comment 27 Peter Staubach 2006-11-10 20:43:56 UTC
The code also needs to take the current working directory of the process
into account as well.  If the current working directory becomes stale,
then a relative pathname lookup may also cause an infinite loop if this
situation is handled correctly.

Comment 28 Jeff Layton 2006-11-10 21:30:25 UTC
Ok, this issue is a quite a bit more complex than I had originally thought. :-)

Am I wrong that most of the code you describe is already present in
__link_path_walk? I thought that it will revalidate dentries all the way back to
the root of the fs when it hit an ESTALE (via the link_path_walk wrapper).
Though, I'm not certain how it handles the situation when the root of the fs is
itself stale.

If that is already handled, then we'd just need to patch up the syscalls a'la
the patch above, correct?



Comment 29 Jeff Layton 2006-11-13 13:51:38 UTC
Here's a list of syscalls from a recent git kernel that I identified that seem
to take pathnames as args. I built this list quickly, so it might be incomplete.
Some of these are just wrappers around common functions so the amount of work
might not be so bad. Still, with so many, perhaps it would be better to do this
in stages?

sys_truncate
sys_stat
sys_statfs
sys_statfs64
sys_lstat
sys_newstat
sys_newlstat
sys_stat64
sys_lstat64
sys_truncate64
sys_setxattr
sys_lsetxattr
sys_getxattr
sys_lgetxattr
sys_listxattr
sys_llistxattr
sys_removexattr
sys_lremovexattr
sys_pivot_root
sys_chroot
sys_mknod
sys_link
sys_symlink
sys_unlink
sys_rename
sys_chmod
sys_creat
sys_open
sys_access
sys_chown
sys_lchown
sys_chown16
sys_lchown16
sys_getcwd
sys_mkdir
sys_chdir
sys_rmdir
sys_swapon
sys_swapoff
sys_uselib
sys_mknodat
sys_mkdirat
sys_unlinkat
sys_symlinkat
sys_linkat
sys_renameat
sys_futimesat
sys_faccessat
sys_fchmodat
sys_fchownat
sys_openat
sys_newfstatat
sys_fstatat64
sys_readlinkat
compat_sys_futimesat
compat_sys_newfstatat
compat_sys_openat
kernel_execve


Comment 30 Jeff Layton 2006-11-13 14:44:28 UTC
Created attachment 141049 [details]
patch against upstream, fix vfs_stat, vfs_lstat and sys_readlinkat

Here's a slightly updated patch. Since the same problem exists upstream, it's
probably best to work on fixing it there first. This patches up vfs_stat,
vfs_lstat and vfs_readlinkat (since they all exist in fs/stat.c).

Basically, I'm doing the same thing as before, but also making sure that
LOOKUP_REVAL is set when we redo the lookup on an ESTALE. My thinking is that
the proper place to put the ESTALE handling (if it doesn't already exist there)
is in the pathname resolution, so fixing up the syscalls to just reenter that
code and force revalidation is probably what we want.

Thoughts?

Comment 31 Peter Staubach 2006-11-13 16:29:39 UTC
Thank you for the constructing that list of system calls which may need
to be modified!

The attached patch is a little buggy but seems along the right lines.
We need to ensure that LOOKUP_REVAL does the right thing so that the
pathname lookup portion works right.  We also need to ensure that it
is not possible for the kernel to loop forever if it is not possible
to recover.

Comment 32 Jeff Layton 2006-11-13 17:11:33 UTC
Created attachment 141073 [details]
new patch for vfs_stat, vfs_lstat and sys_readlinkat

Doh! Yeah, I see where I missed the label in the lstat call. I've not tested
any of the newer patches, so these shouldn't be taken too literally, more
RFC's.

This patch adds the missing label, and also adds a retried flag. If we retry
the lookup and still end up in the same codepath, then we won't retry again
(and hence won't loop forever). Is this what you were more meaning by ensuring
that we don't loop forever?

Also, I checked and the only place that seems to check the LOOKUP_REVAL flag
is:

nfs_lookup_verify_inode

no other filesystems seem to care about it. Not sure if that would be big deal
or not...

Comment 33 Peter Staubach 2006-11-13 18:11:02 UTC
I am mostly concerned about the pathname lookups which don't actually
require any over the wire lookups, like looking up "." or "".

Comment 45 RHEL Program Management 2007-05-09 11:19:59 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 50 Jeff Layton 2008-01-09 17:48:07 UTC
*** Bug 401551 has been marked as a duplicate of this bug. ***

Comment 67 RHEL Program Management 2008-09-03 13:07:26 UTC
Updating PM score.

Comment 73 RHEL Program Management 2009-01-27 16:49:28 UTC
Development Management has reviewed and declined this request.  You may appeal
this decision by reopening this request.