228801 – NFS race can create persistent stale negative dentries

Bug 228801 - NFS race can create persistent stale negative dentries

Summary: NFS race can create persistent stale negative dentries

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 4
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	4.4
Hardware:	All
OS:	Linux
Priority:	high
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	Fabio Olive Leite
QA Contact:	Brian Brock
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	327591 (view as bug list)
Depends On:
Blocks:	246627 253671
TreeView+	depends on / blocked

Reported:	2007-02-15 05:37 UTC by Fabio Olive Leite
Modified:	2023-06-02 22:17 UTC (History)
CC List:	15 users (show)
Fixed In Version:	RHBA-2007-0791
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2007-11-15 16:20:16 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
Python script to reproduce the bug. (3.10 KB, patch) 2007-02-15 05:37 UTC, Fabio Olive Leite	no flags	Details \| Diff
Trond Myklebust's dentry pruning patches combined and ported to RHEL-4. (1.37 KB, patch) 2007-02-15 05:39 UTC, Fabio Olive Leite	no flags	Details \| Diff
Enhanced python script that reproduces the problem. (4.88 KB, text/x-python) 2007-02-20 20:38 UTC, Fabio Olive Leite	no flags	Details
Keep track of the higher bits of jiffies64 in NFS dentries. (1.72 KB, patch) 2007-03-23 22:22 UTC, Fabio Olive Leite	no flags	Details \| Diff
Patch that changes how NFS dentries are validated (4.86 KB, patch) 2007-06-20 19:47 UTC, Fabio Olive Leite	no flags	Details \| Diff
Newer patch to avoid stale negative dentries (3.98 KB, patch) 2007-07-02 19:33 UTC, Fabio Olive Leite	no flags	Details \| Diff
Show Obsolete (4) View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2007:0791	0	normal	SHIPPED_LIVE	Updated kernel packages available for Red Hat Enterprise Linux 4 Update 6	2007-11-14 18:25:55 UTC

Description Fabio Olive Leite 2007-02-15 05:37:13 UTC

Description of problem:

When heavy concurrent file and directory access over NFS is coupled to
server-side updates, sometimes the client seems to miss a server attribute
update and persistent stale negative dentries are kept for a long time. This
leads to a file or directory being listed by a getdirents call but having their
existance denied when accessing directly.

Example on a local dir /local/foo NFS-exported to the localhost on /nfs/foo:

[root@dl145 ~]# ls -l /local/foo
total 4
drwxr-xr-x  3 root root 4096 Jan 30 00:25 bar

[root@dl145 ~]# ls -l /nfs/foo
total 0
?---------  ? ? ? ?           ? bar

[root@dl145 ~]# stat /nfs/foo/bar
stat: cannot stat `/nfs/foo/bar': No such file or directory

[root@dl145 ~]# stat /local/foo/bar
  File: `/local/foo/bar'
  Size: 4096            Blocks: 8          IO Block: 4096   directory
Device: fd00h/64768d    Inode: 28180483    Links: 3
Access: (0755/drwxr-xr-x)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2007-02-14 23:39:48.000000000 -0500
Modify: 2007-01-30 00:25:17.000000000 -0500
Change: 2007-02-14 23:39:48.000000000 -0500

This was obtained with the hit_nfs.py script attached, on a 4-way smp box.

Version-Release number of selected component (if applicable):

I just reproduced on 2.6.9-42.ELsmp. I'll work on newer versions now.

How reproducible:

Needs the proper timing to hit what is likely a race, perhaps either on dcache
handling or on link_path_walk (the -ESTALE case seems racy).

The python script attached reproduces the problem every few times it's ran. The
script creates a thread that removes the directory contents and restores them
(as if a rm -fr followed by untarring a data tarball) every 5 seconds on the
server side (/local/foo), while 300 threads on the client side (/nfs/foo)
alternate between stating /nfs/foo/bar, stating /nfs/foo/bar/baz/file9 and
opening/closing /nfs/foo/bar with O_DIRECTORY every 2 seconds.

This may sound weird, but it was extracted from an actual usage pattern in an
enterprise deployment, and the syscall pattern of tar, rm and the client
programs was replicated in the python script to create a "clean room" reproducer.

Steps to Reproduce:

1. export /local/foo via NFS to localhost
2. mount localhost:/local/foo /nfs/foo
3. ./hit_nfs.py
4. check /nfs/foo contents for problems, and goto 3 if the system recovers too
fast to analyse

"Good" runs can generate the problematic output from ls -l that lasts for a few
hours.

Actual results:

See above.

Expected results:

The stale negative dentries should be detected and cleared at some point.

Additional info:

Trond Myklebust's patches 007 and 008 from
http://linux-nfs.org/Linux-2.6.x/2.6.20/ can clear up the stale negative
dentries when nfs_readdir is called. A mixed version that applies cleanly to
RHEL-4 U4 is attached.

I still don't know the root-cause, though, and those patches don't help for the
case where files are accessed directly (i.e. don't ls the directory or even use
tab-completion on bash to stat the file ;-).

Cheers!
Fábio

Comment 1 Fabio Olive Leite 2007-02-15 05:37:13 UTC

Created attachment 148096 [details]
Python script to reproduce the bug.

Comment 2 Fabio Olive Leite 2007-02-15 05:39:00 UTC

Created attachment 148097 [details]
Trond Myklebust's dentry pruning patches combined and ported to RHEL-4.

Comment 3 Fabio Olive Leite 2007-02-15 05:51:58 UTC

Let's try to shoot for 4.6?

Comment 4 Fabio Olive Leite 2007-02-20 20:37:25 UTC

I just reproduced this even without setting the mtime on the directories back to
a point in time, like tar does. This probably means some race in path lookup?

I'll attach the enhanced reproducer, that has a bit more synchronization between
threads so that it does not trigger on false positives.

Comment 5 Fabio Olive Leite 2007-02-20 20:38:52 UTC

Created attachment 148440 [details]
Enhanced python script that reproduces the problem.

Comment 6 Fabio Olive Leite 2007-02-20 22:38:01 UTC

I just reproduced it even when commenting out the code that creates the client
open threads. I'm trying to reproduce again while capturing traffic.

Comment 7 Fabio Olive Leite 2007-02-21 01:24:18 UTC

Update!

I just reproduced on RHEL-4 U4 by running only 100 stat_file threads on the
python reproducer script (the creation of the other threads, stat_dir and
open_dir, was commented out, as well as changing mtime on the server). It still
produces this:

# ls -laR /nfs/foo
/nfs/foo:
total 8
drwxr-xr-x  3 root root 4096 Feb 20 20:16 .
drwxr-xr-x  3 root root 4096 Feb 14 13:49 ..
?---------  ? ?    ?       ?            ? bar

Notice the script was stating /nfs/foo/bar/baz/file9, and yet the bar directory,
merely a component in the lookup, was affected.

Now we know that:
- Changing the mtime on the server or not is irrelevant;
- Having multiple concurrent sys_stat calls on the client while the server is
being updated is enough to trigger the problem.

Comment 8 Fabio Olive Leite 2007-02-23 18:29:32 UTC

It was also reproduced on RHEL-5, although ls -l on the problematic directory
seems to refresh it and get rid of the problem. If you just stat the directory
or file directly, though, you still get a wrong -ENOENT back.

And to come full circle, it was also reproduced in RHEL-3, although RHEL-3 seems
to recover by itself in about 60 seconds. the ls command in RHEL-3 actually
prints a "No such file or directory" for the broken entries, instead of listing
it with question marks. On the syscall level, though, it's still -ENOENT.

Comment 9 Fabio Olive Leite 2007-02-27 19:05:43 UTC

Given that the problems seems to be related to path lookups over dentries that
point to stale NFS inodes, I thought of helping the system get rid of positive
dentries that point to stale nfs inodes or negative dentries whose parent dentry
points to a stale nfs inode as soon as possible.

I added a few conditions to nfs_dentry_delete, with the patch below:

--- linux-2.6.9/fs/nfs/dir.c.original   2007-02-27 00:14:45.000000000 -0500
+++ linux-2.6.9/fs/nfs/dir.c    2007-02-27 09:50:37.000000000 -0500
@@ -793,6 +793,19 @@
                 * files will be cleaned up during umount */
                return 1;
        }
+       if (dentry->d_inode && NFS_STALE(dentry->d_inode)) {
+               /* Unhash it if it points to a stale inode */
+               printk("Deleting dentry that points to stale NFS inode!\n");
+               printk("  %s/%s\n", dentry->d_parent->d_name.name,
dentry->d_name.name);
+               return 1;
+       }
+       if (!dentry->d_inode && NFS_STALE(dentry->d_parent->d_inode)) {
+               /* Unhash it if it's a negative dentry for a stale parent */
+               printk("Deleting negative dentry on stale parent NFS inode!\n");
+               printk("  %s/%s\n", dentry->d_parent->d_name.name,
dentry->d_name.name);
+               dump_stack();
+               return 1;
+       }
        return 0;

 }

The problem still reproduces, even when I see the messages above in the kernel
log (sorry for the big stack dumps):

SELinux: initialized (dev 0:14, type nfs), uses genfs_contexts
Deleting dentry that points to stale NFS inode!
  baz/file9
Deleting negative dentry on stale parent NFS inode!
  bar/baz
 [<f8b5aea9>] nfs_dentry_delete+0x9e/0xae [nfs]
 [<c016f09e>] dput+0x62/0x1a7
 [<c016659e>] real_lookup+0x7b/0xd2
 [<c01667ae>] do_lookup+0x56/0x8f
 [<c0166b19>] __link_path_walk+0x332/0xbb5
 [<c014e221>] do_anonymous_page+0x1c9/0x1db
 [<c01673df>] link_path_walk+0x43/0xbe
 [<c0134c87>] wake_futex+0x3a/0x44
 [<c0167774>] path_lookup+0x14b/0x17f
 [<c01678bc>] __user_walk+0x21/0x51
 [<c0162b41>] vfs_stat+0x14/0x3a
 [<c0134c87>] wake_futex+0x3a/0x44
 [<c0134d30>] futex_wake+0x9f/0xc5
 [<c016314a>] sys_stat64+0xf/0x23
 [<c0135497>] do_futex+0x32/0x5a
 [<c01355c0>] sys_futex+0x101/0x10c
 [<c02d4703>] syscall_call+0x7/0xb
Deleting negative dentry on stale parent NFS inode!
  baz/file9
 [<f8b5aea9>] nfs_dentry_delete+0x9e/0xae [nfs]
 [<c016f09e>] dput+0x62/0x1a7
 [<c016659e>] real_lookup+0x7b/0xd2
 [<c01667ae>] do_lookup+0x56/0x8f
 [<c0166fef>] __link_path_walk+0x808/0xbb5
 [<c014e221>] do_anonymous_page+0x1c9/0x1db
 [<c01673df>] link_path_walk+0x43/0xbe
 [<c0134c87>] wake_futex+0x3a/0x44
 [<c0167774>] path_lookup+0x14b/0x17f
 [<c01678bc>] __user_walk+0x21/0x51
 [<c0162b41>] vfs_stat+0x14/0x3a
 [<c0134c87>] wake_futex+0x3a/0x44
 [<c0134d30>] futex_wake+0x9f/0xc5
 [<c016314a>] sys_stat64+0xf/0x23
 [<c0135497>] do_futex+0x32/0x5a
 [<c01355c0>] sys_futex+0x101/0x10c
 [<c02d4703>] syscall_call+0x7/0xb
Deleting negative dentry on stale parent NFS inode!
  bar/baz
 [<f8b5aea9>] nfs_dentry_delete+0x9e/0xae [nfs]
 [<c016f09e>] dput+0x62/0x1a7
 [<c016659e>] real_lookup+0x7b/0xd2
 [<c01667ae>] do_lookup+0x56/0x8f
 [<c0166b19>] __link_path_walk+0x332/0xbb5
 [<c01673df>] link_path_walk+0x43/0xbe
 [<c0134c87>] wake_futex+0x3a/0x44
 [<c0167774>] path_lookup+0x14b/0x17f
 [<c01678bc>] __user_walk+0x21/0x51
 [<c0162b41>] vfs_stat+0x14/0x3a
 [<c0134c87>] wake_futex+0x3a/0x44
 [<c0134d30>] futex_wake+0x9f/0xc5
 [<c016314a>] sys_stat64+0xf/0x23
 [<c0135497>] do_futex+0x32/0x5a
 [<c01355c0>] sys_futex+0x101/0x10c
 [<c02d4703>] syscall_call+0x7/0xb
Deleting dentry that points to stale NFS inode!
  bar/baz
Deleting dentry that points to stale NFS inode!
  //bar
Deleting dentry that points to stale NFS inode!
  baz/file9
Deleting negative dentry on stale parent NFS inode!
  baz/file9
 [<f8b5aea9>] nfs_dentry_delete+0x9e/0xae [nfs]
 [<c016f09e>] dput+0x62/0x1a7
 [<c016659e>] real_lookup+0x7b/0xd2
 [<c01667ae>] do_lookup+0x56/0x8f
 [<c0166fef>] __link_path_walk+0x808/0xbb5
 [<c014e73f>] handle_mm_fault+0x139/0x193
 [<c01673df>] link_path_walk+0x43/0xbe
 [<c0134c87>] wake_futex+0x3a/0x44
 [<c0167774>] path_lookup+0x14b/0x17f
 [<c01678bc>] __user_walk+0x21/0x51
 [<c0162b41>] vfs_stat+0x14/0x3a
 [<c0134c87>] wake_futex+0x3a/0x44
 [<c0134d30>] futex_wake+0x9f/0xc5
 [<c016314a>] sys_stat64+0xf/0x23
 [<c0135497>] do_futex+0x32/0x5a
 [<c01355c0>] sys_futex+0x101/0x10c
 [<c02d4703>] syscall_call+0x7/0xb
Deleting negative dentry on stale parent NFS inode!
  baz/file9
 [<f8b5aea9>] nfs_dentry_delete+0x9e/0xae [nfs]
 [<c016f09e>] dput+0x62/0x1a7
 [<c016659e>] real_lookup+0x7b/0xd2
 [<c01667ae>] do_lookup+0x56/0x8f
 [<c0166fef>] __link_path_walk+0x808/0xbb5
 [<c014e73f>] handle_mm_fault+0x139/0x193
 [<c01673df>] link_path_walk+0x43/0xbe
 [<c01506e4>] find_extend_vma+0x12/0x4f
 [<c0134b57>] get_futex_key+0x39/0x108
 [<c0167774>] path_lookup+0x14b/0x17f
 [<c01678bc>] __user_walk+0x21/0x51
 [<c0162b41>] vfs_stat+0x14/0x3a
 [<c01506e4>] find_extend_vma+0x12/0x4f
 [<c0134b57>] get_futex_key+0x39/0x108
 [<c0134d30>] futex_wake+0x9f/0xc5
 [<c016314a>] sys_stat64+0xf/0x23
 [<c0135497>] do_futex+0x32/0x5a
 [<c01355c0>] sys_futex+0x101/0x10c
 [<c02d4703>] syscall_call+0x7/0xb
Deleting negative dentry on stale parent NFS inode!
  baz/file9
 [<f8b5aea9>] nfs_dentry_delete+0x9e/0xae [nfs]
 [<c016f09e>] dput+0x62/0x1a7
 [<c016659e>] real_lookup+0x7b/0xd2
 [<c01667ae>] do_lookup+0x56/0x8f
 [<c0166fef>] __link_path_walk+0x808/0xbb5
 [<c014e73f>] handle_mm_fault+0x139/0x193
 [<c01673df>] link_path_walk+0x43/0xbe
 [<c0134c87>] wake_futex+0x3a/0x44
 [<c0167774>] path_lookup+0x14b/0x17f
 [<c01678bc>] __user_walk+0x21/0x51
 [<c0162b41>] vfs_stat+0x14/0x3a
 [<c0134c87>] wake_futex+0x3a/0x44
 [<c0134d30>] futex_wake+0x9f/0xc5
 [<c016314a>] sys_stat64+0xf/0x23
 [<c0135497>] do_futex+0x32/0x5a
 [<c01355c0>] sys_futex+0x101/0x10c
 [<c02d4703>] syscall_call+0x7/0xb
Deleting negative dentry on stale parent NFS inode!
  baz/file9
 [<f8b5aea9>] nfs_dentry_delete+0x9e/0xae [nfs]
 [<c016f09e>] dput+0x62/0x1a7
 [<c016659e>] real_lookup+0x7b/0xd2
 [<c01667ae>] do_lookup+0x56/0x8f
 [<c0166fef>] __link_path_walk+0x808/0xbb5
 [<c01673df>] link_path_walk+0x43/0xbe
 [<c01506e4>] find_extend_vma+0x12/0x4f
 [<c0134b57>] get_futex_key+0x39/0x108
 [<c0167774>] path_lookup+0x14b/0x17f
 [<c01678bc>] __user_walk+0x21/0x51
 [<c0162b41>] vfs_stat+0x14/0x3a
 [<c01506e4>] find_extend_vma+0x12/0x4f
 [<c0134b57>] get_futex_key+0x39/0x108
 [<c0134d30>] futex_wake+0x9f/0xc5
 [<c016314a>] sys_stat64+0xf/0x23
 [<c0135497>] do_futex+0x32/0x5a
 [<c01355c0>] sys_futex+0x101/0x10c
 [<c02d4703>] syscall_call+0x7/0xb
Deleting negative dentry on stale parent NFS inode!
  baz/file9
 [<f8b5aea9>] nfs_dentry_delete+0x9e/0xae [nfs]
 [<c016f09e>] dput+0x62/0x1a7
 [<c016659e>] real_lookup+0x7b/0xd2
 [<c01667ae>] do_lookup+0x56/0x8f
 [<c0166fef>] __link_path_walk+0x808/0xbb5
 [<c01673df>] link_path_walk+0x43/0xbe
 [<c01506e4>] find_extend_vma+0x12/0x4f
 [<c0134b57>] get_futex_key+0x39/0x108
 [<c0167774>] path_lookup+0x14b/0x17f
 [<c01678bc>] __user_walk+0x21/0x51
 [<c0162b41>] vfs_stat+0x14/0x3a
 [<c01506e4>] find_extend_vma+0x12/0x4f
 [<c0134b57>] get_futex_key+0x39/0x108
 [<c0134d30>] futex_wake+0x9f/0xc5
 [<c016314a>] sys_stat64+0xf/0x23
 [<c0135497>] do_futex+0x32/0x5a
 [<c01355c0>] sys_futex+0x101/0x10c
 [<c02d4703>] syscall_call+0x7/0xb
Deleting dentry that points to stale NFS inode!
  bar/baz
Deleting negative dentry on stale parent NFS inode!
  bar/baz
 [<f8b5aea9>] nfs_dentry_delete+0x9e/0xae [nfs]
 [<c016f09e>] dput+0x62/0x1a7
 [<c016659e>] real_lookup+0x7b/0xd2
 [<c01667ae>] do_lookup+0x56/0x8f
 [<c0166b19>] __link_path_walk+0x332/0xbb5
 [<c014e73f>] handle_mm_fault+0x139/0x193
 [<c01673df>] link_path_walk+0x43/0xbe
 [<c0134d46>] futex_wake+0xb5/0xc5
 [<c0167774>] path_lookup+0x14b/0x17f
 [<c01678bc>] __user_walk+0x21/0x51
 [<c0162b41>] vfs_stat+0x14/0x3a
 [<c0134d46>] futex_wake+0xb5/0xc5
 [<c016314a>] sys_stat64+0xf/0x23
 [<c0135497>] do_futex+0x32/0x5a
 [<c01355c0>] sys_futex+0x101/0x10c
 [<c02d4703>] syscall_call+0x7/0xb
Deleting dentry that points to stale NFS inode!
  //bar
Deleting dentry that points to stale NFS inode!
  //bar
Deleting dentry that points to stale NFS inode!
  //bar
Deleting dentry that points to stale NFS inode!
  //bar
Deleting dentry that points to stale NFS inode!
  //bar

Even with those problematic dentries apparently being deleted, we have:

# ls -la /nfs/foo
total 16
drwxr-xr-x  3 root root 4096 Feb 27 11:52 .
drwxr-xr-x  3 root root 4096 Feb 26 23:55 ..
?---------  ? ?    ?       ?            ? bar

This is getting weirder the harder I look. :)

Comment 10 Fabio Olive Leite 2007-02-27 19:16:38 UTC

On another twist at this issue, most of the packet captures I have seen so far
clearly show the NFS server returning NFS3ERR_NOENT for those directory entries,
and that with the lookup operations being intermixed with access/getattr
operations in the parent directory of those entries, so it may be we're actually
seeing a bad timing problem on the _server_ side. Perhaps in that case we can
just blame NFSv3.

I'm running out of ideas. :)

Comment 11 Peter Staubach 2007-02-28 13:58:28 UTC

Yes, could be a server problem, but my money is still on a client
side problem.  :-)

Comment 12 Peter Staubach 2007-02-28 21:38:46 UTC

Has this been reproduced using two different systems, one as client
and the other as server?

Comment 13 Fabio Olive Leite 2007-03-01 05:22:25 UTC

Yes, the original reproducer was on two different systems, just as the original
actual problem. It's just that when it was found that having client and server
be the same system would reproduce the problem in seconds or minutes instead of
hours or days, we focused on that case. Sorry about that. :)

In any case, a really nice analysis by a partner turned our attention to the
fact that since the usable mtime resolution on NFS seems to be only a second, it
is entirely possible that a negative dentry for a file is created in the same
second that the directory is updated to actually have an entry for which we just
created the negative dentry, and neither the mtime or the cache_change_attribute
will get it. The mtime will be the same as the one we just checked, and we have
not done any metadata-changing action on the server to affect
cache_change_attribute.

I took at look at nfs_verify_change_attribute in RHEL-4 U4, and applied this
simple change:

--- linux-2.6.9/include/linux/nfs_fs.h.original 2007-02-28 21:58:01.000000000 -0500
+++ linux-2.6.9/include/linux/nfs_fs.h  2007-02-28 22:09:20.000000000 -0500
@@ -301,7 +301,7 @@ static inline long nfs_save_change_attri
 static inline int nfs_verify_change_attribute(struct inode *inode, unsigned
long chattr)
 {
        return !nfs_caches_unstable(inode)
-               && chattr == NFS_I(inode)->cache_change_attribute;
+               && chattr > NFS_I(inode)->cache_change_attribute;
 }

 /*

In upstream code this test is made with time_after_eq, in which case we'd just
switch to using time_after.

With this change, the problem no longer reproduces, at least in the time span it
used to (used to reproduce in under a minute, now I haven't seen it in over an
hour of testing).

I don't completely get all the logic inversions in nfs_neg_need_reval and the
functions called from it, and having a "generation counter" be tested with
time_after_eq also blows me a bit. In any case, for now I just care that the
change above seems to keep the problem from happening.

Maybe it means that the code is not trusting any NFS dentry anymore, and that
I've multiplied NFS traffic by ten or more. But it surely seems to keep the
problem from happening. :)

Comment 14 Fabio Olive Leite 2007-03-01 05:25:16 UTC

Forgot to mention the change above is being tested alone, without any of Trond's
patches.

Comment 15 Fabio Olive Leite 2007-03-01 07:23:53 UTC

Well, that change is not the right one (the NFS traffic is a lot higher), but it
shows that the spot to change is indeed near nfs_neg_need_reval (the problem has
not reproduced in a few hours, as the code now always looks up at the server).

I'll think this through and try to come up with The Right Thing.

Comment 16 Fabio Olive Leite 2007-03-02 04:29:41 UTC

OK, I think this is the closest to The Right Thing I can get.

The idea is: don't create negative dentries in nfs_lookup() when the mtime of
the directory is the current second, because we cannot be sure the entry will
not be there milliseconds after and we won't be able to notice.

This assumes the time is synced between server and client, which is a good thing
to have anyway. It generates extra NFS traffic only during the second on which
we'll avoid caching the lookup answer with a negative dentry, and I think
there's nothing we can do about that. A second after the directory change, we
can cache negative dentries to our heart's content, since then we _know_ we will
notice the mtime change when those entries come into existance.

It has not reproduced the problem in over 30 minutes. The change looks a little
gross, but it works. Peter Staubach, what do you think about it? Maybe adding
some comments on top it will make it more palatable.

--- linux-2.6.9/fs/nfs/dir.c.original   2007-02-27 00:14:45.000000000 -0500
+++ linux-2.6.9/fs/nfs/dir.c    2007-03-01 23:13:10.000000000 -0500
@@ -862,7 +862,10 @@ static struct dentry *nfs_lookup(struct
                error = NFS_PROTO(dir)->lookup(dir, &dentry->d_name,
                                &fhandle, &fattr);
                if (error == -ENOENT)
-                       goto no_entry;
+                       if (dir->i_mtime.tv_sec == get_seconds())
+                               goto out_unlock;
+                       else
+                               goto no_entry;
                if (error != 0)
                        goto out_unlock;
        }

I'll see about talking this through with Trond Myklebust as well.

Cheers!
Fábio

Comment 17 Peter Staubach 2007-03-02 15:12:30 UTC

Assuming that the time is sychronized between the client and the server
is a no go.  Please do not introduce dependencies like this.

An approach that I have seen used is to verify all negative entries
with an over the wire GETATTR on the directory in which they reside.
This is done for all negative entries on writable file systems and
the normal attribute cache timeout is used for read-only file systems.
We assume that read-only file systems change slowly, if at all, so
the looser semantics are acceptable.

Comment 20 Fabio Olive Leite 2007-03-02 19:22:40 UTC

Just FYI, now that we fully understand the root-cause, I'd like to keep this for
the record:

[root@z05 ~]# mount /nfs/foo

[root@z05 ~]# touch /local/foo; stat /nfs/foo/bar; touch /local/foo/bar
stat: cannot stat `/nfs/foo/bar': No such file or directory

[root@z05 ~]# ls -la /nfs/foo
total 8
drwxr-xr-x  2 root root 4096 Mar  2 14:16 .
drwxr-xr-x  3 root root 4096 Feb 23 16:50 ..
?---------  ? ?    ?       ?            ? bar

Instant reproduction, as long as you're lucky to get all three commands to start
in the same second. :)

Comment 24 Fabio Olive Leite 2007-03-13 23:13:02 UTC

OK, there are two ways to cause this problem. We've been reproducing the easier
one, but there's another that is specific to the Linux NFS client.

1. mtime is the same

If the mtime on an NFS directory is the same as the previously checked value,
the NFS client considers that there was no change to its contents. This can
happen with our reproducer when operations happen in the same second (with
Linux's default timestamps with one-second resolution) or when the mtime is
artificially changed to a specific value in the server (via tar, for example,
and this could happen even with finer-grained timestamps).

This is fairly intrinsic with the way NFS operates, and the only thing that can
be done about it is expire the cached dentries, valid or not, after some period
of time.

This cause has to do with inodes and mtimes, so it is server-oriented in a way.

2. 32bit jiffy wrap-around

Anuwat recently demonstrated that it is fairly possible for code that uses 32bit
jiffies to be exposed to jiffy wrap-around every 50 days, with HZ set to 1000.
The NFS client code has this problem, and this is because it uses a
filesystem-specific void * field (d_fsdata) in the dentry structure to keep a
snapshot of the jiffies value. Since a pointer is 32bits in i686, the NFS code
cannot use the 64bit representation of the jiffies counter.

Given that NFS dentries almost never really get freed (unless you rename an
entry or unmount the volume), it is entirely possible that very old NFS dentries
will exist in the cache, and their d_fsdata field could have a value higher than
the current low 32bit portion of jiffies.

Like Anuwat determined, these dentries will not get freed by changing their
parent directory, since their own "update time" is already in the "future" for a
32bit jiffy value.

STILL, they _will_ be revalidated if you use the touch command on them, since by
receiving an -ENOENT touch will try to _create_ those entries that the client
thinks does not exist. The server will tell the client that those entries _do_
exist, and the NFS code will drop the existing negative dentries. So the
proposed workaround of touching the problematic entries themselves should still
work.

The fix for this cause of the problem seems to be to store a full 64bit jiffy
count in the dentry object and change the timestamp check to use the full 64bit
value.

This cause has to do with dentries and jiffies, so it is specific to the Linux
NFS client code.

I and Flavio Leitner are still working on the implications of the Linux NFS
client code having d_fsdata point to an allocated structure instead of being
used as a simple long value.

Peter, any comments? :)

Comment 25 Peter Staubach 2007-03-14 15:04:26 UTC

It appears that the root cause for these problems are two fold?

One is that the Linux server does not actually change the mtime on
the directory when creating new entries because of the granularity
being seconds?

The other is the jiffy problem?  I don't understand this one quite.
It has to do with expiration of the time in which the client believes
the dentry is valid before checking again with the server?

Comment 26 Fabio Olive Leite 2007-03-15 20:25:11 UTC

The jiffies problem is that:
a) NFS dentries have their u32 jiffy timestamp saved on d_fsdata, from the
nfs_inode's u32 cache_change_attribute;
b) NFS dentries almost never get freed;
c) u32 jiffies overlap completely in ~50 days, and change sign in ~25 days.

The time_after{,_eq} constructs can tolerate _one_ sign flip (comparing a
negative with a positive value) when comparing u32 jiffy counts. But it
obviously cannot help when both values have the same sign but we _know_ that one
of them is from before the latest 32bit overflow of the u32 jiffies.

So consider you do some operations on an NFS mount, and let it sit for four
weeks, thus allowing for u32 jiffies to flip its most significant bit twice. Any
timestamp that was "positive" before, when compared to a "positive" recent jiffy
count, will appear to time_after{,_eq} to actually be really after the most
recent (yet less than the other one because of the lack of the 33rd bit) one.

In case of positive dentries, any operation on a possibly stale filehandle will
receive ESTALE and will be revalidated.

In the case of negative dentries, it seems we just trust them.

Let's throw in some numbers.

Imagine you have a negative NFS dentry that was created/updated around 20-some
days after boot, so it gets a jiffy timestamp of about 2147482648 (0x7FFFFC18).

Then it sits idle for 28 days, which is enough for the u32 jiffy count to be
increased by 1000 * 60 * 60 * 24 * 28 == 2419200000 (0x90321000).

In this moment, the parent directory of that dentry gets updated, as we do an ls
-l on it, and the parent directory get's a jiffy timestamp in its
cache_change_attribute of (2147482648 + 2419200000) % (2 ^ 32) (remember we're
counting modulo 32), which is 271715352 (0x10320C18).

The current timestamp of 0x10320C18 is clearly "earlier" than the negative
dentry's timestamp of 0x7FFFFC18, when we don't consider that the jiffies wrap
around.

So the code considers the negative dentry to still be valid, even when the
server tells the client it exists in a readdir.

I hope to have gotten the full context in this one. :)

Regards,
Fábio

Comment 27 Fabio Olive Leite 2007-03-15 21:02:13 UTC

This obviously affects only 32bit archs, so a workaround is to move to x86_64
and friends. :)

Perhaps the fix is to move from long to u64 and get_jiffies_64() in
nfs_inode.cache_change_attribute, nfs_fattr.time_start, and actually allocate a
u64 from a slabcache and point dentry.d_fsdata to it, implementing a
nfs_dentry_release operation to free up d_fsdata.

Tricky tricky.

Comment 30 Fabio Olive Leite 2007-03-23 22:22:46 UTC

Created attachment 150805 [details]
Keep track of the higher bits of jiffies64 in NFS dentries.

Proposed fix for the case where NFS dentries sit idle during two sign flips of
32bit jiffies, fooling time_after[_eq].

Basically we abuse another unused (by NFS) field in struct dentry, d_cookie, to
keep track of the higher part of jiffies64, so that we refresh very old
negative dentries when acessing them.

The code is made conditional of BITS_PER_LONG being < 64, since this only
affects 32bit architectures.

A way to test this without waiting about 30 days is to reproduce the problem
within 5 minutes of system boot. The initial jiffies value is made so that it
overflows after 5 minutes of uptime, to expose problems with jiffies early.
This ends up helping us, since the higher word of jiffies64 gets incremented.

I tested this by getting the persistent dentries expoiting the mtime problem,
and refreshed them with the jiffies trigger. After 5 minutes of uptime the
broken negative dentry went away.

Comments, please!

Comment 32 Fabio Olive Leite 2007-03-28 23:18:11 UTC

It looks like the u32 jiffy overflow problem can also affect positive dentries
and file attributes, judging by code in nfs_refresh_inode and the code flows
from there. It would be interesting if someone more experienced in NFS client
code could comment on this.

Imagine that an NFS directory is accessed at about 20 days of uptime, when the
32 bit jiffies is (1000 * 60 * 60 * 24 * 20) == 1,728,000,000. Then this
directory inode sits idle on the client for 30 days and is accessed again, when
the 32 bit jiffies has overflowed to (1000 * 60 * 60 * 24 * 50) % (2 ^ 32) ==
25,032,704.

Even with the all the cast magic in time_after[_eq], the jiffies at 50 days of
uptime is obviously "earlier" than the jiffies at 20 days of uptime, and it's
obvious that the code will flow in unintended ways.

Since nfs_fattr->time_start, nfs_inode->last_updated,
nfs_inode->cache_change_attribute and some others are all compared in some place
or another, the logic is far more difficult to change than the simple negative
dentry case.

I was thinking of converting those fields to u64 and get_jiffies_64(), but then
operating those types will no longer be atomic on 32bit archs, and we might need
extra locking.

So, if anyone more knowledgeable could comment, it would be great! :)

Thanks,
Fábio

Comment 33 Ian Kent 2007-04-04 12:23:07 UTC

(In reply to comment #32)
> 
> So, if anyone more knowledgeable could comment, it would be great! :)
> 

Definitely not more knowledgeable but this sound like an
interesting issue and I'm having a little trouble understanding
why we need to go to so much trouble to fix it. So perhaps you
would be kind enough to humour me and answer a couple of questions.

First, does Tronds patch not work for all the cases?

Why are we so concerned about the cache validation checks when
a negative, hashed dentry, as in this case, seems to be an
obvious candidate for disposal (d_drop) as is done in Tronds
patch?

Ian

Comment 34 Ian Kent 2007-04-05 12:06:26 UTC

(In reply to comment #33)
> why we need to go to so much trouble to fix it. So perhaps you
> would be kind enough to humour me and answer a couple of questions.

Mmm .. obviously not.

Anyway, I have an idea of a possible cause that is somewhat
different to the analysis above. But I can't be sure that I
understand the problem above or in what kernel versions
the problem does and doesn't exist so I'll make the assumption
that it doesn't happen in the latest kernel (perhaps 2.6.20 or
21 and above).

> 
> First, does Tronds patch not work for all the cases?

Assume that Tronds patch above prevents the listing of
these invalid dentrys by unhashing them if they are 
encountered.

> 
> Why are we so concerned about the cache validation checks when
> a negative, hashed dentry, as in this case, seems to be an
> obvious candidate for disposal (d_drop) as is done in Tronds
> patch?

Looking at this message
http://marc.info/?l=linux-nfsv4&m=115743251604662&w=2
which basically says that there's a race in the client during
the mkdir system call. This patch isn't present in the RHEL4
kernel.

To explain, for a mkdir (system) call, the (VFS) lookup returns
a dentry that is hashed and negative and it is later filled
in in the (VFS) mkdir call. So if a (VFS) lookup happens
between these two events the lookup will find the above
negative dentry since it is hashed and can, given specific
timing, return ENOENT from link_path_walk because the dentry
is negative.

Am I at all close to what this bug is about Fabio?

Ian

Comment 35 Fabio Olive Leite 2007-04-05 16:23:45 UTC

Sorry Ian, I wasn't looking at this BZ yesterday...

(In reply to comment #34)
> (In reply to comment #33)
> > why we need to go to so much trouble to fix it. So perhaps you
> > would be kind enough to humour me and answer a couple of questions.
> 
> Mmm .. obviously not.

I'll surely entertain anyone who wants to help. :)

> Anyway, I have an idea of a possible cause that is somewhat
> different to the analysis above. But I can't be sure that I
> understand the problem above or in what kernel versions
> the problem does and doesn't exist so I'll make the assumption
> that it doesn't happen in the latest kernel (perhaps 2.6.20 or
> 21 and above).

I'm not sure they have been fixed. Unfortunately I don't have an environment
with a latest stock kernel to test, but I'll see if I can create something.

> > First, does Tronds patch not work for all the cases?
> 
> Assume that Tronds patch above prevents the listing of
> these invalid dentrys by unhashing them if they are 
> encountered.

Mainly, Trond's patch above works around the issue in one place the issue can be
detected. It does not solve the underlying causes of the issue, so we'll have to
hunt for other places where the issue hits and work around it there as well.

Think of an application like a webserver, that does not do getdirents calls. It
just stats/opens files directly and can hit stale dentries that will not be
cleaned up.

> > Why are we so concerned about the cache validation checks when
> > a negative, hashed dentry, as in this case, seems to be an
> > obvious candidate for disposal (d_drop) as is done in Tronds
> > patch?

Because not all code paths will end up doing a readdir.

> Looking at this message
> http://marc.info/?l=linux-nfsv4&m=115743251604662&w=2
> which basically says that there's a race in the client during
> the mkdir system call. This patch isn't present in the RHEL4
> kernel.
> 
> To explain, for a mkdir (system) call, the (VFS) lookup returns
> a dentry that is hashed and negative and it is later filled
> in in the (VFS) mkdir call. So if a (VFS) lookup happens
> between these two events the lookup will find the above
> negative dentry since it is hashed and can, given specific
> timing, return ENOENT from link_path_walk because the dentry
> is negative.

But in this case I don't think it would be a persistent issue. You could hit
ENOENT once, but in the next lookup it would likely already be filled in. What
we can see here are negative dentries for files that do exist in the server and
you can check them in the client for hours and still get the ENOENT.

> Am I at all close to what this bug is about Fabio?

Close, yes. But I think there's one fundamental set of conditions we identified
that is not yet fully understood, and I'd love to be proved wrong about them.
They are:

- NFS uses jiffies timestamps on its structures to compare update/modification
times;

- on 32bit platforms, jiffies wrap completely in around 50 days (thus MSB flips
in around 25 days);

- NFS dentries and inodes remain allocated in memory for as long as the
filesystem is mounted, which can span months and thus many u32 jiffies flips;

- time_after[_eq] uses cast magic to tolerate sign flips when comparing jiffies,
but it obviously cannot cope with two sign flips (it would still be comparing
values with the same sign).

What seems to be happening is that when NFS dentries and inodes are left idle
for several weeks, it can be that the jiffies value recorded as update time on
them will seem "newer" when compared to timestamps of current NFS operations.
This can lead to old information being "renewed" without being properly
refreshed from server data.

IMHO, this explains why we're seeing issues with NFS clients showing stale data
in directories, for example, as well as stale negative dentries that don't go
away even if you change the directory (the jiffies timestamp of the attributes
of the directory, even being "now", are still "older" than the stored jiffies
timestamp from two sign-flips ago).

Does that sound understandable?

Cheers!
Fábio

Comment 36 Ian Kent 2007-04-05 17:06:21 UTC

(In reply to comment #35)
> Sorry Ian, I wasn't looking at this BZ yesterday...

No prob.
I'm interested because I've had to try and deal with the
apparent directory corruption before.

snip ..

> 
> What seems to be happening is that when NFS dentries and inodes are left idle
> for several weeks, it can be that the jiffies value recorded as update time on
> them will seem "newer" when compared to timestamps of current NFS operations.
> This can lead to old information being "renewed" without being properly
> refreshed from server data.
> 
> IMHO, this explains why we're seeing issues with NFS clients showing stale data
> in directories, for example, as well as stale negative dentries that don't go
> away even if you change the directory (the jiffies timestamp of the attributes
> of the directory, even being "now", are still "older" than the stored jiffies
> timestamp from two sign-flips ago).
> 
> Does that sound understandable?

Sure, I got most of that but I thought, from the description,
we were worried about hashed, negative dentys.

I'll have to have a look through the code but I don't think
negative dentrys can be trusted at all. But it sounds like
we have non-negative aged dentrys which is different to what
I thought the case was.

There must be a way around this!

Ian

Comment 37 Fabio Olive Leite 2007-04-06 07:07:11 UTC

Indeed, the problem started out as a negative dentry problem, but then we
perceived that one of the causes (jiffy overflow) of the persistent negative
dentries could also be making us re-use old data without refreshing from newer
attribute information from the server.

Comment 38 Ian Kent 2007-04-08 04:27:52 UTC

(In reply to comment #37)
> Indeed, the problem started out as a negative dentry problem, but then we
> perceived that one of the causes (jiffy overflow) of the persistent negative
> dentries could also be making us re-use old data without refreshing from newer
> attribute information from the server.

I'm curious about these "persistent negative dentrys".
If the dentry isn't hashed then it's invisible to the VFS so
they can't be causing a problem as they are in the process
of being displosed of. So we must be concerned with hashed
dentrys.

I can't find anywhere in the code that makes a dentry negative
and leaves it hashed. Can you point me to the code that lead
you to this conclusion please?

Ian

Comment 39 Fabio Olive Leite 2007-04-16 15:28:48 UTC

Please read the code in nfs_lookup(). If the server returns -ENOENT, it will
create a negative dentry and hash it, as it should. Otherwise this negative
dentry will not speed up any subsequent lookups.

Also, please look at nfs_dentry_delete(), and notice that NFS dentries almost
never get unhashed, so they can survive many u32 jiffies overflows.

Comment 40 Ian Kent 2007-04-16 16:04:08 UTC

(In reply to comment #39)
> Please read the code in nfs_lookup(). If the server returns -ENOENT, it will
> create a negative dentry and hash it, as it should. Otherwise this negative
> dentry will not speed up any subsequent lookups.
> 
> Also, please look at nfs_dentry_delete(), and notice that NFS dentries almost
> never get unhashed, so they can survive many u32 jiffies overflows.

Yes, I see that now.

But I guess that the problem is much the same for dentrys
whether they're positive or negative, except that the positive
dentrys won't cause the path walk to return ENOENT, but maybe
incorect status info.

I was also looking at NFS patches queued for 2.6.21 and saw
this one:
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=3e7d950a528454ad749a264feef3c8bad3faa108;hp=ebf4d8ec02d8a13e1cbbda6fe746f80f3a4ca773

Something like this may help with the wraparround issue
but the code isn't similar enough to apply it without more
investigation and there appears there maybe another field
with a similar problem in the RHEL4 NFS code. Maybe I'm
wrong about this as well as I haven't looked further yet.

Ian

Comment 41 RHEL Program Management 2007-05-09 07:34:24 UTC

This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 42 Philip Pokorny 2007-06-13 18:46:23 UTC

(In reply to comment #34)
> (In reply to comment #33)
> > Why are we so concerned about the cache validation checks when
> > a negative, hashed dentry, as in this case, seems to be an
> > obvious candidate for disposal (d_drop) as is done in Tronds
> > patch?
> 
> Looking at this message
> http://marc.info/?l=linux-nfsv4&m=115743251604662&w=2
> which basically says that there's a race in the client during
> the mkdir system call. This patch isn't present in the RHEL4
> kernel.
> 
> To explain, for a mkdir (system) call, the (VFS) lookup returns
> a dentry that is hashed and negative and it is later filled
> in in the (VFS) mkdir call. So if a (VFS) lookup happens
> between these two events the lookup will find the above
> negative dentry since it is hashed and can, given specific
> timing, return ENOENT from link_path_walk because the dentry
> is negative.

That patch is related to bug 239625 where these erroneous negative d_entries
were created for files and not just directories.  That problem was noticed when
attempting to creating files in O_CREAT|O_EXCL mode on a read only filesystem. 
It wasn't a race, rather the d_entry was allocated but never filled in because
the create fails on a read-only filesystem.  If the file actually exists on the
server the client never notices because it optimized away the LOOKUP.

Comment 43 Fabio Olive Leite 2007-06-20 19:47:47 UTC

Created attachment 157492 [details]
Patch that changes how NFS dentries are validated

This patch is upstream, and has been used in testing for quite a while.

The first chunk addresses the stale negative dentry problem by changing dentry
timestamp comparison to an equality with its parent inode's cache change
attribute. time_after makes no sense here as the dentry timestamp is always
updated to be the same as the cache_change_attribute when the latter is
changed.

The other chunks help diminish stale attribute problems, but are not yet a
complete fix. I'll open a new BZ for that and push it separately.

Comment 44 Fabio Olive Leite 2007-07-02 19:33:08 UTC

Created attachment 158367 [details]
Newer patch to avoid stale negative dentries

This the same patch re-spinned for the latest RHEL-4 kernels.

Comment 48 Jason Baron 2007-07-26 14:11:47 UTC

committed in stream U6 build 55.23. A test kernel with this patch is available
from http://people.redhat.com/~jbaron/rhel4/

Comment 54 John Poelstra 2007-08-29 04:34:48 UTC

A fix for this issue should have been included in the packages contained in the
RHEL4.6 Beta released on RHN (also available at partners.redhat.com).  

Requested action: Please verify that your issue is fixed to ensure that it is
included in this update release.

After you (Red Hat Partner) have verified that this issue has been addressed,
please perform the following:
1) Change the *status* of this bug to VERIFIED.
2) Add *keyword* of PartnerVerified (leaving the existing keywords unmodified)

If this issue is not fixed, please add a comment describing the most recent
symptoms of the problem you are having and change the status of the bug to FAILS_QA.

If you cannot access bugzilla, please reply with a message to Issue Tracker and
I will change the status for you.  If you need assistance accessing
ftp://partners.redhat.com, please contact your Partner Manager.

Comment 56 John Poelstra 2007-08-29 17:09:41 UTC

changing to FAILS_QA based on comment #55

Comment 57 Jay Turner 2007-08-30 14:01:19 UTC

Moving to Assigned.

Comment 58 Philip Pokorny 2007-08-30 15:31:46 UTC

Can you make comment #55 public since it seems to indicate that this problem is
not fixed?

Comment 59 Jeff Layton 2007-08-30 20:13:26 UTC

Looking at the reproducer, I've been able to get this to fail too:

3+0 records in
3+0 records out
3145728 bytes (3.1 MB) copied, 0.010145 seconds, 310 MB/s
file create OK
exec stat /mnt/nfs/aaa/bbb
/usr/bin/stat: cannot stat `/mnt/nfs/aaa/bbb': No such file or directory

...network captures show that a LOOKUP call is done just before ssh'ing to the
server, and that returns NFS3ERR_NOENT. After the SSH traffic, the client does a
GETATTR call on the 'aaa' dir, and then an ACCESS call.

My guess is that it's checking the dir attributes to see if it has changed since
the failed lookup. They haven't so it doesn't retry the lookup.

Comment 60 Jeff Layton 2007-08-30 20:28:14 UTC

corresponding nfs_debug messages (1984503 is the aaa directory, 1755649 is its
parent):

Aug 30 16:02:31 dhcp59-129 kernel: NFS call  lookup bbb
Aug 30 16:02:31 dhcp59-129 kernel: NFS reply lookup: -2
Aug 30 16:02:31 dhcp59-129 kernel: NFS: dentry_delete(aaa/bbb, 0)
Aug 30 16:02:32 dhcp59-129 kernel: NFS: permission(0:11/1755649), mask=0x1, res=0
Aug 30 16:02:32 dhcp59-129 kernel: NFS: revalidating (0:11/1984503)
Aug 30 16:02:32 dhcp59-129 kernel: NFS call  getattr
Aug 30 16:02:32 dhcp59-129 kernel: NFS reply getattr. status=0
Aug 30 16:02:32 dhcp59-129 kernel: NFS: nfs_update_inode(0:11/1984503 ct=1 info=0x6)
Aug 30 16:02:32 dhcp59-129 kernel: NFS: (0:11/1984503) revalidation complete
Aug 30 16:02:32 dhcp59-129 kernel: NFS call  access
Aug 30 16:02:32 dhcp59-129 kernel: NFS reply access, status = 0
Aug 30 16:02:32 dhcp59-129 kernel: NFS: permission(0:11/1984503), mask=0x1, res=0
Aug 30 16:02:32 dhcp59-129 kernel: NFS: dentry_delete(aaa/bbb, 8)

Comment 62 Jeff Layton 2007-08-31 11:29:12 UTC

Looking back over the logs from the testing I did yesterday, I think I see what
the issue might be. For reference, here's the similar output from a previous
attempt that did not fail:

Aug 30 16:02:30 dhcp59-129 kernel: NFS call  lookup bbb
Aug 30 16:02:30 dhcp59-129 kernel: NFS reply lookup: -2
Aug 30 16:02:30 dhcp59-129 kernel: NFS: dentry_delete(aaa/bbb, 0)
Aug 30 16:02:31 dhcp59-129 kernel: NFS: permission(0:11/1755649), mask=0x1, res=0
Aug 30 16:02:31 dhcp59-129 kernel: NFS: revalidating (0:11/1984503)
Aug 30 16:02:31 dhcp59-129 kernel: NFS call  getattr
Aug 30 16:02:31 dhcp59-129 kernel: NFS reply getattr. status=0
Aug 30 16:02:31 dhcp59-129 kernel: NFS: nfs_update_inode(0:11/1984503 ct=1 info=0x6)
Aug 30 16:02:31 dhcp59-129 kernel: NFS: mtime change on server for file 0:11/1984503
Aug 30 16:02:31 dhcp59-129 kernel: NFS: (0:11/1984503) data cache invalidated
Aug 30 16:02:31 dhcp59-129 kernel: NFS: nfs3_forget_cached_acls(0:11/1984503)
Aug 30 16:02:31 dhcp59-129 kernel: NFS: (0:11/1984503) revalidation complete
Aug 30 16:02:31 dhcp59-129 kernel: NFS call  access
Aug 30 16:02:31 dhcp59-129 kernel: NFS reply access, status = 0
Aug 30 16:02:31 dhcp59-129 kernel: NFS: permission(0:11/1984503), mask=0x1, res=0
Aug 30 16:02:31 dhcp59-129 kernel: NFS: dentry_delete(aaa/bbb, 18)

The main difference here is that the working one shows an mtime change for the
directory when the revalidation is done. This means that its data is invalidated
properly and that causes a new lookup to go out over the wire.

I believe the problem here is due to the fact that the underlying filesystem on
the server does not support time resolution of less than 1 second. When things
get timed in such a way that the file is deleted and then recreated on the
server within the same second, the client has no way to tell that the directory
has changed, and does not invalidate the directory's data.

Note that the noac option is working properly here. The directory's attributes
are not cached -- a getattr is done every time the client needs to revalidate
this directory. The problem is that the revalidation logic has no way to tell
that the directory has changed.

If the server were serving from a filesystem that had sub-second mtime
resolution then I think that this problem would not be occurring.

Comment 63 Jeff Layton 2007-08-31 11:56:37 UTC

Tested against a server with an XFS partition and the nfsbug.sh script ran 300
iterations without failing. I also tested a RHEL5 client and nfsbug.sh test also
failed within a few tries when run against a server on an ext3 filesystem.

I'm going to conclude that the report that this problem isn't fixed is wrong,
and the problem that they're seeing is not related to the original issue.

Comment 68 errata-xmlrpc 2007-11-15 16:20:16 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2007-0791.html

Comment 71 Jeff Layton 2008-01-21 17:33:03 UTC

*** Bug 327591 has been marked as a duplicate of this bug. ***

Comment 72 Tommi Kaarmela 2008-04-11 05:04:30 UTC

Hello. Does this issue still occur in RHEL5(u1)? Thanks.

Comment 73 Fabio Olive Leite 2008-04-16 19:29:56 UTC

Tommi,

I saw the patches discussed here being evaluated as part of other updates to 
the NFS code in RHEL-5, so although this specific BZ has not been cloned to it 
the problem should have already been taken care of there.

Cheers,
Fabio

Comment 76 Rituraj 2012-03-15 10:41:31 UTC

The problem is reproduced in RHEL 4  2.6.9-89.0.11.0.1.ELsmp. The errata mentioned has 2.6.9-67 kernel.

Comment 77 Fabio Olive Leite 2012-03-15 14:51:29 UTC

Hi Rituraj,

can you share some more information about what was reproduced and the environment where it reproduced? There were several related issues discussed in this bugzilla, so please provide more context.

Comment 78 Rituraj 2012-03-15 15:27:06 UTC

Hi Fabio, 

Here is more info abt the system:

[root@ora-stage-ebs-app-6 vmwstagecomn]# rpm -qa | grep -i nfs
nfs-utils-lib-1.0.6-10.el4
nfs-utils-1.0.6-93.EL4
[root@ora-stage-ebs-app-6 vmwstagecomn]# uname -a
Linux ora-stage-ebs-app-6.vmware.com 2.6.9-89.0.11.0.1.ELsmp #1 SMP Tue Sep 15 13:54:39 EDT 2009 i686 i686 i386 GNU/Linux
[root@ora-stage-ebs-app-6 vmwstagecomn]#
[root@ora-stage-ebs-app-6 vmwstagecomn]# cat /etc/issue
Enterprise Linux Enterprise Linux AS release 4 (October Update 8)
Kernel \r on an \m


Thanks for quick response. As this bug is closed I have opened another ticket for the issue.

https://bugzilla.redhat.com/show_bug.cgi?id=803648

There is a description on the problem. 

Thanks!

Comment 79 jbreitman 2023-06-02 22:17:14 UTC

I am seeing this issue in the 5.X and 6.X kernel.
I documented all of the steps at https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1017720.
The fix for me is to run the command below when using the 5.X kernel.
echo 2 > /proc/sys/vm/drop_caches
Are you able to help me resolve this issue?

Note You need to log in before you can comment on or make changes to this bug.