Description of problem: When heavy concurrent file and directory access over NFS is coupled to server-side updates, sometimes the client seems to miss a server attribute update and persistent stale negative dentries are kept for a long time. This leads to a file or directory being listed by a getdirents call but having their existance denied when accessing directly. Example on a local dir /local/foo NFS-exported to the localhost on /nfs/foo: [root@dl145 ~]# ls -l /local/foo total 4 drwxr-xr-x 3 root root 4096 Jan 30 00:25 bar [root@dl145 ~]# ls -l /nfs/foo total 0 ?--------- ? ? ? ? ? bar [root@dl145 ~]# stat /nfs/foo/bar stat: cannot stat `/nfs/foo/bar': No such file or directory [root@dl145 ~]# stat /local/foo/bar File: `/local/foo/bar' Size: 4096 Blocks: 8 IO Block: 4096 directory Device: fd00h/64768d Inode: 28180483 Links: 3 Access: (0755/drwxr-xr-x) Uid: ( 0/ root) Gid: ( 0/ root) Access: 2007-02-14 23:39:48.000000000 -0500 Modify: 2007-01-30 00:25:17.000000000 -0500 Change: 2007-02-14 23:39:48.000000000 -0500 This was obtained with the hit_nfs.py script attached, on a 4-way smp box. Version-Release number of selected component (if applicable): I just reproduced on 2.6.9-42.ELsmp. I'll work on newer versions now. How reproducible: Needs the proper timing to hit what is likely a race, perhaps either on dcache handling or on link_path_walk (the -ESTALE case seems racy). The python script attached reproduces the problem every few times it's ran. The script creates a thread that removes the directory contents and restores them (as if a rm -fr followed by untarring a data tarball) every 5 seconds on the server side (/local/foo), while 300 threads on the client side (/nfs/foo) alternate between stating /nfs/foo/bar, stating /nfs/foo/bar/baz/file9 and opening/closing /nfs/foo/bar with O_DIRECTORY every 2 seconds. This may sound weird, but it was extracted from an actual usage pattern in an enterprise deployment, and the syscall pattern of tar, rm and the client programs was replicated in the python script to create a "clean room" reproducer. Steps to Reproduce: 1. export /local/foo via NFS to localhost 2. mount localhost:/local/foo /nfs/foo 3. ./hit_nfs.py 4. check /nfs/foo contents for problems, and goto 3 if the system recovers too fast to analyse "Good" runs can generate the problematic output from ls -l that lasts for a few hours. Actual results: See above. Expected results: The stale negative dentries should be detected and cleared at some point. Additional info: Trond Myklebust's patches 007 and 008 from http://linux-nfs.org/Linux-2.6.x/2.6.20/ can clear up the stale negative dentries when nfs_readdir is called. A mixed version that applies cleanly to RHEL-4 U4 is attached. I still don't know the root-cause, though, and those patches don't help for the case where files are accessed directly (i.e. don't ls the directory or even use tab-completion on bash to stat the file ;-). Cheers! Fábio
Created attachment 148096 [details] Python script to reproduce the bug.
Created attachment 148097 [details] Trond Myklebust's dentry pruning patches combined and ported to RHEL-4.
Let's try to shoot for 4.6?
I just reproduced this even without setting the mtime on the directories back to a point in time, like tar does. This probably means some race in path lookup? I'll attach the enhanced reproducer, that has a bit more synchronization between threads so that it does not trigger on false positives.
Created attachment 148440 [details] Enhanced python script that reproduces the problem.
I just reproduced it even when commenting out the code that creates the client open threads. I'm trying to reproduce again while capturing traffic.
Update! I just reproduced on RHEL-4 U4 by running only 100 stat_file threads on the python reproducer script (the creation of the other threads, stat_dir and open_dir, was commented out, as well as changing mtime on the server). It still produces this: # ls -laR /nfs/foo /nfs/foo: total 8 drwxr-xr-x 3 root root 4096 Feb 20 20:16 . drwxr-xr-x 3 root root 4096 Feb 14 13:49 .. ?--------- ? ? ? ? ? bar Notice the script was stating /nfs/foo/bar/baz/file9, and yet the bar directory, merely a component in the lookup, was affected. Now we know that: - Changing the mtime on the server or not is irrelevant; - Having multiple concurrent sys_stat calls on the client while the server is being updated is enough to trigger the problem.
It was also reproduced on RHEL-5, although ls -l on the problematic directory seems to refresh it and get rid of the problem. If you just stat the directory or file directly, though, you still get a wrong -ENOENT back. And to come full circle, it was also reproduced in RHEL-3, although RHEL-3 seems to recover by itself in about 60 seconds. the ls command in RHEL-3 actually prints a "No such file or directory" for the broken entries, instead of listing it with question marks. On the syscall level, though, it's still -ENOENT.
Given that the problems seems to be related to path lookups over dentries that point to stale NFS inodes, I thought of helping the system get rid of positive dentries that point to stale nfs inodes or negative dentries whose parent dentry points to a stale nfs inode as soon as possible. I added a few conditions to nfs_dentry_delete, with the patch below: --- linux-2.6.9/fs/nfs/dir.c.original 2007-02-27 00:14:45.000000000 -0500 +++ linux-2.6.9/fs/nfs/dir.c 2007-02-27 09:50:37.000000000 -0500 @@ -793,6 +793,19 @@ * files will be cleaned up during umount */ return 1; } + if (dentry->d_inode && NFS_STALE(dentry->d_inode)) { + /* Unhash it if it points to a stale inode */ + printk("Deleting dentry that points to stale NFS inode!\n"); + printk(" %s/%s\n", dentry->d_parent->d_name.name, dentry->d_name.name); + return 1; + } + if (!dentry->d_inode && NFS_STALE(dentry->d_parent->d_inode)) { + /* Unhash it if it's a negative dentry for a stale parent */ + printk("Deleting negative dentry on stale parent NFS inode!\n"); + printk(" %s/%s\n", dentry->d_parent->d_name.name, dentry->d_name.name); + dump_stack(); + return 1; + } return 0; } The problem still reproduces, even when I see the messages above in the kernel log (sorry for the big stack dumps): SELinux: initialized (dev 0:14, type nfs), uses genfs_contexts Deleting dentry that points to stale NFS inode! baz/file9 Deleting negative dentry on stale parent NFS inode! bar/baz [<f8b5aea9>] nfs_dentry_delete+0x9e/0xae [nfs] [<c016f09e>] dput+0x62/0x1a7 [<c016659e>] real_lookup+0x7b/0xd2 [<c01667ae>] do_lookup+0x56/0x8f [<c0166b19>] __link_path_walk+0x332/0xbb5 [<c014e221>] do_anonymous_page+0x1c9/0x1db [<c01673df>] link_path_walk+0x43/0xbe [<c0134c87>] wake_futex+0x3a/0x44 [<c0167774>] path_lookup+0x14b/0x17f [<c01678bc>] __user_walk+0x21/0x51 [<c0162b41>] vfs_stat+0x14/0x3a [<c0134c87>] wake_futex+0x3a/0x44 [<c0134d30>] futex_wake+0x9f/0xc5 [<c016314a>] sys_stat64+0xf/0x23 [<c0135497>] do_futex+0x32/0x5a [<c01355c0>] sys_futex+0x101/0x10c [<c02d4703>] syscall_call+0x7/0xb Deleting negative dentry on stale parent NFS inode! baz/file9 [<f8b5aea9>] nfs_dentry_delete+0x9e/0xae [nfs] [<c016f09e>] dput+0x62/0x1a7 [<c016659e>] real_lookup+0x7b/0xd2 [<c01667ae>] do_lookup+0x56/0x8f [<c0166fef>] __link_path_walk+0x808/0xbb5 [<c014e221>] do_anonymous_page+0x1c9/0x1db [<c01673df>] link_path_walk+0x43/0xbe [<c0134c87>] wake_futex+0x3a/0x44 [<c0167774>] path_lookup+0x14b/0x17f [<c01678bc>] __user_walk+0x21/0x51 [<c0162b41>] vfs_stat+0x14/0x3a [<c0134c87>] wake_futex+0x3a/0x44 [<c0134d30>] futex_wake+0x9f/0xc5 [<c016314a>] sys_stat64+0xf/0x23 [<c0135497>] do_futex+0x32/0x5a [<c01355c0>] sys_futex+0x101/0x10c [<c02d4703>] syscall_call+0x7/0xb Deleting negative dentry on stale parent NFS inode! bar/baz [<f8b5aea9>] nfs_dentry_delete+0x9e/0xae [nfs] [<c016f09e>] dput+0x62/0x1a7 [<c016659e>] real_lookup+0x7b/0xd2 [<c01667ae>] do_lookup+0x56/0x8f [<c0166b19>] __link_path_walk+0x332/0xbb5 [<c01673df>] link_path_walk+0x43/0xbe [<c0134c87>] wake_futex+0x3a/0x44 [<c0167774>] path_lookup+0x14b/0x17f [<c01678bc>] __user_walk+0x21/0x51 [<c0162b41>] vfs_stat+0x14/0x3a [<c0134c87>] wake_futex+0x3a/0x44 [<c0134d30>] futex_wake+0x9f/0xc5 [<c016314a>] sys_stat64+0xf/0x23 [<c0135497>] do_futex+0x32/0x5a [<c01355c0>] sys_futex+0x101/0x10c [<c02d4703>] syscall_call+0x7/0xb Deleting dentry that points to stale NFS inode! bar/baz Deleting dentry that points to stale NFS inode! //bar Deleting dentry that points to stale NFS inode! baz/file9 Deleting negative dentry on stale parent NFS inode! baz/file9 [<f8b5aea9>] nfs_dentry_delete+0x9e/0xae [nfs] [<c016f09e>] dput+0x62/0x1a7 [<c016659e>] real_lookup+0x7b/0xd2 [<c01667ae>] do_lookup+0x56/0x8f [<c0166fef>] __link_path_walk+0x808/0xbb5 [<c014e73f>] handle_mm_fault+0x139/0x193 [<c01673df>] link_path_walk+0x43/0xbe [<c0134c87>] wake_futex+0x3a/0x44 [<c0167774>] path_lookup+0x14b/0x17f [<c01678bc>] __user_walk+0x21/0x51 [<c0162b41>] vfs_stat+0x14/0x3a [<c0134c87>] wake_futex+0x3a/0x44 [<c0134d30>] futex_wake+0x9f/0xc5 [<c016314a>] sys_stat64+0xf/0x23 [<c0135497>] do_futex+0x32/0x5a [<c01355c0>] sys_futex+0x101/0x10c [<c02d4703>] syscall_call+0x7/0xb Deleting negative dentry on stale parent NFS inode! baz/file9 [<f8b5aea9>] nfs_dentry_delete+0x9e/0xae [nfs] [<c016f09e>] dput+0x62/0x1a7 [<c016659e>] real_lookup+0x7b/0xd2 [<c01667ae>] do_lookup+0x56/0x8f [<c0166fef>] __link_path_walk+0x808/0xbb5 [<c014e73f>] handle_mm_fault+0x139/0x193 [<c01673df>] link_path_walk+0x43/0xbe [<c01506e4>] find_extend_vma+0x12/0x4f [<c0134b57>] get_futex_key+0x39/0x108 [<c0167774>] path_lookup+0x14b/0x17f [<c01678bc>] __user_walk+0x21/0x51 [<c0162b41>] vfs_stat+0x14/0x3a [<c01506e4>] find_extend_vma+0x12/0x4f [<c0134b57>] get_futex_key+0x39/0x108 [<c0134d30>] futex_wake+0x9f/0xc5 [<c016314a>] sys_stat64+0xf/0x23 [<c0135497>] do_futex+0x32/0x5a [<c01355c0>] sys_futex+0x101/0x10c [<c02d4703>] syscall_call+0x7/0xb Deleting negative dentry on stale parent NFS inode! baz/file9 [<f8b5aea9>] nfs_dentry_delete+0x9e/0xae [nfs] [<c016f09e>] dput+0x62/0x1a7 [<c016659e>] real_lookup+0x7b/0xd2 [<c01667ae>] do_lookup+0x56/0x8f [<c0166fef>] __link_path_walk+0x808/0xbb5 [<c014e73f>] handle_mm_fault+0x139/0x193 [<c01673df>] link_path_walk+0x43/0xbe [<c0134c87>] wake_futex+0x3a/0x44 [<c0167774>] path_lookup+0x14b/0x17f [<c01678bc>] __user_walk+0x21/0x51 [<c0162b41>] vfs_stat+0x14/0x3a [<c0134c87>] wake_futex+0x3a/0x44 [<c0134d30>] futex_wake+0x9f/0xc5 [<c016314a>] sys_stat64+0xf/0x23 [<c0135497>] do_futex+0x32/0x5a [<c01355c0>] sys_futex+0x101/0x10c [<c02d4703>] syscall_call+0x7/0xb Deleting negative dentry on stale parent NFS inode! baz/file9 [<f8b5aea9>] nfs_dentry_delete+0x9e/0xae [nfs] [<c016f09e>] dput+0x62/0x1a7 [<c016659e>] real_lookup+0x7b/0xd2 [<c01667ae>] do_lookup+0x56/0x8f [<c0166fef>] __link_path_walk+0x808/0xbb5 [<c01673df>] link_path_walk+0x43/0xbe [<c01506e4>] find_extend_vma+0x12/0x4f [<c0134b57>] get_futex_key+0x39/0x108 [<c0167774>] path_lookup+0x14b/0x17f [<c01678bc>] __user_walk+0x21/0x51 [<c0162b41>] vfs_stat+0x14/0x3a [<c01506e4>] find_extend_vma+0x12/0x4f [<c0134b57>] get_futex_key+0x39/0x108 [<c0134d30>] futex_wake+0x9f/0xc5 [<c016314a>] sys_stat64+0xf/0x23 [<c0135497>] do_futex+0x32/0x5a [<c01355c0>] sys_futex+0x101/0x10c [<c02d4703>] syscall_call+0x7/0xb Deleting negative dentry on stale parent NFS inode! baz/file9 [<f8b5aea9>] nfs_dentry_delete+0x9e/0xae [nfs] [<c016f09e>] dput+0x62/0x1a7 [<c016659e>] real_lookup+0x7b/0xd2 [<c01667ae>] do_lookup+0x56/0x8f [<c0166fef>] __link_path_walk+0x808/0xbb5 [<c01673df>] link_path_walk+0x43/0xbe [<c01506e4>] find_extend_vma+0x12/0x4f [<c0134b57>] get_futex_key+0x39/0x108 [<c0167774>] path_lookup+0x14b/0x17f [<c01678bc>] __user_walk+0x21/0x51 [<c0162b41>] vfs_stat+0x14/0x3a [<c01506e4>] find_extend_vma+0x12/0x4f [<c0134b57>] get_futex_key+0x39/0x108 [<c0134d30>] futex_wake+0x9f/0xc5 [<c016314a>] sys_stat64+0xf/0x23 [<c0135497>] do_futex+0x32/0x5a [<c01355c0>] sys_futex+0x101/0x10c [<c02d4703>] syscall_call+0x7/0xb Deleting dentry that points to stale NFS inode! bar/baz Deleting negative dentry on stale parent NFS inode! bar/baz [<f8b5aea9>] nfs_dentry_delete+0x9e/0xae [nfs] [<c016f09e>] dput+0x62/0x1a7 [<c016659e>] real_lookup+0x7b/0xd2 [<c01667ae>] do_lookup+0x56/0x8f [<c0166b19>] __link_path_walk+0x332/0xbb5 [<c014e73f>] handle_mm_fault+0x139/0x193 [<c01673df>] link_path_walk+0x43/0xbe [<c0134d46>] futex_wake+0xb5/0xc5 [<c0167774>] path_lookup+0x14b/0x17f [<c01678bc>] __user_walk+0x21/0x51 [<c0162b41>] vfs_stat+0x14/0x3a [<c0134d46>] futex_wake+0xb5/0xc5 [<c016314a>] sys_stat64+0xf/0x23 [<c0135497>] do_futex+0x32/0x5a [<c01355c0>] sys_futex+0x101/0x10c [<c02d4703>] syscall_call+0x7/0xb Deleting dentry that points to stale NFS inode! //bar Deleting dentry that points to stale NFS inode! //bar Deleting dentry that points to stale NFS inode! //bar Deleting dentry that points to stale NFS inode! //bar Deleting dentry that points to stale NFS inode! //bar Even with those problematic dentries apparently being deleted, we have: # ls -la /nfs/foo total 16 drwxr-xr-x 3 root root 4096 Feb 27 11:52 . drwxr-xr-x 3 root root 4096 Feb 26 23:55 .. ?--------- ? ? ? ? ? bar This is getting weirder the harder I look. :)
On another twist at this issue, most of the packet captures I have seen so far clearly show the NFS server returning NFS3ERR_NOENT for those directory entries, and that with the lookup operations being intermixed with access/getattr operations in the parent directory of those entries, so it may be we're actually seeing a bad timing problem on the _server_ side. Perhaps in that case we can just blame NFSv3. I'm running out of ideas. :)
Yes, could be a server problem, but my money is still on a client side problem. :-)
Has this been reproduced using two different systems, one as client and the other as server?
Yes, the original reproducer was on two different systems, just as the original actual problem. It's just that when it was found that having client and server be the same system would reproduce the problem in seconds or minutes instead of hours or days, we focused on that case. Sorry about that. :) In any case, a really nice analysis by a partner turned our attention to the fact that since the usable mtime resolution on NFS seems to be only a second, it is entirely possible that a negative dentry for a file is created in the same second that the directory is updated to actually have an entry for which we just created the negative dentry, and neither the mtime or the cache_change_attribute will get it. The mtime will be the same as the one we just checked, and we have not done any metadata-changing action on the server to affect cache_change_attribute. I took at look at nfs_verify_change_attribute in RHEL-4 U4, and applied this simple change: --- linux-2.6.9/include/linux/nfs_fs.h.original 2007-02-28 21:58:01.000000000 -0500 +++ linux-2.6.9/include/linux/nfs_fs.h 2007-02-28 22:09:20.000000000 -0500 @@ -301,7 +301,7 @@ static inline long nfs_save_change_attri static inline int nfs_verify_change_attribute(struct inode *inode, unsigned long chattr) { return !nfs_caches_unstable(inode) - && chattr == NFS_I(inode)->cache_change_attribute; + && chattr > NFS_I(inode)->cache_change_attribute; } /* In upstream code this test is made with time_after_eq, in which case we'd just switch to using time_after. With this change, the problem no longer reproduces, at least in the time span it used to (used to reproduce in under a minute, now I haven't seen it in over an hour of testing). I don't completely get all the logic inversions in nfs_neg_need_reval and the functions called from it, and having a "generation counter" be tested with time_after_eq also blows me a bit. In any case, for now I just care that the change above seems to keep the problem from happening. Maybe it means that the code is not trusting any NFS dentry anymore, and that I've multiplied NFS traffic by ten or more. But it surely seems to keep the problem from happening. :)
Forgot to mention the change above is being tested alone, without any of Trond's patches.
Well, that change is not the right one (the NFS traffic is a lot higher), but it shows that the spot to change is indeed near nfs_neg_need_reval (the problem has not reproduced in a few hours, as the code now always looks up at the server). I'll think this through and try to come up with The Right Thing.
OK, I think this is the closest to The Right Thing I can get. The idea is: don't create negative dentries in nfs_lookup() when the mtime of the directory is the current second, because we cannot be sure the entry will not be there milliseconds after and we won't be able to notice. This assumes the time is synced between server and client, which is a good thing to have anyway. It generates extra NFS traffic only during the second on which we'll avoid caching the lookup answer with a negative dentry, and I think there's nothing we can do about that. A second after the directory change, we can cache negative dentries to our heart's content, since then we _know_ we will notice the mtime change when those entries come into existance. It has not reproduced the problem in over 30 minutes. The change looks a little gross, but it works. Peter Staubach, what do you think about it? Maybe adding some comments on top it will make it more palatable. --- linux-2.6.9/fs/nfs/dir.c.original 2007-02-27 00:14:45.000000000 -0500 +++ linux-2.6.9/fs/nfs/dir.c 2007-03-01 23:13:10.000000000 -0500 @@ -862,7 +862,10 @@ static struct dentry *nfs_lookup(struct error = NFS_PROTO(dir)->lookup(dir, &dentry->d_name, &fhandle, &fattr); if (error == -ENOENT) - goto no_entry; + if (dir->i_mtime.tv_sec == get_seconds()) + goto out_unlock; + else + goto no_entry; if (error != 0) goto out_unlock; } I'll see about talking this through with Trond Myklebust as well. Cheers! Fábio
Assuming that the time is sychronized between the client and the server is a no go. Please do not introduce dependencies like this. An approach that I have seen used is to verify all negative entries with an over the wire GETATTR on the directory in which they reside. This is done for all negative entries on writable file systems and the normal attribute cache timeout is used for read-only file systems. We assume that read-only file systems change slowly, if at all, so the looser semantics are acceptable.
Just FYI, now that we fully understand the root-cause, I'd like to keep this for the record: [root@z05 ~]# mount /nfs/foo [root@z05 ~]# touch /local/foo; stat /nfs/foo/bar; touch /local/foo/bar stat: cannot stat `/nfs/foo/bar': No such file or directory [root@z05 ~]# ls -la /nfs/foo total 8 drwxr-xr-x 2 root root 4096 Mar 2 14:16 . drwxr-xr-x 3 root root 4096 Feb 23 16:50 .. ?--------- ? ? ? ? ? bar Instant reproduction, as long as you're lucky to get all three commands to start in the same second. :)
OK, there are two ways to cause this problem. We've been reproducing the easier one, but there's another that is specific to the Linux NFS client. 1. mtime is the same If the mtime on an NFS directory is the same as the previously checked value, the NFS client considers that there was no change to its contents. This can happen with our reproducer when operations happen in the same second (with Linux's default timestamps with one-second resolution) or when the mtime is artificially changed to a specific value in the server (via tar, for example, and this could happen even with finer-grained timestamps). This is fairly intrinsic with the way NFS operates, and the only thing that can be done about it is expire the cached dentries, valid or not, after some period of time. This cause has to do with inodes and mtimes, so it is server-oriented in a way. 2. 32bit jiffy wrap-around Anuwat recently demonstrated that it is fairly possible for code that uses 32bit jiffies to be exposed to jiffy wrap-around every 50 days, with HZ set to 1000. The NFS client code has this problem, and this is because it uses a filesystem-specific void * field (d_fsdata) in the dentry structure to keep a snapshot of the jiffies value. Since a pointer is 32bits in i686, the NFS code cannot use the 64bit representation of the jiffies counter. Given that NFS dentries almost never really get freed (unless you rename an entry or unmount the volume), it is entirely possible that very old NFS dentries will exist in the cache, and their d_fsdata field could have a value higher than the current low 32bit portion of jiffies. Like Anuwat determined, these dentries will not get freed by changing their parent directory, since their own "update time" is already in the "future" for a 32bit jiffy value. STILL, they _will_ be revalidated if you use the touch command on them, since by receiving an -ENOENT touch will try to _create_ those entries that the client thinks does not exist. The server will tell the client that those entries _do_ exist, and the NFS code will drop the existing negative dentries. So the proposed workaround of touching the problematic entries themselves should still work. The fix for this cause of the problem seems to be to store a full 64bit jiffy count in the dentry object and change the timestamp check to use the full 64bit value. This cause has to do with dentries and jiffies, so it is specific to the Linux NFS client code. I and Flavio Leitner are still working on the implications of the Linux NFS client code having d_fsdata point to an allocated structure instead of being used as a simple long value. Peter, any comments? :)
It appears that the root cause for these problems are two fold? One is that the Linux server does not actually change the mtime on the directory when creating new entries because of the granularity being seconds? The other is the jiffy problem? I don't understand this one quite. It has to do with expiration of the time in which the client believes the dentry is valid before checking again with the server?
The jiffies problem is that: a) NFS dentries have their u32 jiffy timestamp saved on d_fsdata, from the nfs_inode's u32 cache_change_attribute; b) NFS dentries almost never get freed; c) u32 jiffies overlap completely in ~50 days, and change sign in ~25 days. The time_after{,_eq} constructs can tolerate _one_ sign flip (comparing a negative with a positive value) when comparing u32 jiffy counts. But it obviously cannot help when both values have the same sign but we _know_ that one of them is from before the latest 32bit overflow of the u32 jiffies. So consider you do some operations on an NFS mount, and let it sit for four weeks, thus allowing for u32 jiffies to flip its most significant bit twice. Any timestamp that was "positive" before, when compared to a "positive" recent jiffy count, will appear to time_after{,_eq} to actually be really after the most recent (yet less than the other one because of the lack of the 33rd bit) one. In case of positive dentries, any operation on a possibly stale filehandle will receive ESTALE and will be revalidated. In the case of negative dentries, it seems we just trust them. Let's throw in some numbers. Imagine you have a negative NFS dentry that was created/updated around 20-some days after boot, so it gets a jiffy timestamp of about 2147482648 (0x7FFFFC18). Then it sits idle for 28 days, which is enough for the u32 jiffy count to be increased by 1000 * 60 * 60 * 24 * 28 == 2419200000 (0x90321000). In this moment, the parent directory of that dentry gets updated, as we do an ls -l on it, and the parent directory get's a jiffy timestamp in its cache_change_attribute of (2147482648 + 2419200000) % (2 ^ 32) (remember we're counting modulo 32), which is 271715352 (0x10320C18). The current timestamp of 0x10320C18 is clearly "earlier" than the negative dentry's timestamp of 0x7FFFFC18, when we don't consider that the jiffies wrap around. So the code considers the negative dentry to still be valid, even when the server tells the client it exists in a readdir. I hope to have gotten the full context in this one. :) Regards, Fábio
This obviously affects only 32bit archs, so a workaround is to move to x86_64 and friends. :) Perhaps the fix is to move from long to u64 and get_jiffies_64() in nfs_inode.cache_change_attribute, nfs_fattr.time_start, and actually allocate a u64 from a slabcache and point dentry.d_fsdata to it, implementing a nfs_dentry_release operation to free up d_fsdata. Tricky tricky.
Created attachment 150805 [details] Keep track of the higher bits of jiffies64 in NFS dentries. Proposed fix for the case where NFS dentries sit idle during two sign flips of 32bit jiffies, fooling time_after[_eq]. Basically we abuse another unused (by NFS) field in struct dentry, d_cookie, to keep track of the higher part of jiffies64, so that we refresh very old negative dentries when acessing them. The code is made conditional of BITS_PER_LONG being < 64, since this only affects 32bit architectures. A way to test this without waiting about 30 days is to reproduce the problem within 5 minutes of system boot. The initial jiffies value is made so that it overflows after 5 minutes of uptime, to expose problems with jiffies early. This ends up helping us, since the higher word of jiffies64 gets incremented. I tested this by getting the persistent dentries expoiting the mtime problem, and refreshed them with the jiffies trigger. After 5 minutes of uptime the broken negative dentry went away. Comments, please!
It looks like the u32 jiffy overflow problem can also affect positive dentries and file attributes, judging by code in nfs_refresh_inode and the code flows from there. It would be interesting if someone more experienced in NFS client code could comment on this. Imagine that an NFS directory is accessed at about 20 days of uptime, when the 32 bit jiffies is (1000 * 60 * 60 * 24 * 20) == 1,728,000,000. Then this directory inode sits idle on the client for 30 days and is accessed again, when the 32 bit jiffies has overflowed to (1000 * 60 * 60 * 24 * 50) % (2 ^ 32) == 25,032,704. Even with the all the cast magic in time_after[_eq], the jiffies at 50 days of uptime is obviously "earlier" than the jiffies at 20 days of uptime, and it's obvious that the code will flow in unintended ways. Since nfs_fattr->time_start, nfs_inode->last_updated, nfs_inode->cache_change_attribute and some others are all compared in some place or another, the logic is far more difficult to change than the simple negative dentry case. I was thinking of converting those fields to u64 and get_jiffies_64(), but then operating those types will no longer be atomic on 32bit archs, and we might need extra locking. So, if anyone more knowledgeable could comment, it would be great! :) Thanks, Fábio
(In reply to comment #32) > > So, if anyone more knowledgeable could comment, it would be great! :) > Definitely not more knowledgeable but this sound like an interesting issue and I'm having a little trouble understanding why we need to go to so much trouble to fix it. So perhaps you would be kind enough to humour me and answer a couple of questions. First, does Tronds patch not work for all the cases? Why are we so concerned about the cache validation checks when a negative, hashed dentry, as in this case, seems to be an obvious candidate for disposal (d_drop) as is done in Tronds patch? Ian
(In reply to comment #33) > why we need to go to so much trouble to fix it. So perhaps you > would be kind enough to humour me and answer a couple of questions. Mmm .. obviously not. Anyway, I have an idea of a possible cause that is somewhat different to the analysis above. But I can't be sure that I understand the problem above or in what kernel versions the problem does and doesn't exist so I'll make the assumption that it doesn't happen in the latest kernel (perhaps 2.6.20 or 21 and above). > > First, does Tronds patch not work for all the cases? Assume that Tronds patch above prevents the listing of these invalid dentrys by unhashing them if they are encountered. > > Why are we so concerned about the cache validation checks when > a negative, hashed dentry, as in this case, seems to be an > obvious candidate for disposal (d_drop) as is done in Tronds > patch? Looking at this message http://marc.info/?l=linux-nfsv4&m=115743251604662&w=2 which basically says that there's a race in the client during the mkdir system call. This patch isn't present in the RHEL4 kernel. To explain, for a mkdir (system) call, the (VFS) lookup returns a dentry that is hashed and negative and it is later filled in in the (VFS) mkdir call. So if a (VFS) lookup happens between these two events the lookup will find the above negative dentry since it is hashed and can, given specific timing, return ENOENT from link_path_walk because the dentry is negative. Am I at all close to what this bug is about Fabio? Ian
Sorry Ian, I wasn't looking at this BZ yesterday... (In reply to comment #34) > (In reply to comment #33) > > why we need to go to so much trouble to fix it. So perhaps you > > would be kind enough to humour me and answer a couple of questions. > > Mmm .. obviously not. I'll surely entertain anyone who wants to help. :) > Anyway, I have an idea of a possible cause that is somewhat > different to the analysis above. But I can't be sure that I > understand the problem above or in what kernel versions > the problem does and doesn't exist so I'll make the assumption > that it doesn't happen in the latest kernel (perhaps 2.6.20 or > 21 and above). I'm not sure they have been fixed. Unfortunately I don't have an environment with a latest stock kernel to test, but I'll see if I can create something. > > First, does Tronds patch not work for all the cases? > > Assume that Tronds patch above prevents the listing of > these invalid dentrys by unhashing them if they are > encountered. Mainly, Trond's patch above works around the issue in one place the issue can be detected. It does not solve the underlying causes of the issue, so we'll have to hunt for other places where the issue hits and work around it there as well. Think of an application like a webserver, that does not do getdirents calls. It just stats/opens files directly and can hit stale dentries that will not be cleaned up. > > Why are we so concerned about the cache validation checks when > > a negative, hashed dentry, as in this case, seems to be an > > obvious candidate for disposal (d_drop) as is done in Tronds > > patch? Because not all code paths will end up doing a readdir. > Looking at this message > http://marc.info/?l=linux-nfsv4&m=115743251604662&w=2 > which basically says that there's a race in the client during > the mkdir system call. This patch isn't present in the RHEL4 > kernel. > > To explain, for a mkdir (system) call, the (VFS) lookup returns > a dentry that is hashed and negative and it is later filled > in in the (VFS) mkdir call. So if a (VFS) lookup happens > between these two events the lookup will find the above > negative dentry since it is hashed and can, given specific > timing, return ENOENT from link_path_walk because the dentry > is negative. But in this case I don't think it would be a persistent issue. You could hit ENOENT once, but in the next lookup it would likely already be filled in. What we can see here are negative dentries for files that do exist in the server and you can check them in the client for hours and still get the ENOENT. > Am I at all close to what this bug is about Fabio? Close, yes. But I think there's one fundamental set of conditions we identified that is not yet fully understood, and I'd love to be proved wrong about them. They are: - NFS uses jiffies timestamps on its structures to compare update/modification times; - on 32bit platforms, jiffies wrap completely in around 50 days (thus MSB flips in around 25 days); - NFS dentries and inodes remain allocated in memory for as long as the filesystem is mounted, which can span months and thus many u32 jiffies flips; - time_after[_eq] uses cast magic to tolerate sign flips when comparing jiffies, but it obviously cannot cope with two sign flips (it would still be comparing values with the same sign). What seems to be happening is that when NFS dentries and inodes are left idle for several weeks, it can be that the jiffies value recorded as update time on them will seem "newer" when compared to timestamps of current NFS operations. This can lead to old information being "renewed" without being properly refreshed from server data. IMHO, this explains why we're seeing issues with NFS clients showing stale data in directories, for example, as well as stale negative dentries that don't go away even if you change the directory (the jiffies timestamp of the attributes of the directory, even being "now", are still "older" than the stored jiffies timestamp from two sign-flips ago). Does that sound understandable? Cheers! Fábio
(In reply to comment #35) > Sorry Ian, I wasn't looking at this BZ yesterday... No prob. I'm interested because I've had to try and deal with the apparent directory corruption before. snip .. > > What seems to be happening is that when NFS dentries and inodes are left idle > for several weeks, it can be that the jiffies value recorded as update time on > them will seem "newer" when compared to timestamps of current NFS operations. > This can lead to old information being "renewed" without being properly > refreshed from server data. > > IMHO, this explains why we're seeing issues with NFS clients showing stale data > in directories, for example, as well as stale negative dentries that don't go > away even if you change the directory (the jiffies timestamp of the attributes > of the directory, even being "now", are still "older" than the stored jiffies > timestamp from two sign-flips ago). > > Does that sound understandable? Sure, I got most of that but I thought, from the description, we were worried about hashed, negative dentys. I'll have to have a look through the code but I don't think negative dentrys can be trusted at all. But it sounds like we have non-negative aged dentrys which is different to what I thought the case was. There must be a way around this! Ian
Indeed, the problem started out as a negative dentry problem, but then we perceived that one of the causes (jiffy overflow) of the persistent negative dentries could also be making us re-use old data without refreshing from newer attribute information from the server.
(In reply to comment #37) > Indeed, the problem started out as a negative dentry problem, but then we > perceived that one of the causes (jiffy overflow) of the persistent negative > dentries could also be making us re-use old data without refreshing from newer > attribute information from the server. I'm curious about these "persistent negative dentrys". If the dentry isn't hashed then it's invisible to the VFS so they can't be causing a problem as they are in the process of being displosed of. So we must be concerned with hashed dentrys. I can't find anywhere in the code that makes a dentry negative and leaves it hashed. Can you point me to the code that lead you to this conclusion please? Ian
Please read the code in nfs_lookup(). If the server returns -ENOENT, it will create a negative dentry and hash it, as it should. Otherwise this negative dentry will not speed up any subsequent lookups. Also, please look at nfs_dentry_delete(), and notice that NFS dentries almost never get unhashed, so they can survive many u32 jiffies overflows.
(In reply to comment #39) > Please read the code in nfs_lookup(). If the server returns -ENOENT, it will > create a negative dentry and hash it, as it should. Otherwise this negative > dentry will not speed up any subsequent lookups. > > Also, please look at nfs_dentry_delete(), and notice that NFS dentries almost > never get unhashed, so they can survive many u32 jiffies overflows. Yes, I see that now. But I guess that the problem is much the same for dentrys whether they're positive or negative, except that the positive dentrys won't cause the path walk to return ENOENT, but maybe incorect status info. I was also looking at NFS patches queued for 2.6.21 and saw this one: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=3e7d950a528454ad749a264feef3c8bad3faa108;hp=ebf4d8ec02d8a13e1cbbda6fe746f80f3a4ca773 Something like this may help with the wraparround issue but the code isn't similar enough to apply it without more investigation and there appears there maybe another field with a similar problem in the RHEL4 NFS code. Maybe I'm wrong about this as well as I haven't looked further yet. Ian
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
(In reply to comment #34) > (In reply to comment #33) > > Why are we so concerned about the cache validation checks when > > a negative, hashed dentry, as in this case, seems to be an > > obvious candidate for disposal (d_drop) as is done in Tronds > > patch? > > Looking at this message > http://marc.info/?l=linux-nfsv4&m=115743251604662&w=2 > which basically says that there's a race in the client during > the mkdir system call. This patch isn't present in the RHEL4 > kernel. > > To explain, for a mkdir (system) call, the (VFS) lookup returns > a dentry that is hashed and negative and it is later filled > in in the (VFS) mkdir call. So if a (VFS) lookup happens > between these two events the lookup will find the above > negative dentry since it is hashed and can, given specific > timing, return ENOENT from link_path_walk because the dentry > is negative. That patch is related to bug 239625 where these erroneous negative d_entries were created for files and not just directories. That problem was noticed when attempting to creating files in O_CREAT|O_EXCL mode on a read only filesystem. It wasn't a race, rather the d_entry was allocated but never filled in because the create fails on a read-only filesystem. If the file actually exists on the server the client never notices because it optimized away the LOOKUP.
Created attachment 157492 [details] Patch that changes how NFS dentries are validated This patch is upstream, and has been used in testing for quite a while. The first chunk addresses the stale negative dentry problem by changing dentry timestamp comparison to an equality with its parent inode's cache change attribute. time_after makes no sense here as the dentry timestamp is always updated to be the same as the cache_change_attribute when the latter is changed. The other chunks help diminish stale attribute problems, but are not yet a complete fix. I'll open a new BZ for that and push it separately.
Created attachment 158367 [details] Newer patch to avoid stale negative dentries This the same patch re-spinned for the latest RHEL-4 kernels.
committed in stream U6 build 55.23. A test kernel with this patch is available from http://people.redhat.com/~jbaron/rhel4/
A fix for this issue should have been included in the packages contained in the RHEL4.6 Beta released on RHN (also available at partners.redhat.com). Requested action: Please verify that your issue is fixed to ensure that it is included in this update release. After you (Red Hat Partner) have verified that this issue has been addressed, please perform the following: 1) Change the *status* of this bug to VERIFIED. 2) Add *keyword* of PartnerVerified (leaving the existing keywords unmodified) If this issue is not fixed, please add a comment describing the most recent symptoms of the problem you are having and change the status of the bug to FAILS_QA. If you cannot access bugzilla, please reply with a message to Issue Tracker and I will change the status for you. If you need assistance accessing ftp://partners.redhat.com, please contact your Partner Manager.
changing to FAILS_QA based on comment #55
Moving to Assigned.
Can you make comment #55 public since it seems to indicate that this problem is not fixed?
Looking at the reproducer, I've been able to get this to fail too: 3+0 records in 3+0 records out 3145728 bytes (3.1 MB) copied, 0.010145 seconds, 310 MB/s file create OK exec stat /mnt/nfs/aaa/bbb /usr/bin/stat: cannot stat `/mnt/nfs/aaa/bbb': No such file or directory ...network captures show that a LOOKUP call is done just before ssh'ing to the server, and that returns NFS3ERR_NOENT. After the SSH traffic, the client does a GETATTR call on the 'aaa' dir, and then an ACCESS call. My guess is that it's checking the dir attributes to see if it has changed since the failed lookup. They haven't so it doesn't retry the lookup.
corresponding nfs_debug messages (1984503 is the aaa directory, 1755649 is its parent): Aug 30 16:02:31 dhcp59-129 kernel: NFS call lookup bbb Aug 30 16:02:31 dhcp59-129 kernel: NFS reply lookup: -2 Aug 30 16:02:31 dhcp59-129 kernel: NFS: dentry_delete(aaa/bbb, 0) Aug 30 16:02:32 dhcp59-129 kernel: NFS: permission(0:11/1755649), mask=0x1, res=0 Aug 30 16:02:32 dhcp59-129 kernel: NFS: revalidating (0:11/1984503) Aug 30 16:02:32 dhcp59-129 kernel: NFS call getattr Aug 30 16:02:32 dhcp59-129 kernel: NFS reply getattr. status=0 Aug 30 16:02:32 dhcp59-129 kernel: NFS: nfs_update_inode(0:11/1984503 ct=1 info=0x6) Aug 30 16:02:32 dhcp59-129 kernel: NFS: (0:11/1984503) revalidation complete Aug 30 16:02:32 dhcp59-129 kernel: NFS call access Aug 30 16:02:32 dhcp59-129 kernel: NFS reply access, status = 0 Aug 30 16:02:32 dhcp59-129 kernel: NFS: permission(0:11/1984503), mask=0x1, res=0 Aug 30 16:02:32 dhcp59-129 kernel: NFS: dentry_delete(aaa/bbb, 8)
Looking back over the logs from the testing I did yesterday, I think I see what the issue might be. For reference, here's the similar output from a previous attempt that did not fail: Aug 30 16:02:30 dhcp59-129 kernel: NFS call lookup bbb Aug 30 16:02:30 dhcp59-129 kernel: NFS reply lookup: -2 Aug 30 16:02:30 dhcp59-129 kernel: NFS: dentry_delete(aaa/bbb, 0) Aug 30 16:02:31 dhcp59-129 kernel: NFS: permission(0:11/1755649), mask=0x1, res=0 Aug 30 16:02:31 dhcp59-129 kernel: NFS: revalidating (0:11/1984503) Aug 30 16:02:31 dhcp59-129 kernel: NFS call getattr Aug 30 16:02:31 dhcp59-129 kernel: NFS reply getattr. status=0 Aug 30 16:02:31 dhcp59-129 kernel: NFS: nfs_update_inode(0:11/1984503 ct=1 info=0x6) Aug 30 16:02:31 dhcp59-129 kernel: NFS: mtime change on server for file 0:11/1984503 Aug 30 16:02:31 dhcp59-129 kernel: NFS: (0:11/1984503) data cache invalidated Aug 30 16:02:31 dhcp59-129 kernel: NFS: nfs3_forget_cached_acls(0:11/1984503) Aug 30 16:02:31 dhcp59-129 kernel: NFS: (0:11/1984503) revalidation complete Aug 30 16:02:31 dhcp59-129 kernel: NFS call access Aug 30 16:02:31 dhcp59-129 kernel: NFS reply access, status = 0 Aug 30 16:02:31 dhcp59-129 kernel: NFS: permission(0:11/1984503), mask=0x1, res=0 Aug 30 16:02:31 dhcp59-129 kernel: NFS: dentry_delete(aaa/bbb, 18) The main difference here is that the working one shows an mtime change for the directory when the revalidation is done. This means that its data is invalidated properly and that causes a new lookup to go out over the wire. I believe the problem here is due to the fact that the underlying filesystem on the server does not support time resolution of less than 1 second. When things get timed in such a way that the file is deleted and then recreated on the server within the same second, the client has no way to tell that the directory has changed, and does not invalidate the directory's data. Note that the noac option is working properly here. The directory's attributes are not cached -- a getattr is done every time the client needs to revalidate this directory. The problem is that the revalidation logic has no way to tell that the directory has changed. If the server were serving from a filesystem that had sub-second mtime resolution then I think that this problem would not be occurring.
Tested against a server with an XFS partition and the nfsbug.sh script ran 300 iterations without failing. I also tested a RHEL5 client and nfsbug.sh test also failed within a few tries when run against a server on an ext3 filesystem. I'm going to conclude that the report that this problem isn't fixed is wrong, and the problem that they're seeing is not related to the original issue.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2007-0791.html
*** Bug 327591 has been marked as a duplicate of this bug. ***
Hello. Does this issue still occur in RHEL5(u1)? Thanks.
Tommi, I saw the patches discussed here being evaluated as part of other updates to the NFS code in RHEL-5, so although this specific BZ has not been cloned to it the problem should have already been taken care of there. Cheers, Fabio
The problem is reproduced in RHEL 4 2.6.9-89.0.11.0.1.ELsmp. The errata mentioned has 2.6.9-67 kernel.
Hi Rituraj, can you share some more information about what was reproduced and the environment where it reproduced? There were several related issues discussed in this bugzilla, so please provide more context.
Hi Fabio, Here is more info abt the system: [root@ora-stage-ebs-app-6 vmwstagecomn]# rpm -qa | grep -i nfs nfs-utils-lib-1.0.6-10.el4 nfs-utils-1.0.6-93.EL4 [root@ora-stage-ebs-app-6 vmwstagecomn]# uname -a Linux ora-stage-ebs-app-6.vmware.com 2.6.9-89.0.11.0.1.ELsmp #1 SMP Tue Sep 15 13:54:39 EDT 2009 i686 i686 i386 GNU/Linux [root@ora-stage-ebs-app-6 vmwstagecomn]# [root@ora-stage-ebs-app-6 vmwstagecomn]# cat /etc/issue Enterprise Linux Enterprise Linux AS release 4 (October Update 8) Kernel \r on an \m Thanks for quick response. As this bug is closed I have opened another ticket for the issue. https://bugzilla.redhat.com/show_bug.cgi?id=803648 There is a description on the problem. Thanks!
I am seeing this issue in the 5.X and 6.X kernel. I documented all of the steps at https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1017720. The fix for me is to run the command below when using the 5.X kernel. echo 2 > /proc/sys/vm/drop_caches Are you able to help me resolve this issue?