While investigating bug 435291, I noticed that the client occasionally invalidates the file's cache even when it is the only writer of the file. While that bug isn't really related to this proble, it would be nice if that didn't occur as often. The problem is that when there are multiple tasks doing I/O to the same file, write calls can overlap with a getattr that occurs when opening and closing a file (mostly, it seems to occur on open). It goes something like this: getattr call goes out on wire write call goes out on wire write reply (mtime is updated to new mtime) getattr reply (mtime has pre-write mtime) ...there are other races as well, but they basically boil down to the fact that when a getattr goes out on the wire, we need to have no writes in flight, and we need to wait until the getattr reply before we send any new writes. This is probably also a problem on upstream kernels, so I may start with some work there...
Created attachment 298904 [details] upstream patch First quick and dirty upstream patch for this. Adds a new rwsem to the nfs_inode struct, and has write calls get a read lock and getattrs get a write lock. This allows writes to overlap, but makes it so that when a getattr is done, that all writes must complete first and no new writes can occur until the getattr completes. Its tough to tell how much difference this makes, however. It seems like rawhide already didn't do many cache invalidations anyway with the mutex_lt3 reproducer from bug 432974. Still, this patch plus the upstream patches related to bug 436132 seem to pretty much eliminate cache invalidation on rawhide with this test. I'll plan to port something similar for rhel5 for testing since we do seem to get a lot more cache invalidations there with this test. There may be better/cleaner ways to achieve this result rather than using a rwsem, but it does seem to have the semantics we need for this even if we're having to take a read lock in the write codepath, etc...
Created attachment 298916 [details] rhel5 patch 1 First iteration of RHEL5 patch. This one uses an rwsem in the same way as the upstream/rawhide patch. I'm testing this on a RHEL5 development kernel (-86.el5) plus some extra patches from other BZ's. In particular: 436132 437544 ...this patch on top of the patches for those BZ's reduces cache invalidations to 0 on my test rig. I was able to run both testcases for bug 432974 for quite some time w/o failures. While this doesn't actually fix bug 432974, it certainly should make the problem harder to hit. This patch was hastily hacked together though, so I need to give it another look and some more testing before upstream submission.
Peter mentioned an alternative approach: 15:16 < ps> You could also just avoid using the results from the GETATTR for anything other than updating the cached attributes if there are any WRITE operations outstanding. ... 15:17 < ps> That way, you wouldn't have to serialize operations...
I've been playing around with Peter's suggestion (and also figured out the problem I was having with the semaphore code). I'm still seeing cache invalidations. The new reproducer I'm working with does posix locking on the file to coordinate access between different processes. do_setlk() does this: /* * Make sure we clear the cache whenever we try to get the lock. * This makes locking act as a cache coherency point. */ nfs_sync_mapping(filp->f_mapping); nfs_zap_caches(inode); ...that seems like a bit of a performance killer, but changing it could be dangerous. It would be nice if we only did this if the previous lockholder was a different machine, but I don't guess there's any way to tell that. I'll plan to respin my reproducer to use a different synchronization mechanism that avoids this codepath.
Testing new set of patches on rawhide kernels. This one tries to keep track of writes and has nfs_update_inode avoid invalidating the inode's data when it looks like the call that returned fattrs overlapped with a write. This patchset takes the cautious approach and doesn't invalidate the cache if the write request or reply falls within the same jiffy as "now" or when the call was started within the same jiffy as the return of the fattrs. Results (from rawhide -314 kernel) running: # time /opt/iozone/bin/iozone -ac -g 64M -f /mnt/netapp/iozone.testfile Unpatched: ------------------------------------------------------- iozone test complete. real 27m30.160s user 0m1.186s sys 10m57.557s [root@dhcp231-229 ~]# nfsstat -c Client rpc stats: calls retrans authrefrsh 270023 0 0 Client nfs v3: null getattr setattr lookup access readlink 0 0% 1851 0% 198 0% 199 0% 248 0% 0 0% read write create mkdir symlink mknod 74794 27% 192333 71% 198 0% 0 0% 0 0% 0 0% remove rmdir rename link readdir readdirplus 198 0% 0 0% 0 0% 0 0% 0 0% 0 0% fsstat fsinfo pathconf commit 0 0% 2 0% 1 0% 0 0% Patched: ------------------------------------------------------- iozone test complete. real 24m20.545s user 0m1.240s sys 9m15.268s [root@dhcp231-229 linux-2.6.26.x86_64]# nfsstat -c Client rpc stats: calls retrans authrefrsh 245454 0 0 Client nfs v3: null getattr setattr lookup access readlink 0 0% 1851 0% 198 0% 199 0% 246 0% 0 0% read write create mkdir symlink mknod 50189 20% 192371 78% 198 0% 0 0% 0 0% 0 0% remove rmdir rename link readdir readdirplus 198 0% 0 0% 0 0% 0 0% 0 0% 0 0% fsstat fsinfo pathconf commit 0 0% 2 0% 1 0% 0 0% ...the results here are not very compelling, unfortunately. It did seem to reduce the number of reads by ~33% or so, but only shaved around 3 mins off of the run time. Some of that may just be dumb luck, however. I'm not sure that the risk to cache consistency is worth any reward this would give us. That said, the mechanism I'm using to detect overlapping writes is somewhat primitive and could be better.
Updating PM score.
Development Management has reviewed and declined this request. You may appeal this decision by reopening this request.
Reopening for consideration for 5.5
Looks like Trond may have already fixed this upstream, I suspect with commit a10ad17630024bf7aae8e7f18352f816ee483091. I don't see any cache invalidations on the simple iozone test with a rawhide kernel.