Bug 438676 - reduce false NFS cache invalidations due to overlapping write and getattr calls
Summary: reduce false NFS cache invalidations due to overlapping write and getattr calls
Keywords:
Status: CLOSED DUPLICATE of bug 601800
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel
Version: 5.2
Hardware: All
OS: Linux
low
low
Target Milestone: rc
: ---
Assignee: Jeff Layton
QA Contact: Red Hat Kernel QE team
URL:
Whiteboard:
Depends On:
Blocks: 533192
TreeView+ depends on / blocked
 
Reported: 2008-03-24 12:19 UTC by Jeff Layton
Modified: 2010-11-09 13:17 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2010-06-28 12:44:59 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
upstream patch (4.22 KB, patch)
2008-03-24 16:35 UTC, Jeff Layton
no flags Details | Diff
rhel5 patch 1 (4.05 KB, patch)
2008-03-24 18:35 UTC, Jeff Layton
no flags Details | Diff

Description Jeff Layton 2008-03-24 12:19:57 UTC
While investigating bug 435291, I noticed that the client occasionally
invalidates the file's cache even when it is the only writer of the file. While
that bug isn't really related to this proble, it would be nice if that didn't
occur as often.

The problem is that when there are multiple tasks doing I/O to the same file,
write calls can overlap with a getattr that occurs when opening and closing a
file (mostly, it seems to occur on open). It goes something like this:

getattr call goes out on wire
write call goes out on wire
write reply (mtime is updated to new mtime)
getattr reply (mtime has pre-write mtime)

...there are other races as well, but they basically boil down to the fact that
when a getattr goes out on the wire, we need to have no writes in flight, and we
need to wait until the getattr reply before we send any new writes.

This is probably also a problem on upstream kernels, so I may start with some
work there...

Comment 1 Jeff Layton 2008-03-24 16:35:06 UTC
Created attachment 298904 [details]
upstream patch 

First quick and dirty upstream patch for this. Adds a new rwsem to the
nfs_inode struct, and has write calls get a read lock and getattrs get a write
lock. This allows writes to overlap, but makes it so that when a getattr is
done, that all writes must complete first and no new writes can occur until the
getattr completes.

Its tough to tell how much difference this makes, however. It seems like
rawhide already didn't do many cache invalidations anyway with the mutex_lt3
reproducer from bug 432974. Still, this patch plus the upstream patches related
to bug 436132 seem to pretty much eliminate cache invalidation on rawhide with
this test. I'll plan to port something similar for rhel5 for testing since we
do seem to get a lot more cache invalidations there with this test.

There may be better/cleaner ways to achieve this result rather than using a
rwsem, but it does seem to have the semantics we need for this even if we're
having to take a read lock in the write codepath, etc...

Comment 2 Jeff Layton 2008-03-24 18:35:00 UTC
Created attachment 298916 [details]
rhel5 patch 1

First iteration of RHEL5 patch.

This one uses an rwsem in the same way as the upstream/rawhide patch. I'm
testing this on a RHEL5 development kernel (-86.el5) plus some extra patches
from other BZ's. In particular:

436132
437544

...this patch on top of the patches for those BZ's reduces cache invalidations
to 0 on my test rig. I was able to run both testcases for bug 432974 for quite
some time w/o failures.

While this doesn't actually fix bug 432974, it certainly should make the
problem harder to hit.

This patch was hastily hacked together though, so I need to give it another
look and some more testing before upstream submission.

Comment 3 Jeff Layton 2008-04-07 19:20:59 UTC
Peter mentioned an alternative approach:

15:16 < ps> You could also just avoid using the results from the GETATTR for 

            anything other than updating the cached attributes if there are any 

            WRITE operations outstanding.

...
15:17 < ps> That way, you wouldn't have to serialize operations...




Comment 4 Jeff Layton 2008-04-10 11:38:24 UTC
I've been playing around with Peter's suggestion (and also figured out the
problem I was having with the semaphore code). I'm still seeing cache
invalidations. The new reproducer I'm working with does posix locking on the
file to coordinate access between different processes. do_setlk() does this:

        /*
         * Make sure we clear the cache whenever we try to get the lock.
         * This makes locking act as a cache coherency point.
         */
        nfs_sync_mapping(filp->f_mapping);
        nfs_zap_caches(inode);

...that seems like a bit of a performance killer, but changing it could be
dangerous. It would be nice if we only did this if the previous lockholder was a
different machine, but I don't guess there's any way to tell that.

I'll plan to respin my reproducer to use a different synchronization mechanism
that avoids this codepath.


Comment 5 Jeff Layton 2008-09-09 16:55:35 UTC
Testing new set of patches on rawhide kernels. This one tries to keep track of writes and has nfs_update_inode avoid invalidating the inode's data when it looks like the call that returned fattrs overlapped with a write. This patchset takes the cautious approach and doesn't invalidate the cache if the write request or reply falls within the same jiffy as "now" or when the call was started within the same jiffy as the return of the fattrs. Results (from rawhide -314 kernel) running:

# time /opt/iozone/bin/iozone -ac -g 64M -f /mnt/netapp/iozone.testfile

Unpatched:
-------------------------------------------------------
iozone test complete.

real	27m30.160s
user	0m1.186s
sys	10m57.557s

[root@dhcp231-229 ~]# nfsstat -c
Client rpc stats:
calls      retrans    authrefrsh
270023     0          0       

Client nfs v3:
null         getattr      setattr      lookup       access       readlink     
0         0% 1851      0% 198       0% 199       0% 248       0% 0         0% 
read         write        create       mkdir        symlink      mknod        
74794    27% 192333   71% 198       0% 0         0% 0         0% 0         0% 
remove       rmdir        rename       link         readdir      readdirplus  
198       0% 0         0% 0         0% 0         0% 0         0% 0         0% 
fsstat       fsinfo       pathconf     commit       
0         0% 2         0% 1         0% 0         0% 


Patched:
-------------------------------------------------------

iozone test complete.

real	24m20.545s
user	0m1.240s
sys	9m15.268s

[root@dhcp231-229 linux-2.6.26.x86_64]# nfsstat -c
Client rpc stats:
calls      retrans    authrefrsh
245454     0          0       

Client nfs v3:
null         getattr      setattr      lookup       access       readlink     
0         0% 1851      0% 198       0% 199       0% 246       0% 0         0% 
read         write        create       mkdir        symlink      mknod        
50189    20% 192371   78% 198       0% 0         0% 0         0% 0         0% 
remove       rmdir        rename       link         readdir      readdirplus  
198       0% 0         0% 0         0% 0         0% 0         0% 0         0% 
fsstat       fsinfo       pathconf     commit       
0         0% 2         0% 1         0% 0         0% 

...the results here are not very compelling, unfortunately. It did seem to reduce the number of reads by ~33% or so, but only shaved around 3 mins off of the run time. Some of that may just be dumb luck, however. I'm not sure that the risk to cache consistency is worth any reward this would give us.

That said, the mechanism I'm using to detect overlapping writes is somewhat primitive and could be better.

Comment 6 RHEL Program Management 2009-02-16 15:44:58 UTC
Updating PM score.

Comment 8 RHEL Program Management 2009-02-17 18:07:23 UTC
Development Management has reviewed and declined this request.  You may appeal
this decision by reopening this request.

Comment 9 Jeff Layton 2009-02-17 21:03:21 UTC
Reopening for consideration for 5.5

Comment 10 Jeff Layton 2009-08-25 16:33:40 UTC
Looks like Trond may have already fixed this upstream, I suspect with commit a10ad17630024bf7aae8e7f18352f816ee483091. I don't see any cache invalidations on the simple iozone test with a rawhide kernel.


Note You need to log in before you can comment on or make changes to this bug.