436132 – Reduce NFS cache invalidations due to writes from same client

Bug 436132 - Reduce NFS cache invalidations due to writes from same client

Summary: Reduce NFS cache invalidations due to writes from same client

Keywords:
Status:	CLOSED DUPLICATE of bug 427385
Alias:	None
Product:	Red Hat Enterprise Linux 4
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	4.6
Hardware:	All
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	rc
Target Release:	---
Assignee:	Jeff Layton
QA Contact:	Martin Jenner
Docs Contact:
URL:
Whiteboard:
Depends On:	435443
Blocks:	461304
TreeView+	depends on / blocked

Reported:	2008-03-05 16:04 UTC by Jeff Layton
Modified:	2008-09-22 14:52 UTC (History)
CC List:	10 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2008-09-22 14:52:10 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
binary capture from iozone test against netapp (170.21 KB, patch) 2008-03-07 14:49 UTC, Jeff Layton	no flags	Details \| Diff
binary capture from iozone test against netapp (170.21 KB, application/octet-stream) 2008-03-07 14:49 UTC, Jeff Layton	no flags	Details
patch 1 -- always fake pre_op_attrs (1.43 KB, patch) 2008-03-07 21:01 UTC, Jeff Layton	no flags	Details \| Diff
patch 2 -- add timespec_before helper inline function (1.10 KB, patch) 2008-03-07 21:02 UTC, Jeff Layton	no flags	Details \| Diff
patch 3 -- don't allow faked wcc attrs to do backward mtime updates (2.33 KB, patch) 2008-03-07 21:05 UTC, Jeff Layton	no flags	Details \| Diff
patch1 - NFS: don't use post_op_attrs from WRITE replies for cache revalidation (3.88 KB, patch) 2008-06-16 13:15 UTC, Jeff Layton	no flags	Details \| Diff
patch2 - NFS: track number of writes on the wire and when last one started/ended (2.41 KB, patch) 2008-06-16 13:15 UTC, Jeff Layton	no flags	Details \| Diff
patch3 - NFS: change nfs_refresh_inode and __nfs_revalidate_inode to detect overlapping writes (3.41 KB, patch) 2008-06-16 13:17 UTC, Jeff Layton	no flags	Details \| Diff
5 patch set (16.04 KB, patch) 2008-06-17 14:39 UTC, Jeff Layton	no flags	Details \| Diff
Show Obsolete (7) View All

Description Jeff Layton 2008-03-05 16:04:58 UTC

An unscientific iozone test run on a RHEL4 FV Xen guest against a NetApp NFS
server shows a roughly 15% slowdown:

# time /opt/iozone/bin/iozone -caM -g 64M

==> /tmp/nfs3-iozone-2.6.9-42.0.10.ELsmp.out <==

real    19m22.589s
user    0m2.153s
sys     16m42.608s

Client nfs v3:
null         getattr      setattr      lookup       access       readlink     
0         0% 3269      1% 99        0% 397       0% 175       0% 0         0% 
read         write        create       mkdir        symlink      mknod        
0         0% 191435   97% 198       0% 0         0% 0         0% 0         0% 
remove       rmdir        rename       link         readdir      readdirplus  
198       0% 0         0% 0         0% 0         0% 0         0% 1         0% 
fsstat       fsinfo       pathconf     commit       
0         0% 1         0% 0         0% 0         0% 

==> nfs3-iozone-2.6.9-68.16.EL.jtltest.32smp.out <==

real    25m24.618s
user    0m2.168s
sys     17m38.152s

Client nfs v3:
null         getattr      setattr      lookup       access       readlink     
0         0% 3168      1% 198       0% 366       0% 358       0% 0         0% 
read         write        create       mkdir        symlink      mknod        
115022   36% 191433   61% 198       0% 0         0% 0         0% 0         0% 
remove       rmdir        rename       link         readdir      readdirplus  
198       0% 0         0% 0         0% 0         0% 0         0% 0         0% 
fsstat       fsinfo       pathconf     commit       
0         0% 1         0% 0         0% 0         0% 

Note that the U4 kernel does no READ calls. The problem seems to be that the
current RHEL4 kernel is occasionally invalidating the cache even though it's the
only writer. I suspect that the problem was introduced with:

linux-2.6.9-nfs-mmap-cache.patch

The kernel being tested has the patches for bug 435443, so that patchset is
evidently not enough to fix this.

Comment 3 Jeff Layton 2008-03-05 18:58:29 UTC

Just for giggles, I built a patched nfs.ko with the patchset from bz 321101 (had
to drop the patches from 435443 so that they'd apply). The results don't seem to
be very good:

real    26m7.424s
user    0m2.396s
sys     19m28.232s

Client nfs v3:
null         getattr      setattr      lookup       access       readlink
0         0% 2873      0% 198       0% 199       0% 359       0% 0         0%
read         write        create       mkdir        symlink      mknod
124260   38% 191434   59% 198       0% 0         0% 0         0% 0         0%
remove       rmdir        rename       link         readdir      readdirplus
198       0% 0         0% 0         0% 0         0% 0         0% 1         0%
fsstat       fsinfo       pathconf     commit
0         0% 1         0% 0         0% 0         0%

...the time is slightly worse, and it seems like it causes even more READ calls.

I'd say this problem likely won't be fixed by that patchset...

Comment 6 Jeff Layton 2008-03-06 14:24:00 UTC

In an email Larry mentioned:

>>==> iozone-2.6.9-68.16.EL.jtltest.31smp.out <==
>>
>>real    24m8.703s
>>user    0m2.186s
>>sys     20m7.830s
>>  
>
> This is ~85% system time, kernel profiles for both kernels might pinpoint
> the cause of the degradation...


My experience so far with kernel profiling on NFS problems hasn't been that
great. I find that most of the time we see it in sleeping routines since we
generally tend to be blocked waiting on network I/O, and that doesn't help me
pinpoint much...

That said, this situation may be different and I may look at that after I get a
bit more baseline info.

Comment 7 Jeff Layton 2008-03-06 14:39:15 UTC

RHEL5 results:

iozone test complete.

real    48m50.905s
user    0m1.788s
sys     22m57.532s


Client nfs v3:
null         getattr      setattr      lookup       access       readlink
0         0% 1838      0% 198       0% 199       0% 286       0% 0         0%
read         write        create       mkdir        symlink      mknod
60765    23% 191423   75% 198       0% 0         0% 0         0% 0         0%
remove       rmdir        rename       link         readdir      readdirplus
198       0% 0         0% 0         0% 0         0% 0         0% 2         0%
fsstat       fsinfo       pathconf     commit
0         0% 2         0% 0         0% 0         0%

...took almost an hour to run. There's a large discrepancy between wallclock and
system time. Most of the calls seem to be pretty close to the RHEL4 case. The
number of READ calls definitely seems to be reduced. The client here was a RHEL5
FV xen guest running 2.6.18-84.el5. I've noticed some issues with timers on SMP
xen guests so I may drop this down to a 1 CPU guest and rerun the test. I
wouldn't think this test should be CPU bound anyway...

Comment 8 Jeff Layton 2008-03-06 15:49:27 UTC

Same test, this time booted with "nohpet", which seems to return sanity to the
timers on RHEL4 FV SMP xen guests. Still, things look pretty much the same:

iozone test complete.
Timer resolution is poor. Some small transfers may have 
reported inaccurate results. Sizes 512 Kbytes and below.

real    47m8.849s    
user    0m1.740s     
sys     23m36.606s  

Client nfs v3:
null         getattr      setattr      lookup       access       readlink     
0         0% 1840      0% 198       0% 199       0% 280       0% 0         0% 
read         write        create       mkdir        symlink      mknod        
54290    21% 191429   76% 198       0% 0         0% 0         0% 0         0% 
remove       rmdir        rename       link         readdir      readdirplus  
198       0% 0         0% 0         0% 0         0% 0         0% 1         0% 
fsstat       fsinfo       pathconf     commit       
0         0% 2         0% 0         0% 0         0% 

...I'll probably return to working on RHEL4 for now. While I don't have an
explanation for the longer wallclock time here, it does look like RHEL5 reduces
the number of reads when doing this test. Still though, they're not 0 like they
were in RHEL4.4.

When I get some more time, I'll plan to see about running some smaller scale
tests (maybe single file and I/O sizes), and see if I can replicate the
situation where the cache is invaldiated and we have to re-read the page data.
If we can determine the conditions under which that occurs, we might be able to
eliminate at least some cases of it and help performance.

Comment 9 Jeff Layton 2008-03-06 20:27:49 UTC

Turning up a bit of nfs_debug. Here's one case in running a smaller iozone test
where I see that 

NFS: 7635 initiated write call (req 0:13/2845693, 32768 bytes @ offset 0)
NFS: 7636 initiated write call (req 0:13/2845693, 32768 bytes @ offset 32768)
NFS: nfs_update_inode(0:13/2845693 ct=2 info=0x7)
NFS: mtime change on server for file 0:13/2845693
NFS: 7636 nfs_writeback_done (status 32768)
NFS: write (0:13/2845693 4096@32768) OK
NFS: write (0:13/2845693 4096@36864) OK
NFS: write (0:13/2845693 4096@40960) OK
NFS: write (0:13/2845693 4096@45056) OK
NFS: write (0:13/2845693 4096@49152) OK
NFS: write (0:13/2845693 4096@53248) OK
NFS: write (0:13/2845693 4096@57344) OK
NFS: write (0:13/2845693 4096@61440) OK
NFS: nfs_update_inode(0:13/2845693 ct=2 info=0x7)
NFS: mtime change on server for file 0:13/2845693
NFS: 7635 nfs_writeback_done (status 32768)

I'm pretty sure the nfs_update_inode call here is due to the
nfs_post_op_update_inode_force_wcc in nfs3_write_done. It seems like that should
have made this not occur. I'll see about instrumenting that and making sure that
it's actually doing what we expect.

...so we had 2 cases where we saw an mtime change on the server. The interesting
bit is that the second write call completed before the first.

Comment 10 Peter Staubach 2008-03-06 20:48:45 UTC

The reason that the pre_op_attr fields are not reliable when writing to
a file is precisely this that you have found.  The server is free to
process requests in any order that it likes and can also send responses
in any order that it wishes.  The client may also process responses
in random orders, based on scheduling.

If the client does not process responses in the same order that the
server generates the responses (not sends responses, by the way) then
the pre_op_attr fields won't match the post_op_attr fields from the
previous operation.  Thus, the file will be appear to be modified by
a different entity, when in fact, it was not.

Most client implementations ignore the pre_op_attr fields and just
plug in the post_op_attr fields straight, without doing cache
validation.  Some strict consistency is clearly lost, but this is a
well known issue with NFS and applications have long since taken this
into account.

This ordering issue can lead to large performance issues, such as
you are seeing.

Comment 11 Jeff Layton 2008-03-06 20:58:13 UTC

Yep, that seems to be the case. The issue is that I was doing this testing
against a netapp, and it actually fills out the pre-op-attr fields. So when the
ops get out of order like this, then it seems like it causes exactly what you
say. I'm going to rerun this test against my rawhide box with an xfs partition
and see if the pre_op_attr fakeup code eliminates the cache invalidations.

> Most client implementations ignore the pre_op_attr fields and just
> plug in the post_op_attr fields straight, without doing cache
> validation.  Some strict consistency is clearly lost, but this is a
> well known issue with NFS and applications have long since taken this
> into account.

...the code to fake up pre_op_attrs does not do this in any Linux version, AFAICT:

        if ((fattr->valid & NFS_ATTR_FATTR) != 0 &&
                        (fattr->valid & NFS_ATTR_WCC) == 0) {
                memcpy(&fattr->pre_ctime, &inode->i_ctime,
sizeof(fattr->pre_ctime));
                memcpy(&fattr->pre_mtime, &inode->i_mtime,
sizeof(fattr->pre_mtime));
                fattr->pre_size = inode->i_size;
                fattr->valid |= NFS_ATTR_WCC;
        }

...when the server returns actual pre_op_attrs, NFS_ATTR_WCC is set and we don't
force fake pre-op-attrs. Are you saying that this is wrong and that we should
just toss out the pre-op-attrs here?

If so, perhaps we should consider making this tunable somehow? I don't
particularly like adding new knobs, but if we're going to bend the correctness
rules for performance, we should probably make this something that customers can
disable.

Comment 12 Jeff Layton 2008-03-06 21:20:58 UTC

Testing against a rawhide box with an XFS partition showed similar results. I
*still* get cache invalidations. Most of them though seemed to be occurring due
to post_op_attr checks on commit calls. I suppose at that point, the data gets
flushed to disk and that changes the mtime. As an experiment, I changed the nfs3
commit calls to use the force_wcc variant and things seem to be chugging along
much faster. I see no read calls with that change.

I need to ponder whether this is any worse than faking up the pre_op_attrs on a
write call. It may be, in which case this isn't possible.

Peter, thoughts?

Comment 13 Jeff Layton 2008-03-06 22:19:07 UTC

Test results from my commit-fakes-up-preattrs kernel:

iozone test complete.

real    55m33.022s
user    0m2.169s
sys     23m23.886s

Client nfs v3:
null         getattr      setattr      lookup       access       readlink
0         0% 3169      1% 198       0% 397       0% 432       0% 0         0%
read         write        create       mkdir        symlink      mknod
0         0% 191438   97% 198       0% 0         0% 0         0% 0         0%
remove       rmdir        rename       link         readdir      readdirplus
198       0% 0         0% 0         0% 0         0% 0         0% 0         0%
fsstat       fsinfo       pathconf     commit
0         0% 1         0% 0         0% 727       0%

roughly the same number of write calls. 0 reads, which indicates that the cache
is not invalidated. This is similar to how the 4.4 kernels behave. The question
of course would be whether this causes any cache-consistency regressions. It
should probably go upstream first if so.

Now that I think about it though, it seems like doing this for COMMIT might be a
bit more dangerous. It's also not clear to me why COMMIT calls cause the mtime
to change this way -- it seems like the mtime should really be changing on write
and then the commit just makes sure that everything is flushed to disk. So there
may be a server-side component to this as well.

I may experiment some with a patch to do what I think Peter was suggesting in
comment #10 -- ignoring pre_op_attrs for all write calls. It would be
interesting to see if the commit calls behave similarly from a netapp server. Do
they tend to change the mtime as well?

Comment 14 Peter Staubach 2008-03-06 22:22:50 UTC

Yes, I think that we should ignore all pre_op_attr returns from WRITE
requests.  Without the tunable.

COMMIT is not supposed to cause the mtime to update.  Is it?

Comment 15 Jeff Layton 2008-03-06 22:59:46 UTC

...oh and as for the timing on the test in comment #13 -- I'm not surprised that
this was so slow. The server here is a rawhide xen guest. The interesting part
is the number of calls there...

> Yes, I think that we should ignore all pre_op_attr returns from WRITE
> requests.  Without the tunable.

Ok, I hacked up a quick patch to test this and it's running now. That certainly
seems to cut down the number of READ calls when testing against the netapp so
far, but they're definitely not 0 so we are getting some cache invalidations
against the netapp, even with that.

---------------[snip]----------------
iozone test complete.

real    19m56.682s
user    0m2.165s
sys     17m22.103s

Client nfs v3:
null         getattr      setattr      lookup       access       readlink
0         0% 3172      1% 198       0% 371       0% 181       0% 0         0%
read         write        create       mkdir        symlink      mknod
7333      3% 191438   94% 198       0% 0         0% 0         0% 0         0%
remove       rmdir        rename       link         readdir      readdirplus
198       0% 0         0% 0         0% 0         0% 0         0% 2         0%
fsstat       fsinfo       pathconf     commit
0         0% 1         0% 0         0% 0         0%

---------------[snip]----------------

...still, it's much improved and the timing is close to what we see on 4.4 kernels.

> COMMIT is not supposed to cause the mtime to update.  Is it?

I wouldn't think so. It's possible that my interpretation is wrong here. Maybe
these out-of-order writes are somehow skewing things? For instance:

Write Call 1
Write Call 2
Write Response 2 -- update inode's mtime with faked up wcc attrs
Write Response 1 -- update mtime again, but it's actually *earlier* than the
last response
Commit comes in -- wcc pre_op_attrs have the mtime from Write Response 2's
post_op_attrs and we invalidate the cache

I could probably do a patch that makes a printk pop whenever we overwrite the
mtime with one that's older than what's there currently. I'll plan to play with
that a little later...

Comment 16 Jeff Layton 2008-03-07 02:17:12 UTC

The theory in comment #15 does not seem to be the case. I added a bit of debug
code to have the mtimes printed out when there was a cache invalidation due to
mtime changes and to dump the stack:

NFS: mtime change on server for file 0:13/2845693 inode=1204874137.937896000
fattr=1204874137.939896000

Call Trace:<ffffffffa01cd6fa>{:nfs:nfs_update_inode+477}
<ffffffffa01ce804>{:nfs:__nfs_revalidate_inode+448} 
       <ffffffff80136214>{autoremove_wake_function+0}
<ffffffffa01cb9b4>{:nfs:nfs_file_flush+235} 
       <ffffffff8017aefa>{filp_close+68} <ffffffff8017afa6>{sys_close+130} 
       <ffffffff801102aa>{system_call+126} 

NFS: mtime change on server for file 0:13/2845693 inode=1204874142.76232000
fattr=1204874142.78232000

Call Trace:<ffffffffa01cd6fa>{:nfs:nfs_update_inode+477}
<ffffffffa01ce804>{:nfs:__nfs_revalidate_inode+448} 
       <ffffffff80136214>{autoremove_wake_function+0}
<ffffffffa01cb9b4>{:nfs:nfs_file_flush+235} 
       <ffffffff8017aefa>{filp_close+68} <ffffffff8017afa6>{sys_close+130} 
       <ffffffff801102aa>{system_call+126} 

...the fattr mtimes seem to be consistently later than the inode mtimes. I'm not
certain that these are from commit calls, but I suspect that they are since the
stack seems to come from a close() syscall.

This suggests that there are mtime changes after the last write reply. Either
the commit itself is causing an mtime change, or the server is returning the
last write response with a post_op_attr mtime that doesn't reflect its result.
The other possibility is that there is another call being done that doesn't
return post_op_attrs and that is changing the mtime, but I can't think of what
that would be in the context of an iozone run.

Whatever this is doesn't seem to be a netapp specific problem. Running this
against a linux server seems to give similar results.

Comment 17 Jeff Layton 2008-03-07 13:45:41 UTC

Actually, I wasn't thinking about this correctly last night. The above
experiment is meaningless. I did a patch this morning to try and test whether
the out of order writes were skewing things:

If we had faked up the pre_op_attr mtime, and the post_op_attr mtime was before
the inode mtime, then I'd overwrite the fattr->mtime with the current inode
mtime and do a printk. This should prevent the mtime from ever going backward
due to post op attrs from our own write.

If the above theory about out of order write replies causing this was correct,
then I should have seen some of these printks. I never saw this printk fire, so
it doesn't seem like that's the probable cause.

On another note...

Peter what do you think the chances are of getting the behavior change that you
suggest in comment #14 upstream? Do you think there is a good enough argument
for doing this even though this seems to be against the letter of the RFC?

Comment 18 Jeff Layton 2008-03-07 14:49:30 UTC

Created attachment 297190 [details]
binary capture from iozone test against netapp

This capture is from an iozone test against a netapp. The client kernel has the
patch that makes it ignore pre_op_attrs on write replies. We're still getting
some cache invalidations, so I looked through this capture for READ calls and
had a look at what was happening on the wire just before them.

There are a couple of instances -- frames 2337 and 2408 have WRITE calls, but
wireshark doesn't seem to be able to find their replies. A little while after
those calls, we see some GETATTR calls. The mtime in the GETATTR reply in frame
 2341 is later than the mtime in the last write reply that we received (in
frame 2306). 

Just after these calls we see some packet loss:

2339	3.040324	10.10.56.139	10.11.231.236	RPC	[TCP ACKed lost
segment] [TCP Previous segment lost] Continuation

...I don't see any COMMIT calls happening here, so I don't think they're
involved after all. My guess is that packet loss with write requests is causing
the occasional cache invalidation to occur since it makes us miss a
post_op_attr mtime update.

Comment 19 Jeff Layton 2008-03-07 14:49:45 UTC

Created attachment 297191 [details]
binary capture from iozone test against netapp

This capture is from an iozone test against a netapp. The client kernel has the
patch that makes it ignore pre_op_attrs on write replies. We're still getting
some cache invalidations, so I looked through this capture for READ calls and
had a look at what was happening on the wire just before them.

There are a couple of instances -- frames 2337 and 2408 have WRITE calls, but
wireshark doesn't seem to be able to find their replies. A little while after
those calls, we see some GETATTR calls. The mtime in the GETATTR reply in frame
 2341 is later than the mtime in the last write reply that we received (in
frame 2306). 

Just after these calls we see some packet loss:

2339	3.040324	10.10.56.139	10.11.231.236	RPC	[TCP ACKed lost
segment] [TCP Previous segment lost] Continuation

...I don't see any COMMIT calls happening here, so I don't think they're
involved after all. My guess is that packet loss with write requests is causing
the occasional cache invalidation to occur since it makes us miss a
post_op_attr mtime update.

Comment 20 Jeff Layton 2008-03-07 15:02:56 UTC

Of course, this seems to bring us to the question...why aren't we seeing
retransmissions when we get no response to the write? I canceled this last run
after seeing the READ calls. After this, nfsstat shows:

Client rpc stats:
calls      retrans    authrefrsh
1331       0          0       

...shouldn't the WRITE calls with no real responses have been retransmitted? Or
am I putting too much stock in wireshark's NFS decoder?

Comment 21 Peter Staubach 2008-03-07 15:14:01 UTC

Packet loss?  With TCP?  To an application?  That doesn't seem likely.
While wireshark may be showing lost packets, and indeed they may be
being lost, TCP will take care of retransmiting the lost packets.  Such
things will not be visible to the application unless an error is also
returned from TCP.

Wirh regards to RFC1813, having written the specification and having
designed the weak cache consistency part of the protocol, I would say
that it is unfortunate that that part of the protocol didn't work out
as well as I would have liked.  It works very well for single threaded
access like directory operations, but doesn't work well at all with
multi threaded access like writing with multiple contexts.

This can be made to work, by doing something like sorting responses
based on the pre_op_attr mtimes and then applying the wcc_data information,
but it not clear that this would not cause a performance degradation.
It may be worth trying though.  If done right, we could get the
consistency and the performance.

The simple solution, though, is to ignore the pre_op_attr fields if
they are returned and just use the post_op_attr fields to update the
attributes without doing any cache validation.  It would be good to
ensure that time does not go backwards though.

Acceptance upstream -- I suspect that Trond could be convinced, but
it may or may not take some work.

Comment 22 Jeff Layton 2008-03-07 15:32:46 UTC

I didn't mean to say that they were being lost for good. I meant that packet
loss is delaying the write responses and allowing other calls to race in there
after the writes have actually completed on the server. So the GETATTR call sees
the mtime change that happened as a result of the write whose response was delayed.

We're not forcing the wcc updates on the GETATTR call so we end up invalidating
the cache. I do see quite a few GETATTR calls that may not be necessary in this
capture. Reducing their number may help alleviate this, but that's really a
separate problem...

> This can be made to work, by doing something like sorting responses
> based on the pre_op_attr mtimes and then applying the wcc_data information,
> but it not clear that this would not cause a performance degradation.
> It may be worth trying though.  If done right, we could get the
> consistency and the performance.

This sounds tricky and would need some careful design work. I'm not sure I
understand exactly what this would entail...

> The simple solution, though, is to ignore the pre_op_attr fields if
> they are returned and just use the post_op_attr fields to update the
> attributes without doing any cache validation.  It would be good to
> ensure that time does not go backwards though.

> Acceptance upstream -- I suspect that Trond could be convinced, but
> it may or may not take some work.

That's probably the best place to start. I'll plan to write this up with a patch
in the near future and send it to the list. I'm not comfortable making this
change to RHEL without it going upstream first. If nothing else, we should at
least be able to start some discussion there on this problem. 

Thanks!

Comment 23 Peter Staubach 2008-03-07 16:02:00 UTC

Do we know what generated the GETATTR?  Cache validation or something else?

We can't safely do a GETATTR for cache validation while there are outstanding
operations which modify the file.  The attributes may show changes in the
attributes of the file before we are ready to correctly handle them.

Comment 24 Jeff Layton 2008-03-07 19:26:08 UTC

It looks like it's close-to-open cache consistency. A bit of debug info with a
dump_stack() added to __nfs_revalidate_inode:

NFS: nfs_update_inode(0:13/2845693 ct=2 info=0x27)
NFS: 54365 nfs_writeback_done (status 32768)
NFS: write (0:13/2845693 4096@0) OK
NFS: write (0:13/2845693 4096@4096) OK
NFS: write (0:13/2845693 4096@8192) OK
NFS: write (0:13/2845693 4096@12288) OK
NFS: write (0:13/2845693 4096@16384) OK
NFS: write (0:13/2845693 4096@20480) OK
NFS: write (0:13/2845693 4096@24576) OK
NFS: write (0:13/2845693 4096@28672) OK
NFS: revalidating (0:13/2845693)

Call Trace:<ffffffffa01cda9a>{:nfs:__nfs_revalidate_inode+85} 
       <ffffffff80136214>{autoremove_wake_function+0}
<ffffffffa01d5916>{:nfs:nfs_flush_inode+183} 
       <ffffffffa01d607a>{:nfs:nfs_commit_inode+136}
<ffffffffa01cb9b4>{:nfs:nfs_file_flush+235} 
       <ffffffff8017aefa>{filp_close+68} <ffffffff8017afa6>{sys_close+130} 
       <ffffffff801102aa>{system_call+126} 
NFS call  getattr
NFS reply getattr
NFS: nfs_update_inode(0:13/2845693 ct=1 info=0x6)
NFS: mtime change on server for file 0:13/2845693
NFS: (0:13/2845693) data cache invalidated

...the nfs_update_inode call done from nfs_writeback_done did not invalidate the
cache, but the getattr immediately following it *did*. I think that getattr was
called from nfs_file_flush().

I think I need to add a bit more instrumentation to dump out info about the
mtime when we decide to invalidate the cache, and when it's updated as a result
of wcc updates. Maybe that will make this a bit clearer...

Comment 25 Jeff Layton 2008-03-07 20:06:18 UTC

Did a debug patch that instrumented the mtime stuff in nfs_update_inode and
nfs_wcc_update_inode:

NFS: 54573 initiated write call (req 0:13/2845693, 12288 bytes @ offset 0)
NFS: 54574 initiated write call (req 0:13/2845693, 4096 bytes @ offset 16384)
NFS: 54575 initiated write call (req 0:13/2845693, 4096 bytes @ offset 24576)
NFS: 54576 initiated write call (req 0:13/2845693, 20480 bytes @ offset 32768)
NFS: 54577 initiated write call (req 0:13/2845693, 4096 bytes @ offset 57344)
NFS: nfs_update_inode(0:13/2845693 ct=2 info=0x7)
NFS: updating inode mtime old=1204938324.267327000 new=1204938325.225912000
NFS: 54573 nfs_writeback_done (status 12288)
NFS: write (0:13/2845693 4096@0) OK
NFS: write (0:13/2845693 4096@4096) OK
NFS: write (0:13/2845693 4096@8192) OK
NFS: nfs_update_inode(0:13/2845693 ct=2 info=0x7)
NFS: updating inode mtime old=1204938325.225912000 new=1204938325.233912000
NFS: 54574 nfs_writeback_done (status 4096)
NFS: write (0:13/2845693 4096@16384) OK
NFS: nfs_update_inode(0:13/2845693 ct=2 info=0x7)
NFS: updating inode mtime old=1204938325.233912000 new=1204938325.240912000
NFS: 54575 nfs_writeback_done (status 4096)
NFS: write (0:13/2845693 4096@24576) OK
NFS: nfs_update_inode(0:13/2845693 ct=2 info=0x7)
NFS: updating inode mtime old=1204938325.240912000 new=1204938325.332912000
NFS: 54577 nfs_writeback_done (status 4096)
NFS: write (0:13/2845693 4096@57344) OK
NFS: nfs_update_inode(0:13/2845693 ct=2 info=0x7)
NFS: updating inode mtime old=1204938325.332912000 new=1204938325.330912000

^^^ time is being reset to the past here.

NFS: 54576 nfs_writeback_done (status 20480)
NFS: write (0:13/2845693 4096@32768) OK
NFS: write (0:13/2845693 4096@36864) OK
NFS: write (0:13/2845693 4096@40960) OK
NFS: write (0:13/2845693 4096@45056) OK
NFS: write (0:13/2845693 4096@49152) OK
NFS: revalidating (0:13/2845693)
NFS call  getattr
NFS reply getattr
NFS: nfs_update_inode(0:13/2845693 ct=1 info=0x6)
NFS: mtime change on server for file 0:13/2845693 inode=1204938325.330912000
fattr=1204938325.332912000

^^^ so when the getattr is done, it gets the *real* older one and the cache is
invalidated. I'm not sure why my test patch earlier didn't catch this -- it may
have had a bug. I have an idea about how to fix this. I'll test it and see if it
helps.

Comment 26 Jeff Layton 2008-03-07 21:01:32 UTC

Created attachment 297250 [details]
patch 1 -- always fake pre_op_attrs

This is the patch suggested by Peter. It makes the code to fake up pre_op_attrs
apply even when the server has sent us valid pre_op_attrs.

Comment 27 Jeff Layton 2008-03-07 21:02:20 UTC

Created attachment 297251 [details]
patch 2 -- add timespec_before helper inline function

Simple helper function to detect when one timespec is before another.

Comment 28 Jeff Layton 2008-03-07 21:05:22 UTC

Created attachment 297252 [details]
patch 3 -- don't allow faked wcc attrs to do backward mtime updates

This patch seems to fix the remaning cache invalidations that I've seen. When
we're faking pre_op_attrs, then we also don't allow the code to overwrite the
inode's mtime with an mtime from the fattr that's in the past. Instead we
overwrite the fattr's mtime with the inode's mtime to make sure that we avoid
invalidating the cache.

I think this is a reasonable fix, but haven't yet thought through all of the
implications. This also needs regression testing.

Comment 29 Jeff Layton 2008-03-07 21:07:42 UTC

With these 3 proposed patches, the reproducer shows 0 reads which is the
behavior on 4.4 kernels. None of these patches is upstream, and getting them in
there may be tough. We're technically violating the RFC, though with the
author's blessing ;-)

I need to do a bit more testing with it. If it works OK, I'll plan to roll this
up as a set and propose it upstream in the near future.

Comment 30 Jeff Layton 2008-03-10 15:18:39 UTC

Testing on rawhide this morning in preparation of sending these upstream. Here
are results on kernel without any of these patches:

iozone test complete.

real    46m6.959s
user    0m1.509s
sys     23m40.880s

Client rpc stats:
calls      retrans    authrefrsh
281849     0          0

Client nfs v3:
null         getattr      setattr      lookup       access       readlink
0         0% 1867      0% 198       0% 199       0% 319       0% 0         0%
read         write        create       mkdir        symlink      mknod
83232    29% 195633   69% 198       0% 0         0% 0         0% 0         0%
remove       rmdir        rename       link         readdir      readdirplus
198       0% 0         0% 0         0% 0         0% 0         0% 2         0%
fsstat       fsinfo       pathconf     commit
0         0% 2         0% 0         0% 0         0%

...similar results to what we see on RHEL4 without any extra patches.

Comment 31 Jeff Layton 2008-03-10 16:03:18 UTC

Results with 2 extra patches ported for upstream. Upstream already has a
timespec_compare function. I'll probably respin the RHEL4 set so that it adds
the same function and change the patch here to use that function instead.

real    33m57.375s
user    0m1.471s
sys     12m59.619s

Client rpc stats:
calls      retrans    authrefrsh
198642     0          0

Client nfs v3:
null         getattr      setattr      lookup       access       readlink
0         0% 1802      0% 198       0% 201       0% 206       0% 0         0%
read         write        create       mkdir        symlink      mknod
0         0% 195836   98% 198       0% 0         0% 0         0% 0         0%
remove       rmdir        rename       link         readdir      readdirplus
198       0% 0         0% 0         0% 0         0% 0         0% 0         0%
fsstat       fsinfo       pathconf     commit
0         0% 2         0% 0         0% 0         0%

Comment 32 Jeff Layton 2008-03-10 17:28:37 UTC

Both patches posted upstream to linux-nfs list and sent to Trond directly.

Comment 33 Jeff Layton 2008-03-14 15:54:24 UTC

Changing tagline to something more descriptive

Comment 35 Jeff Layton 2008-06-12 18:54:18 UTC

The entire approach here has been pretty well shot down upstream. One problem is
that this only helps cache invalidation due to attributes returned on write
calls. But, if you have a getattr or something that overlaps with a write, you
can still end up unnecessarily invalidating the cache. We also need to make sure
that when we ignore a set of attributes like this, that we set
NFS_INO_INVALID_ATTR to indicate that the attrcache may not be right.

After talking this over with Peter at Connectathon, I'm going to try and do a
different approach where we try to track whether a call that returned attributes
overlapped with a write. If that happens, we'll just use those attributes to
update the attrcache, but won't invalidate the data.

Comment 36 Jeff Layton 2008-06-16 13:15:05 UTC

Created attachment 309488 [details]
patch1 - NFS: don't use post_op_attrs from WRITE replies for cache revalidation

Comment 37 Jeff Layton 2008-06-16 13:15:37 UTC

Created attachment 309489 [details]
patch2 - NFS: track number of writes on the wire and when last one started/ended

Comment 38 Jeff Layton 2008-06-16 13:17:03 UTC

Created attachment 309490 [details]
patch3 - NFS: change nfs_refresh_inode and __nfs_revalidate_inode to detect overlapping writes

Comment 39 Jeff Layton 2008-06-16 13:26:05 UTC

New (very rough) cut of an upstream patchset. This should really be considered a
proof of concept or RFC rather than an actual patch submission. The idea with
this one is to track three new pieces of information about write calls:

1) number of writes outstanding on the wire
2) when last write call was sent
3) when last write reply was received

With this we should be able to detect whether a call that returned attributes
overlapped with a write. I've done some very basic testing with this. With an
straight, single-threaded iozone test, this eliminates cache invalidations. It
also seems to eliminate cache invalidations when I run the mmap-repro.c test for
bug 432974 on it (previously I'd often see cache invalidations from the CTO
getattr calls with that reproducer).

This is probably not without races though. It seems easily possible for a write
to start after a getattr reply was received, but before we could process it.
This logic would mean that we wouldn't invalidate the cache there when we
probably should. So, the logic in nfs_overlapping_write may need some further
tweaking. Getting this wrong will probably mean some hard-to-detect and
harder-to-track-down data corruption, so we need to be very careful here.

Comment 40 Jeff Layton 2008-06-16 17:46:23 UTC

Did an updated patchset that adds this macro to jiffies.h, and uses it in
nfs_overlapping_write():

+#define time_between(a,b,c) \
+       (time_after(a,b) && \
+        time_before(a,c))
+

...this makes that function return false on the boundary conditions. Results
aren't quite as good. With an iozone test:

read         write        
240       0% 194938   58%

...so there were some cache invalidations involved. On a short mmap-repro test
(5-10 minute run), it still seems to pretty much completely eliminate cache
invalidations.

Still though, this is a pretty substantial improvement over the unpatched case
and it should be a bit safer for the edge cases.

Comment 41 Jeff Layton 2008-06-16 18:15:31 UTC

Forgot to post the timing from the modified patch:

real	24m38.818s
user	0m0.993s
sys	8m13.286s

...not quite as fast as the first patch, but still better than the unpatched
case by quite a bit.

Comment 42 Jeff Layton 2008-06-17 14:39:23 UTC

Created attachment 309620 [details]
5 patch set

Newer set of 5 concatenated patches. This has a more "mature" implementation of
nfs_update_inode_attributes, makes it so that we do not match on edge cases in
nfs_overlapping_write() and also removes the old
nfs_post_op_update_inode_force_wcc() function since it's no longer used.

Comment 43 Jeff Layton 2008-06-17 14:45:00 UTC

Results with this seem to be pretty good. I see substantial performance gain on
a straight iozone test and also see no cache invalidations with the mmap-repro
and mutex-lt3 reproducers.

To Do:

What this really needs is some testing with multiple machines updating the same
files, but that are not using a fcntl locking. That's probably likely to tickle
any bugs that this may cause. I'll need to ponder how best to test that.

Also, I'm not sure that I have all of the logic correct in
nfs_update_inode_attributes. Anyone looking over this should review that
carefully...

Finally, there's probably some opportunity for code consolidation here. Maybe
have nfs_update_inode call nfs_update_inode_attributes, and then do the other
logic for the revalidation. I'll probably try to do that before submitting upstream.

Comment 44 RHEL Program Management 2008-09-03 13:12:16 UTC

Updating PM score.

Comment 45 Jeff Layton 2008-09-22 14:52:10 UTC


*** This bug has been marked as a duplicate of bug 427385 ***

Note You need to log in before you can comment on or make changes to this bug.