An unscientific iozone test run on a RHEL4 FV Xen guest against a NetApp NFS server shows a roughly 15% slowdown: # time /opt/iozone/bin/iozone -caM -g 64M ==> /tmp/nfs3-iozone-2.6.9-42.0.10.ELsmp.out <== real 19m22.589s user 0m2.153s sys 16m42.608s Client nfs v3: null getattr setattr lookup access readlink 0 0% 3269 1% 99 0% 397 0% 175 0% 0 0% read write create mkdir symlink mknod 0 0% 191435 97% 198 0% 0 0% 0 0% 0 0% remove rmdir rename link readdir readdirplus 198 0% 0 0% 0 0% 0 0% 0 0% 1 0% fsstat fsinfo pathconf commit 0 0% 1 0% 0 0% 0 0% ==> nfs3-iozone-2.6.9-68.16.EL.jtltest.32smp.out <== real 25m24.618s user 0m2.168s sys 17m38.152s Client nfs v3: null getattr setattr lookup access readlink 0 0% 3168 1% 198 0% 366 0% 358 0% 0 0% read write create mkdir symlink mknod 115022 36% 191433 61% 198 0% 0 0% 0 0% 0 0% remove rmdir rename link readdir readdirplus 198 0% 0 0% 0 0% 0 0% 0 0% 0 0% fsstat fsinfo pathconf commit 0 0% 1 0% 0 0% 0 0% Note that the U4 kernel does no READ calls. The problem seems to be that the current RHEL4 kernel is occasionally invalidating the cache even though it's the only writer. I suspect that the problem was introduced with: linux-2.6.9-nfs-mmap-cache.patch The kernel being tested has the patches for bug 435443, so that patchset is evidently not enough to fix this.
Just for giggles, I built a patched nfs.ko with the patchset from bz 321101 (had to drop the patches from 435443 so that they'd apply). The results don't seem to be very good: real 26m7.424s user 0m2.396s sys 19m28.232s Client nfs v3: null getattr setattr lookup access readlink 0 0% 2873 0% 198 0% 199 0% 359 0% 0 0% read write create mkdir symlink mknod 124260 38% 191434 59% 198 0% 0 0% 0 0% 0 0% remove rmdir rename link readdir readdirplus 198 0% 0 0% 0 0% 0 0% 0 0% 1 0% fsstat fsinfo pathconf commit 0 0% 1 0% 0 0% 0 0% ...the time is slightly worse, and it seems like it causes even more READ calls. I'd say this problem likely won't be fixed by that patchset...
In an email Larry mentioned: >>==> iozone-2.6.9-68.16.EL.jtltest.31smp.out <== >> >>real 24m8.703s >>user 0m2.186s >>sys 20m7.830s >> > > This is ~85% system time, kernel profiles for both kernels might pinpoint > the cause of the degradation... My experience so far with kernel profiling on NFS problems hasn't been that great. I find that most of the time we see it in sleeping routines since we generally tend to be blocked waiting on network I/O, and that doesn't help me pinpoint much... That said, this situation may be different and I may look at that after I get a bit more baseline info.
RHEL5 results: iozone test complete. real 48m50.905s user 0m1.788s sys 22m57.532s Client nfs v3: null getattr setattr lookup access readlink 0 0% 1838 0% 198 0% 199 0% 286 0% 0 0% read write create mkdir symlink mknod 60765 23% 191423 75% 198 0% 0 0% 0 0% 0 0% remove rmdir rename link readdir readdirplus 198 0% 0 0% 0 0% 0 0% 0 0% 2 0% fsstat fsinfo pathconf commit 0 0% 2 0% 0 0% 0 0% ...took almost an hour to run. There's a large discrepancy between wallclock and system time. Most of the calls seem to be pretty close to the RHEL4 case. The number of READ calls definitely seems to be reduced. The client here was a RHEL5 FV xen guest running 2.6.18-84.el5. I've noticed some issues with timers on SMP xen guests so I may drop this down to a 1 CPU guest and rerun the test. I wouldn't think this test should be CPU bound anyway...
Same test, this time booted with "nohpet", which seems to return sanity to the timers on RHEL4 FV SMP xen guests. Still, things look pretty much the same: iozone test complete. Timer resolution is poor. Some small transfers may have reported inaccurate results. Sizes 512 Kbytes and below. real 47m8.849s user 0m1.740s sys 23m36.606s Client nfs v3: null getattr setattr lookup access readlink 0 0% 1840 0% 198 0% 199 0% 280 0% 0 0% read write create mkdir symlink mknod 54290 21% 191429 76% 198 0% 0 0% 0 0% 0 0% remove rmdir rename link readdir readdirplus 198 0% 0 0% 0 0% 0 0% 0 0% 1 0% fsstat fsinfo pathconf commit 0 0% 2 0% 0 0% 0 0% ...I'll probably return to working on RHEL4 for now. While I don't have an explanation for the longer wallclock time here, it does look like RHEL5 reduces the number of reads when doing this test. Still though, they're not 0 like they were in RHEL4.4. When I get some more time, I'll plan to see about running some smaller scale tests (maybe single file and I/O sizes), and see if I can replicate the situation where the cache is invaldiated and we have to re-read the page data. If we can determine the conditions under which that occurs, we might be able to eliminate at least some cases of it and help performance.
Turning up a bit of nfs_debug. Here's one case in running a smaller iozone test where I see that NFS: 7635 initiated write call (req 0:13/2845693, 32768 bytes @ offset 0) NFS: 7636 initiated write call (req 0:13/2845693, 32768 bytes @ offset 32768) NFS: nfs_update_inode(0:13/2845693 ct=2 info=0x7) NFS: mtime change on server for file 0:13/2845693 NFS: 7636 nfs_writeback_done (status 32768) NFS: write (0:13/2845693 4096@32768) OK NFS: write (0:13/2845693 4096@36864) OK NFS: write (0:13/2845693 4096@40960) OK NFS: write (0:13/2845693 4096@45056) OK NFS: write (0:13/2845693 4096@49152) OK NFS: write (0:13/2845693 4096@53248) OK NFS: write (0:13/2845693 4096@57344) OK NFS: write (0:13/2845693 4096@61440) OK NFS: nfs_update_inode(0:13/2845693 ct=2 info=0x7) NFS: mtime change on server for file 0:13/2845693 NFS: 7635 nfs_writeback_done (status 32768) I'm pretty sure the nfs_update_inode call here is due to the nfs_post_op_update_inode_force_wcc in nfs3_write_done. It seems like that should have made this not occur. I'll see about instrumenting that and making sure that it's actually doing what we expect. ...so we had 2 cases where we saw an mtime change on the server. The interesting bit is that the second write call completed before the first.
The reason that the pre_op_attr fields are not reliable when writing to a file is precisely this that you have found. The server is free to process requests in any order that it likes and can also send responses in any order that it wishes. The client may also process responses in random orders, based on scheduling. If the client does not process responses in the same order that the server generates the responses (not sends responses, by the way) then the pre_op_attr fields won't match the post_op_attr fields from the previous operation. Thus, the file will be appear to be modified by a different entity, when in fact, it was not. Most client implementations ignore the pre_op_attr fields and just plug in the post_op_attr fields straight, without doing cache validation. Some strict consistency is clearly lost, but this is a well known issue with NFS and applications have long since taken this into account. This ordering issue can lead to large performance issues, such as you are seeing.
Yep, that seems to be the case. The issue is that I was doing this testing against a netapp, and it actually fills out the pre-op-attr fields. So when the ops get out of order like this, then it seems like it causes exactly what you say. I'm going to rerun this test against my rawhide box with an xfs partition and see if the pre_op_attr fakeup code eliminates the cache invalidations. > Most client implementations ignore the pre_op_attr fields and just > plug in the post_op_attr fields straight, without doing cache > validation. Some strict consistency is clearly lost, but this is a > well known issue with NFS and applications have long since taken this > into account. ...the code to fake up pre_op_attrs does not do this in any Linux version, AFAICT: if ((fattr->valid & NFS_ATTR_FATTR) != 0 && (fattr->valid & NFS_ATTR_WCC) == 0) { memcpy(&fattr->pre_ctime, &inode->i_ctime, sizeof(fattr->pre_ctime)); memcpy(&fattr->pre_mtime, &inode->i_mtime, sizeof(fattr->pre_mtime)); fattr->pre_size = inode->i_size; fattr->valid |= NFS_ATTR_WCC; } ...when the server returns actual pre_op_attrs, NFS_ATTR_WCC is set and we don't force fake pre-op-attrs. Are you saying that this is wrong and that we should just toss out the pre-op-attrs here? If so, perhaps we should consider making this tunable somehow? I don't particularly like adding new knobs, but if we're going to bend the correctness rules for performance, we should probably make this something that customers can disable.
Testing against a rawhide box with an XFS partition showed similar results. I *still* get cache invalidations. Most of them though seemed to be occurring due to post_op_attr checks on commit calls. I suppose at that point, the data gets flushed to disk and that changes the mtime. As an experiment, I changed the nfs3 commit calls to use the force_wcc variant and things seem to be chugging along much faster. I see no read calls with that change. I need to ponder whether this is any worse than faking up the pre_op_attrs on a write call. It may be, in which case this isn't possible. Peter, thoughts?
Test results from my commit-fakes-up-preattrs kernel: iozone test complete. real 55m33.022s user 0m2.169s sys 23m23.886s Client nfs v3: null getattr setattr lookup access readlink 0 0% 3169 1% 198 0% 397 0% 432 0% 0 0% read write create mkdir symlink mknod 0 0% 191438 97% 198 0% 0 0% 0 0% 0 0% remove rmdir rename link readdir readdirplus 198 0% 0 0% 0 0% 0 0% 0 0% 0 0% fsstat fsinfo pathconf commit 0 0% 1 0% 0 0% 727 0% roughly the same number of write calls. 0 reads, which indicates that the cache is not invalidated. This is similar to how the 4.4 kernels behave. The question of course would be whether this causes any cache-consistency regressions. It should probably go upstream first if so. Now that I think about it though, it seems like doing this for COMMIT might be a bit more dangerous. It's also not clear to me why COMMIT calls cause the mtime to change this way -- it seems like the mtime should really be changing on write and then the commit just makes sure that everything is flushed to disk. So there may be a server-side component to this as well. I may experiment some with a patch to do what I think Peter was suggesting in comment #10 -- ignoring pre_op_attrs for all write calls. It would be interesting to see if the commit calls behave similarly from a netapp server. Do they tend to change the mtime as well?
Yes, I think that we should ignore all pre_op_attr returns from WRITE requests. Without the tunable. COMMIT is not supposed to cause the mtime to update. Is it?
...oh and as for the timing on the test in comment #13 -- I'm not surprised that this was so slow. The server here is a rawhide xen guest. The interesting part is the number of calls there... > Yes, I think that we should ignore all pre_op_attr returns from WRITE > requests. Without the tunable. Ok, I hacked up a quick patch to test this and it's running now. That certainly seems to cut down the number of READ calls when testing against the netapp so far, but they're definitely not 0 so we are getting some cache invalidations against the netapp, even with that. ---------------[snip]---------------- iozone test complete. real 19m56.682s user 0m2.165s sys 17m22.103s Client nfs v3: null getattr setattr lookup access readlink 0 0% 3172 1% 198 0% 371 0% 181 0% 0 0% read write create mkdir symlink mknod 7333 3% 191438 94% 198 0% 0 0% 0 0% 0 0% remove rmdir rename link readdir readdirplus 198 0% 0 0% 0 0% 0 0% 0 0% 2 0% fsstat fsinfo pathconf commit 0 0% 1 0% 0 0% 0 0% ---------------[snip]---------------- ...still, it's much improved and the timing is close to what we see on 4.4 kernels. > COMMIT is not supposed to cause the mtime to update. Is it? I wouldn't think so. It's possible that my interpretation is wrong here. Maybe these out-of-order writes are somehow skewing things? For instance: Write Call 1 Write Call 2 Write Response 2 -- update inode's mtime with faked up wcc attrs Write Response 1 -- update mtime again, but it's actually *earlier* than the last response Commit comes in -- wcc pre_op_attrs have the mtime from Write Response 2's post_op_attrs and we invalidate the cache I could probably do a patch that makes a printk pop whenever we overwrite the mtime with one that's older than what's there currently. I'll plan to play with that a little later...
The theory in comment #15 does not seem to be the case. I added a bit of debug code to have the mtimes printed out when there was a cache invalidation due to mtime changes and to dump the stack: NFS: mtime change on server for file 0:13/2845693 inode=1204874137.937896000 fattr=1204874137.939896000 Call Trace:<ffffffffa01cd6fa>{:nfs:nfs_update_inode+477} <ffffffffa01ce804>{:nfs:__nfs_revalidate_inode+448} <ffffffff80136214>{autoremove_wake_function+0} <ffffffffa01cb9b4>{:nfs:nfs_file_flush+235} <ffffffff8017aefa>{filp_close+68} <ffffffff8017afa6>{sys_close+130} <ffffffff801102aa>{system_call+126} NFS: mtime change on server for file 0:13/2845693 inode=1204874142.76232000 fattr=1204874142.78232000 Call Trace:<ffffffffa01cd6fa>{:nfs:nfs_update_inode+477} <ffffffffa01ce804>{:nfs:__nfs_revalidate_inode+448} <ffffffff80136214>{autoremove_wake_function+0} <ffffffffa01cb9b4>{:nfs:nfs_file_flush+235} <ffffffff8017aefa>{filp_close+68} <ffffffff8017afa6>{sys_close+130} <ffffffff801102aa>{system_call+126} ...the fattr mtimes seem to be consistently later than the inode mtimes. I'm not certain that these are from commit calls, but I suspect that they are since the stack seems to come from a close() syscall. This suggests that there are mtime changes after the last write reply. Either the commit itself is causing an mtime change, or the server is returning the last write response with a post_op_attr mtime that doesn't reflect its result. The other possibility is that there is another call being done that doesn't return post_op_attrs and that is changing the mtime, but I can't think of what that would be in the context of an iozone run. Whatever this is doesn't seem to be a netapp specific problem. Running this against a linux server seems to give similar results.
Actually, I wasn't thinking about this correctly last night. The above experiment is meaningless. I did a patch this morning to try and test whether the out of order writes were skewing things: If we had faked up the pre_op_attr mtime, and the post_op_attr mtime was before the inode mtime, then I'd overwrite the fattr->mtime with the current inode mtime and do a printk. This should prevent the mtime from ever going backward due to post op attrs from our own write. If the above theory about out of order write replies causing this was correct, then I should have seen some of these printks. I never saw this printk fire, so it doesn't seem like that's the probable cause. On another note... Peter what do you think the chances are of getting the behavior change that you suggest in comment #14 upstream? Do you think there is a good enough argument for doing this even though this seems to be against the letter of the RFC?
Created attachment 297190 [details] binary capture from iozone test against netapp This capture is from an iozone test against a netapp. The client kernel has the patch that makes it ignore pre_op_attrs on write replies. We're still getting some cache invalidations, so I looked through this capture for READ calls and had a look at what was happening on the wire just before them. There are a couple of instances -- frames 2337 and 2408 have WRITE calls, but wireshark doesn't seem to be able to find their replies. A little while after those calls, we see some GETATTR calls. The mtime in the GETATTR reply in frame 2341 is later than the mtime in the last write reply that we received (in frame 2306). Just after these calls we see some packet loss: 2339 3.040324 10.10.56.139 10.11.231.236 RPC [TCP ACKed lost segment] [TCP Previous segment lost] Continuation ...I don't see any COMMIT calls happening here, so I don't think they're involved after all. My guess is that packet loss with write requests is causing the occasional cache invalidation to occur since it makes us miss a post_op_attr mtime update.
Created attachment 297191 [details] binary capture from iozone test against netapp This capture is from an iozone test against a netapp. The client kernel has the patch that makes it ignore pre_op_attrs on write replies. We're still getting some cache invalidations, so I looked through this capture for READ calls and had a look at what was happening on the wire just before them. There are a couple of instances -- frames 2337 and 2408 have WRITE calls, but wireshark doesn't seem to be able to find their replies. A little while after those calls, we see some GETATTR calls. The mtime in the GETATTR reply in frame 2341 is later than the mtime in the last write reply that we received (in frame 2306). Just after these calls we see some packet loss: 2339 3.040324 10.10.56.139 10.11.231.236 RPC [TCP ACKed lost segment] [TCP Previous segment lost] Continuation ...I don't see any COMMIT calls happening here, so I don't think they're involved after all. My guess is that packet loss with write requests is causing the occasional cache invalidation to occur since it makes us miss a post_op_attr mtime update.
Of course, this seems to bring us to the question...why aren't we seeing retransmissions when we get no response to the write? I canceled this last run after seeing the READ calls. After this, nfsstat shows: Client rpc stats: calls retrans authrefrsh 1331 0 0 ...shouldn't the WRITE calls with no real responses have been retransmitted? Or am I putting too much stock in wireshark's NFS decoder?
Packet loss? With TCP? To an application? That doesn't seem likely. While wireshark may be showing lost packets, and indeed they may be being lost, TCP will take care of retransmiting the lost packets. Such things will not be visible to the application unless an error is also returned from TCP. Wirh regards to RFC1813, having written the specification and having designed the weak cache consistency part of the protocol, I would say that it is unfortunate that that part of the protocol didn't work out as well as I would have liked. It works very well for single threaded access like directory operations, but doesn't work well at all with multi threaded access like writing with multiple contexts. This can be made to work, by doing something like sorting responses based on the pre_op_attr mtimes and then applying the wcc_data information, but it not clear that this would not cause a performance degradation. It may be worth trying though. If done right, we could get the consistency and the performance. The simple solution, though, is to ignore the pre_op_attr fields if they are returned and just use the post_op_attr fields to update the attributes without doing any cache validation. It would be good to ensure that time does not go backwards though. Acceptance upstream -- I suspect that Trond could be convinced, but it may or may not take some work.
I didn't mean to say that they were being lost for good. I meant that packet loss is delaying the write responses and allowing other calls to race in there after the writes have actually completed on the server. So the GETATTR call sees the mtime change that happened as a result of the write whose response was delayed. We're not forcing the wcc updates on the GETATTR call so we end up invalidating the cache. I do see quite a few GETATTR calls that may not be necessary in this capture. Reducing their number may help alleviate this, but that's really a separate problem... > This can be made to work, by doing something like sorting responses > based on the pre_op_attr mtimes and then applying the wcc_data information, > but it not clear that this would not cause a performance degradation. > It may be worth trying though. If done right, we could get the > consistency and the performance. This sounds tricky and would need some careful design work. I'm not sure I understand exactly what this would entail... > The simple solution, though, is to ignore the pre_op_attr fields if > they are returned and just use the post_op_attr fields to update the > attributes without doing any cache validation. It would be good to > ensure that time does not go backwards though. > Acceptance upstream -- I suspect that Trond could be convinced, but > it may or may not take some work. That's probably the best place to start. I'll plan to write this up with a patch in the near future and send it to the list. I'm not comfortable making this change to RHEL without it going upstream first. If nothing else, we should at least be able to start some discussion there on this problem. Thanks!
Do we know what generated the GETATTR? Cache validation or something else? We can't safely do a GETATTR for cache validation while there are outstanding operations which modify the file. The attributes may show changes in the attributes of the file before we are ready to correctly handle them.
It looks like it's close-to-open cache consistency. A bit of debug info with a dump_stack() added to __nfs_revalidate_inode: NFS: nfs_update_inode(0:13/2845693 ct=2 info=0x27) NFS: 54365 nfs_writeback_done (status 32768) NFS: write (0:13/2845693 4096@0) OK NFS: write (0:13/2845693 4096@4096) OK NFS: write (0:13/2845693 4096@8192) OK NFS: write (0:13/2845693 4096@12288) OK NFS: write (0:13/2845693 4096@16384) OK NFS: write (0:13/2845693 4096@20480) OK NFS: write (0:13/2845693 4096@24576) OK NFS: write (0:13/2845693 4096@28672) OK NFS: revalidating (0:13/2845693) Call Trace:<ffffffffa01cda9a>{:nfs:__nfs_revalidate_inode+85} <ffffffff80136214>{autoremove_wake_function+0} <ffffffffa01d5916>{:nfs:nfs_flush_inode+183} <ffffffffa01d607a>{:nfs:nfs_commit_inode+136} <ffffffffa01cb9b4>{:nfs:nfs_file_flush+235} <ffffffff8017aefa>{filp_close+68} <ffffffff8017afa6>{sys_close+130} <ffffffff801102aa>{system_call+126} NFS call getattr NFS reply getattr NFS: nfs_update_inode(0:13/2845693 ct=1 info=0x6) NFS: mtime change on server for file 0:13/2845693 NFS: (0:13/2845693) data cache invalidated ...the nfs_update_inode call done from nfs_writeback_done did not invalidate the cache, but the getattr immediately following it *did*. I think that getattr was called from nfs_file_flush(). I think I need to add a bit more instrumentation to dump out info about the mtime when we decide to invalidate the cache, and when it's updated as a result of wcc updates. Maybe that will make this a bit clearer...
Did a debug patch that instrumented the mtime stuff in nfs_update_inode and nfs_wcc_update_inode: NFS: 54573 initiated write call (req 0:13/2845693, 12288 bytes @ offset 0) NFS: 54574 initiated write call (req 0:13/2845693, 4096 bytes @ offset 16384) NFS: 54575 initiated write call (req 0:13/2845693, 4096 bytes @ offset 24576) NFS: 54576 initiated write call (req 0:13/2845693, 20480 bytes @ offset 32768) NFS: 54577 initiated write call (req 0:13/2845693, 4096 bytes @ offset 57344) NFS: nfs_update_inode(0:13/2845693 ct=2 info=0x7) NFS: updating inode mtime old=1204938324.267327000 new=1204938325.225912000 NFS: 54573 nfs_writeback_done (status 12288) NFS: write (0:13/2845693 4096@0) OK NFS: write (0:13/2845693 4096@4096) OK NFS: write (0:13/2845693 4096@8192) OK NFS: nfs_update_inode(0:13/2845693 ct=2 info=0x7) NFS: updating inode mtime old=1204938325.225912000 new=1204938325.233912000 NFS: 54574 nfs_writeback_done (status 4096) NFS: write (0:13/2845693 4096@16384) OK NFS: nfs_update_inode(0:13/2845693 ct=2 info=0x7) NFS: updating inode mtime old=1204938325.233912000 new=1204938325.240912000 NFS: 54575 nfs_writeback_done (status 4096) NFS: write (0:13/2845693 4096@24576) OK NFS: nfs_update_inode(0:13/2845693 ct=2 info=0x7) NFS: updating inode mtime old=1204938325.240912000 new=1204938325.332912000 NFS: 54577 nfs_writeback_done (status 4096) NFS: write (0:13/2845693 4096@57344) OK NFS: nfs_update_inode(0:13/2845693 ct=2 info=0x7) NFS: updating inode mtime old=1204938325.332912000 new=1204938325.330912000 ^^^ time is being reset to the past here. NFS: 54576 nfs_writeback_done (status 20480) NFS: write (0:13/2845693 4096@32768) OK NFS: write (0:13/2845693 4096@36864) OK NFS: write (0:13/2845693 4096@40960) OK NFS: write (0:13/2845693 4096@45056) OK NFS: write (0:13/2845693 4096@49152) OK NFS: revalidating (0:13/2845693) NFS call getattr NFS reply getattr NFS: nfs_update_inode(0:13/2845693 ct=1 info=0x6) NFS: mtime change on server for file 0:13/2845693 inode=1204938325.330912000 fattr=1204938325.332912000 ^^^ so when the getattr is done, it gets the *real* older one and the cache is invalidated. I'm not sure why my test patch earlier didn't catch this -- it may have had a bug. I have an idea about how to fix this. I'll test it and see if it helps.
Created attachment 297250 [details] patch 1 -- always fake pre_op_attrs This is the patch suggested by Peter. It makes the code to fake up pre_op_attrs apply even when the server has sent us valid pre_op_attrs.
Created attachment 297251 [details] patch 2 -- add timespec_before helper inline function Simple helper function to detect when one timespec is before another.
Created attachment 297252 [details] patch 3 -- don't allow faked wcc attrs to do backward mtime updates This patch seems to fix the remaning cache invalidations that I've seen. When we're faking pre_op_attrs, then we also don't allow the code to overwrite the inode's mtime with an mtime from the fattr that's in the past. Instead we overwrite the fattr's mtime with the inode's mtime to make sure that we avoid invalidating the cache. I think this is a reasonable fix, but haven't yet thought through all of the implications. This also needs regression testing.
With these 3 proposed patches, the reproducer shows 0 reads which is the behavior on 4.4 kernels. None of these patches is upstream, and getting them in there may be tough. We're technically violating the RFC, though with the author's blessing ;-) I need to do a bit more testing with it. If it works OK, I'll plan to roll this up as a set and propose it upstream in the near future.
Testing on rawhide this morning in preparation of sending these upstream. Here are results on kernel without any of these patches: iozone test complete. real 46m6.959s user 0m1.509s sys 23m40.880s Client rpc stats: calls retrans authrefrsh 281849 0 0 Client nfs v3: null getattr setattr lookup access readlink 0 0% 1867 0% 198 0% 199 0% 319 0% 0 0% read write create mkdir symlink mknod 83232 29% 195633 69% 198 0% 0 0% 0 0% 0 0% remove rmdir rename link readdir readdirplus 198 0% 0 0% 0 0% 0 0% 0 0% 2 0% fsstat fsinfo pathconf commit 0 0% 2 0% 0 0% 0 0% ...similar results to what we see on RHEL4 without any extra patches.
Results with 2 extra patches ported for upstream. Upstream already has a timespec_compare function. I'll probably respin the RHEL4 set so that it adds the same function and change the patch here to use that function instead. real 33m57.375s user 0m1.471s sys 12m59.619s Client rpc stats: calls retrans authrefrsh 198642 0 0 Client nfs v3: null getattr setattr lookup access readlink 0 0% 1802 0% 198 0% 201 0% 206 0% 0 0% read write create mkdir symlink mknod 0 0% 195836 98% 198 0% 0 0% 0 0% 0 0% remove rmdir rename link readdir readdirplus 198 0% 0 0% 0 0% 0 0% 0 0% 0 0% fsstat fsinfo pathconf commit 0 0% 2 0% 0 0% 0 0%
Both patches posted upstream to linux-nfs list and sent to Trond directly.
Changing tagline to something more descriptive
The entire approach here has been pretty well shot down upstream. One problem is that this only helps cache invalidation due to attributes returned on write calls. But, if you have a getattr or something that overlaps with a write, you can still end up unnecessarily invalidating the cache. We also need to make sure that when we ignore a set of attributes like this, that we set NFS_INO_INVALID_ATTR to indicate that the attrcache may not be right. After talking this over with Peter at Connectathon, I'm going to try and do a different approach where we try to track whether a call that returned attributes overlapped with a write. If that happens, we'll just use those attributes to update the attrcache, but won't invalidate the data.
Created attachment 309488 [details] patch1 - NFS: don't use post_op_attrs from WRITE replies for cache revalidation
Created attachment 309489 [details] patch2 - NFS: track number of writes on the wire and when last one started/ended
Created attachment 309490 [details] patch3 - NFS: change nfs_refresh_inode and __nfs_revalidate_inode to detect overlapping writes
New (very rough) cut of an upstream patchset. This should really be considered a proof of concept or RFC rather than an actual patch submission. The idea with this one is to track three new pieces of information about write calls: 1) number of writes outstanding on the wire 2) when last write call was sent 3) when last write reply was received With this we should be able to detect whether a call that returned attributes overlapped with a write. I've done some very basic testing with this. With an straight, single-threaded iozone test, this eliminates cache invalidations. It also seems to eliminate cache invalidations when I run the mmap-repro.c test for bug 432974 on it (previously I'd often see cache invalidations from the CTO getattr calls with that reproducer). This is probably not without races though. It seems easily possible for a write to start after a getattr reply was received, but before we could process it. This logic would mean that we wouldn't invalidate the cache there when we probably should. So, the logic in nfs_overlapping_write may need some further tweaking. Getting this wrong will probably mean some hard-to-detect and harder-to-track-down data corruption, so we need to be very careful here.
Did an updated patchset that adds this macro to jiffies.h, and uses it in nfs_overlapping_write(): +#define time_between(a,b,c) \ + (time_after(a,b) && \ + time_before(a,c)) + ...this makes that function return false on the boundary conditions. Results aren't quite as good. With an iozone test: read write 240 0% 194938 58% ...so there were some cache invalidations involved. On a short mmap-repro test (5-10 minute run), it still seems to pretty much completely eliminate cache invalidations. Still though, this is a pretty substantial improvement over the unpatched case and it should be a bit safer for the edge cases.
Forgot to post the timing from the modified patch: real 24m38.818s user 0m0.993s sys 8m13.286s ...not quite as fast as the first patch, but still better than the unpatched case by quite a bit.
Created attachment 309620 [details] 5 patch set Newer set of 5 concatenated patches. This has a more "mature" implementation of nfs_update_inode_attributes, makes it so that we do not match on edge cases in nfs_overlapping_write() and also removes the old nfs_post_op_update_inode_force_wcc() function since it's no longer used.
Results with this seem to be pretty good. I see substantial performance gain on a straight iozone test and also see no cache invalidations with the mmap-repro and mutex-lt3 reproducers. To Do: What this really needs is some testing with multiple machines updating the same files, but that are not using a fcntl locking. That's probably likely to tickle any bugs that this may cause. I'll need to ponder how best to test that. Also, I'm not sure that I have all of the logic correct in nfs_update_inode_attributes. Anyone looking over this should review that carefully... Finally, there's probably some opportunity for code consolidation here. Maybe have nfs_update_inode call nfs_update_inode_attributes, and then do the other logic for the revalidation. I'll probably try to do that before submitting upstream.
Updating PM score.
*** This bug has been marked as a duplicate of bug 427385 ***