Opening this bug to determine the feasibility of backporting this patch to RHEL5. http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=7b159fc18d417980f57aef64cab3417ee6af70f8;hp=34901f70d119d88126e7390351b8c780646628e1
The first thing we need to do here though, is to come up with a reproducer that we can use to verify that this is working. Doing a simple: strace dd if=/dev/zero of=/nfs/mount bs=1M count=<something big> should eventually show the write calls erroring out. The problem is that you need some way to make sure that the writes start getting flushed to the server before all of the writes have completed. Possibly a C program that: writes out more data than is on the partition does an fsync() and verifies that it gets an error then tries to write some more and sees if the writes error out
Created attachment 298747 [details] reproducer program I think this should work as a reproducer. Might need some cleaning up for RHTS, but it's probably suitable for that too. It definitely fails on RHEL5, and should pass on rawhide (testing there next).
test passes on Fedora 8's current 2.6.24 kernel. The reproducer seems to be good. I'll start having a look at the patch soon...
The test program here will not test whether synchronous writes are disabled when writes stop failing. I may want to consider modifying it for that to make sure that we test it correctly.
Created attachment 299375 [details] reproducer program Respun reproducer. Check also that async writes are reenabled when writes start succeeding.
I've got a start on this patch, and it seems to be basically doing the right thing, but I'm noticing a behavioral difference between RHEL5 and rawhide that's affecting the situation. On rawhide, when we call fsync() after all of the writes there seem to almost always be a few writes that are still outstanding. This is making nfs_do_fsync exit with the error flag set, so that the next write forces a sync write. On RHEL5, we never seem to hit the situation where there are still outstanding writes when we call fsync(). So when nfs_do_fsync calls nfs_wb_all, it exits without an error and the flag is cleared.
The difference looks like it's probably in the nfs_writepages implementation: do_fsync() calls filemap_fdatawrite() itself before calling the filesystem's fsync operation. On RHEL5 this seems to be generally causing all of the dirty pages to get flushed to the server before nfs_fsync ever gets called. We may not be able to do this without some changes to the nfs writeback logic.
Created attachment 299887 [details] patch -- fall back to synchronous writes when a background write errors out Here's the current patch that I have. It doesn't currently work against the reproducer, seemingly due to the differences in nfs_writepages in RHEL5 and upstream.
I'll also note that this reproducer occasionally fails on rawhide too. If the timing is such that the fsync call doesn't actually have to flush any writes, we end up with the same effect. We could still consider this for RHEL5, but we'd have to note that the boundary case near an fsync() syscall may be different on RHEL5.
I've incorporated the patch in comment #9 into the RHEL5 test kernels on my people page if the customer (or anyone else) wishes to test them: http://people.redhat.com/jlayton ...it does *not* pass the test program I have for this due to changes in how nfs_writepages behaves. I'll also note, however, that the reproducer occasionally fails on rawhide as well whenever you do an fsync and all writes have already been flushed. I think that I can write a new reproducer that avoids fsync calls that will demonstrate that the patch works, but it may be a little while before I can get to it.
Created attachment 302444 [details] reproducer 2 This is a simpler reproducer that does work on RHEL5, though at the expense of allowing the program to potentially dirty a lot more pages on a kernel without this patch... To use it: 1) create a fairly small filesystem (I used a 512M ext3 filesystem) 2) write a large file to this filesystem to get it close to capacity. i.e.: # dd if=/dev/zero of=/export/filler bs=1M count=500 3) export the filesystem and mount it on the client. This test seems to work fine over localhost as well. 3) compile this program and run it, with the filename and the number of bytes to write. It's probably best to have it attempt to write a file that is close to or greater than the size of the physical RAM on the box. That should ensure that the kernel attempts to write out some of the dirty pages before the program exits. My machine has 512M so I did: # ./write-simple /mnt/scratch/testfile 536870912
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
in kernel-2.6.18-107.el5 You can download this test kernel from http://people.redhat.com/dzickus/el5
*** Bug 472484 has been marked as a duplicate of this bug. ***
As Peter says, this problem is completely a kernel issue. No userspace nfs components should make any difference here. How exactly were they testing this?
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2009-0225.html