Opening this bug to determine the feasibility of backporting this patch to RHEL5.
The first thing we need to do here though, is to come up with a reproducer that
we can use to verify that this is working. Doing a simple:
strace dd if=/dev/zero of=/nfs/mount bs=1M count=<something big>
should eventually show the write calls erroring out. The problem is that you
need some way to make sure that the writes start getting flushed to the server
before all of the writes have completed.
Possibly a C program that:
writes out more data than is on the partition
does an fsync() and verifies that it gets an error
then tries to write some more and sees if the writes error out
Created attachment 298747 [details]
I think this should work as a reproducer. Might need some cleaning up for RHTS,
but it's probably suitable for that too.
It definitely fails on RHEL5, and should pass on rawhide (testing there next).
test passes on Fedora 8's current 2.6.24 kernel. The reproducer seems to be
good. I'll start having a look at the patch soon...
The test program here will not test whether synchronous writes are disabled when
writes stop failing. I may want to consider modifying it for that to make sure
that we test it correctly.
Created attachment 299375 [details]
Respun reproducer. Check also that async writes are reenabled when writes start
I've got a start on this patch, and it seems to be basically doing the right
thing, but I'm noticing a behavioral difference between RHEL5 and rawhide that's
affecting the situation.
On rawhide, when we call fsync() after all of the writes there seem to almost
always be a few writes that are still outstanding. This is making nfs_do_fsync
exit with the error flag set, so that the next write forces a sync write.
On RHEL5, we never seem to hit the situation where there are still outstanding
writes when we call fsync(). So when nfs_do_fsync calls nfs_wb_all, it exits
without an error and the flag is cleared.
The difference looks like it's probably in the nfs_writepages implementation:
do_fsync() calls filemap_fdatawrite() itself before calling the filesystem's
fsync operation. On RHEL5 this seems to be generally causing all of the dirty
pages to get flushed to the server before nfs_fsync ever gets called.
We may not be able to do this without some changes to the nfs writeback logic.
Created attachment 299887 [details]
patch -- fall back to synchronous writes when a background write errors out
Here's the current patch that I have. It doesn't currently work against the
reproducer, seemingly due to the differences in nfs_writepages in RHEL5 and
I'll also note that this reproducer occasionally fails on rawhide too. If the
timing is such that the fsync call doesn't actually have to flush any writes, we
end up with the same effect.
We could still consider this for RHEL5, but we'd have to note that the boundary
case near an fsync() syscall may be different on RHEL5.
I've incorporated the patch in comment #9 into the RHEL5 test kernels on my
people page if the customer (or anyone else) wishes to test them:
...it does *not* pass the test program I have for this due to changes in how
nfs_writepages behaves. I'll also note, however, that the reproducer
occasionally fails on rawhide as well whenever you do an fsync and all writes
have already been flushed.
I think that I can write a new reproducer that avoids fsync calls that will
demonstrate that the patch works, but it may be a little while before I can get
Created attachment 302444 [details]
This is a simpler reproducer that does work on RHEL5, though at the expense of
allowing the program to potentially dirty a lot more pages on a kernel without
To use it:
1) create a fairly small filesystem (I used a 512M ext3 filesystem)
2) write a large file to this filesystem to get it close to capacity. i.e.:
# dd if=/dev/zero of=/export/filler bs=1M count=500
3) export the filesystem and mount it on the client. This test seems to work
fine over localhost as well.
3) compile this program and run it, with the filename and the number of bytes
to write. It's probably best to have it attempt to write a file that is close
to or greater than the size of the physical RAM on the box. That should ensure
that the kernel attempts to write out some of the dirty pages before the
My machine has 512M so I did:
# ./write-simple /mnt/scratch/testfile 536870912
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release. Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products. This request is not yet committed for inclusion in an Update
You can download this test kernel from http://people.redhat.com/dzickus/el5
*** Bug 472484 has been marked as a duplicate of this bug. ***
As Peter says, this problem is completely a kernel issue. No userspace nfs components should make any difference here. How exactly were they testing this?
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.