Bug 438423 - backport patch to RHEL5 have it flip to synchronous writes when there is a write error
Summary: backport patch to RHEL5 have it flip to synchronous writes when there is a wr...
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel   
(Show other bugs)
Version: 5.2
Hardware: All
OS: Linux
Target Milestone: rc
: ---
Assignee: Jeff Layton
QA Contact: Martin Jenner
: 472484 (view as bug list)
Depends On:
Blocks: KernelPrio5.3
TreeView+ depends on / blocked
Reported: 2008-03-20 19:51 UTC by Jeff Layton
Modified: 2009-03-16 09:32 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Last Closed: 2009-01-20 19:56:39 UTC
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---

Attachments (Terms of Use)
reproducer program (1.27 KB, patch)
2008-03-20 20:40 UTC, Jeff Layton
no flags Details | Diff
reproducer program (1.95 KB, text/plain)
2008-03-27 18:19 UTC, Jeff Layton
no flags Details
patch -- fall back to synchronous writes when a background write errors out (6.41 KB, patch)
2008-04-01 13:14 UTC, Jeff Layton
no flags Details | Diff
reproducer 2 (1.63 KB, patch)
2008-04-15 12:38 UTC, Jeff Layton
no flags Details | Diff

External Trackers
Tracker ID Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2009:0225 normal SHIPPED_LIVE Important: Red Hat Enterprise Linux 5.3 kernel security and bug fix update 2009-01-20 16:06:24 UTC

Comment 1 Jeff Layton 2008-03-20 19:53:07 UTC
Opening this bug to determine the feasibility of backporting this patch to RHEL5.


Comment 2 Jeff Layton 2008-03-20 19:59:20 UTC
The first thing we need to do here though, is to come up with a reproducer that
we can use to verify that this is working. Doing a simple:

strace dd if=/dev/zero of=/nfs/mount bs=1M count=<something big>

should eventually show the write calls erroring out. The problem is that you
need some way to make sure that the writes start getting flushed to the server
before all of the writes have completed.

Possibly a C program that:

writes out more data than is on the partition
does an fsync() and verifies that it gets an error
then tries to write some more and sees if the writes error out

Comment 3 Jeff Layton 2008-03-20 20:40:19 UTC
Created attachment 298747 [details]
reproducer program

I think this should work as a reproducer. Might need some cleaning up for RHTS,
but it's probably suitable for that too.

It definitely fails on RHEL5, and should pass on rawhide (testing there next).

Comment 4 Jeff Layton 2008-03-20 20:44:04 UTC
test passes on Fedora 8's current 2.6.24 kernel. The reproducer seems to be
good. I'll start having a look at the patch soon...

Comment 5 Jeff Layton 2008-03-24 18:14:04 UTC
The test program here will not test whether synchronous writes are disabled when
writes stop failing. I may want to consider modifying it for that to make sure
that we test it correctly.

Comment 6 Jeff Layton 2008-03-27 18:19:05 UTC
Created attachment 299375 [details]
reproducer program

Respun reproducer. Check also that async writes are reenabled when writes start

Comment 7 Jeff Layton 2008-04-01 11:46:39 UTC
I've got a start on this patch, and it seems to be basically doing the right
thing, but I'm noticing a behavioral difference between RHEL5 and rawhide that's
affecting the situation.

On rawhide, when we call fsync() after all of the writes there seem to almost
always be a few writes that are still outstanding. This is making nfs_do_fsync
exit with the error flag set, so that the next write forces a sync write.

On RHEL5, we never seem to hit the situation where there are still outstanding
writes when we call fsync(). So when nfs_do_fsync calls nfs_wb_all, it exits
without an error and the flag is cleared.

Comment 8 Jeff Layton 2008-04-01 12:31:05 UTC
The difference looks like it's probably in the nfs_writepages implementation:

do_fsync() calls filemap_fdatawrite() itself before calling the filesystem's
fsync operation. On RHEL5 this seems to be generally causing all of the dirty
pages to get flushed to the server before nfs_fsync ever gets called.

We may not be able to do this without some changes to the nfs writeback logic.

Comment 9 Jeff Layton 2008-04-01 13:14:13 UTC
Created attachment 299887 [details]
patch -- fall back to synchronous writes when a background write errors out

Here's the current patch that I have. It doesn't currently work against the
reproducer, seemingly due to the differences in nfs_writepages in RHEL5 and

Comment 10 Jeff Layton 2008-04-01 13:19:10 UTC
I'll also note that this reproducer occasionally fails on rawhide too. If the
timing is such that the fsync call doesn't actually have to flush any writes, we
end up with the same effect.

We could still consider this for RHEL5, but we'd have to note that the boundary
case near an fsync() syscall may be different on RHEL5.

Comment 11 Jeff Layton 2008-04-07 14:04:24 UTC
I've incorporated the patch in comment #9 into the RHEL5 test kernels on my
people page if the customer (or anyone else) wishes to test them:


...it does *not* pass the test program I have for this due to changes in how
nfs_writepages behaves. I'll also note, however, that the reproducer
occasionally fails on rawhide as well whenever you do an fsync and all writes
have already been flushed.

I think that I can write a new reproducer that avoids fsync calls that will
demonstrate that the patch works, but it may be a little while before I can get
to it.

Comment 12 Jeff Layton 2008-04-15 12:38:55 UTC
Created attachment 302444 [details]
reproducer 2

This is a simpler reproducer that does work on RHEL5, though at the expense of
allowing the program to potentially dirty a lot more pages on a kernel without
this patch...

To use it:
1) create a fairly small filesystem (I used a 512M ext3 filesystem)
2) write a large file to this filesystem to get it close to capacity. i.e.:

# dd if=/dev/zero of=/export/filler bs=1M count=500

3) export the filesystem and mount it on the client. This test seems to work
fine over localhost as well.

3) compile this program and run it, with the filename and the number of bytes
to write. It's probably best to have it attempt to write a file that is close
to or greater than the size of the physical RAM on the box. That should ensure
that the kernel attempts to write out some of the dirty pages before the
program exits.

My machine has 512M so I did:

# ./write-simple /mnt/scratch/testfile 536870912

Comment 14 RHEL Product and Program Management 2008-05-22 11:10:06 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update

Comment 16 Don Zickus 2008-09-03 03:39:04 UTC
in kernel-2.6.18-107.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5

Comment 20 Jeff Layton 2008-11-21 11:43:18 UTC
*** Bug 472484 has been marked as a duplicate of this bug. ***

Comment 25 Jeff Layton 2008-12-16 22:59:56 UTC
As Peter says, this problem is completely a kernel issue. No userspace nfs components should make any difference here. How exactly were they testing this?

Comment 28 errata-xmlrpc 2009-01-20 19:56:39 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.


Note You need to log in before you can comment on or make changes to this bug.