Bug 137830

Summary:

worktodo does not support NFS aio

Product:

Red Hat Enterprise Linux 3

Reporter:

Chuck Lever <cel>

Component:

kernel

Assignee:

Steve Dickson <steved>

Status:

CLOSED ERRATA

QA Contact:

Brian Brock <bbrock>

Severity:

medium

Docs Contact:

Priority:

medium

Version:

3.0

CC:

bjohnson, greg.marsden, jmoyer, peterm, petrides, riel, steved, zach.brown

Target Milestone:

---

Target Release:

---

Hardware:

All

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2005-05-18 13:28:23 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

132991

Attachments:

Description	Flags
OraSim 4.2 job file to generate aio on NFS	none
NFS support for worktodo implementation	none
second revision of sync_page support in worktodo	none
KABI friendlier patch	none

Description Chuck Lever 2004-11-01 21:13:36 UTC

From Bugzilla Helper:
User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)

Description of problem:
The worktodo implementation in RHEL 3.0 kernels does not support NFS 
aio.  The problem is that logic in mm/wtd.c invokes run_task_queue
(&tq_disk) directly rather than using the a_ops->sync_page 
interface.  For local block based file systems, this is equivalent, 
but for NFS, sync_page is required because it doesn't use tq_disk for 
anything.


Version-Release number of selected component (if applicable):
kernel-2.4.21-20.EL

How reproducible:
Sometimes

Steps to Reproduce:
I've been testing aio over NFS using OraSim with one of the sample
job files that sets up async I/O drivers.

Actual Results:  Sometimes it works fine, sometimes OraSim reports 
that I/O requests
have been lost.

Expected Results:  OraSim with async I/O on NFS files should always 
work.

Additional info:

Using the sync_page interface means that somehow the page that is 
being TryLock'd has to be passed to __wtd_lock_page_waiter() so it 
can call the correct sync_page ops on the correct page.  Right now 
__wtd_lock_page_waiter uses a global statically allocated structure 
with a NULL data argument to pass work to keventd.  To use sync_page, 
it will need to be changed to use a dynamic tq_struct for each page 
being waited upon.

Comment 1 Greg Marsden 2004-11-02 21:54:15 UTC

This is important to making NFS AIO reliable, so it is a priority for us.

Comment 2 Chuck Lever 2004-11-03 21:53:16 UTC

Created attachment 106131 [details]
OraSim 4.2 job file to generate aio on NFS

yes, i know OraSim 4.2 is old.	you should be able to use any of the OraSim-6.1
sample job files that use the async drivers and reproduce this problem.

the issue shows up with the following message (from memory):

FATAL: Async I/O queue still jammed after 10000 milliseconds!

Comment 4 Chuck Lever 2004-11-16 13:00:34 UTC

ok, i've found a much better test case that tickles precisely this
problem.

1.  mount an NFS file system on a GbE attached client

2.  use iozone to create a file that is about as large as local RAM
    like this:  "iozone -s 1g -i 0 -w" (where 1g is replaced with
    the memory size of your client).

3.  now run:  "iozone -r 32 -i 1 -k 32"

this will trigger calls to the wtd stuff, which will vainly try to 
run the disk queue to flush iozone's requests.

Comment 5 Chuck Lever 2004-11-16 19:58:54 UTC

Created attachment 106842 [details]
NFS support for worktodo implementation

here's a patch that addresses this bug.  please review and comment.

Comment 6 Zach Brown 2004-11-16 22:24:41 UTC

> +	int (*sync_page_delayed)(struct page *);

An address space op for doing the async variant of sync_page that aio
is doing by hand (based on invalid assumptions) certainly seems
reasonable.  Can we get a big fat comment about what it does?

> +static inline int sync_page_delayed(struct page *page)
> +{
> +        struct address_space *mapping = page->mapping;
> +
> +        if (mapping && mapping->a_ops &&
mapping->a_ops->sync_page_delayed)
> +                return mapping->a_ops->sync_page_delayed(page);
> +	schedule_task(&run_disk_tq);
> +        return 0;
> +}

there is space/tab damage here.  I don't think it should always
schedule the run_disk_tq, either.  Instead the patch should just add a
block_sync_page_delayed() which pairs with block_sync_page().  Then
add it to the file systems we care about (ext3?).

> +static inline int sync_page(struct page *page)
> +{
> +        struct address_space *mapping = page->mapping;
> +
> +        if (mapping && mapping->a_ops && mapping->a_ops->sync_page)
> +                return mapping->a_ops->sync_page(page);
> +	run_task_queue(&tq_disk);
> +        return 0;
> +}

(tab/space again.)  Why not export the sync_page() from mm/filemap.c?
  Another sync_page() that unconditionally runs tq_disk is confusing.

Comment 7 Chuck Lever 2004-11-16 22:40:48 UTC

tab damage now fixed in my version of the patch.  sorry about that.

i considered a block_sync_page_delayed, but thought that might be a
little too much churn for a prototype.  i can add it if people think
this whole approach is a good idea.

sync_page and sync_page_delayed are both local helper functions. 
especially if we take the block_sync_page_delayed approach, i guess
neither of these helpers should explicitly run the disk queue.

Comment 8 Chuck Lever 2004-11-18 22:03:01 UTC

Created attachment 106991 [details]
second revision of sync_page support in worktodo

this version addresses earlier comments, and fixes a race condition that
results in nfs_flushd not running often enough.

it appears that this patch by itself is enough to make NFS aio usable and
performant (the patch for 139582 may not be necessary, and in fact may result
in worse performance in some cases).  i've tested with OraSim.

comments and review please.

Comment 12 Steve Dickson 2005-01-25 16:18:48 UTC

Created attachment 110197 [details]
KABI friendlier patch

The path proposed in comment #8 has major KABI issues
and appears entire patch may not been need to make AIO
work better over NFS.

From my testing, using the two iozone commands Chuck suggested,
it appears only the addition of the __sync_page() call to wtd_lock_page()
is really need to make things work better. Plus there is not any KABI issue
with that part of the patch.

Working with our AIO guy, I  understand why the new sync_page_delayed()
is a desired addition, but its a KABI infraction plus it not really clear its
need
to explicitly address the bug, which is, make NFS work better with AIO....

So the patch I'm proposing only introduces the __sync_page() call.
Please test this it see if truly fix the problem....

Comment 13 Chuck Lever 2005-01-25 20:14:47 UTC

what exactly is the KABI issue with the sync_page_delayed() part of
the patch?  i thought that part would still be acceptable.

at this point, we need OraSim and real Oracle database testing.  i will
leave that to Greg.  iozone testing is not enough here, we've found
that Oracle still has some problems running with NFS aio, especially
without sync_page_delayed.

Comment 14 Jeff Moyer 2005-01-25 23:32:39 UTC

The patch modifies the address_space_operations structure which is exported. 
That most certainly breaks the kernel ABI.

Instead of adding the sync_page_delayed function pointer, we could hack up
__wtd_lock_page_waiter to check if a page is backed by an NFS file system, and
call nfs_wake_flushd from there if so.  It's an ugly hack, but it's arguably no
worse than assuming every page is backed by a block device.

What do you think?

Comment 15 Steve Dickson 2005-01-26 00:01:59 UTC

Chuck,

I agree with you that Greg will need to run Orasim to see if this
new patch addresses this issues that he was seeing.  I feel
confident that patch should help some what since it does give
AIO a direct path to NFS that it didn't have before.... but only
time (and testing) will tell....

Comment 16 Greg Marsden 2005-01-26 01:15:17 UTC

I'll give Steve's patch a spin and see what happens...

Comment 17 Chuck Lever 2005-01-26 21:17:29 UTC

re: comment #15

the sync_page part of the patch will help.  but i don't feel that, by
itself, it will be an adequate solution.  OraSim appeared to run
faster and more reliably with sync_page and sync_page_delayed; our
TPC-C tests still don't run reliably over NFS aio even with the
complete version of the patch.

re: comment #14

i couldn't find a way to check whether a file is an NFS file without
using symbols that would have to be loaded with nfs.ko.  if no NFS
file systems are mounted, and nfs.ko is not loaded, how will worktodo
find these symbols?  i'm probably ignorant about how this would work.

thanks for offline comments on how to help my patches be more
acceptable to the KABI gods.

Comment 18 Greg Marsden 2005-02-04 05:56:23 UTC

Had a chance to try out the patch from steved in attachment 110197 [details] and
the OAST stress tests pass with ~25% performance improvement. I need
to rerun this a couple times to rule out network traffic and filer
load from the equation, but it looks good.

Comment 19 Greg Marsden 2005-02-10 19:30:22 UTC

Steve, you mentioned yesterday that a revised version of this patch is
slated for inclusing in U5?

Comment 20 Ernie Petrides 2005-02-16 12:31:27 UTC

A fix for this problem has just been committed to the RHEL3 U5
patch pool this evening (in kernel version 2.4.21-27.14.EL).

Comment 21 Ernie Petrides 2005-02-17 13:46:56 UTC

Note that the fix in -27.14.EL has been slightly modified in -27.15.EL,
although the functionality of the fix remains the same.

Comment 22 Van Okamura 2005-03-05 02:34:02 UTC

Can we get access to this latest patch, or the latest kernel with this
patch?  We want to do more testing.  Thanks.

Comment 23 Ernie Petrides 2005-03-08 01:29:17 UTC

Van, the partner beta period for U5 is scheduled to begin Wednesday of
this week.  The kernel version is 2.4.21-29.EL (which has what you want).

Comment 24 Tim Powers 2005-05-18 13:28:23 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2005-294.html