Red Hat Bugzilla – Bug 137830
worktodo does not support NFS aio
Last modified: 2007-11-30 17:07:04 EST
From Bugzilla Helper:
User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
Description of problem:
The worktodo implementation in RHEL 3.0 kernels does not support NFS
aio. The problem is that logic in mm/wtd.c invokes run_task_queue
(&tq_disk) directly rather than using the a_ops->sync_page
interface. For local block based file systems, this is equivalent,
but for NFS, sync_page is required because it doesn't use tq_disk for
Version-Release number of selected component (if applicable):
Steps to Reproduce:
I've been testing aio over NFS using OraSim with one of the sample
job files that sets up async I/O drivers.
Actual Results: Sometimes it works fine, sometimes OraSim reports
that I/O requests
have been lost.
Expected Results: OraSim with async I/O on NFS files should always
Using the sync_page interface means that somehow the page that is
being TryLock'd has to be passed to __wtd_lock_page_waiter() so it
can call the correct sync_page ops on the correct page. Right now
__wtd_lock_page_waiter uses a global statically allocated structure
with a NULL data argument to pass work to keventd. To use sync_page,
it will need to be changed to use a dynamic tq_struct for each page
being waited upon.
This is important to making NFS AIO reliable, so it is a priority for us.
Created attachment 106131 [details]
OraSim 4.2 job file to generate aio on NFS
yes, i know OraSim 4.2 is old. you should be able to use any of the OraSim-6.1
sample job files that use the async drivers and reproduce this problem.
the issue shows up with the following message (from memory):
FATAL: Async I/O queue still jammed after 10000 milliseconds!
ok, i've found a much better test case that tickles precisely this
1. mount an NFS file system on a GbE attached client
2. use iozone to create a file that is about as large as local RAM
like this: "iozone -s 1g -i 0 -w" (where 1g is replaced with
the memory size of your client).
3. now run: "iozone -r 32 -i 1 -k 32"
this will trigger calls to the wtd stuff, which will vainly try to
run the disk queue to flush iozone's requests.
Created attachment 106842 [details]
NFS support for worktodo implementation
here's a patch that addresses this bug. please review and comment.
> + int (*sync_page_delayed)(struct page *);
An address space op for doing the async variant of sync_page that aio
is doing by hand (based on invalid assumptions) certainly seems
reasonable. Can we get a big fat comment about what it does?
> +static inline int sync_page_delayed(struct page *page)
> + struct address_space *mapping = page->mapping;
> + if (mapping && mapping->a_ops &&
> + return mapping->a_ops->sync_page_delayed(page);
> + schedule_task(&run_disk_tq);
> + return 0;
there is space/tab damage here. I don't think it should always
schedule the run_disk_tq, either. Instead the patch should just add a
block_sync_page_delayed() which pairs with block_sync_page(). Then
add it to the file systems we care about (ext3?).
> +static inline int sync_page(struct page *page)
> + struct address_space *mapping = page->mapping;
> + if (mapping && mapping->a_ops && mapping->a_ops->sync_page)
> + return mapping->a_ops->sync_page(page);
> + run_task_queue(&tq_disk);
> + return 0;
(tab/space again.) Why not export the sync_page() from mm/filemap.c?
Another sync_page() that unconditionally runs tq_disk is confusing.
tab damage now fixed in my version of the patch. sorry about that.
i considered a block_sync_page_delayed, but thought that might be a
little too much churn for a prototype. i can add it if people think
this whole approach is a good idea.
sync_page and sync_page_delayed are both local helper functions.
especially if we take the block_sync_page_delayed approach, i guess
neither of these helpers should explicitly run the disk queue.
Created attachment 106991 [details]
second revision of sync_page support in worktodo
this version addresses earlier comments, and fixes a race condition that
results in nfs_flushd not running often enough.
it appears that this patch by itself is enough to make NFS aio usable and
performant (the patch for 139582 may not be necessary, and in fact may result
in worse performance in some cases). i've tested with OraSim.
comments and review please.
Created attachment 110197 [details]
KABI friendlier patch
The path proposed in comment #8 has major KABI issues
and appears entire patch may not been need to make AIO
work better over NFS.
From my testing, using the two iozone commands Chuck suggested,
it appears only the addition of the __sync_page() call to wtd_lock_page()
is really need to make things work better. Plus there is not any KABI issue
with that part of the patch.
Working with our AIO guy, I understand why the new sync_page_delayed()
is a desired addition, but its a KABI infraction plus it not really clear its
to explicitly address the bug, which is, make NFS work better with AIO....
So the patch I'm proposing only introduces the __sync_page() call.
Please test this it see if truly fix the problem....
what exactly is the KABI issue with the sync_page_delayed() part of
the patch? i thought that part would still be acceptable.
at this point, we need OraSim and real Oracle database testing. i will
leave that to Greg. iozone testing is not enough here, we've found
that Oracle still has some problems running with NFS aio, especially
The patch modifies the address_space_operations structure which is exported.
That most certainly breaks the kernel ABI.
Instead of adding the sync_page_delayed function pointer, we could hack up
__wtd_lock_page_waiter to check if a page is backed by an NFS file system, and
call nfs_wake_flushd from there if so. It's an ugly hack, but it's arguably no
worse than assuming every page is backed by a block device.
What do you think?
I agree with you that Greg will need to run Orasim to see if this
new patch addresses this issues that he was seeing. I feel
confident that patch should help some what since it does give
AIO a direct path to NFS that it didn't have before.... but only
time (and testing) will tell....
I'll give Steve's patch a spin and see what happens...
re: comment #15
the sync_page part of the patch will help. but i don't feel that, by
itself, it will be an adequate solution. OraSim appeared to run
faster and more reliably with sync_page and sync_page_delayed; our
TPC-C tests still don't run reliably over NFS aio even with the
complete version of the patch.
re: comment #14
i couldn't find a way to check whether a file is an NFS file without
using symbols that would have to be loaded with nfs.ko. if no NFS
file systems are mounted, and nfs.ko is not loaded, how will worktodo
find these symbols? i'm probably ignorant about how this would work.
thanks for offline comments on how to help my patches be more
acceptable to the KABI gods.
Had a chance to try out the patch from steved in attachment 110197 [details] and
the OAST stress tests pass with ~25% performance improvement. I need
to rerun this a couple times to rule out network traffic and filer
load from the equation, but it looks good.
Steve, you mentioned yesterday that a revised version of this patch is
slated for inclusing in U5?
A fix for this problem has just been committed to the RHEL3 U5
patch pool this evening (in kernel version 2.4.21-27.14.EL).
Note that the fix in -27.14.EL has been slightly modified in -27.15.EL,
although the functionality of the fix remains the same.
Can we get access to this latest patch, or the latest kernel with this
patch? We want to do more testing. Thanks.
Van, the partner beta period for U5 is scheduled to begin Wednesday of
this week. The kernel version is 2.4.21-29.EL (which has what you want).
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.