From Bugzilla Helper: User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1) Description of problem: The worktodo implementation in RHEL 3.0 kernels does not support NFS aio. The problem is that logic in mm/wtd.c invokes run_task_queue (&tq_disk) directly rather than using the a_ops->sync_page interface. For local block based file systems, this is equivalent, but for NFS, sync_page is required because it doesn't use tq_disk for anything. Version-Release number of selected component (if applicable): kernel-2.4.21-20.EL How reproducible: Sometimes Steps to Reproduce: I've been testing aio over NFS using OraSim with one of the sample job files that sets up async I/O drivers. Actual Results: Sometimes it works fine, sometimes OraSim reports that I/O requests have been lost. Expected Results: OraSim with async I/O on NFS files should always work. Additional info: Using the sync_page interface means that somehow the page that is being TryLock'd has to be passed to __wtd_lock_page_waiter() so it can call the correct sync_page ops on the correct page. Right now __wtd_lock_page_waiter uses a global statically allocated structure with a NULL data argument to pass work to keventd. To use sync_page, it will need to be changed to use a dynamic tq_struct for each page being waited upon.
This is important to making NFS AIO reliable, so it is a priority for us.
Created attachment 106131 [details] OraSim 4.2 job file to generate aio on NFS yes, i know OraSim 4.2 is old. you should be able to use any of the OraSim-6.1 sample job files that use the async drivers and reproduce this problem. the issue shows up with the following message (from memory): FATAL: Async I/O queue still jammed after 10000 milliseconds!
ok, i've found a much better test case that tickles precisely this problem. 1. mount an NFS file system on a GbE attached client 2. use iozone to create a file that is about as large as local RAM like this: "iozone -s 1g -i 0 -w" (where 1g is replaced with the memory size of your client). 3. now run: "iozone -r 32 -i 1 -k 32" this will trigger calls to the wtd stuff, which will vainly try to run the disk queue to flush iozone's requests.
Created attachment 106842 [details] NFS support for worktodo implementation here's a patch that addresses this bug. please review and comment.
> + int (*sync_page_delayed)(struct page *); An address space op for doing the async variant of sync_page that aio is doing by hand (based on invalid assumptions) certainly seems reasonable. Can we get a big fat comment about what it does? > +static inline int sync_page_delayed(struct page *page) > +{ > + struct address_space *mapping = page->mapping; > + > + if (mapping && mapping->a_ops && mapping->a_ops->sync_page_delayed) > + return mapping->a_ops->sync_page_delayed(page); > + schedule_task(&run_disk_tq); > + return 0; > +} there is space/tab damage here. I don't think it should always schedule the run_disk_tq, either. Instead the patch should just add a block_sync_page_delayed() which pairs with block_sync_page(). Then add it to the file systems we care about (ext3?). > +static inline int sync_page(struct page *page) > +{ > + struct address_space *mapping = page->mapping; > + > + if (mapping && mapping->a_ops && mapping->a_ops->sync_page) > + return mapping->a_ops->sync_page(page); > + run_task_queue(&tq_disk); > + return 0; > +} (tab/space again.) Why not export the sync_page() from mm/filemap.c? Another sync_page() that unconditionally runs tq_disk is confusing.
tab damage now fixed in my version of the patch. sorry about that. i considered a block_sync_page_delayed, but thought that might be a little too much churn for a prototype. i can add it if people think this whole approach is a good idea. sync_page and sync_page_delayed are both local helper functions. especially if we take the block_sync_page_delayed approach, i guess neither of these helpers should explicitly run the disk queue.
Created attachment 106991 [details] second revision of sync_page support in worktodo this version addresses earlier comments, and fixes a race condition that results in nfs_flushd not running often enough. it appears that this patch by itself is enough to make NFS aio usable and performant (the patch for 139582 may not be necessary, and in fact may result in worse performance in some cases). i've tested with OraSim. comments and review please.
Created attachment 110197 [details] KABI friendlier patch The path proposed in comment #8 has major KABI issues and appears entire patch may not been need to make AIO work better over NFS. From my testing, using the two iozone commands Chuck suggested, it appears only the addition of the __sync_page() call to wtd_lock_page() is really need to make things work better. Plus there is not any KABI issue with that part of the patch. Working with our AIO guy, I understand why the new sync_page_delayed() is a desired addition, but its a KABI infraction plus it not really clear its need to explicitly address the bug, which is, make NFS work better with AIO.... So the patch I'm proposing only introduces the __sync_page() call. Please test this it see if truly fix the problem....
what exactly is the KABI issue with the sync_page_delayed() part of the patch? i thought that part would still be acceptable. at this point, we need OraSim and real Oracle database testing. i will leave that to Greg. iozone testing is not enough here, we've found that Oracle still has some problems running with NFS aio, especially without sync_page_delayed.
The patch modifies the address_space_operations structure which is exported. That most certainly breaks the kernel ABI. Instead of adding the sync_page_delayed function pointer, we could hack up __wtd_lock_page_waiter to check if a page is backed by an NFS file system, and call nfs_wake_flushd from there if so. It's an ugly hack, but it's arguably no worse than assuming every page is backed by a block device. What do you think?
Chuck, I agree with you that Greg will need to run Orasim to see if this new patch addresses this issues that he was seeing. I feel confident that patch should help some what since it does give AIO a direct path to NFS that it didn't have before.... but only time (and testing) will tell....
I'll give Steve's patch a spin and see what happens...
re: comment #15 the sync_page part of the patch will help. but i don't feel that, by itself, it will be an adequate solution. OraSim appeared to run faster and more reliably with sync_page and sync_page_delayed; our TPC-C tests still don't run reliably over NFS aio even with the complete version of the patch. re: comment #14 i couldn't find a way to check whether a file is an NFS file without using symbols that would have to be loaded with nfs.ko. if no NFS file systems are mounted, and nfs.ko is not loaded, how will worktodo find these symbols? i'm probably ignorant about how this would work. thanks for offline comments on how to help my patches be more acceptable to the KABI gods.
Had a chance to try out the patch from steved in attachment 110197 [details] and the OAST stress tests pass with ~25% performance improvement. I need to rerun this a couple times to rule out network traffic and filer load from the equation, but it looks good.
Steve, you mentioned yesterday that a revised version of this patch is slated for inclusing in U5?
A fix for this problem has just been committed to the RHEL3 U5 patch pool this evening (in kernel version 2.4.21-27.14.EL).
Note that the fix in -27.14.EL has been slightly modified in -27.15.EL, although the functionality of the fix remains the same.
Can we get access to this latest patch, or the latest kernel with this patch? We want to do more testing. Thanks.
Van, the partner beta period for U5 is scheduled to begin Wednesday of this week. The kernel version is 2.4.21-29.EL (which has what you want).
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2005-294.html