Bug 137830
| Summary: | worktodo does not support NFS aio | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 3 | Reporter: | Chuck Lever <cel> | ||||||||||
| Component: | kernel | Assignee: | Steve Dickson <steved> | ||||||||||
| Status: | CLOSED ERRATA | QA Contact: | Brian Brock <bbrock> | ||||||||||
| Severity: | medium | Docs Contact: | |||||||||||
| Priority: | medium | ||||||||||||
| Version: | 3.0 | CC: | bjohnson, greg.marsden, jmoyer, peterm, petrides, riel, steved, zach.brown | ||||||||||
| Target Milestone: | --- | ||||||||||||
| Target Release: | --- | ||||||||||||
| Hardware: | All | ||||||||||||
| OS: | Linux | ||||||||||||
| Whiteboard: | |||||||||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||||||||
| Doc Text: | Story Points: | --- | |||||||||||
| Clone Of: | Environment: | ||||||||||||
| Last Closed: | 2005-05-18 13:28:23 UTC | Type: | --- | ||||||||||
| Regression: | --- | Mount Type: | --- | ||||||||||
| Documentation: | --- | CRM: | |||||||||||
| Verified Versions: | Category: | --- | |||||||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||||||
| Embargoed: | |||||||||||||
| Bug Depends On: | |||||||||||||
| Bug Blocks: | 132991 | ||||||||||||
| Attachments: |
|
||||||||||||
|
Description
Chuck Lever
2004-11-01 21:13:36 UTC
This is important to making NFS AIO reliable, so it is a priority for us. Created attachment 106131 [details]
OraSim 4.2 job file to generate aio on NFS
yes, i know OraSim 4.2 is old. you should be able to use any of the OraSim-6.1
sample job files that use the async drivers and reproduce this problem.
the issue shows up with the following message (from memory):
FATAL: Async I/O queue still jammed after 10000 milliseconds!
ok, i've found a much better test case that tickles precisely this
problem.
1. mount an NFS file system on a GbE attached client
2. use iozone to create a file that is about as large as local RAM
like this: "iozone -s 1g -i 0 -w" (where 1g is replaced with
the memory size of your client).
3. now run: "iozone -r 32 -i 1 -k 32"
this will trigger calls to the wtd stuff, which will vainly try to
run the disk queue to flush iozone's requests.
Created attachment 106842 [details]
NFS support for worktodo implementation
here's a patch that addresses this bug. please review and comment.
> + int (*sync_page_delayed)(struct page *); An address space op for doing the async variant of sync_page that aio is doing by hand (based on invalid assumptions) certainly seems reasonable. Can we get a big fat comment about what it does? > +static inline int sync_page_delayed(struct page *page) > +{ > + struct address_space *mapping = page->mapping; > + > + if (mapping && mapping->a_ops && mapping->a_ops->sync_page_delayed) > + return mapping->a_ops->sync_page_delayed(page); > + schedule_task(&run_disk_tq); > + return 0; > +} there is space/tab damage here. I don't think it should always schedule the run_disk_tq, either. Instead the patch should just add a block_sync_page_delayed() which pairs with block_sync_page(). Then add it to the file systems we care about (ext3?). > +static inline int sync_page(struct page *page) > +{ > + struct address_space *mapping = page->mapping; > + > + if (mapping && mapping->a_ops && mapping->a_ops->sync_page) > + return mapping->a_ops->sync_page(page); > + run_task_queue(&tq_disk); > + return 0; > +} (tab/space again.) Why not export the sync_page() from mm/filemap.c? Another sync_page() that unconditionally runs tq_disk is confusing. tab damage now fixed in my version of the patch. sorry about that. i considered a block_sync_page_delayed, but thought that might be a little too much churn for a prototype. i can add it if people think this whole approach is a good idea. sync_page and sync_page_delayed are both local helper functions. especially if we take the block_sync_page_delayed approach, i guess neither of these helpers should explicitly run the disk queue. Created attachment 106991 [details]
second revision of sync_page support in worktodo
this version addresses earlier comments, and fixes a race condition that
results in nfs_flushd not running often enough.
it appears that this patch by itself is enough to make NFS aio usable and
performant (the patch for 139582 may not be necessary, and in fact may result
in worse performance in some cases). i've tested with OraSim.
comments and review please.
Created attachment 110197 [details] KABI friendlier patch The path proposed in comment #8 has major KABI issues and appears entire patch may not been need to make AIO work better over NFS. From my testing, using the two iozone commands Chuck suggested, it appears only the addition of the __sync_page() call to wtd_lock_page() is really need to make things work better. Plus there is not any KABI issue with that part of the patch. Working with our AIO guy, I understand why the new sync_page_delayed() is a desired addition, but its a KABI infraction plus it not really clear its need to explicitly address the bug, which is, make NFS work better with AIO.... So the patch I'm proposing only introduces the __sync_page() call. Please test this it see if truly fix the problem.... what exactly is the KABI issue with the sync_page_delayed() part of the patch? i thought that part would still be acceptable. at this point, we need OraSim and real Oracle database testing. i will leave that to Greg. iozone testing is not enough here, we've found that Oracle still has some problems running with NFS aio, especially without sync_page_delayed. The patch modifies the address_space_operations structure which is exported. That most certainly breaks the kernel ABI. Instead of adding the sync_page_delayed function pointer, we could hack up __wtd_lock_page_waiter to check if a page is backed by an NFS file system, and call nfs_wake_flushd from there if so. It's an ugly hack, but it's arguably no worse than assuming every page is backed by a block device. What do you think? Chuck, I agree with you that Greg will need to run Orasim to see if this new patch addresses this issues that he was seeing. I feel confident that patch should help some what since it does give AIO a direct path to NFS that it didn't have before.... but only time (and testing) will tell.... I'll give Steve's patch a spin and see what happens... re: comment #15 the sync_page part of the patch will help. but i don't feel that, by itself, it will be an adequate solution. OraSim appeared to run faster and more reliably with sync_page and sync_page_delayed; our TPC-C tests still don't run reliably over NFS aio even with the complete version of the patch. re: comment #14 i couldn't find a way to check whether a file is an NFS file without using symbols that would have to be loaded with nfs.ko. if no NFS file systems are mounted, and nfs.ko is not loaded, how will worktodo find these symbols? i'm probably ignorant about how this would work. thanks for offline comments on how to help my patches be more acceptable to the KABI gods. Had a chance to try out the patch from steved in attachment 110197 [details] and
the OAST stress tests pass with ~25% performance improvement. I need
to rerun this a couple times to rule out network traffic and filer
load from the equation, but it looks good.
Steve, you mentioned yesterday that a revised version of this patch is slated for inclusing in U5? A fix for this problem has just been committed to the RHEL3 U5 patch pool this evening (in kernel version 2.4.21-27.14.EL). Note that the fix in -27.14.EL has been slightly modified in -27.15.EL, although the functionality of the fix remains the same. Can we get access to this latest patch, or the latest kernel with this patch? We want to do more testing. Thanks. Van, the partner beta period for U5 is scheduled to begin Wednesday of this week. The kernel version is 2.4.21-29.EL (which has what you want). An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2005-294.html |