Description of problem: Dio write returns EIO when try_to_release_page fails because bh is still referenced. This is caused by the race between freeing buffer(dio) and committing transaction, and the race between freeing buffer(dio) and background_writeout->ext3_ordered_writepage. The patch fixing this race was merged to linus-tree. http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=6ccfa806a9cfbbf1cd43d5b6aa47ef2c0eb518fd;hp=344c790e3821dac37eb742ddd0b611a300f78b9a Follwing is the patch for RHEL5. Please apply this patch to RHEL5. diff -Nrup 2.6.18-92.el5.org/mm/filemap.c 2.6.18-92.el5.diofix/mm/filemap.c --- 2.6.18-92.el5.org/mm/filemap.c 2008-09-04 11:55:58.000000000 +0900 +++ 2.6.18-92.el5.diofix/mm/filemap.c 2008-09-04 14:19:32.000000000 +0900 @@ -2633,13 +2633,20 @@ generic_file_direct_IO(int rw, struct ki * After a write we want buffered reads to be sure to go to disk to get * the new data. We invalidate clean cached page from the region we're * about to write. We do this *before* the write so that we can return - * -EIO without clobbering -EIOCBQUEUED from ->direct_IO(). + * without clobbering -EIOCBQUEUED from ->direct_IO(). */ if (rw == WRITE && mapping->nrpages) { retval = invalidate_inode_pages2_range(mapping, offset >> PAGE_CACHE_SHIFT, end); - if (retval) + /* + * If a page can not be invalidated, return 0 to fall back + * to buffered write. + */ + if (retval) { + if (retval == -EBUSY) + retval = 0; goto out; + } } retval = mapping->a_ops->direct_IO(rw, iocb, iov, offset, nr_segs); diff -Nrup 2.6.18-92.el5.org/mm/truncate.c 2.6.18-92.el5.diofix/mm/truncate.c --- 2.6.18-92.el5.org/mm/truncate.c 2008-09-04 11:55:57.000000000 +0900 +++ 2.6.18-92.el5.diofix/mm/truncate.c 2008-09-04 14:13:04.000000000 +0900 @@ -323,7 +323,7 @@ failed: * Any pages which are found to be mapped into pagetables are unmapped prior to * invalidation. * - * Returns -EIO if any pages could not be invalidated. + * Returns -EBUSY if any pages could not be invalidated. */ int invalidate_inode_pages2_range(struct address_space *mapping, pgoff_t start, pgoff_t end) @@ -383,7 +383,7 @@ int invalidate_inode_pages2_range(struct if (!invalidate_complete_page2(mapping, page)) { if (was_dirty) set_page_dirty(page); - ret = -EIO; + ret = -EBUSY; } unlock_page(page); } Additional info:
Updating PM score.
hifumi hisashi, just for information, what was the pattern for fsstress?
$ fsstress -d testdir -l 0 -n 1000 -p 2000 -z -f creat=1 -f write=1 -f dwrite=1 To reproduce this, buffered write and dio write mixed workload is the key.
(In reply to comment #3) > $ fsstress -d testdir -l 0 -n 1000 -p 2000 -z -f creat=1 -f write=1 -f dwrite=1 > > To reproduce this, buffered write and dio write mixed workload is the key. Many thanks!
So, I'm running fsstress for alsmost 3 days now and haven't seen any -EIO. Are there any special conditions to trigger this bug?
Maybe it is difficult to reproduce this bug. I think increasing load to HDD is inportant(increasing write IO size or process). On my environment, I spend 2days to get -EIO from dio after adding above modification to fsstress.
Any Updates?
Event posted on 11-12-2009 09:22am EST by willg Oracle has come back again and believes the problem to be an OS issue. The following was noted from Oracle logs on 10/30/09: The error occurred on 10/30/09 at 04:41:17. 04:41:17.396007 write(2, "ORA-31693: Table data object "EB"..., 132) = 132 04:41:17.396151 write(2, "ORA-31644: unable to position to"..., 143) = 143 04:41:17.396195 write(2, "ORA-19502: write error on file ""..., 147) = 147 04:41:17.396233 write(2, "ORA-27072: File I/O error\n", 26) = 26 RA-31693: Table data object "EBR_RPT"."CLM_FACT_B_SM_02_MV":"SYS_P383714" failed to load/unload and is being skipped due to error: ORA-31644: unable to position to block number 821431 in dump file "/dbbackup/d245/exp/tsm_excluded/d245_expdp_full_tsm_strace_091029_2200.dmp" ORA-19502: write error on file "/dbbackup/d245/exp/tsm_excluded/d245_expdp_full_tsm_strace_091029_2200.dmp", block number 821420 (block size=4096) ORA-27072: File I/O error Additional information: 4 Additional information: 821420 We are not seeing anything in messages that correspond to these time periods. Im attaching the tests that our DBA's have run to reproduce the problem. The problem appears to correspond to high IO useage through the environment as the machine names in the test case attachment are on different blade servers/OS's. Our DBA's are wondering if Redhat and Oracle could get together to discuss troubleshooting steps we should put in place to better figure the issue out. Here is another recent note from the DBA's: I did not try to reproduce the error and we hit it on three different database dumps on lslcebr3n02 (dev) this morning. There was a database ETL load running on lslcebr5n01-4 (ppmo) this morning. The error seems to occur when there is a lot of I/O from the EBR dev, ppmo and prod nodes. 01:22:22.491166 write(2, "ORA-31693: Table data object "EB"..., 135) = 135 01:22:22.491322 write(2, "ORA-31644: unable to position to"..., 143) = 143 01:22:22.491405 write(2, "ORA-19502: write error on file ""..., 147) = 147 01:22:22.491481 write(2, "ORA-27072: File I/O error\n", 26) = 26 01:22:22.491555 write(2, "Additional information: 4\n", 26) = 26 01:22:22.491631 write(2, "Additional information: 500192\n", 31) = 31 02:29:43.889323 write(2, "ORA-31693: Table data object "EB"..., 134) = 134 02:29:43.889995 write(2, "ORA-31644: unable to position to"..., 143) = 143 02:29:43.890102 write(2, "ORA-19502: write error on file ""..., 147) = 147 02:29:43.890207 write(2, "ORA-27072: File I/O error\n", 26) = 26 02:29:43.890324 write(2, "Additional information: 4\n", 26) = 26 02:29:43.890770 write(2, "Additional information: 411823\n", 31) = 31 04:29:47.667546 write(2, "ORA-31693: Table data object "EB"..., 126) = 126 04:29:47.667701 write(2, "ORA-31644: unable to position to"..., 143) = 143 04:29:47.667818 write(2, "ORA-19502: write error on file ""..., 147) = 147 04:29:47.667910 write(2, "ORA-27072: File I/O error\n", 26) = 26 04:29:47.667998 write(2, "Additional information: 4\n", 26) = 26 04:29:47.668083 write(2, "Additional information: 683701\n", 31) = 31 This event sent from IssueTracker by willg issue 345418
I see no reason not to include the patch initially posted to this bugzilla. I'm a little confused about the last comment, though. Was the customer running with the patched kernel and still hit I/O errors?
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
@hifumi.hisashi.co.jp, @GSS We need to confirm that there is third-party commitment to test for the resolution of this request during the RHEL 5.5 Beta Test Phase before we can approve it for acceptance into the release. RHEL 5.5 Beta Test Phase is expected to begin around February 2010. In order to avoid any unnecessary delays, please post a confirmation as soon as possible, including the contact information for testing engineers. Any additional information about alternative testing variations we could use to reproduce this issue in-house would be appreciated.
@GSS, @hifumi.hisashi We need to confirm that there is third-party commitment to test for the resolution of this request during the RHEL 5.5 Beta Test Phase before we can approve it for acceptance into the release. RHEL 5.5 Beta Test Phase is expected to begin around February 2010. In order to avoid any unnecessary delays, please post a confirmation as soon as possible, including the contact information for testing engineers. Any additional information about alternative testing variations we could use to reproduce this issue in-house would be appreciated.
in kernel-2.6.18-181.el5 You can download this test kernel from http://people.redhat.com/dzickus/el5 Please update the appropriate value in the Verified field (cf_verified) to indicate this fix has been successfully verified. Include a comment with verification details.
~~ Attention Customers and Partners - RHEL 5.5 Beta is now available on RHN ~~ RHEL 5.5 Beta has been released! There should be a fix present in this release that addresses your request. Please test and report back results here, by March 3rd 2010 (2010-03-03) or sooner. Upon successful verification of this request, post your results and update the Verified field in Bugzilla with the appropriate value. If you encounter any issues while testing, please describe them and set this bug into NEED_INFO. If you encounter new defects or have additional patch(es) to request for inclusion, please clone this bug per each request and escalate through your support representative.
bug state changed from ON_QA to VERIFIED Sanity only. Patch is actually being applied.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2010-0178.html
Event posted on 06-29-2010 01:52pm EDT by jbrier I called this customer yesterday and asked for a status update on testing with the latest kernel. This event sent from IssueTracker by jbrier issue 345418
Event posted on 06-29-2010 03:00pm EDT by jbrier Customer responded witht the following: === Edward Cheadle applied the kernel to one node of the RAC cluster on June 24th. We are still waiting on information from the DBA to determine if the problem is still being seen. Thanks. === I'll keep you updated when we hear back. This event sent from IssueTracker by jbrier issue 345418
Event posted on 08-06-2010 04:51pm EDT by jbrier Just got another update from the customer: === Update, we resolved the kernel panic issue. It was some old hp-ilo drivers that were causing the panic. Our DBA is having us apply this kernel to a second set of RAC servers before calling this issue resolved. === This event sent from IssueTracker by jbrier issue 345418
Customer finally closed the case, I assume they confirmed on their second host/cluster that the patch fixed the issue. #243 Created By: Bedford, Brian (10/26/2010 10:47 AM) This issue can be considered resolved. Thanks for all of your assistance.
Created attachment 459904 [details] dsadsadadsada
The popularity <a href=http://www.sonbags.com/>designer handbags</a> with the hobo bags may be accredited to their usefulness and their potential to go well with all occasions as effectively as their capacity to hold a lot of things. So they could be the perfect accompaniment <a href=http://www.earlshop.com/>replica watches</a> for school, picnic, work and even for the beach or the night out with buddies. The different materials is often utilised differently.Hobo purses are <a href=http://www.sonbags.com/coach-handbags.html>coach handbags</a> obviously intended and manufactured to integrate fashion with utility. The baggage are specially intended to put in or carry important and handy objects owners need like cell phones, cosmetics, wallet, and <a href=http://www.earlshop.com/Watches-Bvlgari.html>bvlgari watches</a> jewelries. A few of those totes are even large adequate to carry several pieces of garments, toys, along with a camera. Most men and women fancy the leather-based hobo purses. Some of your common <a href=http://www.sonbags.com/christian-dior.html>christian dior</a> types of your leather-based versions consist of smooth, soft leathers or pebbled leathers. Even the suede bags, which are commonly delicate, and have a casual seem, are deemed stylish. Hobo baggage are widely favored by most <a href=http://www.earlshop.com/Watches-Cartier.html>cartier</a> celebrities. You can find quantity of basic hobos from designer <a href=http://www.sonbags.com/gucci.html>gucci bags</a> brands such as Bottega Veneta Bottega Hobo, Gucci Horsebit Hobo and Dior Saddle Bag. Don��t you assume these hobos feature a dumpling condition? Thus I assume it really is additional <a href=http://www.earlshop.com/Watches-Cartier.html>cartier watches</a> visualize to call them Dumpling Bags.
The popularity http://www.sonbags.com/ with the hobo bags may be accredited to their usefulness and their potential to go well with all occasions as effectively as their capacity to hold a lot of things. So they could be the perfect accompaniment http://www.earlshop.com/ for school, picnic, work and even for the beach or the night out with buddies. The different materials is often utilised differently.Hobo purses are http://www.sonbags.com/coach-handbags.html obviously intended and manufactured to integrate fashion with utility. The baggage are specially intended to put in or carry important and handy objects owners need like cell phones, cosmetics, wallet, and http://www.earlshop.com/Watches-Bvlgari.html jewelries. A few of those totes are even large adequate to carry several pieces of garments, toys, along with a camera. Most men and women fancy the leather-based hobo purses. Some of your common http://www.sonbags.com/christian-dior.html types of your leather-based versions consist of smooth, soft leathers or pebbled leathers. Even the suede bags, which are commonly delicate, and have a casual seem, are deemed stylish. Hobo baggage are widely favored by most http://www.earlshop.com/Watches-Cartier.html celebrities. You can find quantity of basic hobos from designer http://www.sonbags.com/gucci.html brands such as Bottega Veneta Bottega Hobo, Gucci Horsebit Hobo and Dior Saddle Bag. Don��t you assume these hobos feature a dumpling condition? Thus I assume it really is additional http://www.earlshop.com/Watches-Cartier.html visualize to call them Dumpling Bags.