Description of problem: There is a race in 2.6.9's handling of bh submission in __block_write_full_page() that will allow more than one request to be submitted for a given bh at the same time. This causes end_buffer_async_write() to blow up on the following bug: 685 struct buffer_head *tmp; 686 struct page *page; 687 688 BUG_ON(!buffer_async_write(bh)); 689 This happens in the 2nd invocation of end_buffer_async_write() as the 1st caller has already cleared the bit later on in the same function: 705 spin_lock_irqsave(&page_uptodate_lock, flags); 706 clear_buffer_async_write(bh); 707 unlock_buffer(bh); 708 tmp = bh->b_this_page; Version-Release number of selected component (if applicable): Originally reported on 2.6.9-34.EL but confirmed to also affect later kernels. How reproducible: Difficult. Seen in production at a couple of sites every few months. Steps to Reproduce: Nick Piggin posted a reproducer to LKML but we've been unable to trigger the problem reliably with this test case: "When running fsstress -v -d $DIR/tmp -n 1000 -p 1000 -l 2 on an ext2 filesystem with 1024 byte block size, on SMP i386 with 4096 byte page size over loopback to an image file on a tmpfs filesystem, I would very quickly hit BUG_ON(!buffer_async_write(bh)); in fs/buffer.c:end_buffer_async_write" Actual results: BUG Expected results: No BUG. Additional info: Discussed on LKML here: http://lkml.org/lkml/2005/4/24/145 [[patch] __block_write_full_page bug] http://lkml.org/lkml/2005/4/27/85 [[patch] fix the 2nd buffer race properly] http://lkml.org/lkml/2005/4/30/181 [[patch] alternative fix for VFS race]
This was fixed upstream with the following three commits: commit f0fbd5fc09b20f7ba7bc8c80be33e39925bb38e1 Author: Andrew Morton <akpm> Date: Thu May 5 16:15:48 2005 -0700 [PATCH] __block_write_full_page() simplification The `last_bh' logic probably isn't worth much. In those situations where only the front part of the page is being written out we will save some looping but in the vastly more common case of an all-page writeout if just adds more code. Nick Piggin <nickpiggin.au> Signed-off-by: Andrew Morton <akpm> Signed-off-by: Linus Torvalds <torvalds> commit 05937baae9fc27b64bcd4378da7d2b14edf7931c Author: Andrew Morton <akpm> Date: Thu May 5 16:15:47 2005 -0700 [PATCH] __block_write_full_page speedup Remove all those get_bh()'s and put_bh()'s by extending lock_page() to cover the troublesome regions. (get_bh() and put_bh() happen every time whereas contention on a page's lock in there happens basically never). Cc: Nick Piggin <nickpiggin.au> Signed-off-by: Andrew Morton <akpm> Signed-off-by: Linus Torvalds <torvalds> commit ad576e63e0c8b274a8558b8e05a6d0526e804dc0 Author: Nick Piggin <nickpiggin.au> Date: Thu May 5 16:15:46 2005 -0700 [PATCH] __block_write_full_page race fix When running fsstress -v -d $DIR/tmp -n 1000 -p 1000 -l 2 on an ext2 filesystem with 1024 byte block size, on SMP i386 with 4096 byte page size over loopback to an image file on a tmpfs filesystem, I would very quickly hit BUG_ON(!buffer_async_write(bh)); in fs/buffer.c:end_buffer_async_write It seems that more than one request would be submitted for a given bh at a time. What would happen is the following: 2 threads doing __mpage_writepages on the same page. Thread 1 - lock the page first, and enter __block_write_full_page. Thread 1 - (eg.) mark_buffer_async_write on the first 2 buffers. Thread 1 - set page writeback, unlock page. Thread 2 - lock page, wait on page writeback Thread 1 - submit_bh on the first 2 buffers. => both requests complete, none of the page buffers are async_write, end_page_writeback is called. Thread 2 - wakes up. enters __block_write_full_page. Thread 2 - mark_buffer_async_write on (eg.) the last buffer Thread 1 - finds the last buffer has async_write set, submit_bh on that. Thread 2 - submit_bh on the last buffer. => oops. So change __block_write_full_page to explicitly keep track of the last bh we need to issue, so we don't touch anything after issuing the last request. Signed-off-by: Nick Piggin <nickpiggin.au> Signed-off-by: Andrew Morton <akpm> Signed-off-by: Linus Torvalds <torvalds> ad576e63e0c8b274a8558b8e05a6d0526e804dc0 is the fix, the remaining two are cleanups of the surrounding code.
I grabbed a test machine to try to reproduce this, and I hit a different BUG: ------------[ cut here ]------------ kernel BUG at fs/mpage.c:417! invalid operand: 0000 [#1] SMP Modules linked in: md5 ipv6 parport_pc lp parport autofs4 sunrpc cpufreq_powersave loop joydev button battery ac ehci_hcd uhci_hcd i5000_edac edac_mc hw_random bnx2 dm_snapshot dm_zero dm_mirror ext3 jbd dm_mod ata_piix libata cciss sd_mod scsi_mod CPU: 4 EIP: 0060:[<c017aed1>] Not tainted VLI EFLAGS: 00010202 (2.6.9-78.23.ELsmp) EIP is at mpage_writepage+0xd0/0x56f eax: 0000001c ebx: c140d560 ecx: 0000000c edx: 00000000 esi: d7a4ba70 edi: d7a4ba70 ebp: 00000000 esp: c8d8fdb4 ds: 007b es: 007b ss: 0068 Process fsstress (pid: 10153, threadinfo=c8d8f000 task=e1d72df0) Stack: 00152afe 00000000 00000000 00000000 00000000 00000000 00000000 00000004 00000001 00000001 00000004 0000000a e341ae50 e341af00 c0196e79 c140d560 00000000 00000082 c203ad80 c011d31c 0000001b 00000089 0000022b 00000006 Call Trace: [<c0196e79>] ext2_get_block+0x0/0x2cb [<c011d31c>] try_to_wake_up+0x28e/0x299 [<c01c5356>] radix_tree_gang_lookup_tag+0x3d/0x53 [<c014187b>] find_get_pages_tag+0x28/0x64 [<c01418b0>] find_get_pages_tag+0x5d/0x64 [<c017b569>] mpage_writepages+0x1f9/0x314 [<c0196e79>] ext2_get_block+0x0/0x2cb [<c0197ebb>] ext2_update_inode+0x30a/0x315 [<c01463fe>] do_writepages+0x19/0x27 [<c0179e70>] __sync_single_inode+0x5f/0x1bb [<c017a22d>] sync_sb_inodes+0x1a7/0x274 [<c017a44a>] sync_inodes_sb+0x72/0x83 [<c02df172>] __cond_resched+0x14/0x39 [<c017a51c>] sync_inodes+0x16/0x6c [<c015dd3a>] do_sync+0x11/0x55 [<c015dd88>] sys_sync+0xa/0xd [<c02e0c43>] syscall_call+0x7/0xb [<c02e007b>] __lock_text_end+0x60c/0x1071 Code: 8b 03 f6 c4 10 0f 84 f6 00 00 00 8b 03 f6 c4 10 75 08 0f 0b 9b 01 ec 60 2f c0 8b 44 24 3c 31 ed 8b 70 0c 89 f7 8b 07 a8 04 74 08 <0f> 0b a1 01 ec 60 2f c0 8b 07 a8 10 75 21 8b 07 a8 02 0f 85 1e <0>Fatal exception: panic in 5 seconds Kernel panic - not syncing: Fatal exception That bug is in mpage_writepage: if (page_has_buffers(page)) { struct buffer_head *head = page_buffers(page); struct buffer_head *bh = head; /* If they're all mapped and dirty, do it */ page_block = 0; do { BUG_ON(buffer_locked(bh)); <================== This could be related, but I'll have to think about it a bit more. I'll retry with a patched kernel to make sure the issue has been resolved.
I re-ran the test with a kernel that was patched with the above 3 commits and I still get this failure. I'll continue investigating.
I opened up a separate bugzilla (480404) for the problem I ran into. Once that's fixed, hopefully I can reproduce the bug reported here.
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
Committed in 89.44.EL . RPMS are available at http://people.redhat.com/vgoyal/rhel4/
This is hard to reproduce, I followed steps in comment 0 and let it run for 8 hours and on different hosts, I can hit bug 480404 but not this one. Ran reproducer on 2.6.9-96 kernel over the weekend(about 60 hours) and no issue was found. linux-2.6.9-fs-buffer-__block_write_full_page-race-fix.patch linux-2.6.9-fs-buffer-__block_write_full_page-speedup-by-removin.pattch linux-2.6.9-fs-buffer-__block_write_full_page-simplification-by-.pattch are applied in 2.6.9-96 kernel correctly. Set it to SanityOnly
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2011-0263.html