Hide Forgot
=Comment: #0================================================= Mohammed Omar <mohd.omar.com> - 2007-11-21 09:30 EDT ---Problem description:-- System generated a "Call Trace and Badness at fs/buffer.c" while executing [bonnie++ dbench fs_inod fsstress fsx_linux IOZone postmark tiobench] against EXT2,EXT3,REISERFS,XFS file systems. Initially its generated due to tiotest, next time it shows due to fs_inod. -----/var/log/message ------------ Nov 18 13:08:07 p520b kernel: Badness at fs/buffer.c:570 Nov 18 13:08:07 p520b kernel: NIP: c000000000140c18 LR: c000000000140d18 CTR: 0000000000000000 Nov 18 13:08:07 p520b kernel: REGS: c00000004a102c50 TRAP: 0700 Not tainted (2.6.23.1-42.fc8) Nov 18 13:08:07 p520b kernel: MSR: 8000000000029032 <EE,ME,IR,DR> CR: 24002424 XER: 20000000 Nov 18 13:08:07 p520b kernel: TASK = c0000000339271a0[14832] 'tiotest' THREAD: c00000004a100000 CPU: 0 Nov 18 13:08:07 p520b kernel: GPR00: 0000000000000001 c00000004a102ed0 c00000000073a750 c00000001e1f8d68 Nov 18 13:08:07 p520b kernel: GPR04: c00000004a102fc0 0000000000100000 c00000001e1f8d68 c000000001157540 Nov 18 13:08:07 p520b kernel: GPR08: c00000003600fc18 c00000003600fc18 c00000001e1f8d68 c00000001e1f8db0 Nov 18 13:08:07 p520b kernel: GPR12: 0000000000000000 c00000000065c900 c00000000c2f1900 c00000004a103c50 Nov 18 13:08:07 p520b kernel: GPR16: 0000000000011dde c000000000709e48 0000000000000000 00000000000a01d2 Nov 18 13:08:07 p520b kernel: GPR20: 0000000000000020 0000000000000020 0000000000000000 0000000000000002 Nov 18 13:08:07 p520b kernel: GPR24: 0000000000000005 c00000004a103298 c00000004a1034b0 0000000000000000 Nov 18 13:08:07 p520b kernel: GPR28: c00000004a102fc0 c000000001157540 c0000000006ea688 c000000007430c60 Nov 18 13:08:07 p520b kernel: NIP [c000000000140c18] .drop_buffers+0xcc/0x14c Nov 18 13:08:07 p520b kernel: LR [c000000000140d18] .try_to_free_buffers+0x80/0xfc Nov 18 13:08:07 p520b kernel: Call Trace: Nov 18 13:08:07 p520b kernel: [c00000004a102ed0] [c00000000014068c] .free_buffer_head+0x4c/0x80 (unreliable) Nov 18 13:08:07 p520b kernel: [c00000004a102f50] [c000000000140d18] .try_to_free_buffers+0x80/0xfc Nov 18 13:08:07 p520b kernel: [c00000004a103000] [c0000000000d9540] .try_to_release_page+0x7c/0x90 Nov 18 13:08:07 p520b kernel: [c00000004a103080] [c0000000000e48d0] .shrink_page_list+0x64c/0x7cc Nov 18 13:08:07 p520b kernel: [c00000004a103220] [c0000000000e4bb0] .shrink_inactive_list+0x160/0x430 Nov 18 13:08:07 p520b kernel: [c00000004a103390] [c0000000000e4fb8] .shrink_zone+0x138/0x1a4 Nov 18 13:08:07 p520b kernel: [c00000004a103440] [c0000000000e60d4] .try_to_free_pages+0x1d4/0x338 Nov 18 13:08:07 p520b kernel: [c00000004a103540] [c0000000000deb84] .__alloc_pages+0x20c/0x36c Nov 18 13:08:07 p520b kernel: [c00000004a103630] [c000000000100084] .alloc_pages_current+0xd4/0xf8 Nov 18 13:08:07 p520b kernel: [c00000004a1036c0] [c0000000000d726c] .__page_cache_alloc+0xac/0xc8 Nov 18 13:08:07 p520b kernel: [c00000004a103740] [c0000000000e12b4] .__do_page_cache_readahead+0xd4/0x2f4 Nov 18 13:08:07 p520b kernel: [c00000004a1038a0] [c0000000000e1790] .ondemand_readahead+0x154/0x170 Nov 18 13:08:07 p520b kernel: [c00000004a103930] [c0000000000d7784] .do_generic_mapping_read+0x178/0x550 Nov 18 13:08:07 p520b kernel: [c00000004a103a70] [c0000000000d9728] .generic_file_aio_read+0x144/0x1d4 Nov 18 13:08:07 p520b kernel: [c00000004a103b50] [c00000000011233c] .do_sync_read+0xc4/0x124 Nov 18 13:08:07 p520b kernel: [c00000004a103cf0] [c000000000112d80] .vfs_read+0x120/0x208 Nov 18 13:08:07 p520b kernel: [c00000004a103d90] [c00000000011356c] .sys_read+0x4c/0x8c Nov 18 13:08:07 p520b kernel: [c00000004a103e30] [c000000000008548] syscall_exit+0x0/0x40 Nov 18 13:08:07 p520b kernel: Instruction dump: Nov 18 13:08:07 p520b kernel: 396a0048 e86a0008 7fa85800 419e0050 e92b0008 f9090000 f9280008 f96b0008 Nov 18 13:08:07 p520b kernel: f96a0048 e80a0058 7c000074 7800d182 <0b000000> e80a0000 7809afe3 4182001c --- "uname -a"----- Linux p520b.in.ibm.com 2.6.23.1-42.fc8 #1 SMP Tue Oct 30 13:05:49 EDT 2007 ppc64 ppc64 ppc64 GNU/Linux ---Hardware Environment--- Machine type: p520b Cpu type :Power5 ---Is this reproducible?---- YES ----Steps to reproduce--------- 1.Create 4 Partitions of 5GB each. format it with EXT2, EXT3,REISERFS,XFS and mount them on /EXT2 , /EXT3 , /REISERFS, /XFS 2. Download bonnie++ dbench fs_inod fsstress fsx_linux IOZone postmark tiobench in /home/test/EXT2 , /home/test/EXT3, /home/test/REISERFS , /home/test/XFS and run this following script #!/bin/sh SUM_FS_TEST_HOME=/home/test/EXT3; export SUM_FS_TEST_HOME SUM_FS_TEST_TYPE=EXT3; export SUM_FS_TEST_TYPE SUM_FS_TEST_DIR=/EXT3; export SUM_FS_TEST_DIR ROOT_PATH=`pwd` cd bonnie++ ./bonnietest.sh >& Outputs/runtime_messages & cd $ROOT_PATH cd dbench ./dbenchtest.sh >& Outputs/runtime_messages & cd $ROOT_PATH cd fs_inod ./fs_inodtest.sh >& Outputs/runtime_messages & cd $ROOT_PATH cd fsstress ./fsstresstest.sh >& Outputs/runtime_messages & cd $ROOT_PATH cd fsx_linux ./fsx-linuxtest.sh >& Outputs/runtime_messages & cd $ROOT_PATH cd IOZone ./iozonetest.sh >& Outputs/runtime_messages & cd $ROOT_PATH cd postmark ./postmarktest.sh >& Outputs/runtime_message & cd $ROOT_PATH cd tiobench ./tiobenchtest.sh >& Outputs/runtime_messages & cd $ROOT_PATH ---Did the system produce an OOPS message on the console?--- NO --Is the system sitting in a debugger right now?-- NO --Additional Information------ bonnietest.sh fsx-linuxtest.sh tiobenchtest.sh fs_inodtest.sh... etc are the scripts which run respective binaries. --thanks omar =Comment: #1================================================= Mohammed Omar <mohd.omar.com> - 2007-11-21 09:38 EDT var-log-messages log attached =Comment: #2================================================= Supriya Kannery <skannery.com> - 2007-11-23 01:08 EDT Mohammed, How fast is the problem getting recreated after the scripts are started? =Comment: #3================================================= Mohammed Omar <mohd.omar.com> - 2007-11-23 04:20 EDT Hi supriya, > How fast is the problem getting recreated after the scripts are started? Its not generating immediately. Its taking 5-6 hrs to recreate it. --thanks omar =Comment: #6================================================= Supriya Kannery <skannery.com> - 2007-11-27 07:18 EDT fs/buffer.c:570 567 static inline void __remove_assoc_queue(struct buffer_head *bh) 568 { 569 list_del_init(&bh->b_assoc_buffers); 570 WARN_ON(!bh->b_assoc_map); <== Bug here 571 if (buffer_write_io_error(bh)) 572 set_bit(AS_EIO, &bh->b_assoc_map->flags); 573 bh->b_assoc_map = NULL; 574 } bh->b_assoc_map points to the address_space to which the buffer_head maps to. Before control path reached here, this had become NULL. Doubting that __bforget() on this buffer_head is called earlier due to which the b_assoc_map was made to NULL. Analysing further.. =Comment: #7================================================= Supriya Kannery <skannery.com> - 2007-11-29 05:18 EDT Mohammed, Attaching two files: 1. A debug patch which is just a printk added to get information related to the address_space pointed by bh->b_assoc_map and that pointed by page->mapping when problem happens. Please apply this debug patch and recreate the problem once again. Collect the dmesg output once problem is recreated and attach here. 2. A test patch which has got code change for addressing this issue. Please apply this test patch and verify whether the issue is addressed or not. Thanks, Supriya =Comment: #8================================================= Supriya Kannery <skannery.com> - 2007-11-29 05:20 EDT Debug patch for getting more info =Comment: #9================================================= Supriya Kannery <skannery.com> - 2007-11-29 05:31 EDT Test patch for avoiding the badness =Comment: #11================================================= Supriya Kannery <skannery.com> - 2007-12-03 05:22 EDT Mohammed is facing some network connection problem. Hence he passed on this info to me to update in bugzilla: He tried the second patch, it works fine, doesn't throw any call trace. He will be updating on the first patch test result later. =Comment: #15================================================= Mohammed Omar <mohd.omar.com> - 2007-12-13 04:51 EDT Supriya, I applied Debug Patch to F8 kernel, recompiled and restarted the tests. I have not noticed any messages or Call traces.As we had a discussion, printf statement might be suppressing the race condition. --Regards Omar =Comment: #16================================================= Supriya Kannery <skannery.com> - 2008-01-10 06:04 EDT Modifications to __remove_assoc_queue was done by Jan Kara for a patch related to fsync: http://linux.derkeiler.com/Mailing-Lists/Kernel/2006-10/msg03431.html Hence I discussed with him to get his comments about the patch I prepared. He proposed another patch. Following is his response: There two possible races in handling of private_list in buffer cache. 1) When fsync_buffers_list() processes a private_list, it clears b_assoc_mapping and moves buffer to its private list. Now drop_buffers() comes, sees a buffer is on list so it calls __remove_assoc_queue() which complains about b_assoc_mapping being cleared (as it cannot propagate possible IO error). This race has been actually observed in the wild. 2) When fsync_buffers_list() processes a private_list, mark_buffer_dirty_inode() can be called on bh which is already on the private list of fsync_buffers_list(). As buffer is on some list (note that the check is performed without private_lock), it is not readded to the mapping's private_list and after fsync_buffers_list() finishes, we have a dirty buffer which should be on private_list but it isn't. This race has not been reported, probably because most (but not all) callers of mark_buffer_dirty_inode() hold i_mutex and thus are serialized with fsync(). Fix these issues by rewriting fsync_buffers_list() to not move buffers to dedicated private list (which has an additional bonus of reducing code size) and scan original private_list instead (some care has to be taken to avoid racing with drop_buffers()). Also mark_buffer_dirty_inode() is modified to avoid race 2). Attaching the patch he has provided. =Comment: #17================================================= Supriya Kannery <skannery.com> - 2008-01-10 06:06 EDT Patch proposed by Jan =Comment: #18================================================= Supriya Kannery <skannery.com> - 2008-01-10 06:08 EDT Mohammed, Can you please verify this patch. Thanks & Rgds, Supriya =Comment: #19================================================= Mohammed Omar <mohd.omar.com> - 2008-01-11 01:59 EDT supriya, I tested the patch. Here the results.. Jan 11 00:10:33 js21 kernel: ------------[ cut here ]------------ Jan 11 00:10:33 js21 kernel: Badness at fs/buffer.c:561 Jan 11 00:10:33 js21 kernel: NIP: c000000000143b98 LR: c000000000143a78 CTR: c000000000143a24 Jan 11 00:10:33 js21 kernel: REGS: c000000262547940 TRAP: 0700 Not tainted (2.6.23.1-janpatch) Jan 11 00:10:33 js21 kernel: MSR: 8000000000029032 <EE,ME,IR,DR> CR: 24000428 XER: 20000000 Jan 11 00:10:33 js21 kernel: TASK = c000000263d271a0[2972] 'iozone' THREAD: c000000262544000 CPU: 0 Jan 11 00:10:33 js21 kernel: GPR00: 0000000000000001 c000000262547bc0 c00000000073a750 c000000002ec2250 Jan 11 00:10:33 js21 kernel: GPR04: c00000020ca9d0d0 0000000000000000 000000000000000e 0000000000000000 Jan 11 00:10:33 js21 kernel: GPR08: c0000001347301e8 c0000001347301e8 c0000001347301a0 c0000001347301e8 Jan 11 00:10:33 js21 kernel: GPR12: d000000000453288 c00000000065c900 0000000010050000 00000000ff85ec10 Jan 11 00:10:33 js21 kernel: GPR16: 0000000010050000 0000000000000000 0000000000000000 00000000f6e00000 Jan 11 00:10:33 js21 kernel: GPR20: 0000000000000003 0000000000000000 0000000000000010 0000000010050000 Jan 11 00:10:33 js21 kernel: GPR24: 0000000000000000 00000000ff85ec18 0000000000000000 0000000000000000 Jan 11 00:10:33 js21 kernel: GPR28: c000000224b69510 c000000002ec2250 c0000000006ec3f0 c000000224b69498 Jan 11 00:10:33 js21 kernel: NIP [c000000000143b98] .sync_mapping_buffers+0x174/0x37c Jan 11 00:10:33 js21 kernel: LR [c000000000143a78] .sync_mapping_buffers+0x54/0x37c Jan 11 00:10:33 js21 kernel: Call Trace: Jan 11 00:10:33 js21 kernel: [c000000262547bc0] [c000000000143a78] .sync_mapping_buffers+0x54/0x37c (unreliable) Jan 11 00:10:33 js21 kernel: [c000000262547c70] [d000000000448b18] .ext2_sync_file+0x28/0x84 [ext2] Jan 11 00:10:33 js21 kernel: [c000000262547d00] [c00000000013fb24] .do_fsync+0x90/0x10c Jan 11 00:10:33 js21 kernel: [c000000262547da0] [c00000000013fbd4] .__do_fsync+0x34/0x60 Jan 11 00:10:33 js21 kernel: [c000000262547e30] [c000000000008548] syscall_exit+0x0/0x40 Jan 11 00:10:33 js21 kernel: Instruction dump: Jan 11 00:10:33 js21 kernel: 40820068 e80a0000 7809ffe3 408201a8 e92b0008 f9090000 f9280008 f96b0008 Jan 11 00:10:33 js21 kernel: f96a0048 e80a0058 7c000074 7800d182 <0b000000> e80a0000 7808afe3 41820020 --Regards Omar M =Comment: #23================================================= Mohammed Omar <mohd.omar.com> - 2008-01-23 03:58 EDT Patch found at LKML Patch found at lkml as per supriya's comment.I will be testing this patch. --Regards Omar =Comment: #25================================================= Mohammed Omar <mohd.omar.com> - 2008-01-25 01:06 EDT Supriya, I have tested the above patch with F8 kernel. It works fine. --Regards Omar SUMMARY: Problem: System generated a "Call Trace and Badness at fs/buffer.c" while executing [bonnie++ dbench fs_inod fsstress fsx_linux IOZone postmark tiobench] against EXT2,EXT3,REISERFS,XFS file systems. Analysis: bh->b_assoc_map is not set in __remove_assoc_queue when called from drop_buffers(). Noticed that Jan kara had provided a patch to modify __remove_assoc_queue sometime back: http://linux.derkeiler.com/Mailing-Lists/Kernel/2006-10/msg03431.html. Discussed this issue with Jan Solution: Jan Kara provided patch in LKML (second patch) at http://lkml.org/lkml/2008/1/14/262 which Omar successfully verified.
Created attachment 292919 [details] Second Patch from Jan in LKML
Created attachment 292920 [details] var-log-messages
------- Comment From mohd.omar.com 2008-02-18 08:50 EDT------- I could see the same "Call Traces" on F9Alpha Kernel.I think Redhat has not yet picked up the patch. --uname -a------------ Linux p520b.in.ibm.com 2.6.24-2.fc9 #1 SMP Fri Jan 25 12:55:54 EST 2008 ppc64 ppc64 ppc64 GNU/Linux -------log----------- ------------[ cut here ]------------ Badness at fs/buffer.c:583 NIP: c00000000014ee6c LR: c00000000014ef70 CTR: c000000000156160 REGS: c000000049e1ab80 TRAP: 0700 Not tainted (2.6.24-2.fc9) MSR: 8000000000029032 <EE,ME,IR,DR> CR: 24004444 XER: 00000000 TASK = c0000000349a3440[16529] 'tiotest' THREAD: c000000049e18000 CPU: 1 GPR00: 0000000000000001 c000000049e1ae00 c000000000756280 c000000040e99160 GPR04: c000000049e1aef0 0000000000200000 c000000040e99160 cf00000001399580 GPR08: c00000004809bc18 c0000000708add58 c000000040e99160 c000000040e991a8 GPR12: 0000000000000000 c00000000067a480 0000000000000000 c000000070f67158 GPR16: c000000070f672c8 d000000000613688 0000000000000000 00000000001200d2 GPR20: 0000000000000003 0000000000000020 0000000000000000 0000000000000000 GPR24: 000000000000000b c000000049e1b1a8 c000000049e1b3c0 0000000000020000 GPR28: c000000049e1aef0 cf00000001399580 c0000000007022c8 c000000070945518 NIP [c00000000014ee6c] .drop_buffers+0xcc/0x14c LR [c00000000014ef70] .try_to_free_buffers+0x84/0xec Call Trace: [c000000049e1ae00] [c0000000004569f4] ._spin_lock+0x10/0x24 (unreliable) [c000000049e1ae80] [c00000000014ef70] .try_to_free_buffers+0x84/0xec [c000000049e1af20] [c0000000000e978c] .try_to_release_page+0x7c/0x90 [c000000049e1afa0] [c0000000000f5e34] .shrink_page_list+0x660/0x7ec [c000000049e1b130] [c0000000000f60fc] .shrink_inactive_list+0x13c/0x3e4 [c000000049e1b2a0] [c0000000000f64c0] .shrink_zone+0x11c/0x174 [c000000049e1b350] [c0000000000f7694] .try_to_free_pages+0x1e0/0x344 [c000000049e1b450] [c0000000000efdb4] .__alloc_pages+0x270/0x434 [c000000049e1b540] [c0000000001109d4] .alloc_pages_current+0xd4/0xf8 [c000000049e1b5d0] [c0000000000e7450] .__page_cache_alloc+0xc0/0xdc [c000000049e1b650] [c0000000000e7620] .__grab_cache_page+0x44/0xa0 [c000000049e1b6f0] [d0000000005d4068] .reiserfs_write_begin+0xa0/0x2a8 [reiserfs] [c000000049e1b7d0] [c0000000000e8820] .generic_file_buffered_write+0x16c/0x6ec [c000000049e1b910] [c0000000000e9118] .__generic_file_aio_write_nolock+0x378/0x3ec [c000000049e1ba10] [c0000000000e920c] .generic_file_aio_write+0x80/0x114 [c000000049e1bad0] [c000000000122914] .do_sync_write+0xc4/0x124 [c000000049e1bc70] [d0000000005d5310] .reiserfs_file_write+0x5c/0x90 [reiserfs] [c000000049e1bcf0] [c0000000001231d0] .vfs_write+0xf8/0x1dc [c000000049e1bd90] [c000000000123c20] .sys_write+0x4c/0x8c [c000000049e1be30] [c000000000008748] syscall_exit+0x0/0x40 Instruction dump: 396a0048 e86a0008 7fa85800 419e0050 e92b0008 f9090000 f9280008 f96b0008 f96a0048 e80a0058 7c000074 7800d182 <0b000000> e80a0000 7809afe3 4182001c --Regards Omar
This message is a reminder that Fedora 8 is nearing its end of life. Approximately 30 (thirty) days from now Fedora will stop maintaining and issuing updates for Fedora 8. It is Fedora's policy to close all bug reports from releases that are no longer maintained. At that time this bug will be closed as WONTFIX if it remains open with a Fedora 'version' of '8'. Package Maintainer: If you wish for this bug to remain open because you plan to fix it in a currently maintained version, simply change the 'version' to a later Fedora version prior to Fedora 8's end of life. Bug Reporter: Thank you for reporting this issue and we are sorry that we may not be able to fix it before Fedora 8 is end of life. If you would still like to see this bug fixed and are able to reproduce it against a later version of Fedora please change the 'version' of this bug to the applicable version. If you are unable to change the version, please add a comment here and someone will do it for you. Although we aim to fix as many bugs as possible during every release's lifetime, sometimes those efforts are overtaken by events. Often a more recent Fedora release includes newer upstream software that fixes bugs or makes them obsolete. The process we are following is described here: http://fedoraproject.org/wiki/BugZappers/HouseKeeping
Didn't see this issue in fedora 10. We can close this bug.
Fedora 8 changed to end-of-life (EOL) status on 2009-01-07. Fedora 8 is no longer maintained, which means that it will not receive any further security or bug fix updates. As a result we are closing this bug. If you can reproduce this bug against a currently maintained version of Fedora please feel free to reopen this bug against that version. Thank you for reporting this bug and we are sorry it could not be fixed.