in trying to track down bz 231910 I hit this problem. Basically I hit the withdraw, rebooted both nodes, ran a gfs2_fsck on the filesystem, mounted the FS back up, rm -rf'ed the directory i was removing, and then rm -rf'ed the lost+found directory, and then I got this OOPs BUG: unable to handle kernel paging request at virtual address 080c4b10 printing eip: f8c34110 *pde = 32f60067 Oops: 0000 [#1] SMP Modules linked in: autofs4 hidp rfcomm l2cap bluetooth lock_dlm gfs2 dlm configfs sunrpc sg iptable_filter ip_tables ip6t_REJECT xt_tcpudp ip6table_filter ip6_tables x_tables ipv6 dm_multipath video sbs i2c_ec button battery asus_acpi ac parport_pc lp parport floppy i2c_piix4 i2c_core cfi_probe gen_probe pcspkr scb2_flash mtdcore chipreg tg3 serio_raw ide_cd cdrom dm_snapshot dm_zero dm_mirror dm_mod qla2xxx scsi_transport_fc sd_mod scsi_mod ext3 jbd ehci_hcd ohci_hcd uhci_hcd CPU: 0 EIP: 0060:[<f8c34110>] Not tainted VLI EFLAGS: 00010296 (2.6.21-rc1 #16) EIP is at compare_dents+0x9/0x58 [gfs2] eax: f8bc00c0 ebx: 080c4b00 ecx: e3fdb79f edx: f8bc00c4 esi: da0acee8 edi: 000000bc ebp: 0000005c esp: e9692e20 ds: 007b es: 007b fs: 00d8 gs: 0033 ss: 0068 Process rm (pid: 3177, ti=e9692000 task=f0571560 task.ti=e9692000) Stack: 000000bc 00000004 c04e199c f8bc0004 000ded88 0000005c 000000c4 f0571560 d9fbc488 d9fbc488 da04cf60 c9341580 00000000 00000000 00000000 e9692f5c f8c33f2f f8c34107 c04e191c 00000000 da0aceb8 da0acee8 e9692f94 f8c329a8 Call Trace: [<c04e199c>] sort+0x5d/0x14d [<f8c33f2f>] do_filldir_main+0x2e/0x1e4 [gfs2] [<f8c34107>] compare_dents+0x0/0x58 [gfs2] [<c04e191c>] u32_swap+0x0/0xb [<f8c329a8>] gfs2_dirent_scan+0xa2/0x15f [gfs2] [<f8c344b3>] gfs2_dir_read+0x2fe/0x4d1 [gfs2] [<c0478ae4>] filldir64+0x0/0xc5 [<f8c42ee9>] gfs2_readdir+0x6f/0x90 [gfs2] [<c0478ae4>] filldir64+0x0/0xc5 [<c0478ae4>] filldir64+0x0/0xc5 [<f8c42ebd>] gfs2_readdir+0x43/0x90 [gfs2] [<c0478cc5>] vfs_readdir+0x63/0x8d [<c0478ae4>] filldir64+0x0/0xc5 [<c0478d52>] sys_getdents64+0x63/0xa5 [<c0404e4c>] syscall_call+0x7/0xb [<c0610000>] rt_mutex_slowunlock+0x89/0x197 ======================= Code: 53 0f b7 48 14 89 cb c1 eb 08 c1 e1 08 09 d9 0f b7 c9 01 c8 2b 42 08 39 42 04 5b 0f 94 c0 0f b6 c0 c3 56 53 8b 30 8b 1a 8b 4e 10 <8b> 43 10 0f c9 0f c8 39 c1 77 37 72 3c 0f b7 46 16 89 c2 c1 ea EIP: [<f8c34110>] compare_dents+0x9/0x58 [gfs2] SS:ESP 0068:e9692e20
It looks like the problem is related to the way we count the number of entries in a directory during readdir. We are making the assumption that ip->i_di.di_entries is correct which means that: 1. In the case that there are more entries than that, we may potentially run off the end of our buffer which is passed to sort(). 2. In the case that there are fewer entries we might pass uninitialised pointers to sort() (the array of pointers is kmalloced) which appears to be whats happened here. I suspect that there is a bug in fsck's routine for creating the lost+found directory, but either way GFS2 shouldn't panic like this. The fix for case 2 is trivial, fixing case 1 is not so easy.
Looking through the source for fsck, the most likely cause that I've not been able so far to eliminate as a possibility is that the "updated" variable is not being set correctly somewhere. Otherwise it looks as if its doing all the right things.
Created attachment 150028 [details] Patch to detect and reject corrupt directories This patch will detect and reject directories where the number of entries in the directory (in the inode for stuffed directories, or the leaf block for exhash) doesn't match the actual number of entries on that disk block. If this should happen a warning message is printed pointing out the location of the error. If you still have the (corrupt) filesystem available, please test it with this patch to see if this will do the "right thing", i.e. return -EIO to readdir rather than crashing as it does currently.
Josef, are you in a position to test this patch or does the offending fs image no longer exist? It seems to work ok for me when I tested the patch on correct filesystems, so I think it should be ok to apply, but it would be nice to know if it fixes the problem that you reported first.
yes i saved the bad image so bob could poke around on it. i will test the patch sometime next week, my queue is overflowing atm.
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
Rob, Kevin, can you set the rest of the flags on this bug from ? to +. Thanks, Steve.
Created attachment 154453 [details] First part of patch for RHEL 5.1 This is the first part of the patch for RHEL 5.1
Created attachment 154454 [details] Second part of patch Here is part two of the patch.
in 2.6.18-19.el5 You can download this test kernel from http://people.redhat.com/dzickus/el5
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2007-0959.html