Bug 232107 - GFS2 panics if you try to rm -rf the lost+found directory
Summary: GFS2 panics if you try to rm -rf the lost+found directory
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel
Version: 5.1
Hardware: All
OS: Linux
medium
medium
Target Milestone: ---
: ---
Assignee: Don Zickus
QA Contact: Dean Jansa
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2007-03-13 21:26 UTC by Josef Bacik
Modified: 2007-11-30 22:07 UTC (History)
7 users (show)

Fixed In Version: RHBA-2007-0959
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2007-11-07 19:44:08 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Patch to detect and reject corrupt directories (3.15 KB, patch)
2007-03-14 11:26 UTC, Steve Whitehouse
no flags Details | Diff
First part of patch for RHEL 5.1 (3.85 KB, patch)
2007-05-10 09:19 UTC, Steve Whitehouse
no flags Details | Diff
Second part of patch (1.86 KB, patch)
2007-05-10 09:21 UTC, Steve Whitehouse
no flags Details | Diff


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2007:0959 0 normal SHIPPED_LIVE Updated kernel packages for Red Hat Enterprise Linux 5 Update 1 2007-11-08 00:47:37 UTC

Description Josef Bacik 2007-03-13 21:26:36 UTC
in trying to track down bz 231910 I hit this problem.  Basically I hit the 
withdraw, rebooted both nodes, ran a gfs2_fsck on the filesystem, mounted the 
FS back up, rm -rf'ed the directory i was removing, and then rm -rf'ed the 
lost+found directory, and then I got this OOPs

 BUG: unable to handle kernel paging request at virtual address 080c4b10
 printing eip:
f8c34110
*pde = 32f60067
Oops: 0000 [#1]
SMP 
Modules linked in: autofs4 hidp rfcomm l2cap bluetooth lock_dlm gfs2 dlm 
configfs sunrpc sg iptable_filter ip_tables ip6t_REJECT xt_tcpudp 
ip6table_filter ip6_tables x_tables ipv6 dm_multipath video sbs i2c_ec button 
battery asus_acpi ac parport_pc lp parport floppy i2c_piix4 i2c_core cfi_probe 
gen_probe pcspkr scb2_flash mtdcore chipreg tg3 serio_raw ide_cd cdrom 
dm_snapshot dm_zero dm_mirror dm_mod qla2xxx scsi_transport_fc sd_mod scsi_mod 
ext3 jbd ehci_hcd ohci_hcd uhci_hcd
CPU:    0
EIP:    0060:[<f8c34110>]    Not tainted VLI
EFLAGS: 00010296   (2.6.21-rc1 #16)
EIP is at compare_dents+0x9/0x58 [gfs2]
eax: f8bc00c0   ebx: 080c4b00   ecx: e3fdb79f   edx: f8bc00c4
esi: da0acee8   edi: 000000bc   ebp: 0000005c   esp: e9692e20
ds: 007b   es: 007b   fs: 00d8  gs: 0033  ss: 0068
Process rm (pid: 3177, ti=e9692000 task=f0571560 task.ti=e9692000)
Stack: 000000bc 00000004 c04e199c f8bc0004 000ded88 0000005c 000000c4 f0571560 
       d9fbc488 d9fbc488 da04cf60 c9341580 00000000 00000000 00000000 e9692f5c 
       f8c33f2f f8c34107 c04e191c 00000000 da0aceb8 da0acee8 e9692f94 f8c329a8 
Call Trace:
 [<c04e199c>] sort+0x5d/0x14d
 [<f8c33f2f>] do_filldir_main+0x2e/0x1e4 [gfs2]
 [<f8c34107>] compare_dents+0x0/0x58 [gfs2]
 [<c04e191c>] u32_swap+0x0/0xb
 [<f8c329a8>] gfs2_dirent_scan+0xa2/0x15f [gfs2]
 [<f8c344b3>] gfs2_dir_read+0x2fe/0x4d1 [gfs2]
 [<c0478ae4>] filldir64+0x0/0xc5
 [<f8c42ee9>] gfs2_readdir+0x6f/0x90 [gfs2]
 [<c0478ae4>] filldir64+0x0/0xc5
 [<c0478ae4>] filldir64+0x0/0xc5
 [<f8c42ebd>] gfs2_readdir+0x43/0x90 [gfs2]
 [<c0478cc5>] vfs_readdir+0x63/0x8d
 [<c0478ae4>] filldir64+0x0/0xc5
 [<c0478d52>] sys_getdents64+0x63/0xa5
 [<c0404e4c>] syscall_call+0x7/0xb
 [<c0610000>] rt_mutex_slowunlock+0x89/0x197
 =======================
Code: 53 0f b7 48 14 89 cb c1 eb 08 c1 e1 08 09 d9 0f b7 c9 01 c8 2b 42 08 39 
42 04 5b 0f 94 c0 0f b6 c0 c3 56 53 8b 30 8b 1a 8b 4e 10 <8b> 43 10 0f c9 0f 
c8 39 c1 77 37 72 3c 0f b7 46 16 89 c2 c1 ea 
EIP: [<f8c34110>] compare_dents+0x9/0x58 [gfs2] SS:ESP 0068:e9692e20

Comment 1 Steve Whitehouse 2007-03-14 08:40:58 UTC
It looks like the problem is related to the way we count the number of entries
in a directory during readdir. We are making the assumption that
ip->i_di.di_entries is correct which means that:
 1. In the case that there are more entries than that, we may potentially run
off the end of our buffer which is passed to sort().
 2. In the case that there are fewer entries we might pass uninitialised
pointers to sort() (the array of pointers is kmalloced) which appears to be
whats happened here.

I suspect that there is a bug in fsck's routine for creating the lost+found
directory, but either way GFS2 shouldn't panic like this. The fix for case 2 is
trivial, fixing case 1 is not so easy.



Comment 2 Steve Whitehouse 2007-03-14 09:11:43 UTC
Looking through the source for fsck, the most likely cause that I've not been
able so far to eliminate as a possibility is that the "updated" variable is not
being set correctly somewhere. Otherwise it looks as if its doing all the right
things.



Comment 3 Steve Whitehouse 2007-03-14 11:26:43 UTC
Created attachment 150028 [details]
Patch to detect and reject corrupt directories

This patch will detect and reject directories where the number of entries in
the directory (in the inode for stuffed directories, or the leaf block for
exhash) doesn't match the actual number of entries on that disk block. If this
should happen a warning message is printed pointing out the location of the
error.

If you still have the (corrupt) filesystem available, please test it with this
patch to see if this will do the "right thing", i.e. return -EIO to readdir
rather than crashing as it does currently.

Comment 4 Steve Whitehouse 2007-03-23 10:53:29 UTC
Josef, are you in a position to test this patch or does the offending fs image
no longer exist? It seems to work ok for me when I tested the patch on correct
filesystems, so I think it should be ok to apply, but it would be nice to know
if it fixes the problem that you reported first.


Comment 5 Josef Bacik 2007-03-23 13:39:40 UTC
yes i saved the bad image so bob could poke around on it.  i will test the patch
sometime next week, my queue is overflowing atm.

Comment 7 RHEL Program Management 2007-04-13 16:23:59 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 8 Steve Whitehouse 2007-04-20 10:51:09 UTC
Rob, Kevin, can you set the rest of the flags on this bug from ? to +. Thanks,
Steve.

Comment 9 Steve Whitehouse 2007-05-10 09:19:49 UTC
Created attachment 154453 [details]
First part of patch for RHEL 5.1

This is the first part of the patch for RHEL 5.1

Comment 10 Steve Whitehouse 2007-05-10 09:21:20 UTC
Created attachment 154454 [details]
Second part of patch

Here is part two of the patch.

Comment 11 Don Zickus 2007-05-11 22:08:00 UTC
in 2.6.18-19.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5

Comment 14 errata-xmlrpc 2007-11-07 19:44:08 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2007-0959.html



Note You need to log in before you can comment on or make changes to this bug.