232107 – GFS2 panics if you try to rm -rf the lost+found directory

Bug 232107 - GFS2 panics if you try to rm -rf the lost+found directory

Summary: GFS2 panics if you try to rm -rf the lost+found directory

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 5
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	5.1
Hardware:	All
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	Don Zickus
QA Contact:	Dean Jansa
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2007-03-13 21:26 UTC by Josef Bacik
Modified:	2007-11-30 22:07 UTC (History)
CC List:	7 users (show)
Fixed In Version:	RHBA-2007-0959
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2007-11-07 19:44:08 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
Patch to detect and reject corrupt directories (3.15 KB, patch) 2007-03-14 11:26 UTC, Steve Whitehouse	no flags	Details \| Diff
First part of patch for RHEL 5.1 (3.85 KB, patch) 2007-05-10 09:19 UTC, Steve Whitehouse	no flags	Details \| Diff
Second part of patch (1.86 KB, patch) 2007-05-10 09:21 UTC, Steve Whitehouse	no flags	Details \| Diff
Show Obsolete (1) View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2007:0959	0	normal	SHIPPED_LIVE	Updated kernel packages for Red Hat Enterprise Linux 5 Update 1	2007-11-08 00:47:37 UTC

Description Josef Bacik 2007-03-13 21:26:36 UTC

in trying to track down bz 231910 I hit this problem.  Basically I hit the 
withdraw, rebooted both nodes, ran a gfs2_fsck on the filesystem, mounted the 
FS back up, rm -rf'ed the directory i was removing, and then rm -rf'ed the 
lost+found directory, and then I got this OOPs

 BUG: unable to handle kernel paging request at virtual address 080c4b10
 printing eip:
f8c34110
*pde = 32f60067
Oops: 0000 [#1]
SMP 
Modules linked in: autofs4 hidp rfcomm l2cap bluetooth lock_dlm gfs2 dlm 
configfs sunrpc sg iptable_filter ip_tables ip6t_REJECT xt_tcpudp 
ip6table_filter ip6_tables x_tables ipv6 dm_multipath video sbs i2c_ec button 
battery asus_acpi ac parport_pc lp parport floppy i2c_piix4 i2c_core cfi_probe 
gen_probe pcspkr scb2_flash mtdcore chipreg tg3 serio_raw ide_cd cdrom 
dm_snapshot dm_zero dm_mirror dm_mod qla2xxx scsi_transport_fc sd_mod scsi_mod 
ext3 jbd ehci_hcd ohci_hcd uhci_hcd
CPU:    0
EIP:    0060:[<f8c34110>]    Not tainted VLI
EFLAGS: 00010296   (2.6.21-rc1 #16)
EIP is at compare_dents+0x9/0x58 [gfs2]
eax: f8bc00c0   ebx: 080c4b00   ecx: e3fdb79f   edx: f8bc00c4
esi: da0acee8   edi: 000000bc   ebp: 0000005c   esp: e9692e20
ds: 007b   es: 007b   fs: 00d8  gs: 0033  ss: 0068
Process rm (pid: 3177, ti=e9692000 task=f0571560 task.ti=e9692000)
Stack: 000000bc 00000004 c04e199c f8bc0004 000ded88 0000005c 000000c4 f0571560 
       d9fbc488 d9fbc488 da04cf60 c9341580 00000000 00000000 00000000 e9692f5c 
       f8c33f2f f8c34107 c04e191c 00000000 da0aceb8 da0acee8 e9692f94 f8c329a8 
Call Trace:
 [<c04e199c>] sort+0x5d/0x14d
 [<f8c33f2f>] do_filldir_main+0x2e/0x1e4 [gfs2]
 [<f8c34107>] compare_dents+0x0/0x58 [gfs2]
 [<c04e191c>] u32_swap+0x0/0xb
 [<f8c329a8>] gfs2_dirent_scan+0xa2/0x15f [gfs2]
 [<f8c344b3>] gfs2_dir_read+0x2fe/0x4d1 [gfs2]
 [<c0478ae4>] filldir64+0x0/0xc5
 [<f8c42ee9>] gfs2_readdir+0x6f/0x90 [gfs2]
 [<c0478ae4>] filldir64+0x0/0xc5
 [<c0478ae4>] filldir64+0x0/0xc5
 [<f8c42ebd>] gfs2_readdir+0x43/0x90 [gfs2]
 [<c0478cc5>] vfs_readdir+0x63/0x8d
 [<c0478ae4>] filldir64+0x0/0xc5
 [<c0478d52>] sys_getdents64+0x63/0xa5
 [<c0404e4c>] syscall_call+0x7/0xb
 [<c0610000>] rt_mutex_slowunlock+0x89/0x197
 =======================
Code: 53 0f b7 48 14 89 cb c1 eb 08 c1 e1 08 09 d9 0f b7 c9 01 c8 2b 42 08 39 
42 04 5b 0f 94 c0 0f b6 c0 c3 56 53 8b 30 8b 1a 8b 4e 10 <8b> 43 10 0f c9 0f 
c8 39 c1 77 37 72 3c 0f b7 46 16 89 c2 c1 ea 
EIP: [<f8c34110>] compare_dents+0x9/0x58 [gfs2] SS:ESP 0068:e9692e20

Comment 1 Steve Whitehouse 2007-03-14 08:40:58 UTC

It looks like the problem is related to the way we count the number of entries
in a directory during readdir. We are making the assumption that
ip->i_di.di_entries is correct which means that:
 1. In the case that there are more entries than that, we may potentially run
off the end of our buffer which is passed to sort().
 2. In the case that there are fewer entries we might pass uninitialised
pointers to sort() (the array of pointers is kmalloced) which appears to be
whats happened here.

I suspect that there is a bug in fsck's routine for creating the lost+found
directory, but either way GFS2 shouldn't panic like this. The fix for case 2 is
trivial, fixing case 1 is not so easy.

Comment 2 Steve Whitehouse 2007-03-14 09:11:43 UTC

Looking through the source for fsck, the most likely cause that I've not been
able so far to eliminate as a possibility is that the "updated" variable is not
being set correctly somewhere. Otherwise it looks as if its doing all the right
things.

Comment 3 Steve Whitehouse 2007-03-14 11:26:43 UTC

Created attachment 150028 [details]
Patch to detect and reject corrupt directories

This patch will detect and reject directories where the number of entries in
the directory (in the inode for stuffed directories, or the leaf block for
exhash) doesn't match the actual number of entries on that disk block. If this
should happen a warning message is printed pointing out the location of the
error.

If you still have the (corrupt) filesystem available, please test it with this
patch to see if this will do the "right thing", i.e. return -EIO to readdir
rather than crashing as it does currently.

Comment 4 Steve Whitehouse 2007-03-23 10:53:29 UTC

Josef, are you in a position to test this patch or does the offending fs image
no longer exist? It seems to work ok for me when I tested the patch on correct
filesystems, so I think it should be ok to apply, but it would be nice to know
if it fixes the problem that you reported first.

Comment 5 Josef Bacik 2007-03-23 13:39:40 UTC

yes i saved the bad image so bob could poke around on it.  i will test the patch
sometime next week, my queue is overflowing atm.

Comment 7 RHEL Program Management 2007-04-13 16:23:59 UTC

This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 8 Steve Whitehouse 2007-04-20 10:51:09 UTC

Rob, Kevin, can you set the rest of the flags on this bug from ? to +. Thanks,
Steve.

Comment 9 Steve Whitehouse 2007-05-10 09:19:49 UTC

Created attachment 154453 [details]
First part of patch for RHEL 5.1

This is the first part of the patch for RHEL 5.1

Comment 10 Steve Whitehouse 2007-05-10 09:21:20 UTC

Created attachment 154454 [details]
Second part of patch

Here is part two of the patch.

Comment 11 Don Zickus 2007-05-11 22:08:00 UTC

in 2.6.18-19.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5

Comment 14 errata-xmlrpc 2007-11-07 19:44:08 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2007-0959.html

Note You need to log in before you can comment on or make changes to this bug.