286501 – ext3 fs corruption noted after fresh install of RHEL5.1-Server-20070906.0

Bug 286501 - ext3 fs corruption noted after fresh install of RHEL5.1-Server-20070906.0

Summary: ext3 fs corruption noted after fresh install of RHEL5.1-Server-20070906.0

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 5
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	5.1
Hardware:	ppc64
OS:	Linux
Priority:	medium
Severity:	low
Target Milestone:	---
Target Release:	---
Assignee:	Eric Sandeen
QA Contact:	Martin Jenner
Docs Contact:
URL:	http://rhts.lab.boston.redhat.com/cgi...
Whiteboard:
Duplicates (2):	247628 289711 (view as bug list)
Depends On:
Blocks:	311301
TreeView+	depends on / blocked

Reported:	2007-09-11 17:59 UTC by Mike Gahagan
Modified:	2007-11-30 22:07 UTC (History)
CC List:	5 users (show)
Fixed In Version:	RHBA-2007-0959
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2007-11-07 20:03:52 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2007:0959	0	normal	SHIPPED_LIVE	Updated kernel packages for Red Hat Enterprise Linux 5 Update 1	2007-11-08 00:47:37 UTC

Description Mike Gahagan 2007-09-11 17:59:21 UTC

Description of problem:

Found in rhts log from failed job (jobid 6782)

INIT: version 2.86 booting
		Welcome to Red Hat Enterprise Linux Server
		Press 'I' to enter interactive startup.
Setting clock  (utc): Tue Sep 11 10:41:25 EDT 2007 [  OK  ]
Starting udev: [  OK  ]
Setting hostname ibm-js20-04.lab.boston.redhat.com:  [  OK  ]
Setting up Logical Volume Management:   2 logical volume(s) in volume group
"VolGroup00" now active
[  OK  ]
Checking filesystems
Checking all file systems.
[/sbin/fsck.ext3 (1) -- /] fsck.ext3 -a /dev/VolGroup00/LogVol00 
/dev/VolGroup00/LogVol00: clean, 61045/9240576 files, 760688/9240576 blocks
[/sbin/fsck.ext3 (1) -- /boot] fsck.ext3 -a /dev/hda2 
/boot: clean, 20/26104 files, 20547/104384 blocks
[  OK  ]
Remounting root filesystem in read-write mode:  [  OK  ]
Mounting local filesystems:  [  OK  ]
Enabling local filesystem quotas:  [  OK  ]
EXT3-fs error (device dm-0): htree_dirblock_to_tree: bad entry in directory
#4554864: directory entry across blocks - offset=0, inode=4608, rec_len=30720,
name_len=0
ext3_abort called.
EXT3-fs error (device dm-0): ext3_journal_start_sb:
<4>__journal_remove_journal_head: freeing b_committed_data
Remounting filesystem read-only
rm: cannot remove `/var/run/utmp': Read-only file system
/etc/rc.d/rc.sysinit: line 844: /var/run/utmp: Read-only file system
touch: cannot touch `/var/log/wtmp': Read-only file system
chgrp: changing group of `/var/run/utmp': Read-only file system
chgrp: changing group of `/var/log/wtmp': Read-only file system

Version-Release number of selected component (if applicable):



How reproducible:

Has only happened one time so far.

Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:

Comment 1 Mike Gahagan 2007-09-11 18:09:56 UTC

requesting blocker due to fs corruption.

Comment 2 Eric Sandeen 2007-09-11 18:39:18 UTC

Can you make an image of this (corrupted) filesystem?

Comment 3 Eric Sandeen 2007-09-11 18:42:16 UTC

Hm, and are logs from the previous boot(s) and/or installs available?

Comment 5 Eric Sandeen 2007-09-11 19:00:29 UTC

Ok, if this was a one-time problem, and the corrupted fs image is no longer
available, and kernel messages from install-time aren't available... I don't see
how we can make any progress on this one, I'm afraid.

The corruption in question is that a directory entry claims to be larger than a
block size, i.e. 30720 bytes.  Which happens to be a nice even 0x7800 hex... but
past that, I'm fresh out of clues.  Saving the corrupted fs for examination
would probably be most helpful in these cases, if there is any way to do
that.... then could look for any other corruption, and see if there are more clues.

All we have to go on is one single bad value on the disk, which could just as
easily be attributed to a memory or hard disk error... or, a filesystem bug. 
But there's just not enough to go on.

If this crops up again, though, more datapoints will be helpful.

Comment 6 Eric Sandeen 2007-09-12 16:19:49 UTC

247628 looks related, and makes me wonder if we have an endian problem...

-eric

Comment 7 Eric Sandeen 2007-09-12 23:58:09 UTC

There was a reproducer posted to the ext4 list a while back, which passed w/ no
response from anyone :(

(it was slightly different results, but hopefully same underlying cause)

Working on it now...

Comment 8 Eric Sandeen 2007-09-13 22:46:07 UTC

The more I look at the root cause from the reproducer, the less I feel like it
is accurately reproducing the original report, I'm afraid.  When I get 100% to
the bottom of it, I'll see if i can bridge that conceptual gap...

But in the meantime, if this crops up again, if there's any way to get an image
of the fs in question, that'd be great.

-Eric

Comment 10 Eric Sandeen 2007-09-14 18:14:49 UTC

The corruptions in this and the other bug are with records like this:

offset=0, inode=0, 		rec_len=0, 	name_len=0
offset=0, inode=2164326400, 	rec_len=0, 	name_len=5
offset=0, inode=5376, 		rec_len=2, 	name_len=0
offset=0, inode=570556416, 	rec_len=28161, 	name_len=111
offset=0, inode=4608, 		rec_len=30720, 	name_len=0

all at offset 0 in the directory, and the inode numbers are "interesting:" 

0x81010000, 0x1500, 0x22020000, 0x1200

pretty round numbers, there.  endian problems?  Did we get to a block that
doesn't actually contain dir entries?  Hmmm

Comment 11 Eric Sandeen 2007-09-14 18:51:02 UTC

Also, for what it's worth, this does not look like a regression in RHEL5.  I was
able to hit it on x86 on Kernel 2.6.18-2.el5

-Eric

Comment 12 Eric Sandeen 2007-09-14 18:51:48 UTC

(re: comment #11, hit it with the QE reproducer, that is - and I'm not yet
convinced that the reproducer is hitting the same root cause as the original report)

Comment 13 Eric Sandeen 2007-09-15 06:32:34 UTC

Ok, I have some code running now that survives the reproducer that was reported
on the ext4 list.  Need to clean it up & will send it upstream....

Comment 14 Eric Sandeen 2007-09-16 04:13:11 UTC

Sent a patch to linux-ext4 today for comment.

Comment 15 Eric Sandeen 2007-09-17 17:15:18 UTC

For now, I'm willing to chalk up the original error to the problem demonstrated
in the reproducer.  Due to the miscalculation, the memcpy of the new name will
overwrite the buffer & corrupt memory.  After that, all bets are off... let's
get the fix in and keep an eye out for any recurrance.

Thanks,

-Eric

Comment 18 Eric Sandeen 2007-09-17 19:35:09 UTC

*** Bug 247628 has been marked as a duplicate of this bug. ***

Comment 19 Eric Sandeen 2007-09-17 19:48:02 UTC

*** Bug 289711 has been marked as a duplicate of this bug. ***

Comment 23 Eric Sandeen 2007-09-18 17:37:36 UTC

Taking out of beta-private group, no reason to restrict access AFAICS.

Patch now in -mm, btw, probably slated for .22 & .23.

Comment 24 Don Zickus 2007-09-18 19:23:31 UTC

in 2.6.18-48.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5

Comment 26 Mike Gahagan 2007-09-26 19:44:38 UTC

verified using reproducer located at:

http://lists.openwall.net/linux-ext4/2007/06/01/1

corruption reproduces within seconds with the -47 kernel, no corruption noted
after about 30 min. of use with the -49 kernel.

Comment 28 errata-xmlrpc 2007-11-07 20:03:52 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2007-0959.html

Comment 29 Eric Sandeen 2007-11-16 15:18:15 UTC

http://qa.mandriva.com/show_bug.cgi?id=32547 is worrying me now; it looks
possible that this fix caused another regression... looking into it with a sense
of urgency.  Just a heads up...

Note You need to log in before you can comment on or make changes to this bug.