Bug 286501 - ext3 fs corruption noted after fresh install of RHEL5.1-Server-20070906.0
Summary: ext3 fs corruption noted after fresh install of RHEL5.1-Server-20070906.0
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel
Version: 5.1
Hardware: ppc64
OS: Linux
medium
low
Target Milestone: ---
: ---
Assignee: Eric Sandeen
QA Contact: Martin Jenner
URL: http://rhts.lab.boston.redhat.com/cgi...
Whiteboard:
: 247628 289711 (view as bug list)
Depends On:
Blocks: 311301
TreeView+ depends on / blocked
 
Reported: 2007-09-11 17:59 UTC by Mike Gahagan
Modified: 2007-11-30 22:07 UTC (History)
5 users (show)

Fixed In Version: RHBA-2007-0959
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2007-11-07 20:03:52 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2007:0959 0 normal SHIPPED_LIVE Updated kernel packages for Red Hat Enterprise Linux 5 Update 1 2007-11-08 00:47:37 UTC

Description Mike Gahagan 2007-09-11 17:59:21 UTC
Description of problem:

Found in rhts log from failed job (jobid 6782)

INIT: version 2.86 booting
		Welcome to Red Hat Enterprise Linux Server
		Press 'I' to enter interactive startup.
Setting clock  (utc): Tue Sep 11 10:41:25 EDT 2007 [  OK  ]
Starting udev: [  OK  ]
Setting hostname ibm-js20-04.lab.boston.redhat.com:  [  OK  ]
Setting up Logical Volume Management:   2 logical volume(s) in volume group
"VolGroup00" now active
[  OK  ]
Checking filesystems
Checking all file systems.
[/sbin/fsck.ext3 (1) -- /] fsck.ext3 -a /dev/VolGroup00/LogVol00 
/dev/VolGroup00/LogVol00: clean, 61045/9240576 files, 760688/9240576 blocks
[/sbin/fsck.ext3 (1) -- /boot] fsck.ext3 -a /dev/hda2 
/boot: clean, 20/26104 files, 20547/104384 blocks
[  OK  ]
Remounting root filesystem in read-write mode:  [  OK  ]
Mounting local filesystems:  [  OK  ]
Enabling local filesystem quotas:  [  OK  ]
EXT3-fs error (device dm-0): htree_dirblock_to_tree: bad entry in directory
#4554864: directory entry across blocks - offset=0, inode=4608, rec_len=30720,
name_len=0
ext3_abort called.
EXT3-fs error (device dm-0): ext3_journal_start_sb:
<4>__journal_remove_journal_head: freeing b_committed_data
Remounting filesystem read-only
rm: cannot remove `/var/run/utmp': Read-only file system
/etc/rc.d/rc.sysinit: line 844: /var/run/utmp: Read-only file system
touch: cannot touch `/var/log/wtmp': Read-only file system
chgrp: changing group of `/var/run/utmp': Read-only file system
chgrp: changing group of `/var/log/wtmp': Read-only file system

Version-Release number of selected component (if applicable):



How reproducible:

Has only happened one time so far.

Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:

Comment 1 Mike Gahagan 2007-09-11 18:09:56 UTC
requesting blocker due to fs corruption.

Comment 2 Eric Sandeen 2007-09-11 18:39:18 UTC
Can you make an image of this (corrupted) filesystem?

Comment 3 Eric Sandeen 2007-09-11 18:42:16 UTC
Hm, and are logs from the previous boot(s) and/or installs available?

Comment 5 Eric Sandeen 2007-09-11 19:00:29 UTC
Ok, if this was a one-time problem, and the corrupted fs image is no longer
available, and kernel messages from install-time aren't available... I don't see
how we can make any progress on this one, I'm afraid.

The corruption in question is that a directory entry claims to be larger than a
block size, i.e. 30720 bytes.  Which happens to be a nice even 0x7800 hex... but
past that, I'm fresh out of clues.  Saving the corrupted fs for examination
would probably be most helpful in these cases, if there is any way to do
that.... then could look for any other corruption, and see if there are more clues.

All we have to go on is one single bad value on the disk, which could just as
easily be attributed to a memory or hard disk error... or, a filesystem bug. 
But there's just not enough to go on.

If this crops up again, though, more datapoints will be helpful.

Comment 6 Eric Sandeen 2007-09-12 16:19:49 UTC
247628 looks related, and makes me wonder if we have an endian problem...

-eric

Comment 7 Eric Sandeen 2007-09-12 23:58:09 UTC
There was a reproducer posted to the ext4 list a while back, which passed w/ no
response from anyone :(

(it was slightly different results, but hopefully same underlying cause)

Working on it now...

Comment 8 Eric Sandeen 2007-09-13 22:46:07 UTC
The more I look at the root cause from the reproducer, the less I feel like it
is accurately reproducing the original report, I'm afraid.  When I get 100% to
the bottom of it, I'll see if i can bridge that conceptual gap...

But in the meantime, if this crops up again, if there's any way to get an image
of the fs in question, that'd be great.

-Eric

Comment 10 Eric Sandeen 2007-09-14 18:14:49 UTC
The corruptions in this and the other bug are with records like this:

offset=0, inode=0, 		rec_len=0, 	name_len=0
offset=0, inode=2164326400, 	rec_len=0, 	name_len=5
offset=0, inode=5376, 		rec_len=2, 	name_len=0
offset=0, inode=570556416, 	rec_len=28161, 	name_len=111
offset=0, inode=4608, 		rec_len=30720, 	name_len=0

all at offset 0 in the directory, and the inode numbers are "interesting:" 

0x81010000, 0x1500, 0x22020000, 0x1200

pretty round numbers, there.  endian problems?  Did we get to a block that
doesn't actually contain dir entries?  Hmmm

Comment 11 Eric Sandeen 2007-09-14 18:51:02 UTC
Also, for what it's worth, this does not look like a regression in RHEL5.  I was
able to hit it on x86 on Kernel 2.6.18-2.el5

-Eric

Comment 12 Eric Sandeen 2007-09-14 18:51:48 UTC
(re: comment #11, hit it with the QE reproducer, that is - and I'm not yet
convinced that the reproducer is hitting the same root cause as the original report)

Comment 13 Eric Sandeen 2007-09-15 06:32:34 UTC
Ok, I have some code running now that survives the reproducer that was reported
on the ext4 list.  Need to clean it up & will send it upstream....

Comment 14 Eric Sandeen 2007-09-16 04:13:11 UTC
Sent a patch to linux-ext4 today for comment.

Comment 15 Eric Sandeen 2007-09-17 17:15:18 UTC
For now, I'm willing to chalk up the original error to the problem demonstrated
in the reproducer.  Due to the miscalculation, the memcpy of the new name will
overwrite the buffer & corrupt memory.  After that, all bets are off... let's
get the fix in and keep an eye out for any recurrance.

Thanks,

-Eric

Comment 18 Eric Sandeen 2007-09-17 19:35:09 UTC
*** Bug 247628 has been marked as a duplicate of this bug. ***

Comment 19 Eric Sandeen 2007-09-17 19:48:02 UTC
*** Bug 289711 has been marked as a duplicate of this bug. ***

Comment 23 Eric Sandeen 2007-09-18 17:37:36 UTC
Taking out of beta-private group, no reason to restrict access AFAICS.

Patch now in -mm, btw, probably slated for .22 & .23.

Comment 24 Don Zickus 2007-09-18 19:23:31 UTC
in 2.6.18-48.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5

Comment 26 Mike Gahagan 2007-09-26 19:44:38 UTC
verified using reproducer located at:

http://lists.openwall.net/linux-ext4/2007/06/01/1

corruption reproduces within seconds with the -47 kernel, no corruption noted
after about 30 min. of use with the -49 kernel.


Comment 28 errata-xmlrpc 2007-11-07 20:03:52 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2007-0959.html


Comment 29 Eric Sandeen 2007-11-16 15:18:15 UTC
http://qa.mandriva.com/show_bug.cgi?id=32547 is worrying me now; it looks
possible that this fix caused another regression... looking into it with a sense
of urgency.  Just a heads up...


Note You need to log in before you can comment on or make changes to this bug.