Red Hat Bugzilla – Bug 286501
ext3 fs corruption noted after fresh install of RHEL5.1-Server-20070906.0
Last modified: 2007-11-30 17:07:47 EST
Description of problem:
Found in rhts log from failed job (jobid 6782)
INIT: version 2.86 booting
Welcome to Red Hat Enterprise Linux Server
Press 'I' to enter interactive startup.
Setting clock (utc): Tue Sep 11 10:41:25 EDT 2007 [ OK ]
Starting udev: [ OK ]
Setting hostname ibm-js20-04.lab.boston.redhat.com: [ OK ]
Setting up Logical Volume Management: 2 logical volume(s) in volume group
"VolGroup00" now active
[ OK ]
Checking all file systems.
[/sbin/fsck.ext3 (1) -- /] fsck.ext3 -a /dev/VolGroup00/LogVol00
/dev/VolGroup00/LogVol00: clean, 61045/9240576 files, 760688/9240576 blocks
[/sbin/fsck.ext3 (1) -- /boot] fsck.ext3 -a /dev/hda2
/boot: clean, 20/26104 files, 20547/104384 blocks
[ OK ]
Remounting root filesystem in read-write mode: [ OK ]
Mounting local filesystems: [ OK ]
Enabling local filesystem quotas: [ OK ]
EXT3-fs error (device dm-0): htree_dirblock_to_tree: bad entry in directory
#4554864: directory entry across blocks - offset=0, inode=4608, rec_len=30720,
EXT3-fs error (device dm-0): ext3_journal_start_sb:
<4>__journal_remove_journal_head: freeing b_committed_data
Remounting filesystem read-only
rm: cannot remove `/var/run/utmp': Read-only file system
/etc/rc.d/rc.sysinit: line 844: /var/run/utmp: Read-only file system
touch: cannot touch `/var/log/wtmp': Read-only file system
chgrp: changing group of `/var/run/utmp': Read-only file system
chgrp: changing group of `/var/log/wtmp': Read-only file system
Version-Release number of selected component (if applicable):
Has only happened one time so far.
Steps to Reproduce:
requesting blocker due to fs corruption.
Can you make an image of this (corrupted) filesystem?
Hm, and are logs from the previous boot(s) and/or installs available?
Ok, if this was a one-time problem, and the corrupted fs image is no longer
available, and kernel messages from install-time aren't available... I don't see
how we can make any progress on this one, I'm afraid.
The corruption in question is that a directory entry claims to be larger than a
block size, i.e. 30720 bytes. Which happens to be a nice even 0x7800 hex... but
past that, I'm fresh out of clues. Saving the corrupted fs for examination
would probably be most helpful in these cases, if there is any way to do
that.... then could look for any other corruption, and see if there are more clues.
All we have to go on is one single bad value on the disk, which could just as
easily be attributed to a memory or hard disk error... or, a filesystem bug.
But there's just not enough to go on.
If this crops up again, though, more datapoints will be helpful.
247628 looks related, and makes me wonder if we have an endian problem...
There was a reproducer posted to the ext4 list a while back, which passed w/ no
response from anyone :(
(it was slightly different results, but hopefully same underlying cause)
Working on it now...
The more I look at the root cause from the reproducer, the less I feel like it
is accurately reproducing the original report, I'm afraid. When I get 100% to
the bottom of it, I'll see if i can bridge that conceptual gap...
But in the meantime, if this crops up again, if there's any way to get an image
of the fs in question, that'd be great.
The corruptions in this and the other bug are with records like this:
offset=0, inode=0, rec_len=0, name_len=0
offset=0, inode=2164326400, rec_len=0, name_len=5
offset=0, inode=5376, rec_len=2, name_len=0
offset=0, inode=570556416, rec_len=28161, name_len=111
offset=0, inode=4608, rec_len=30720, name_len=0
all at offset 0 in the directory, and the inode numbers are "interesting:"
0x81010000, 0x1500, 0x22020000, 0x1200
pretty round numbers, there. endian problems? Did we get to a block that
doesn't actually contain dir entries? Hmmm
Also, for what it's worth, this does not look like a regression in RHEL5. I was
able to hit it on x86 on Kernel 2.6.18-2.el5
(re: comment #11, hit it with the QE reproducer, that is - and I'm not yet
convinced that the reproducer is hitting the same root cause as the original report)
Ok, I have some code running now that survives the reproducer that was reported
on the ext4 list. Need to clean it up & will send it upstream....
Sent a patch to linux-ext4 today for comment.
For now, I'm willing to chalk up the original error to the problem demonstrated
in the reproducer. Due to the miscalculation, the memcpy of the new name will
overwrite the buffer & corrupt memory. After that, all bets are off... let's
get the fix in and keep an eye out for any recurrance.
*** Bug 247628 has been marked as a duplicate of this bug. ***
*** Bug 289711 has been marked as a duplicate of this bug. ***
Taking out of beta-private group, no reason to restrict access AFAICS.
Patch now in -mm, btw, probably slated for .22 & .23.
You can download this test kernel from http://people.redhat.com/dzickus/el5
verified using reproducer located at:
corruption reproduces within seconds with the -47 kernel, no corruption noted
after about 30 min. of use with the -49 kernel.
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.
http://qa.mandriva.com/show_bug.cgi?id=32547 is worrying me now; it looks
possible that this fix caused another regression... looking into it with a sense
of urgency. Just a heads up...