Description of problem: Found in rhts log from failed job (jobid 6782) INIT: version 2.86 booting Welcome to Red Hat Enterprise Linux Server Press 'I' to enter interactive startup. Setting clock (utc): Tue Sep 11 10:41:25 EDT 2007 [ OK ] Starting udev: [ OK ] Setting hostname ibm-js20-04.lab.boston.redhat.com: [ OK ] Setting up Logical Volume Management: 2 logical volume(s) in volume group "VolGroup00" now active [ OK ] Checking filesystems Checking all file systems. [/sbin/fsck.ext3 (1) -- /] fsck.ext3 -a /dev/VolGroup00/LogVol00 /dev/VolGroup00/LogVol00: clean, 61045/9240576 files, 760688/9240576 blocks [/sbin/fsck.ext3 (1) -- /boot] fsck.ext3 -a /dev/hda2 /boot: clean, 20/26104 files, 20547/104384 blocks [ OK ] Remounting root filesystem in read-write mode: [ OK ] Mounting local filesystems: [ OK ] Enabling local filesystem quotas: [ OK ] EXT3-fs error (device dm-0): htree_dirblock_to_tree: bad entry in directory #4554864: directory entry across blocks - offset=0, inode=4608, rec_len=30720, name_len=0 ext3_abort called. EXT3-fs error (device dm-0): ext3_journal_start_sb: <4>__journal_remove_journal_head: freeing b_committed_data Remounting filesystem read-only rm: cannot remove `/var/run/utmp': Read-only file system /etc/rc.d/rc.sysinit: line 844: /var/run/utmp: Read-only file system touch: cannot touch `/var/log/wtmp': Read-only file system chgrp: changing group of `/var/run/utmp': Read-only file system chgrp: changing group of `/var/log/wtmp': Read-only file system Version-Release number of selected component (if applicable): How reproducible: Has only happened one time so far. Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
requesting blocker due to fs corruption.
Can you make an image of this (corrupted) filesystem?
Hm, and are logs from the previous boot(s) and/or installs available?
Ok, if this was a one-time problem, and the corrupted fs image is no longer available, and kernel messages from install-time aren't available... I don't see how we can make any progress on this one, I'm afraid. The corruption in question is that a directory entry claims to be larger than a block size, i.e. 30720 bytes. Which happens to be a nice even 0x7800 hex... but past that, I'm fresh out of clues. Saving the corrupted fs for examination would probably be most helpful in these cases, if there is any way to do that.... then could look for any other corruption, and see if there are more clues. All we have to go on is one single bad value on the disk, which could just as easily be attributed to a memory or hard disk error... or, a filesystem bug. But there's just not enough to go on. If this crops up again, though, more datapoints will be helpful.
247628 looks related, and makes me wonder if we have an endian problem... -eric
There was a reproducer posted to the ext4 list a while back, which passed w/ no response from anyone :( (it was slightly different results, but hopefully same underlying cause) Working on it now...
The more I look at the root cause from the reproducer, the less I feel like it is accurately reproducing the original report, I'm afraid. When I get 100% to the bottom of it, I'll see if i can bridge that conceptual gap... But in the meantime, if this crops up again, if there's any way to get an image of the fs in question, that'd be great. -Eric
The corruptions in this and the other bug are with records like this: offset=0, inode=0, rec_len=0, name_len=0 offset=0, inode=2164326400, rec_len=0, name_len=5 offset=0, inode=5376, rec_len=2, name_len=0 offset=0, inode=570556416, rec_len=28161, name_len=111 offset=0, inode=4608, rec_len=30720, name_len=0 all at offset 0 in the directory, and the inode numbers are "interesting:" 0x81010000, 0x1500, 0x22020000, 0x1200 pretty round numbers, there. endian problems? Did we get to a block that doesn't actually contain dir entries? Hmmm
Also, for what it's worth, this does not look like a regression in RHEL5. I was able to hit it on x86 on Kernel 2.6.18-2.el5 -Eric
(re: comment #11, hit it with the QE reproducer, that is - and I'm not yet convinced that the reproducer is hitting the same root cause as the original report)
Ok, I have some code running now that survives the reproducer that was reported on the ext4 list. Need to clean it up & will send it upstream....
Sent a patch to linux-ext4 today for comment.
For now, I'm willing to chalk up the original error to the problem demonstrated in the reproducer. Due to the miscalculation, the memcpy of the new name will overwrite the buffer & corrupt memory. After that, all bets are off... let's get the fix in and keep an eye out for any recurrance. Thanks, -Eric
*** Bug 247628 has been marked as a duplicate of this bug. ***
*** Bug 289711 has been marked as a duplicate of this bug. ***
Taking out of beta-private group, no reason to restrict access AFAICS. Patch now in -mm, btw, probably slated for .22 & .23.
in 2.6.18-48.el5 You can download this test kernel from http://people.redhat.com/dzickus/el5
verified using reproducer located at: http://lists.openwall.net/linux-ext4/2007/06/01/1 corruption reproduces within seconds with the -47 kernel, no corruption noted after about 30 min. of use with the -49 kernel.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2007-0959.html
http://qa.mandriva.com/show_bug.cgi?id=32547 is worrying me now; it looks possible that this fix caused another regression... looking into it with a sense of urgency. Just a heads up...