Bug 286501
Summary: | ext3 fs corruption noted after fresh install of RHEL5.1-Server-20070906.0 | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 5 | Reporter: | Mike Gahagan <mgahagan> |
Component: | kernel | Assignee: | Eric Sandeen <esandeen> |
Status: | CLOSED ERRATA | QA Contact: | Martin Jenner <mjenner> |
Severity: | low | Docs Contact: | |
Priority: | medium | ||
Version: | 5.1 | CC: | dzickus, esandeen, jburke, sct, smoser |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | ppc64 | ||
OS: | Linux | ||
URL: | http://rhts.lab.boston.redhat.com/cgi-bin/rhts/test_log.cgi?id=682570 | ||
Whiteboard: | |||
Fixed In Version: | RHBA-2007-0959 | Doc Type: | Bug Fix |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2007-11-07 20:03:52 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 311301 |
Description
Mike Gahagan
2007-09-11 17:59:21 UTC
requesting blocker due to fs corruption. Can you make an image of this (corrupted) filesystem? Hm, and are logs from the previous boot(s) and/or installs available? Ok, if this was a one-time problem, and the corrupted fs image is no longer available, and kernel messages from install-time aren't available... I don't see how we can make any progress on this one, I'm afraid. The corruption in question is that a directory entry claims to be larger than a block size, i.e. 30720 bytes. Which happens to be a nice even 0x7800 hex... but past that, I'm fresh out of clues. Saving the corrupted fs for examination would probably be most helpful in these cases, if there is any way to do that.... then could look for any other corruption, and see if there are more clues. All we have to go on is one single bad value on the disk, which could just as easily be attributed to a memory or hard disk error... or, a filesystem bug. But there's just not enough to go on. If this crops up again, though, more datapoints will be helpful. 247628 looks related, and makes me wonder if we have an endian problem... -eric There was a reproducer posted to the ext4 list a while back, which passed w/ no response from anyone :( (it was slightly different results, but hopefully same underlying cause) Working on it now... The more I look at the root cause from the reproducer, the less I feel like it is accurately reproducing the original report, I'm afraid. When I get 100% to the bottom of it, I'll see if i can bridge that conceptual gap... But in the meantime, if this crops up again, if there's any way to get an image of the fs in question, that'd be great. -Eric The corruptions in this and the other bug are with records like this: offset=0, inode=0, rec_len=0, name_len=0 offset=0, inode=2164326400, rec_len=0, name_len=5 offset=0, inode=5376, rec_len=2, name_len=0 offset=0, inode=570556416, rec_len=28161, name_len=111 offset=0, inode=4608, rec_len=30720, name_len=0 all at offset 0 in the directory, and the inode numbers are "interesting:" 0x81010000, 0x1500, 0x22020000, 0x1200 pretty round numbers, there. endian problems? Did we get to a block that doesn't actually contain dir entries? Hmmm Also, for what it's worth, this does not look like a regression in RHEL5. I was able to hit it on x86 on Kernel 2.6.18-2.el5 -Eric (re: comment #11, hit it with the QE reproducer, that is - and I'm not yet convinced that the reproducer is hitting the same root cause as the original report) Ok, I have some code running now that survives the reproducer that was reported on the ext4 list. Need to clean it up & will send it upstream.... Sent a patch to linux-ext4 today for comment. For now, I'm willing to chalk up the original error to the problem demonstrated in the reproducer. Due to the miscalculation, the memcpy of the new name will overwrite the buffer & corrupt memory. After that, all bets are off... let's get the fix in and keep an eye out for any recurrance. Thanks, -Eric *** Bug 247628 has been marked as a duplicate of this bug. *** *** Bug 289711 has been marked as a duplicate of this bug. *** Taking out of beta-private group, no reason to restrict access AFAICS. Patch now in -mm, btw, probably slated for .22 & .23. in 2.6.18-48.el5 You can download this test kernel from http://people.redhat.com/dzickus/el5 verified using reproducer located at: http://lists.openwall.net/linux-ext4/2007/06/01/1 corruption reproduces within seconds with the -47 kernel, no corruption noted after about 30 min. of use with the -49 kernel. An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2007-0959.html http://qa.mandriva.com/show_bug.cgi?id=32547 is worrying me now; it looks possible that this fix caused another regression... looking into it with a sense of urgency. Just a heads up... |