Bug 500483
Summary: | GFS2: fsck.gfs2 sometimes needs to be run twice | ||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 5 | Reporter: | Robert Peterson <rpeterso> | ||||||||||||||||||||||
Component: | gfs2-utils | Assignee: | Robert Peterson <rpeterso> | ||||||||||||||||||||||
Status: | CLOSED ERRATA | QA Contact: | Cluster QE <mspqa-list> | ||||||||||||||||||||||
Severity: | high | Docs Contact: | |||||||||||||||||||||||
Priority: | urgent | ||||||||||||||||||||||||
Version: | 5.3 | CC: | adas, bkahn, cdewolf, cward, dejohnso, edamato, everett.bennett, ffotorel, jcapel, jkortus, liko, sghosh, swhiteho, tao, tdunnon | ||||||||||||||||||||||
Target Milestone: | rc | Keywords: | ZStream | ||||||||||||||||||||||
Target Release: | --- | ||||||||||||||||||||||||
Hardware: | All | ||||||||||||||||||||||||
OS: | Linux | ||||||||||||||||||||||||
Whiteboard: | |||||||||||||||||||||||||
Fixed In Version: | gfs2-utils-0.1.62-6.el5 | Doc Type: | Bug Fix | ||||||||||||||||||||||
Doc Text: | Story Points: | --- | |||||||||||||||||||||||
Clone Of: | |||||||||||||||||||||||||
: | 509225 532691 (view as bug list) | Environment: | |||||||||||||||||||||||
Last Closed: | 2010-03-30 08:53:04 UTC | Type: | --- | ||||||||||||||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||||||||||||||
Documentation: | --- | CRM: | |||||||||||||||||||||||
Verified Versions: | Category: | --- | |||||||||||||||||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||||||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||||||||||||||
Embargoed: | |||||||||||||||||||||||||
Bug Depends On: | |||||||||||||||||||||||||
Bug Blocks: | 499522, 532691, 536842 | ||||||||||||||||||||||||
Attachments: |
|
Description
Robert Peterson
2009-05-12 21:12:41 UTC
Created attachment 345963 [details]
Preliminary patch
This preliminary patch solves all the problems I encountered on the
original set of metadata that caused me to open the bugzilla. There
are several distinct issues. Here is a list in the order of appearance
within the patch:
1. When duplicates were discovered, nothing was reported in the verbose
output, so I changed the appropriate log_debug to log_info.
2. When marking blocks as "data" blocks, it was not setting the rgrp
bitmap accordingly. That caused some repaired data blocks to be
treated as metadata blocks later, which caused them to be improperly
freed.
3. When invalid metadata was discovered, nothing was reported in the
verbose output, so I changed the appropriate log_debug to log_info.
4. The verbose output was putting out annoying "Done checking metatree"
messages that were more appropriately debug messages, so I changed
them from log_info to log_debug.
5. I ran into a duplicate reference error message that had the wrong
number of parameters. I changed it as a precaution.
6. Function check_dentry was resetting the "update" flag when examining
every directory entry. So the flag was set to "update" when an error
was discovered in a directory and reset back to "no_update" when it
processed the next "good" entry. The statement was removed.
7. When stale directory entries were removed, it was not reported.
I added a log_err message for consistency's sake.
8. In pass3, when stranded directories were moved to lost+found, it was
not updating the dinode. I changed the code so that it did the
appropriate fsck_inode_put call in the right places.
9. When bad directory entries were removed, the dinode di_entries count
was not being adjusted to reflect one less directory entry.
I added code to decrement it.
This version should work, but could probably stand to have more
testing. So far I've only tested it with the failing metadata set.
I'd like to run it through some of my other gfs2 metadata in my
collection to try to shake out other similar problems.
I've been working on a replacement patch. I ran the previous patch against several of the file system metadatas I have in my collection and it shook out more problems. My latest and greatest patch fixes many of those problems but there are still a few kinks to work out. If you want to try my latest patch, let me know and I'll attach it. Otherwise I'll keep working on it and post it when I'm closer. This is getting to be a lot more involved than I ever dreamed it would be. I've flushed out and fixed several more bugs in fsck.gfs2 since my previous post, and my latest version has passed twelve tough fsck tests so far. I've still got one minor problem that I hope to fix today, and two more tests I'd like to run. The problem with fsck.gfs2 fixes is that they can take a very long time to test. Luckily, debugging problems is relatively easy, unlike the kernel code. Many of these fixes should be crosswritten to gfs1's fsck (fsck.gfs). The bottom line: Unless I find a major problem, I will my attach my latest patch to the bugzilla within the next four hours. Created attachment 347674 [details]
Hopefully the final patch
This is my latest and greatest. I ended up finding and fixing the
additional problem I mentioned, but it took longer than expected.
Therefore I haven't had time to re-test the new patch on my whole
metadata collection. If they all pass, then this is likely what
I'll use as the final patch.
Created attachment 347874 [details]
The Final Patch
This patch fixes another minor problem found during testing.
The good news is that this one was able to clean up all the
messed up file systems in my GFS2 metadata collection on the
first run. There were three sets of metadata that were too
large for my device, so I couldn't test them.
The code is basically the same as the previous patch, except
for the directory traversal code in metawalk.c. One of my damaged
metadata sets had an inode with a set of directory leaf pointers
that were zeroed at the beginning of the data. Those leading
zeroes were confusing the code because it had no "previous"
directory pointer to work with, so it didn't know how to
fix it. So I added logic to find the first viable directory
pointer and fill it in, in cases where there was no "previous".
There's no doubt that many of these fixes should be ported back
to gfs_fsck for GFS(1).
Unfortunately, I began testing the upstream version of this patch on some different metadata and it uncovered a complex new problem dealing with indirect extended attributes. I need some more time to revise and retest. If you're not using extended attributes, the previous patch should be fine. Created attachment 348194 [details]
Replacement patch
This patch fixes several problems I found when testing on upstream
code, with different metadata. This patch hasn't had much testing,
but I'm doing final testing now, which takes several hours.
Created attachment 348195 [details]
Upstream version of the proposed patch
This is the upstream equivalent "replacement" patch.
The "replacement patch" passes and fixes all the metadata I have on the first try. However, due to the size of the patch, I'd like to get the customer (and anyone else who can, for that matter) to run the patch on whatever GFS2 file systems they have, before I push this to the repository. I pushed the upstream patch to the master branch of the gfs2-utils git repository and the STABLE3 branch of the cluster git repository. I'd still like to hold off on RHEL5 until someone other than me tries the patch. Created attachment 348598 [details]
Try 5 patch
This patch fixes another bug whereby the system was trying to
write to the file system when -n was specified. I've run it
against all but one of my metadata sets and it still passes.
I'll check the final one in the morning.
The previous patch was updating the bitmap for all data blocks.
That's not desired, especially when -n is specified. This version
checks the type in the existing bitmap first to see whether it
needs changing, and if so, it asks permission to do so.
This required the use of a function that was once in libgfs2 but
it eventually found its way to gfs2_edit. Now it's back in libgfs2
so multiple utils can use it.
Created attachment 348601 [details]
x86_64 binary
This is an x86_64 binary of the latest fsck.gfs2 if someone wants
to try it.
The last test passed as well, so again I'm waiting for other people to try it. In order to satisfy the customer, I made the updates to the RHEL5 version of the code, but I still need to crosswrite those latest changes to the upstream code and test it there. That won't take much time. Still waiting to hear back. According to customer the fsck failed. <Snip> (pass1b.c:347) Checking inode 27005698 (0x19c1302)'s metatree for references to block 6758956 (0x67222c) (pass1b.c:353) Done checking metatree (pass1b.c:347) Checking inode 27005698 (0x19c1302)'s metatree for references to block 6758955 (0x67222b) (pass1b.c:353) Done checking metatree (pass1b.c:347) Checking inode 27005698 (0x19c1302)'s metatree for references to block 6758954 (0x67222a) (pass1b.c:353) Done checking metatree (pass1b.c:521) Scanning block 27005699 (0x19c1303) for inodes (pass1b.c:521) Scanning block 27005700 (0x19c1304) for inodes (pass1b.c:521) Scanning block 27005701 (0x19c1305) for inodes (pass1b.c:521) Scanning block 27005702 (0x19c1306) for inodes (pass1b.c:347) Checking inode 27005702 (0x19c1306)'s metatree for references to block 196609 (0x30001) gfs2_fsck: bad seek: Invalid argument on line 129 of file buf.c <\Snip> Can they please do the following command and post the results here? gfs2_edit -x -p 27005702 /dev/their/device This may be a hardware problem, but I want to make sure fsck is doing the right thing here, and hopefully this will tell me, although this may lead me to other requests. Hey Bob, just to let you know the output is attached... Thanks, Toure The data from comment #21 probably means there is no hardware problem; it's more likely a bug in the patch. Can I get a copy of their metadata or can I get access to their system? I'll see if I can logically deduce what's going on in the meantime. The complete output from comment #19 would probably be enough, but it's likely to be very big. I ported my patch from fsck.gfs2 to gfs_fsck, the gfs-1 version, and started testing the code changes against my collection of damaged gfs metadata. It uncovered a bunch more shortcomings with the patch. So I've been very busy fixing and porting back and forth from gfs2 to gfs and re-testing the changes. I've made a lot of progress. I may post another patch soon that supersedes the "try 5" patch, after I do some gfs2 testing on it. *** Bug 506550 has been marked as a duplicate of this bug. *** Created attachment 353921 [details] Try 6 patch This patch corrects some problems I found when testing the GFS crosswrite version for bug #509225. It also fixes the problem reported in bug #506550. It may still need some work, but I hope to be done with it soon. This has had minimal testing since the last change, so I'll likely need to spend several hours retesting it, with both gfs and gfs2 damaged metadata sets. Created attachment 356614 [details]
Try 7 patch
This patch fixes an additional problem whereby directory sentinels
were being mistaken for corrupt blocks. It also adds the capability
for fsck.gfs2 to truncate a directory block (preserving as many
directory entries as possible) if the data is unrecoverable.
Created attachment 356680 [details]
Try 8 patch
This fixes a few more minor things I found in gfs2 cross-testing
and cross-checking.
I pushed my latest and greatest patch to the master branch of the gfs2-utils git tree, and the STABLE3, STABLE2 and RHEL5 branches of the cluster git tree, for inclusion into 5.5. I have tested the patch extensively with a variety of customer metadata and metadata mocked up using gfs2_edit. The tests were performed on systems kool and roth-01. Changing status to Modified. Pushed to the RHEL55 branch of cluster.git. Changing status to POST. Built according to the new procedure. Changing to Modified. A regression was found by the upstream community (bug #521068). I already have a fix, so I just need to respin the patch. Changing status to FAILS_QA until that's done. A "part2" patch has been added to distcvs and the build was successful. The patch ID for the RHEL55 branch in git is: 863037b. Changing status back to Modified and changing the build fields accordingly. *** Bug 531771 has been marked as a duplicate of this bug. *** *** Bug 499333 has been marked as a duplicate of this bug. *** *** Bug 495799 has been marked as a duplicate of this bug. *** ~~ Attention Customers and Partners - RHEL 5.5 Beta is now available on RHN ~~ RHEL 5.5 Beta has been released! There should be a fix present in this release that addresses your request. Please test and report back results here, by March 3rd 2010 (2010-03-03) or sooner. Upon successful verification of this request, post your results and update the Verified field in Bugzilla with the appropriate value. If you encounter any issues while testing, please describe them and set this bug into NEED_INFO. If you encounter new defects or have additional patch(es) to request for inclusion, please clone this bug per each request and escalate through your support representative. An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2010-0287.html |