Bug 500483

Summary: GFS2: fsck.gfs2 sometimes needs to be run twice
Product: Red Hat Enterprise Linux 5 Reporter: Robert Peterson <rpeterso>
Component: gfs2-utilsAssignee: Robert Peterson <rpeterso>
Status: CLOSED ERRATA QA Contact: Cluster QE <mspqa-list>
Severity: high Docs Contact:
Priority: urgent    
Version: 5.3CC: adas, bkahn, cdewolf, cward, dejohnso, edamato, everett.bennett, ffotorel, jcapel, jkortus, liko, sghosh, swhiteho, tao, tdunnon
Target Milestone: rcKeywords: ZStream
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: gfs2-utils-0.1.62-6.el5 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 509225 532691 (view as bug list) Environment:
Last Closed: 2010-03-30 08:53:04 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 499522, 532691, 536842    
Attachments:
Description Flags
Preliminary patch
none
Hopefully the final patch
none
The Final Patch
none
Replacement patch
none
Upstream version of the proposed patch
none
Try 5 patch
none
x86_64 binary
none
Try 6 patch
none
Try 7 patch
none
Try 8 patch none

Description Robert Peterson 2009-05-12 21:12:41 UTC
Description of problem:
When solving bug #496330 and by using customer metadata, I discovered
that there were some fsck problems whereby fsck.gfs2 did not clean up
all of the problems in the first run.  See:

https://bugzilla.redhat.com/show_bug.cgi?id=496330#c4

I had to run fsck.gfs2 twice in order for the file system to come
through clean and not report any more errors.

Version-Release number of selected component (if applicable):
RHEL5

How reproducible:
Always

Steps to Reproduce:
1. Restore customer's metadata with duplicate block corruption
   from bug #496330.
2. fsck.gfs2 -y /dev/that/device
3. fsck.gfs2 -y /dev/that/device
  
Actual results:
Errors are flagged and fixed the second time fsck.gfs2 is run

Expected results:
You should only need to run fsck.gfs2 once to clean up all the
errors.

Additional info:

Comment 2 Robert Peterson 2009-05-29 21:39:21 UTC
Created attachment 345963 [details]
Preliminary patch

This preliminary patch solves all the problems I encountered on the
original set of metadata that caused me to open the bugzilla.  There
are several distinct issues.  Here is a list in the order of appearance
within the patch:

1. When duplicates were discovered, nothing was reported in the verbose
   output, so I changed the appropriate log_debug to log_info.
2. When marking blocks as "data" blocks, it was not setting the rgrp
   bitmap accordingly.  That caused some repaired data blocks to be
   treated as metadata blocks later, which caused them to be improperly
   freed.
3. When invalid metadata was discovered, nothing was reported in the
   verbose output, so I changed the appropriate log_debug to log_info.
4. The verbose output was putting out annoying "Done checking metatree"
   messages that were more appropriately debug messages, so I changed
   them from log_info to log_debug.
5. I ran into a duplicate reference error message that had the wrong
   number of parameters.  I changed it as a precaution.
6. Function check_dentry was resetting the "update" flag when examining
   every directory entry.  So the flag was set to "update" when an error
   was discovered in a directory and reset back to "no_update" when it
   processed the next "good" entry.  The statement was removed.
7. When stale directory entries were removed, it was not reported.
   I added a log_err message for consistency's sake.
8. In pass3, when stranded directories were moved to lost+found, it was
   not updating the dinode.  I changed the code so that it did the
   appropriate fsck_inode_put call in the right places.
9. When bad directory entries were removed, the dinode di_entries count
   was not being adjusted to reflect one less directory entry.
   I added code to decrement it.

This version should work, but could probably stand to have more
testing.  So far I've only tested it with the failing metadata set.
I'd like to run it through some of my other gfs2 metadata in my
collection to try to shake out other similar problems.

Comment 4 Robert Peterson 2009-06-05 21:33:23 UTC
I've been working on a replacement patch.  I ran the previous patch
against several of the file system metadatas I have in my collection
and it shook out more problems.  My latest and greatest patch fixes
many of those problems but there are still a few kinks to work out.
If you want to try my latest patch, let me know and I'll attach it.
Otherwise I'll keep working on it and post it when I'm closer.
This is getting to be a lot more involved than I ever dreamed it
would be.

Comment 6 Robert Peterson 2009-06-12 17:27:09 UTC
I've flushed out and fixed several more bugs in fsck.gfs2
since my previous post, and my latest version has passed twelve
tough fsck tests so far.  I've still got one minor problem that I
hope to fix today, and two more tests I'd like to run.

The problem with fsck.gfs2 fixes is that they can take a very long
time to test.  Luckily, debugging problems is relatively easy, unlike
the kernel code.

Many of these fixes should be crosswritten to gfs1's fsck (fsck.gfs).

The bottom line: Unless I find a major problem, I will my attach my
latest patch to the bugzilla within the next four hours.

Comment 7 Robert Peterson 2009-06-12 21:45:05 UTC
Created attachment 347674 [details]
Hopefully the final patch

This is my latest and greatest.  I ended up finding and fixing the
additional problem I mentioned, but it took longer than expected.
Therefore I haven't had time to re-test the new patch on my whole
metadata collection.  If they all pass, then this is likely what
I'll use as the final patch.

Comment 8 Robert Peterson 2009-06-15 00:33:05 UTC
Created attachment 347874 [details]
The Final Patch

This patch fixes another minor problem found during testing.
The good news is that this one was able to clean up all the
messed up file systems in my GFS2 metadata collection on the
first run.  There were three sets of metadata that were too
large for my device, so I couldn't test them.

The code is basically the same as the previous patch, except
for the directory traversal code in metawalk.c.  One of my damaged
metadata sets had an inode with a set of directory leaf pointers
that were zeroed at the beginning of the data.  Those leading
zeroes were confusing the code because it had no "previous"
directory pointer to work with, so it didn't know how to
fix it.  So I added logic to find the first viable directory
pointer and fill it in, in cases where there was no "previous".

There's no doubt that many of these fixes should be ported back
to gfs_fsck for GFS(1).

Comment 9 Robert Peterson 2009-06-15 22:31:39 UTC
Unfortunately, I began testing the upstream version of this patch
on some different metadata and it uncovered a complex new problem
dealing with indirect extended attributes.  I need some more time
to revise and retest.  If you're not using extended attributes,
the previous patch should be fine.

Comment 10 Robert Peterson 2009-06-16 23:24:25 UTC
Created attachment 348194 [details]
Replacement patch

This patch fixes several problems I found when testing on upstream
code, with different metadata.  This patch hasn't had much testing,
but I'm doing final testing now, which takes several hours.

Comment 11 Robert Peterson 2009-06-16 23:26:40 UTC
Created attachment 348195 [details]
Upstream version of the proposed patch

This is the upstream equivalent "replacement" patch.

Comment 12 Robert Peterson 2009-06-17 03:19:21 UTC
The "replacement patch" passes and fixes all the metadata I have
on the first try.  However, due to the size of the patch, I'd like
to get the customer (and anyone else who can, for that matter) to
run the patch on whatever GFS2 file systems they have, before I
push this to the repository.

Comment 13 Robert Peterson 2009-06-17 13:37:18 UTC
I pushed the upstream patch to the master branch of the gfs2-utils
git repository and the STABLE3 branch of the cluster git repository.
I'd still like to hold off on RHEL5 until someone other than me
tries the patch.

Comment 15 Robert Peterson 2009-06-19 04:13:39 UTC
Created attachment 348598 [details]
Try 5 patch

This patch fixes another bug whereby the system was trying to
write to the file system when -n was specified.  I've run it
against all but one of my metadata sets and it still passes.
I'll check the final one in the morning.

The previous patch was updating the bitmap for all data blocks.
That's not desired, especially when -n is specified.  This version
checks the type in the existing bitmap first to see whether it
needs changing, and if so, it asks permission to do so.

This required the use of a function that was once in libgfs2 but
it eventually found its way to gfs2_edit.  Now it's back in libgfs2
so multiple utils can use it.

Comment 16 Robert Peterson 2009-06-19 04:19:37 UTC
Created attachment 348601 [details]
x86_64 binary

This is an x86_64 binary of the latest fsck.gfs2 if someone wants
to try it.

Comment 17 Robert Peterson 2009-06-19 11:14:23 UTC
The last test passed as well, so again I'm waiting for other people
to try it.  In order to satisfy the customer, I made the updates to
the RHEL5 version of the code, but I still need to crosswrite those
latest changes to the upstream code and test it there.  That won't
take much time.

Comment 18 Robert Peterson 2009-06-25 16:26:51 UTC
Still waiting to hear back.

Comment 19 Toure Dunnon 2009-06-30 15:35:58 UTC
According to customer the fsck failed.

<Snip>
(pass1b.c:347)  Checking inode 27005698 (0x19c1302)'s metatree for references to block 6758956 (0x67222c)
(pass1b.c:353)  Done checking metatree
(pass1b.c:347)  Checking inode 27005698 (0x19c1302)'s metatree for references to block 6758955 (0x67222b)
(pass1b.c:353)  Done checking metatree
(pass1b.c:347)  Checking inode 27005698 (0x19c1302)'s metatree for references to block 6758954 (0x67222a)
(pass1b.c:353)  Done checking metatree
(pass1b.c:521)  Scanning block 27005699 (0x19c1303) for inodes
(pass1b.c:521)  Scanning block 27005700 (0x19c1304) for inodes
(pass1b.c:521)  Scanning block 27005701 (0x19c1305) for inodes
(pass1b.c:521)  Scanning block 27005702 (0x19c1306) for inodes
(pass1b.c:347)  Checking inode 27005702 (0x19c1306)'s metatree for references to block 196609 (0x30001)
gfs2_fsck: bad seek: Invalid argument on line 129 of file buf.c
<\Snip>

Comment 20 Robert Peterson 2009-06-30 16:13:11 UTC
Can they please do the following command and post the results here?

gfs2_edit -x -p 27005702 /dev/their/device

This may be a hardware problem, but I want to make sure fsck is
doing the right thing here, and hopefully this will tell me, although
this may lead me to other requests.

Comment 22 Toure Dunnon 2009-07-01 20:43:18 UTC
Hey Bob, just to let you know the output is attached...

Thanks,
Toure

Comment 23 Robert Peterson 2009-07-01 21:25:16 UTC
The data from comment #21 probably means there is no hardware
problem; it's more likely a bug in the patch.  Can I get a copy of
their metadata or can I get access to their system?  I'll see if
I can logically deduce what's going on in the meantime.

Comment 24 Robert Peterson 2009-07-01 21:31:29 UTC
The complete output from comment #19 would probably be enough, but
it's likely to be very big.

Comment 26 Robert Peterson 2009-07-08 18:48:36 UTC
I ported my patch from fsck.gfs2 to gfs_fsck, the gfs-1 version, and
started testing the code changes against my collection of damaged
gfs metadata.  It uncovered a bunch more shortcomings with the patch.

So I've been very busy fixing and porting back and forth from gfs2 to
gfs and re-testing the changes.  I've made a lot of progress.  I may
post another patch soon that supersedes the "try 5" patch, after I do
some gfs2 testing on it.

Comment 27 Robert Peterson 2009-07-15 22:49:01 UTC
*** Bug 506550 has been marked as a duplicate of this bug. ***

Comment 28 Robert Peterson 2009-07-15 22:55:29 UTC
Created attachment 353921 [details]
Try 6 patch

This patch corrects some problems I found when testing the GFS
crosswrite version for bug #509225.  It also fixes the problem
reported in bug #506550.  It may still need some work, but I
hope to be done with it soon.  This has had minimal testing
since the last change, so I'll likely need to spend several hours
retesting it, with both gfs and gfs2 damaged metadata sets.

Comment 29 Robert Peterson 2009-08-07 03:54:39 UTC
Created attachment 356614 [details]
Try 7 patch

This patch fixes an additional problem whereby directory sentinels
were being mistaken for corrupt blocks.  It also adds the capability
for fsck.gfs2 to truncate a directory block (preserving as many
directory entries as possible) if the data is unrecoverable.

Comment 30 Robert Peterson 2009-08-07 16:50:14 UTC
Created attachment 356680 [details]
Try 8 patch

This fixes a few more minor things I found in gfs2 cross-testing
and cross-checking.

Comment 31 Robert Peterson 2009-08-10 17:02:01 UTC
I pushed my latest and greatest patch to the master branch of the
gfs2-utils git tree, and the STABLE3, STABLE2 and RHEL5 branches
of the cluster git tree, for inclusion into 5.5.  I have tested
the patch extensively with a variety of customer metadata and
metadata mocked up using gfs2_edit.  The tests were performed on
systems kool and roth-01.  Changing status to Modified.

Comment 32 Robert Peterson 2009-08-17 16:19:34 UTC
Pushed to the RHEL55 branch of cluster.git.  Changing status to POST.

Comment 33 Robert Peterson 2009-08-19 14:10:57 UTC
Built according to the new procedure.  Changing to Modified.

Comment 34 Robert Peterson 2009-09-08 18:06:27 UTC
A regression was found by the upstream community (bug #521068).
I already have a fix, so I just need to respin the patch.
Changing status to FAILS_QA until that's done.

Comment 35 Robert Peterson 2009-09-08 19:12:54 UTC
A "part2" patch has been added to distcvs and the build was
successful.  The patch ID for the RHEL55 branch in git is:
863037b.  Changing status back to Modified and changing the
build fields accordingly.

Comment 39 Steve Whitehouse 2009-11-25 10:41:20 UTC
*** Bug 531771 has been marked as a duplicate of this bug. ***

Comment 40 Steve Whitehouse 2009-12-09 11:34:34 UTC
*** Bug 499333 has been marked as a duplicate of this bug. ***

Comment 41 Steve Whitehouse 2009-12-15 10:55:51 UTC
*** Bug 495799 has been marked as a duplicate of this bug. ***

Comment 45 Chris Ward 2010-02-11 10:27:29 UTC
~~ Attention Customers and Partners - RHEL 5.5 Beta is now available on RHN ~~

RHEL 5.5 Beta has been released! There should be a fix present in this 
release that addresses your request. Please test and report back results 
here, by March 3rd 2010 (2010-03-03) or sooner.

Upon successful verification of this request, post your results and update 
the Verified field in Bugzilla with the appropriate value.

If you encounter any issues while testing, please describe them and set 
this bug into NEED_INFO. If you encounter new defects or have additional 
patch(es) to request for inclusion, please clone this bug per each request
and escalate through your support representative.

Comment 48 errata-xmlrpc 2010-03-30 08:53:04 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2010-0287.html