Bug 208836 - fatal: invalid metadata block
Summary: fatal: invalid metadata block
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Cluster Suite
Classification: Retired
Component: gfs
Version: 4
Hardware: All
OS: Linux
medium
urgent
Target Milestone: ---
Assignee: Robert Peterson
QA Contact: GFS Bugs
URL:
Whiteboard:
: 210367 (view as bug list)
Depends On:
Blocks: 216203
TreeView+ depends on / blocked
 
Reported: 2006-10-02 10:55 UTC by Stephen Willey
Modified: 2010-01-12 03:13 UTC (History)
4 users (show)

Fixed In Version: RHBA-2007-0139
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2007-05-10 21:00:00 UTC
Embargoed:


Attachments (Terms of Use)
gfs fsck standard output (43.85 KB, application/octet-stream)
2006-10-02 10:55 UTC, Stephen Willey
no flags Details
Experimental gfs_fsck to allow interrupt (12.93 KB, patch)
2006-10-09 23:21 UTC, Robert Peterson
no flags Details | Diff
Alternate patch for STABLE branch (19.97 KB, patch)
2006-10-21 19:46 UTC, Robert Peterson
no flags Details | Diff
gfs_fsck log from 2006/10/25 (223.54 KB, application/x-bzip2)
2006-10-25 10:14 UTC, James Braid
no flags Details
verbose gfs_fsck log from 2006/10/25 (365.41 KB, application/x-bzip2)
2006-10-25 18:55 UTC, James Braid
no flags Details
Patch to make gfs_fsck handle very big file systems better (480 bytes, patch)
2006-10-25 20:04 UTC, Robert Peterson
no flags Details | Diff
Patch for RHEL4U4 to bring gfs_fsck up to date (25.76 KB, patch)
2006-10-25 20:36 UTC, Robert Peterson
no flags Details | Diff
gfs_fsck log from 2006/10/31 (623.04 KB, application/x-bzip2)
2006-10-31 11:10 UTC, James Braid
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2007:0139 0 normal SHIPPED_LIVE GFS bug fix update 2007-05-10 20:59:07 UTC

Description Stephen Willey 2006-10-02 10:55:27 UTC
Description of problem:

We seem to have some filesystem corruption which was caused by a faulty RAID
enclosure. Basically around 16Mb worth of the RAID was inaccessible for a day or
two until we realised and swapped out the enclosure. The "missing" 16Mb worth of
blocks were accesible again but GFS did not deal with the whole thing very well.
 We have tried running gfs_fsck twice over the last two weekends (its a 48Tb
filesystem!) but on both occasions it seems to get stuck during pass1b with
errors like "On disk count does not match...." (can't recall exact message).
gfs_fsck would dump output and seem to be working its way through the filesystem
for about two days and then output would stop, a load of 1 would be observed and
gfs_fsck would take 99.9% of the CPU. After leaving the fsck for two days and no
further output I had to kill it. This filesystem is being used for nearline
storage so its better that we get some backups through than leave gfs_fsck run
for a week!

The main issue is that we frequently get crashes like this:

GFS: fsid=nearlineA:gfs1.0: fatal: invalid metadata block
GFS: fsid=nearlineA:gfs1.0:   bh = 7139682389 (type: exp=6, found=5)
GFS: fsid=nearlineA:gfs1.0:   function = get_leaf
GFS: fsid=nearlineA:gfs1.0:   file =
/usr/src/redhat/BUILD/gfs-kernel-2.6.9-49/smp/src/gfs/dir.c, line = 437
GFS: fsid=nearlineA:gfs1.0:   time = 1159784525
GFS: fsid=nearlineA:gfs1.0: about to withdraw from the cluster
GFS: fsid=nearlineA:gfs1.0: waiting for outstanding I/O
GFS: fsid=nearlineA:gfs1.0: telling LM to withdraw
lock_dlm: withdraw abandoned memory
GFS: fsid=nearlineA:gfs1.0: withdrawn

I'm attaching the standard output from a gfs_fsck run up until I killed it due
to 2 days without output. Unfortunately I didn't grab the std error.

What I want to know is should we let gfs_fsck run and run even it appears to be
doing very little or is there a manual way to wipe the effected areas of the
filesystem? Even if we deallocate those areas and lose the data thats fine as
long as it stops the crashes.


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:

On our two node cluster heres the complete list of occurances over the last few
days:

[root@ns1a ~]# grep -A 1 -B 1 "bh =" /var/log/messages*
/var/log/messages-Oct  2 01:22:56 ns1a kernel: GFS: fsid=nearlineA:gfs1.1:
fatal: invalid metadata block
/var/log/messages:Oct  2 01:22:56 ns1a kernel: GFS: fsid=nearlineA:gfs1.1:   bh
= 9230177608 (type: exp=4, found=5)
/var/log/messages-Oct  2 01:22:56 ns1a kernel: GFS: fsid=nearlineA:gfs1.1:  
function = gfs_get_meta_buffer
--
/var/log/messages.1-Sep 28 02:29:20 ns1a kernel: GFS: fsid=nearlineA:gfs1.0:
fatal: invalid metadata block
/var/log/messages.1:Sep 28 02:29:20 ns1a kernel: GFS: fsid=nearlineA:gfs1.0:  
bh = 7139682396 (type: exp=4, found=5)
/var/log/messages.1-Sep 28 02:29:20 ns1a kernel: GFS: fsid=nearlineA:gfs1.0:  
function = gfs_get_meta_buffer
--
/var/log/messages.1-Sep 29 10:25:02 ns1a kernel: GFS: fsid=nearlineA:gfs1.0:
fatal: invalid metadata block
/var/log/messages.1:Sep 29 10:25:02 ns1a kernel: GFS: fsid=nearlineA:gfs1.0:  
bh = 9230177565 (type: exp=4, found=6)
/var/log/messages.1-Sep 29 10:25:02 ns1a kernel: GFS: fsid=nearlineA:gfs1.0:  
function = gfs_get_meta_buffer


[root@ns2a ~]# grep -A 1 -B 1 "bh =" /var/log/messages*
/var/log/messages-Oct  2 00:14:18 ns2a kernel: GFS: fsid=nearlineA:gfs1.0:
fatal: invalid metadata block
/var/log/messages:Oct  2 00:14:18 ns2a kernel: GFS: fsid=nearlineA:gfs1.0:   bh
= 9230177555 (type: exp=4, found=6)
/var/log/messages-Oct  2 00:14:18 ns2a kernel: GFS: fsid=nearlineA:gfs1.0:  
function = gfs_get_meta_buffer
--
/var/log/messages-Oct  2 11:22:05 ns2a kernel: GFS: fsid=nearlineA:gfs1.0:
fatal: invalid metadata block
/var/log/messages:Oct  2 11:22:05 ns2a kernel: GFS: fsid=nearlineA:gfs1.0:   bh
= 7139682389 (type: exp=6, found=5)
/var/log/messages-Oct  2 11:22:05 ns2a kernel: GFS: fsid=nearlineA:gfs1.0:  
function = get_leaf
--
/var/log/messages.1-Sep 28 19:43:16 ns2a kernel: GFS: fsid=nearlineA:gfs1.1:
fatal: invalid metadata block
/var/log/messages.1:Sep 28 19:43:16 ns2a kernel: GFS: fsid=nearlineA:gfs1.1:  
bh = 5775125502 (type: exp=5, found=4)
/var/log/messages.1-Sep 28 19:43:16 ns2a kernel: GFS: fsid=nearlineA:gfs1.1:  
function = gfs_get_meta_buffer
--
/var/log/messages.1-Sep 29 04:36:32 ns2a kernel: GFS: fsid=nearlineA:gfs1.1:
fatal: invalid metadata block
/var/log/messages.1:Sep 29 04:36:32 ns2a kernel: GFS: fsid=nearlineA:gfs1.1:  
bh = 7139682389 (type: exp=6, found=5)
/var/log/messages.1-Sep 29 04:36:32 ns2a kernel: GFS: fsid=nearlineA:gfs1.1:  
function = get_leaf

Comment 1 Stephen Willey 2006-10-02 10:55:27 UTC
Created attachment 137533 [details]
gfs fsck standard output

Comment 3 Robert Peterson 2006-10-03 18:03:12 UTC
Without having the file system here to analyze, it's impossible to
say whether gfs_fsck is processing normally or in some kind of infinite
loop.  All indications are that it's processing normally.
The gfs_fsck output had 16752 messages from pass1b that said:
"Found dup block".  As long at that block number keeps going up, the
gfs_fsck is not hung.  In pass1b, the code is trying to identify
metadata in the file system that point to the same physical blocks.  
To identify duplicate blocks, it uses bitmap information in memory
that was built during pass1.  With a 48TB file system, it's likely
that your system doesn't have enough RAM to hold these bitmaps in
memory, and therefore has to swap to disk for the bitmap information.
This extra swapping will be especially time consuming during pass1b
where it's referencing the bitmaps.  Adding memory might make it go
faster, but I realize that may or may not be practical.


Comment 4 Stephen Willey 2006-10-03 18:10:10 UTC
True it is chewing 11Gigs of swap. The 16752 messages about the "dup" blocks
stopped in that gfs_fsck output and then the fsck continued running for another
2 days without adding anymore output. Thats when I got suspicious and killed it
off. I did an strace -p of the process first and it did seem to be doing lots of
lseek and lreads but it would seem to be doing the same block over and over
again for around a minute before moving on to another block.

We will just have to limp through the rest of this week and try the filesystem
check again over the weekend. I'll make sure to record the std error this time too.

Comment 5 Stephen Willey 2006-10-09 16:14:10 UTC
Okay we have now been running the fsck for 5 days but this time with "-v". It
appears that after finding lots of DUP or bad blocks it then looks at each inode
looking for a referance to each block. As there quite a few blocks it is taking
around a minute to go from inode 1 to 2 etc etc. We are now up to inode
145,000,000! It looks like this is looking at all inodes used/unused on the
filesystem. In a 48Tb gfs filesystem how many inodes are there likely to be?
Should we wait for this fsck to finish or is there something else we can do to
skip/speed up this operation?

Comment 6 Robert Peterson 2006-10-09 23:21:08 UTC
Created attachment 138096 [details]
Experimental gfs_fsck to allow interrupt

I have attached a patch you can try.  There are three improvements:

1. This patch will allow you to interrupt the different passes and
   skip to the next.  It won't allow you to interrupt pass 1 because 
   the other passes rely upon information from pass 1.
   Since pass 1b is duplicate block checking, it shouldn't be
   harmful if you interrupt it and skip to the next pass.
2. I've combined two passes through the file system in pass1b,
   allowing it to do its job faster.  This should be considered
   experimental until we can do further testing here.
3. It also reports percentage complete for more of the passes.

The patch is not elegant, but I wanted to get something to you
as quickly as possible.

I have a few additional ideas for improvements, such as
using dynamic memory compression of the bitmap memory if gfs_fsck
runs into excessive memory pressure.  This might make gfs_fsck
run faster, if we can keep it from swapping to disk, provided
the file system usage is not extremely fragmented.  On the other
hand, it might make it run slower because it would take more cpu
time to decompress the bitmaps on the fly.  It would depend on what
the bitmaps look like for the file system and how well they compress.  
Unfortunately, that kind of improvement will take some time to 
implement.

Comment 7 James Braid 2006-10-11 10:33:09 UTC
FYI the original gfs_fsck from about a week ago is still running. We're just
seeing loads of:

Done checking inode x's metatree for references to block x

messages from gfs_fsck (it's running with -v). It's up to about 270M inodes so
far. In theory all those inodes should be actual files (not hardlinks), correct?
This filesystem does have a huge number of hardlinks because it's used for doing
rsync hardlink snapshots.

At the moment we're just going to let the fsck run to completion unless you
think we should stop it and use the patched gfs_fsck?

Comment 8 Robert Peterson 2006-10-11 12:57:47 UTC
The hard links are only directory entries in the file system, with
no extra inodes.  The inodes are for actual files, directories,
devices and such.  At this point, you're probably better off letting
it run to completion.  However, be aware that gfs_fsck runs slower
in verbose (-v) mode.


Comment 9 James Braid 2006-10-19 12:48:07 UTC
(In reply to comment #6)
> Created an attachment (id=138096) [edit]
> Experimental gfs_fsck to allow interrupt
> 
> I have attached a patch you can try.  There are three improvements:
> 

Hey Robert - I'm trying to patch this into the latest tarball release (1.03.00),
and it fails. Any chance you could make a patch against that for us?



Comment 10 Lenny Maiorani 2006-10-19 16:27:32 UTC
I am trying to use this patch against RHEL4 Update4 and that won't work either.
Can you create a patch for that?

Comment 11 Robert Peterson 2006-10-21 19:46:23 UTC
Created attachment 139068 [details]
Alternate patch for STABLE branch

This patch should work for both RHEL4U4 and STABLE branches,
since they are identical code in cvs.

Comment 12 James Braid 2006-10-24 17:50:40 UTC
(In reply to comment #11)
> This patch should work for both RHEL4U4 and STABLE branches,
> since they are identical code in cvs.
 
Hey Robert,

We ended up patching the latest tarball release (1.03.00) and restarting the
fsck. It died with the following output:

---

  RG 46205 at block 0xFFF05700 intact [length 0x3b8cf]
  RG 46206 at block 0xFFF40FCF intact [length 0x3b8cf]
  RG 46207 at block 0xFFF7C89E intact [length 0x3b8cf]
  RG 46208 at block 0xFFFB816D intact [length 0x3b8cf]
  RG 46209 at block 0xFFFF3A3C intact [length 0x3b8cf]
* RG 46210 at block 0x10002F30B *** DAMAGED *** [length 0x3b8cf]
* RG 46211 at block 0x10006ABDA *** DAMAGED *** [length 0x3b8cf]
* RG 46212 at block 0x1000A64A9 *** DAMAGED *** [length 0x3b8cf]
* RG 46213 at block 0x1000E1D78 *** DAMAGED *** [length 0x3b8cf]
Error: too many bad RGs.
Unable to fill in resource group information.

---

Is there anything we can do to recover from this?

Comment 13 James Braid 2006-10-24 17:53:57 UTC
One more note - prior to using the patched 1.03.00 gfs_fsck, we'd been using
version 6.1.5.

Which I am thinking is why we didn't see these messages earlier - 6.1.5 fsck
doesn't seem to do as many checks when it starts up.



Comment 14 Robert Peterson 2006-10-24 18:51:22 UTC
The more recent versions of gfs_fsck go out of their way to verify that
the resource group information for the GFS file system is in good shape.
These changes were done for bugzilla bug 179069.  Now gfs_fsck has the
ability to detect (and in most cases fix) these kinds of corruption.
These messages indicate several of the RGs are corrupt in your FS.
Earlier versions of gfs_fsck didn't detect this kind of corruption.

I really wish I had access to your system so I could use gfs_edit to
poke around and investigate the file system by hand.  It's possible that
gfs_fsck just couldn't adequately figure out where the RG's were
due to corruption in the rg index or something.

Unfortunately, this isn't easy to figure out except by hand.

For now I recommend running this version of gfs_fsck in verbose mode
and collecting the output.  That will tell me where it thinks the RGs
should be, and why it can't rebuild it.


Comment 15 James Braid 2006-10-25 10:14:59 UTC
Created attachment 139331 [details]
gfs_fsck log from 2006/10/25

Comment 16 James Braid 2006-10-25 10:16:16 UTC
I've attached the verbose output from the patched gfs_fsck. We can probably
arrange to give you access to the machine if it would help.

Comment 17 James Braid 2006-10-25 10:17:56 UTC
Oops... it seems that output wasn't very verbose. I'm running the fsck again.

Comment 18 James Braid 2006-10-25 18:55:09 UTC
Created attachment 139382 [details]
verbose gfs_fsck log from 2006/10/25

Comment 19 James Braid 2006-10-25 18:55:51 UTC
I've attached the log from a gfs_fsck -vvv run to the bug.

Comment 20 Robert Peterson 2006-10-25 19:11:58 UTC
Hi James,

Well, I've got some (sort of) good news for you.  
(1) The info that you attached last time was good enough to tell
me what's happening.  If you look at the output, you'll see it
complaining about an RG at block #0x646a7331.  That's not a valid
RG entry.  The closest RG is #0x646aa05b.  It just needs to figure
that out properly in order to fix it.

(2) It seems you've encountered a new bug in gfs_fsck in the code that
tries to repair RGs.  I'll get it fixed as quickly as I can, and
post it here.  Shouldn't be a difficult fix.  Unfortunately, your
file system had been extended once with gfs_grow, and that makes it a
lot harder for gfs_fsck to figure out where the RGs are supposed to
be and repair that kind of damage.  The "Too many corrupt entries"
message is supposed to be a failsafe if gfs_fsck loses its way when
repairing rg index damage, so it doesn't accidentally make its own
corruption.  It's got a minor problem that I should be able to fix
easily.

(3) The corruption in your file system is definitely one corrupt entry
in your rg index.  Hopefully the damage doesn't go beyond that, and
once I get gfs_fsck fixed, it should repair the damage.  That means
the rest of your data is probably intact, which is better than I had
feared.

I'll see if I can get you a better gfs_fsck and attach it to the
bugzilla record.  Hopefully even today.


Comment 21 Lenny Maiorani 2006-10-25 20:03:17 UTC
(In reply to comment #11)
> Created an attachment (id=139068) [edit]
> Alternate patch for STABLE branch
> 
> This patch should work for both RHEL4U4 and STABLE branches,
> since they are identical code in cvs.
> 

RHEL4U4 and STABLE are not exactly the same. That patch will not work for
RHEL4U4. Can you provide one for RHEL4U4? 



Comment 22 Robert Peterson 2006-10-25 20:04:11 UTC
Created attachment 139399 [details]
Patch to make gfs_fsck handle very big file systems better

This patch fixes a problem I found with gfs_fsck using 32-bit block
numbers where it should use 64-bit.  This was what caused gfs_fsck
to incorrectly report that RG blocks with block numbers > 0xffffffff
were corrupt when they were not, and aborting.	This doesn't, however,
fix the problem with the corrupt RG that was reported.	This patch
should allow gfs_fsck to repair the damaged RG index properly though.

This could affect other GFS user-space tools such as gfs_mkfs as well.
Please apply this patch to gfs/include/osi_user.h then do "make clean"
and "make install" from the cluster/gfs directory to make sure all the
tools pick it up.  Then re-run gfs_fsck and hopefully it will repair 
the RG index properly.	Once the RG index is repaired, any further 
damage (if any) should hopefully be detected and fixed.

The logs were very helpful: they told me that the RG index is
incorrectly pointing to an RG at block 0x646a7331 whereas the closest
one was really at 0x646aa05b.  None of this tells us anything about
how the original RG index got corrupted.  That is still unknown.

Comment 23 James Braid 2006-10-25 20:31:23 UTC
(In reply to comment #22)
> Created an attachment (id=139399) [edit]
> Patch to make gfs_fsck handle very big file systems better
> 
> This patch fixes a problem I found with gfs_fsck using 32-bit block
> numbers where it should use 64-bit.  This was what caused gfs_fsck
> to incorrectly report that RG blocks with block numbers > 0xffffffff
> were corrupt when they were not, and aborting.	This doesn't, however,
> fix the problem with the corrupt RG that was reported.	This patch
> should allow gfs_fsck to repair the damaged RG index properly though.

Thanks for that, I've applied the patch and am re-running the fsck now.


Comment 24 Robert Peterson 2006-10-25 20:36:16 UTC
Created attachment 139404 [details]
Patch for RHEL4U4 to bring gfs_fsck up to date

This patch includes all fixes for this bz, against RHEL4U4.
This also includes the fixes for bz 156009.
My apologies, Lenny, for not attaching this earlier.

Comment 25 James Braid 2006-10-25 21:26:56 UTC
(In reply to comment #22)
> Created an attachment (id=139399) [edit]
> Patch to make gfs_fsck handle very big file systems better

I'm guessing that change from 32 to 64-bit block numbers is going to increase
the memory usage? This system has 16GB of memory and 2GB of swap (48TB
filesystem) and it just died trying to allocate a bitmap. I've added more swap
and am re-running the fsck... more news soon :)


Comment 26 James Braid 2006-10-25 21:34:05 UTC
(In reply to comment #25)
> I'm guessing that change from 32 to 64-bit block numbers is going to increase
> the memory usage? This system has 16GB of memory and 2GB of swap (48TB
> filesystem) and it just died trying to allocate a bitmap. I've added more swap
> and am re-running the fsck... more news soon :)

More swap didn't help. It's still dying in the same place. Here's the error
message, I'm running a verbose fsck now - so I'll get the log to you tomorrow
probably (about to go home)

Initializing fsck
Unable to allocate bitmap of size 1601697153



Comment 27 Robert Peterson 2006-10-25 22:14:09 UTC
What is the architecture of your cluster nodes?  An "unsigned long"
should evaluate to 8-bytes (uint64_t) on a 64-bit machine, so it
theoretically shouldn't matter there.  On a 32-bit node, "unsigned long"
evaluates to 4-bytes.  The maximum size of a GFS file system on
a cluster with 32-bit nodes is only 16TB, due to the 32-bit block
numbers.  

See: http://sources.redhat.com/cluster/faq.html#gfs_maxsize

This is true for both kernel and user space.  If you have
32-bit (e.g. x86) nodes in the cluster, they will most likely corrupt
data if they try to read or write data stored beyond 16TB.
That's both gfs_fsck and the GFS file system itself.
A cluster of all 64-bit nodes (e.g. x86_64) should not have had a
problem because it should already be using a 64-bit block number.  
With a file system over 16TB, gfs_fsck should only be run from a
64-bit node.


Comment 28 James Braid 2006-10-25 22:20:17 UTC
Doh! I've accidentally built this latest gfs_fsck on a x86 machine instead of a
x86_64 one. Your note reminded me of that.

Sorry for the false alarm. Will rebuild on x86_64 tomorrow.

Comment 29 James Braid 2006-10-31 11:08:18 UTC
(In reply to comment #22)
> The logs were very helpful: they told me that the RG index is
> incorrectly pointing to an RG at block 0x646a7331 whereas the closest
> one was really at 0x646aa05b.  None of this tells us anything about
> how the original RG index got corrupted.  That is still unknown.

The original RG probably got corrupted when we had one of the underlying RAID
devices fail.

gfs_fsck's been sitting at 1% of this pass for over 24 hours now (I just tried
ctrl-c'ing it to invoke the "skip pass" bits, but it would seem it just killed
gfs_fsck instead :()

Heres a snip of the logs, I'll attach the full ones

---

Leaf(254937628) entry count in directory 254937621 doesn't match number of entri
es found - is 35, found 0
Leaf entry count updated
Leaf(254937638) entry count in directory 254937621 doesn't match number of entri
es found - is 42, found 0
Leaf entry count updated
Leaf(254937631) entry count in directory 254937621 doesn't match number of entri
es found - is 47, found 0
Leaf entry count updated
Leaf(254937645) entry count in directory 254937621 doesn't match number of entri
es found - is 42, found 0
Leaf entry count updated
Leaf(254937626) entry count in directory 254937621 doesn't match number of entri
es found - is 38, found 0
Leaf entry count updated
Leaf(254937639) entry count in directory 254937621 doesn't match number of entri
es found - is 42, found 0
Leaf entry count updated
Leaf(254937632) entry count in directory 254937621 doesn't match number of entri
es found - is 48, found 0
Leaf entry count updated
Leaf(254937641) entry count in directory 254937621 doesn't match number of entri
es found - is 44, found 0
Leaf entry count updated
Leaf(254937627) entry count in directory 254937621 doesn't match number of entri
es found - is 39, found 0
Leaf entry count updated
Leaf(254937642) entry count in directory 254937621 doesn't match number of entri
es found - is 41, found 0
Leaf entry count updated
Leaf(254937648) entry count in directory 254937621 doesn't match number of entri
es found - is 21, found 0
Leaf entry count updated
Leaf(254937629) entry count in directory 254937621 doesn't match number of entri
es found - is 29, found 0
Leaf entry count updated
Leaf(254937644) entry count in directory 254937621 doesn't match number of entri
es found - is 47, found 0

---

If we give you access to the machine with the filesystem - do you think it would
be possible to nuke the corrupted RG, or do something (sorry - I don't know much
about the GFS layout or internals) so the repair can complete quicker? Or
perhaps there's some commands we could run?

We're not too worried about loosing a single (or even a few) RG's worth of data.
At the moment we just need to get the filesystem back online so some data (about
8TB of the 48TB) can be archived, before we blow the FS away and start again.


Comment 30 James Braid 2006-10-31 11:10:59 UTC
Created attachment 139831 [details]
gfs_fsck log from 2006/10/31

Comment 31 Robert Peterson 2006-11-06 16:06:44 UTC
I can't think of a way to do this faster or better.
The problem is that there are multiple directories in good areas
of the volume that are referencing data that was on the area of
volume that was replaced.  The gfs_fsck program has to go through
them one by one and fix each one; something it can do much faster
than patching by hand.

Also, if ctrl-c ended the gfs_fsck program, then you probably
aren't running with the patch I attached last.  Perhaps you compiled
it with "make" but didn't do "make install" or are running the
original version for some reason.


Comment 32 Ryan O'Hara 2006-11-14 17:09:09 UTC
*** Bug 210367 has been marked as a duplicate of this bug. ***

Comment 33 Robert Peterson 2006-11-16 15:50:45 UTC
Since there have been no reported problems with the patch, I'm planning
to commit the changes to gfs_fsck.  Here's a list of what this fix
actually does in more detail:

1. Pass 1b had a memory leak where it never returned memory allocated
   to duplicate blocks that were discovered.  If you had a bad RG with
   lots of duplicate blocks, it could leak a significant amount of 
   memory.  The fix frees up that memory at the end of pass 1b, which
   might theoretically result in less swapping to disk due to memory
   pressure, which in turn might mean a slight performance increase
   for later passes, such as the lengthy pass 5.
2. When processing duplicate blocks, Pass 1b was making two passes
   through the file system:  (1) one to look for inodes referencing 
   the duplicate block, and (2) one to look for directory entries
   referencing each duplicate block.  I've combined those two passes
   into one, which should cut the amount of time spent on each 
   duplicate in half.  It adds up in a multi-terabyte file system.
3. In comment #22 of this bz, I referenced a "bug" that restricted
   gfs_fsck from processing very large file systems properly.
   In fact, this is not a bug at all because the value in question
   is plenty big enough to handle it if gfs_fsck is run from a 64-bit
   machine, and it's a well-known fact that gfs_fsck shouldn't be
   run from 32-bit machines if the file system is over 16TB.
   If I increase the block numbers to 64-bit values, it would increase
   memory requirements on the 32-bit machine, possibly pushing us into
   swapping and therefore worse performance.  Plus, there are likely
   to be similar problems we haven't yet discovered when running 
   gfs_fsck on a file system bigger than 16TB from a 32-bit machine.
   Therefore, I put the value back to "long int" so that it expands
   correctly they way it used to for each architecture.  However, I
   added a check to make sure the system architecture can handle the 
   file system.  So with this fix, if you run gfs_fsck from a 32-bit
   machine on a file system bigger than 16TB, it will give you an
   error message and exit.
4. The new gfs_fsck allows the users to interrupt processing by hitting
   ctrl-c in the middle.  It will now tell you where it is in the
   processing and give you the option to a=abort, c=continue or s=skip 
   the rest of that pass and go on to the next pass.  Pass 1 is an 
   exception to this rule and cannot be skipped.
   That way, if users get into a very long pass, such as the duplicate
   processing in pass 1b, they can instruct gfs_fsck to skip that pass
   and go on to the next.  Although we can't ensure file system
   integrity when passes are skipped, it will allow users to get their 
   critical file systems back running with minimal exposure and then
   schedule down time later to redo the pass they skipped over.  Also,
   since interrupting it reports on its progress, it will allow the 
   users to make sure it's still working and not hung up in some 
   infinite loop.
5. I added more "percent complete" messages to the other passes to 
   reassure users of the progress on fixing the file system.

Note that this does not fix whatever caused the file system to become
corrupted in the first place.  That may be impossible to figure out
unless we're given a reliable way to recreate the corruption in our lab.


Comment 34 Robert Peterson 2006-11-17 17:11:44 UTC
The fix was committed to CVS for RHEL4, STABLE, HEAD and RHEL5, but not 
RHEL50.  The exclusion of RHEL50 means that the fix won't go into the
GA release of RHEL5, unless necessary, but it should make the first 
RHEL5 update.


Comment 40 Red Hat Bugzilla 2007-05-10 21:00:00 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2007-0139.html



Note You need to log in before you can comment on or make changes to this bug.