Bug 208836
Summary: | fatal: invalid metadata block | ||
---|---|---|---|
Product: | [Retired] Red Hat Cluster Suite | Reporter: | Stephen Willey <stephen.willey> |
Component: | gfs | Assignee: | Robert Peterson <rpeterso> |
Status: | CLOSED ERRATA | QA Contact: | GFS Bugs <gfs-bugs> |
Severity: | urgent | Docs Contact: | |
Priority: | medium | ||
Version: | 4 | CC: | daire.byrne, jamesb, lenny, rkenna |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | All | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | RHBA-2007-0139 | Doc Type: | Bug Fix |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2007-05-10 21:00:00 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 216203 | ||
Attachments: |
Description
Stephen Willey
2006-10-02 10:55:27 UTC
Created attachment 137533 [details]
gfs fsck standard output
Without having the file system here to analyze, it's impossible to say whether gfs_fsck is processing normally or in some kind of infinite loop. All indications are that it's processing normally. The gfs_fsck output had 16752 messages from pass1b that said: "Found dup block". As long at that block number keeps going up, the gfs_fsck is not hung. In pass1b, the code is trying to identify metadata in the file system that point to the same physical blocks. To identify duplicate blocks, it uses bitmap information in memory that was built during pass1. With a 48TB file system, it's likely that your system doesn't have enough RAM to hold these bitmaps in memory, and therefore has to swap to disk for the bitmap information. This extra swapping will be especially time consuming during pass1b where it's referencing the bitmaps. Adding memory might make it go faster, but I realize that may or may not be practical. True it is chewing 11Gigs of swap. The 16752 messages about the "dup" blocks stopped in that gfs_fsck output and then the fsck continued running for another 2 days without adding anymore output. Thats when I got suspicious and killed it off. I did an strace -p of the process first and it did seem to be doing lots of lseek and lreads but it would seem to be doing the same block over and over again for around a minute before moving on to another block. We will just have to limp through the rest of this week and try the filesystem check again over the weekend. I'll make sure to record the std error this time too. Okay we have now been running the fsck for 5 days but this time with "-v". It appears that after finding lots of DUP or bad blocks it then looks at each inode looking for a referance to each block. As there quite a few blocks it is taking around a minute to go from inode 1 to 2 etc etc. We are now up to inode 145,000,000! It looks like this is looking at all inodes used/unused on the filesystem. In a 48Tb gfs filesystem how many inodes are there likely to be? Should we wait for this fsck to finish or is there something else we can do to skip/speed up this operation? Created attachment 138096 [details]
Experimental gfs_fsck to allow interrupt
I have attached a patch you can try. There are three improvements:
1. This patch will allow you to interrupt the different passes and
skip to the next. It won't allow you to interrupt pass 1 because
the other passes rely upon information from pass 1.
Since pass 1b is duplicate block checking, it shouldn't be
harmful if you interrupt it and skip to the next pass.
2. I've combined two passes through the file system in pass1b,
allowing it to do its job faster. This should be considered
experimental until we can do further testing here.
3. It also reports percentage complete for more of the passes.
The patch is not elegant, but I wanted to get something to you
as quickly as possible.
I have a few additional ideas for improvements, such as
using dynamic memory compression of the bitmap memory if gfs_fsck
runs into excessive memory pressure. This might make gfs_fsck
run faster, if we can keep it from swapping to disk, provided
the file system usage is not extremely fragmented. On the other
hand, it might make it run slower because it would take more cpu
time to decompress the bitmaps on the fly. It would depend on what
the bitmaps look like for the file system and how well they compress.
Unfortunately, that kind of improvement will take some time to
implement.
FYI the original gfs_fsck from about a week ago is still running. We're just seeing loads of: Done checking inode x's metatree for references to block x messages from gfs_fsck (it's running with -v). It's up to about 270M inodes so far. In theory all those inodes should be actual files (not hardlinks), correct? This filesystem does have a huge number of hardlinks because it's used for doing rsync hardlink snapshots. At the moment we're just going to let the fsck run to completion unless you think we should stop it and use the patched gfs_fsck? The hard links are only directory entries in the file system, with no extra inodes. The inodes are for actual files, directories, devices and such. At this point, you're probably better off letting it run to completion. However, be aware that gfs_fsck runs slower in verbose (-v) mode. (In reply to comment #6) > Created an attachment (id=138096) [edit] > Experimental gfs_fsck to allow interrupt > > I have attached a patch you can try. There are three improvements: > Hey Robert - I'm trying to patch this into the latest tarball release (1.03.00), and it fails. Any chance you could make a patch against that for us? I am trying to use this patch against RHEL4 Update4 and that won't work either. Can you create a patch for that? Created attachment 139068 [details]
Alternate patch for STABLE branch
This patch should work for both RHEL4U4 and STABLE branches,
since they are identical code in cvs.
(In reply to comment #11) > This patch should work for both RHEL4U4 and STABLE branches, > since they are identical code in cvs. Hey Robert, We ended up patching the latest tarball release (1.03.00) and restarting the fsck. It died with the following output: --- RG 46205 at block 0xFFF05700 intact [length 0x3b8cf] RG 46206 at block 0xFFF40FCF intact [length 0x3b8cf] RG 46207 at block 0xFFF7C89E intact [length 0x3b8cf] RG 46208 at block 0xFFFB816D intact [length 0x3b8cf] RG 46209 at block 0xFFFF3A3C intact [length 0x3b8cf] * RG 46210 at block 0x10002F30B *** DAMAGED *** [length 0x3b8cf] * RG 46211 at block 0x10006ABDA *** DAMAGED *** [length 0x3b8cf] * RG 46212 at block 0x1000A64A9 *** DAMAGED *** [length 0x3b8cf] * RG 46213 at block 0x1000E1D78 *** DAMAGED *** [length 0x3b8cf] Error: too many bad RGs. Unable to fill in resource group information. --- Is there anything we can do to recover from this? One more note - prior to using the patched 1.03.00 gfs_fsck, we'd been using version 6.1.5. Which I am thinking is why we didn't see these messages earlier - 6.1.5 fsck doesn't seem to do as many checks when it starts up. The more recent versions of gfs_fsck go out of their way to verify that the resource group information for the GFS file system is in good shape. These changes were done for bugzilla bug 179069. Now gfs_fsck has the ability to detect (and in most cases fix) these kinds of corruption. These messages indicate several of the RGs are corrupt in your FS. Earlier versions of gfs_fsck didn't detect this kind of corruption. I really wish I had access to your system so I could use gfs_edit to poke around and investigate the file system by hand. It's possible that gfs_fsck just couldn't adequately figure out where the RG's were due to corruption in the rg index or something. Unfortunately, this isn't easy to figure out except by hand. For now I recommend running this version of gfs_fsck in verbose mode and collecting the output. That will tell me where it thinks the RGs should be, and why it can't rebuild it. Created attachment 139331 [details]
gfs_fsck log from 2006/10/25
I've attached the verbose output from the patched gfs_fsck. We can probably arrange to give you access to the machine if it would help. Oops... it seems that output wasn't very verbose. I'm running the fsck again. Created attachment 139382 [details]
verbose gfs_fsck log from 2006/10/25
I've attached the log from a gfs_fsck -vvv run to the bug. Hi James, Well, I've got some (sort of) good news for you. (1) The info that you attached last time was good enough to tell me what's happening. If you look at the output, you'll see it complaining about an RG at block #0x646a7331. That's not a valid RG entry. The closest RG is #0x646aa05b. It just needs to figure that out properly in order to fix it. (2) It seems you've encountered a new bug in gfs_fsck in the code that tries to repair RGs. I'll get it fixed as quickly as I can, and post it here. Shouldn't be a difficult fix. Unfortunately, your file system had been extended once with gfs_grow, and that makes it a lot harder for gfs_fsck to figure out where the RGs are supposed to be and repair that kind of damage. The "Too many corrupt entries" message is supposed to be a failsafe if gfs_fsck loses its way when repairing rg index damage, so it doesn't accidentally make its own corruption. It's got a minor problem that I should be able to fix easily. (3) The corruption in your file system is definitely one corrupt entry in your rg index. Hopefully the damage doesn't go beyond that, and once I get gfs_fsck fixed, it should repair the damage. That means the rest of your data is probably intact, which is better than I had feared. I'll see if I can get you a better gfs_fsck and attach it to the bugzilla record. Hopefully even today. (In reply to comment #11) > Created an attachment (id=139068) [edit] > Alternate patch for STABLE branch > > This patch should work for both RHEL4U4 and STABLE branches, > since they are identical code in cvs. > RHEL4U4 and STABLE are not exactly the same. That patch will not work for RHEL4U4. Can you provide one for RHEL4U4? Created attachment 139399 [details]
Patch to make gfs_fsck handle very big file systems better
This patch fixes a problem I found with gfs_fsck using 32-bit block
numbers where it should use 64-bit. This was what caused gfs_fsck
to incorrectly report that RG blocks with block numbers > 0xffffffff
were corrupt when they were not, and aborting. This doesn't, however,
fix the problem with the corrupt RG that was reported. This patch
should allow gfs_fsck to repair the damaged RG index properly though.
This could affect other GFS user-space tools such as gfs_mkfs as well.
Please apply this patch to gfs/include/osi_user.h then do "make clean"
and "make install" from the cluster/gfs directory to make sure all the
tools pick it up. Then re-run gfs_fsck and hopefully it will repair
the RG index properly. Once the RG index is repaired, any further
damage (if any) should hopefully be detected and fixed.
The logs were very helpful: they told me that the RG index is
incorrectly pointing to an RG at block 0x646a7331 whereas the closest
one was really at 0x646aa05b. None of this tells us anything about
how the original RG index got corrupted. That is still unknown.
(In reply to comment #22) > Created an attachment (id=139399) [edit] > Patch to make gfs_fsck handle very big file systems better > > This patch fixes a problem I found with gfs_fsck using 32-bit block > numbers where it should use 64-bit. This was what caused gfs_fsck > to incorrectly report that RG blocks with block numbers > 0xffffffff > were corrupt when they were not, and aborting. This doesn't, however, > fix the problem with the corrupt RG that was reported. This patch > should allow gfs_fsck to repair the damaged RG index properly though. Thanks for that, I've applied the patch and am re-running the fsck now. Created attachment 139404 [details] Patch for RHEL4U4 to bring gfs_fsck up to date This patch includes all fixes for this bz, against RHEL4U4. This also includes the fixes for bz 156009. My apologies, Lenny, for not attaching this earlier. (In reply to comment #22) > Created an attachment (id=139399) [edit] > Patch to make gfs_fsck handle very big file systems better I'm guessing that change from 32 to 64-bit block numbers is going to increase the memory usage? This system has 16GB of memory and 2GB of swap (48TB filesystem) and it just died trying to allocate a bitmap. I've added more swap and am re-running the fsck... more news soon :) (In reply to comment #25) > I'm guessing that change from 32 to 64-bit block numbers is going to increase > the memory usage? This system has 16GB of memory and 2GB of swap (48TB > filesystem) and it just died trying to allocate a bitmap. I've added more swap > and am re-running the fsck... more news soon :) More swap didn't help. It's still dying in the same place. Here's the error message, I'm running a verbose fsck now - so I'll get the log to you tomorrow probably (about to go home) Initializing fsck Unable to allocate bitmap of size 1601697153 What is the architecture of your cluster nodes? An "unsigned long" should evaluate to 8-bytes (uint64_t) on a 64-bit machine, so it theoretically shouldn't matter there. On a 32-bit node, "unsigned long" evaluates to 4-bytes. The maximum size of a GFS file system on a cluster with 32-bit nodes is only 16TB, due to the 32-bit block numbers. See: http://sources.redhat.com/cluster/faq.html#gfs_maxsize This is true for both kernel and user space. If you have 32-bit (e.g. x86) nodes in the cluster, they will most likely corrupt data if they try to read or write data stored beyond 16TB. That's both gfs_fsck and the GFS file system itself. A cluster of all 64-bit nodes (e.g. x86_64) should not have had a problem because it should already be using a 64-bit block number. With a file system over 16TB, gfs_fsck should only be run from a 64-bit node. Doh! I've accidentally built this latest gfs_fsck on a x86 machine instead of a x86_64 one. Your note reminded me of that. Sorry for the false alarm. Will rebuild on x86_64 tomorrow. (In reply to comment #22) > The logs were very helpful: they told me that the RG index is > incorrectly pointing to an RG at block 0x646a7331 whereas the closest > one was really at 0x646aa05b. None of this tells us anything about > how the original RG index got corrupted. That is still unknown. The original RG probably got corrupted when we had one of the underlying RAID devices fail. gfs_fsck's been sitting at 1% of this pass for over 24 hours now (I just tried ctrl-c'ing it to invoke the "skip pass" bits, but it would seem it just killed gfs_fsck instead :() Heres a snip of the logs, I'll attach the full ones --- Leaf(254937628) entry count in directory 254937621 doesn't match number of entri es found - is 35, found 0 Leaf entry count updated Leaf(254937638) entry count in directory 254937621 doesn't match number of entri es found - is 42, found 0 Leaf entry count updated Leaf(254937631) entry count in directory 254937621 doesn't match number of entri es found - is 47, found 0 Leaf entry count updated Leaf(254937645) entry count in directory 254937621 doesn't match number of entri es found - is 42, found 0 Leaf entry count updated Leaf(254937626) entry count in directory 254937621 doesn't match number of entri es found - is 38, found 0 Leaf entry count updated Leaf(254937639) entry count in directory 254937621 doesn't match number of entri es found - is 42, found 0 Leaf entry count updated Leaf(254937632) entry count in directory 254937621 doesn't match number of entri es found - is 48, found 0 Leaf entry count updated Leaf(254937641) entry count in directory 254937621 doesn't match number of entri es found - is 44, found 0 Leaf entry count updated Leaf(254937627) entry count in directory 254937621 doesn't match number of entri es found - is 39, found 0 Leaf entry count updated Leaf(254937642) entry count in directory 254937621 doesn't match number of entri es found - is 41, found 0 Leaf entry count updated Leaf(254937648) entry count in directory 254937621 doesn't match number of entri es found - is 21, found 0 Leaf entry count updated Leaf(254937629) entry count in directory 254937621 doesn't match number of entri es found - is 29, found 0 Leaf entry count updated Leaf(254937644) entry count in directory 254937621 doesn't match number of entri es found - is 47, found 0 --- If we give you access to the machine with the filesystem - do you think it would be possible to nuke the corrupted RG, or do something (sorry - I don't know much about the GFS layout or internals) so the repair can complete quicker? Or perhaps there's some commands we could run? We're not too worried about loosing a single (or even a few) RG's worth of data. At the moment we just need to get the filesystem back online so some data (about 8TB of the 48TB) can be archived, before we blow the FS away and start again. Created attachment 139831 [details]
gfs_fsck log from 2006/10/31
I can't think of a way to do this faster or better. The problem is that there are multiple directories in good areas of the volume that are referencing data that was on the area of volume that was replaced. The gfs_fsck program has to go through them one by one and fix each one; something it can do much faster than patching by hand. Also, if ctrl-c ended the gfs_fsck program, then you probably aren't running with the patch I attached last. Perhaps you compiled it with "make" but didn't do "make install" or are running the original version for some reason. *** Bug 210367 has been marked as a duplicate of this bug. *** Since there have been no reported problems with the patch, I'm planning to commit the changes to gfs_fsck. Here's a list of what this fix actually does in more detail: 1. Pass 1b had a memory leak where it never returned memory allocated to duplicate blocks that were discovered. If you had a bad RG with lots of duplicate blocks, it could leak a significant amount of memory. The fix frees up that memory at the end of pass 1b, which might theoretically result in less swapping to disk due to memory pressure, which in turn might mean a slight performance increase for later passes, such as the lengthy pass 5. 2. When processing duplicate blocks, Pass 1b was making two passes through the file system: (1) one to look for inodes referencing the duplicate block, and (2) one to look for directory entries referencing each duplicate block. I've combined those two passes into one, which should cut the amount of time spent on each duplicate in half. It adds up in a multi-terabyte file system. 3. In comment #22 of this bz, I referenced a "bug" that restricted gfs_fsck from processing very large file systems properly. In fact, this is not a bug at all because the value in question is plenty big enough to handle it if gfs_fsck is run from a 64-bit machine, and it's a well-known fact that gfs_fsck shouldn't be run from 32-bit machines if the file system is over 16TB. If I increase the block numbers to 64-bit values, it would increase memory requirements on the 32-bit machine, possibly pushing us into swapping and therefore worse performance. Plus, there are likely to be similar problems we haven't yet discovered when running gfs_fsck on a file system bigger than 16TB from a 32-bit machine. Therefore, I put the value back to "long int" so that it expands correctly they way it used to for each architecture. However, I added a check to make sure the system architecture can handle the file system. So with this fix, if you run gfs_fsck from a 32-bit machine on a file system bigger than 16TB, it will give you an error message and exit. 4. The new gfs_fsck allows the users to interrupt processing by hitting ctrl-c in the middle. It will now tell you where it is in the processing and give you the option to a=abort, c=continue or s=skip the rest of that pass and go on to the next pass. Pass 1 is an exception to this rule and cannot be skipped. That way, if users get into a very long pass, such as the duplicate processing in pass 1b, they can instruct gfs_fsck to skip that pass and go on to the next. Although we can't ensure file system integrity when passes are skipped, it will allow users to get their critical file systems back running with minimal exposure and then schedule down time later to redo the pass they skipped over. Also, since interrupting it reports on its progress, it will allow the users to make sure it's still working and not hung up in some infinite loop. 5. I added more "percent complete" messages to the other passes to reassure users of the progress on fixing the file system. Note that this does not fix whatever caused the file system to become corrupted in the first place. That may be impossible to figure out unless we're given a reliable way to recreate the corruption in our lab. The fix was committed to CVS for RHEL4, STABLE, HEAD and RHEL5, but not RHEL50. The exclusion of RHEL50 means that the fix won't go into the GA release of RHEL5, unless necessary, but it should make the first RHEL5 update. An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2007-0139.html |