Description of problem: We seem to have some filesystem corruption which was caused by a faulty RAID enclosure. Basically around 16Mb worth of the RAID was inaccessible for a day or two until we realised and swapped out the enclosure. The "missing" 16Mb worth of blocks were accesible again but GFS did not deal with the whole thing very well. We have tried running gfs_fsck twice over the last two weekends (its a 48Tb filesystem!) but on both occasions it seems to get stuck during pass1b with errors like "On disk count does not match...." (can't recall exact message). gfs_fsck would dump output and seem to be working its way through the filesystem for about two days and then output would stop, a load of 1 would be observed and gfs_fsck would take 99.9% of the CPU. After leaving the fsck for two days and no further output I had to kill it. This filesystem is being used for nearline storage so its better that we get some backups through than leave gfs_fsck run for a week! The main issue is that we frequently get crashes like this: GFS: fsid=nearlineA:gfs1.0: fatal: invalid metadata block GFS: fsid=nearlineA:gfs1.0: bh = 7139682389 (type: exp=6, found=5) GFS: fsid=nearlineA:gfs1.0: function = get_leaf GFS: fsid=nearlineA:gfs1.0: file = /usr/src/redhat/BUILD/gfs-kernel-2.6.9-49/smp/src/gfs/dir.c, line = 437 GFS: fsid=nearlineA:gfs1.0: time = 1159784525 GFS: fsid=nearlineA:gfs1.0: about to withdraw from the cluster GFS: fsid=nearlineA:gfs1.0: waiting for outstanding I/O GFS: fsid=nearlineA:gfs1.0: telling LM to withdraw lock_dlm: withdraw abandoned memory GFS: fsid=nearlineA:gfs1.0: withdrawn I'm attaching the standard output from a gfs_fsck run up until I killed it due to 2 days without output. Unfortunately I didn't grab the std error. What I want to know is should we let gfs_fsck run and run even it appears to be doing very little or is there a manual way to wipe the effected areas of the filesystem? Even if we deallocate those areas and lose the data thats fine as long as it stops the crashes. Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info: On our two node cluster heres the complete list of occurances over the last few days: [root@ns1a ~]# grep -A 1 -B 1 "bh =" /var/log/messages* /var/log/messages-Oct 2 01:22:56 ns1a kernel: GFS: fsid=nearlineA:gfs1.1: fatal: invalid metadata block /var/log/messages:Oct 2 01:22:56 ns1a kernel: GFS: fsid=nearlineA:gfs1.1: bh = 9230177608 (type: exp=4, found=5) /var/log/messages-Oct 2 01:22:56 ns1a kernel: GFS: fsid=nearlineA:gfs1.1: function = gfs_get_meta_buffer -- /var/log/messages.1-Sep 28 02:29:20 ns1a kernel: GFS: fsid=nearlineA:gfs1.0: fatal: invalid metadata block /var/log/messages.1:Sep 28 02:29:20 ns1a kernel: GFS: fsid=nearlineA:gfs1.0: bh = 7139682396 (type: exp=4, found=5) /var/log/messages.1-Sep 28 02:29:20 ns1a kernel: GFS: fsid=nearlineA:gfs1.0: function = gfs_get_meta_buffer -- /var/log/messages.1-Sep 29 10:25:02 ns1a kernel: GFS: fsid=nearlineA:gfs1.0: fatal: invalid metadata block /var/log/messages.1:Sep 29 10:25:02 ns1a kernel: GFS: fsid=nearlineA:gfs1.0: bh = 9230177565 (type: exp=4, found=6) /var/log/messages.1-Sep 29 10:25:02 ns1a kernel: GFS: fsid=nearlineA:gfs1.0: function = gfs_get_meta_buffer [root@ns2a ~]# grep -A 1 -B 1 "bh =" /var/log/messages* /var/log/messages-Oct 2 00:14:18 ns2a kernel: GFS: fsid=nearlineA:gfs1.0: fatal: invalid metadata block /var/log/messages:Oct 2 00:14:18 ns2a kernel: GFS: fsid=nearlineA:gfs1.0: bh = 9230177555 (type: exp=4, found=6) /var/log/messages-Oct 2 00:14:18 ns2a kernel: GFS: fsid=nearlineA:gfs1.0: function = gfs_get_meta_buffer -- /var/log/messages-Oct 2 11:22:05 ns2a kernel: GFS: fsid=nearlineA:gfs1.0: fatal: invalid metadata block /var/log/messages:Oct 2 11:22:05 ns2a kernel: GFS: fsid=nearlineA:gfs1.0: bh = 7139682389 (type: exp=6, found=5) /var/log/messages-Oct 2 11:22:05 ns2a kernel: GFS: fsid=nearlineA:gfs1.0: function = get_leaf -- /var/log/messages.1-Sep 28 19:43:16 ns2a kernel: GFS: fsid=nearlineA:gfs1.1: fatal: invalid metadata block /var/log/messages.1:Sep 28 19:43:16 ns2a kernel: GFS: fsid=nearlineA:gfs1.1: bh = 5775125502 (type: exp=5, found=4) /var/log/messages.1-Sep 28 19:43:16 ns2a kernel: GFS: fsid=nearlineA:gfs1.1: function = gfs_get_meta_buffer -- /var/log/messages.1-Sep 29 04:36:32 ns2a kernel: GFS: fsid=nearlineA:gfs1.1: fatal: invalid metadata block /var/log/messages.1:Sep 29 04:36:32 ns2a kernel: GFS: fsid=nearlineA:gfs1.1: bh = 7139682389 (type: exp=6, found=5) /var/log/messages.1-Sep 29 04:36:32 ns2a kernel: GFS: fsid=nearlineA:gfs1.1: function = get_leaf
Created attachment 137533 [details] gfs fsck standard output
Without having the file system here to analyze, it's impossible to say whether gfs_fsck is processing normally or in some kind of infinite loop. All indications are that it's processing normally. The gfs_fsck output had 16752 messages from pass1b that said: "Found dup block". As long at that block number keeps going up, the gfs_fsck is not hung. In pass1b, the code is trying to identify metadata in the file system that point to the same physical blocks. To identify duplicate blocks, it uses bitmap information in memory that was built during pass1. With a 48TB file system, it's likely that your system doesn't have enough RAM to hold these bitmaps in memory, and therefore has to swap to disk for the bitmap information. This extra swapping will be especially time consuming during pass1b where it's referencing the bitmaps. Adding memory might make it go faster, but I realize that may or may not be practical.
True it is chewing 11Gigs of swap. The 16752 messages about the "dup" blocks stopped in that gfs_fsck output and then the fsck continued running for another 2 days without adding anymore output. Thats when I got suspicious and killed it off. I did an strace -p of the process first and it did seem to be doing lots of lseek and lreads but it would seem to be doing the same block over and over again for around a minute before moving on to another block. We will just have to limp through the rest of this week and try the filesystem check again over the weekend. I'll make sure to record the std error this time too.
Okay we have now been running the fsck for 5 days but this time with "-v". It appears that after finding lots of DUP or bad blocks it then looks at each inode looking for a referance to each block. As there quite a few blocks it is taking around a minute to go from inode 1 to 2 etc etc. We are now up to inode 145,000,000! It looks like this is looking at all inodes used/unused on the filesystem. In a 48Tb gfs filesystem how many inodes are there likely to be? Should we wait for this fsck to finish or is there something else we can do to skip/speed up this operation?
Created attachment 138096 [details] Experimental gfs_fsck to allow interrupt I have attached a patch you can try. There are three improvements: 1. This patch will allow you to interrupt the different passes and skip to the next. It won't allow you to interrupt pass 1 because the other passes rely upon information from pass 1. Since pass 1b is duplicate block checking, it shouldn't be harmful if you interrupt it and skip to the next pass. 2. I've combined two passes through the file system in pass1b, allowing it to do its job faster. This should be considered experimental until we can do further testing here. 3. It also reports percentage complete for more of the passes. The patch is not elegant, but I wanted to get something to you as quickly as possible. I have a few additional ideas for improvements, such as using dynamic memory compression of the bitmap memory if gfs_fsck runs into excessive memory pressure. This might make gfs_fsck run faster, if we can keep it from swapping to disk, provided the file system usage is not extremely fragmented. On the other hand, it might make it run slower because it would take more cpu time to decompress the bitmaps on the fly. It would depend on what the bitmaps look like for the file system and how well they compress. Unfortunately, that kind of improvement will take some time to implement.
FYI the original gfs_fsck from about a week ago is still running. We're just seeing loads of: Done checking inode x's metatree for references to block x messages from gfs_fsck (it's running with -v). It's up to about 270M inodes so far. In theory all those inodes should be actual files (not hardlinks), correct? This filesystem does have a huge number of hardlinks because it's used for doing rsync hardlink snapshots. At the moment we're just going to let the fsck run to completion unless you think we should stop it and use the patched gfs_fsck?
The hard links are only directory entries in the file system, with no extra inodes. The inodes are for actual files, directories, devices and such. At this point, you're probably better off letting it run to completion. However, be aware that gfs_fsck runs slower in verbose (-v) mode.
(In reply to comment #6) > Created an attachment (id=138096) [edit] > Experimental gfs_fsck to allow interrupt > > I have attached a patch you can try. There are three improvements: > Hey Robert - I'm trying to patch this into the latest tarball release (1.03.00), and it fails. Any chance you could make a patch against that for us?
I am trying to use this patch against RHEL4 Update4 and that won't work either. Can you create a patch for that?
Created attachment 139068 [details] Alternate patch for STABLE branch This patch should work for both RHEL4U4 and STABLE branches, since they are identical code in cvs.
(In reply to comment #11) > This patch should work for both RHEL4U4 and STABLE branches, > since they are identical code in cvs. Hey Robert, We ended up patching the latest tarball release (1.03.00) and restarting the fsck. It died with the following output: --- RG 46205 at block 0xFFF05700 intact [length 0x3b8cf] RG 46206 at block 0xFFF40FCF intact [length 0x3b8cf] RG 46207 at block 0xFFF7C89E intact [length 0x3b8cf] RG 46208 at block 0xFFFB816D intact [length 0x3b8cf] RG 46209 at block 0xFFFF3A3C intact [length 0x3b8cf] * RG 46210 at block 0x10002F30B *** DAMAGED *** [length 0x3b8cf] * RG 46211 at block 0x10006ABDA *** DAMAGED *** [length 0x3b8cf] * RG 46212 at block 0x1000A64A9 *** DAMAGED *** [length 0x3b8cf] * RG 46213 at block 0x1000E1D78 *** DAMAGED *** [length 0x3b8cf] Error: too many bad RGs. Unable to fill in resource group information. --- Is there anything we can do to recover from this?
One more note - prior to using the patched 1.03.00 gfs_fsck, we'd been using version 6.1.5. Which I am thinking is why we didn't see these messages earlier - 6.1.5 fsck doesn't seem to do as many checks when it starts up.
The more recent versions of gfs_fsck go out of their way to verify that the resource group information for the GFS file system is in good shape. These changes were done for bugzilla bug 179069. Now gfs_fsck has the ability to detect (and in most cases fix) these kinds of corruption. These messages indicate several of the RGs are corrupt in your FS. Earlier versions of gfs_fsck didn't detect this kind of corruption. I really wish I had access to your system so I could use gfs_edit to poke around and investigate the file system by hand. It's possible that gfs_fsck just couldn't adequately figure out where the RG's were due to corruption in the rg index or something. Unfortunately, this isn't easy to figure out except by hand. For now I recommend running this version of gfs_fsck in verbose mode and collecting the output. That will tell me where it thinks the RGs should be, and why it can't rebuild it.
Created attachment 139331 [details] gfs_fsck log from 2006/10/25
I've attached the verbose output from the patched gfs_fsck. We can probably arrange to give you access to the machine if it would help.
Oops... it seems that output wasn't very verbose. I'm running the fsck again.
Created attachment 139382 [details] verbose gfs_fsck log from 2006/10/25
I've attached the log from a gfs_fsck -vvv run to the bug.
Hi James, Well, I've got some (sort of) good news for you. (1) The info that you attached last time was good enough to tell me what's happening. If you look at the output, you'll see it complaining about an RG at block #0x646a7331. That's not a valid RG entry. The closest RG is #0x646aa05b. It just needs to figure that out properly in order to fix it. (2) It seems you've encountered a new bug in gfs_fsck in the code that tries to repair RGs. I'll get it fixed as quickly as I can, and post it here. Shouldn't be a difficult fix. Unfortunately, your file system had been extended once with gfs_grow, and that makes it a lot harder for gfs_fsck to figure out where the RGs are supposed to be and repair that kind of damage. The "Too many corrupt entries" message is supposed to be a failsafe if gfs_fsck loses its way when repairing rg index damage, so it doesn't accidentally make its own corruption. It's got a minor problem that I should be able to fix easily. (3) The corruption in your file system is definitely one corrupt entry in your rg index. Hopefully the damage doesn't go beyond that, and once I get gfs_fsck fixed, it should repair the damage. That means the rest of your data is probably intact, which is better than I had feared. I'll see if I can get you a better gfs_fsck and attach it to the bugzilla record. Hopefully even today.
(In reply to comment #11) > Created an attachment (id=139068) [edit] > Alternate patch for STABLE branch > > This patch should work for both RHEL4U4 and STABLE branches, > since they are identical code in cvs. > RHEL4U4 and STABLE are not exactly the same. That patch will not work for RHEL4U4. Can you provide one for RHEL4U4?
Created attachment 139399 [details] Patch to make gfs_fsck handle very big file systems better This patch fixes a problem I found with gfs_fsck using 32-bit block numbers where it should use 64-bit. This was what caused gfs_fsck to incorrectly report that RG blocks with block numbers > 0xffffffff were corrupt when they were not, and aborting. This doesn't, however, fix the problem with the corrupt RG that was reported. This patch should allow gfs_fsck to repair the damaged RG index properly though. This could affect other GFS user-space tools such as gfs_mkfs as well. Please apply this patch to gfs/include/osi_user.h then do "make clean" and "make install" from the cluster/gfs directory to make sure all the tools pick it up. Then re-run gfs_fsck and hopefully it will repair the RG index properly. Once the RG index is repaired, any further damage (if any) should hopefully be detected and fixed. The logs were very helpful: they told me that the RG index is incorrectly pointing to an RG at block 0x646a7331 whereas the closest one was really at 0x646aa05b. None of this tells us anything about how the original RG index got corrupted. That is still unknown.
(In reply to comment #22) > Created an attachment (id=139399) [edit] > Patch to make gfs_fsck handle very big file systems better > > This patch fixes a problem I found with gfs_fsck using 32-bit block > numbers where it should use 64-bit. This was what caused gfs_fsck > to incorrectly report that RG blocks with block numbers > 0xffffffff > were corrupt when they were not, and aborting. This doesn't, however, > fix the problem with the corrupt RG that was reported. This patch > should allow gfs_fsck to repair the damaged RG index properly though. Thanks for that, I've applied the patch and am re-running the fsck now.
Created attachment 139404 [details] Patch for RHEL4U4 to bring gfs_fsck up to date This patch includes all fixes for this bz, against RHEL4U4. This also includes the fixes for bz 156009. My apologies, Lenny, for not attaching this earlier.
(In reply to comment #22) > Created an attachment (id=139399) [edit] > Patch to make gfs_fsck handle very big file systems better I'm guessing that change from 32 to 64-bit block numbers is going to increase the memory usage? This system has 16GB of memory and 2GB of swap (48TB filesystem) and it just died trying to allocate a bitmap. I've added more swap and am re-running the fsck... more news soon :)
(In reply to comment #25) > I'm guessing that change from 32 to 64-bit block numbers is going to increase > the memory usage? This system has 16GB of memory and 2GB of swap (48TB > filesystem) and it just died trying to allocate a bitmap. I've added more swap > and am re-running the fsck... more news soon :) More swap didn't help. It's still dying in the same place. Here's the error message, I'm running a verbose fsck now - so I'll get the log to you tomorrow probably (about to go home) Initializing fsck Unable to allocate bitmap of size 1601697153
What is the architecture of your cluster nodes? An "unsigned long" should evaluate to 8-bytes (uint64_t) on a 64-bit machine, so it theoretically shouldn't matter there. On a 32-bit node, "unsigned long" evaluates to 4-bytes. The maximum size of a GFS file system on a cluster with 32-bit nodes is only 16TB, due to the 32-bit block numbers. See: http://sources.redhat.com/cluster/faq.html#gfs_maxsize This is true for both kernel and user space. If you have 32-bit (e.g. x86) nodes in the cluster, they will most likely corrupt data if they try to read or write data stored beyond 16TB. That's both gfs_fsck and the GFS file system itself. A cluster of all 64-bit nodes (e.g. x86_64) should not have had a problem because it should already be using a 64-bit block number. With a file system over 16TB, gfs_fsck should only be run from a 64-bit node.
Doh! I've accidentally built this latest gfs_fsck on a x86 machine instead of a x86_64 one. Your note reminded me of that. Sorry for the false alarm. Will rebuild on x86_64 tomorrow.
(In reply to comment #22) > The logs were very helpful: they told me that the RG index is > incorrectly pointing to an RG at block 0x646a7331 whereas the closest > one was really at 0x646aa05b. None of this tells us anything about > how the original RG index got corrupted. That is still unknown. The original RG probably got corrupted when we had one of the underlying RAID devices fail. gfs_fsck's been sitting at 1% of this pass for over 24 hours now (I just tried ctrl-c'ing it to invoke the "skip pass" bits, but it would seem it just killed gfs_fsck instead :() Heres a snip of the logs, I'll attach the full ones --- Leaf(254937628) entry count in directory 254937621 doesn't match number of entri es found - is 35, found 0 Leaf entry count updated Leaf(254937638) entry count in directory 254937621 doesn't match number of entri es found - is 42, found 0 Leaf entry count updated Leaf(254937631) entry count in directory 254937621 doesn't match number of entri es found - is 47, found 0 Leaf entry count updated Leaf(254937645) entry count in directory 254937621 doesn't match number of entri es found - is 42, found 0 Leaf entry count updated Leaf(254937626) entry count in directory 254937621 doesn't match number of entri es found - is 38, found 0 Leaf entry count updated Leaf(254937639) entry count in directory 254937621 doesn't match number of entri es found - is 42, found 0 Leaf entry count updated Leaf(254937632) entry count in directory 254937621 doesn't match number of entri es found - is 48, found 0 Leaf entry count updated Leaf(254937641) entry count in directory 254937621 doesn't match number of entri es found - is 44, found 0 Leaf entry count updated Leaf(254937627) entry count in directory 254937621 doesn't match number of entri es found - is 39, found 0 Leaf entry count updated Leaf(254937642) entry count in directory 254937621 doesn't match number of entri es found - is 41, found 0 Leaf entry count updated Leaf(254937648) entry count in directory 254937621 doesn't match number of entri es found - is 21, found 0 Leaf entry count updated Leaf(254937629) entry count in directory 254937621 doesn't match number of entri es found - is 29, found 0 Leaf entry count updated Leaf(254937644) entry count in directory 254937621 doesn't match number of entri es found - is 47, found 0 --- If we give you access to the machine with the filesystem - do you think it would be possible to nuke the corrupted RG, or do something (sorry - I don't know much about the GFS layout or internals) so the repair can complete quicker? Or perhaps there's some commands we could run? We're not too worried about loosing a single (or even a few) RG's worth of data. At the moment we just need to get the filesystem back online so some data (about 8TB of the 48TB) can be archived, before we blow the FS away and start again.
Created attachment 139831 [details] gfs_fsck log from 2006/10/31
I can't think of a way to do this faster or better. The problem is that there are multiple directories in good areas of the volume that are referencing data that was on the area of volume that was replaced. The gfs_fsck program has to go through them one by one and fix each one; something it can do much faster than patching by hand. Also, if ctrl-c ended the gfs_fsck program, then you probably aren't running with the patch I attached last. Perhaps you compiled it with "make" but didn't do "make install" or are running the original version for some reason.
*** Bug 210367 has been marked as a duplicate of this bug. ***
Since there have been no reported problems with the patch, I'm planning to commit the changes to gfs_fsck. Here's a list of what this fix actually does in more detail: 1. Pass 1b had a memory leak where it never returned memory allocated to duplicate blocks that were discovered. If you had a bad RG with lots of duplicate blocks, it could leak a significant amount of memory. The fix frees up that memory at the end of pass 1b, which might theoretically result in less swapping to disk due to memory pressure, which in turn might mean a slight performance increase for later passes, such as the lengthy pass 5. 2. When processing duplicate blocks, Pass 1b was making two passes through the file system: (1) one to look for inodes referencing the duplicate block, and (2) one to look for directory entries referencing each duplicate block. I've combined those two passes into one, which should cut the amount of time spent on each duplicate in half. It adds up in a multi-terabyte file system. 3. In comment #22 of this bz, I referenced a "bug" that restricted gfs_fsck from processing very large file systems properly. In fact, this is not a bug at all because the value in question is plenty big enough to handle it if gfs_fsck is run from a 64-bit machine, and it's a well-known fact that gfs_fsck shouldn't be run from 32-bit machines if the file system is over 16TB. If I increase the block numbers to 64-bit values, it would increase memory requirements on the 32-bit machine, possibly pushing us into swapping and therefore worse performance. Plus, there are likely to be similar problems we haven't yet discovered when running gfs_fsck on a file system bigger than 16TB from a 32-bit machine. Therefore, I put the value back to "long int" so that it expands correctly they way it used to for each architecture. However, I added a check to make sure the system architecture can handle the file system. So with this fix, if you run gfs_fsck from a 32-bit machine on a file system bigger than 16TB, it will give you an error message and exit. 4. The new gfs_fsck allows the users to interrupt processing by hitting ctrl-c in the middle. It will now tell you where it is in the processing and give you the option to a=abort, c=continue or s=skip the rest of that pass and go on to the next pass. Pass 1 is an exception to this rule and cannot be skipped. That way, if users get into a very long pass, such as the duplicate processing in pass 1b, they can instruct gfs_fsck to skip that pass and go on to the next. Although we can't ensure file system integrity when passes are skipped, it will allow users to get their critical file systems back running with minimal exposure and then schedule down time later to redo the pass they skipped over. Also, since interrupting it reports on its progress, it will allow the users to make sure it's still working and not hung up in some infinite loop. 5. I added more "percent complete" messages to the other passes to reassure users of the progress on fixing the file system. Note that this does not fix whatever caused the file system to become corrupted in the first place. That may be impossible to figure out unless we're given a reliable way to recreate the corruption in our lab.
The fix was committed to CVS for RHEL4, STABLE, HEAD and RHEL5, but not RHEL50. The exclusion of RHEL50 means that the fix won't go into the GA release of RHEL5, unless necessary, but it should make the first RHEL5 update.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2007-0139.html