Bug 442271
Summary: | GFS: gfs_fsck bugs found in rindex repair code | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 5 | Reporter: | Robert Peterson <rpeterso> | ||||||
Component: | gfs-utils | Assignee: | Robert Peterson <rpeterso> | ||||||
Status: | CLOSED ERRATA | QA Contact: | GFS Bugs <gfs-bugs> | ||||||
Severity: | low | Docs Contact: | |||||||
Priority: | low | ||||||||
Version: | 5.3 | CC: | cfeist, edamato, jkortus | ||||||
Target Milestone: | rc | ||||||||
Target Release: | --- | ||||||||
Hardware: | All | ||||||||
OS: | Linux | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | Environment: | ||||||||
Last Closed: | 2009-09-02 11:01:21 UTC | Type: | --- | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Attachments: |
|
Description
Robert Peterson
2008-04-13 17:10:54 UTC
Created attachment 302383 [details]
Proposed Patch
This is the first prototype I wrote about in the description. It
passes my "gfs_fsck_hellfire" test. I should test it on some more
shredded rindex files though.
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release. I got some customer feedback that indicated it would be worth adding a special parameter to gfs_fsck to force a rindex rebuild. As it stands today, it doesn't always detect some kinds of rindex corruption. Note that since "normal" gfs2's RGs are always the same size, even after a gfs2_grow operation, we don't need to port this patch to gfs2_fsck. However, this does expose a bigger issue: file systems that are converted from gfs to gfs2 via gfs2_convert will have their RGs in the same locations as they were in gfs. Today's rindex and rgrp repair algorithm in gfs2_fsck are grossly inadequate to handle that situation because gfs's RG repair algorithm is very involved and complex. The gfs2 version is much simplified and cleaner, relying on the fact that RGs will always be on nice neat boundaries. Seeing as how rindex and rgrp corruption are very rare to begin with, there should be minimal customer exposure. However, if need be we can lift the entire rg repair algorithm from gfs_fsck and make a special repair case for file systems converted from gfs. That port might be a big effort for very little payback. Still, the first customer to have RG damage on a gfs1-converted file system won't be in a very good mood. One saving grace is that the instructions in gfs2_convert tell the customers to run gfs_fsck before converting the file system, so hopefully RG and rindex damage won't be carried over across the convert. So in theory the problem should only be for RG damage that occurs after the conversion takes place. Time will tell if it's worth doing. If and when I do this fix, I should crosswrite patch 768d7f6 from gfs2 to gfs. This is the problem where RG blocks inside a journal look just like a real RG block and confuse the rg repair algorithm, making it think the RGs have improper sizes. In theory, this should cause gfs_fsck to determine improper block locations for the RGs, and then, after discovering those blocks aren't really RGs, it quits with "Error: too many bad RGs." Created attachment 317593 [details]
Latest patch
This patch fixes some problems uncovered by a destroyed rindex file
that came in from a user. With this patch, he was able to recover
his 2TB gfs file system successfully, and the rindex was very badly
destroyed. I still need to run this version through some tests but
this is what I'm planning to ship unless I find bugs.
Equivalent patches were pushed to the master, STABLE2, STABLE3 and RHEL5 branches of the cluster git tree for inclusion into 5.4. RHEL5 patch was tested on system roth-01 with a variety of scenarios including my gfs_fsck_hellfire test case. I also tested the patch against some user metadata from previous bugs. Changing status to Modified. gfs2-utils-2.03.11-1.fc9, cman-2.03.11-1.fc9, rgmanager-2.03.11-1.fc9 has been pushed to the Fedora 9 stable repository. If problems still persist, please make note of it in this bug report. There is different behaviour of the tool on x86_64 and on ia64. On x86_64 it works as expected but does not fix the FS on ia64. Tested with GFS fsck 0.1.19 (built May 4 2009 19:35:05). Similar corruption (rindex entry corruption, part of gfs_fsck_hellfire) in gfs2 ends in segfault on both archs. This is not regression as I think the fix never really worked on ia64 (tested previous 5.x releases). Example of successful run on x86_64: x86_64 - rindex entry #1 -> 0x00 gfs_fsck -y /dev/VolGroup00/GFS Initializing fsck Invalid length 0 found in rindex. number_of_rgs = 8. rgindex #1 ri_addr discrepancy: index 0x0 != expected: 0x11 rgindex #1 ri_length discrepancy: index 0x0 != expected: 0x4 rgindex #1 ri_data1 discrepancy: index 0x0 != expected: 0x15 rgindex #1 ri_data discrepancy: index 0x0 != expected: 0xebf8 rgindex #1 ri_bitbytes discrepancy: index 0x0 != expected: 0x3afe Clearing journals (this may take a while). Journals cleared. Starting pass1 Pass1 complete Starting pass1b Pass1b complete Starting pass1c Pass1c complete Starting pass2 Pass2 complete Starting pass3 Pass3 complete Starting pass4 Pass4 complete Starting pass5 Pass5 complete Writing changes to disk Note the rgindex #1 (the only one corrupted) is fixed with correct values. Now ia64 the same scenario: ia64 - rindex entry #1 -> 0x00 (repeats, uncorrectable) gfs_fsck -y /dev/sdc1 Initializing fsck Invalid length 0 found in rindex. The middle RG is not on an even boundary (fs has grown?) Section 1: 0x11 - 0x1bcdf RG 1 at block 0x11 intact [length 0x6f33] RG 2 at block 0x6F44 intact [length 0x6f30] RG 3 at block 0xDE74 intact [length 0x6f30] RG 4 at block 0x14DA4 intact [length 0x6f30] RG 5 at block 0x1BCD4 intact [length 0x6f30] Section 2: 0x1fcc0 - 0x3b99f * RG 6 at block 0x1FCC0 *** DAMAGED *** [length 0x6f4c] RG 7 at block 0x26C0C intact [length 0x6f34] RG 8 at block 0x2DB40 intact [length 0x6f34] RG 9 at block 0x34A74 intact [length 0x6f34] Section 3: 0x3b9a0 - 0x3b9a7 Unable to use rindex; doing block-by-block search. This will be slow, so be patient. * RG 10 at block 0x3B9A0 *** DAMAGED *** [length 0x1] rgindex #1 ri_addr discrepancy: index 0x0 != expected: 0x11 rgindex #1 ri_length discrepancy: index 0x0 != expected: 0x2 rgindex #1 ri_data1 discrepancy: index 0x0 != expected: 0x13 rgindex #1 ri_data discrepancy: index 0x0 != expected: 0x6f30 rgindex #1 ri_bitbytes discrepancy: index 0x0 != expected: 0x1bcc rgindex #5 ri_addr discrepancy: index 0x1fcd5 != expected: 0x1bcd4 rgindex #5 ri_length discrepancy: index 0x2 != expected: 0x1 rgindex #5 ri_data1 discrepancy: index 0x1fcd7 != expected: 0x1bcd5 rgindex #5 ri_data discrepancy: index 0x6f34 != expected: 0x0 rgindex #5 ri_bitbytes discrepancy: index 0x1bcd != expected: 0x0 rgindex #6 ri_addr discrepancy: index 0x26c0c != expected: 0x1fcc0 rgindex #6 ri_length discrepancy: index 0x2 != expected: 0x1 rgindex #6 ri_data1 discrepancy: index 0x26c0e != expected: 0x1fcc1 rgindex #6 ri_data discrepancy: index 0x6f30 != expected: 0x6f48 rgindex #6 ri_bitbytes discrepancy: index 0x1bcc != expected: 0x1bd2 rgindex #7 ri_addr discrepancy: index 0x2db40 != expected: 0x26c0c rgindex #7 ri_data1 discrepancy: index 0x2db42 != expected: 0x26c0e rgindex #8 ri_addr discrepancy: index 0x34a74 != expected: 0x2db40 rgindex #8 ri_data1 discrepancy: index 0x34a76 != expected: 0x2db42 Resource group count discrepancy. Index says 8. Should be 10. Block #130240 (0x1fcc0) (1 of 1) is neither GFS_METATYPE_RB nor GFS_METATYPE_RG. Attempting to repair the RG. Clearing journals (this may take a while). Journals cleared. Starting pass1 Block #130240 (0x1fcc0) (1 of 1) is neither GFS_METATYPE_RB nor GFS_METATYPE_RG. Resource group or index is corrupted. Several non-existing errors in FS are discovered. Please note that the disk was zeroed before mkfs_gfs (mkfs.gfs -O -t a3cluster:a3gfs2 -p lock_nolock -j 2 -J 32 /dev/sdc1). The errors repeat each run and from 2nd on are always the same. I was unable to fix the FS on ia64 with gfs_fsck. And last, example of backtrace of gfs2_fsck after zeroing first rgindex: ia64 coredump: Core was generated by `gfs2_fsck -y /dev/sdc1'. Program terminated with signal 11, Segmentation fault. [New process 9558] #0 0x4000000000049fe0 in gfs2_rgrp_read (sdp=0x60000fffff879220, rgd=0x6000000000057290) at rgrp.c:148 148 gfs2_rgrp_in(&rgd->rg, rgd->bh[0]->b_data); (gdb) bt full #0 0x4000000000049fe0 in gfs2_rgrp_read (sdp=0x60000fffff879220, rgd=0x6000000000057290) at rgrp.c:148 x = 0 length = 0 #1 0x40000000000185a0 in rg_repair () No symbol table info available. #2 0x4000000000005580 in initialize () No symbol table info available. #3 0x4000000000003170 in main () No symbol table info available. (gdb) verified with gfs-utils-0.1.20-1.el5. The fix is working apart from setup described in bug 512722. An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2009-1336.html |