Red Hat Bugzilla – Bug 495774
gfs_fsck segfaults while fixing 'EA leaf block type' problem.
Last modified: 2010-10-23 04:58:10 EDT
Description of problem:
Running `gfs_fsck -v -n <block device>' on a GFS filesystem with inconsistencies, gfs_fsck outputs EA leaf block problems:
Initializing special inodes...
Validating Resource Group index.
Level 1 check.
3798 resource groups found.
Setting block ranges...
Creating a block list of size 249167869...
Checking metadata in Resource Group 0
Checking metadata in Resource Group 1
Checking metadata in Resource Group 3065
Checking metadata in Resource Group 3066
Checking metadata in Resource Group 3067
Checking metadata in Resource Group 3068
EA leaf block has incorrect type.
And gfs_fsck aborts.
When running gfs_fsck without the '-n' flag to fix the problem, gfs_fsck crashes doing a double free on GLIBC.
# gfs_fsck -y <block device>
Clearing journals (this may take a while)....
13 percent complete.
25 percent complete.
37 percent complete.
48 percent complete.
49 percent complete.
61 percent complete.
72 percent complete.
EA leaf block has incorrect type.
*** glibc detected *** gfs_fsck: double free or corruption (fasttop): 0x00000000059aaf20 ***
======= Backtrace: =========
======= Memory map: ========
00400000-00422000 r-xp 00000000 fd:00 8053142 /sbin/gfs_fsck
00622000-00623000 rw-p 00022000 fd:00 8053142 /sbin/gfs_fsck
03d2f000-059ba000 rw-p 03d2f000 00:00 0 [heap]
3ca0400000-3ca041a000 r-xp 00000000 fd:00 1146035 /lib64/ld-2.5.so
3ca061a000-3ca061b000 r--p 0001a000 fd:00 1146035 /lib64/ld-2.5.so
3ca061b000-3ca061c000 rw-p 0001b000 fd:00 1146035 /lib64/ld-2.5.so
3ca0800000-3ca094a000 r-xp 00000000 fd:00 1146049 /lib64/libc-2.5.so
3ca094a000-3ca0b49000 ---p 0014a000 fd:00 1146049 /lib64/libc-2.5.so
3ca0b49000-3ca0b4d000 r--p 00149000 fd:00 1146049 /lib64/libc-2.5.so
3ca0b4d000-3ca0b4e000 rw-p 0014d000 fd:00 1146049 /lib64/libc-2.5.so
3ca0b4e000-3ca0b53000 rw-p 3ca0b4e000 00:00 0
3ca4000000-3ca400d000 r-xp 00000000 fd:00 1146061 /lib64/libgcc_s-4.1.2-20080102.so.1
3ca400d000-3ca420d000 ---p 0000d000 fd:00 1146061 /lib64/libgcc_s-4.1.2-20080102.so.1
3ca420d000-3ca420e000 rw-p 0000d000 fd:00 1146061 /lib64/libgcc_s-4.1.2-20080102.so.1
2adb93530000-2adb93531000 rw-p 2adb93530000 00:00 0
2adb93540000-2adba0532000 rw-p 2adb93540000 00:00 0
2adba4000000-2adba4021000 rw-p 2adba4000000 00:00 0
2adba4021000-2adba8000000 ---p 2adba4021000 00:00 0
7fff17535000-7fff1757a000 rw-p 7ffffffba000 00:00 0 [stack]
ffffffffff600000-ffffffffffe00000 ---p 00000000 00:00 0 [vdso]
followed by a core dump.
Version-Release number of selected component (if applicable):
everytime on this particular filesystem which is damaged.
Steps to Reproduce:
1. run the command as show above.
gfs_fsck segfaults and exits.
gfs_fsck fix the filesystem.
Created attachment 339902 [details]
A similar problem was fixed in gfs2_fsck. This patch is a
gfs-crosswrite from the gfs2 patch. It is completely untested.
I'm waiting to get the customer's metadata so I can test it
properly to make sure it fixes the problem.
Setting NEEDINFO flag until I can get the metadata in to make sure
it will fix the file system properly. Since the patch has not been
run even once, it's likely to need a few changes before it ships.
I do not recommend running the patch on a production machine until
the testing of the patch is complete.
This one seems to be bad too. Setting NEEDINFO again until I can
get a clean copy.
This copy of the metadata is perfect. I ran the patch I posted
with comment #13 on it, and it correctly fixes the file system.
I'll start the process of getting this into RHEL5 asap.
The patch was pushed to the master branch of the gfs1-utils git
tree, and the STABLE2, STABLE3 and RHEL5 branches of the cluster
git tree for inclusion into 5.4. It was tested on system roth-01
using the customer's metadata that failed before the patch.
Changing status to Modified.
Created attachment 341575 [details]
The previous patch forgot to actually write the changes to disk.
This was an oversight on my part, mainly because I made incorrect
assumptions based on how gfs2_fsck (from which the patch came)
operates. Hopefully this fixes it.
Thanks for the good news. The addendum patch was pushed to the master
branch of the gfs1-utils git tree, and the STABLE2, STABLE3 and RHEL5
branches of the cluster git tree for inclusion into 5.4.
Using customer metadata, I have determined that bug #507775 was
caused by a regression introduced with this bug's patch. I have
an addendum patch that corrects the problem and allows gfs_fsck
to repair both sets of gfs metadata from bug #507775. I will
post the addendum patch immediately and start the process of
respinning this fix for all the appropriate releases. Temporarily
changing the status to FAILS_QA, but I should be able to push the
addendum fix today.
Created attachment 350176 [details]
Addendum patch for the 507775 problem.
This patch fixes both sets of corrupt metadata from bug #507775.
We can't commit this fix unless the blocker or exception flag is set for this bug.
Chris, the flags were set when I originally did the commit for
RHEL5.4, and that previous commit is defective. We can't ship
a defective fix, so we really have no choice. Do I really need
the exception flag? If so, I can likely get it.
The addendum has been pushed now to master in gfs1-utils, and
STABLE3, STABLE2, RHEL5 and RHEL54 branches of the cluster.git
repository. It was tested on system roth-01.
*** Bug 507775 has been marked as a duplicate of this bug. ***
Requesting the exception flag.
verified with gfs-utils-0.1.20-1.el5
to fully fix the filesystem, gfs_fsck has to be run twice. This applies until bug 509225 is fixed.
Passed eatype test on x86_64 and ia64.
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.