Bug 495774

Summary: gfs_fsck segfaults while fixing 'EA leaf block type' problem.
Product: Red Hat Enterprise Linux 5 Reporter: Eduardo Damato <edamato>
Component: gfs-utilsAssignee: Robert Peterson <rpeterso>
Status: CLOSED ERRATA QA Contact: Cluster QE <mspqa-list>
Severity: high Docs Contact:
Priority: urgent    
Version: 5.3CC: cfeist, cward, edamato, hlawatschek, jkortus, rrottmann, tao
Target Milestone: rc   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: gfs-utils-0.1.20-1.el5 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 510758 (view as bug list) Environment:
Last Closed: 2009-09-02 11:01:00 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Preliminary patch
none
Addendum patch
none
Addendum patch for the 507775 problem. none

Description Eduardo Damato 2009-04-14 18:21:52 UTC
Description of problem:

Running `gfs_fsck -v -n <block device>' on a GFS filesystem with inconsistencies, gfs_fsck outputs EA leaf block problems:

Initializing fsck
Initializing lists...
Initializing special inodes...
Validating Resource Group index.
Level 1 check.
3798 resource groups found.
(passed)
Setting block ranges...
Creating a block list of size 249167869...
Starting pass1
Checking metadata in Resource Group 0
Checking metadata in Resource Group 1
...
Checking metadata in Resource Group 3065
Checking metadata in Resource Group 3066
Checking metadata in Resource Group 3067
Checking metadata in Resource Group 3068
EA leaf block has incorrect type.

And gfs_fsck aborts.

When running gfs_fsck without the '-n' flag to fix the problem, gfs_fsck crashes doing a double free on GLIBC. 

# gfs_fsck -y <block device>
Initializing fsck 
Clearing journals (this may take a while).... 
Journals cleared. 
Starting pass1 
13 percent complete. 
25 percent complete. 
37 percent complete. 
48 percent complete. 
49 percent complete. 
61 percent complete. 
72 percent complete. 
EA leaf block has incorrect type. 
*** glibc detected *** gfs_fsck: double free or corruption (fasttop): 0x00000000059aaf20 *** 
======= Backtrace: ========= 
/lib64/libc.so.6[0x3ca0871634] 
/lib64/libc.so.6(cfree+0x8c)[0x3ca0874c5c] 
gfs_fsck[0x416546] 
gfs_fsck[0x4166ed] 
gfs_fsck[0x403c9c] 
gfs_fsck[0x404119] 
gfs_fsck[0x404370] 
gfs_fsck[0x40148a] 
/lib64/libc.so.6(__libc_start_main+0xf4)[0x3ca081d8b4] 
gfs_fsck[0x4010e9] 
======= Memory map: ======== 
00400000-00422000 r-xp 00000000 fd:00 8053142                            /sbin/gfs_fsck 
00622000-00623000 rw-p 00022000 fd:00 8053142                            /sbin/gfs_fsck 
03d2f000-059ba000 rw-p 03d2f000 00:00 0                                  [heap] 
3ca0400000-3ca041a000 r-xp 00000000 fd:00 1146035                        /lib64/ld-2.5.so 
3ca061a000-3ca061b000 r--p 0001a000 fd:00 1146035                        /lib64/ld-2.5.so 
3ca061b000-3ca061c000 rw-p 0001b000 fd:00 1146035                        /lib64/ld-2.5.so 
3ca0800000-3ca094a000 r-xp 00000000 fd:00 1146049                        /lib64/libc-2.5.so 
3ca094a000-3ca0b49000 ---p 0014a000 fd:00 1146049                        /lib64/libc-2.5.so 
3ca0b49000-3ca0b4d000 r--p 00149000 fd:00 1146049                        /lib64/libc-2.5.so 
3ca0b4d000-3ca0b4e000 rw-p 0014d000 fd:00 1146049                        /lib64/libc-2.5.so 
3ca0b4e000-3ca0b53000 rw-p 3ca0b4e000 00:00 0 
3ca4000000-3ca400d000 r-xp 00000000 fd:00 1146061                        /lib64/libgcc_s-4.1.2-20080102.so.1 
3ca400d000-3ca420d000 ---p 0000d000 fd:00 1146061                        /lib64/libgcc_s-4.1.2-20080102.so.1 
3ca420d000-3ca420e000 rw-p 0000d000 fd:00 1146061                        /lib64/libgcc_s-4.1.2-20080102.so.1 
2adb93530000-2adb93531000 rw-p 2adb93530000 00:00 0 
2adb93540000-2adba0532000 rw-p 2adb93540000 00:00 0 
2adba4000000-2adba4021000 rw-p 2adba4000000 00:00 0 
2adba4021000-2adba8000000 ---p 2adba4021000 00:00 0 
7fff17535000-7fff1757a000 rw-p 7ffffffba000 00:00 0                      [stack] 
ffffffffff600000-ffffffffffe00000 ---p 00000000 00:00 0                  [vdso] 
Aborted 
 
followed by a core dump.


Version-Release number of selected component (if applicable):

kernel-2.6.18-120.el5.bz470074.0
gfs2-utils-0.1.44-1.el5_2.1
glibc-common-2.5-24
cman-2.0.98-1.el5
kernel-headers-2.6.18-121.el5
kernel-doc-2.6.18-92.1.18.el5
kernel-2.6.18-134.el5
glibc-2.5-24
gfs-utils-0.1.18-1.el5

How reproducible:

everytime on this particular filesystem which is damaged.

Steps to Reproduce:

1. run the command as show above.
  
Actual results:

gfs_fsck segfaults and exits.

Expected results:

gfs_fsck fix the filesystem.

Additional info:

Comment 13 Robert Peterson 2009-04-16 19:03:21 UTC
Created attachment 339902 [details]
Preliminary patch

A similar problem was fixed in gfs2_fsck.  This patch is a
gfs-crosswrite from the gfs2 patch.  It is completely untested.
I'm waiting to get the customer's metadata so I can test it
properly to make sure it fixes the problem.

Comment 14 Robert Peterson 2009-04-16 19:05:26 UTC
Setting NEEDINFO flag until I can get the metadata in to make sure
it will fix the file system properly.  Since the patch has not been
run even once, it's likely to need a few changes before it ships.
I do not recommend running the patch on a production machine until
the testing of the patch is complete.

Comment 16 Robert Peterson 2009-04-20 17:06:50 UTC
This one seems to be bad too.  Setting NEEDINFO again until I can
get a clean copy.

Comment 18 Robert Peterson 2009-04-21 22:03:44 UTC
This copy of the metadata is perfect.  I ran the patch I posted
with comment #13 on it, and it correctly fixes the file system.
I'll start the process of getting this into RHEL5 asap.

Comment 19 Robert Peterson 2009-04-22 15:07:42 UTC
The patch was pushed to the master branch of the gfs1-utils git
tree, and the STABLE2, STABLE3 and RHEL5 branches of the cluster
git tree for inclusion into 5.4.  It was tested on system roth-01
using the customer's metadata that failed before the patch.
Changing status to Modified.

Comment 21 Robert Peterson 2009-04-28 14:10:06 UTC
Created attachment 341575 [details]
Addendum patch

The previous patch forgot to actually write the changes to disk.
This was an oversight on my part, mainly because I made incorrect
assumptions based on how gfs2_fsck (from which the patch came)
operates.  Hopefully this fixes it.

Comment 23 Robert Peterson 2009-04-29 21:37:12 UTC
Thanks for the good news.  The addendum patch was pushed to the master
branch of the gfs1-utils git tree, and the STABLE2, STABLE3 and RHEL5
branches of the cluster git tree for inclusion into 5.4.

Comment 25 Robert Peterson 2009-07-01 19:27:47 UTC
Using customer metadata, I have determined that bug #507775 was
caused by a regression introduced with this bug's patch.  I have
an addendum patch that corrects the problem and allows gfs_fsck
to repair both sets of gfs metadata from bug #507775.  I will
post the addendum patch immediately and start the process of
respinning this fix for all the appropriate releases.  Temporarily
changing the status to FAILS_QA, but I should be able to push the
addendum fix today.

Comment 26 Robert Peterson 2009-07-01 19:29:45 UTC
Created attachment 350176 [details]
Addendum patch for the 507775 problem.

This patch fixes both sets of corrupt metadata from bug #507775.

Comment 27 Chris Feist 2009-07-01 19:36:46 UTC
We can't commit this fix unless the blocker or exception flag is set for this bug.

Comment 28 Robert Peterson 2009-07-01 19:55:01 UTC
Chris, the flags were set when I originally did the commit for
RHEL5.4, and that previous commit is defective.  We can't ship
a defective fix, so we really have no choice.  Do I really need
the exception flag?  If so, I can likely get it.

The addendum has been pushed now to master in gfs1-utils, and
STABLE3, STABLE2, RHEL5 and RHEL54 branches of the cluster.git
repository.  It was tested on system roth-01.

Comment 29 Robert Peterson 2009-07-01 20:03:24 UTC
*** Bug 507775 has been marked as a duplicate of this bug. ***

Comment 31 Robert Peterson 2009-07-01 21:06:09 UTC
Requesting the exception flag.

Comment 35 Jaroslav Kortus 2009-07-20 16:10:32 UTC
verified with gfs-utils-0.1.20-1.el5

to fully fix the filesystem, gfs_fsck has to be run twice. This applies until bug 509225 is fixed.
	
Passed eatype test on x86_64 and ia64.

Comment 38 errata-xmlrpc 2009-09-02 11:01:00 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2009-1336.html