Bug 507775

Summary: fsck.gfs segfaults when repairing a corrupt GFS volume. Data gets lost!
Product: Red Hat Enterprise Linux 5 Reporter: Reiner Rottmann <rrottmann>
Component: gfs-utilsAssignee: Robert Peterson <rpeterso>
Status: CLOSED DUPLICATE QA Contact: Cluster QE <mspqa-list>
Severity: medium Docs Contact:
Priority: low    
Version: 5.5CC: edamato, hlawatschek, jkortus
Target Milestone: rc   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2009-07-01 20:03:24 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Rough Cut Patch none

Description Reiner Rottmann 2009-06-24 07:17:28 UTC
Description of problem:
A small GFS Volume (20GB, 3GB used) was corrupted as a server was shutdown uncleanly.

When executing a fsck.gfs, a segfault occurs.

After that, When mounting the device with lockproto=lock_nolock on a single machine, the filesystem is reported to be filled with 3GB data, but only around 500MB of data are still accessible. Most files on the filesystem seem to be missing.

Version-Release number of selected component (if applicable):
RHEL 5 U3 with GFS utils 

How reproducible:
Always with this filesystem

Steps to Reproduce:
1. Run fsck.gfs on device
2. wait a few minutes.
3. See segfault
  
Actual results:

fsck.gfs -y /dev/mapper/vg-gfs_lv-gfs
Initializing fsck                                                                                  
Clearing journals (this may take a while)....                                                      
Journals cleared.                                                                                  
Starting pass1                                                                                     
Found unused inode marked in-use                                                                   
Pass1 complete                                                                                     
Starting pass1b                                                                                    
Found dup block at 724461                                                                          
Block 724461 has 2 inodes referencing it fora total of 2 duplicate references                      
Inode (null) has 1 reference(s) to block 724461                                                    
Clearing...                                                                                        
Jun 22 19:26:55 (none) kernel: fsck.gfs[30737]: segfault at 0000000000000018 rip 0000000000416406 rsp 00007fff232c09b0 error 4 

Expected results:
fsck.gfs should not segfault.
errors shoud be corrected.

Additional info:
# uname -a
Linux C2N1 2.6.18-128.1.10.el5 #1 SMP Thu May 7 10:35:59 EDT 2009 x86_64 x86_64 x86_64 GNU/Linux

rpm -qf /sbin/gfs_fsck
gfs-utils-0.1.18-1.el5

Comment 1 Robert Peterson 2009-06-24 15:01:26 UTC
This is very likely to be the result of one of the bugs I found
in gfs2's fsck in bug #500483.  I stated in this comment:

https://bugzilla.redhat.com/show_bug.cgi?id=500483#c8

that many of the bugs I found and fixed should be back-ported to
gfs's fsck since gfs2's fsck was based on gfs_fsck.  So I'll use
this bug to do that back-porting.  However, in order to make sure
that the problem is fixed, I'd like to get a copy of the GFS
metadata that recreated this segfault, if that's possible.

Comment 2 Robert Peterson 2009-06-26 19:57:14 UTC
Created attachment 349598 [details]
Rough Cut Patch

This is my first crack at a crosswrite patch from bug #500483
that I hope full fix the problem.  It is COMPLETELY untested,
(and likely dangerous) but I wanted to save what I've got so far.
I need to debug this and run it through all the GFS metadata
I've got in my collection, like I did for gfs2's fsck, and that's
likely to take several days.

I am still waiting to get the original metadata so I can make
sure it fixes the problem.  There is still the possibility that
this bug is the same as 506550, in which case there is still
a fair amount of work to do.  Either way, this preliminary port
will be needed before I can get to that, and either way, I need
the original metadata that shows the problem.

Comment 3 Reiner Rottmann 2009-06-30 13:22:36 UTC
We could recreate this error in an test environment. There we created an complete dump of the metadata.

The dump is available here: http://www.files.to/get/718921/7t711aftr0
Within an test environment 

See the attachment fsck.gfs.dump for the complete output of the fsck.gfs command.

Comment 4 Jaroslav Kortus 2009-06-30 20:07:39 UTC
Could this be related to bug 493727? I have met the very same issue during my testing there.

Comment 5 Robert Peterson 2009-06-30 20:26:03 UTC
I doubt that it is the same problem, but I just received their
metadata today, so I can likely run the different versions of
gfs_fsck against the metadata and find out.

Comment 6 Reiner Rottmann 2009-07-01 06:49:31 UTC
Here is the metadata of the original filesytem where the bug has been experienced first:

http://www.files.to/get/719467/3f4fb20cqu

Comment 7 Robert Peterson 2009-07-01 20:03:24 UTC
I have determined that this problem was caused by a regression
introduced by the patch for bug #495774.  That bug record now
contains an addendum patch that fixes the segfault.  Therefore,
I'm closing this one as a duplicate of 495774.

I have opened a new bug #509225 for the gfs crosswrite work from
gfs2 bug #500483 (GFS2: fsck.gfs2 sometimes needs to be run twice).

*** This bug has been marked as a duplicate of bug 495774 ***