Bug 179069

Summary: gfs_fsck unable to fix file system
Product: [Retired] Red Hat Cluster Suite Reporter: Henry Harris <henry.harris>
Component: gfsAssignee: Robert Peterson <rpeterso>
Status: CLOSED ERRATA QA Contact: GFS Bugs <gfs-bugs>
Severity: medium Docs Contact:
Priority: medium    
Version: 4   
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: RHBA-2006-0560 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2006-08-10 21:28:44 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 180185    
Attachments:
Description Flags
Output from running gfs_fsck
none
Patch to fix the problem
none
Better patch to fix the problem none

Description Henry Harris 2006-01-26 23:00:36 UTC
Description of problem: Ran gfs_fsck but it was unable to fix the file system.
See attachment gfs_fsck.out.  The nodes in the cluster had crashed as 
described in additional info below.  We were getting I/O errors trying to 
access GFS file system so we ran gfs_fsck.


Version-Release number of selected component (if applicable):


How reproducible:
Only happenend once on all three nodes in cluster

Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:
1)  The same process/stack on both crashes was identical
     called 'o5_wait_for_sys' whose parent process is 'hotplug'
 
2) The stack crash function was in the read() system call
    and looked like:
 
show_cfsmnt()
seq_read()
vfs_read()
sys_read()  <---- crash w /null pointer

Comment 1 Henry Harris 2006-01-26 23:02:46 UTC
Created attachment 123750 [details]
Output from running gfs_fsck

Comment 2 Robert Peterson 2006-01-30 22:51:58 UTC
The gfs_fsck messages indicate corruption in the gfs resource group information
for the filesystem.  It's nearly impossible to say whether the crash caused the
corruption or whether the corruption caused the crash.

Is there any way I can get a copy of the corrupted filesystem to examine?
I'd like to see the corruption first-hand, if possible.  Sometimes the only way
to find a smoking gun is to find the embedded bullet and follow the trail of
smoke backward to its source.


Comment 3 Henry Harris 2006-01-31 16:05:24 UTC
Unfortunately, the file system has been recreated and is no longer in the 
corrupted state.

Comment 4 Robert Peterson 2006-03-31 21:54:08 UTC
Created attachment 127163 [details]
Patch to fix the problem

Attached is an extensive patch that attempts to fix corrupted RGs and
corrupted RG Index entries.  Several rudimentary tests have been run
on a variety of conditions under which rgs and rgindex entries were
purposely corrupted.  The patch seems to work properly in all cases
tested.

Comment 5 Robert Peterson 2006-04-19 14:10:27 UTC
Created attachment 127984 [details]
Better patch to fix the problem

This patch is much better.  Several code problems from the previous
patch were found and fixed.  This version has passed a newly designed
battery of test cases that use gfs_fsck to fix 245 different 
variations of:

(1) filesystem size, (2) number of journals, (3) location of RG 
corruption, (4) location of RG index corruption, (5) filesystem
resizing by gfs_grow, and (6) RG size and number of RGs.

I won't promise that it can fix all forms of RG and RG index
corruption, but it does pretty well.

Comment 8 Red Hat Bugzilla 2006-08-10 21:28:45 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2006-0560.html