179069 – gfs_fsck unable to fix file system

Bug 179069 - gfs_fsck unable to fix file system

Summary: gfs_fsck unable to fix file system

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Cluster Suite
Classification:	Retired
Component:	gfs
Sub Component:
Version:	4
Hardware:	All
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Assignee:	Robert Peterson
QA Contact:	GFS Bugs
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	180185
TreeView+	depends on / blocked

Reported:	2006-01-26 23:00 UTC by Henry Harris
Modified:	2010-01-12 03:09 UTC (History)
CC List:	0 users
Fixed In Version:	RHBA-2006-0560
Clone Of:
Environment:
Last Closed:	2006-08-10 21:28:44 UTC
Embargoed:

Attachments	(Terms of Use)
Output from running gfs_fsck (308 bytes, text/plain) 2006-01-26 23:02 UTC, Henry Harris	no flags	Details
Patch to fix the problem (42.17 KB, patch) 2006-03-31 21:54 UTC, Robert Peterson	no flags	Details \| Diff
Better patch to fix the problem (52.03 KB, patch) 2006-04-19 14:10 UTC, Robert Peterson	no flags	Details \| Diff
Show Obsolete (1) View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2006:0560	0	normal	SHIPPED_LIVE	GFS bug fix update	2006-08-10 04:00:00 UTC

Description Henry Harris 2006-01-26 23:00:36 UTC

Description of problem: Ran gfs_fsck but it was unable to fix the file system.
See attachment gfs_fsck.out.  The nodes in the cluster had crashed as 
described in additional info below.  We were getting I/O errors trying to 
access GFS file system so we ran gfs_fsck.


Version-Release number of selected component (if applicable):


How reproducible:
Only happenend once on all three nodes in cluster

Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:
1)  The same process/stack on both crashes was identical
     called 'o5_wait_for_sys' whose parent process is 'hotplug'
 
2) The stack crash function was in the read() system call
    and looked like:
 
show_cfsmnt()
seq_read()
vfs_read()
sys_read()  <---- crash w /null pointer

Comment 1 Henry Harris 2006-01-26 23:02:46 UTC

Created attachment 123750 [details]
Output from running gfs_fsck

Comment 2 Robert Peterson 2006-01-30 22:51:58 UTC

The gfs_fsck messages indicate corruption in the gfs resource group information
for the filesystem.  It's nearly impossible to say whether the crash caused the
corruption or whether the corruption caused the crash.

Is there any way I can get a copy of the corrupted filesystem to examine?
I'd like to see the corruption first-hand, if possible.  Sometimes the only way
to find a smoking gun is to find the embedded bullet and follow the trail of
smoke backward to its source.

Comment 3 Henry Harris 2006-01-31 16:05:24 UTC

Unfortunately, the file system has been recreated and is no longer in the 
corrupted state.

Comment 4 Robert Peterson 2006-03-31 21:54:08 UTC

Created attachment 127163 [details]
Patch to fix the problem

Attached is an extensive patch that attempts to fix corrupted RGs and
corrupted RG Index entries.  Several rudimentary tests have been run
on a variety of conditions under which rgs and rgindex entries were
purposely corrupted.  The patch seems to work properly in all cases
tested.

Comment 5 Robert Peterson 2006-04-19 14:10:27 UTC

Created attachment 127984 [details]
Better patch to fix the problem

This patch is much better.  Several code problems from the previous
patch were found and fixed.  This version has passed a newly designed
battery of test cases that use gfs_fsck to fix 245 different 
variations of:

(1) filesystem size, (2) number of journals, (3) location of RG 
corruption, (4) location of RG index corruption, (5) filesystem
resizing by gfs_grow, and (6) RG size and number of RGs.

I won't promise that it can fix all forms of RG and RG index
corruption, but it does pretty well.

Comment 8 Red Hat Bugzilla 2006-08-10 21:28:45 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2006-0560.html

Note You need to log in before you can comment on or make changes to this bug.