Description of problem: Ran gfs_fsck but it was unable to fix the file system.
See attachment gfs_fsck.out. The nodes in the cluster had crashed as
described in additional info below. We were getting I/O errors trying to
access GFS file system so we ran gfs_fsck.
Version-Release number of selected component (if applicable):
Only happenend once on all three nodes in cluster
Steps to Reproduce:
1) The same process/stack on both crashes was identical
called 'o5_wait_for_sys' whose parent process is 'hotplug'
2) The stack crash function was in the read() system call
and looked like:
sys_read() <---- crash w /null pointer
Created attachment 123750 [details]
Output from running gfs_fsck
The gfs_fsck messages indicate corruption in the gfs resource group information
for the filesystem. It's nearly impossible to say whether the crash caused the
corruption or whether the corruption caused the crash.
Is there any way I can get a copy of the corrupted filesystem to examine?
I'd like to see the corruption first-hand, if possible. Sometimes the only way
to find a smoking gun is to find the embedded bullet and follow the trail of
smoke backward to its source.
Unfortunately, the file system has been recreated and is no longer in the
Created attachment 127163 [details]
Patch to fix the problem
Attached is an extensive patch that attempts to fix corrupted RGs and
corrupted RG Index entries. Several rudimentary tests have been run
on a variety of conditions under which rgs and rgindex entries were
purposely corrupted. The patch seems to work properly in all cases
Created attachment 127984 [details]
Better patch to fix the problem
This patch is much better. Several code problems from the previous
patch were found and fixed. This version has passed a newly designed
battery of test cases that use gfs_fsck to fix 245 different
(1) filesystem size, (2) number of journals, (3) location of RG
corruption, (4) location of RG index corruption, (5) filesystem
resizing by gfs_grow, and (6) RG size and number of RGs.
I won't promise that it can fix all forms of RG and RG index
corruption, but it does pretty well.
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.