Description of problem: Ran gfs_fsck but it was unable to fix the file system. See attachment gfs_fsck.out. The nodes in the cluster had crashed as described in additional info below. We were getting I/O errors trying to access GFS file system so we ran gfs_fsck. Version-Release number of selected component (if applicable): How reproducible: Only happenend once on all three nodes in cluster Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info: 1) The same process/stack on both crashes was identical called 'o5_wait_for_sys' whose parent process is 'hotplug' 2) The stack crash function was in the read() system call and looked like: show_cfsmnt() seq_read() vfs_read() sys_read() <---- crash w /null pointer
Created attachment 123750 [details] Output from running gfs_fsck
The gfs_fsck messages indicate corruption in the gfs resource group information for the filesystem. It's nearly impossible to say whether the crash caused the corruption or whether the corruption caused the crash. Is there any way I can get a copy of the corrupted filesystem to examine? I'd like to see the corruption first-hand, if possible. Sometimes the only way to find a smoking gun is to find the embedded bullet and follow the trail of smoke backward to its source.
Unfortunately, the file system has been recreated and is no longer in the corrupted state.
Created attachment 127163 [details] Patch to fix the problem Attached is an extensive patch that attempts to fix corrupted RGs and corrupted RG Index entries. Several rudimentary tests have been run on a variety of conditions under which rgs and rgindex entries were purposely corrupted. The patch seems to work properly in all cases tested.
Created attachment 127984 [details] Better patch to fix the problem This patch is much better. Several code problems from the previous patch were found and fixed. This version has passed a newly designed battery of test cases that use gfs_fsck to fix 245 different variations of: (1) filesystem size, (2) number of journals, (3) location of RG corruption, (4) location of RG index corruption, (5) filesystem resizing by gfs_grow, and (6) RG size and number of RGs. I won't promise that it can fix all forms of RG and RG index corruption, but it does pretty well.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2006-0560.html