Description of problem: GFS2 filesystem got consistency errors and can't be used. Not possible to write or read from it. The message is: Aug 23 12:34:33 server kernel: GFS2: fsid=cluster01:gfsv2vg04vol02.0: fatal: filesystem consistency error Aug 23 12:34:33 server kernel: GFS2: fsid=cluster01:gfsv2vg04vol02.0: RG = 130666 Aug 23 12:34:33 server kernel: GFS2: fsid=cluster01:gfsv2vg04vol02.0: function = gfs2_setbit, file = /builddir/buil d/BUILD/gfs2-kmod-1.92/_kmod_build_/rgrp.c, line = 97 Aug 23 12:34:33 server kernel: GFS2: fsid=cluster01:gfsv2vg04vol02.0: about to withdraw this file system Aug 23 12:34:33 server kernel: GFS2: fsid=cluster01:gfsv2vg04vol02.0: telling LM to withdraw The output of gfs2_fsck is attached. Version-Release number of selected component (if applicable): Red Hat Enterprise Linux Server release 5.3 (Tikanga) Linux server 2.6.18-128.4.1.el5 #1 SMP Thu Jul 23 19:59:19 EDT 2009 x86_64 x86_64 x86_64 GNU/Linux How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: The filesystem shows filesystem consistency errors, after executing gfs2_fsck. Expected results: gfs2_fsck repairs the GFS2 filesystem and determine if the consistency error is cause by a bug. Additional info: This new bug is open as suggested in Bug #490136 (internal note). It's a production system.
The output of fsck.gfs2 does not seem to be attached as stated in the description. I recommend they download and untar this file: http://people.redhat.com/rpeterso/Experimental/RHEL5.x/gfs2/gfs2_fsck_edit.tgz It contains new versions of fsck.gfs2 and gfs2_edit. I recommend first saving off their metadata first with this version of gfs2_edit using a command like this: gfs2_edit savemeta /dev/their/device /tmp/519721.meta Then perhaps they can run this version of fsck.gfs2 to see if it fixes their file system.
The consistency error is caused by file system corruption. There are many ways the file system may become corrupted. Some of them are due to hardware problems, such as defective storage or Host Bus Adapter. Some of them are due to user error, such as running fsck.gfs2 while the file system is mounted on a different node in the cluster. Some of them are due to bugs in the GFS2 file system. It is nearly impossible to say what caused the file system corruption at this time because there is almost no information here to analyze. File system corruption problems are very difficult to solve unless we have a scenario we can use to recreate the corruption reliably, starting with a "clean" file system.
This may be related to bug #519049. I'm hoping to attach a patch for that bug later today. Perhaps the customer would be willing to try it.
These two newest attachments have syslogs that are nearly a month old. They indicate the customer is still using the obsolete GFS2 overlay module. That needs to be removed. They should be able to remove it with this command: rpm -e gfs2-kmod The syslog also indicates possible hardware problems with their storage (or possibly multipath problems). I don't see further evidence of GFS2 doing the wrong thing, but that gfs2-kmod rpm needs to be removed in favor of a newer GFS2 module that is built into the kernel.
Setting NEEDINFO until I get feedback
Hot off the press, this is my latest and greatest 5.5 version of fsck.gfs2: http://people.redhat.com/rpeterso/Experimental/RHEL5.x/gfs2/fsck.gfs2 It has passed all the tests I've run so far and it's faster than the official version. I recommend they save their gfs2 metadata and run this version, saving the output. In other words: (1) Save this version of fsck.gfs2 into some directory like ~/Download (2) unmount the file system from all nodes: umount /mnt/gfs2 (3) On one node: gfs2_edit savemeta /dev/your/device ~/519721.savemeta cd ~/Download (or the directory where you have the new fsck.gfs2) ./fsck.gfs2 -y /dev/your/device &> /tmp/fsck.gfs2.output (4) re-mount the file system as normal (5) please post the output (/tmp/fsck.gfs2.output) to the bugzilla.