Description of problem:
GFS2 filesystem got consistency errors and can't be used. Not possible to write or read from it.
The message is:
Aug 23 12:34:33 server kernel: GFS2: fsid=cluster01:gfsv2vg04vol02.0: fatal:
filesystem consistency error
Aug 23 12:34:33 server kernel: GFS2: fsid=cluster01:gfsv2vg04vol02.0: RG =
Aug 23 12:34:33 server kernel: GFS2: fsid=cluster01:gfsv2vg04vol02.0:
function = gfs2_setbit, file = /builddir/buil
d/BUILD/gfs2-kmod-1.92/_kmod_build_/rgrp.c, line = 97
Aug 23 12:34:33 server kernel: GFS2: fsid=cluster01:gfsv2vg04vol02.0: about to
withdraw this file system
Aug 23 12:34:33 server kernel: GFS2: fsid=cluster01:gfsv2vg04vol02.0: telling
LM to withdraw
The output of gfs2_fsck is attached.
Version-Release number of selected component (if applicable):
Red Hat Enterprise Linux Server release 5.3 (Tikanga)
Linux server 2.6.18-128.4.1.el5 #1 SMP Thu Jul 23 19:59:19 EDT 2009 x86_64 x86_64 x86_64 GNU/Linux
Steps to Reproduce:
The filesystem shows filesystem consistency errors, after executing gfs2_fsck.
gfs2_fsck repairs the GFS2 filesystem and determine if the consistency error is cause by a bug.
This new bug is open as suggested in Bug #490136 (internal note). It's a production system.
The output of fsck.gfs2 does not seem to be attached as stated
in the description. I recommend they download and untar this
It contains new versions of fsck.gfs2 and gfs2_edit.
I recommend first saving off their metadata first with this
version of gfs2_edit using a command like this:
gfs2_edit savemeta /dev/their/device /tmp/519721.meta
Then perhaps they can run this version of fsck.gfs2 to see if it
fixes their file system.
The consistency error is caused by file system corruption.
There are many ways the file system may become corrupted.
Some of them are due to hardware problems, such as defective
storage or Host Bus Adapter. Some of them are due to user
error, such as running fsck.gfs2 while the file system is
mounted on a different node in the cluster. Some of them are
due to bugs in the GFS2 file system. It is nearly impossible
to say what caused the file system corruption at this time
because there is almost no information here to analyze.
File system corruption problems are very difficult to solve
unless we have a scenario we can use to recreate the corruption
reliably, starting with a "clean" file system.
This may be related to bug #519049. I'm hoping to attach a patch
for that bug later today. Perhaps the customer would be willing
to try it.
These two newest attachments have syslogs that are nearly a
month old. They indicate the customer is still using the
obsolete GFS2 overlay module. That needs to be removed.
They should be able to remove it with this command:
rpm -e gfs2-kmod
The syslog also indicates possible hardware problems with their
storage (or possibly multipath problems). I don't see further
evidence of GFS2 doing the wrong thing, but that gfs2-kmod rpm
needs to be removed in favor of a newer GFS2 module that is
built into the kernel.
Setting NEEDINFO until I get feedback
Hot off the press, this is my latest and greatest 5.5 version
It has passed all the tests I've run so far and it's faster than
the official version. I recommend they save their gfs2 metadata
and run this version, saving the output. In other words:
(1) Save this version of fsck.gfs2 into some directory like ~/Download
(2) unmount the file system from all nodes:
(3) On one node:
gfs2_edit savemeta /dev/your/device ~/519721.savemeta
cd ~/Download (or the directory where you have the new fsck.gfs2)
./fsck.gfs2 -y /dev/your/device &> /tmp/fsck.gfs2.output
(4) re-mount the file system as normal
(5) please post the output (/tmp/fsck.gfs2.output) to the bugzilla.