175589 – File system corruption using GFS 6.1

Bug 175589 - File system corruption using GFS 6.1

Summary: File system corruption using GFS 6.1

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	Red Hat Cluster Suite
Classification:	Retired
Component:	gfs
Sub Component:
Version:	4
Hardware:	i686
OS:	Linux
Priority:	medium
Severity:	high
Target Milestone:	---
Assignee:	Ben Marzinski
QA Contact:	GFS Bugs
Docs Contact:
URL:	https://bugzilla.redhat.com/bugzilla/...
Whiteboard:
Duplicates (2):	144110 173951 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2005-12-13 02:03 UTC by Brad Isbell
Modified:	2010-01-12 03:09 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2006-11-22 18:41:52 UTC
Embargoed:

Attachments	(Terms of Use)
Messages files and command output (22.44 KB, application/gzip) 2005-12-15 16:57 UTC, Brad Isbell	no flags	Details
Messages files (264.36 KB, application/gzip) 2005-12-15 21:21 UTC, Brad Isbell	no flags	Details
summary of the steps along with copies of the scripts, diagnostic data, as well as script(1) output (61.59 KB, application/gzip) 2006-01-05 17:48 UTC, Brad Isbell	no flags	Details
View All

Description Brad Isbell 2005-12-13 02:03:44 UTC

From Bugzilla Helper:
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.7.12) Gecko/20050915 Firefox/1.0.7

Description of problem:
File systems get corrupted after large amount of I/O on GFS file system. 
The problem seems to be related to the file system locking mechanism.  
It seems that the nodes get out of sync on their shared access to the common GFS file system. One node appears to be writing data while the other system does not recognize that the other node has written data. When it sees the data on the disk is different than what it expects, it inhibits access to the file system. At this point, the file system is corrupted. We have seen conditions where the corruption has occured, but one of the nodes is allowed to perform I/O, furthering corruption.

Here are some messages from the /var/log/messages file on both systems of the cluster.

Dec  7 17:02:23 nsx-1-3 kernel: GFS: fsid=nsx_cluster1:gfs3.1: fatal: invalid metadata block
Dec  7 17:02:23 nsx-1-3 kernel: GFS: fsid=nsx_cluster1:gfs3.1:   bh = 764919 (magic)
Dec  7 17:02:23 nsx-1-3 kernel: GFS: fsid=nsx_cluster1:gfs3.1:   function = gfs_get_meta_buffer
Dec  7 17:02:23 nsx-1-3 kernel: GFS: fsid=nsx_cluster1:gfs3.1:   file = /usr/src/build/615130-x86_64/BUILD/gfs-kernel-2.6.9-42/smp/src/gfs/dio.c, line = 1223
Dec  7 17:02:23 nsx-1-3 kernel: GFS: fsid=nsx_cluster1:gfs3.1:   time = 1133992943
Dec  7 17:02:23 nsx-1-3 kernel: GFS: fsid=nsx_cluster1:gfs3.1: about to withdraw from the cluster
Dec  7 17:02:23 nsx-1-3 kernel: GFS: fsid=nsx_cluster1:gfs3.1: waiting for outstanding I/O
Dec  7 17:02:23 nsx-1-3 kernel: GFS: fsid=nsx_cluster1:gfs3.1: telling LM to withdraw
Dec  7 17:02:26 nsx-1-3 kernel: lock_dlm: withdraw abandoned memory
Dec  7 17:02:26 nsx-1-3 kernel: GFS: fsid=nsx_cluster1:gfs3.1: withdrawn

-----------------------------------------

Here are some of the messages on the system that is performing the writes:

Dec  7 17:02:11 nsx-1-2 kernel: GFS: fsid=nsx_cluster1:gfs3.0: jid=1: 
Trying to acquire journal lock...
Dec  7 17:02:11 nsx-1-2 kernel: GFS: fsid=nsx_cluster1:gfs3.0: jid=1: 
Looking at journal...
Dec  7 17:02:11 nsx-1-2 kernel: GFS: fsid=nsx_cluster1:gfs3.0: jid=1: 
Acquiring the transaction lock...
Dec  7 17:02:11 nsx-1-2 kernel: GFS: fsid=nsx_cluster1:gfs3.0: jid=1: 
Replaying journal...
Dec  7 17:02:11 nsx-1-2 kernel: GFS: fsid=nsx_cluster1:gfs3.0: jid=1: 
Replayed 244 of 6780 blocks
Dec  7 17:02:11 nsx-1-2 kernel: GFS: fsid=nsx_cluster1:gfs3.0: jid=1: 
replays = 244, skips = 6388, sames = 148
Dec  7 17:02:11 nsx-1-2 kernel: GFS: fsid=nsx_cluster1:gfs3.0: jid=1: 
Journal replayed in 1s
Dec  7 17:02:11 nsx-1-2 kernel: GFS: fsid=nsx_cluster1:gfs3.0: jid=1: 
Done
Dec  7 17:05:01 nsx-1-2 crond(pam_unix)[31577]: session opened for user 
root by (uid=0)
Dec  7 17:05:01 nsx-1-2 crond(pam_unix)[31577]: session closed for user 
root
Dec  7 17:05:57 nsx-1-2 kernel: GFS: Trying to join cluster "lock_dlm", 
"nsx_cluster1:gfs3"
Dec  7 17:05:59 nsx-1-2 kernel: GFS: fsid=nsx_cluster1:gfs3.0: Joined 
cluster. Now mounting FS...
Dec  7 17:05:59 nsx-1-2 kernel: GFS: fsid=nsx_cluster1:gfs3.0: jid=0: 
Trying to acquire journal lock...
Dec  7 17:05:59 nsx-1-2 kernel: GFS: fsid=nsx_cluster1:gfs3.0: jid=0: 
Looking at journal...
Dec  7 17:05:59 nsx-1-2 kernel: GFS: fsid=nsx_cluster1:gfs3.0: jid=0: 
Done

(These last three lines will be repeated many times)

I noticed some of the same error messages in bug ID #173951 which was submitted by a RedHat employee a few weeks ago.  There have been no responses to this bug.

Version-Release number of selected component (if applicable):
GFS-kernel-smp-2.6.9-42.1

How reproducible:
Always

Steps to Reproduce:
1. Have both nodes produce a lot of I/O on the GFS file system at the same time.
2.
3.
  

Actual Results:  Error messages are in Description field above.

Additional info:

Comment 1 Corey Marthaler 2005-12-13 19:13:28 UTC

This does appear to be a dup of bz 173951, however I'll let a GFS developer make
that call. I'll also add a note in bz 173951 that this issue has been seen
outside of Redhat.

Comment 2 Brad Isbell 2005-12-14 22:00:18 UTC

Problem can be replicated by doing the following:

- On node-1, I spawned 8 processes that ran in an infinite "for loop" doing no
file system I/O. This was to put a load on the system (we have 8 cores on the
system)

- On node-2, I ran program to continuously create large files on the shared GFS
 file system. The files were an an estimated 15 MB.

- On node-1, I ran commands such as ls, df, and wc to get information about 
files being created by node-2.

- Eventually, I encountered the I/O errors on node-1

Comment 3 Kiersten (Kerri) Anderson 2005-12-15 15:41:22 UTC

Please descibe your hardware and storage components.  Also, provide the message
logs and cluster.conf file.

Comment 4 Brad Isbell 2005-12-15 16:57:30 UTC

Created attachment 122291 [details]
Messages files and command output

Comment 5 Brad Isbell 2005-12-15 17:09:53 UTC

I passed along the messages files from both system along with the dmesg and
modinfo output.  There is no cluster.conf file.  These systems are not
clustered, they are just using the GFS piece.

The hardware configuration is
2 SunFire v40Z's
    Emulex LP10000DC-S Fibre Channel Cards
Direct connection to Sun StorEdge 9985 (OEM from Hitachi)

By the way, here is also a copy of the /proc/scsi/scsi file

bash-2.05$ more scsi 
Attached devices:
Host: scsi0 Channel: 00 Id: 00 Lun: 00
  Vendor: LSILOGIC Model: 1030 IM       IM Rev: 1000
  Type:   Direct-Access                    ANSI SCSI revision: 02
Host: scsi0 Channel: 00 Id: 06 Lun: 00
  Vendor: SDR      Model: GEM318P          Rev: 1   
  Type:   Processor                        ANSI SCSI revision: 02
Host: scsi1 Channel: 00 Id: 00 Lun: 00
  Vendor: HITACHI  Model: OPEN-V           Rev: 5004
  Type:   Direct-Access                    ANSI SCSI revision: 03
Host: scsi1 Channel: 00 Id: 00 Lun: 01
  Vendor: HITACHI  Model: OPEN-V           Rev: 5004
  Type:   Direct-Access                    ANSI SCSI revision: 03
Host: scsi1 Channel: 00 Id: 00 Lun: 02
  Vendor: HITACHI  Model: OPEN-V           Rev: 5004
  Type:   Direct-Access                    ANSI SCSI revision: 03
Host: scsi1 Channel: 00 Id: 00 Lun: 03
  Vendor: HITACHI  Model: OPEN-V           Rev: 5004
  Type:   Direct-Access                    ANSI SCSI revision: 03
Host: scsi1 Channel: 00 Id: 00 Lun: 04
  Vendor: HITACHI  Model: OPEN-V           Rev: 5004
  Type:   Direct-Access                    ANSI SCSI revision: 03

----------------------------------------------------------
A copy of the /etc/fstab file:

# This file is edited by fstab-sync - see 'man fstab-sync' for details
LABEL=/1                /                       ext3    defaults        1 1
none                    /dev/pts                devpts  gid=5,mode=620  0 0
none                    /dev/shm                tmpfs   defaults        0 0
none                    /proc                   proc    defaults        0 0
none                    /sys                    sysfs   defaults        0 0
LABEL=SWAP-sda2         swap                    swap    defaults        0 0
/dev/sdc1                /nsx_direct            gfs     noauto          0 0
/dev/hdc                /media/cdrom            auto   
pamconsole,exec,noauto,anaged 0 0
/dev/fd0                /media/floppy           auto   
pamconsole,exec,noauto,anaged 0 0


--------------------------------------
And finally, the output of fdisk -l for /dev/sdc


Disk /dev/sdc: 142.9 GB, 142992211968 bytes
255 heads, 63 sectors/track, 17384 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

   Device Boot      Start         End      Blocks   Id  System
/dev/sdc1               1       17384   139636948+  83  Linux

Comment 6 Kiersten (Kerri) Anderson 2005-12-15 17:22:48 UTC

GFS requires that the cluster components be installed and running in order to
maintain data consistency between multiple nodes.  If you didn't specify
lock_dlm or lock_gulm as the lock protocol on the mkfs, then you will not be
able to us gfs in a multiple node environment.

Comment 7 Brad Isbell 2005-12-15 19:32:50 UTC

OK- I got the cluster.conf file-
I'll see if the file systems were created with one of the options mentioned.

<?xml version="1.0" ?>
<cluster config_version="3" name="nsx_cluster1">
	<fence_daemon clean_start="0" post_fail_delay="0" post_join_delay="30"/>
	<clusternodes>
		<clusternode name="nsx-1-2" votes="1">
			<fence/>
		</clusternode>
		<clusternode name="nsx-1-3" votes="1">
			<fence/>
		</clusternode>
	</clusternodes>
	<cman expected_votes="1" two_node="1"/>
	<fencedevices>
		<fencedevice agent="fence_manual" name="man_fence"/>
	</fencedevices>
	<rm>
		<failoverdomains/>
		<resources/>
	</rm>
</cluster>

Comment 8 Kiersten (Kerri) Anderson 2005-12-15 20:36:02 UTC

The message files do not have entries for the time when the errors were
occuring.  They stop on Dec 5.   Also, from your initial entries, you are using
lock_dlm.  Can you provide the full message logs and/or are there is there any
further information in them beyond what you reported initially?

Comment 9 Brad Isbell 2005-12-15 21:21:05 UTC

Created attachment 122303 [details]
Messages files

Here are some messages files that go back a little further.  Sorry about that.

Comment 10 Brad Isbell 2005-12-16 00:06:59 UTC

The "-p lock_dlm" option was used to create the file system in the current
environment..

We also experienced the corruption problem with the lock_gulm locking protocol.

Comment 12 Ben Marzinski 2006-01-03 22:23:22 UTC

I've been trying to recreate this bug using the method described in Comment #2,
with no success. Can you give me more information on how you can reproduce this
bug?  Specifically, can you tell me what program you used to create the files,
and if it's simply a test program, can you attach it to the bugzilla? Also how
long does it usually take to reproduce? Are you entering the ls, wc, df, etc.
commands by hand, or do you have a script for that? If there is a script, can
you attach that too?

Comment 13 Ben Marzinski 2006-01-04 19:11:06 UTC

*** Bug 173951 has been marked as a duplicate of this bug. ***

Comment 14 Ben Marzinski 2006-01-04 20:14:31 UTC

*** Bug 144110 has been marked as a duplicate of this bug. ***

Comment 15 Brad Isbell 2006-01-05 17:44:43 UTC

I was able to reproduce the problem with my own scripts. Attached is a summary
of the steps along with copies of the scripts, diagnostic data, as well as
script(1) output of my terminal sessions whilst creating the problem.

   1.

      Create new gfs file system (gfs_mkfs -j4 -p lock_dlm -t nsx_cluster1:gfs2
/dev/sdc1)
   2.

      mount gfs file system from both systems (nsx-1-2, nsx-1-3) on /nsx_direct
   3.

      Extract tar file consisting of customer data to shared mount point. Note
that this data is not used (executed/read/written) other than for a directory
listing (ls -lR) later during the problem recreation. I would expect that you
could recreate large directory of data into shared area to mimic this step. FYI,
customer data totals about 500MB of data.
   4.

      Create /nsx_direct/test directory and create âsharedfileâ
   5.

      Run gfs_tool gettune /nsx_direct to collect GFS statistics
   6.

      Start running file creation program on nsx-1-2. This consists of two
scripts: x.sh and x.pl. I've let this program run for about 2 minutes before
corruption occurs. Because the corruption is âsilentâ the problem may have
occurred even earlier.

<>[ Note: The âx.shâ shell script file essentially runs indefinitely calling
x.pl script (10) times per iteration, recreating (10) new files on the shared
GFS file system, and waiting for completion before starting again. The script
also appends the date to the âsharedfileâ and runs "gfstool stat" on the
"sharedfile" after each iteration of the script. The âx.plâ perl script
basically creates stdout data for file creation in x.sh script.]

    7. On 2nd system, nsx-1-3, star y.sh script. Basically this script tails the
"sharedfile" and runs "du -sk", "df-k" on the common GFS directory, "wc -l" on
the files in the shared directory, and "gfs_tool stat" on the "sharedfile,
sleeps for 5 seconds, and repeats itself.

    8. After about 2 minutes, I interrupted the y.sh script and ran a "ls -alR"
command ffrom the root of the GFS file system (/nsx_direct). This is when data
corruption is first noticed.

Here's a description of the attached files included in attached 0104.data.tar.gz

nsx2/nsx-1-2.gfstool.out - session information capturing "gfstool" commands and
output leading up to problem
nsx2/nsx-1-2.gfs_problems.out - session information creating environment,
outputs of scripts
nsx2/nsx-1-2.gfs_counters - output of "gfs_tool counters /nsx_direct -c""
nsx2/nsx-1-2.sharedfile.stat.xxx - output of "gfs_tool stat sharedfile"
nsx2/nsx-1-2.x.sh - shell script to create files and i/o on GFS file system,
collect gfs_tool stat data on "sharedfile"
nsx2/nsx-1-2.x.pl - perl scriipt to create unique files on GFS file system
nsx2/nsx-1-2.gettune - gettune stat data on "sharedfile"
nsx2/nsx-1-2.gfs_fsck.out - output of gfs_fsck after corruption. file system was
unmounted from both nodes.
nsx3/nsx-1-3.script.out - session information capturing commands and output
leading up to problem. Notice end of this file when "ls -alR" command is
manually issued to show corruption of file system.
nsx3/nsx-1-3.log.out - output of y.sh script including output of ls, df, du, and
gfs_tool command.
nsx3/y.sh - shell script to "read" i/o on shared file system
nsx3/nsx-1-3.sharedfile.tail - "tail" output from y.sh script

ATTACHMENT: 0104.data.tar.gz

Comment 16 Brad Isbell 2006-01-05 17:48:42 UTC

Created attachment 122838 [details]
summary of the steps along with copies of the scripts, diagnostic data, as well as script(1) output

Comment 17 Ben Marzinski 2006-01-10 22:48:27 UTC

In nsx2/nsx-1-2.gfstool.out, a script named "gfs_stuff" is run? What does this
do.  Is it simply 'gfs_tool counters /nsx_direct -c'?

Comment 18 Brad Isbell 2006-01-11 00:40:05 UTC

I was able to reproduce the problem in a much more simpler manner.

1. On node 1,  I continually untar a 500 MB file to a directory on shared GFS 
file system.

2. On node 2, I continually run "ls -alR | wc -l" on shared GFS file system.

3. Eventually on node 1, the tar command errors as it tries to write data. On 
node 2, the ls -alR continues to run suggesting no problems.

4. I am able to unmount and mount the GFS FS on node 1 without reported 
errors. However, running ls -alR indicates input/output errors. 

5. On node -2, I unmount and remount the file system and I/O errors are 
present.

So in this environment, all writes performed on node 1. Only reads on node 2. 
Errors first noticed on node 1. Eventual errors found on node 2.

Comment 19 Brad Isbell 2006-01-11 13:55:32 UTC

--------From Comment #17--------------
In nsx2/nsx-1-2.gfstool.out, a script named "gfs_stuff" is run? What does this
do.  Is it simply 'gfs_tool counters /nsx_direct -c'?
--------------------------------------

 That's right. Just gathering gfs_tool data. Output should have been included in
tar files sent.

Comment 22 Ben Marzinski 2006-02-02 21:44:23 UTC


*** This bug has been marked as a duplicate of 176352 ***

Comment 23 Axel Thimm 2006-02-03 10:50:20 UTC

bug #176352 is protected. We'll never know whether this will be fixed and how.

Comment 24 Ben Marzinski 2006-02-03 14:54:02 UTC

Reopening so that there is a public-facing bug for this problem

Comment 25 Dennis Preston 2006-02-10 18:52:19 UTC

We experienced this bug last night using rsync from cron:

Feb  9 16:11:46 node01 kernel: GFS: fsid=node:data2.1: fatal: invalid
metadata block
Feb  9 16:11:46 node01 kernel: GFS: fsid=node:data2.1:   bh = 13241964 (magic)
Feb  9 16:11:46 node01 kernel: GFS: fsid=node:data2.1:   function =
buf_build_bhlist
Feb  9 16:11:46 node01 kernel: GFS: fsid=node:data2.1:   file =
fs/gfs/lops.c, line = 453
Feb  9 16:11:46 node01 kernel: GFS: fsid=node:data2.1:   time = 1139526706
Feb  9 16:11:46 node01 kernel: GFS: fsid=node:data2.1: about to withdraw
from the cluster
Feb  9 16:11:46 node01 kernel: GFS: fsid=node:data2.1: waiting for
outstanding I/O
Feb  9 16:11:46 node01 kernel: GFS: fsid=node:data2.1: telling LM to withdraw
Feb  9 16:11:49 node01 kernel: lock_dlm: withdraw abandoned memory
Feb  9 16:11:49 node01 kernel: GFS: fsid=node:data2.1: withdrawn

Comment 26 Ben Marzinski 2006-02-16 21:39:36 UTC

The problem that this bugzilla was originally opened for turned out to be
completely unrelated to GFS. After a certain amount of IO to the block device,
future reads stopped returning the correct data. There were no errors reported
by the emulex driver being used, the SCSI layer, or the storage array to my
knowledge. The corruption could be noticed by simply dd'ing data to and from the
block device and comparing it, after the device was in this state. Unloading and
reloading the emulex driver clears up the problem temporarily.

I will leave this bugzilla open because it's being used to track the occurance
of this message, which could possibly be the result of GFS caused corruption in
some cases.  However, much of the initial data in this bugzilla, including the
steps to reliably reproduce this bug, is not valid for any GFS bug.

I would also like to urge everyone who does see these messages to try and verify
that the problem is not in the block layer, because in this case, there were no
error messages from the below the filesystem layer to indicate that there were
problems.  This can be done by doing IO directly to the block device after
GFS has withdrawn, and verifying that what was written to the device can be
correctly read back multiple times. DOING THIS WILL CORRUPT THE FILESYSTEM,
FORCING YOU TO REMAKE IT AFTERWARDS! DON'T DO IT ON A PRODUCTION FILESYSTEM!

Note You need to log in before you can comment on or make changes to this bug.