From Bugzilla Helper: User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.7.12) Gecko/20050915 Firefox/1.0.7 Description of problem: File systems get corrupted after large amount of I/O on GFS file system. The problem seems to be related to the file system locking mechanism. It seems that the nodes get out of sync on their shared access to the common GFS file system. One node appears to be writing data while the other system does not recognize that the other node has written data. When it sees the data on the disk is different than what it expects, it inhibits access to the file system. At this point, the file system is corrupted. We have seen conditions where the corruption has occured, but one of the nodes is allowed to perform I/O, furthering corruption. Here are some messages from the /var/log/messages file on both systems of the cluster. Dec 7 17:02:23 nsx-1-3 kernel: GFS: fsid=nsx_cluster1:gfs3.1: fatal: invalid metadata block Dec 7 17:02:23 nsx-1-3 kernel: GFS: fsid=nsx_cluster1:gfs3.1: bh = 764919 (magic) Dec 7 17:02:23 nsx-1-3 kernel: GFS: fsid=nsx_cluster1:gfs3.1: function = gfs_get_meta_buffer Dec 7 17:02:23 nsx-1-3 kernel: GFS: fsid=nsx_cluster1:gfs3.1: file = /usr/src/build/615130-x86_64/BUILD/gfs-kernel-2.6.9-42/smp/src/gfs/dio.c, line = 1223 Dec 7 17:02:23 nsx-1-3 kernel: GFS: fsid=nsx_cluster1:gfs3.1: time = 1133992943 Dec 7 17:02:23 nsx-1-3 kernel: GFS: fsid=nsx_cluster1:gfs3.1: about to withdraw from the cluster Dec 7 17:02:23 nsx-1-3 kernel: GFS: fsid=nsx_cluster1:gfs3.1: waiting for outstanding I/O Dec 7 17:02:23 nsx-1-3 kernel: GFS: fsid=nsx_cluster1:gfs3.1: telling LM to withdraw Dec 7 17:02:26 nsx-1-3 kernel: lock_dlm: withdraw abandoned memory Dec 7 17:02:26 nsx-1-3 kernel: GFS: fsid=nsx_cluster1:gfs3.1: withdrawn ----------------------------------------- Here are some of the messages on the system that is performing the writes: Dec 7 17:02:11 nsx-1-2 kernel: GFS: fsid=nsx_cluster1:gfs3.0: jid=1: Trying to acquire journal lock... Dec 7 17:02:11 nsx-1-2 kernel: GFS: fsid=nsx_cluster1:gfs3.0: jid=1: Looking at journal... Dec 7 17:02:11 nsx-1-2 kernel: GFS: fsid=nsx_cluster1:gfs3.0: jid=1: Acquiring the transaction lock... Dec 7 17:02:11 nsx-1-2 kernel: GFS: fsid=nsx_cluster1:gfs3.0: jid=1: Replaying journal... Dec 7 17:02:11 nsx-1-2 kernel: GFS: fsid=nsx_cluster1:gfs3.0: jid=1: Replayed 244 of 6780 blocks Dec 7 17:02:11 nsx-1-2 kernel: GFS: fsid=nsx_cluster1:gfs3.0: jid=1: replays = 244, skips = 6388, sames = 148 Dec 7 17:02:11 nsx-1-2 kernel: GFS: fsid=nsx_cluster1:gfs3.0: jid=1: Journal replayed in 1s Dec 7 17:02:11 nsx-1-2 kernel: GFS: fsid=nsx_cluster1:gfs3.0: jid=1: Done Dec 7 17:05:01 nsx-1-2 crond(pam_unix)[31577]: session opened for user root by (uid=0) Dec 7 17:05:01 nsx-1-2 crond(pam_unix)[31577]: session closed for user root Dec 7 17:05:57 nsx-1-2 kernel: GFS: Trying to join cluster "lock_dlm", "nsx_cluster1:gfs3" Dec 7 17:05:59 nsx-1-2 kernel: GFS: fsid=nsx_cluster1:gfs3.0: Joined cluster. Now mounting FS... Dec 7 17:05:59 nsx-1-2 kernel: GFS: fsid=nsx_cluster1:gfs3.0: jid=0: Trying to acquire journal lock... Dec 7 17:05:59 nsx-1-2 kernel: GFS: fsid=nsx_cluster1:gfs3.0: jid=0: Looking at journal... Dec 7 17:05:59 nsx-1-2 kernel: GFS: fsid=nsx_cluster1:gfs3.0: jid=0: Done (These last three lines will be repeated many times) I noticed some of the same error messages in bug ID #173951 which was submitted by a RedHat employee a few weeks ago. There have been no responses to this bug. Version-Release number of selected component (if applicable): GFS-kernel-smp-2.6.9-42.1 How reproducible: Always Steps to Reproduce: 1. Have both nodes produce a lot of I/O on the GFS file system at the same time. 2. 3. Actual Results: Error messages are in Description field above. Additional info:
This does appear to be a dup of bz 173951, however I'll let a GFS developer make that call. I'll also add a note in bz 173951 that this issue has been seen outside of Redhat.
Problem can be replicated by doing the following: - On node-1, I spawned 8 processes that ran in an infinite "for loop" doing no file system I/O. This was to put a load on the system (we have 8 cores on the system) - On node-2, I ran program to continuously create large files on the shared GFS file system. The files were an an estimated 15 MB. - On node-1, I ran commands such as ls, df, and wc to get information about files being created by node-2. - Eventually, I encountered the I/O errors on node-1
Please descibe your hardware and storage components. Also, provide the message logs and cluster.conf file.
Created attachment 122291 [details] Messages files and command output
I passed along the messages files from both system along with the dmesg and modinfo output. There is no cluster.conf file. These systems are not clustered, they are just using the GFS piece. The hardware configuration is 2 SunFire v40Z's Emulex LP10000DC-S Fibre Channel Cards Direct connection to Sun StorEdge 9985 (OEM from Hitachi) By the way, here is also a copy of the /proc/scsi/scsi file bash-2.05$ more scsi Attached devices: Host: scsi0 Channel: 00 Id: 00 Lun: 00 Vendor: LSILOGIC Model: 1030 IM IM Rev: 1000 Type: Direct-Access ANSI SCSI revision: 02 Host: scsi0 Channel: 00 Id: 06 Lun: 00 Vendor: SDR Model: GEM318P Rev: 1 Type: Processor ANSI SCSI revision: 02 Host: scsi1 Channel: 00 Id: 00 Lun: 00 Vendor: HITACHI Model: OPEN-V Rev: 5004 Type: Direct-Access ANSI SCSI revision: 03 Host: scsi1 Channel: 00 Id: 00 Lun: 01 Vendor: HITACHI Model: OPEN-V Rev: 5004 Type: Direct-Access ANSI SCSI revision: 03 Host: scsi1 Channel: 00 Id: 00 Lun: 02 Vendor: HITACHI Model: OPEN-V Rev: 5004 Type: Direct-Access ANSI SCSI revision: 03 Host: scsi1 Channel: 00 Id: 00 Lun: 03 Vendor: HITACHI Model: OPEN-V Rev: 5004 Type: Direct-Access ANSI SCSI revision: 03 Host: scsi1 Channel: 00 Id: 00 Lun: 04 Vendor: HITACHI Model: OPEN-V Rev: 5004 Type: Direct-Access ANSI SCSI revision: 03 ---------------------------------------------------------- A copy of the /etc/fstab file: # This file is edited by fstab-sync - see 'man fstab-sync' for details LABEL=/1 / ext3 defaults 1 1 none /dev/pts devpts gid=5,mode=620 0 0 none /dev/shm tmpfs defaults 0 0 none /proc proc defaults 0 0 none /sys sysfs defaults 0 0 LABEL=SWAP-sda2 swap swap defaults 0 0 /dev/sdc1 /nsx_direct gfs noauto 0 0 /dev/hdc /media/cdrom auto pamconsole,exec,noauto,anaged 0 0 /dev/fd0 /media/floppy auto pamconsole,exec,noauto,anaged 0 0 -------------------------------------- And finally, the output of fdisk -l for /dev/sdc Disk /dev/sdc: 142.9 GB, 142992211968 bytes 255 heads, 63 sectors/track, 17384 cylinders Units = cylinders of 16065 * 512 = 8225280 bytes Device Boot Start End Blocks Id System /dev/sdc1 1 17384 139636948+ 83 Linux
GFS requires that the cluster components be installed and running in order to maintain data consistency between multiple nodes. If you didn't specify lock_dlm or lock_gulm as the lock protocol on the mkfs, then you will not be able to us gfs in a multiple node environment.
OK- I got the cluster.conf file- I'll see if the file systems were created with one of the options mentioned. <?xml version="1.0" ?> <cluster config_version="3" name="nsx_cluster1"> <fence_daemon clean_start="0" post_fail_delay="0" post_join_delay="30"/> <clusternodes> <clusternode name="nsx-1-2" votes="1"> <fence/> </clusternode> <clusternode name="nsx-1-3" votes="1"> <fence/> </clusternode> </clusternodes> <cman expected_votes="1" two_node="1"/> <fencedevices> <fencedevice agent="fence_manual" name="man_fence"/> </fencedevices> <rm> <failoverdomains/> <resources/> </rm> </cluster>
The message files do not have entries for the time when the errors were occuring. They stop on Dec 5. Also, from your initial entries, you are using lock_dlm. Can you provide the full message logs and/or are there is there any further information in them beyond what you reported initially?
Created attachment 122303 [details] Messages files Here are some messages files that go back a little further. Sorry about that.
The "-p lock_dlm" option was used to create the file system in the current environment.. We also experienced the corruption problem with the lock_gulm locking protocol.
I've been trying to recreate this bug using the method described in Comment #2, with no success. Can you give me more information on how you can reproduce this bug? Specifically, can you tell me what program you used to create the files, and if it's simply a test program, can you attach it to the bugzilla? Also how long does it usually take to reproduce? Are you entering the ls, wc, df, etc. commands by hand, or do you have a script for that? If there is a script, can you attach that too?
*** Bug 173951 has been marked as a duplicate of this bug. ***
*** Bug 144110 has been marked as a duplicate of this bug. ***
I was able to reproduce the problem with my own scripts. Attached is a summary of the steps along with copies of the scripts, diagnostic data, as well as script(1) output of my terminal sessions whilst creating the problem. 1. Create new gfs file system (gfs_mkfs -j4 -p lock_dlm -t nsx_cluster1:gfs2 /dev/sdc1) 2. mount gfs file system from both systems (nsx-1-2, nsx-1-3) on /nsx_direct 3. Extract tar file consisting of customer data to shared mount point. Note that this data is not used (executed/read/written) other than for a directory listing (ls -lR) later during the problem recreation. I would expect that you could recreate large directory of data into shared area to mimic this step. FYI, customer data totals about 500MB of data. 4. Create /nsx_direct/test directory and create âsharedfileâ 5. Run gfs_tool gettune /nsx_direct to collect GFS statistics 6. Start running file creation program on nsx-1-2. This consists of two scripts: x.sh and x.pl. I've let this program run for about 2 minutes before corruption occurs. Because the corruption is âsilentâ the problem may have occurred even earlier. <>[ Note: The âx.shâ shell script file essentially runs indefinitely calling x.pl script (10) times per iteration, recreating (10) new files on the shared GFS file system, and waiting for completion before starting again. The script also appends the date to the âsharedfileâ and runs "gfstool stat" on the "sharedfile" after each iteration of the script. The âx.plâ perl script basically creates stdout data for file creation in x.sh script.] 7. On 2nd system, nsx-1-3, star y.sh script. Basically this script tails the "sharedfile" and runs "du -sk", "df-k" on the common GFS directory, "wc -l" on the files in the shared directory, and "gfs_tool stat" on the "sharedfile, sleeps for 5 seconds, and repeats itself. 8. After about 2 minutes, I interrupted the y.sh script and ran a "ls -alR" command ffrom the root of the GFS file system (/nsx_direct). This is when data corruption is first noticed. Here's a description of the attached files included in attached 0104.data.tar.gz nsx2/nsx-1-2.gfstool.out - session information capturing "gfstool" commands and output leading up to problem nsx2/nsx-1-2.gfs_problems.out - session information creating environment, outputs of scripts nsx2/nsx-1-2.gfs_counters - output of "gfs_tool counters /nsx_direct -c"" nsx2/nsx-1-2.sharedfile.stat.xxx - output of "gfs_tool stat sharedfile" nsx2/nsx-1-2.x.sh - shell script to create files and i/o on GFS file system, collect gfs_tool stat data on "sharedfile" nsx2/nsx-1-2.x.pl - perl scriipt to create unique files on GFS file system nsx2/nsx-1-2.gettune - gettune stat data on "sharedfile" nsx2/nsx-1-2.gfs_fsck.out - output of gfs_fsck after corruption. file system was unmounted from both nodes. nsx3/nsx-1-3.script.out - session information capturing commands and output leading up to problem. Notice end of this file when "ls -alR" command is manually issued to show corruption of file system. nsx3/nsx-1-3.log.out - output of y.sh script including output of ls, df, du, and gfs_tool command. nsx3/y.sh - shell script to "read" i/o on shared file system nsx3/nsx-1-3.sharedfile.tail - "tail" output from y.sh script ATTACHMENT: 0104.data.tar.gz
Created attachment 122838 [details] summary of the steps along with copies of the scripts, diagnostic data, as well as script(1) output
In nsx2/nsx-1-2.gfstool.out, a script named "gfs_stuff" is run? What does this do. Is it simply 'gfs_tool counters /nsx_direct -c'?
I was able to reproduce the problem in a much more simpler manner. 1. On node 1, I continually untar a 500 MB file to a directory on shared GFS file system. 2. On node 2, I continually run "ls -alR | wc -l" on shared GFS file system. 3. Eventually on node 1, the tar command errors as it tries to write data. On node 2, the ls -alR continues to run suggesting no problems. 4. I am able to unmount and mount the GFS FS on node 1 without reported errors. However, running ls -alR indicates input/output errors. 5. On node -2, I unmount and remount the file system and I/O errors are present. So in this environment, all writes performed on node 1. Only reads on node 2. Errors first noticed on node 1. Eventual errors found on node 2.
--------From Comment #17-------------- In nsx2/nsx-1-2.gfstool.out, a script named "gfs_stuff" is run? What does this do. Is it simply 'gfs_tool counters /nsx_direct -c'? -------------------------------------- That's right. Just gathering gfs_tool data. Output should have been included in tar files sent.
*** This bug has been marked as a duplicate of 176352 ***
bug #176352 is protected. We'll never know whether this will be fixed and how.
Reopening so that there is a public-facing bug for this problem
We experienced this bug last night using rsync from cron: Feb 9 16:11:46 node01 kernel: GFS: fsid=node:data2.1: fatal: invalid metadata block Feb 9 16:11:46 node01 kernel: GFS: fsid=node:data2.1: bh = 13241964 (magic) Feb 9 16:11:46 node01 kernel: GFS: fsid=node:data2.1: function = buf_build_bhlist Feb 9 16:11:46 node01 kernel: GFS: fsid=node:data2.1: file = fs/gfs/lops.c, line = 453 Feb 9 16:11:46 node01 kernel: GFS: fsid=node:data2.1: time = 1139526706 Feb 9 16:11:46 node01 kernel: GFS: fsid=node:data2.1: about to withdraw from the cluster Feb 9 16:11:46 node01 kernel: GFS: fsid=node:data2.1: waiting for outstanding I/O Feb 9 16:11:46 node01 kernel: GFS: fsid=node:data2.1: telling LM to withdraw Feb 9 16:11:49 node01 kernel: lock_dlm: withdraw abandoned memory Feb 9 16:11:49 node01 kernel: GFS: fsid=node:data2.1: withdrawn
The problem that this bugzilla was originally opened for turned out to be completely unrelated to GFS. After a certain amount of IO to the block device, future reads stopped returning the correct data. There were no errors reported by the emulex driver being used, the SCSI layer, or the storage array to my knowledge. The corruption could be noticed by simply dd'ing data to and from the block device and comparing it, after the device was in this state. Unloading and reloading the emulex driver clears up the problem temporarily. I will leave this bugzilla open because it's being used to track the occurance of this message, which could possibly be the result of GFS caused corruption in some cases. However, much of the initial data in this bugzilla, including the steps to reliably reproduce this bug, is not valid for any GFS bug. I would also like to urge everyone who does see these messages to try and verify that the problem is not in the block layer, because in this case, there were no error messages from the below the filesystem layer to indicate that there were problems. This can be done by doing IO directly to the block device after GFS has withdrawn, and verifying that what was written to the device can be correctly read back multiple times. DOING THIS WILL CORRUPT THE FILESYSTEM, FORCING YOU TO REMAKE IT AFTERWARDS! DON'T DO IT ON A PRODUCTION FILESYSTEM!