Bug 455696
Summary: | NFS over GFS issue (fatal: assertion "!bd->bd_pinned && !buffer_busy(bh)" failed) | ||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | [Retired] Red Hat Cluster Suite | Reporter: | Mark Hlawatschek <hlawatschek> | ||||||||||||||||||||||||||||||||||||
Component: | GFS-kernel | Assignee: | Robert Peterson <rpeterso> | ||||||||||||||||||||||||||||||||||||
Status: | CLOSED ERRATA | QA Contact: | Cluster QE <mspqa-list> | ||||||||||||||||||||||||||||||||||||
Severity: | high | Docs Contact: | |||||||||||||||||||||||||||||||||||||
Priority: | urgent | ||||||||||||||||||||||||||||||||||||||
Version: | 4 | CC: | adas, ben.yarwood, bernhard.furtmueller, bkahn, bmarzins, bstevens, cfeist, edamato, grimme, hklein, jplans, mchristi, merz, michael.hagmann, pcfe, ra, rdassen, revers, rpeterso, rrottmann, swhiteho, tao | ||||||||||||||||||||||||||||||||||||
Target Milestone: | rc | Keywords: | Reopened, ZStream | ||||||||||||||||||||||||||||||||||||
Target Release: | --- | ||||||||||||||||||||||||||||||||||||||
Hardware: | All | ||||||||||||||||||||||||||||||||||||||
OS: | Linux | ||||||||||||||||||||||||||||||||||||||
Whiteboard: | |||||||||||||||||||||||||||||||||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||||||||||||||||||||||||||||||||
Doc Text: | Story Points: | --- | |||||||||||||||||||||||||||||||||||||
Clone Of: | |||||||||||||||||||||||||||||||||||||||
: | 515267 (view as bug list) | Environment: | |||||||||||||||||||||||||||||||||||||
Last Closed: | 2009-08-03 15:01:36 UTC | Type: | --- | ||||||||||||||||||||||||||||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||||||||||||||||||||||||||||
Documentation: | --- | CRM: | |||||||||||||||||||||||||||||||||||||
Verified Versions: | Category: | --- | |||||||||||||||||||||||||||||||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||||||||||||||||||||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||||||||||||||||||||||||||||
Embargoed: | |||||||||||||||||||||||||||||||||||||||
Bug Depends On: | |||||||||||||||||||||||||||||||||||||||
Bug Blocks: | 491369, 491928, 494835, 515267, 674403 | ||||||||||||||||||||||||||||||||||||||
Attachments: |
|
Description
Mark Hlawatschek
2008-07-17 08:01:05 UTC
Reiner created a nice c program (gen-data.c) to create the files with predefined sized and distribution. btw: we are using the following paramaters during our tests: ./gen-data -a 256 -b 393216 -c 76800 -n 200000 -p /mnt/nfstest/dir{1,2} Created attachment 312021 [details]
stress test tool
*** Bug 445000 has been marked as a duplicate of this bug. *** Created attachment 317629 [details]
Proposed patch
Here is the patch I'm currently testing.
The test ran for two hours, and 6 runs, but it finally failed. The patch definitely makes it harder to recreate, so I think I'm on the right track at least. Created attachment 319176 [details]
Latest patch
This patch fixes a bunch of bugs I've found. It turns out that
either NFS or vfs can and does occasionally set the buffer dirty
bit on while we're in the middle of function ail_empty_gl. I
took GREAT pains to make sure the buffer head's dirty flag was
not twiddled by some other aberrant path through the gfs code.
Therefore, I changed the assert from checking "buffer_busy" to
checking only "buffer_locked." However, if the other timing
windows/bugs I found are not closed/fixed, the dirty buffer is not
always processed properly by the subsequent code, and therefore the
file system may still be corrupted and may take an "invalid
metadata block" assert later.
With this patch, I've been able to run the failing scenario
six (going on seven) with no failures.
The code changes in this patch are as follows:
1. In gfs_aspace_releasepage, I set bd->bd_bh to NULL so that the
bh pointer can never be reused by accident. I also moved the
zeroing of bh->b_private up before the storage is released
with kmem_cache_free. That closes a timing window whereby
another process can sneak in and reuse the bd for a different
bh, and we would nuke the pointer. I also changed it to its
simpler form to get rid of a compiler warning.
2. Function ail_empty_gl's assert was changed due to NFS/VFS
re-setting the buffer head's Dirty flag at a critical time,
as stated above.
3. I changed function gfs_attach_bufdata so that it returns the
pointer to the gfs_bufdata it finds. This comes in useful
later.
4. I moved the decrementing of bd_pinned to the bottom of function
gfs_dunpin as stated in an earlier comment. This prevents two
processes from racing to get into the code that marks the
buffer dirty, interfering with one another.
5. Function dir_split_leaf was writing into a inode buffer
without adding it to the transaction. That can cause the
inode buffer to not be journaled correctly, and the buffer
to therefore not be accounted for correctly in its attempt
to flush metadata buffers before writing them to their final
in-place location.
6. Function gfs_trans_add_bh was checking to see if a buffer
head already had a buffer descriptor attached, but it was
checking the pointer without the page locked. That created a
timing window whereby a process could reuse a bd when it
should not. That's why I needed to change gfs_attach_bufdata
to protect this value.
Created attachment 319283 [details]
Better patch
I was able to break the previous patch on about the ninth run. It
broke with another Invalid Metadata assert in the directory code.
Digging around, I discovered three more places in dir.c that modify
a dinode buffer without adding the buffer to the transaction. That
may account for the problem, so this patch adds the necessary calls to
gfs_trans_add_bh. I've run this version almost as long as the previous
patch (several hours) without hitting any problems. Still, testing
continues.
Created attachment 319380 [details]
Oct 3 patch
Unfortunately, I did encounter a problem with the 4th generation
patch, but only after about ten runs of the failing scenario.
When it failed, it gave me an "invalid metadata block" assert.
That may be due to the circumstances of my file system, but it's
hard to tell. I'd like to zero out my test file system completely
with dd from /dev/zero, then mkfs, then restart.
This patch is the same as the previous patch, except I fixed one thing.
I discovered that when I got the above "invalid metadata block" assert,
the system got into an infinite loop and it eventually gives the
message "stuck in gfs_releasepage()." This patch fixes that loop.
So this patch is a whole lot better than the original code, and fixes a
lot of bugs. However, there is still the concern over the invalid
metadata block message. In this case, the bad block looked like GFS
metadata, but the block type was zero. As far as I can tell, that can
only happen through a call to function clump_alloc. All other places
in GFS where it initializes the buffer, it sets both the magic number
and the block type.
Last note: this problem may very well not occur if the default block size of 4K is used. I ran the test for 10 iterations with blocksize=1024 after zeroing out the device with a dd from /dev/zero followed by a mkfs. The 10 iterations ran to completion without asserting, but I did see the "invalid metadata block" assert after the filesystem was idle for sometime (couple of hours or so). I ran the test for 20 iterations with the default block size (4096). The test ran through 20 iterations without a failure. The filesystem has been idle for a couple of hours now and I don't see any asserts on the console. This is the assert I'm seeing On shell-01 nfs_update_inode: inode number mismatch expected (0:14/0x28592fa), got (0:14/0x0) nfs_update_inode: inode number mismatch expected (0:14/0x87), got (0:14/0xffffffffa02f5488) nfs_update_inode: inode number mismatch expected (0:14/0x87), got (0:14/0x10214c42000) nfs_update_inode: inode number mismatch expected (0:14/0x87), got (0:14/0x10214c42000) nfs_update_inode: inode number mismatch expected (0:14/0x87), got (0:14/0x10214c42000) nfs_update_inode: inode number mismatch expected (0:14/0x87), got (0:14/0x10214c42000) nfs_update_inode: inode number mismatch expected (0:14/0x87), got (0:14/0x10214c42000) nfs_update_inode: inode number mismatch expected (0:14/0x87), got (0:14/0x10214c42000) nfs_update_inode: inode number mismatch expected (0:14/0x87), got (0:14/0x10214c42000) nfs_update_inode: inode number mismatch expected (0:14/0x87), got (0:14/0x10214c42000) nfs_update_inode: inode number mismatch expected (0:14/0x87), got (0:14/0x10214c42000) nfs_update_inode: inode number mismatch expected (0:14/0x87), got (0:14/0x10214c42000) nfs_update_inode: inode number mismatch expected (0:14/0x87), got (0:14/0x10214c42000) nfs_update_inode: inode number mismatch expected (0:14/0x87), got (0:14/0x10214c42000) nfs_update_inode: inode number mismatch expected (0:14/0x87), got (0:14/0x10214c42000) nfs_update_inode: inode number mismatch expected (0:14/0x87), got (0:14/0x10214c42000) nfs_update_inode: inode number mismatch expected (0:14/0x87), got (0:14/0x10214c42000) nfs_update_inode: inode number mismatch expected (0:14/0x87), got (0:14/0x10214c42000) nfs_update_inode: inode number mismatch expected (0:14/0x87), got (0:14/0x10214c42000) nfs_update_inode: inode number mismatch expected (0:14/0x87), got (0:14/0x10214c42000) nfs_update_inode: inode number mismatch expected (0:14/0x87), got (0:14/0x10214c42000) nfs_update_inode: inode number mismatch expected (0:14/0x87), got (0:14/0x10214c42000) nfs_update_inode: inode number mismatch expected (0:14/0x87), got (0:14/0x10214c42000) nfs_update_inode: inode number mismatch expected (0:14/0x87), got (0:14/0x10214c42000) nfs_update_inode: inode number mismatch expected (0:14/0x87), got (0:14/0x10214c42000) nfs_update_inode: inode number mismatch expected (0:14/0x87), got (0:14/0x10214c42000) nfs_update_inode: inode number mismatch expected (0:14/0x87), got (0:14/0x10214c42000) nfs_update_inode: inode number mismatch expected (0:14/0x87), got (0:14/0x10214c42000) nfs_update_inode: inode number mismatch expected (0:14/0x87), got (0:14/0x10214c42000) nfs_update_inode: inode number mismatch expected (0:14/0x87), got (0:14/0x1021b604000) GFS: fsid=shell:gfs.0: jid=1: Trying to acquire journal lock... GFS: fsid=shell:gfs.0: jid=1: Looking at journal... GFS: fsid=shell:gfs.0: jid=1: Acquiring the transaction lock... GFS: fsid=shell:gfs.0: jid=1: Replaying journal... GFS: fsid=shell:gfs.0: jid=1: Replayed 36780 of 37250 blocks GFS: fsid=shell:gfs.0: jid=1: replays = 36780, skips = 224, sames = 246 GFS: fsid=shell:gfs.0: jid=1: Journal replayed in 6s GFS: fsid=shell:gfs.0: jid=1: Done nfs_statfs: statfs error = 116 shell-02 GFS: fsid=shell:gfs.1: fatal: invalid metadata block GFS: fsid=shell:gfs.1: bh = 3470969 (type: exp=6, found=0) GFS: fsid=shell:gfs.1: function = get_leaf GFS: fsid=shell:gfs.1: file = /home/bob/cluster/gfs-kernel/src/gfs/dir.c, lin8 GFS: fsid=shell:gfs.1: time = 1223601261 GFS: fsid=shell:gfs.1: about to withdraw from the cluster GFS: fsid=shell:gfs.1: waiting for outstanding I/O GFS: fsid=shell:gfs.1: telling LM to withdraw lock_dlm: withdraw abandoned memory GFS: fsid=shell:gfs.1: withdrawn Comment #16: Did this assertion result from testing Roberts latest patch? Yes. It took about 10 iterations of the test. I've been trying to narrow down the aforementioned "invalid metadata block" problem with my latest patch. The biggest problem is that it takes a very long time to recreate the failure. First, I used to instrumentation to determine that the data corruption is not related to the directory leaf manipulation code. At least it's very unlikely. It appears to be a corrupt block in the file system. In all the cases of corruption, the block type is 0, which is only used in function "clump_alloc". Until today, the error was always detected when traversing the directory leaf blocks resulting from the find command. So I've been focusing my efforts on searching the directory code for problems. However, today, I was able to recreate a failure with a different block type: GFS: fsid=shell:gfs.0: fatal: invalid metadata block GFS: fsid=shell:gfs.0: bh = 35512638 (type: exp=7, found=0) GFS: fsid=shell:gfs.0: function = gfs_get_data_buffer GFS: fsid=shell:gfs.0: file = /home/bob/cluster/gfs-kernel/src/gfs/dio.c, line = 1280 This was basically checking for a journaled block, type GFS_METATYPE_JD, but once again, it found type 0, which is from clump_alloc. Note that in previous cases, there were not multiple clump blocks near the corrupt block. So it looks like the data may have been written into the wrong block or something. In previous cases, the block looked to be all zeros except for the GFS metadata headers. In one case it was not all zeros. In fact, I couldn't identify what the block was. It didn't look like any GFS data structure. Also, it didn't look like directory data, nor the single-byte data the test program writes to files. It looked like weird database stuff. Also note that the corruption apparently happened a "long time" prior to its being detected by the find. In other words, there are absolutely no references to the corrupt block in any of the journals. I tried increasing the journal size to 2GB, which makes the mkfs take a very long time. Unfortunately, I was not able to recreate the failure this way. But I also didn't try for more than a few hours. Lately I've been doing some experiments to try to get the recreation time down to a more reasonable time. First, I tried reducing the block size from 1K to 512 bytes. No affect. Next, I tried increasing the block size to 2K. No affect. Thanks to Abhi's experiment of zeroing the device before mkfs, it seems more likely that the problem has something to do with the clump_alloc code. Since the "clump_alloc" code that deals with these zero-type blocks kicks in when the code runs out of blocks in a given RG, I decided to try changing the RG size from its default to 32MB. That did seem to make the problem occur quicker. The problem recreated on the first run of the scenario rather than the second or third. However, I've only done that once. The nearby blocks looked like other clump blocks (unlike previous recreations) except for the third block on the same page, which looked like a directory leaf. At least this time there weren't any inodes on the same physical page. Once again, data surrounding the corrupt clump block was all zeros, but the block on the page boundary was also a clump block, and it looked almost like byte-shifted part of an inode: 000000087784F000 01161970 00000000 00000000 00000000 [...p............] 000000087784F010 00000000 00000000 00000000 00000000 [................] 000000087784F020 00000000 00000001 00000001 000081A4 [................] 000000087784F030 00000001 00000000 00000000 00000000 [................] 000000087784F040 00000101 00000000 00000400 00000000 [................] 000000087784F050 00000000 00000000 00000001 00000000 [................] 000000087784F060 021DE040 490623F0 00000000 490623F0 [...@I.#.....I.#.] 000000087784F070 00000000 490623F0 00000000 00000001 [....I.#.........] 000000087784F080 00000000 49062056 00096B5F 00000000 [....I. V..k_....] (with the rest all 0x00). The value 021BE040 at offset 0x60 is a GFS disk block number of a disk inode and that inode contains the value 490623F0 as its time stamps. So I'm continuing my analysis, focusing on the clump_alloc code. Created attachment 321732 [details]
Oct 28 patch
This version of the patch seemingly works better. I still have some
investigation to do, so this is likely to be revised again. But so
far I've seen no failures with it. This is good news, considering I
found ways to increase my chances of hitting the problem (more later).
First, let it be known that I'm focusing my efforts on the "Invalid Metadata Block" problem described earlier, which only appears with my newer patches. For example, the Oct 03 patch. I still believe that patch is beneficial and allows the test to get much farther, but the user's tests are uncovering this second bug in the code, which I'm now pursuing. I've done several things today. I figured out that if I maximize the number of RGs and minimize the size of the journals, I'm much more likely to hit the Invalid Metadata Block error with the Oct 03 patch. My best chances seem to be when I use this mkfs command: mkfs.gfs -O -j3 -J32 -p lock_dlm -b 1024 -r32 -t shell:gfs /dev/shell-vg/lv01 After this mkfs command, I've been able to recreate the failure on almost every run of the test program, although sometimes it takes up to ten to fifteen minutes to recreate. That's still much better than yesterday, when it was taking up to three hours to recreate. I verified that the "Invalid Metadata" problem always complains about a block written by function clump_alloc. In other words, I changed the clump_alloc block type to be 0x77, and when the failure occurred, it reported the wrong block as having type 0x77. I've also determined that in the failing case, node 1 is trying to write the block complained about, and the "find" command of node 2 is trying to read the block, then complaining. I've gone through all the code that uses the clump_alloc blocks and verified that the logic looks correct. There aren't any blatant glaring problems there. Lastly, I've created the previously posted patch. This patch is the same as the October 3 patch, but it adds some calls to "mark_buffer_dirty" which hopefully causes any blocks fetched from the previously allocated "clumps" to be written to disk. This seems to have a positive affect on the code. This patch seems to make the failure much less likely. Unfortunately, I did receive another failure with that patch, but only after several runs of the failing scenario. I believe that another carefully placed call to mark_buffer_dirty would eliminate this failing case as well. I need to discuss this with Steve Whitehouse in the morning because it's likely that "mark_buffer_dirty" is circumventing the problem rather than solving the problem. I removed the "mark_buffer_dirty" calls because they were not solving the problem, they were only closing the window a little bit. In other words, I was able to recreate the problem with the calls there, so it's better to not mask the symptom at this point in time. Next, I added more instrumentation to the code to try to narrow down what's happening some more. A big concern of mine, given the previous symptoms, is that we might be hitting a scenario like this: 1. The rg runs out of free metadata blocks, so GFS goes to set aside a clump of free data blocks as free metadata blocks. 2. Function clump_alloc changes the bitmap settings for a clump of blocks from FREE to FREEMETA. Then another process gets control. 3. The other process wants to allocate a metadata block, so scans the bitmap for FREEMETA and finds one of the blocks being manipulated by clump_alloc. The process writes its metadata header into the buffer associated with that block. 4. Function clump_alloc gains control again and requests the block be overwritten with the dummy header that seems to be popping up in our errors. This overwrites the good data with what we're seeing. So first I verified that the proper rgrp glocks are in place in the clump_alloc code. Next, I verified that the meta header cache is not being checked at the same time a new clump is being added to it. Unfortunately, every time I change the instrumentation, I change the symptoms. There now seems to be some cluster coherency problem. With the smaller mkfs, and if the timing is just right, I am able to recreate this error: GFS: fsid=shell:gfs.0: fatal: invalid metadata block GFS: fsid=shell:gfs.0: bh = 34933213 (magic) GFS: fsid=shell:gfs.0: function = get_leaf GFS: fsid=shell:gfs.0: file = /home/bob/cluster/gfs-kernel/src/gfs/dir.c, line = 442 This is a completely different symptom. The corrupt block does not look like a metadata cache-allocated "clump" block in this case. With my latest instrumentation, I haven't been seeing that problem lately. So get_leaf discovered a block that should be a directory leaf, but it sees a block that is not even GFS metadata (it seems to be just data). With instrumentation, I verified that the block was NOT fetched from the meta header cache. I also verified that the corrupt block was actually read with ll_rw_block(READ, 1, &bh) in function gfs_dreread, and the wait_on_buffer(bh) function must have been called. None of these symptoms seem to occur if I make the journal big. But without a fair amount of data in the journal, it's tough to tell the history of the blocks that are being reported as corrupt. I need to do two things: (1) make sure the scenario I gave above is not actually happening. (2) examine the journal log code for problems relating to cluster coherency. I may need to recreate my glock history patch to solve this, but the timing is so sensitive that I highly doubt the problems will reoccur with that patch in place. So I can only move forward on this problem an inch or two at a time. I may need to do more research into making it recreate faster, such as increasing or decreasing the meta header cache size and such. Well, I verified that the scenario above is not happening. I added instrumentation to detect if any process is writing a buffer that's corrupt. It never triggered, but somehow the buffer was read back corrupt anyway. That means that when GFS tried to write a directory leaf block (METATYPE_LF) or a journaled data block (METATYPE_JD) the buffer was correct when it submitted the buffer for IO. I also verified that when the problem is detected, the buffer was NOT previously used to write either a (1) leaf block, (2) Journaled data block, (3) metadata clump, nor was the buffer fetched from the meta header cache. I spent a fair amount of time trying to get the problem to recreate faster with a bigger journal size so I can get a history of the corrupt block. I was not successful. One thing I tried was running four simultaneous copies of the gen-data program on each server. No luck. At one point, the "METATYPE_JD" error triggered on, of all things, the quota file's inode. In other words, GFS claimed that the quota file's inode was not a metadata block when, in fact, it was. This made me take a step back and question whether the problem is somewhere in the vfs rather then GFS. The kernel I was running was quite old. So I upgraded my kernel to 2.6.9-78-16, which is for pre-release 4.7. Then I had to recompile the cluster suite for the new kernel, and I had several issues with that. Finally I got things working again, but I hit the problem again regardless. I also verified the problem is not specific to the special code that we use to process nfs directory requests. The error happens regardless of whether the nfs or normal readdir is used. Whatever is going wrong is happening at a very low level. Observe this set of events: [root@shell-02 ../gfs-kernel/src/gfs]# mkfs.gfs -O -j3 -J32 -p lock_dlm -b 1024 -r32 -X -t shell:gfs /dev/shell-vg/lv01 Expert mode: on Device: /dev/shell-vg/lv01 Blocksize: 1024 Filesystem Size: 104730448 Journals: 3 Resource Groups: 3198 Locking Protocol: lock_dlm Lock Table: shell:gfs Syncing... All Done [root@shell-02 ../gfs-kernel/src/gfs]# sync [root@shell-02 ../gfs-kernel/src/gfs]# mount -t gfs /dev/shell-vg/lv01 /mnt/gfs/mount: /dev/shell-vg/lv01: can't read superblock [root@shell-02 ../gfs-kernel/src/gfs]# dmesg | tail -8 GFS: Trying to join cluster "lock_dlm", "shell:gfs" GFS: fsid=shell:gfs.0: Joined cluster. Now mounting FS... GFS: fsid=shell:gfs.0: fatal: invalid metadata block GFS: fsid=shell:gfs.0: bh = 74 (magic) GFS: fsid=shell:gfs.0: function = gfs_get_meta_buffer GFS: fsid=shell:gfs.0: file = /home/bob/cluster/gfs-kernel/src/gfs/dio.c, line = 1241 GFS: fsid=shell:gfs.0: time = 1226017322 GFS: fsid=shell:gfs.0: can't get journal index inode: -5 [root@shell-02 ../gfs-kernel/src/gfs]# gfs2_edit -x -p 74 /dev/shell-vg/lv01 Block #74 (0x4a) of 104857600 (0x6400000) (disk inode) -------------------- Journal Index ------------------ 00012800 01161970 00000004 00000000 00000000 [...p............] 00012810 00000190 00000000 00000000 0000004A [...............J] 00012820 00000000 0000004A 00000180 00000000 [.......J........] 00012830 00000000 00000001 00000000 000000F0 [................] 00012840 00000000 00000001 00000000 49138A0B [............I...] 00012850 00000000 49138A0B 00000000 49138A0B [....I.......I...] 00012860 00000000 00000000 00000000 00000000 [................] 00012870 00000000 00000000 00000000 00000000 [................] 00012880 00000001 000003E8 00010000 00000000 [................] 00012890 00000000 00000000 00000000 00000000 [................] 000128A0 00000000 00000000 00000000 00000000 [................] 000128B0 00000000 00000000 00000000 00000000 [................] 000128C0 00000000 00000000 00000000 00000000 [................] 000128D0 00000000 00000000 00000000 00000000 [................] 000128E0 00000000 00000000 00000000 031F4000 [..............@.] (snip) [root@shell-02 ../gfs-kernel/src/gfs]# mount -t gfs /dev/shell-vg/lv01 /mnt/gfs/mount: /dev/shell-vg/lv01: can't read superblock So the page on disk and/or as seen from user space actually DOES contain the magic number, but GFS still sees the wrong page. I would have thought that the gfs2_edit program would have caused the vfs to reload the page. So I can only conclude that something in memory is being corrupted and once it's corrupted, it remains corrupt in the eyes of the GFS kernel module. Userland can still see the correct data though. Well, the symptom from the previous post was an anomaly caused by a problem with my instrumentation. Once I fixed the problem it started behaving as I'd expect it. However, it doesn't shed any light on the problem at hand. I've gone to great lengths to check for the proper read requests being sent properly to the vfs layer, and the buffer_heads are in the proper state. I've also made sure the radix tree is finding the proper page, the page is not locked, and the block sizes are not being corrupted. So it seems as if memory is being corrupted somehow. That theory is supported by the following scenario that occurred today: This instrumented gfs function produced the output that follows it: void gfs_logbh_init(struct gfs_sbd *sdp, struct buffer_head *bh, uint64_t blkno, char *data) { memset(bh, 0, sizeof(struct buffer_head)); bh->b_state = (1 << BH_Mapped) | (1 << BH_Uptodate) | (1 << BH_Lock); atomic_set(&bh->b_count, 1); set_bh_page(bh, virt_to_page(data), ((unsigned long)data) & (PAGE_SIZE - 1)); if (blkno < sdp->sd_jindex[0].ji_addr || blkno > (sdp->sd_jindex[2].ji_addr + (sdp->sd_jindex[2].ji_nsegment * 16))) { printk("Bob: Great. Another bug.\n"); printk("blkno = 0x%llx\n", blkno); printk("j0=0x%llx\n", sdp->sd_jindex[0].ji_addr); printk("j2=0x%llx\n", sdp->sd_jindex[2].ji_addr); printk("j2s=0x%x\n", sdp->sd_jindex[2].ji_nsegment * 16); gfs_assert_warn(sdp, blkno >= sdp->sd_jindex[0].ji_addr && blkno <= (sdp->sd_jindex[2].ji_addr + (sdp->sd_jindex[2].ji_nsegment * 16))); dump_stack(); } bh->b_blocknr = blkno; bh->b_size = sdp->sd_sb.sb_bsize; bh->b_bdev = sdp->sd_vfs->s_bdev; init_buffer(bh, logbh_end_io, NULL); INIT_LIST_HEAD(&bh->b_assoc_buffers); } Bob: Great. Another bug. blkno = 0x3203701 j0=0x31f4000 j2=0x3204000 j2s=0x8000 Call Trace:<ffffffffa0371d8a>{:gfs:gfs_logbh_init+426} <ffffffffa038727c>{:gfs:gfs_log_get_buf+230} <ffffffffa03892f9>{:gfs:buf_build_bhlist+114} <ffffffff8031838c>{wait_for_completion+231} <ffffffff8031838c>{wait_for_completion+231} <ffffffffa0387552>{:gfs:disk_commit+170} <ffffffffa038a727>{:gfs:quota_trans_size+33} <ffffffffa0387993>{:gfs:log_refund+111} <ffffffffa0387eaa>{:gfs:log_flush_internal+510} <ffffffffa0391b78>{:gfs:gfs_fsync+96} <ffffffff8015b898>{__filemap_fdatawrite_range+95} <ffffffffa02ebaa5>{:nfsd:nfsd_sync_dir+53} <ffffffffa02ed7ab>{:nfsd:nfsd_create_v3+759} <ffffffffa02f3acf>{:nfsd:nfsd3_proc_create+307} <ffffffffa02e87dc>{:nfsd:nfsd_dispatch+219} <ffffffffa01914ac>{:sunrpc:svc_process+1227} <ffffffff801347cf>{default_wake_function+0} <ffffffffa02e831b>{:nfsd:nfsd+0} <ffffffffa02e8553>{:nfsd:nfsd+568} <ffffffff8013351c>{schedule_tail+55} <ffffffff80110fd3>{child_rip+8} <ffffffffa02e831b>{:nfsd:nfsd+0} <ffffffffa02e831b>{:nfsd:nfsd+0} <ffffffff80110fcb>{child_rip+0} But the logic is correct. The "if" test passed, flagging "blkno" as being outside the bounds of the journals, but the values that printed are, in fact, correct, and blkno is entirely within journal1, which is what I expect. The "assert" that follows is testing the same values but it found them to be correct (with reverse logic). The "blkno" parameter passed in to the function should not change inside the function. Therefore, either memory was corrupt and later "fixed" or else I'm having some kind of hardware problem; maybe my machine's memory or motherboard (or both) are flaky. Either that or some kernel process is changing my memory. Time for a lengthy update on this bug record. The last patch that I posted to this bugzilla record fixes the fatal assertion issue reported with this bugzilla. The patch allows GFS to get further, but I'm now experiencing three additional problems. The problems are all highly timing-dependent and hard to recreate. I've never seen either when I use the default block size of 4K on mkfs. Making the journals bigger (like the default of 128MB) makes both problems extremely rare, but not necessarily impossible. I'm using the minimum size of 32MB in order to recreate the problem. The problems do not appear to be related to the media. Neither the SAN, disks nor HBA are giving any kind of error codes back to GFS. These three additional problems take about a half-hour to recreate, and I never know which problem is going to recreate, so I'm trying to instrument and solve two problems at the same time. I'm constantly having to switch my focus from one to the other, and that's not easy. The three additional problems are as follows: ======================================== = PROBLEM 1 - The "Leaf Block" problem = ======================================== About 49% of the recreations, I get an error that has this symptom: GFS: fsid=shell:gfs.1: fatal: invalid metadata block GFS: fsid=shell:gfs.1: bh = 148893 (magic) GFS: fsid=shell:gfs.1: function = get_leaf GFS: fsid=shell:gfs.1: file = /home/bob/cluster/gfs-kernel/src/gfs/dir.c, line = 440 GFS: fsid=shell:gfs.1: time = 1226431699 Ignore the line number because my instrumentation throws that off. In this case, GFS function get_leaf is trying to fetch a leaf block for a directory. Instead of leaf block data, it finds a bunch of random junk from memory. In actual fact, the data on the file system is a correct leaf block, which you can prove by printing out the block in hex, like so: gfs2_edit -x -p <block> /dev/your/device Doing so doesn't change the situation, however. If GFS tries to re-read the block ten times, it still gets the same random trash. I went to some trouble to prove that the data is being written correctly, the blocks and block sizes are all being calculated correctly. The problem is on the read side of things. The problem does not appear to be related to GFS's i_cache, because I can invalidate/destroy the i_cache and it still sees the same trash. However, if GFS brute-forces a clear_buffer_uptodate of the bh being used to fetch the data, subsequent reads will work and see the correct data. (I just found this today) This tells me that some process is clearing the buffer_uptodate flag when they shouldn't, or else it is not being set when the data is read. I've painstakingly gone through all the places where GFS itself twiddles the buffer_uptodate bit, and I don't think it's GFS doing the bit twiddling. That's not to say that it's not GFS's fault. Perhaps something in the GFS code path is causing VFS or NFS to twiddle the bit. Or perhaps NFS is at fault. It's a difficult thing to prove either way. I do know that the bh is given to ll_rw_block(READ...) with the bh properly locked. I also know that particular bh was not used for an unstuff, a truncate or for a metadata clump allocation. I also know the buffer was not uptodate when we submitted it for I/O. ============================== = PROBLEM 2 the "JD" problem = ============================== Another 49% of the recreations, I get a problem that has this symptom: GFS: fsid=shell:gfs.1: fatal: invalid metadata block GFS: fsid=shell:gfs.1: bh = 126058 (type: exp=7, found=0) GFS: fsid=shell:gfs.1: function = gfs_get_data_buffer GFS: fsid=shell:gfs.1: file = /home/bob/cluster/gfs-kernel/src/gfs/dio.c, line = 1297 GFS: fsid=shell:gfs.1: time = 1226423645 Again, ignore the line number. Although it has a similar symptom, this is seemingly a completely different problem. In this scenario, gfs_dir_read is trying to read a block full of indirect pointers at height 1 from a directory inode. (Each of these indirect pointers contained in the block will point to a leaf block). Instead of a GFS block of type 7 GFS_METATYPE_JD, function gfs_get_data_buffer is seeing a block that looks just like a metadata allocation clump block. In this case, the metadata is actually corrupt on disk, so no amount of reading or invalidating is going to fix that. The file system is corrupted. It's not always a metadata clump block, but that seems to be a lot more common than any other type of corruption. In almost all the times I've recreated this, the journals contain absolutely no information about the corrupt block, nor the directory inode that contains the corrupt block. Yesterday, I managed to finally recreate the problem with references left in the journal. The journal indicated that the inode was rewritten a bunch of times, but always with the correct data. Metadata clump allocation blocks are not journaled, but I do know for a fact that the bh associated with that page of memory was not used to write any of the 1000 previously allocated metadata clump blocks. Unfortunately, my journal information was obscured by a journal bug in the "withdraw" code that I had fixed in GFS in RHEL5.3. I had to port the patch back to RHEL4 and hope to recreate it again. As in problem #1, the bh had the correct block number and block size. The bh was not used for a recent unstuff, or a truncate (I targeted unstuff and truncate in both cases because they are some of the few GFS functions that mess with pages manually.) I've added additional instrumentation to narrow down this problem, and I'm waiting for it to hit again. ========================================== = PROBLEM 3 the "journal bounds" problem = ========================================== This problem occurs the remaining 2% of my recreations. The symptom is an assert that I added as instrumentation to ensure GFS data integrity. I wanted to guarantee that journal writes were never getting outside of the journals. In my recreation scenario, I have three journals, so I added this code into function gfs_logbh_init: if (blkno < sdp->sd_jindex[0].ji_addr || blkno > (sdp->sd_jindex[2].ji_addr + (sdp->sd_jindex[2].ji_nsegment * 16))) { printk("Bob: Great. Another bug.\n"); printk("blkno = 0x%llx\n", blkno); printk("j0=0x%llx\n", sdp->sd_jindex[0].ji_addr); printk("j2=0x%llx\n", sdp->sd_jindex[2].ji_addr); printk("j2s=0x%x\n", sdp->sd_jindex[2].ji_nsegment * 16); gfs_assert_warn(sdp, blkno >= sdp->sd_jindex[0].ji_addr && blkno <= (sdp->sd_jindex[2].ji_addr + (sdp->sd_jindex[2].ji_nsegment * 16))); dump_stack(); } This function initializes a "fake" buffer head to be used to write into one of the journals. The assert checks to make sure the block number is not before the first journal and not after the last journal. I've had this problem recreate a couple of times with JUST the gfs_assert_warn part but not the other code. I've only seen it occur ONCE with the rest of the code in there. But here's the strange thing: When the instrumented code hit, it printed "Bob: Great. Another bug." as I expected, but the subsequent values it printed for J0, J2 and j2s all indicated the values were properly within the bounds of the test. Even more strange, the subsequent call to gfs_assert_warn did NOT trigger, which means the values somehow fixed themselves and were inside the bounds of the test! I've only seen problems like this with faulty hardware: flaky memory or motherboards. Either that or memory corruption. Either way, this is the most bizarre problem yet and I have no explanation for it. That's all I have at the moment; I'm continuing to pursue these bugs. Created attachment 325912 [details]
Dec 05 patch
This is my latest and greatest patch. I don't know if all the code
changes are needed. Also, I don't guarantee all the problems are
fixed with it. I need to do more testing to see whether or not the
problems are fixed. At the very least, it's very difficult (if not
impossible) to recreate the failure with this patch in place. I have
not tested the patch for performance impacts yet.
After extensive testing, both problems 1 and 2 are still there but harder to hit. I'm continuing to narrow it down. Part of the problem is that the read requests that are finding the corruption happen a long time after any writes had occurred to the corrupt blocks. The writes happened more than 8 million read/write requests prior to discovering the corruption. BTW, I still think that problem #3 (the journal blocks problem) was due to my faulty instrumentation in the code. I think I was checking the log bounds at the same time the values were being initialized and therefore they always fixed themselves in a matter of microseconds. I think I've successfully proven that, in the case of problem #1, the "corrupt block" was never written at all! So now I've got to figure out why it didn't write the block when it should have. It was an indirect block that was supposed to point to leaf blocks. It's quite difficult to figure out why it didn't write the block. It's also difficult to write any way to detect when it's doing it wrong. So I'm now trying to narrow down what the sequence of events was on the directory that was written out corrupt, and that's no small matter either. My latest and greatest instrumentation changes the timing too much and the problem doesn't recreate with it. So now I've got to optimize it to get the timing of the instrumented code closer to normal again. Created attachment 328294 [details] Jan 06 patch This is my latest and greatest patch. I believe I've fixed the "JD problem" with this patch. The problem was caused by some code that was improperly calculating block extents for directories. The Leaf Block problem can still be recreated with this patch. However, I have now collected a better set of data to analyze that problem and will look at it today. Detailed explanation of this patch, in order of appearance: 1. In bmap.c I added an and clause so that gfs_block_map() does not try to calculate block extents for directories. The calculation was based on inode height, but height has a different meaning for directories because of the hash table that's considered data. This fixes the invalid metadata block problems. 2. In gfs_aspace_releasepage(), I initialize bd_ail_gl_list so that leftover values from previous usage aren't accidentally reused to cause possible corruption. Also, since we're severing the link here between bh and bd, we should zero out the references to each other for the same reason, and we should do it before the cache is freed to avoid the possibility of use-after-free. The bh2bd(bh) = NULL is a deprecated usage that's been updated to the proper form: "bd->bd_bh = NULL;" 3. In ail_empty_gl() I changed the check from !buffer_busy to !buffer_locked because, as reported in the original problem, and noted in comment #11, the buffer may still be dirty at this point. 4. In getbuf() I removed the gfs_assert_warn clause. This was just burning precious cpu time because if bh->b_blocknr == blkno it doesn't need to be mapped. 5. In getbuf() I added mark_page_accessed because that keeps the vfs from freeing the buffer we're about to use. This is also in gfs2. This keeps us from having to remap pages and such. 6. In gfs_dreread() I added wait_on_buffer to make sure the buffer we're going to use does not have any pending IO. 7. I changed function gfs_attach_bufdata() so that it returns the bd pointer, allocating a new bd if necessary. This closes a race condition in function gfs_trans_add_bh. 8. In gfs_attach_bufdata() I initialize the bd_ail_gl_list pointer so we can't reference glocks from previous uses of the memory. 9. In gfs_is_pinned I initialize variable "data" to NULL. There was another timing window whereby two callers that wanted to pin the same page at the same time would interfere with each other. If the bd_pinned pointer is zero and we need to allocate memory for data, it was first unlocking the buffer head, which gave another process an opportunity to mess with the bh, such as calling gfs_dpin again. To avoid this, I unlock the buffer head, allocate the memory, then go back to the start where it reacquires the lock and checks the need for pinning again. That way, multiple processes can be pinning at the same time and they'll each do the right thing. 10. In gfs_dpin() I changed to code to only set bd once the buffer lock has been acquired on the bh. The way it was done before, the bd value could be changed while we were waiting for the bh lock. 11. In gfs_dunpin() we only need to unlock and relock the buffer head if there's a transaction. If we don't have a transaction we're only opening up a timing window where another process can mess with the buffer we're working on. Also, we shouldn't decrement the bd_pinned counter until the end of the function otherwise another process could assume the buffer was unpinned while we're still in the process of unpinning it and messing with the active items list. 12. In dio.h I needed to change the function prototype for gfs_attach_bufdata to reflect that it now returns the bd. 13. In get_leaf() the code was not releasing the buffer if an error is encountered. That results in problems unmounting. So I added the appropriate brelse() call. 13. Function dir_split_leaf() was not checking the return code from the call to gfs_dirent_alloc. It really should. 14. Function dir_split_leaf() was not adding the modified dibh dinode buffer to the transaction before it modified its contents. So it might have been possible for buffer changes to be submitted to the dinode without the glock needed to ensure cluster coherency in this code path. 15. Function dir_double_exhash() had the same problem as 14. 16. Function dir_e_add() was not checking the return code from its call to function gfs_dirent_alloc. Naughty naughty. 17. Function dir_l_add() had the same problem as #14. 18. Function dir_l_del() had the same problem as #14. 19. Function sync_trans() was not checking for SHUTDOWN conditions, which meant that the journals could be slightly corrupted (more like obscured, really) when an error occurs. This is a crosswrite fix from bug #452274. 20. Function commit_trans() had the same issue, and is also a crosswrite fix from bug #452274. 21. Function gfs_log_shutdown() had the same issue, and is also a crosswrite fix from bug #452274. 22. Function gfs_commit_write() was not marking the inode dirty when it changed the file size, so vfs thought it was okay to free it up if more memory was desired. 23. Function gfs_unstuffer_page() was creating empty buffers with the uptodate bit set. 24. Function gfs_find_jhead() was not locking the journal log when finding the journal head. It should do that so the log head doesn't move around while we're looking, lest we get into an infinite loop. 25. In function gfs_trans_add_bh() I call gfs_attach_bufdata with the new calling method. After a full day of running, I still have not seen the JD problem with my latest patch. Hopefully that means there is only one problem remaining: The corrupt leaf block buffer issue. From my latest instrumentation I can tell this: 1. The problem is not fixed by a newer kernel. I am now running on a pre-release 2.6.9-78.24.ELsmp kernel and the problem is still there. 2. If function dir_e_read() does clear_buffer_uptodate(bh) and retries the read operation again, the buffer comes back normal with the correct contents (a directory leaf block). 3. None of the buffer_head variables have changed from the first submit of the IO. For example, nobody has changed the b_data pointer, the b_private variable, the b_state is the same, etc. 4. When the failing read occurs, the code path is always the same, and here is exactly what it does: 1. getbuf: The create bit is on, so page = grab_cache_page() 2. page has no buffers, so it creates some. 3. page is not mapped, so it maps the block we want. 4. DIO_START is set coming in to gfs_dreread. 5. Before the read is issued, buffer_uptodate is 0 so it definitely calls ll_rw_block(READ) on the bh. The buffer_uptodate is still 0 immediately after the read is issued with ll_rw_block(READ). 6. b_count was 1 (not 0) when the IO was issued. 7. buffer_locked was not set prior to the ll_rw_block read request but it is set afterward (hopefully by ll_rw_block but I should test this to make sure) so vfs should have done the READ. 8. DIO_WAIT flag is on so it calls the wait_on_buffer(). 9. When it comes back from the wait_on_buffer the buffer is now uptodate. 10. The buffer had 0 in b_data[7], after the read returns, which means the buffer is corrupt as soon as it is read. But it's not always 0. Sometimes it's 4 (dinode). Looks like random trash from memory. 5. The "corrupt" directory leaf block was written by the other node, and it has been a leaf block for a very long time. 6. Previous reads of the directory did not find this leaf block to be corrupt. AFAICT, no other processes on the failing node think that the corrupt block is theirs for any reason, and it doesn't look like there has been any fighting over the page, block or buffer_head. The buffer_uptodate bit was not set on by any special condition in GFS. To me it really seems like some kind of vfs bug. Point 6 in comment #37 seems as if it might be an issue. I'd expect b_count to be 2 since the I/O completion will drop a ref count on the buffer, and if b_count was then 0, it would be invalid to continue to access the bh in question. Since you are making changes here, I'd suggest dropping the (obsolete) ll_rw_block and replacing it with submit_bh. It might make things easier to trace since there will be fewer possible races. Created attachment 328592 [details] 09 Jan 2009 patch Here is my current theory regarding the "Leaf Block" problem: -------------------------------------------------------------- When you're talking about a journaled block, such as directory leaf blocks, there are really two instances of the data: (1) The "fake" copy that's written to the journal, and (2) The "real" data that gets written to the disk in its proper on-disk location. The gfs kernel code modifies the buffers appropriately, then it ensures the "fake" (1) buffers are written to the journal. After the "fake" data is synced on disk to the journal, the code calls log_free_buf() to free the "fake" buffers created for the commit to disk. In function log_free_buf(), the "real" (2) buffers are left to their own devices just as soon as the block is committed to the journal. Although the "real" (2) buffers have been submitted to be written to their on-disk locations on node-1, nothing actually forces the "real" (2) buffers to disk at the time the glock is released. If the timing is just right, and the media (SAN, etc.) is kept busy enough, node-2 can acquire the glock (in order to parse the directory leaf blocks for the "find" command) before the "real" buffers have landed on-disk. They do, however, eventually land on disk, so any subsequent read will pick up the correct data. That's my theory. I've attempted to fix the problem by waiting for the I/O to complete on the "real" buffers, hopefully as late as we can in the process, in function log_free_buf(). Hopefully it won't impact performance, but that's something we have to test. Here is my current theory regarding the "JD" problem: -------------------------------------------------------------- I believe the problem was caused because when the directory leaf blocks were modified by function gfs_writei, but the blocks may not be added to the journal transaction. Here is how this patch differs from the 06 Jan 2009 patch: -------------------------------------------------------------- 1. In function gfs_block_map, I got rid of the GFS_FILE_REG check. After careful study and scrutiny, I determined that it's perfectly okay for a directory to have block extents. So there should be no need to check for regular files (in theory). My new solution to the "JD Problem" is listed under (3) below. 2. As per Steve's suggestion in comment #38, I replaced the obsolete call to ll_rw_block with some code for calling submit_bh. This did not affect either problem, but it's cleaner. I did not replace the other calls to ll_rw_block, which are used for writing. I only changed the READ call. 3. In function gfs_writei() the modified buffer was not being added to the journal transaction. I believe this was the cause of the "JD Problem". So I changed the code to add the buffer to the journal transaction. 4. This is my attempt to fix the Leaf Block problem by waiting for I/O to complete when we're freeing the "fake" buffers. So far I've run the failing scenario four times without hitting any problems on this patch. Previous patch versions have made it a very long time, so a lot more testing is needed. If this version tests out and cannot be broken, I need to start pruning out the unnecessary code. Although I think they're all good changes, they're probably out of the scope of this fix. The patch is pretty big, so the job of pruning may be a big one. I've all but proven at least one of these data corruption problems, and possibly both, can't be GFS's fault. Adding Mike Christie to the cc list. Mike told me on the phone he has a patch that might help. Rob, you meantioned seeing no sense messages. Were they like this: Apr 30 11:14:03 linc01a kernel: Info fld=0x0 Apr 30 11:14:03 linc01a kernel: sdi: Current: sense key: No Sense Apr 30 11:14:03 linc01a kernel: Additional sense: No additional sense I think this might be the silent corruption we see in this bz: https://bugzilla.redhat.com/show_bug.cgi?id=468088 Give me a second to port this to RHEL4. Created attachment 329133 [details]
fix handling of no sense check condition
Patch was made against the RHEL 4.7 kernel.
The "no sense" messages are like this: Info fld=0x0, Current sdb: sense key No Sense Info fld=0x0, Current sdb: sense key No Sense Info fld=0x0, Current sdb: sense key No Sense Info fld=0x0, Current sdb: sense key No Sense I'll try the patch. It's still too early to tell (it needs many more test runs), but Mike's patch seems to have eliminated the corrupt "Leaf Block" problem. My fingers are crossed. Mike's bug is the only explanation I've found that makes sense for that problem, given what the instrumentation tells me (today's instrumentation tells me that the leaf block was written out correctly, but was corrupt when it was read back in). That's possibly good news. I've been hitting the corrupt "Leaf Block" problem exclusively for more than a week now. Today, with Mike's patch, the "JD problem" reared its ugly head once more. So the "JD problem" is still alive and well. (A JD block is considered "data" for the directory, but the "data" is journaled, and it is a hash table that gives leaf block locations). This time I have proof that GFS on node-1 did not write that block and node-2 only wrote it out as a metadata "clump block". Later, node-2 thought, for some reason, that it should be a hash table block for a given directory. That fits what I've seen before regarding the JD problem. The question is now why that particular block got inserted into the directory hash table if the block was never written. That's exactly what I was trying to determine with an older version of my instrumentation. So I need to go back to that older version of my instrumentation. My fear is that the older instrumentation somehow changed the timing and masked the JD problem, which is why I haven't seen it for so long. I spent all morning running the failing scenario over and over. With this version of the instrumentation, I used to be able to recreate the "Leaf Block" problem in about a half-hour, but with Mike Christie's patch in place, the problem no longer seems to recreate. That's a very good sign. Which leaves me trying to figure out the "JD problem". It's very hard to recreate it now with all my patches in place. After running all morning, I was finally able to collect some data on that problem. In this particular recreation, the "corrupt" block was, in fact, a valid JD block for a very long time. Then, 76 log entries after it was last found to be valid, it is found corrupt, on the same node that was modifying that directory. I'm looking at the instrumentation data now. Robert, could I hop on the machine real quick or could you just give me the output of cat /sys/block/sdX/device/vendor cat /sys/block/sdX/device/model ? [root@shell-01 ../gfs-kernel/src/gfs]# cat /sys/block/sdb/device/vendor APPLE [root@shell-01 ../gfs-kernel/src/gfs]# cat /sys/block/sdb/device/model Xserve RAID Thanks, do you still have issues? Do you want me to send the patch I made under this bugzilla or should I make a new one? (In reply to comment #50) > Thanks, do you still have issues? Do you want me to send the patch I made under > this bugzilla or should I make a new one? Ignore this. It looks like you still have issues. Let me make a new bz for the scsi issue. For the scsi issue I made a new bugzilla here https://bugzilla.redhat.com/show_bug.cgi?id=480666 Just an update. I spent almost all of Tuesday 20 Jan trying to recreate the "JD problem" with my instrumented code. The problem would not recreate. Curious if the problem had somehow gone away with recent code changes, I went back to the non-instrumented patch and ran nearly all of Wednesday 21 Jan without a failure. Then on on Wednesday evening, the JD problem recreated, but since my instrumentation wasn't there, I collected no information about it. So the problem is still there, but extremely hard to recreate. I restarted testing on the instrumented version again. Finally, late Wednesday night, I caught the failure with instrumentation. The instrumentation told me two things: (1) Either node 1 did not write the corrupt block at all or else it had been a VERY long time since it wrote it. I suspect the latter. Based on previous data collected, I suspect that the block had been written as a JD block by node-1 at a much earlier time. (2) Node 2 did, in fact, allocate the block as a META_CLUMP block, although it's uncertain when that occurred. This leads me to believe something is corrupting the block allocation bitmaps in memory. Fortunately for me, this is RHEL4, which does NOT have the new, faster bitmap code, so that cannot possibly be at fault. If my suspicions about the corrupt bitmap is correct, it's a needle in a proverbial haystack. But at least I have a direction to go: I will begin changing my instrumentation to tell me if this really is bitmap corruption by adding bitmap integrity checking. It's going to be tricky to figure out a good way to do this while still preserving the performance, so the problem will still recreate for me. Created attachment 329889 [details]
23 Jan 2009 patch
I spent the last twelve or more hours trying to recreate the "JD
problem" with instrumentation, and I finally caught it. I'm not
sure I believe what it's telling me. It might be a problem with the
instrumentation, but that portion of the code is not complex, so it
is hard to believe there's a problem there either. Basically, the
instrumentation told me that the corrupt block pointer was not only
never written out, neither was the block that contained the pointer.
There are only two reasonable explanations for that: either (a) my
instrumentation has a problem, or (b) the directory data itself was
modified outside of a transaction. So I backed up one pointer-level
prior to the bad one. In this particular case, the data could only
have come from dir_l_add. Looking at that function, I did discover
that data is indeed modified outside of a transaction: there was a
call to function dirent_alloc which modifies the dinode buffer (dibh)
BEFORE the buffer is added to the transaction. (Before my patches,
dir_l_add was not even doing a trans_add_bh, so my patches were
better anyway, just not perfect).
So this patch is basically the same as the previous one, but I moved
the call to gfs_trans_add_bh to before the call to gfs_dirent_alloc.
Hopefully this will solve the problem, and protect the integrity of
the journaled data block. In other words, hopefully this is enough
to prevent the JD problem from happening. I'm hopeful but of course
there are no guarantees.
Once I've found an absolutely stable patch, I need to start removing
the code I added that's no longer needed. That could also be a long
tedious process, since there are so many. I'll probably take a
minimalist approach and start with the original code and add back
only the pieces I absolutely know I need to fix the problem.
Robert, Rob Evers was testing the patch I attached, and discovered that it still can corrupt data so the problems you are still hitting might be the scsi layer's fault. I ccd Rob Evers here and we add an update when we know more. The problem I was seeing is due to a bug in my test code. The patch Mike provided worked as far as I know. I verified my kernel has the patch Mike Christie posted earlier. However, I am still seeing the problem, albeit, rarely, with my patch. Since putting the patch on 16 January, with tons of testing I have seen the corrupt metadata leaf block problem twice, the most recent of which was on Tuesday 27 January. I have also seen the "JD problem" on other occasions, which I still believe to be a GFS problem lurking deep in the shadows. With my latest patch I ran all last weekend (when I was available, which was about 14 hours of testing) before hitting the JD problem, so it's now very rare, but not as rare as the corrupt leaf block problem (which, as I said, I've only hit twice). My concern is that I'm running on the 2.6.9-18 kernel, and some of this might be fixed in more recent kernels. I'm not able to move to newer kernels because of an NFS bug I found, bug #479728. Well, I could, but I would have to reboot after every two runs, which would slow down the testing process to an unacceptable level. It would take twice as long, or likely four days to recreate the more common JD problem and a month to recreate the leaf block problem. Of course, if the newer kernel fixes the problem, it would never recreate. Based on info I got from today's instrumentation, the problem may boil down to the fact that function gfs_sync_page() only operates on regular files, and not directories. Its job is to sync all "data" pages associated with an inode. It currently ignores calls for inodes that aren't "regular" files because only regular files would have data blocks, right? Wrong. Directory inodes also have data blocks: the directory hash tables that are being corrupted. Of course, there's also the scsi device driver problem, but that's a side issue. I'm going to do some experimenting with fixing only that function (i.e. stripping out all the other potential fixes I've crafted) and see how it fares. Well, the previous theory didn't hold true. Tracing the code back carefully, I now believe that the original code in ail_empty_gl was correct to assert-exit if "buffer_busy(bh)". I think that the fact that the buffer is still dirty after the sync is somehow causing the JD problem. Therefore I've reinstated the original buffer_busy check. The code path is as follows: Function inode_go_sync is called, and in theory, this should be the meta && data case because the directory has both metadata (the directory hash table) and data (the leaf blocks). The function does this essentially: gfs_sync_page(gl, flags | DIO_START); gfs_log_flush_glock(gl); gfs_sync_buf(gl, flags | DIO_START | DIO_WAIT | DIO_CHECK); gfs_sync_page(gl, flags | DIO_WAIT); The call to gfs_sync_buf does this: filemap_fdatawrite(mapping) - known to have been called filemap_fdatawait(mapping) - known to have been called ail_empty_gl - known to have been called The filemap_fdatawrite page range and inode i_size have both been proven correct (0xffffffffffffffff). Yet the ail_empty_gl discovers that some buffer_head is still dirty even after the filemap_fdatawrite/wait was proven to have happened. I've also proven that the buffer_head in question was on the list of gl_ail_bufs when the filemap_fdatawrite/wait were called. The buffer_head is removed from the page buffers shortly thereafter and the next read assigns a new buffer to the page, but after the next read, the data comes back invalid, triggering the jd problem (that is, if the assert isn't coded to stop it first). So now I need to figure out why vfs isn't syncing out all the buffers to disk when we've asked it to. Last note: It seems like the problem occurs more readily when I'm running the scenario late at night. Perhaps it has something to do with the lighter network traffic with all the glocks flying around. I've been staying on a much older 2.6.9-78.18 kernel due to bug #479728, so my fear had been that the problem was already known and fixed in a newer kernel's vfs. Today I circumvented the 479728 bug, which allowed me to move up to a shiny new 2.6.9-80 kernel. The problem still exists there, so the bug is definitely still out there. So I guess I start digging into the vfs to find out why my sync left a buffer dirty. One more note: I seem to be able to recreate the problem more consistently if I wait until the last "find" ends before I start another run with another find. I've finally been able to catch red-handed the process responsible for setting the dirty bit for the buffer in question. The call trace looks like this: Call Trace:<ffffffff80193969>{mark_buffer_dirty+39} <ffffffffa02b5efe>{:gfs:gfs_dunpin+237 <ffffffffa02d2cc9>{:gfs:buf_add_to_ail+22} <ffffffffa02d14b8>{:gfs:log_flush_intern <ffffffffa02dc308>{:gfs:gfs_fsync+95} <ffffffffa03320e8>{:nfsd:nfsd_sync_dir+53} <ffffffffa0333fec>{:nfsd:nfsd_create_v3+758} <ffffffffa033a54f>{:nfsd:nfsd3_proc_cr <ffffffffa032eb35>{:nfsd:nfsd_dispatch+220} <ffffffffa01affb5>{:sunrpc:svc_process+ <ffffffffa032e409>{:nfsd:nfsd+0} <ffffffffa032e734>{:nfsd:nfsd+811} <ffffffff80111693>{child_rip+8} <ffffffffa032e409>{:nfsd:nfsd+0} <ffffffffa032e409>{:nfsd:nfsd+0} <ffffffff8011168b>{child_rip+0} So there are two processes. One process is doing this: gfs_glock_xmote_th - if gl state EXCLUSIVE inode_go_sync (glops.c) - takes no locks gfs_log_flush_glock - does nothing but calls: log_flush_internal - takes log lock the whole time add_trans_to_ail buf_add_to_ail gfs_dunpin gfs_sync_buf filemap_fdatawrite ail_empty_gl When this process gets to the ail_empty_gl, one of its buffers have been dirtied by another process doing this: nfs gfs_fsync - file sync - holds the glock in SHARED (ops_file.c) gfs_log_flush_glock - does nothing but calls: log_flush_internal - takes log lock the whole time add_trans_to_ail buf_add_to_ail gfs_dunpin The second process, which does the gfs_fsync, takes a shared glock for the inode. The first process is apparently demoting an exclusive glock, trying to sync out the data. Assuming there isn't cross-glock contamination or buffer mixups, I think it's a matter of how to fix it. Which is a big assumption, given the illusive history of this bug. This is my latest theory as to what is going wrong: 1. One process (pid-a) holds an inode glock and is unlocking it. Meanwhile, another process (pid-b) is trying to fsync the same file from nfs. 2. Pid-b calls gfs_fsync which requests a "shared" glock. This holder gets put onto the waiters3 queue, waiting for a glock promotion to occur. 3. Pid-a calls run_queue to process its demote request. It checks to see if there are any waiters1, and there aren't any. There aren't any waiters2 requests either, and the HIF_DEMOTE bit is on, so it calls rq_demote to demote the glock. It doesn't care at this point about pid-b's promote request on gl_waiters3 at this point. That's likely to keep one node from monopolizing the glock. 4. Pid-a calls go_drop_th because the pid-a's holder is unlocked and the glock state is not exclusive. 5. gfs_glock_drop_th calls gfs_log_flush_glock, which takes the log lock, then adds the transactions buffers to the active items list (ail). Then the log lock is released. Then it calls gfs_sync_buf which tells the vfs to sync the data and wait for it to be done. 6. Pid-b then calls gfs_log_flush_glock, which calls log_flush_internal. That runs the transaction's glock buffer list again, and again, it tries to add them to the active items list (ail). Function add_trans_to_ail calls buf_add_to_ail, which calls gfs_dunpin, which marks the buffer dirty a second time. But this marking happens after the vfs has synced the buffer to disk. 7. Pid-a's gfs_sync_buf calls ail_empty_gl, but much to its astonishment, it finds a buffer on the ail list that's still dirty, knowing full well that it had just requested all of them to be synced to disk. So it asserts. My latest attempt to fix the problem is simply this: diff --git a/gfs-kernel/src/gfs/glock.c b/gfs-kernel/src/gfs/glock.c index 1014ea2..670810f 100644 --- a/gfs-kernel/src/gfs/glock.c +++ b/gfs-kernel/src/gfs/glock.c @@ -723,8 +723,9 @@ rq_demote(struct gfs_holder *gh) set_bit(GLF_LOCK, &gl->gl_flags); spin_unlock(&gl->gl_spin); - if (gh->gh_state == LM_ST_UNLOCKED || - gl->gl_state != LM_ST_EXCLUSIVE) + if ((gh->gh_state == LM_ST_UNLOCKED || + gl->gl_state != LM_ST_EXCLUSIVE) && + list_empty(&gl->gl_waiters3)) /* Unlock */ glops->go_drop_th(gl); else I'm testing the patch now and so far, no failures, but it's still very early in the process. Hi Ray, The patch mentioned in comment #63 did not work properly; it caused other problems. However, I believe I am still on the right track. Unfortunately, an nfs kernel issue caused me to lend my cluster to an NFS developer for a few days, so I lost some time there. I tried another fix this morning, and although it seemed to work, it had an unacceptable performance impact. I'm trying another fix as I type this note. Because the problem is so hard to hit, I won't be able to say for sure whether it works properly for a few days. Created attachment 333550 [details]
27 Feb 2009 patch
This is my latest patch. It seems to work perfectly and has no
ill performance impacts. In fact, it seems to make the recreation
scenario run somewhat faster, often running fourteen minutes rather
than fifteen. Given the history and complexity of this problem, I
obviously need to do a lot more testing before I'll call it good.
I'll also want it thoroughly reviewed by my peers. Frankly, I'm
worried about making sure we maintain cluster coherency with the
patch.
The patch passed the cluster coherency test: Testcase Result -------- ------ buffered-write-read PASS buffered-write-readv PASS buffered-write-pread PASS buffered-write-mmread PASS buffered-writev-read PASS buffered-writev-readv PASS buffered-writev-pread PASS buffered-writev-mmread PASS buffered-pwrite-read PASS buffered-pwrite-readv PASS buffered-pwrite-pread PASS buffered-pwrite-mmread PASS buffered-mmwrite-read PASS buffered-mmwrite-readv PASS buffered-mmwrite-pread PASS buffered-mmwrite-mmread PASS sync-write-read PASS sync-write-readv PASS sync-write-pread PASS sync-write-mmread PASS sync-writev-read PASS sync-writev-readv PASS sync-writev-pread PASS sync-writev-mmread PASS sync-pwrite-read PASS sync-pwrite-readv PASS sync-pwrite-pread PASS sync-pwrite-mmread PASS sync-mmwrite-read PASS sync-mmwrite-readv PASS sync-mmwrite-pread PASS sync-mmwrite-mmread PASS direct-write-read PASS direct-write-readv PASS direct-write-pread PASS direct-writev-read PASS direct-writev-readv PASS direct-writev-pread PASS direct-pwrite-read PASS direct-pwrite-readv PASS direct-pwrite-pread PASS direct-mmwrite-read PASS direct-mmwrite-readv PASS direct-mmwrite-pread PASS ================================================= Total Tests Run: 44 Total PASS: 44 Total FAIL: 0 Total TIMEOUT: 0 Total KILLED: 0 Total STOPPED: 0 Update on the latest: Once again, through rigorous testing, I managed to break the patch. I've made several revisions based on similar code paths that also expose the problem. Now I've got a new version that incorporates some of the fixes posted in previous patches. I had been hoping this was a "simple fix", but it seems there are several things going on at the same time. In other words, I was hoping I could forget the apparent problems I found earlier and fixed with previous patches in favour of a patch that pre-empts the other problems. (In other words, "If it's not broken, don't fix it.") Unfortunately, it seems as if at least some of the other fixes from previous patches are needed anyway. I have not broken my latest patch, after many hours of testing. Unfortunately, in order to catch any possible "misconduct" by the patch, I've kept a minimal amount of instrumentation in there for the time being. So the new patch is not yet ready for general consumption. Tomorrow morning, I'll do some more testing and if I can't break it, I'll strip out all the instrumentation and test it again. Then I need to redo the cluster coherency test. I'll likely attach it here before starting the coherency test, since that test takes so long. Created attachment 334226 [details]
05 March 2009 patch
This is my latest patch. Testing is in progress.
Sunday night update: I've got a patch that works, but it severely impacts performance. The recreation scenario goes from 15 minutes to 25 minutes; clearly unacceptable. I've tried several ways to redo the logic so that it doesn't impact performance, but either I lose cluster coherency or the assert problem comes back. And of course, it takes hours of testing to determine that any given attempt is bad. I need to find a clever way to rearrange the logic to avoid the race without impacting performance. For GFS gurus: The problem is that several instances of gfs_sync_buf are racing with one another. To simplify the logic (and make it faster), I've incorporated gfs_sync_buf into log_flush_internal. That function is protected against races, for the most part, by the gfs_log_lock. The gfs_sync_buf, however, is not. The race problem is caused by another process entering the function, calling add_trans_to_ail, which causes some buffer_heads on the ail_gl_list to be set dirty again. This often happens because the wait for disk IO takes so long, compared to the cpu. The add_trans_to_ail is synchronized under the gfs_log_lock, but since the gfs_sync_buf portion calls disk io functions (filemap_fdatawrite and filemap_fdatawait) I clearly cannot hold the log lock during that time or performance will be killed. Previous patch versions have moved the gl_ail_bufs list to a temporary list using list_splice_init, and ail_empty_gl then operates on the temporary list only. This seems to cause cluster coherency issues. The act of making sure this is serialized by using, for instance, a GLF_LFLUSH bit in the glock, also seems to kill performance. They end up waiting for long periods of time for the log_lock. If I avoid the log_lock serialization, the assert problem comes back. So it's a three-way tradeoff: One way kills performance, another way kills cluster coherency, and the third way has the assert. So I need to find an elegant way to do this. I guess I'll sleep on it. Created attachment 334890 [details]
11 March 2009 patch
This patch is closer, but still has instrumentation (and problems).
It utilizes a new "buffer_handsoff" bit to flag buffer heads that
have been tagged for synchronization. (The bit isn't included in
my patch because I've got a kernel patch as well, which will NOT be
part of the final version). This one works for the most part, but
I was able to break it with the cluster coherency test, after hours
of testing. Basically, it hit an "Invalid Metadata Block" error.
I suspect that's due to a problem with the patch, and I'd like to get
Steve Whitehouse's idea of what could be going wrong. The patch isn't
as big as it would seem; the size is mostly due to my moving three
functions from dio.c to log.c, but there were some changes to them.
I'm not sure that I follow why you've added a rwsem as it appears that you only use it in write mode. Since the mapping in gfs_inval_buf() is the meta mapping, the atomic_inc/dec of i_writecount is almost certainly nonsense, since that is only checked at the higher layers of the code. I suspect its a hangover from the use of some earlier function - I've just removed it in gfs2 and it can probably go from here too. What makes me worried though is that you seems to have added protection from journal flushes in gfs_inval_buf(). The issue is that if we get to this point and there are still dirty buffers or pages then something has gone wrong. All the data should long since have been written back to disk by the ->go_sync function. I don't follow why the new handsoff bit is required... the dirty bit is there to indicate which buffers need to be synced back to disk. Hi Steve and thanks for your comments. You're right; a normal semaphore is sufficient. I fixed it. The i_writecount code was just moved; I hadn't changed it. But with your recommendation, I've now removed the vestigial code as you did with gfs2. The protection in gfs_inval_buf was mostly because I designed sd_log_flush_lock to protect again all callers of ail_empty_gl. It may not be necessary in this case, but I'll have to investigate that. The new "handsoff" bit was designed to protect the buffer_heads that are set up to be synced by log_flush_glock. This is the timing window I described in comment #74. Basically, two processes are operating on the same glock, often during glock handoffs. One process ("A") calls log_flush_glock, which calls log_flush_internal followed by gfs_sync_buf. When gfs_sync_buf is called, there is really no race protection, and it does a vfs sync to disk, at which time another process ("B") can get control, and call functions like gfs_fsync, which ends up slipping more buffer_heads onto the ail: B's fsync starts another gfs_log_flush_glock, then B's dunpin marks the buffer_head dirty again before ail_empty_gl checks the failing assert (i.e. asserting the buffer_head is NOT dirty). The handsoff bit allows ail_empty_gl to empty (and sanity check) the ail list of ONLY the buffer_heads that were found while the log_lock was held. Other buffer_heads that were added to the ail by subsequent gfs_fsync calls remain on the ail list, to be synced and removed by that process. Even with the current patch's protection in place, there are buffer_heads that get added to the ail that don't have the handsoff bit set, while gfs_sync_buf is waiting for the vfs to sync to disk. The handsoff bit started out as instrumentation so I could have the kernel tell me what process (and calling sequence) was dirtying the buffer_head while it was being synced to disk. I might be able to simplify the logic by just using the dirty bit for that purpose, especially now that I've determined that the sd_log_flush_lock semaphore does not seem to impact performance. And since the changing of the ail lists are protected by the sd_ail_lock spin_lock. Again, that will take serious investigation. I've found that this code is extremely complex and sensitive and the slightest code change can have a major impact, either on performance or cluster coherency. The big problem is that some of the code traverses the same buffers by the "transaction ail list" while other code traverses the same buffers by the "glock ail list", and still others that move buffer_heads around to different transactions and such. It's all very convoluted. I wish it were as simple as gfs2, which does things more sanely. Hmmm. I might have found Nate's grow issue hiding in the code: error = gfs_glock_nq_init(ip->i_gl, LM_ST_EXCLUSIVE, LM_FLAG_PRIORITY | GL_SYNC | GL_NOCANCEL_OTHER, &i_gh); Looks like its setting GL_SYNC. The other place GL_SYNC gets set is when we update the attributes for an inode - it is not at all clear why. In neither case does it make any sense to force a sync so far as I can see. I can only guess that there was some other bug found and this this was supposed to work around it in some way or other... With my latest and greatest patch (slightly revised from the previously posted patch, as suggested by Steve), I can run the cluster coherency test and a number of normally-failing scenario iterations without failure. That's the good news. The bad news is that I now seem to be occasionally hitting bug #457557, which is a gfs2 bug. The gfs1 and gfs2 code are similar enough that I seem to be triggering it on gfs1 now somehow. I need to add more instrumentation into the code to determine what the problem is. Also, I've removed the protection in gfs_inval_buf as Steve suggested in comment #78 and it hasn't seemed to break anything. I've got new instrumentation to hopefully tell me whether or not it should be safe to remove the "handsoff" bit in the buffer_head. So far it has not triggered that instrumentation after several hours of testing. Based partly on instrumentation in the code and partly on detective work, it seems to me that this is happening with my latest patch: NFS calls into function do_write_direct to write to a file, which determines that an allocation is not required. For some strange reason, it takes a SHARED lock on the transaction glock, aka sd_trans_gl. After the write is complete, another write on a different node wants to do the same thing. The sd_trans_gl gets demoted and the callback sets the state to LM_ST_UNLOCKED. Shortly thereafter, trans_go_xmote_th gets called on the sd_trans_gl and because the state is LM_ST_UNLOCKED, it foregoes the call to gfs_log_shutdown. Later, when it goes to promote the glock, function trans_go_xmote_bh sees that the gl_state is now not LM_ST_UNLOCKED anymore but it detects that the gfs_log_shutdown was never done, and takes this error: if (!(head.lh_flags & GFS_LOG_HEAD_UNMOUNT)) gfs_consist(sdp); This is partly speculation on my part, but it seems to fit. The big mystery to me is why do_write_direct needs to hold the sd_trans_gl in the first place. Scratch that. If the gl_state was not LM_ST_UNLOCKED at the top half (trans_go_xmote_th) it should still have been LM_ST_UNLOCKED at the bottom half (trans_go_xmote_bh). So at the top half, it should have done the gfs_log_shutdown, and it should have found it at the bottom half. Which can only mean that another process added a log header between the trans_go_xmote_th and the gfs_find_jhead. So perhaps the problem here is that the log lock is not adequate to protect that from happening. I don't follow comment #83 regarding the demotion of the trans glock upon receiving a callback. Since the trans glock is always grabbed in shared mode, except when freezing the fs (which we know is not happening here) there should be no callbacks sent as the other nodes will always be sending compatible lock requests. It is however quite possible that one of the deamons is demoting the trans gl periodically if it appears to be unused for some length of time, or if something locally sets GL_NOCACHE before releaseing the lock for whatever reason. If a log header is being added while the trans gl is not held, then the #1 suspect has to be gfs_logd which has no idea whether the trans glock is being held or not. Created attachment 335529 [details]
17 March 2009 patch
This patch takes a slightly new approach that makes more sense.
There were several places in the code that tried to remove buffers
from the two active items lists (transaction and glock ail) but each
one was using a different set of criteria, none of which were
consistent. This patch introduces a new function,
gfs_bd_ail_tryremove, that checks the condition of the buffer using
uniform criteria before removing it from the ail. All the places
that were doing this haphazardly now call the new function.
So really, not counting the scsi device driver bug, there were five
main gfs bugs found with this one scenario:
1. There was a timing window whereby a process could mark a buffer
dirty while it was being synced to disk. This was fixed by
introducing a new semaphore, sd_log_flush_lock, which keeps
that from happening.
2. Buffers were being taken off the ail list at different times
using different criteria. That was fixed by the new function
as mentioned above.
3. Some buffers were not being added to the transaction, especially
in cases where the files were journaled, which happens mostly
with directory hash table data. That was fixed by the
introduction of the necessary calls to gfs_trans_add_bh.
4. The transaction glock was prematurely being released when the
glocks hit a capacity watermark. That's why it often took so
long to recreate some of these problems. To prevent this, new
code was added to function clear_glock so that it would only
release the transaction glock at unmount time. I'm using the
number of glockd daemons to determine whether the call was made
during normal operations or at unmount time. So in order to
accomodate that change, I had to fix a bit of code where the
number of glockd daemons was going negative.
5. When finding its place in the journal, function gfs_find_jhead
was not holding the journal log lock. That causing another
nasty timing window where the journal was being changed while
the proper journal location was being located.
This patch passes the cluster coherency test and I've run it
through the failing scenario several times with no ill effects.
Current status summary: Engineering has developed a new patch set for this issue, which we have determined is due to several underlying bugs. There were several places in the code that tried to remove buffers from the two active items lists (transaction and glock ail), but there were consistency issues in their criteria. The patch introduces a new function, gfs_bd_ail_tryremove, that checks the condition of the buffer using uniform criteria before removing it from the ail, and calls this function from all the places identified previously. Apart from the SCSI device driver bug, there were five main GFS bugs found with this one scenario: 1. There was a timing window whereby a process could mark a buffer dirty while it was being synced to disk. 2. Buffers were being taken off the ail list at different times using different criteria. 3. Some buffers were not being added to the transaction, especially in cases where the files were journaled, which happens mostly with directory hash table data. 4. The transaction glock was prematurely being released when the glocks hit a capacity watermark. This is why it typically took so long to recreate some of these problems. 5. When finding its place in the journal, function gfs_find_jhead was not holding the journal log lock. That caused another timing window where the journal was being changed while the proper journal location was being located. The current patch passes the cluster coherency test and has been run through the failing scenario several times with no ill effects. The patch from comment #86 was pushed to the RHEL4 and RHEL48 branches of the cluster git tree. It was tested, and tested, and tested on the shell-0{1,2,3} cluster using the cluster coherency test and the scenario known to cause the problem within minutes on a system without the patch. Setting status to Modified. For the sake of completeness, I'll document the exact test scenario that uncovered so many problems: Reboot the entire shell cluster and paste in these commands: modprobe cman modprobe lock_dlm insmod /home/bob/cluster/gfs-kernel/src/gfs/gfs.ko service cman start service fenced start service clvmd start On one node, paste in: mkfs.gfs -O -j3 -J32 -p lock_dlm -b 1024 -r32 -X -t shell:gfs /dev/shell-vg/lv02 On all nodes, paste: mount -o debug -t gfs /dev/shell-vg/lv02 /mnt/gfs/ service portmap start ; service nfs start ; exportfs -va On shell-01, paste but don't hit enter on the second line: mount -o debug -t gfs /dev/shell-vg/lv02 /mnt/gfs/ service portmap start ; service nfs start ; exportfs -va On shell-02, paste but don't hit enter on the second line: mount shell-01:/mnt/gfs /mnt/nfs time /root/gen-data -a 256 -b 393216 -c 76800 -n 200000 -p /mnt/nfs/dir2 On another shell-02 session, paste but don't hit enter on the second line: whileloop.sh which is: [root@shell-01 ../gfs-kernel/src/gfs]# cat `which whileloop.sh` #!/bin/bash while /bin/true; do echo "Finding" find /mnt/gfs -type f &> /dev/null done On shell-03, paste but don't hit enter on the second line: time for i in `seq 970 -1 0` ; do echo $i ; sleep 1; done ; qarsh shell-02 "killall whileloop.sh" (You may need to substitute ssh for qarsh). Next: 1. Hit enter on shell-01, wait one second 2. Hit enter on shell-02, wait one second 3. Hit enter on the second shell-02 (whileloop.sh), wait one second 4. Hit enter on shell-03 This triggers the problem pretty quickly on a non-patched system. Correction to comment #89: On shell-01, after the file system is mounted, this is what should be pasted (not a second GFS mount attempt): mount shell-02:/mnt/gfs /mnt/nfs time /root/gen-data -a 256 -b 393216 -c 76800 -n 200000 -p /mnt/nfs/dir1 I hit a problem and am investigating. I suspect the patch caused the problem, so I'm changing the status to FAILS_QA until I can resolve it. Created attachment 337029 [details]
Addendum patch
The problem I uncovered related to a specific piece of the patch I
had previously pushed out. This patch reverts that piece and fixes
the new problem.
The addendum patch has been pushed to the RHEL4 and RHEL47 branches of the cluster git tree. So I'm changing the status back to Modified. Bug 495514 was discovered after this fix went in. Moving back to ASSIGNED to resolve the new regression. I also ran into problems running revolver where recovery of two out of three file systems would hang in recovery during the first iteration. An addendum patch has been pushed to the RHEL4, RHEL47 and RHEL48 branches of the cluster git tree. Changing back to Modified. An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2009-1045.html Hi, I'm sorry that I have to report, that the errata leads to a new problem with the described test scenario: GFS: fsid=axqa01:gfs.0: fatal: invalid metadata block GFS: fsid=axqa01:gfs.0: bh = 1499994 (type: exp=7, found=0) GFS: fsid=axqa01:gfs.0: function = gfs_get_data_buffer GFS: fsid=axqa01:gfs.0: file = /builddir/build/BUILD/gfs-kernel-2.6.9-85/smp/src/gfs/dio.c, line = 1161 GFS: fsid=axqa01:gfs.0: time = 1243456240 GFS: fsid=axqa01:gfs.0: about to withdraw from the cluster GFS: fsid=axqa01:gfs.0: waiting for outstanding I/O GFS: fsid=axqa01:gfs.0: telling LM to withdraw lock_dlm: withdraw abandoned memory GFS: fsid=axqa01:gfs.0: withdrawn We repeated the test with the proposed errata RPMs. We did the following steps to produce the error: server1# mkfs.gfs -O -j3 -J32 -p lock_dlm -b 1024 -r32 -t axqa0 1:lv_gfstest /dev/mapper/vg_gfstest-lv_gfstest server{1,2} # mount -t gfs /dev/mapper/vg_gfstest-lv_gfstest /mnt/gfstest server1# mount -t nfs server2:/mnt/gfstest /mnt/nfstest server2# mount -t nfs server1:/mnt/gfstest /mnt/nfstest server{1,2}# while /bin/true; do find /mnt/gfstest -type f -exec ls -l {} \; ;done server1# # ./gen-data -a 256 -b 393216 -c 76800 -n 200000 -p /mnt/nfstest/dir1 For the sake of completeness: # modinfo gfs filename: /lib/modules/2.6.9-89.ELsmp/kernel/fs/gfs/gfs.ko license: GPL author: Red Hat, Inc. description: Global File System 2.6.9-85.2.el4 depends: lock_harness,cman vermagic: 2.6.9-89.ELsmp SMP gcc-3.4 # uname -a Linux realserver9 2.6.9-89.ELsmp #1 SMP Mon Apr 20 10:33:05 EDT 2009 x86_64 x86_64 x86_64 GNU/Linux # rpm -qi GFS-kernel-smp Name : GFS-kernel-smp Relocations: (not relocatable) Version : 2.6.9 Vendor: Red Hat, Inc. Release : 85.2.el4 Build Date: Tue May 5 16:21:44 2009 Install Date: Wed May 27 08:04:22 2009 Build Host: hs20-bc2-5.build.redhat.com Group : System Environment/Kernel Source RPM: GFS-kernel-2.6.9-85.2.el4.src.rpm Size : 742008 License: GPL Signature : DSA/SHA1, Wed May 6 02:13:32 2009, Key ID 219180cddb42a60e Packager : Red Hat, Inc. <http://bugzilla.redhat.com/bugzilla> Summary : GFS-kernel-smp - The Global File System kernel modules Description : GFS - The Global File System is a symmetric, shared-disk, cluster file system. I cannot understand how that can still be happening, unless there is a hardware problem. Can you please verify whether there were any messages in dmesg regarding hardware errors that were given before the failure noted above, especially that look like this: Apr 30 11:14:03 linc01a kernel: Info fld=0x0 Apr 30 11:14:03 linc01a kernel: sdi: Current: sense key: No Sense Apr 30 11:14:03 linc01a kernel: Additional sense: No additional sense Event posted on 2009-06-16 16:31 CEST by rdassen Hello Jürgen, hello Mark, As the erratum for bugzilla #455696 has already been released and the connection between the new problem and the problem for which the erratum was developed is at present not clearly understood, I would recommend for Atix to open a new bugzilla report for this new problem (referencing #455696) and to work with Red Hat Engineering on it as per the Altix - Red Hat relationship. Kind regards, Ray Internal Status set to 'Waiting on Customer' Status set to: Waiting on Client This event sent from IssueTracker by rdassen issue 187880 In talking with Steve Whitehouse this morning, we decided it would be a good idea if Atix can enable SMART on their SAN and see if they can get any diagnostic information on possible hardware problems. Also, I'd like to verify that they have rebooted the machines in question since they installed that version of the gfs kernel module. In the past, I've seen cases where customers report that a problem is not fixed but it turns out that they haven't unloaded and reloaded the kernel module, so although the new kernel module is on the system they still might have the old one in memory. I'd just like to have them double-check. In this case, a reboot would be best because there are two fixes that need to be running: the gfs kernel module, and the scsi driver fix mentioned in Mike Christie's comments above. I was able to recreate the problem as described in comment #112. So now I have something to investigate. I do, however, favor the idea of treating this as a new problem and opening a new bugzilla record. Hi, in my opinion, this bugzilla cannot be closed successfully. The testcase described in comment #1 still causes a similar problem (The filesystem is unmounted and the test fails). Nevertheless, I'd agree to the idea to create a new bugzilla entry that blocks this one. -Mark Created attachment 348699 [details]
Patch to try
With this patch, my system no longer recreates the problem.
I need to do further research and analysis, but can Atix try
this and see if it clears up the problem?
Since the customer has not responded in a month and a half regarding my new patch, and since they said in comment #117 that they are not opposed to a new bugzilla to address the problem, I have opened new bug #515267. Since an erratum has already been issued for this bug, and no other customers have reported problems, I'm closing this one again as ERRATA. We were not able to reproduce the issue using the newest RedHat provided RPMs for RHEL4, so the problem seems to be fixed. Little Add-On to my Comment #123: With the "patch to try" and the newest RedHat provided packages we were not able to reproduce the issue. |