Bug 455696

Summary:

NFS over GFS issue (fatal: assertion "!bd->bd_pinned && !buffer_busy(bh)" failed)

Product:

[Retired] Red Hat Cluster Suite

Reporter:

Mark Hlawatschek <hlawatschek>

Component:

GFS-kernel

Assignee:

Robert Peterson <rpeterso>

Status:

CLOSED ERRATA

QA Contact:

Cluster QE <mspqa-list>

Severity:

high

Docs Contact:

Priority:

urgent

Version:

CC:

adas, ben.yarwood, bernhard.furtmueller, bkahn, bmarzins, bstevens, cfeist, edamato, grimme, hklein, jplans, mchristi, merz, michael.hagmann, pcfe, ra, rdassen, revers, rpeterso, rrottmann, swhiteho, tao

Target Milestone:

Keywords:

Reopened, ZStream

Target Release:

---

Hardware:

All

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Clones:

515267 (view as bug list)

Environment:

Last Closed:

2009-08-03 15:01:36 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

491369, 491928, 494835, 515267, 674403

Attachments:

Description	Flags
stress test tool	none
Proposed patch	none
Latest patch	none
Better patch	none
Oct 3 patch	none
Oct 28 patch	none
Dec 05 patch	none
Jan 06 patch	none
09 Jan 2009 patch	none
fix handling of no sense check condition	none
23 Jan 2009 patch	none
27 Feb 2009 patch	none
05 March 2009 patch	none
11 March 2009 patch	none
17 March 2009 patch	none
Addendum patch	none
Patch to try	none

Description Mark Hlawatschek 2008-07-17 08:01:05 UTC

Description of problem:

During some stress tests with NFS over GFS, we observed a strange problem.

The test setup consists of two GFS cluster nodes (node1, node2), 
both serving the same NFS exports (/mnt/gfstest).

both nodes# mount -t gfs /dev/XYZ /mnt/gfstest

# exportfs -v 
/mnt/gfstest     	<world>(rw,wdelay,no_root_squash)

The NFS exports are mounted again via NFS from the cluster nodes, whereas 
node1 has mounted the NFS export from node2 and node2 has mounted the NFS 
export from node1.

node1 # mount node2:/mnt/gfstest /mnt/nfstest
node2 # mount node1:/mnt/gfstest /mnt/nfstest

During the stress test, node1 creates files into dir1 on the NFS and client2 
created files into dir2 on the same GFS. Node1 continuously reads the files 
created by client1 and client2. 

node1 # gen-data [...] /mnt/nfstest/dir1
node2 # gen-data [...] /mnt/nfstest/dir2
node2 # while /bin/true; do find /mnt/gfstest -type f -exec cat {} > /dev/null
\; ;done

After some time (about 10 minutes) the 
following error occurs on node1:

GFS: fsid=axqa01:gfstest.0: fatal: assertion "!bd->bd_pinned 
&& !buffer_busy(bh)" failed
GFS: fsid=axqa01:gfstest.0:   function = ail_empty_gl
GFS: fsid=axqa01:gfstest.0:   file 
= /builddir/build/BUILD/gfs-kernel-2.6.9-75/smp/src/gfs/dio.c, line = 383
GFS: fsid=axqa01:gfstest.0:   time = 1216216523
GFS: fsid=axqa01:gfstest.0: about to withdraw from the cluster
GFS: fsid=axqa01:gfstest.0: waiting for outstanding I/O
GFS: fsid=axqa01:gfstest.0: telling LM to withdraw
lock_dlm: withdraw abandoned memory
GFS: fsid=axqa01:gfstest.0: withdrawn


Version-Release number of selected component (if applicable):
RHEL4.6
GFS-kernel-smp-2.6.9-75.11

How reproducible:
most times

Steps to Reproduce: 
1. Create a GFS with bs 1k
2. mount GFS on two nodes to /mnt/gfstest
3. export /mnt/gfstest vo NFS on both nodes 
/mnt/gfstest     	<world>(rw,wdelay,no_root_squash)

4. mount NFS export on both nodes

node1 # mount node2:/mnt/gfstest /mnt/nfstest
node2 # mount node1:/mnt/gfstest /mnt/nfstest

5. start stressing the filesystem:

node1 # gen-data [...] /mnt/nfstest/dir1
node2 # gen-data [...] /mnt/nfstest/dir2
node2 # while /bin/true; do find /mnt/gfstest -type f -exec cat {} > /dev/null
\; ;done

Actual results:

fatal: assertion "!bd->bd_pinned 
&& !buffer_busy(bh)" failed

NFS reports i/o errors

Expected results:

No error should occur
Additional info:

Comment 1 Mark Hlawatschek 2008-07-17 09:20:12 UTC

Reiner created a nice c program (gen-data.c) to create the files with predefined
sized and distribution.

btw: we are using the following paramaters during our tests:

./gen-data -a 256 -b 393216 -c 76800 -n 200000 -p /mnt/nfstest/dir{1,2}

Comment 2 Mark Hlawatschek 2008-07-17 09:28:51 UTC

Created attachment 312021 [details]
stress test tool

Comment 3 Abhijith Das 2008-07-21 15:54:33 UTC

*** Bug 445000 has been marked as a duplicate of this bug. ***

Comment 8 Robert Peterson 2008-09-24 22:42:47 UTC

Created attachment 317629 [details]
Proposed patch

Here is the patch I'm currently testing.

Comment 9 Robert Peterson 2008-09-25 01:10:21 UTC

The test ran for two hours, and 6 runs, but it finally failed.
The patch definitely makes it harder to recreate, so I think I'm
on the right track at least.

Comment 11 Robert Peterson 2008-10-01 23:07:35 UTC

Created attachment 319176 [details]
Latest patch

This patch fixes a bunch of bugs I've found.  It turns out that
either NFS or vfs can and does occasionally set the buffer dirty
bit on while we're in the middle of function ail_empty_gl.  I
took GREAT pains to make sure the buffer head's dirty flag was
not twiddled by some other aberrant path through the gfs code.

Therefore, I changed the assert from checking "buffer_busy" to
checking only "buffer_locked."  However, if the other timing
windows/bugs I found are not closed/fixed, the dirty buffer is not
always processed properly by the subsequent code, and therefore the
file system may still be corrupted and may take an "invalid
metadata block" assert later.

With this patch, I've been able to run the failing scenario
six (going on seven) with no failures.

The code changes in this patch are as follows:

1. In gfs_aspace_releasepage, I set bd->bd_bh to NULL so that the
   bh pointer can never be reused by accident.  I also moved the
   zeroing of bh->b_private up before the storage is released
   with kmem_cache_free.  That closes a timing window whereby
   another process can sneak in and reuse the bd for a different
   bh, and we would nuke the pointer.  I also changed it to its
   simpler form to get rid of a compiler warning.  
2. Function ail_empty_gl's assert was changed due to NFS/VFS
   re-setting the buffer head's Dirty flag at a critical time,
   as stated above.
3. I changed function gfs_attach_bufdata so that it returns the
   pointer to the gfs_bufdata it finds.  This comes in useful
   later.
4. I moved the decrementing of bd_pinned to the bottom of function
   gfs_dunpin as stated in an earlier comment.  This prevents two
   processes from racing to get into the code that marks the
   buffer dirty, interfering with one another.
5. Function dir_split_leaf was writing into a inode buffer
   without adding it to the transaction.  That can cause the
   inode buffer to not be journaled correctly, and the buffer
   to therefore not be accounted for correctly in its attempt
   to flush metadata buffers before writing them to their final
   in-place location.
6. Function gfs_trans_add_bh was checking to see if a buffer
   head already had a buffer descriptor attached, but it was
   checking the pointer without the page locked.  That created a
   timing window whereby a process could reuse a bd when it
   should not.  That's why I needed to change gfs_attach_bufdata
   to protect this value.

Comment 12 Robert Peterson 2008-10-02 19:01:52 UTC

Created attachment 319283 [details]
Better patch

I was able to break the previous patch on about the ninth run.  It
broke with another Invalid Metadata assert in the directory code.
Digging around, I discovered three more places in dir.c that modify
a dinode buffer without adding the buffer to the transaction.  That
may account for the problem, so this patch adds the necessary calls to
gfs_trans_add_bh.  I've run this version almost as long as the previous
patch (several hours) without hitting any problems.  Still, testing
continues.

Comment 13 Robert Peterson 2008-10-03 15:18:33 UTC

Created attachment 319380 [details]
Oct 3 patch

Unfortunately, I did encounter a problem with the 4th generation
patch, but only after about ten runs of the failing scenario.
When it failed, it gave me an "invalid metadata block" assert.
That may be due to the circumstances of my file system, but it's
hard to tell.  I'd like to zero out my test file system completely
with dd from /dev/zero, then mkfs, then restart.

This patch is the same as the previous patch, except I fixed one thing.
I discovered that when I got the above "invalid metadata block" assert,
the system got into an infinite loop and it eventually gives the
message "stuck in gfs_releasepage()."  This patch fixes that loop.

So this patch is a whole lot better than the original code, and fixes a
lot of bugs.  However, there is still the concern over the invalid
metadata block message.  In this case, the bad block looked like GFS
metadata, but the block type was zero.  As far as I can tell, that can
only happen through a call to function clump_alloc.  All other places
in GFS where it initializes the buffer, it sets both the magic number
and the block type.

Comment 14 Robert Peterson 2008-10-03 15:19:30 UTC

Last note: this problem may very well not occur if the default block
size of 4K is used.

Comment 15 Abhijith Das 2008-10-07 17:20:35 UTC

I ran the test for 10 iterations with blocksize=1024 after zeroing out the device with a dd from /dev/zero followed by a mkfs. The 10 iterations ran to completion without asserting, but I did see the "invalid metadata block" assert after the filesystem was idle for sometime (couple of hours or so).

I ran the test for 20 iterations with the default block size (4096). The test ran through 20 iterations without a failure. The filesystem has been idle for a couple of hours now and I don't see any asserts on the console.

Comment 16 Abhijith Das 2008-10-10 16:22:38 UTC

This is the assert I'm seeing

On shell-01

nfs_update_inode: inode number mismatch
expected (0:14/0x28592fa), got (0:14/0x0)
nfs_update_inode: inode number mismatch
expected (0:14/0x87), got (0:14/0xffffffffa02f5488)
nfs_update_inode: inode number mismatch
expected (0:14/0x87), got (0:14/0x10214c42000)
nfs_update_inode: inode number mismatch
expected (0:14/0x87), got (0:14/0x10214c42000)
nfs_update_inode: inode number mismatch
expected (0:14/0x87), got (0:14/0x10214c42000)
nfs_update_inode: inode number mismatch
expected (0:14/0x87), got (0:14/0x10214c42000)
nfs_update_inode: inode number mismatch
expected (0:14/0x87), got (0:14/0x10214c42000)
nfs_update_inode: inode number mismatch
expected (0:14/0x87), got (0:14/0x10214c42000)
nfs_update_inode: inode number mismatch
expected (0:14/0x87), got (0:14/0x10214c42000)
nfs_update_inode: inode number mismatch
expected (0:14/0x87), got (0:14/0x10214c42000)
nfs_update_inode: inode number mismatch
expected (0:14/0x87), got (0:14/0x10214c42000)
nfs_update_inode: inode number mismatch
expected (0:14/0x87), got (0:14/0x10214c42000)
nfs_update_inode: inode number mismatch
expected (0:14/0x87), got (0:14/0x10214c42000)
nfs_update_inode: inode number mismatch
expected (0:14/0x87), got (0:14/0x10214c42000)
nfs_update_inode: inode number mismatch
expected (0:14/0x87), got (0:14/0x10214c42000)
nfs_update_inode: inode number mismatch
expected (0:14/0x87), got (0:14/0x10214c42000)
nfs_update_inode: inode number mismatch
expected (0:14/0x87), got (0:14/0x10214c42000)
nfs_update_inode: inode number mismatch
expected (0:14/0x87), got (0:14/0x10214c42000)
nfs_update_inode: inode number mismatch
expected (0:14/0x87), got (0:14/0x10214c42000)
nfs_update_inode: inode number mismatch
expected (0:14/0x87), got (0:14/0x10214c42000)
nfs_update_inode: inode number mismatch
expected (0:14/0x87), got (0:14/0x10214c42000)
nfs_update_inode: inode number mismatch
expected (0:14/0x87), got (0:14/0x10214c42000)
nfs_update_inode: inode number mismatch
expected (0:14/0x87), got (0:14/0x10214c42000)
nfs_update_inode: inode number mismatch
expected (0:14/0x87), got (0:14/0x10214c42000)
nfs_update_inode: inode number mismatch
expected (0:14/0x87), got (0:14/0x10214c42000)
nfs_update_inode: inode number mismatch
expected (0:14/0x87), got (0:14/0x10214c42000)
nfs_update_inode: inode number mismatch
expected (0:14/0x87), got (0:14/0x10214c42000)
nfs_update_inode: inode number mismatch
expected (0:14/0x87), got (0:14/0x10214c42000)
nfs_update_inode: inode number mismatch
expected (0:14/0x87), got (0:14/0x10214c42000)
nfs_update_inode: inode number mismatch
expected (0:14/0x87), got (0:14/0x1021b604000)
GFS: fsid=shell:gfs.0: jid=1: Trying to acquire journal lock...
GFS: fsid=shell:gfs.0: jid=1: Looking at journal...
GFS: fsid=shell:gfs.0: jid=1: Acquiring the transaction lock...
GFS: fsid=shell:gfs.0: jid=1: Replaying journal...
GFS: fsid=shell:gfs.0: jid=1: Replayed 36780 of 37250 blocks
GFS: fsid=shell:gfs.0: jid=1: replays = 36780, skips = 224, sames = 246
GFS: fsid=shell:gfs.0: jid=1: Journal replayed in 6s
GFS: fsid=shell:gfs.0: jid=1: Done
nfs_statfs: statfs error = 116


shell-02

GFS: fsid=shell:gfs.1: fatal: invalid metadata block
GFS: fsid=shell:gfs.1:   bh = 3470969 (type: exp=6, found=0)
GFS: fsid=shell:gfs.1:   function = get_leaf
GFS: fsid=shell:gfs.1:   file = /home/bob/cluster/gfs-kernel/src/gfs/dir.c, lin8
GFS: fsid=shell:gfs.1:   time = 1223601261
GFS: fsid=shell:gfs.1: about to withdraw from the cluster
GFS: fsid=shell:gfs.1: waiting for outstanding I/O
GFS: fsid=shell:gfs.1: telling LM to withdraw
lock_dlm: withdraw abandoned memory
GFS: fsid=shell:gfs.1: withdrawn

Comment 17 Mark Hlawatschek 2008-10-13 14:53:10 UTC

Comment #16: Did this assertion result from testing Roberts latest patch?

Comment 18 Abhijith Das 2008-10-13 15:05:48 UTC

Yes. It took about 10 iterations of the test.

Comment 19 Robert Peterson 2008-10-27 21:22:30 UTC

I've been trying to narrow down the aforementioned "invalid metadata block"
problem with my latest patch.  The biggest problem is that it takes a very
long time to recreate the failure.

First, I used to instrumentation to determine that the data corruption is
not related to the directory leaf manipulation code.  At least it's very
unlikely.  It appears to be a corrupt block in the file system.  In all
the cases of corruption, the block type is 0, which is only used in
function "clump_alloc".  Until today, the error was always detected when
traversing the directory leaf blocks resulting from the find command.
So I've been focusing my efforts on searching the directory code for
problems.  However, today, I was able to recreate a failure with a
different block type:

GFS: fsid=shell:gfs.0: fatal: invalid metadata block
GFS: fsid=shell:gfs.0:   bh = 35512638 (type: exp=7, found=0)
GFS: fsid=shell:gfs.0:   function = gfs_get_data_buffer
GFS: fsid=shell:gfs.0:   file = /home/bob/cluster/gfs-kernel/src/gfs/dio.c, line = 1280

This was basically checking for a journaled block, type GFS_METATYPE_JD, but
once again, it found type 0, which is from clump_alloc.

Note that in previous cases, there were not multiple clump blocks near the
corrupt block.  So it looks like the data may have been written into the
wrong block or something.  In previous cases, the block looked to be all
zeros except for the GFS metadata headers.  In one case it was
not all zeros.  In fact, I couldn't identify what the block was.
It didn't look like any GFS data structure.  Also, it didn't look like
directory data, nor the single-byte data the test program writes to files.
It looked like weird database stuff.

Also note that the corruption apparently happened a "long time" prior to
its being detected by the find.  In other words, there are absolutely no
references to the corrupt block in any of the journals.

I tried increasing the journal size to 2GB, which makes the mkfs take
a very long time.  Unfortunately, I was not able to recreate the
failure this way.  But I also didn't try for more than a few hours.

Lately I've been doing some experiments to try to get the recreation
time down to a more reasonable time.  First, I tried reducing the block
size from 1K to 512 bytes.  No affect.  Next, I tried increasing the
block size to 2K.  No affect.

Thanks to Abhi's experiment of zeroing the device before mkfs, it seems
more likely that the problem has something to do with the clump_alloc
code.

Since the "clump_alloc" code that deals with these zero-type blocks
kicks in when the code runs out of blocks in a given RG, I decided to
try changing the RG size from its default to 32MB.  That did seem to
make the problem occur quicker.  The problem recreated on the first
run of the scenario rather than the second or third.  However, I've
only done that once.  The nearby blocks looked like other clump blocks
(unlike previous recreations) except for the third block on the same
page, which looked like a directory leaf.  At least this time there
weren't any inodes on the same physical page.

Once again, data surrounding the corrupt clump block was all zeros,
but the block on the page boundary was also a clump block, and it
looked almost like byte-shifted part of an inode:

000000087784F000 01161970 00000000 00000000 00000000 [...p............]
000000087784F010 00000000 00000000 00000000 00000000 [................]
000000087784F020 00000000 00000001 00000001 000081A4 [................]
000000087784F030 00000001 00000000 00000000 00000000 [................]
000000087784F040 00000101 00000000 00000400 00000000 [................]
000000087784F050 00000000 00000000 00000001 00000000 [................]
000000087784F060 021DE040 490623F0 00000000 490623F0 [...@I.#.....I.#.]
000000087784F070 00000000 490623F0 00000000 00000001 [....I.#.........]
000000087784F080 00000000 49062056 00096B5F 00000000 [....I. V..k_....]
(with the rest all 0x00).

The value 021BE040 at offset 0x60 is a GFS disk block number of a disk
inode and that inode contains the value 490623F0 as its time stamps.

So I'm continuing my analysis, focusing on the clump_alloc code.

Comment 20 Robert Peterson 2008-10-28 22:43:28 UTC

Created attachment 321732 [details]
Oct 28 patch

This version of the patch seemingly works better.  I still have some
investigation to do, so this is likely to be revised again.  But so
far I've seen no failures with it.  This is good news, considering I
found ways to increase my chances of hitting the problem (more later).

Comment 21 Robert Peterson 2008-10-28 23:02:19 UTC

First, let it be known that I'm focusing my efforts on the "Invalid
Metadata Block" problem described earlier, which only appears with
my newer patches.  For example, the Oct 03 patch.  I still believe
that patch is beneficial and allows the test to get much farther, but
the user's tests are uncovering this second bug in the code, which I'm
now pursuing.

I've done several things today.  

I figured out that if I maximize the number of RGs and minimize the
size of the journals, I'm much more likely to hit the Invalid Metadata
Block error with the Oct 03 patch.  My best chances seem to be when I
use this mkfs command:

mkfs.gfs -O -j3 -J32 -p lock_dlm -b 1024 -r32 -t shell:gfs /dev/shell-vg/lv01

After this mkfs command, I've been able to recreate the failure on almost
every run of the test program, although sometimes it takes up to ten to
fifteen minutes to recreate.  That's still much better than yesterday,
when it was taking up to three hours to recreate.

I verified that the "Invalid Metadata" problem always complains about a
block written by function clump_alloc.  In other words, I changed the
clump_alloc block type to be 0x77, and when the failure occurred, it
reported the wrong block as having type 0x77.

I've also determined that in the failing case, node 1 is trying to
write the block complained about, and the "find" command of node 2
is trying to read the block, then complaining.

I've gone through all the code that uses the clump_alloc blocks and
verified that the logic looks correct.  There aren't any blatant glaring
problems there.

Lastly, I've created the previously posted patch.  This patch is the
same as the October 3 patch, but it adds some calls to "mark_buffer_dirty"
which hopefully causes any blocks fetched from the previously allocated
"clumps" to be written to disk.  This seems to have a positive affect
on the code.  This patch seems to make the failure much less likely.
Unfortunately, I did receive another failure with that patch, but
only after several runs of the failing scenario.  I believe that another
carefully placed call to mark_buffer_dirty would eliminate this failing
case as well.

I need to discuss this with Steve Whitehouse in the morning because it's
likely that "mark_buffer_dirty" is circumventing the problem rather than
solving the problem.

Comment 22 Robert Peterson 2008-10-30 15:04:57 UTC

I removed the "mark_buffer_dirty" calls because they were not solving the
problem, they were only closing the window a little bit.  In other words,
I was able to recreate the problem with the calls there, so it's better
to not mask the symptom at this point in time.  Next, I added more
instrumentation to the code to try to narrow down what's happening some
more.

A big concern of mine, given the previous symptoms, is that we might
be hitting a scenario like this:

1. The rg runs out of free metadata blocks, so GFS goes to set aside
   a clump of free data blocks as free metadata blocks.
2. Function clump_alloc changes the bitmap settings for a clump of
   blocks from FREE to FREEMETA.  Then another process gets control.
3. The other process wants to allocate a metadata block, so scans the
   bitmap for FREEMETA and finds one of the blocks being manipulated
   by clump_alloc.  The process writes its metadata header into the
   buffer associated with that block.
4. Function clump_alloc gains control again and requests the block be
   overwritten with the dummy header that seems to be popping up in
   our errors.  This overwrites the good data with what we're seeing.

So first I verified that the proper rgrp glocks are in place in the
clump_alloc code.

Next, I verified that the meta header cache is not being checked at the
same time a new clump is being added to it.

Unfortunately, every time I change the instrumentation, I change the
symptoms.  There now seems to be some cluster coherency problem.
With the smaller mkfs, and if the timing is just right, I am able to
recreate this error:

GFS: fsid=shell:gfs.0: fatal: invalid metadata block
GFS: fsid=shell:gfs.0:   bh = 34933213 (magic)
GFS: fsid=shell:gfs.0:   function = get_leaf
GFS: fsid=shell:gfs.0:   file = /home/bob/cluster/gfs-kernel/src/gfs/dir.c, line = 442

This is a completely different symptom.  The corrupt block does not
look like a metadata cache-allocated "clump" block in this case.  With
my latest instrumentation, I haven't been seeing that problem lately.

So get_leaf discovered a block that should be a directory leaf, but it
sees a block that is not even GFS metadata (it seems to be just data).
With instrumentation, I verified that the block was NOT fetched from the
meta header cache.  I also verified that the corrupt block was actually
read with ll_rw_block(READ, 1, &bh) in function gfs_dreread, and the
wait_on_buffer(bh) function must have been called.

None of these symptoms seem to occur if I make the journal big.
But without a fair amount of data in the journal, it's tough to tell
the history of the blocks that are being reported as corrupt.

I need to do two things: (1) make sure the scenario I gave above is
not actually happening.  (2) examine the journal log code for problems
relating to cluster coherency.  I may need to recreate my glock history
patch to solve this, but the timing is so sensitive that I highly doubt
the problems will reoccur with that patch in place.  So I can only move
forward on this problem an inch or two at a time.  I may need to do
more research into making it recreate faster, such as increasing or
decreasing the meta header cache size and such.

Comment 23 Robert Peterson 2008-10-31 21:26:09 UTC

Well, I verified that the scenario above is not happening.

I added instrumentation to detect if any process is writing a buffer
that's corrupt.  It never triggered, but somehow the buffer was read
back corrupt anyway.  That means that when GFS tried to write a
directory leaf block (METATYPE_LF) or a journaled data block (METATYPE_JD)
the buffer was correct when it submitted the buffer for IO.

I also verified that when the problem is detected, the buffer was NOT
previously used to write either a (1) leaf block, (2) Journaled data
block, (3) metadata clump, nor was the buffer fetched from the meta
header cache.

I spent a fair amount of time trying to get the problem to recreate
faster with a bigger journal size so I can get a history of the corrupt
block.  I was not successful.  One thing I tried was running four
simultaneous copies of the gen-data program on each server.  No luck.

At one point, the "METATYPE_JD" error triggered on, of all things, the
quota file's inode.  In other words, GFS claimed that the quota file's
inode was not a metadata block when, in fact, it was.

This made me take a step back and question whether the problem is somewhere
in the vfs rather then GFS.  The kernel I was running was quite old.  So I
upgraded my kernel to 2.6.9-78-16, which is for pre-release 4.7.
Then I had to recompile the cluster suite for the new kernel, and I had
several issues with that.  Finally I got things working again, but I hit
the problem again regardless.

I also verified the problem is not specific to the special code that
we use to process nfs directory requests.  The error happens regardless
of whether the nfs or normal readdir is used.

Comment 24 Robert Peterson 2008-11-07 00:35:46 UTC

Whatever is going wrong is happening at a very low level.  Observe this
set of events:

[root@shell-02 ../gfs-kernel/src/gfs]# mkfs.gfs -O -j3 -J32 -p lock_dlm -b 1024 -r32 -X -t shell:gfs /dev/shell-vg/lv01
Expert mode:               on
Device:                    /dev/shell-vg/lv01
Blocksize:                 1024
Filesystem Size:           104730448
Journals:                  3
Resource Groups:           3198
Locking Protocol:          lock_dlm
Lock Table:                shell:gfs

Syncing...
All Done
[root@shell-02 ../gfs-kernel/src/gfs]# sync
[root@shell-02 ../gfs-kernel/src/gfs]# mount -t gfs /dev/shell-vg/lv01 /mnt/gfs/mount: /dev/shell-vg/lv01: can't read superblock
[root@shell-02 ../gfs-kernel/src/gfs]# dmesg | tail -8
GFS: Trying to join cluster "lock_dlm", "shell:gfs"
GFS: fsid=shell:gfs.0: Joined cluster. Now mounting FS...
GFS: fsid=shell:gfs.0: fatal: invalid metadata block
GFS: fsid=shell:gfs.0:   bh = 74 (magic)
GFS: fsid=shell:gfs.0:   function = gfs_get_meta_buffer
GFS: fsid=shell:gfs.0:   file = /home/bob/cluster/gfs-kernel/src/gfs/dio.c, line = 1241
GFS: fsid=shell:gfs.0:   time = 1226017322
GFS: fsid=shell:gfs.0: can't get journal index inode: -5
[root@shell-02 ../gfs-kernel/src/gfs]# gfs2_edit -x -p 74 /dev/shell-vg/lv01
Block #74    (0x4a) of 104857600 (0x6400000) (disk inode)
-------------------- Journal Index ------------------
00012800 01161970 00000004 00000000 00000000 [...p............] 
00012810 00000190 00000000 00000000 0000004A [...............J] 
00012820 00000000 0000004A 00000180 00000000 [.......J........] 
00012830 00000000 00000001 00000000 000000F0 [................] 
00012840 00000000 00000001 00000000 49138A0B [............I...] 
00012850 00000000 49138A0B 00000000 49138A0B [....I.......I...] 
00012860 00000000 00000000 00000000 00000000 [................] 
00012870 00000000 00000000 00000000 00000000 [................] 
00012880 00000001 000003E8 00010000 00000000 [................] 
00012890 00000000 00000000 00000000 00000000 [................] 
000128A0 00000000 00000000 00000000 00000000 [................] 
000128B0 00000000 00000000 00000000 00000000 [................] 
000128C0 00000000 00000000 00000000 00000000 [................] 
000128D0 00000000 00000000 00000000 00000000 [................] 
000128E0 00000000 00000000 00000000 031F4000 [..............@.] 
(snip)
[root@shell-02 ../gfs-kernel/src/gfs]# mount -t gfs /dev/shell-vg/lv01 /mnt/gfs/mount: /dev/shell-vg/lv01: can't read superblock

So the page on disk and/or as seen from user space actually DOES
contain the magic number, but GFS still sees the wrong page.
I would have thought that the gfs2_edit program would have caused
the vfs to reload the page.

So I can only conclude that something in memory is being corrupted
and once it's corrupted, it remains corrupt in the eyes of the GFS
kernel module.  Userland can still see the correct data though.

Comment 25 Robert Peterson 2008-11-07 23:17:02 UTC

Well, the symptom from the previous post was an anomaly caused by a
problem with my instrumentation.  Once I fixed the problem it started
behaving as I'd expect it.  However, it doesn't shed any light on the
problem at hand.

I've gone to great lengths to check for the proper read requests being
sent properly to the vfs layer, and the buffer_heads are in the proper
state.  I've also made sure the radix tree is finding the proper page,
the page is not locked, and the block sizes are not being corrupted.

So it seems as if memory is being corrupted somehow.  That theory is
supported by the following scenario that occurred today:

This instrumented gfs function produced the output that follows it:

void
gfs_logbh_init(struct gfs_sbd *sdp, struct buffer_head *bh,
	       uint64_t blkno, char *data)
{
	memset(bh, 0, sizeof(struct buffer_head));
	bh->b_state = (1 << BH_Mapped) | (1 << BH_Uptodate) | (1 << BH_Lock);
	atomic_set(&bh->b_count, 1);
	set_bh_page(bh, virt_to_page(data), ((unsigned long)data) & (PAGE_SIZE - 1));
	if (blkno < sdp->sd_jindex[0].ji_addr ||
	    blkno > (sdp->sd_jindex[2].ji_addr +
		     (sdp->sd_jindex[2].ji_nsegment * 16))) {
		printk("Bob: Great.  Another bug.\n");
		printk("blkno = 0x%llx\n", blkno);
		printk("j0=0x%llx\n", sdp->sd_jindex[0].ji_addr);
		printk("j2=0x%llx\n", sdp->sd_jindex[2].ji_addr);
		printk("j2s=0x%x\n", sdp->sd_jindex[2].ji_nsegment * 16);
		gfs_assert_warn(sdp, blkno >= sdp->sd_jindex[0].ji_addr &&
				blkno <= (sdp->sd_jindex[2].ji_addr +
					  (sdp->sd_jindex[2].ji_nsegment * 16)));
		dump_stack();
	}
	bh->b_blocknr = blkno;
	bh->b_size = sdp->sd_sb.sb_bsize;
	bh->b_bdev = sdp->sd_vfs->s_bdev;
	init_buffer(bh, logbh_end_io, NULL);
	INIT_LIST_HEAD(&bh->b_assoc_buffers);
}

Bob: Great.  Another bug.
blkno = 0x3203701
j0=0x31f4000
j2=0x3204000
j2s=0x8000

Call Trace:<ffffffffa0371d8a>{:gfs:gfs_logbh_init+426} <ffffffffa038727c>{:gfs:gfs_log_get_buf+230} 
       <ffffffffa03892f9>{:gfs:buf_build_bhlist+114} <ffffffff8031838c>{wait_for_completion+231} 
       <ffffffff8031838c>{wait_for_completion+231} <ffffffffa0387552>{:gfs:disk_commit+170} 
       <ffffffffa038a727>{:gfs:quota_trans_size+33} <ffffffffa0387993>{:gfs:log_refund+111} 
       <ffffffffa0387eaa>{:gfs:log_flush_internal+510} <ffffffffa0391b78>{:gfs:gfs_fsync+96} 
       <ffffffff8015b898>{__filemap_fdatawrite_range+95} <ffffffffa02ebaa5>{:nfsd:nfsd_sync_dir+53} 
       <ffffffffa02ed7ab>{:nfsd:nfsd_create_v3+759} <ffffffffa02f3acf>{:nfsd:nfsd3_proc_create+307} 
       <ffffffffa02e87dc>{:nfsd:nfsd_dispatch+219} <ffffffffa01914ac>{:sunrpc:svc_process+1227} 
       <ffffffff801347cf>{default_wake_function+0} <ffffffffa02e831b>{:nfsd:nfsd+0} 
       <ffffffffa02e8553>{:nfsd:nfsd+568} <ffffffff8013351c>{schedule_tail+55} 
       <ffffffff80110fd3>{child_rip+8} <ffffffffa02e831b>{:nfsd:nfsd+0} 
       <ffffffffa02e831b>{:nfsd:nfsd+0} <ffffffff80110fcb>{child_rip+0} 

But the logic is correct.  The "if" test passed, flagging "blkno" as
being outside the bounds of the journals, but the values that printed
are, in fact, correct, and blkno is entirely within journal1, which is
what I expect.  The "assert" that follows is testing the same values
but it found them to be correct (with reverse logic).  The "blkno"
parameter passed in to the function should not change inside the
function.  Therefore, either memory was corrupt and later "fixed" or
else I'm having some kind of hardware problem; maybe my machine's
memory or motherboard (or both) are flaky.  Either that or some kernel
process is changing my memory.

Comment 26 Robert Peterson 2008-11-11 20:47:24 UTC

Time for a lengthy update on this bug record.  The last patch that
I posted to this bugzilla record fixes the fatal assertion issue
reported with this bugzilla.  The patch allows GFS to get further, but
I'm now experiencing three additional problems.

The problems are all highly timing-dependent and hard to recreate.
I've never seen either when I use the default block size of 4K on mkfs.
Making the journals bigger (like the default of 128MB) makes both
problems extremely rare, but not necessarily impossible.  I'm using the
minimum size of 32MB in order to recreate the problem.

The problems do not appear to be related to the media.  Neither the
SAN, disks nor HBA are giving any kind of error codes back to GFS.

These three additional problems take about a half-hour to recreate,
and I never know which problem is going to recreate, so I'm trying
to instrument and solve two problems at the same time.  I'm constantly
having to switch my focus from one to the other, and that's not easy.

The three additional problems are as follows:

========================================
= PROBLEM 1 - The "Leaf Block" problem =
========================================
About 49% of the recreations, I get an error that has this
symptom:

GFS: fsid=shell:gfs.1: fatal: invalid metadata block
GFS: fsid=shell:gfs.1:   bh = 148893 (magic)
GFS: fsid=shell:gfs.1:   function = get_leaf
GFS: fsid=shell:gfs.1:   file = /home/bob/cluster/gfs-kernel/src/gfs/dir.c, line = 440
GFS: fsid=shell:gfs.1:   time = 1226431699

Ignore the line number because my instrumentation throws that off.

In this case, GFS function get_leaf is trying to fetch a leaf block
for a directory.  Instead of leaf block data, it finds a bunch of
random junk from memory.  In actual fact, the data on the file
system is a correct leaf block, which you can prove by printing out
the block in hex, like so:

gfs2_edit -x -p <block> /dev/your/device

Doing so doesn't change the situation, however.  If GFS tries to
re-read the block ten times, it still gets the same random trash.
I went to some trouble to prove that the data is being written
correctly, the blocks and block sizes are all being calculated
correctly.  The problem is on the read side of things.

The problem does not appear to be related to GFS's i_cache, because
I can invalidate/destroy the i_cache and it still sees the same
trash.  However, if GFS brute-forces a clear_buffer_uptodate of
the bh being used to fetch the data, subsequent reads will work
and see the correct data.  (I just found this today)

This tells me that some process is clearing the buffer_uptodate
flag when they shouldn't, or else it is not being set when the
data is read.  I've painstakingly gone through all the places
where GFS itself twiddles the buffer_uptodate bit, and I don't
think it's GFS doing the bit twiddling.  That's not to say that
it's not GFS's fault.  Perhaps something in the GFS code path is
causing VFS or NFS to twiddle the bit.  Or perhaps NFS is at
fault.  It's a difficult thing to prove either way.  I do know
that the bh is given to ll_rw_block(READ...) with the bh properly
locked.  I also know that particular bh was not used for an
unstuff, a truncate or for a metadata clump allocation.  I also
know the buffer was not uptodate when we submitted it for I/O.

==============================
= PROBLEM 2 the "JD" problem =
==============================
Another 49% of the recreations, I get a problem that has this
symptom:

GFS: fsid=shell:gfs.1: fatal: invalid metadata block
GFS: fsid=shell:gfs.1:   bh = 126058 (type: exp=7, found=0)
GFS: fsid=shell:gfs.1:   function = gfs_get_data_buffer
GFS: fsid=shell:gfs.1:   file = /home/bob/cluster/gfs-kernel/src/gfs/dio.c, line = 1297
GFS: fsid=shell:gfs.1:   time = 1226423645

Again, ignore the line number.

Although it has a similar symptom, this is seemingly a completely
different problem.  In this scenario, gfs_dir_read is trying to
read a block full of indirect pointers at height 1 from a directory
inode.  (Each of these indirect pointers contained in the block
will point to a leaf block).  Instead of a GFS block of type 7
GFS_METATYPE_JD, function gfs_get_data_buffer is seeing a block
that looks just like a metadata allocation clump block.

In this case, the metadata is actually corrupt on disk, so no
amount of reading or invalidating is going to fix that.  The file
system is corrupted.  It's not always a metadata clump block, but
that seems to be a lot more common than any other type of corruption.

In almost all the times I've recreated this, the journals contain
absolutely no information about the corrupt block, nor the directory
inode that contains the corrupt block.  Yesterday, I managed to
finally recreate the problem with references left in the journal.
The journal indicated that the inode was rewritten a bunch of times,
but always with the correct data.  Metadata clump allocation
blocks are not journaled, but I do know for a fact that the bh
associated with that page of memory was not used to write any of
the 1000 previously allocated metadata clump blocks.

Unfortunately, my journal information was obscured by a journal bug
in the "withdraw" code that I had fixed in GFS in RHEL5.3.
I had to port the patch back to RHEL4 and hope to recreate it again.

As in problem #1, the bh had the correct block number and block
size.  The bh was not used for a recent unstuff, or a truncate
(I targeted unstuff and truncate in both cases because they are some
of the few GFS functions that mess with pages manually.)

I've added additional instrumentation to narrow down this problem,
and I'm waiting for it to hit again.

==========================================
= PROBLEM 3 the "journal bounds" problem =
==========================================
This problem occurs the remaining 2% of my recreations.  The symptom
is an assert that I added as instrumentation to ensure GFS data
integrity.  I wanted to guarantee that journal writes were never
getting outside of the journals.  In my recreation scenario, I have
three journals, so I added this code into function gfs_logbh_init:

	if (blkno < sdp->sd_jindex[0].ji_addr ||
	    blkno > (sdp->sd_jindex[2].ji_addr +
		     (sdp->sd_jindex[2].ji_nsegment * 16))) {
		printk("Bob: Great.  Another bug.\n");
		printk("blkno = 0x%llx\n", blkno);
		printk("j0=0x%llx\n", sdp->sd_jindex[0].ji_addr);
		printk("j2=0x%llx\n", sdp->sd_jindex[2].ji_addr);
		printk("j2s=0x%x\n", sdp->sd_jindex[2].ji_nsegment * 16);
		gfs_assert_warn(sdp, blkno >= sdp->sd_jindex[0].ji_addr &&
				blkno <= (sdp->sd_jindex[2].ji_addr +
					  (sdp->sd_jindex[2].ji_nsegment * 16)));
		dump_stack();
	}

This function initializes a "fake" buffer head to be used to write
into one of the journals.  The assert checks to make sure the block
number is not before the first journal and not after the last journal.
I've had this problem recreate a couple of times with JUST the
gfs_assert_warn part but not the other code.  I've only seen it
occur ONCE with the rest of the code in there.  But here's the strange
thing: When the instrumented code hit, it printed "Bob: Great.
Another bug." as I expected, but the subsequent values it printed
for J0, J2 and j2s all indicated the values were properly within the
bounds of the test.  Even more strange, the subsequent call to
gfs_assert_warn did NOT trigger, which means the values somehow fixed
themselves and were inside the bounds of the test!

I've only seen problems like this with faulty hardware: flaky memory
or motherboards.  Either that or memory corruption.  Either way, this
is the most bizarre problem yet and I have no explanation for it.

That's all I have at the moment; I'm continuing to pursue these bugs.

Comment 32 Robert Peterson 2008-12-05 21:32:11 UTC

Created attachment 325912 [details]
Dec 05 patch

This is my latest and greatest patch.  I don't know if all the code
changes are needed.  Also, I don't guarantee all the problems are
fixed with it.  I need to do more testing to see whether or not the
problems are fixed.  At the very least, it's very difficult (if not
impossible) to recreate the failure with this patch in place.  I have
not tested the patch for performance impacts yet.

Comment 33 Robert Peterson 2008-12-17 23:06:09 UTC

After extensive testing, both problems 1 and 2 are still there but
harder to hit.  I'm continuing to narrow it down.  Part of the
problem is that the read requests that are finding the corruption
happen a long time after any writes had occurred to the corrupt
blocks.  The writes happened more than 8 million read/write
requests prior to discovering the corruption.  BTW, I still think
that problem #3 (the journal blocks problem) was due to my faulty
instrumentation in the code.  I think I was checking the log bounds
at the same time the values were being initialized and therefore
they always fixed themselves in a matter of microseconds.

Comment 34 Robert Peterson 2008-12-19 22:45:19 UTC

I think I've successfully proven that, in the case of problem #1,
the "corrupt block" was never written at all!  So now I've got to
figure out why it didn't write the block when it should have.
It was an indirect block that was supposed to point to leaf blocks.

It's quite difficult to figure out why it didn't write the block.
It's also difficult to write any way to detect when it's doing it
wrong.  So I'm now trying to narrow down what the sequence of events
was on the directory that was written out corrupt, and that's no
small matter either.  My latest and greatest instrumentation changes
the timing too much and the problem doesn't recreate with it.  So now
I've got to optimize it to get the timing of the instrumented code
closer to normal again.

Comment 36 Robert Peterson 2009-01-06 16:40:15 UTC

Created attachment 328294 [details]
Jan 06 patch

This is my latest and greatest patch.  I believe I've fixed the
"JD problem" with this patch.  The problem was caused by some code
that was improperly calculating block extents for directories.

The Leaf Block problem can still be recreated with this patch.
However, I have now collected a better set of data to analyze that
problem and will look at it today.

Detailed explanation of this patch, in order of appearance:

1. In bmap.c I added an and clause so that gfs_block_map() does not
   try to calculate block extents for directories.  The calculation
   was based on inode height, but height has a different meaning
   for directories because of the hash table that's considered data.
   This fixes the invalid metadata block problems.
2. In gfs_aspace_releasepage(), I initialize bd_ail_gl_list so that
   leftover values from previous usage aren't accidentally reused
   to cause possible corruption.  Also, since we're severing the
   link here between bh and bd, we should zero out the references
   to each other for the same reason, and we should do it before
   the cache is freed to avoid the possibility of use-after-free.
   The bh2bd(bh) = NULL is a deprecated usage that's been updated
   to the proper form: "bd->bd_bh = NULL;"
3. In ail_empty_gl() I changed the check from !buffer_busy to
   !buffer_locked because, as reported in the original problem,
   and noted in comment #11, the buffer may still be dirty at this
   point.
4. In getbuf() I removed the gfs_assert_warn clause.  This was
   just burning precious cpu time because if bh->b_blocknr == blkno
   it doesn't need to be mapped.
5. In getbuf() I added mark_page_accessed because that keeps the
   vfs from freeing the buffer we're about to use.  This is also
   in gfs2.  This keeps us from having to remap pages and such.
6. In gfs_dreread() I added wait_on_buffer to make sure the buffer
   we're going to use does not have any pending IO.
7. I changed function gfs_attach_bufdata() so that it returns the
   bd pointer, allocating a new bd if necessary.  This closes a
   race condition in function gfs_trans_add_bh.
8. In gfs_attach_bufdata() I initialize the bd_ail_gl_list pointer
   so we can't reference glocks from previous uses of the memory.
9. In gfs_is_pinned I initialize variable "data" to NULL.
   There was another timing window whereby two callers that wanted
   to pin the same page at the same time would interfere with each
   other.  If the bd_pinned pointer is zero and we need to allocate
   memory for data, it was first unlocking the buffer head, which
   gave another process an opportunity to mess with the bh, such
   as calling gfs_dpin again.  To avoid this, I unlock the buffer
   head, allocate the memory, then go back to the start where it
   reacquires the lock and checks the need for pinning again.
   That way, multiple processes can be pinning at the same time
   and they'll each do the right thing.
10. In gfs_dpin() I changed to code to only set bd once the
   buffer lock has been acquired on the bh.  The way it was done
   before, the bd value could be changed while we were waiting
   for the bh lock.
11. In gfs_dunpin() we only need to unlock and relock the buffer
   head if there's a transaction.  If we don't have a transaction
   we're only opening up a timing window where another process can
   mess with the buffer we're working on.  Also, we shouldn't
   decrement the bd_pinned counter until the end of the function
   otherwise another process could assume the buffer was unpinned
   while we're still in the process of unpinning it and messing
   with the active items list.
12. In dio.h I needed to change the function prototype for
   gfs_attach_bufdata to reflect that it now returns the bd.
13. In get_leaf() the code was not releasing the buffer if an
   error is encountered.  That results in problems unmounting.
   So I added the appropriate brelse() call.
13. Function dir_split_leaf() was not checking the return code
   from the call to gfs_dirent_alloc.  It really should.
14. Function dir_split_leaf() was not adding the modified dibh
   dinode buffer to the transaction before it modified its contents.
   So it might have been possible for buffer changes to be
   submitted to the dinode without the glock needed to ensure
   cluster coherency in this code path.
15. Function dir_double_exhash() had the same problem as 14.
16. Function dir_e_add() was not checking the return code from its
   call to function gfs_dirent_alloc.  Naughty naughty.
17. Function dir_l_add() had the same problem as #14.
18. Function dir_l_del() had the same problem as #14.
19. Function sync_trans() was not checking for SHUTDOWN conditions,
   which meant that the journals could be slightly corrupted
   (more like obscured, really) when an error occurs.  This is a
   crosswrite fix from bug #452274.
20. Function commit_trans() had the same issue, and is also a
   crosswrite fix from bug #452274.
21. Function gfs_log_shutdown() had the same issue, and is also a
   crosswrite fix from bug #452274.
22. Function gfs_commit_write() was not marking the inode dirty
   when it changed the file size, so vfs thought it was okay to
   free it up if more memory was desired.
23. Function gfs_unstuffer_page() was creating empty buffers with
   the uptodate bit set.
24. Function gfs_find_jhead() was not locking the journal log
   when finding the journal head.  It should do that so the log
   head doesn't move around while we're looking, lest we get into
   an infinite loop.
25. In function gfs_trans_add_bh() I call gfs_attach_bufdata with
   the new calling method.

Comment 37 Robert Peterson 2009-01-06 23:04:46 UTC

After a full day of running, I still have not seen the JD problem with
my latest patch.  Hopefully that means there is only one problem
remaining: The corrupt leaf block buffer issue.  From my latest
instrumentation I can tell this:

1. The problem is not fixed by a newer kernel.  I am now running on
   a pre-release 2.6.9-78.24.ELsmp kernel and the problem is still there.
2. If function dir_e_read() does clear_buffer_uptodate(bh) and retries
   the read operation again, the buffer comes back normal with the
   correct contents (a directory leaf block).
3. None of the buffer_head variables have changed from the first
   submit of the IO.  For example, nobody has changed the b_data
   pointer, the b_private variable, the b_state is the same, etc.
4. When the failing read occurs, the code path is always the same,
   and here is exactly what it does:

    1. getbuf: The create bit is on, so page = grab_cache_page()
    2. page has no buffers, so it creates some.
    3. page is not mapped, so it maps the block we want.
    4. DIO_START is set coming in to gfs_dreread.
    5. Before the read is issued, buffer_uptodate is 0 so it
       definitely calls ll_rw_block(READ) on the bh.
       The buffer_uptodate is still 0 immediately after the read is
       issued with ll_rw_block(READ).
    6. b_count was 1 (not 0) when the IO was issued.
    7. buffer_locked was not set prior to the ll_rw_block read request
       but it is set afterward (hopefully by ll_rw_block but I should
       test this to make sure) so vfs should have done the READ.
    8. DIO_WAIT flag is on so it calls the wait_on_buffer().
    9. When it comes back from the wait_on_buffer the buffer is
       now uptodate.
   10. The buffer had 0 in b_data[7], after the read returns, which
       means the buffer is corrupt as soon as it is read.  But it's
       not always 0.  Sometimes it's 4 (dinode).  Looks like random
       trash from memory.
5. The "corrupt" directory leaf block was written by the other node,
   and it has been a leaf block for a very long time.
6. Previous reads of the directory did not find this leaf block to be
   corrupt.

AFAICT, no other processes on the failing node think that the corrupt
block is theirs for any reason, and it doesn't look like there has
been any fighting over the page, block or buffer_head.  The
buffer_uptodate bit was not set on by any special condition in GFS.

To me it really seems like some kind of vfs bug.

Comment 38 Steve Whitehouse 2009-01-07 12:53:08 UTC

Point 6 in comment #37 seems as if it might be an issue. I'd expect b_count to be 2 since the I/O completion will drop a ref count on the buffer, and if b_count was  then 0, it would be invalid to continue to access the bh in question.

Since you are making changes here, I'd suggest dropping the (obsolete) ll_rw_block and replacing it with submit_bh. It might make things easier to trace since there will be fewer possible races.

Comment 41 Robert Peterson 2009-01-09 21:01:58 UTC

Created attachment 328592 [details]
09 Jan 2009 patch

Here is my current theory regarding the "Leaf Block" problem:
--------------------------------------------------------------
When you're talking about a journaled block, such as directory
leaf blocks, there are really two instances of the data:
(1) The "fake" copy that's written to the journal, and (2)
The "real" data that gets written to the disk in its proper
on-disk location.

The gfs kernel code modifies the buffers appropriately, then it
ensures the "fake" (1) buffers are written to the journal.
After the "fake" data is synced on disk to the journal, the code
calls log_free_buf() to free the "fake" buffers created for the
commit to disk.

In function log_free_buf(), the "real" (2) buffers are left to
their own devices just as soon as the block is committed to the
journal.

Although the "real" (2) buffers have been submitted to be written
to their on-disk locations on node-1, nothing actually forces the
"real" (2) buffers to disk at the time the glock is released.

If the timing is just right, and the media (SAN, etc.) is kept
busy enough, node-2 can acquire the glock (in order to parse the
directory leaf blocks for the "find" command) before the "real"
buffers have landed on-disk.

They do, however, eventually land on disk, so any subsequent
read will pick up the correct data.  That's my theory.

I've attempted to fix the problem by waiting for the I/O to
complete on the "real" buffers, hopefully as late as we can in the
process, in function log_free_buf().  Hopefully it won't impact
performance, but that's something we have to test.


Here is my current theory regarding the "JD" problem:
--------------------------------------------------------------
I believe the problem was caused because when the directory
leaf blocks were modified by function gfs_writei, but the
blocks may not be added to the journal transaction.

Here is how this patch differs from the 06 Jan 2009 patch:
--------------------------------------------------------------

1. In function gfs_block_map, I got rid of the GFS_FILE_REG
   check.  After careful study and scrutiny, I determined that
   it's perfectly okay for a directory to have block extents.
   So there should be no need to check for regular files (in theory).
   My new solution to the "JD Problem" is listed under (3) below.
2. As per Steve's suggestion in comment #38, I replaced the
   obsolete call to ll_rw_block with some code for calling submit_bh.
   This did not affect either problem, but it's cleaner.
   I did not replace the other calls to ll_rw_block, which are
   used for writing.  I only changed the READ call.
3. In function gfs_writei() the modified buffer was not being
   added to the journal transaction.  I believe this was the cause
   of the "JD Problem".  So I changed the code to add the buffer
   to the journal transaction.
4. This is my attempt to fix the Leaf Block problem by waiting
   for I/O to complete when we're freeing the "fake" buffers.

So far I've run the failing scenario four times without hitting
any problems on this patch.  Previous patch versions have made
it a very long time, so a lot more testing is needed.

If this version tests out and cannot be broken, I need to start
pruning out the unnecessary code.  Although I think they're all
good changes, they're probably out of the scope of this fix.
The patch is pretty big, so the job of pruning may be a big one.

Comment 42 Robert Peterson 2009-01-15 19:09:32 UTC

I've all but proven at least one of these data corruption problems,
and possibly both, can't be GFS's fault.  Adding Mike Christie to
the cc list.  Mike told me on the phone he has a patch that might help.

Comment 43 Mike Christie 2009-01-15 19:26:06 UTC

Rob, you meantioned seeing no sense messages. Were they like this:

Apr 30 11:14:03 linc01a kernel: Info fld=0x0
Apr 30 11:14:03 linc01a kernel: sdi: Current: sense key: No Sense
Apr 30 11:14:03 linc01a kernel:     Additional sense: No additional sense

I think this might be the silent corruption we see in this bz:
https://bugzilla.redhat.com/show_bug.cgi?id=468088

Give me a second to port this to RHEL4.

Comment 44 Mike Christie 2009-01-15 19:37:41 UTC

Created attachment 329133 [details]
fix handling of no sense check condition

Patch was made against the RHEL 4.7 kernel.

Comment 45 Robert Peterson 2009-01-15 20:15:30 UTC

The "no sense" messages are like this:

Info fld=0x0, Current sdb: sense key No Sense
Info fld=0x0, Current sdb: sense key No Sense
Info fld=0x0, Current sdb: sense key No Sense
Info fld=0x0, Current sdb: sense key No Sense

I'll try the patch.

Comment 46 Robert Peterson 2009-01-15 21:56:50 UTC

It's still too early to tell (it needs many more test runs), but
Mike's patch seems to have eliminated the corrupt "Leaf Block"
problem.  My fingers are crossed.  Mike's bug is the only explanation
I've found that makes sense for that problem, given what the
instrumentation tells me (today's instrumentation tells me that the
leaf block was written out correctly, but was corrupt when it was
read back in).  That's possibly good news.

I've been hitting the corrupt "Leaf Block" problem exclusively for
more than a week now.  Today, with Mike's patch, the "JD problem" reared
its ugly head once more.  So the "JD problem" is still alive and well.
(A JD block is considered "data" for the directory, but the "data" is
journaled, and it is a hash table that gives leaf block locations).

This time I have proof that GFS on node-1 did not write that block and
node-2 only wrote it out as a metadata "clump block".  Later, node-2
thought, for some reason, that it should be a hash table block for a
given directory.  That fits what I've seen before regarding the JD
problem.

The question is now why that particular block got inserted into the
directory hash table if the block was never written.  That's exactly
what I was trying to determine with an older version of my
instrumentation.  So I need to go back to that older version of my
instrumentation.  My fear is that the older instrumentation somehow
changed the timing and masked the JD problem, which is why I haven't
seen it for so long.

Comment 47 Robert Peterson 2009-01-16 18:06:10 UTC

I spent all morning running the failing scenario over and over.
With this version of the instrumentation, I used to be able to
recreate the "Leaf Block" problem in about a half-hour, but with
Mike Christie's patch in place, the problem no longer seems to
recreate.  That's a very good sign.  

Which leaves me trying to figure out the "JD problem".  It's very
hard to recreate it now with all my patches in place.  After
running all morning, I was finally able to collect some data on
that problem.   In this particular recreation, the "corrupt"
block was, in fact, a valid JD block for a very long time.  Then,
76 log entries after it was last found to be valid, it is found
corrupt, on the same node that was modifying that directory.
I'm looking at the instrumentation data now.

Comment 48 Mike Christie 2009-01-16 19:15:54 UTC

Robert, could I hop on the machine real quick or could you just give me the output of

cat /sys/block/sdX/device/vendor
cat /sys/block/sdX/device/model

?

Comment 49 Robert Peterson 2009-01-16 20:37:58 UTC

[root@shell-01 ../gfs-kernel/src/gfs]# cat /sys/block/sdb/device/vendor
APPLE   
[root@shell-01 ../gfs-kernel/src/gfs]# cat /sys/block/sdb/device/model 
Xserve RAID

Comment 50 Mike Christie 2009-01-19 18:43:23 UTC

Thanks, do you still have issues? Do you want me to send the patch I made under this bugzilla or should I make a new one?

Comment 51 Mike Christie 2009-01-19 18:46:57 UTC

(In reply to comment #50)
> Thanks, do you still have issues? Do you want me to send the patch I made under
> this bugzilla or should I make a new one?

Ignore this. It looks like you still have issues. Let me make a new bz for the scsi issue.

Comment 52 Mike Christie 2009-01-19 18:54:02 UTC

For the scsi issue I made a new bugzilla here
https://bugzilla.redhat.com/show_bug.cgi?id=480666

Comment 53 Robert Peterson 2009-01-22 14:49:33 UTC

Just an update.  I spent almost all of Tuesday 20 Jan trying to
recreate the "JD problem" with my instrumented code.  The problem
would not recreate.  Curious if the problem had somehow gone away
with recent code changes, I went back to the non-instrumented patch
and ran nearly all of Wednesday 21 Jan without a failure.  Then on
on Wednesday evening, the JD problem recreated, but since my
instrumentation wasn't there, I collected no information about it.
So the problem is still there, but extremely hard to recreate.

I restarted testing on the instrumented version again.  Finally, late
Wednesday night, I caught the failure with instrumentation.  The
instrumentation told me two things: (1) Either node 1 did not write
the corrupt block at all or else it had been a VERY long time since
it wrote it.  I suspect the latter.  Based on previous data collected,
I suspect that the block had been written as a JD block by node-1
at a much earlier time.  (2) Node 2 did, in fact, allocate the block
as a META_CLUMP block, although it's uncertain when that occurred.

This leads me to believe something is corrupting the block allocation
bitmaps in memory.  Fortunately for me, this is RHEL4, which does NOT
have the new, faster bitmap code, so that cannot possibly be at fault.
If my suspicions about the corrupt bitmap is correct, it's a needle in
a proverbial haystack.  But at least I have a direction to go: I will
begin changing my instrumentation to tell me if this really is
bitmap corruption by adding bitmap integrity checking.  It's going to
be tricky to figure out a good way to do this while still preserving
the performance, so the problem will still recreate for me.

Comment 54 Robert Peterson 2009-01-23 23:39:14 UTC

Created attachment 329889 [details]
23 Jan 2009 patch

I spent the last twelve or more hours trying to recreate the "JD
problem" with instrumentation, and I finally caught it.  I'm not
sure I believe what it's telling me.  It might be a problem with the
instrumentation, but that portion of the code is not complex, so it
is hard to believe there's a problem there either. Basically, the
instrumentation told me that the corrupt block pointer was not only
never written out, neither was the block that contained the pointer.

There are only two reasonable explanations for that: either (a) my
instrumentation has a problem, or (b) the directory data itself was
modified outside of a transaction.  So I backed up one pointer-level
prior to the bad one.  In this particular case, the data could only
have come from dir_l_add.  Looking at that function, I did discover
that data is indeed modified outside of a transaction: there was a
call to function dirent_alloc which modifies the dinode buffer (dibh)
BEFORE the buffer is added to the transaction.  (Before my patches,
dir_l_add was not even doing a trans_add_bh, so my patches were
better anyway, just not perfect).

So this patch is basically the same as the previous one, but I moved
the call to gfs_trans_add_bh to before the call to gfs_dirent_alloc.
Hopefully this will solve the problem, and protect the integrity of
the journaled data block.  In other words, hopefully this is enough
to prevent the JD problem from happening.  I'm hopeful but of course
there are no guarantees.

Once I've found an absolutely stable patch, I need to start removing
the code I added that's no longer needed.  That could also be a long
tedious process, since there are so many.  I'll probably take a
minimalist approach and start with the original code and add back
only the pieces I absolutely know I need to fix the problem.

Comment 55 Mike Christie 2009-01-26 19:15:05 UTC

Robert,

Rob Evers was testing the patch I attached, and discovered that it still can corrupt data so the problems you are still hitting might be the scsi layer's fault. I ccd Rob Evers here and we add an update when we know more.

Comment 56 Rob Evers 2009-01-28 16:25:18 UTC

The problem I was seeing is due to a bug in my test code.  The patch Mike provided worked as far as I know.

Comment 57 Robert Peterson 2009-01-29 15:54:47 UTC

I verified my kernel has the patch Mike Christie posted earlier.
However, I am still seeing the problem, albeit, rarely, with my
patch.  Since putting the patch on 16 January, with tons of testing
I have seen the corrupt metadata leaf block problem twice, the most
recent of which was on Tuesday 27 January.  I have also seen the
"JD problem" on other occasions, which I still believe to be a GFS
problem lurking deep in the shadows.  With my latest patch I ran
all last weekend (when I was available, which was about 14 hours of
testing) before hitting the JD problem, so it's now very rare, but
not as rare as the corrupt leaf block problem (which, as I said, I've
only hit twice).

My concern is that I'm running on the 2.6.9-18 kernel, and some of
this might be fixed in more recent kernels.  I'm not able to move to
newer kernels because of an NFS bug I found, bug #479728.  Well, I
could, but I would have to reboot after every two runs, which would
slow down the testing process to an unacceptable level.  It would
take twice as long, or likely four days to recreate the more common
JD problem and a month to recreate the leaf block problem.  Of course,
if the newer kernel fixes the problem, it would never recreate.

Comment 58 Robert Peterson 2009-01-30 21:14:39 UTC

Based on info I got from today's instrumentation, the problem may
boil down to the fact that function gfs_sync_page() only operates on
regular files, and not directories.  Its job is to sync all "data"
pages associated with an inode.  It currently ignores calls for
inodes that aren't "regular" files because only regular files would
have data blocks, right?  Wrong.  Directory inodes also have data
blocks: the directory hash tables that are being corrupted.

Of course, there's also the scsi device driver problem, but that's
a side issue.  I'm going to do some experimenting with fixing
only that function (i.e. stripping out all the other potential fixes
I've crafted) and see how it fares.

Comment 59 Robert Peterson 2009-02-05 05:56:30 UTC

Well, the previous theory didn't hold true.

Tracing the code back carefully, I now believe that the original
code in ail_empty_gl was correct to assert-exit if "buffer_busy(bh)".
I think that the fact that the buffer is still dirty after the
sync is somehow causing the JD problem.  Therefore I've reinstated
the original buffer_busy check.  The code path is as follows:

Function inode_go_sync is called, and in theory, this should be the
meta && data case because the directory has both metadata (the
directory hash table) and data (the leaf blocks).  The function
does this essentially:

   gfs_sync_page(gl, flags | DIO_START);
   gfs_log_flush_glock(gl);
   gfs_sync_buf(gl, flags | DIO_START | DIO_WAIT | DIO_CHECK);
   gfs_sync_page(gl, flags | DIO_WAIT);

The call to gfs_sync_buf does this:

   filemap_fdatawrite(mapping) - known to have been called
   filemap_fdatawait(mapping) - known to have been called
   ail_empty_gl - known to have been called

The filemap_fdatawrite page range and inode i_size have both been
proven correct (0xffffffffffffffff).

Yet the ail_empty_gl discovers that some buffer_head is still
dirty even after the filemap_fdatawrite/wait was proven to have
happened.

I've also proven that the buffer_head in question was on the list
of gl_ail_bufs when the filemap_fdatawrite/wait were called.
The buffer_head is removed from the page buffers shortly thereafter
and the next read assigns a new buffer to the page, but after the
next read, the data comes back invalid, triggering the jd problem
(that is, if the assert isn't coded to stop it first).

So now I need to figure out why vfs isn't syncing out all the
buffers to disk when we've asked it to.

Last note: It seems like the problem occurs more readily when I'm
running the scenario late at night.  Perhaps it has something to
do with the lighter network traffic with all the glocks flying
around.

Comment 60 Robert Peterson 2009-02-05 20:35:39 UTC

I've been staying on a much older 2.6.9-78.18 kernel due to
bug #479728, so my fear had been that the problem was already
known and fixed in a newer kernel's vfs.  Today I circumvented the
479728 bug, which allowed me to move up to a shiny new 2.6.9-80
kernel.  The problem still exists there, so the bug is definitely
still out there.  So I guess I start digging into the vfs to find
out why my sync left a buffer dirty.

Comment 61 Robert Peterson 2009-02-05 23:27:24 UTC

One more note: I seem to be able to recreate the problem more
consistently if I wait until the last "find" ends before I start
another run with another find.

Comment 62 Robert Peterson 2009-02-16 23:22:28 UTC

I've finally been able to catch red-handed the process responsible
for setting the dirty bit for the buffer in question.  The call trace
looks like this:

Call Trace:<ffffffff80193969>{mark_buffer_dirty+39} <ffffffffa02b5efe>{:gfs:gfs_dunpin+237 
       <ffffffffa02d2cc9>{:gfs:buf_add_to_ail+22} <ffffffffa02d14b8>{:gfs:log_flush_intern 
       <ffffffffa02dc308>{:gfs:gfs_fsync+95} <ffffffffa03320e8>{:nfsd:nfsd_sync_dir+53} 
       <ffffffffa0333fec>{:nfsd:nfsd_create_v3+758} <ffffffffa033a54f>{:nfsd:nfsd3_proc_cr 
       <ffffffffa032eb35>{:nfsd:nfsd_dispatch+220} <ffffffffa01affb5>{:sunrpc:svc_process+ 
       <ffffffffa032e409>{:nfsd:nfsd+0} <ffffffffa032e734>{:nfsd:nfsd+811} 
       <ffffffff80111693>{child_rip+8} <ffffffffa032e409>{:nfsd:nfsd+0} 
       <ffffffffa032e409>{:nfsd:nfsd+0} <ffffffff8011168b>{child_rip+0} 

So there are two processes.  One process is doing this:

gfs_glock_xmote_th - if gl state EXCLUSIVE
   inode_go_sync (glops.c) - takes no locks
      gfs_log_flush_glock - does nothing but calls:
            log_flush_internal - takes log lock the whole time
               add_trans_to_ail
                  buf_add_to_ail
                     gfs_dunpin
      gfs_sync_buf
         filemap_fdatawrite
         ail_empty_gl

When this process gets to the ail_empty_gl, one of its buffers have
been dirtied by another process doing this:

nfs
   gfs_fsync - file sync - holds the glock in SHARED (ops_file.c)
      gfs_log_flush_glock - does nothing but calls:
         log_flush_internal - takes log lock the whole time
            add_trans_to_ail
               buf_add_to_ail
                  gfs_dunpin

The second process, which does the gfs_fsync, takes a shared glock
for the inode.  The first process is apparently demoting an exclusive
glock, trying to sync out the data.  Assuming there isn't cross-glock
contamination or buffer mixups, I think it's a matter of how to fix
it.  Which is a big assumption, given the illusive history of this bug.

Comment 63 Robert Peterson 2009-02-18 23:19:00 UTC

This is my latest theory as to what is going wrong:

1. One process (pid-a) holds an inode glock and is unlocking it.
   Meanwhile, another process (pid-b) is trying to fsync the same
   file from nfs.
2. Pid-b calls gfs_fsync which requests a "shared" glock.  This
   holder gets put onto the waiters3 queue, waiting for a glock
   promotion to occur.
3. Pid-a calls run_queue to process its demote request.  It checks
   to see if there are any waiters1, and there aren't any.
   There aren't any waiters2 requests either, and the HIF_DEMOTE
   bit is on, so it calls rq_demote to demote the glock.  It doesn't
   care at this point about pid-b's promote request on gl_waiters3
   at this point.  That's likely to keep one node from monopolizing
   the glock.
4. Pid-a calls go_drop_th because the pid-a's holder is unlocked
   and the glock state is not exclusive.
5. gfs_glock_drop_th calls gfs_log_flush_glock, which takes the log
   lock, then adds the transactions buffers to the active items list
   (ail).  Then the log lock is released.  Then it calls gfs_sync_buf
   which tells the vfs to sync the data and wait for it to be done.
6. Pid-b then calls gfs_log_flush_glock, which calls log_flush_internal.
   That runs the transaction's glock buffer list again, and again,
   it tries to add them to the active items list (ail).  Function
   add_trans_to_ail calls buf_add_to_ail, which calls gfs_dunpin,
   which marks the buffer dirty a second time.  But this marking
   happens after the vfs has synced the buffer to disk.
7. Pid-a's gfs_sync_buf calls ail_empty_gl, but much to its
   astonishment, it finds a buffer on the ail list that's still dirty,
   knowing full well that it had just requested all of them to be
   synced to disk.  So it asserts.

My latest attempt to fix the problem is simply this:

diff --git a/gfs-kernel/src/gfs/glock.c b/gfs-kernel/src/gfs/glock.c
index 1014ea2..670810f 100644
--- a/gfs-kernel/src/gfs/glock.c
+++ b/gfs-kernel/src/gfs/glock.c
@@ -723,8 +723,9 @@ rq_demote(struct gfs_holder *gh)
                set_bit(GLF_LOCK, &gl->gl_flags);
                spin_unlock(&gl->gl_spin);
 
-               if (gh->gh_state == LM_ST_UNLOCKED ||
-                   gl->gl_state != LM_ST_EXCLUSIVE)
+               if ((gh->gh_state == LM_ST_UNLOCKED ||
+                   gl->gl_state != LM_ST_EXCLUSIVE) &&
+                   list_empty(&gl->gl_waiters3))
                        /* Unlock */
                        glops->go_drop_th(gl);
                else

I'm testing the patch now and so far, no failures, but it's still
very early in the process.

Comment 66 Robert Peterson 2009-02-27 19:06:53 UTC

Hi Ray,

The patch mentioned in comment #63 did not work properly; it caused
other problems.  However, I believe I am still on the right track.
Unfortunately, an nfs kernel issue caused me to lend my cluster to
an NFS developer for a few days, so I lost some time there.  I tried
another fix this morning, and although it seemed to work, it had an
unacceptable performance impact.  I'm trying another fix as I type
this note.  Because the problem is so hard to hit, I won't be able to
say for sure whether it works properly for a few days.

Comment 67 Robert Peterson 2009-02-27 22:53:50 UTC

Created attachment 333550 [details]
27 Feb 2009 patch

This is my latest patch.  It seems to work perfectly and has no
ill performance impacts.  In fact, it seems to make the recreation
scenario run somewhat faster, often running fourteen minutes rather
than fifteen.  Given the history and complexity of this problem, I
obviously need to do a lot more testing before I'll call it good.
I'll also want it thoroughly reviewed by my peers.  Frankly, I'm
worried about making sure we maintain cluster coherency with the
patch.

Comment 69 Robert Peterson 2009-03-02 20:24:11 UTC

The patch passed the cluster coherency test:

Testcase                                 Result    
--------                                 ------    
buffered-write-read                      PASS      
buffered-write-readv                     PASS      
buffered-write-pread                     PASS      
buffered-write-mmread                    PASS      
buffered-writev-read                     PASS      
buffered-writev-readv                    PASS      
buffered-writev-pread                    PASS      
buffered-writev-mmread                   PASS      
buffered-pwrite-read                     PASS      
buffered-pwrite-readv                    PASS      
buffered-pwrite-pread                    PASS      
buffered-pwrite-mmread                   PASS      
buffered-mmwrite-read                    PASS      
buffered-mmwrite-readv                   PASS      
buffered-mmwrite-pread                   PASS      
buffered-mmwrite-mmread                  PASS      
sync-write-read                          PASS      
sync-write-readv                         PASS      
sync-write-pread                         PASS      
sync-write-mmread                        PASS      
sync-writev-read                         PASS      
sync-writev-readv                        PASS      
sync-writev-pread                        PASS      
sync-writev-mmread                       PASS      
sync-pwrite-read                         PASS      
sync-pwrite-readv                        PASS      
sync-pwrite-pread                        PASS      
sync-pwrite-mmread                       PASS      
sync-mmwrite-read                        PASS      
sync-mmwrite-readv                       PASS      
sync-mmwrite-pread                       PASS      
sync-mmwrite-mmread                      PASS      
direct-write-read                        PASS      
direct-write-readv                       PASS      
direct-write-pread                       PASS      
direct-writev-read                       PASS      
direct-writev-readv                      PASS      
direct-writev-pread                      PASS      
direct-pwrite-read                       PASS      
direct-pwrite-readv                      PASS      
direct-pwrite-pread                      PASS      
direct-mmwrite-read                      PASS      
direct-mmwrite-readv                     PASS      
direct-mmwrite-pread                     PASS      
=================================================
Total Tests Run: 44
Total PASS:      44
Total FAIL:      0
Total TIMEOUT:   0
Total KILLED:    0
Total STOPPED:   0

Comment 71 Robert Peterson 2009-03-04 03:34:41 UTC

Update on the latest: Once again, through rigorous testing, I managed
to break the patch.  I've made several revisions based on similar
code paths that also expose the problem.  Now I've got a new version
that incorporates some of the fixes posted in previous patches.

I had been hoping this was a "simple fix", but it seems there are
several things going on at the same time.  In other words, I was hoping
I could forget the apparent problems I found earlier and fixed with
previous patches in favour of a patch that pre-empts the other problems.
(In other words, "If it's not broken, don't fix it.")  Unfortunately,
it seems as if at least some of the other fixes from previous patches
are needed anyway.

I have not broken my latest patch, after many hours of testing.
Unfortunately, in order to catch any possible "misconduct" by the patch,
I've kept a minimal amount of instrumentation in there for the time
being.  So the new patch is not yet ready for general consumption.
Tomorrow morning, I'll do some more testing and if I can't break it,
I'll strip out all the instrumentation and test it again.  Then I need
to redo the cluster coherency test.  I'll likely attach it here before
starting the coherency test, since that test takes so long.

Comment 73 Robert Peterson 2009-03-05 22:14:32 UTC

Created attachment 334226 [details]
05 March 2009 patch

This is my latest patch.  Testing is in progress.

Comment 74 Robert Peterson 2009-03-09 01:35:34 UTC

Sunday night update: I've got a patch that works, but it severely
impacts performance.  The recreation scenario goes from 15 minutes
to 25 minutes; clearly unacceptable.

I've tried several ways to redo the logic so that it doesn't impact
performance, but either I lose cluster coherency or the assert
problem comes back.  And of course, it takes hours of testing to
determine that any given attempt is bad.  I need to find a clever
way to rearrange the logic to avoid the race without impacting
performance.

For GFS gurus: The problem is that several instances of gfs_sync_buf
are racing with one another.  To simplify the logic (and make it
faster), I've incorporated gfs_sync_buf into log_flush_internal.
That function is protected against races, for the most part, by the
gfs_log_lock.  The gfs_sync_buf, however, is not.  The race problem
is caused by another process entering the function, calling
add_trans_to_ail, which causes some buffer_heads on the ail_gl_list
to be set dirty again.  This often happens because the wait for disk
IO takes so long, compared to the cpu.  The add_trans_to_ail is
synchronized under the gfs_log_lock, but since the gfs_sync_buf
portion calls disk io functions (filemap_fdatawrite and
filemap_fdatawait) I clearly cannot hold the log lock during that
time or performance will be killed.

Previous patch versions have moved the gl_ail_bufs list to a temporary
list using list_splice_init, and ail_empty_gl then operates on the
temporary list only.  This seems to cause cluster coherency issues.
The act of making sure this is serialized by using, for instance, a
GLF_LFLUSH bit in the glock, also seems to kill performance.
They end up waiting for long periods of time for the log_lock.
If I avoid the log_lock serialization, the assert problem comes back.
So it's a three-way tradeoff: One way kills performance, another way
kills cluster coherency, and the third way has the assert.  So I need
to find an elegant way to do this.  I guess I'll sleep on it.

Comment 77 Robert Peterson 2009-03-12 03:26:48 UTC

Created attachment 334890 [details]
11 March 2009 patch

This patch is closer, but still has instrumentation (and problems).
It utilizes a new "buffer_handsoff" bit to flag buffer heads that
have been tagged for synchronization.  (The bit isn't included in
my patch because I've got a kernel patch as well, which will NOT be
part of the final version).  This one works for the most part, but
I was able to break it with the cluster coherency test, after hours
of testing.  Basically, it hit an "Invalid Metadata Block" error.
I suspect that's due to a problem with the patch, and I'd like to get
Steve Whitehouse's idea of what could be going wrong.  The patch isn't
as big as it would seem; the size is mostly due to my moving three
functions from dio.c to log.c, but there were some changes to them.

Comment 78 Steve Whitehouse 2009-03-12 10:38:38 UTC

I'm not sure that I follow why you've added a rwsem as it appears that you only use it in write mode.

Since the mapping in gfs_inval_buf() is the meta mapping, the atomic_inc/dec of i_writecount is almost certainly nonsense, since that is only checked at the higher layers of the code. I suspect its a hangover from the use of some earlier function - I've just removed it in gfs2 and it can probably go from here too.

What makes me worried though is that you seems to have added protection from journal flushes in gfs_inval_buf(). The issue is that if we get to this point and there are still dirty buffers or pages then something has gone wrong. All the data should long since have been written back to disk by the ->go_sync function.

I don't follow why the new handsoff bit is required... the dirty bit is there to indicate which buffers need to be synced back to disk.

Comment 79 Robert Peterson 2009-03-12 14:18:41 UTC

Hi Steve and thanks for your comments.

You're right; a normal semaphore is sufficient.  I fixed it.

The i_writecount code was just moved; I hadn't changed it.  But with
your recommendation, I've now removed the vestigial code as you did
with gfs2.

The protection in gfs_inval_buf was mostly because I designed
sd_log_flush_lock to protect again all callers of ail_empty_gl.
It may not be necessary in this case, but I'll have to investigate
that.

The new "handsoff" bit was designed to protect the buffer_heads
that are set up to be synced by log_flush_glock.  This is the timing
window I described in comment #74.  Basically, two processes are
operating on the same glock, often during glock handoffs.  One
process ("A") calls log_flush_glock, which calls log_flush_internal
followed by gfs_sync_buf.  When gfs_sync_buf is called, there is
really no race protection, and it does a vfs sync to disk, at which
time another process ("B") can get control, and call functions like
gfs_fsync, which ends up slipping more buffer_heads onto the ail:
B's fsync starts another gfs_log_flush_glock, then B's dunpin marks
the buffer_head dirty again before ail_empty_gl checks the failing
assert (i.e. asserting the buffer_head is NOT dirty).

The handsoff bit allows ail_empty_gl to empty (and sanity check) the
ail list of ONLY the buffer_heads that were found while the log_lock
was held.  Other buffer_heads that were added to the ail by subsequent
gfs_fsync calls remain on the ail list, to be synced and removed by
that process.  Even with the current patch's protection in place,
there are buffer_heads that get added to the ail that don't have the
handsoff bit set, while gfs_sync_buf is waiting for the vfs to sync
to disk.

The handsoff bit started out as instrumentation so I could have the
kernel tell me what process (and calling sequence) was dirtying the
buffer_head while it was being synced to disk.  I might be able to
simplify the logic by just using the dirty bit for that purpose,
especially now that I've determined that the sd_log_flush_lock
semaphore does not seem to impact performance.  And since the changing
of the ail lists are protected by the sd_ail_lock spin_lock.
Again, that will take serious investigation.

I've found that this code is extremely complex and sensitive and the
slightest code change can have a major impact, either on performance
or cluster coherency.  The big problem is that some of the code
traverses the same buffers by the "transaction ail list" while other
code traverses the same buffers by the "glock ail list", and still
others that move buffer_heads around to different transactions and
such.  It's all very convoluted.  I wish it were as simple as gfs2,
which does things more sanely.

Comment 80 Steve Whitehouse 2009-03-12 16:14:41 UTC

Hmmm. I might have found Nate's grow issue hiding in the code:

        error = gfs_glock_nq_init(ip->i_gl, LM_ST_EXCLUSIVE,
                                 LM_FLAG_PRIORITY | GL_SYNC | GL_NOCANCEL_OTHER,
                                 &i_gh);

Looks like its setting GL_SYNC. The other place GL_SYNC gets set is when we update the attributes for an inode - it is not at all clear why. In neither case does it make any sense to force a sync so far as I can see. I can only guess that there was some other bug found and this this was supposed to work around it in some way or other...

Comment 81 Robert Peterson 2009-03-12 22:43:20 UTC

With my latest and greatest patch (slightly revised from the previously
posted patch, as suggested by Steve), I can run the cluster coherency
test and a number of normally-failing scenario iterations without
failure.  That's the good news.  The bad news is that I now seem to
be occasionally hitting bug #457557, which is a gfs2 bug.  The gfs1
and gfs2 code are similar enough that I seem to be triggering it on
gfs1 now somehow.  I need to add more instrumentation into the code
to determine what the problem is.

Also, I've removed the protection in gfs_inval_buf as Steve suggested
in comment #78 and it hasn't seemed to break anything.  I've got new
instrumentation to hopefully tell me whether or not it should be safe
to remove the "handsoff" bit in the buffer_head.  So far it has not
triggered that instrumentation after several hours of testing.

Comment 83 Robert Peterson 2009-03-13 22:06:12 UTC

Based partly on instrumentation in the code and partly on detective
work, it seems to me that this is happening with my latest patch:

NFS calls into function do_write_direct to write to a file, which
determines that an allocation is not required.  For some strange
reason, it takes a SHARED lock on the transaction glock, aka
sd_trans_gl.  After the write is complete, another write on a
different node wants to do the same thing.  The sd_trans_gl gets
demoted and the callback sets the state to LM_ST_UNLOCKED.
Shortly thereafter, trans_go_xmote_th gets called on the sd_trans_gl
and because the state is LM_ST_UNLOCKED, it foregoes the call to
gfs_log_shutdown.  Later, when it goes to promote the glock, function
trans_go_xmote_bh sees that the gl_state is now not LM_ST_UNLOCKED
anymore but it detects that the gfs_log_shutdown was never done, and
takes this error:

		if (!(head.lh_flags & GFS_LOG_HEAD_UNMOUNT))
			gfs_consist(sdp);

This is partly speculation on my part, but it seems to fit.  The big
mystery to me is why do_write_direct needs to hold the sd_trans_gl
in the first place.

Comment 84 Robert Peterson 2009-03-13 22:15:03 UTC

Scratch that.  If the gl_state was not LM_ST_UNLOCKED at the top
half (trans_go_xmote_th) it should still have been LM_ST_UNLOCKED
at the bottom half (trans_go_xmote_bh).  So at the top half, it
should have done the gfs_log_shutdown, and it should have found it
at the bottom half.  Which can only mean that another process added
a log header between the trans_go_xmote_th and the gfs_find_jhead.
So perhaps the problem here is that the log lock is not adequate
to protect that from happening.

Comment 85 Steve Whitehouse 2009-03-16 10:13:02 UTC

I don't follow comment #83 regarding the demotion of the trans glock upon receiving a callback. Since the trans glock is always grabbed in shared mode, except when freezing the fs (which we know is not happening here) there should be no callbacks sent as the other nodes will always be sending compatible lock requests.

It is however quite possible that one of the deamons is demoting the trans gl periodically if it appears to be unused for some length of time, or if something locally sets GL_NOCACHE before releaseing the lock for whatever reason.

If a log header is being added while the trans gl is not held, then the #1 suspect has to be gfs_logd which has no idea whether the trans glock is being held or not.

Comment 86 Robert Peterson 2009-03-17 14:30:10 UTC

Created attachment 335529 [details]
17 March 2009 patch

This patch takes a slightly new approach that makes more sense.
There were several places in the code that tried to remove buffers
from the two active items lists (transaction and glock ail) but each
one was using a different set of criteria, none of which were
consistent.  This patch introduces a new function,
gfs_bd_ail_tryremove, that checks the condition of the buffer using
uniform criteria before removing it from the ail.  All the places
that were doing this haphazardly now call the new function.

So really, not counting the scsi device driver bug, there were five
main gfs bugs found with this one scenario:

1. There was a timing window whereby a process could mark a buffer
   dirty while it was being synced to disk.  This was fixed by
   introducing a new semaphore, sd_log_flush_lock, which keeps
   that from happening.
2. Buffers were being taken off the ail list at different times
   using different criteria.  That was fixed by the new function
   as mentioned above.
3. Some buffers were not being added to the transaction, especially
   in cases where the files were journaled, which happens mostly
   with directory hash table data.  That was fixed by the
   introduction of the necessary calls to gfs_trans_add_bh.
4. The transaction glock was prematurely being released when the
   glocks hit a capacity watermark.  That's why it often took so
   long to recreate some of these problems.  To prevent this, new
   code was added to function clear_glock so that it would only
   release the transaction glock at unmount time.  I'm using the
   number of glockd daemons to determine whether the call was made
   during normal operations or at unmount time.  So in order to
   accomodate that change, I had to fix a bit of code where the
   number of glockd daemons was going negative.
5. When finding its place in the journal, function gfs_find_jhead
   was not holding the journal log lock.  That causing another
   nasty timing window where the journal was being changed while
   the proper journal location was being located.

This patch passes the cluster coherency test and I've run it
through the failing scenario several times with no ill effects.

Comment 87 J.H.M. Dassen (Ray) 2009-03-17 16:04:47 UTC

Current status summary:

Engineering has developed a new patch set for this issue, which we have
determined is due to several underlying bugs.

There were several places in the code that tried to remove buffers
from the two active items lists (transaction and glock ail), but there were
consistency issues in their criteria.
The patch introduces a new function, gfs_bd_ail_tryremove, that checks the
condition of the buffer using uniform criteria before removing it from the
ail, and calls this function from all the places identified previously.

Apart from the SCSI device driver bug, there were five main GFS bugs found
with this one scenario:

1. There was a timing window whereby a process could mark a buffer
   dirty while it was being synced to disk.
2. Buffers were being taken off the ail list at different times
   using different criteria.
3. Some buffers were not being added to the transaction, especially
   in cases where the files were journaled, which happens mostly
   with directory hash table data.
4. The transaction glock was prematurely being released when the
   glocks hit a capacity watermark.
   This is why it typically took so long to recreate some of these problems.
5. When finding its place in the journal, function gfs_find_jhead
   was not holding the journal log lock. That caused another timing window
   where the journal was being changed while the proper journal location was
   being located.

The current patch passes the cluster coherency test and has been run through
the failing scenario several times with no ill effects.

Comment 88 Robert Peterson 2009-03-17 16:39:57 UTC

The patch from comment #86 was pushed to the RHEL4 and RHEL48 branches
of the cluster git tree.  It was tested, and tested, and tested on
the shell-0{1,2,3} cluster using the cluster coherency test and the
scenario known to cause the problem within minutes on a system without
the patch.  Setting status to Modified.

Comment 89 Robert Peterson 2009-03-17 16:46:54 UTC

For the sake of completeness, I'll document the exact test scenario
that uncovered so many problems:

Reboot the entire shell cluster and paste in these commands:
modprobe cman
modprobe lock_dlm
insmod /home/bob/cluster/gfs-kernel/src/gfs/gfs.ko
service cman start
service fenced start
service clvmd start

On one node, paste in:
mkfs.gfs -O -j3 -J32 -p lock_dlm -b 1024 -r32 -X -t shell:gfs /dev/shell-vg/lv02

On all nodes, paste:
mount -o debug -t gfs /dev/shell-vg/lv02 /mnt/gfs/
service portmap start ; service nfs start ; exportfs -va

On shell-01, paste but don't hit enter on the second line:
mount -o debug -t gfs /dev/shell-vg/lv02 /mnt/gfs/
service portmap start ; service nfs start ; exportfs -va

On shell-02, paste but don't hit enter on the second line:
mount shell-01:/mnt/gfs /mnt/nfs
time /root/gen-data -a 256 -b 393216 -c 76800 -n 200000 -p /mnt/nfs/dir2

On another shell-02 session, paste but don't hit enter on the second line:
whileloop.sh which is:

[root@shell-01 ../gfs-kernel/src/gfs]# cat `which whileloop.sh`
#!/bin/bash
while /bin/true; do
   echo "Finding"
   find /mnt/gfs -type f &> /dev/null
done

On shell-03, paste but don't hit enter on the second line:

time for i in `seq 970 -1 0` ; do echo $i ; sleep 1; done ; qarsh shell-02 "killall whileloop.sh"

(You may need to substitute ssh for qarsh).

Next:
1. Hit enter on shell-01, wait one second
2. Hit enter on shell-02, wait one second
3. Hit enter on the second shell-02 (whileloop.sh), wait one second
4. Hit enter on shell-03

This triggers the problem pretty quickly on a non-patched system.

Comment 90 Robert Peterson 2009-03-17 16:50:35 UTC

Correction to comment #89: On shell-01, after the file system is
mounted, this is what should be pasted (not a second GFS mount attempt):

mount shell-02:/mnt/gfs /mnt/nfs
time /root/gen-data -a 256 -b 393216 -c 76800 -n 200000 -p /mnt/nfs/dir1

Comment 97 Robert Peterson 2009-03-26 23:00:28 UTC

I hit a problem and am investigating.  I suspect the patch caused
the problem, so I'm changing the status to FAILS_QA until I can
resolve it.

Comment 99 Robert Peterson 2009-03-27 15:53:31 UTC

Created attachment 337029 [details]
Addendum patch

The problem I uncovered related to a specific piece of the patch I
had previously pushed out.  This patch reverts that piece and fixes
the new problem.

Comment 100 Robert Peterson 2009-03-27 15:54:44 UTC

The addendum patch has been pushed to the RHEL4 and RHEL47 branches
of the cluster git tree.  So I'm changing the status back to Modified.

Comment 104 Nate Straz 2009-04-15 14:43:52 UTC

Bug 495514 was discovered after this fix went in.  Moving back to ASSIGNED to resolve the new regression.

Comment 105 Nate Straz 2009-04-16 14:13:36 UTC

I also ran into problems running revolver where recovery of two out of three file systems would hang in recovery during the first iteration.

Comment 106 Robert Peterson 2009-04-17 16:00:21 UTC

An addendum patch has been pushed to the RHEL4, RHEL47 and RHEL48
branches of the cluster git tree.  Changing back to Modified.

Comment 110 errata-xmlrpc 2009-05-18 21:09:50 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2009-1045.html

Comment 112 Mark Hlawatschek 2009-06-16 07:41:19 UTC

Hi,

I'm sorry that I have to report, that the errata leads to a new problem with the described test scenario:

GFS: fsid=axqa01:gfs.0: fatal: invalid metadata block
GFS: fsid=axqa01:gfs.0:   bh = 1499994 (type: exp=7, found=0)
GFS: fsid=axqa01:gfs.0:   function = gfs_get_data_buffer
GFS: fsid=axqa01:gfs.0:   file = /builddir/build/BUILD/gfs-kernel-2.6.9-85/smp/src/gfs/dio.c, line = 1161
GFS: fsid=axqa01:gfs.0:   time = 1243456240
GFS: fsid=axqa01:gfs.0: about to withdraw from the cluster
GFS: fsid=axqa01:gfs.0: waiting for outstanding I/O
GFS: fsid=axqa01:gfs.0: telling LM to withdraw
lock_dlm: withdraw abandoned memory
GFS: fsid=axqa01:gfs.0: withdrawn


We repeated the test with the proposed errata RPMs. We did the following steps to produce the error:

server1# mkfs.gfs -O -j3 -J32 -p lock_dlm -b 1024 -r32 -t axqa0
1:lv_gfstest /dev/mapper/vg_gfstest-lv_gfstest
server{1,2} # mount -t gfs /dev/mapper/vg_gfstest-lv_gfstest /mnt/gfstest
server1# mount -t nfs server2:/mnt/gfstest
/mnt/nfstest
server2# mount -t nfs server1:/mnt/gfstest
/mnt/nfstest
server{1,2}# while /bin/true; do find /mnt/gfstest -type f -exec ls
-l {} \; ;done
server1# # ./gen-data -a 256 -b 393216 -c 76800 -n 200000 -p
/mnt/nfstest/dir1

For the sake of completeness:
# modinfo gfs
filename:       /lib/modules/2.6.9-89.ELsmp/kernel/fs/gfs/gfs.ko
license:        GPL
author:         Red Hat, Inc.
description:    Global File System 2.6.9-85.2.el4
depends:        lock_harness,cman
vermagic:       2.6.9-89.ELsmp SMP gcc-3.4

# uname -a
Linux realserver9 2.6.9-89.ELsmp #1 SMP Mon Apr 20 10:33:05 EDT 2009 x86_64 x86_64 x86_64 GNU/Linux

# rpm -qi GFS-kernel-smp
Name        : GFS-kernel-smp               Relocations: (not relocatable)
Version     : 2.6.9                             Vendor: Red Hat, Inc.
Release     : 85.2.el4                      Build Date: Tue May  5 16:21:44 2009
Install Date: Wed May 27 08:04:22 2009      Build Host: hs20-bc2-5.build.redhat.com
Group       : System Environment/Kernel     Source RPM: GFS-kernel-2.6.9-85.2.el4.src.rpm
Size        : 742008                           License: GPL
Signature   : DSA/SHA1, Wed May  6 02:13:32 2009, Key ID 219180cddb42a60e
Packager    : Red Hat, Inc. <http://bugzilla.redhat.com/bugzilla>
Summary     : GFS-kernel-smp - The Global File System kernel modules
Description :
GFS - The Global File System is a symmetric, shared-disk, cluster file
system.

Comment 113 Robert Peterson 2009-06-16 14:28:42 UTC

I cannot understand how that can still be happening, unless there
is a hardware problem.  Can you please verify whether there were
any messages in dmesg regarding hardware errors that were given
before the failure noted above, especially that look like this:

Apr 30 11:14:03 linc01a kernel: Info fld=0x0
Apr 30 11:14:03 linc01a kernel: sdi: Current: sense key: No Sense
Apr 30 11:14:03 linc01a kernel:     Additional sense: No additional sense

Comment 114 Issue Tracker 2009-06-16 14:31:07 UTC

Event posted on 2009-06-16 16:31 CEST by rdassen

Hello Jürgen, hello Mark,

As the erratum for bugzilla #455696 has already been released and the
connection between the new problem and the problem for which the erratum
was developed is at present not clearly understood, I would recommend for
Atix to open a new bugzilla report for this new problem (referencing
#455696) and to work with Red Hat Engineering on it as per the Altix - Red
Hat relationship.

Kind regards,
Ray


Internal Status set to 'Waiting on Customer'
Status set to: Waiting on Client

This event sent from IssueTracker by rdassen 
 issue 187880

Comment 115 Robert Peterson 2009-06-16 15:45:17 UTC

In talking with Steve Whitehouse this morning, we decided it would
be a good idea if Atix can enable SMART on their SAN and see if they
can get any diagnostic information on possible hardware problems.

Also, I'd like to verify that they have rebooted the machines in
question since they installed that version of the gfs kernel module.

In the past, I've seen cases where customers report that a problem
is not fixed but it turns out that they haven't unloaded and reloaded
the kernel module, so although the new kernel module is on the system
they still might have the old one in memory.  I'd just like to have
them double-check.  In this case, a reboot would be best because there
are two fixes that need to be running: the gfs kernel module, and the
scsi driver fix mentioned in Mike Christie's comments above.

Comment 116 Robert Peterson 2009-06-17 16:55:38 UTC

I was able to recreate the problem as described in comment #112.
So now I have something to investigate.  I do, however, favor the
idea of treating this as a new problem and opening a new bugzilla
record.

Comment 117 Mark Hlawatschek 2009-06-19 18:26:56 UTC

Hi,

in my opinion, this bugzilla cannot be closed successfully. The testcase described in comment #1 still causes a similar problem (The filesystem is unmounted and the test fails).
Nevertheless, I'd agree to the idea to create a new bugzilla entry that blocks this one. 

-Mark

Comment 118 Robert Peterson 2009-06-19 19:06:09 UTC

Created attachment 348699 [details]
Patch to try

With this patch, my system no longer recreates the problem.
I need to do further research and analysis, but can Atix try
this and see if it clears up the problem?

Comment 120 Robert Peterson 2009-08-03 15:01:36 UTC

Since the customer has not responded in a month and a half regarding
my new patch, and since they said in comment #117 that they are not
opposed to a new bugzilla to address the problem, I have opened new
bug #515267.  Since an erratum has already been issued for this bug,
and no other customers have reported problems, I'm closing this one
again as ERRATA.

Comment 123 Thomas Merz 2010-06-07 13:01:33 UTC

We were not able to reproduce the issue using the newest RedHat provided RPMs for RHEL4, so the problem seems to be fixed.

Comment 124 Thomas Merz 2010-06-11 14:02:59 UTC

Little Add-On to my Comment #123:
With the "patch to try" and the newest RedHat provided packages we were not able to reproduce the issue.