Bug 251053

Summary:

GFS2 panic at function = do_strip, file = fs/gfs2/bmap.c, line = 764

Product:

Red Hat Enterprise Linux 5

Reporter:

Wendy Cheng <nobody+wcheng>

Component:

kernel

Assignee:

Don Zickus <dzickus>

Status:

CLOSED ERRATA

QA Contact:

GFS Bugs <gfs-bugs>

Severity:

medium

Docs Contact:

Priority:

high

Version:

5.0

CC:

bmarson, lwang, nstraz, rkenna, swhiteho

Target Milestone:

---

Target Release:

---

Hardware:

All

OS:

Linux

Whiteboard:

Fixed In Version:

RHBA-2007-0959

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2007-11-07 19:58:02 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
Add missing bh transaction	none
Upstream version of changes	none
patch used last night	none
today's corruption	none
dinode.panic	none
Tentative patch that seems to fix the issue	none
Update patch for this issue	none
the test patch used	none
Updated patch	none
panic after do_touch patch removed with identical stack trace	none
Upstream patch for comment #54	none

Description Wendy Cheng 2007-08-06 18:59:57 UTC

Description of problem:

Filesystem corruption detected with performance group's NFS SPECsfs run:
GFS2: fsid=notaclu:sdt.0:   inode = 59563 420454
GFS2: fsid=notaclu:sdt.0:   function = do_strip, file = fs/gfs2/bmap.c, line = 764
GFS2: fsid=notaclu:sdt.0: about to withdraw this file system
GFS2: fsid=notaclu:sdt.0: telling LM to withdraw
GFS2: fsid=notaclu:sdt.0: withdrawn

Call Trace:
[<ffffffff88563d86>] :gfs2:gfs2_lm_withdraw+0xc1/0xd0
[<ffffffff88567a41>] :gfs2:gfs2_pin+0xca/0x10d
[<ffffffff885668c5>] :gfs2:buf_lo_add+0x23/0xf2
[<ffffffff88573ef0>] :gfs2:gfs2_statfs_change+0xc3/0xd2
[<ffffffff88571c01>] :gfs2:rgblk_free+0x139/0x158
[<ffffffff885756eb>] :gfs2:gfs2_consist_inode_i+0x43/0x48
[<ffffffff88558921>] :gfs2:do_strip+0x271/0x350
[<ffffffff88557840>] :gfs2:recursive_scan+0xf4/0x177
[<ffffffff8856121a>] :gfs2:gfs2_glmutex_lock+0xaa/0xb3
[<ffffffff8855795c>] :gfs2:trunc_dealloc+0x99/0xc1
[<ffffffff885586b0>] :gfs2:do_strip+0x0/0x350
[<ffffffff88570000>] :gfs2:gfs2_quota_init+0x2a0/0x398
[<ffffffff88558fb8>] :gfs2:gfs2_truncatei+0x5b8/0x5f6
[<ffffffff8856c7d3>] :gfs2:gfs2_setattr+0xa6/0x2ce
[<ffffffff8856c75d>] :gfs2:gfs2_setattr+0x30/0x2ce
[<ffffffff80062536>] __down_write_nested+0x12/0x92
[<ffffffff8002c63d>] notify_change+0x14b/0x2e1
[<ffffffff88514438>] :nfsd:nfsd_setattr+0x349/0x4ce
[<ffffffff88515b08>] :nfsd:nfsd_create_v3+0x34b/0x4ea
[<ffffffff8851b755>] :nfsd:nfsd3_proc_create+0x12f/0x140
[<ffffffff885101c4>] :nfsd:nfsd_dispatch+0xd7/0x198
[<ffffffff88442518>] :sunrpc:svc_process+0x44d/0x70b
[<ffffffff800625cf>] __down_read+0x12/0x92
[<ffffffff8851054d>] :nfsd:nfsd+0x0/0x2db
[<ffffffff885106fb>] :nfsd:nfsd+0x1ae/0x2db
[<ffffffff8005bfb1>] child_rip+0xa/0x11
[<ffffffff8851054d>] :nfsd:nfsd+0x0/0x2db
[<ffffffff8851054d>] :nfsd:nfsd+0x0/0x2db
[<ffffffff8005bfa7>] child_rip+0x0/0x11

Version-Release number of selected component (if applicable):
Found in both 36.el5 and 37.el5 kernel with performance fix from
bugzilla 249905.

How reproducible:
Seen on performance group's 'bigi" system with 56 filesystems running 
(lock_nolock) during NFS SPECsfs runs. 

Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:
Could be a dup of bz 248176.

Comment 3 Wendy Cheng 2007-08-06 19:25:24 UTC

Just borrow the machine from Barry - will kick off the debugging runs around
4:30 pm today. Barry said it hardly happens on 36.el5 unless:

1. scand is extended to 300 seconds; or
2. doing profiling on the gfs2 code.

Comment 4 Steve Whitehouse 2007-08-07 08:18:06 UTC

So far I have noticed that the combination of do_strip() and recursive_scan() is
O(N^2) wrt to the height of the metadata tree. I've not spotted anything that
might cause a problem yet.

Comment 5 Steve Whitehouse 2007-08-07 08:28:14 UTC

recursive_scan() should be fixed to iterate over the blocks. Currently it has 8
arguments which are all 64bits (on x86_64) except one (which will probably be
aligned to a 64 bit boundary) plus a 64bit return address making 9 words in
total. Since the stack depth is a maximum of 10, that gives: 9 * 8 * 10 = 720
bytes of stack. Not enough to overflow it, but certainly a lot.

Making this iterative would also solve the issue in comment #4 above.

Comment 6 Steve Whitehouse 2007-08-07 13:12:04 UTC

Actually my calculation earlier didn't take into account the local varibles
either. Actual max stack usage could be as much as approx 1.2k, although we'd
need a max sized file (not all blocks need to be populated, truncate would do
the trick) along with a 512b block size to get the highest usage. Of course if
there were to be disk corruption, then potentially we can just run out of stack
here.

Think what would happen if a pointer on one of the indirect blocks was corrupted
so that it pointed back to somewhere higher up the tree for example...

Comment 7 Wendy Cheng 2007-08-07 15:37:25 UTC

No good news yet. Update the status here in case anyone is curious..

We were in do_strip "for (p = top; p < bottom; p++)" loop releasing 
blocks. Before the loop could complete, ip->i_di.di_blocks had reached 0. 
I moved the di_blocks sanity check up into the beginning of the loop.
This would make gfs2 stop immediately if di_blocks prematurely reaches
zero. After that, we no longer have disk corruptions. Before this change,
the states were so messed up that we could hardly get any useful info.

When this happened, the benchmark just finished its sub-run and was in 
the middle of re-starting by doing create(s) new test files (that kernel 
tranferred the actions into truncating the old files - since the files 
had the same names). This could be a good debugging lead. 

Barry is helping to re-structure benchmark write patterns (since I 
seemed seeing "deadbeef" inside glock structure - but I forgot to save
the debugging screen last night).

Comment 8 Wendy Cheng 2007-08-07 20:08:59 UTC

With millions glocks hanging around, "cat /sys/kernel/debug/glocks" seems
to be taking forever (more than half hour and still not returning). I lost
the patience so went to read the glock memory instead. The i_gl associated 
with the problem inode looks like the following:  

ffff8103697fa728:  ffff810259dd6068 ffff8103aad804e8   h`.Y............
ffff8103697fa738:  0000000000000020 00000000000182f3    ...............
                                             ^^^^^^^
                   ** from Wendy: this is indeed the problem ino
                    
ffff8103697fa748:  ffffffff00000002 0000000100000007   ................
ffff8103697fa758:  0000517300000001 deadbeef00000000   ....sQ..........
ffff8103697fa768:  000000010027ba28 deadbeef000025dc   (.'......%......
                                             ^^^^^^^
                 *** from Wendy: is this 25dc the gl_owner_pid ?

This thread's pid is 0x25ef and it is supposed to be holding an 
exclusive glock during the truncate operations. 

So today's conclusion ..look to me, the glock code gets screwed up here. 
This may well explain why 37.el5 triggers the issue easily (and it has
a bunch of glock changes). I'll do printk tomorrow to confirm

The glock dump tool is simply not usable with so many locks in the system. 
memory right ?

Comment 9 Wendy Cheng 2007-08-07 21:30:15 UTC

Interesting !!! I go to read the code, the gl_owner_pid is actually the thread
that releases the lock (glock_dq) ? This is more than counter-intuitive ? 

So apparently nothing is wrong here.

Comment 10 Wendy Cheng 2007-08-07 22:09:43 UTC

The whole issue comes down to (with height=0) the memory that hosts this
particular meta data block (inode) has an extra "block number" between its top
and bottom:

ffff8100d044a0e8:  0000000000000000 0000000000000000   ................
ffff8100d044a0f8:  0000000000000000 0000000000000000   ................
ffff8100d044a108:  0000000000000000 0000000000000000   ................
ffff8100d044a118:  d7d0050000000000 0000000000000000   ................
ffff8100d044a128:  0000000000000000 0000000000000000   ................
ffff8100d044a138:  0000000000000000 0000000000000000   ................
ffff8100d044a148:  0000000000000000 0000000000000000   ................
ffff8100d044a158:  0000000000000000 0000000000000000   ................
 
I have no idea what so ever how that d7d0050000000000 got written into
this piece of memory. Will use gfs2_edit tomorrow to see whether the
corresponding on-disk block has that too.

Comment 11 Steve Whitehouse 2007-08-08 09:38:48 UTC

Regarding comment #8, the format which we use for dumping glocks is very
inefficient. We also copy things too many times during glock dumping, so I'm not
surprised that its slow. Maybe we eventually should have a netlink based
interface, and even better still, one that can capture the state from both gfs2
and the dlm at the same time.

In comment #9, the gl_owner_pid is set if the process is holding the glmutex
lock. This is bascially the lock which protects against state changes (i.e. the
GLF_LOCK bit) although for some reason that bit is directly manipulated in
various places, so that even that is not 100% accurate. It is still sometimes
useful in debugging though.

In comment #10, as well as looking at the actual disk block, you could also look
at the rgrp's allocation bitmap, to see what kind of block the rgrp thinks it
is. That might be easier than trying to guess by the block's contents in case
its not obvious or the contents don't match what the rgrp thinks it is.

Comment 12 Wendy Cheng 2007-08-08 13:53:11 UTC

The di_blocks and the actual blocks stored in the inode are always off by 2 
when this issue occurs. I tentatively add a workaround patch that ignores 
the di_blocks count. The benchmark runs thru *without* any visible trouble. 

Haven't checked extended attribute code yet but can't find any other place
that could cause di_blocks count and block allocation out of sync.

Comment 14 Wendy Cheng 2007-08-08 21:04:50 UTC

Barry, the issue has not surfaced at all today. Did you change any selinux 
setting in bigi after you tested out kdump ?

Comment 15 Wendy Cheng 2007-08-08 21:46:43 UTC

fyi .. trying to figure out whether gfs2's extend attribute code has anything
to do with this corruption. The "EA" code only runs under selinux enabled. In
bigi, the selinux is enabled in configuration but disabled in grub as kernel
boot param. So we need to know it was enabled yesterday - I seemed to be able
to consistently hit it yesterday but not today.

Comment 16 Wendy Cheng 2007-08-08 21:48:18 UTC

s/we need to know it was enabled yesterday/we need to know whether it was
enabled yesterday/

Comment 17 Wendy Cheng 2007-08-09 01:35:52 UTC

Barry, no worry. I hit it without selinux and none of printk(s) added into EA
code got executed. So this implies we can rule out EA code. 

Since I could release these extra disk blocks without issues, it implies they
were (most likely) real disk blocks, not some random junks.

Comment 18 Barry Marson 2007-08-09 01:40:00 UTC

The only kit updated was the kexec-kdump* due of a problem not being able to get
kexec to work and thus kdump.  While doing the test I was told to boot up with
virtually no options, so I just deleted them on the commandline of grub
interface ...  This did include SELINUX being turned on and I noted that it was
doing something with all files in the system at boot time.  Probably first time
in eons SELINUX was enabled.  Other than that .. nothing else has changed.

You could horse around with scand.  Say elevate it to 300 and see if it reoccurs ...

Barry

Comment 19 Wendy Cheng 2007-08-10 00:24:23 UTC

Steve found a place where GFS2 splits its directory leaf without adding the
new bh into journal transaction. This will make di_blocks count less than the
actual block count allocated. Have added the fix into the test kernel. So far 
so good.

Hopefully we'll have good news by tomorrow morning.

Comment 20 Steve Whitehouse 2007-08-10 09:07:42 UTC

Comment #19 sounds very hopeful. I'm bumping the priority on this since its a
corruption issue.

Kevin, Nate, please can we have some ACKs on this one?

Comment 21 Wendy Cheng 2007-08-10 12:07:43 UTC

It runs thru last night wihout any issues. Will do a clean build for Barry 
(so he can contiue) and explain what has been observed when I get to office 
today.

Comment 22 Wendy Cheng 2007-08-10 17:12:09 UTC

Barry, your machine is installed with the modified gfs2.ko (I didn't end up 
doing new RPM build - it took too long). Please give it a try and see how it 
goes.

The fix makes sense:
1. It occurs under memory pressure where the on-disk di_blocks isn't get
   correctly updated due to the lost transaction.
2. These "extra" disk blocks shown up in the inode (both memory and on-disk
   version) look very legitimate and can be deallocated without issues.

However, I'm still puzzled by few things. But will let Barry be the judge. 
In short, return this machine to Barry and let's see how it goes.

Comment 23 Wendy Cheng 2007-08-10 20:07:34 UTC

Discussed with Barry and we decided to run SPECsfs non-stopped over the 
weekend. It has been running good since yesterday. Still monitoring it. 

Didn't include any patch from bz 248176 into the build. Not sure whether
we'll hit it or not.

Comment 24 Wendy Cheng 2007-08-11 20:17:14 UTC

Just check - the benchmarks still runs good. 

None of the problem described in bz 248176 (that was originally thought 
the cause of this issue) shows up at all.

Comment 25 Wendy Cheng 2007-08-11 23:01:38 UTC

Created attachment 161128 [details]
Add missing bh transaction

- Add the missing bh transaction while splitting a directory leaf.
- Move the do_strip sanity check up. This is to avoid further disk corruption
  as soon as memory corruption is found. This change has greatly simplified 
  our debugging effort while trouble shooting this issue.

Comment 26 Wendy Cheng 2007-08-11 23:04:24 UTC

Comment #24 is RHEL5 patch based on 37.el5 kernel.

Comment 27 Wendy Cheng 2007-08-12 00:54:41 UTC

Created attachment 161129 [details]
Upstream version of changes

Untested yet.

Comment 28 Wendy Cheng 2007-08-12 01:05:42 UTC

BTW, didn't include Steve's original patch, nor addressing the possible
stack size problem mentioned in comment #5. The patch only fixes the data
corruption issue described in the original problem statement. 

I'm not sure, under RHEL beta cycle, we should do more the current patch 
does. However, I may do more for upstream version if time permits.

Comment 29 Barry Marson 2007-08-13 14:50:58 UTC

I kicked off a run this morning with the 2.6.18-37.el5.spec.2 kernel.  During
the 12000 Ops measurement time, /sfs14 got corrupted.  Thats /dev/sdn

I have left the machine with the corrupt FS.  I wont be online again until this
evening.  Wendy BIGI is yours ...

Barry

Comment 30 Wendy Cheng 2007-08-13 15:08:26 UTC

I'm very sorry .. had been testing all weekend using gfs2.ko. Before turing
the machine back to Barry, I rebuilt the RPMs and only did very light testings
on 37.el5.spec.2. Just check the source, I apparently did a mistake in RPM 
spec file such that the fix *was not* included :( ...

Rebuild the RPMs now.

Comment 31 Steve Whitehouse 2007-08-13 15:27:10 UTC

Regarding comment #28, I agree that we don't need to fix the stuff from comment
#5 right now. We should probably make that into a Fedora bug though so that we
don't forget about it later on.

Comment 32 Wendy Cheng 2007-08-13 19:34:31 UTC

The build (37.el5.spec.2) didn't screw up. So the "missing bh" is not the
whole story of this corruption. 

I did a "diff" between my old build tree and RPM build tree - the bmap.c 
in the old build tree is certainly messy. Part of them I thought was 
debugging code and got removed on Sunday. The left-over shows that one 
time I suspected the dinode was not updated correctly between three functions: 
build_height(), gfs2_block_map(), and do_grow(). Unfortunately, I forgot
what I did. 

I'm trying to put the old changes back from my memory.

Comment 33 Wendy Cheng 2007-08-14 12:15:07 UTC

Created attachment 161264 [details]
patch used last night

Add my debugging patch in. SPECsfs ran thru 30 runs (15 each with IOPS ranging
from 2000 - 30000 per seconds) without corruption.

The bad news is that none of the traps (added in the new code) was caught. So 
I'm not convinced the new code actually did anything - must be dumb lucks.

Comment 34 Wendy Cheng 2007-08-14 19:39:15 UTC

Created attachment 161304 [details]
today's corruption 

Note that I've moved (!ip->i_di.di_blocks) assertion up in do_strip's 
"for" loop. It is not possible for this loop to change on-disk block 
count to "-1". However, if we check "fsck" output, the on-disk block 
count is "-1". So some other logic *was* updating this inode at the 
same time when this do_strip() was executed. In theory, this inode 
should be protected by glock exclusive lock. Adding this is single 
node case, the i_mutex should also provide necessary protection. We 
have to trust vfs's i_mutex. This implies the thread executing this 
inode updating logic must be internal GFS2 logic.    

	 for (p = top; p < bottom; p++) {
		if (!*p)
			continue;

		if (!ip->i_di.di_blocks)
			gfs2_consist_inode(ip);

		bn = be64_to_cpu(*p);
		if (bstart + blen == bn)
			blen++;
		else {
			if (bstart) {
				if (metadata)
					gfs2_free_meta(ip, bstart, blen);
				else
					gfs2_free_data(ip, bstart, blen);
			}

			bstart = bn;
			blen = 1;
		}

		*p = 0;
		ip->i_di.di_blocks--;
		gfs2_set_inode_blocks(&ip->i_inode);
	}

Comment 35 Wendy Cheng 2007-08-14 20:54:58 UTC

In short, I'm inclined to conclude that "truncate" system call and 
gfs2_delete_inode (kicked off by gfs2_inplace_reserve_i()) were 
executed at the same time. The only way to prevent this to happen
is the exclusive inode glock. Since this is lock_nolock ... so ...

/me to check glock logic under lock_nolock ...

Comment 36 Wendy Cheng 2007-08-15 00:50:03 UTC

A side note: while checking the code, I decided to keep machine busy and 
retried fix from bz 248176. Built a new spec.3 RPM with patch from comment 
#33, plus bz 248176 fix download from rhkernel-list. It died even sooner
without tweaking scand. Not sure this is a good news or bad news.

Comment 37 Wendy Cheng 2007-08-15 12:48:54 UTC

Re-read bugzilla 248176 and accidentally found bugzilla 245009 (somehow I 
was not in cc list). Look like a dup (245009) of this one.

So it could be worse - i.e. delete inode had happened and inode was re-used.

Comment 38 Steve Whitehouse 2007-08-15 12:58:30 UTC

So that means that the original patch might still be the right one to fix this
issue? I got the impression from earlier comments that you have proved at least
that with the patch, the problem is at least greatly reduced even if not
entirely solved.

Comment 39 Wendy Cheng 2007-08-15 13:12:21 UTC

I think so - at least it will take *much* longer to reproduce now.

Comment 40 Wendy Cheng 2007-08-15 18:08:40 UTC

Should correct comment #39. It would take very long time to reproduce on 
37.el5.spec.2 (37.el5 + patch from comment #26 + single scand per node fix). 
But if we add fixes from bz248176, it occurs relatively quick. So there
is no point to rush patch from comment #26 into RHEL (or even upstream) 
since fix from bz248176 is in both upstream and RHEL now.

On spec.2 kernel, when this di_blocks count mismatch happened, if we 
kept going (and left di_blocks un-touched), the bencmark most likely 
would succeed (so this makes sense for patch in comment #26).

On spec.3 kernel (that adds one more fix from bz 248176), the issue 
surfaces easily (again, not sure it is a good thing or bad thing). If 
we keep going, nfsd will panic on write where glock pointers stored in 
ip disappear. So clear_inode must be executed on this inode sometime 
between create and write.  

The clear_inode() can be invoked either by VM's dispose_list (under
memory pressure) or by gfs2_inplace_reserve (via gfs2_delete_inode). 
If run under gfs2_delete_inode, the global inode_lock is taken (this
is good). If under VM dispose_list, it is run without any locking
that can prevent this inode gets re-used. 

Look to me we have this inode re-use problem for a while (judging by
several prematurely-closed bugzilla). Still trying to piece these
puzzles togather to get a full picture of the issues.

Comment 41 Wendy Cheng 2007-08-15 20:41:49 UTC

Other than clear_inode, do_strip (under gfs2_setattr context) seems to be 
reading stale inode data that has been updated by gfs2_delete_inode. In 
theory, the conflict is protected by exclusive glock. Just added an 
un-conditional flush into gfs2_delete_inode. Wait to see the result. 

Remember that 37.el5 is easier to have this problem than 36.el5. And we
have glock deadlock changes in 37.el5.

Comment 42 Wendy Cheng 2007-08-16 03:37:44 UTC

Created attachment 161428 [details]
dinode.panic

After a manual inode flush at the end of gfs2_delete_inode, the job died 
at inode refresh. Don't understand why this gets so screwed up ... 

I'm removing bz 248176 fix and retrying the above experiment again.

Comment 43 Wendy Cheng 2007-08-16 20:43:41 UTC

Created attachment 161691 [details]
Tentative patch that seems to fix the issue

As soon as fix from bug 248176 was added, different issues (symptoms) 
showed up. That patch was removed last night to simplify the test. 

Current test kernel is 37.el5 + single gfs2_scand + missing bh fix. It 
consistently shows mis-matched inode meta data that can only be altered 
by either do_strip (part of gfs2_setattr) or gfs2_dir_exhash_dealloc
(part of the gfs2_delete_inode). 

One possibility is that the file has been unlinked and added into
GFS2_BLKST_UNLINKED list. It then gets picked by log reservation
code for deletion. With i_count zero, gfs2_delete_inode is invoked 
and blocks are removed. As this deletion is part of the internal
gfs2 logic, the inode still sits inside vfs's inode cache and waiting
for re-use. At the same time (or shortly after the deletion), a CREATE 
nfs request comes over with TRUNCATE flag. So gfs2_get_dentry finds 
the deleted inode with matching ino. It *refreshes* the inode and passes
it down to gfs2 truncate code. As I recalled from last round of gfs2-nfs
works, gfs2_get_dentry always does a disk read. This implies the changes
made by gfs2_delete_inode has not hit the disk yet. 

As an experiment, I force a flush inside gfs2_delete_inode by adding this
patch. The benchmark runs thru 3 times (15 runs each) with scand interval
set to 300 without issues. Previously it always corrupts before first loop 
completes (max record is 12 th of first 15 runs).

Comment 44 Wendy Cheng 2007-08-16 21:30:47 UTC

On second thought, if we change i_size to zero upon gfs2_delete_inode,
we won't have to bother with the flush sequence at all ? If inode uses 
the "behind" (on memory) version, we have di_blocks and i_size both 
zero. So the truncate code can be avoided. If inode re-reads from disk, 
the i_size and everything else will be updated anyway. Good idea ?
Let me give it a try.

Comment 45 Wendy Cheng 2007-08-17 04:01:50 UTC

Created attachment 161715 [details]
Update patch for this issue

This change has passed two rounds (30 runs total). Will let it loop overnight. 

If good, this will be "it".

Comment 46 Steve Whitehouse 2007-08-17 10:45:14 UTC

Now that I've got my brain around comment #43 I think you are right, but for the
wrong reasons. Here is a slightly different suggestion as to what is going on,
but very much along the same lines, and with the same result:

1. Inode has i_count == 0 and i_link == 0 so it gets deleted. At this point the
inode state machine in the VFS prevents us using this inode any more while
deletion takes place.

2. In comes an NFS fh lookup request for the inode (create & truncate) as per
comment #43. At this point though, what happens is that we try to verify it by:
  a) Locking the inode (note this reads the inode in, hence the problem you are
seeing, as we've not yet checked in the rgrp that it _is_ an inode!)
  b) Checking the rgrp
  c) Reading in the inode

So I think the problem is that we ought to grab the r/o lock on the rgrp
_before_ looking up the inode itself. What we have to be careful of here is that
the lock ordering on allocation goes: inode -> rgrp, so we can't do rgrp ->
inode here without risk of deadlock. We thus need to follow the scheme which is
used in lookup (substituting rgrp for directory) thus:

 1. Get rgrp lock & check state is "dinode"
 2. Do "lookup" to create VFS inode & get iopen lock
 3. Release rgrp lock
 4. Lock inode to refresh it & check inode type

The only "interesting" part of this is to ensure that the inode has its type set
before we clear I_NEW. We might land up having to pass the rgrp lock to
gfs2_inode_lookup() and doing:

if (rgrp_gh)
        gfs2_glock_dq(rgrp_gh);

just before the if (type == DT_UNKNOWN) { clause.

Let me know whether you think this makes sense, or if I'm missing something obvious.

Comment 47 Wendy Cheng 2007-08-17 12:54:19 UTC

Let's keep things simple. Intuitively, it is not a good idea to make 
changes as desribed in comment #45 unless we have a strong proof. 

The first thing to make sure is to comb thru path between nfsd create 
and gfs2 create to see where i_size gets reset. The bottom of this
issue seems to be due to truncate call falsely called after gfs2_delete_inode. 

I'm heading to office now. Will discuss this matter shortly.

Comment 48 Wendy Cheng 2007-08-17 12:55:53 UTC

BTW, the benchmark still runs good with patch from comment #45.

Comment 49 Wendy Cheng 2007-08-17 12:57:44 UTC

grr.... s/comment #45/comment #46/ in comment #47 :) ...

Comment 50 Wendy Cheng 2007-08-17 15:19:51 UTC

Corruption is gone after the patch with 37.el5 kernel. However, as soon as I 
pull in newer kernels, it dies again. So this doesn't solve anything.

Trying to work thru Steve's logic in comment #46 now.

Comment 51 Wendy Cheng 2007-08-17 21:49:19 UTC

Found a good way to recreate this rather quickly (about 30 to 40 minutes,
instead of half day) by not doing try_rgrp_unlink... Look like we were
chasing the wrong tails. Need to break for food.. will come back soon..

Comment 52 Wendy Cheng 2007-08-18 04:25:03 UTC

The problem disappears. Haven't figured out why. Will upgrade base 
system equivalent to 2.6.18-41.gfs2abhi.002 and let it loop until 
tomorrow afternoon. 

Since there were enough disk spaces, I commented out try_rgrp_unlink(),
thinking to see how system behaved without the trouble-some (I thought)
unlink inodes. To my surprise, the problem popped up very quickly. It 
implies each inode I got was from freshly assigned disk block. So these 
unlink inodes are not relevant to this issue.   

Making long story short, I changed gfs2 truncate code such that it
becomes noop if the requested size and existing size are identical. 
Current code unconditionally goes to do_shrink. Then, the problem 
disappears. 

Still trying to figure out why. Maybe because I significantly cut down
log traffic by doing so (the benchmark creates large amount of files 
and keeps re-using them) - this may imply we have journaling issue. Or
maybe some logic inside gfs2 truncate code doesn't handle "equal size"
well (let's hope for this) ? Or maybe simply another dumb luck ? Don't 
know for sure at this moment.

Comment 53 Wendy Cheng 2007-08-18 05:02:49 UTC

Created attachment 161787 [details]
the test patch used

Comment 54 Wendy Cheng 2007-08-18 05:55:31 UTC

Created attachment 161788 [details]
Updated patch

Comment 55 Wendy Cheng 2007-08-18 18:06:46 UTC

Found a glock flag called "GFS2_DIF_TRUNC_IN_PROG" that is embeded as part
of the inode_go_lock. Whenever inode_go_lock is called, bottom half of the
truncate code is invoked. I think this may explain the corruption. Will look
into this more. It is simply not possible for truncate code to be interrupted
and needs resume. This looks very wrong. 

Also realized GFS1 has a function called "do_same" that is identical to my
"do_touch" that only updates the time stamps when request size and actual
size are identical. I think we need this too to cut down all of these extra
"doing nothing" transaction.

Comment 56 Wendy Cheng 2007-08-18 19:20:04 UTC

Change that truncate resume call into a BUG and go back to 37.el5 (since it
is easier to recreate). It is absolutely wrong. If truncates is in progress,
another process gets in to execute the bottom half part of the code, gfs2 will
corrupt for sure. 

Let's see whether we can capture anything by this doing.

Comment 57 Wendy Cheng 2007-08-19 01:27:38 UTC

Need to correct myself about GFS2_DIF_TRUNC_IN_PROG ... 

 * First, the BUG didn't catch anything. 
 * The flag is embedded in inode_go_lock - only the first glock owner on 
   the node would executed it. As part of the glock flow, it is not possible
   for two processes to do this code at the same time. As long as glock 
   works, it shouldn't cause corruption.
 * Really think about this, we do need this flag upon log recovery - say
   we crash between trunc_start and trunc_end. Whoever picks up this glock
   first after crash should continue the task.

So no fault on this flag !

Comment 58 Wendy Cheng 2007-08-19 10:33:10 UTC

"2.6.18-41.gfs2abhi.002 + patch from comment #54" still loops - no sign of
corruption (yet).

Comment 59 Steve Whitehouse 2007-08-19 14:37:23 UTC

Regarding comment #57 and previous comment on truncate... I thought that this
feature has been taken from GFS1, but I might be wrong since its a long time
since I looked at the GFS1 code in great detail. Its certialny something that
was written by Ken though and I had the same thoughts as you when I first saw it
- that it was the wrong thing, but in fact its not quite what it seems.

The truncate itself is as simple as changing the i_size to a different number.
The flag in question relates not to the truncate, but to the deallocation (if
required) after the truncate. This may be deferred allowing other nodes to get a
glock (for read purposes, for example). The flag which is set on the inode just
lets any future acquirer of the exclusive glock know that it ought to finish the
deallocation before it does anything else, so as you say, it should be harmless,
even if slightly odd looking.

Regarding comment #58 & comment #54, the patch looks ok to me, but I'd like to
understand why this does something different to the original. So in otherwords
to be sure that it doesn't work just because of a timing change rather than
something more fundamental.

Comment 60 Wendy Cheng 2007-08-20 01:30:10 UTC

Each benchmark run takes about 6 hours to complete. Before tomorrow's 
9:30am meeting, we have a chance to do one more try. I doubt we're  
going to find anything with abhi's kernel + patch from comment 54 (this 
is a good news though).

Just removed do_touch and will run 2.6.18-41.gfs2abhi.002 + missing bh 
patch. This is to prove the corruption has gone away with the changes 
made from last week's kernel build (could be gfs2, or even other parts 
of the kernel). If corruptions occur, then we know we have more problems 
(bad thing). What the "do_touch" really does is to reduce the burst mode
truncate calls (and unnecessary IOs) during benchmark's file creation
time. 

Smoke cluster is also looping with the same code. However, I doubt we
are going to find anything there (since the traffic is bounded by one
single NIC interface - vs. Barry's 4-ethernet-cards per machine setup).
Also "kool" seems to die for good (I don't have console access) - so 
it is down to 3 NFS clients, vs. Barry's 4 NFS clients).

Comment 61 Wendy Cheng 2007-08-20 02:45:58 UTC

A side note: smoke cluster max achieves 2557 Op/seconds. Barry's bigi can
do 24425 Op/secons (10x).

Comment 62 Wendy Cheng 2007-08-20 03:57:39 UTC

Created attachment 161838 [details]
panic after do_touch patch removed with identical stack trace

After the patch from comment #54 is removed (but missing-bh change stays),
the job dies at 14000 Ops/Sec run.

Comment 63 Wendy Cheng 2007-08-20 04:14:35 UTC

So look like we need patch #54, at least as a workaround. Need to discuss
this tomorrow...

Comment 64 Wendy Cheng 2007-08-20 04:39:52 UTC

Created attachment 161840 [details]
Upstream patch for comment #54

Comment 65 Steve Whitehouse 2007-08-20 11:16:42 UTC

Do you want me to apply the patch in comment #64? If so then please post it to
cluster-devel. It looks good to me, and its worth having just for the reduction
in I/O.

Comment 66 Wendy Cheng 2007-08-20 14:12:54 UTC

Yes, we need this patch anyway (regardless the problem) - it is more of a 
performance fix rather than bug fix. RHEL patch needs to get pushed too. 

Consider it as a workaround while the root cause of this issue is being 
worked on.

Comment 67 Wendy Cheng 2007-08-20 18:43:16 UTC

Bugzilla 2532590 is opened to continue this work.

Comment 68 Wendy Cheng 2007-08-20 18:44:47 UTC

s/2532589/253590/ in comment #67

Comment 69 Don Zickus 2007-08-21 18:35:48 UTC

in 2.6.18-42.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5

Comment 70 Wendy Cheng 2007-08-22 20:52:10 UTC

Root cause of the original issue has been identified and fixed in:
tttps://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=253590

Patch will be posted to rhkernel-list shortly.

Comment 71 Wendy Cheng 2007-08-22 20:55:10 UTC

Make it clear .. the changes made via this bugzilla are still valid and 
required. Comment $70 (and comment #67) is used to put this issue into a 
full closure.

Comment 74 errata-xmlrpc 2007-11-07 19:58:02 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2007-0959.html