243136 – GFS2 - NFS SPECsfs test runs panic and/or stalled in lookup code

Bug 243136 - GFS2 - NFS SPECsfs test runs panic and/or stalled in lookup code

Summary: GFS2 - NFS SPECsfs test runs panic and/or stalled in lookup code

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 5
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	5.0
Hardware:	All
OS:	Linux
Priority:	medium
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Don Zickus
QA Contact:	GFS Bugs
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	204760
TreeView+	depends on / blocked

Reported:	2007-06-07 14:22 UTC by Wendy Cheng
Modified:	2007-11-30 22:07 UTC (History)
CC List:	6 users (show)
Fixed In Version:	RHBA-2007-0959
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2007-11-07 19:52:06 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
Draft version of GFS2 NFS patch (4.80 KB, patch) 2007-06-20 02:45 UTC, Wendy Cheng	no flags	Details \| Diff
combined draft patch for i_mode and no_formal_ino issue (10.03 KB, patch) 2007-06-21 22:11 UTC, Wendy Cheng	no flags	Details \| Diff
This fixes the gfs_block_truncate_page EIO error. (370 bytes, patch) 2007-06-22 19:21 UTC, Wendy Cheng	no flags	Details \| Diff
[patch 1/4] Fix gfs2_block_truncate_page err return (370 bytes, patch) 2007-06-27 21:25 UTC, Wendy Cheng	no flags	Details \| Diff
[patch 2/4] obtain no_formal_ino from directory entry (3.89 KB, patch) 2007-06-27 21:26 UTC, Wendy Cheng	no flags	Details \| Diff
[patch 3/4] remove imode from file handle (6.47 KB, patch) 2007-06-27 21:27 UTC, Wendy Cheng	no flags	Details \| Diff
[patch 4/4] gfs2 truncate inode size inconsistency (396 bytes, patch) 2007-06-27 21:28 UTC, Wendy Cheng	no flags	Details \| Diff
Show Obsolete (3) View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2007:0959	0	normal	SHIPPED_LIVE	Updated kernel packages for Red Hat Enterprise Linux 5 Update 1	2007-11-08 00:47:37 UTC

Description Wendy Cheng 2007-06-07 14:22:01 UTC

Description of problem:

Did a test run with SPECsfs (NFS) on GFS2 with 2.6.22.rc kernel (taken
from nmw git tree on Jun. 5). It doesn't go very far (compared to what 
was seen about 2 months ago). Something has been changed in lookup code ?
It either panics (in nfsd lookup) or get stale file handle with directory 
lookup. The benchmark is not in read-write stage yet - it stops in the 
middle of test directories setup.  

Version-Release number of selected component (if applicable):
Since GFS2 upstream code is virtually identical to RHEL5 at this
moment, I think RHEL5 has the same issue too.

Comment 2 Steve Whitehouse 2007-06-11 10:40:39 UTC

I wonder if this might have some relation to this:

http://git.kernel.org/?p=linux/kernel/git/steve/gfs2-2.6-nmw.git;a=commitdiff;h=ca66d775276e70f69756b4a92727f498aba75e32

if so it might just be chance as to whether you see it or not, so that might by
why its not been seen before.

Comment 3 Wendy Cheng 2007-06-11 20:25:11 UTC

Didn't have a chance to look into this until this morning. First crash seemed 
to be easy from the surface - somehow NFS passes around a shorter file handle 
(with GFS2_SMALL_FH_SIZE) which is perfectly legal. This implies we couldn't 
always stuff "imode" into the file handle. However, all the "down-stream" code
(mostly gfs_inode_lookup) requires imode to be known. This is alarming.

For this particular crash, we assert in gfs2_inode_lookup() where DT_UNKNOWN 
was added in gfs2_decode_fh that made gfs2_inode_lookup mistakenly believing 
it was in unlinked inode recovery code path. It tried to ask for an exclusive 
lock while holding a shared lock already that subsequently generated the
assertion.

Comment 4 Wendy Cheng 2007-06-12 05:06:10 UTC

Actually not as bad as I expected - if mode is DT_UNKOWN, we will just go
ahead to read in the inode. So the only issue here (the panic occurred this
morning) is the deadlock. There are several ways to fix the deadlock. Will
tentatively unlock the shared lock in gfs2_get_dentry before calling 
gfs2_inode_lookup.

Let's see whether we have other problems - no need to rush into a quick 
solution at this moment. Staye tuned.

Comment 5 Wendy Cheng 2007-06-12 05:23:42 UTC

inode->i_op could get screwed up too ... hopefully this is the root cause
of these mysterious panic(s) and hang(s).. Will continue tomorrow.

Comment 6 Steve Whitehouse 2007-06-12 07:51:07 UTC

Its not valid to call lookup with DT_UNKNOWN from anywhere except the unlinked
recovery code. I added the special case for the unlink code only because there
was not any other solution to that problem and there is no reason that the mode
cannot be added to the NFS filehandle (I thought it already had been some time
ago). If there is not enough space, then we can just leave out some bits of the
no_formal_ino field, it will not make a great deal of difference whether this
field is 64 bits or 60 bits (only 4 bits are really needed to encode the type).

Comment 7 RHEL Program Management 2007-06-12 15:54:21 UTC

This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 8 Wendy Cheng 2007-06-13 04:10:15 UTC

Look like the draft patch is working - at least the benchmark has been running
for more than 5 minutes (before this patch, it died within one minute).

I don't see why we need to stuff imode into file handle. If the inode has been
in memory, imode is there. If inode is not in memory, we need to do disk read 
anyway (and imode is part of the on disk structure field). 

Will summerize and work on a real patch if the freshly dispatched test run
could go thru test directory setup (and start doing read-write).

Comment 9 Steve Whitehouse 2007-06-13 08:54:49 UTC

There are various reasons, but the most important is related to the ordering of
I/O vs. VFS locking. Ideally we'd like to decouple the two so far as we can
since glocks have to be ordered, or alternatively they have to be sent off as a
bunch of "try locks" in order to get the most efficient I/O patterns when
multiple threads are in use. Currently acquiring a glock implies bringing the
inode uptodate and we don't really want to do that in the ->lookup since lookup
is single threaded (effectively) due to i_mutex. So its much better to do the
I/O outside of the ->lookup function if at all possible, leaving the way clear
for other threads to look things up in parallel.

Another reason is that it means that we are using the same system for both NFS
and "normal" lookups, since the type is always available from the directory entry.


I'd like to have been able to use the same system for the unlinked files, and if
we'd adopted an ext3-like list of unlinked inodes, then we might have been able
to do that (one list per type, per rgrp) but the current system has the
advantage of fitting more easily into the existing GFS2 metadata. The unlinked
stuff (since in this case its only recovery) is not in a performance critical
path really, so its not such a big deal.

Comment 10 Wendy Cheng 2007-06-17 01:45:31 UTC

sorry, forgot to do assign.

Comment 11 Wendy Cheng 2007-06-20 02:42:31 UTC

Well, a historical moment :) .. Using RHEL3 as NFS client, I got the first 
GFS2 SPECsfs run completed without panic. Patch will be uploaded shortly.   

SFS Aggregate Results for 1 Client(s), Tue Jun 19 21:33:01 2007
NFS Protocol Version 3
------------------------------------------------------------------------------
NFS         Target Actual     NFS    NFS    Mean    Std Dev  Std Error   Pcnt
Op           NFS    NFS       Op     Op    Response Response of Mean,95%  of
Type         Mix    Mix     Success Error   Time     Time    Confidence  Total
             Pcnt   Pcnt     Count  Count  Msec/Op  Msec/Op  +- Msec/Op  Time
------------------------------------------------------------------------------
getattr      11%   11.2%     27131     0     0.49     4.19      0.02      7.0%
setattr       1%    1.0%      2470     0     0.99     7.87      0.11      1.3%
lookup       27%   27.4%     66343     0     0.46     4.07      0.02     16.1%
readlink      7%    7.3%     17640     0     0.22     0.14      0.01      2.1%
read         18%   17.9%     43440  1339     1.77    10.68      0.03     40.8%
write         9%    8.7%     21022   455     1.25     8.15      0.04     13.9%
create        1%    1.0%      2416     0     1.71    14.77      0.15      2.2%
remove        1%    1.0%      2455     0     0.95     8.88      0.12      1.2%
readdir       2%    2.1%      4980     0     0.46     0.47      0.02      1.2%
fsstat        1%    1.0%      2421     0     0.25     1.71      0.05      0.3%
access        7%    7.1%     17161     0     0.23     0.60      0.01      2.1%
commit        5%    4.3%     10549     0     0.20     0.10      0.01      1.1%
fsinfo        1%    1.0%      2520     0     0.20     0.13      0.01      0.3%
readdirplus   9%    9.1%     22019     0     0.88     0.61      0.01     10.3%
------------------------------------------------------------------------------

        --------------------------------------------------------
        | SPEC SFS VERSION    3.0 AGGREGATE RESULTS SUMMARY    |
        --------------------------------------------------------
NFS V3 THROUGHPUT:     811 Ops/Sec   AVG. RESPONSE TIME:     0.8 Msec/Op
TCP PROTOCOL
NFS MIXFILE: [ SFS default ]
AGGREGATE REQUESTED LOAD: 800 Ops/Sec
TOTAL NFS OPERATIONS:   242567      TEST TIME: 299 Sec
NUMBER OF SFS CLIENTS: 1
TOTAL FILE SET SIZE CREATED: 7922.1 MB
TOTAL FILE SET SIZE ACCESSED:  792.4 -  874.9 MB  (100% to 110% of Base)

------------------------------------------------------------------------

Comment 12 Wendy Cheng 2007-06-20 02:45:17 UTC

Created attachment 157438 [details]
Draft version of GFS2 NFS patch

Don't get too excited though - there are large amount of RPC errors. However,
no panic, no hang - that's quite an improvement !

Comment 13 Wendy Cheng 2007-06-20 02:50:50 UTC

Also don't take the patch too seriously. It is just a POC patch - to prove
the issue is indeed with stuffing imode into the file handle. I would like
to run this on RHEL4 and RHEL5 clients before finalizing the real solution.

Comment 14 Wendy Cheng 2007-06-20 03:00:10 UTC

A side note.. during the run, the client passed 25855 short filehandles (that
can't accomodate imode).

Comment 18 Wendy Cheng 2007-06-21 05:59:41 UTC

Recap the issues found so far:

1. File handle doesn't have enough room for imode - this will cause hang or 
   panic. Draft patch in comment #12 works fine. Will polish a little bit 
   and send out for review tomorrow.
2. GFS lookup code path doesn't grab inode shared glock. Is this intentional 
   or oversight ? The end result is that lookup code path would not do disk
   read to get inode contents. This implies the info passed to nfs filehandle
   encoding routine is only partially right - i.e. no_addr is correct (since 
   it is obtained from directory entry) but no_formal_ino is garbage (since 
   GFS2 never reads in the inode and no_formal_ino content is stored in 
   on-disk inode). I tentatively add a shared glock into gfs2_lookupi() and
   it seems to solve this issue. Need to further test this out though.
3. Bump into a 3rd issue in gfs2_setattr() after the 2nd patch is applied.
   The benchmark was in the middle of cleaning up its test directories (to
   prepare for another run) and tries to do a truncate. The do_shrink()
   call fails in gfs2_block_truncate_page with error set EIO. Not sure 
   what's the cause though. Seen this problem before ?

Comment 19 Steve Whitehouse 2007-06-21 09:21:56 UTC

I really don't like the idea of having two slightly different lookup methods
when one will do. At the very least it should be possible to use the same code
as the unlinked path uses, since it has the same problem to solve. Surely all
that needs doing is to change to a shared lock for the NFS case? In that case we
could use that as an extra argument to the lookup function.

If you are going to add this code to avoid needing to know the imode, then you
might as well take that information out of the larger file handles. Both of them
should be using the same lookup method I think to avoid extra complications.

The reason for having the imode in the file handles is so that we don't have to
read the inodes in individually during the lookup function. The idea was that it
should be possible to read them in, in a batch (assuming that we need to read a
number of them) later on.

It is intentional that we don't grab the lock in gfs2_lookupi(). We only take
the directory lock and when the inode is returned it can then be locked, if
required, to bring all the fields uptodate just like every other place in the code.

Comment 20 Wendy Cheng 2007-06-21 15:09:33 UTC

From commnet #19
> It is intentional that we don't grab the lock in gfs2_lookupi().

I finally understand where GFS2's superior performance comes from ... but 
please approach this performance improvement carefully.. After all, large
amount of users will access GFS2 thru NFS. 

We all agree NFS needs something other than no_addr (real ino) to shield 
itself from wrongly accessing a file that has been deleted and ino re-used. 
In ext3, it uses generation numbers. In gfs2, it uses the generated 
no_formal_ino during file creation.

For newly created files (e.g. the files under NFS connectathon test suit), 
the issue will not be surfaced since all the information stays within 
in-memory inode left over during creation time. For the files existing in 
disk that doesn't get read-in yet, and/or its inode gets temporarily purged 
by VM due to its unused status (mostly found in NFS SPECsfs environment),  
the removal of inode shared lock will make lookup code bypassing disk read. 
So we have an un-initialized no_formal_ino that gets decoded into filehandle.
NFS client would end up with ESTALE when it tries to access a legitimate and 
still alive file.

I can't find a way to fix this unless we can:

1. put that inode shared glock back to gfs lookup code (as the old code
   does); or
2. add no_formal_ino into directory entry (so we don't need to read in 
   the inode from the disk) - this is on-disk data structure changes though.

Be aware this issue differing from that i_mode issue. With the i_mode issue, 
the code path comes down from gfs2_get_dentry, so we know it is from NFS.
For this second issue, nfsd simply does a generic lookup that goes to vfs 
lookup, then comes down to gfs2 generic lookup code. There is really no way 
to bypass this.

In short, NFS over GFS2 is very broken at this moment due to this change.

Comment 21 Wendy Cheng 2007-06-21 16:03:00 UTC

Discussed this with Steve .. the no_formal_ino is already stored inside 
directory entry. So will code the fix accordingly - need to find a good
place to stuff that info into in-memmory inode though. This is very doable
(I hope).

Comment 23 Wendy Cheng 2007-06-21 22:11:46 UTC

Created attachment 157580 [details]
combined draft patch for i_mode and no_formal_ino issue

Turned out to be a trivial fix (for no_formal_ino issue) so I didn't bother 
to seperate the two patches. Still have lots of debugging code inside that
needs to get cleaned up and polished. In general, it seems to work fine - but 
I won't know for sure until the test run completed. 

The benchmark is running now.

Comment 24 Wendy Cheng 2007-06-21 22:31:38 UTC

Benchmark completed but the 2nd runs hit the gfs2_setattr problem as expected.
Hopefully the setattr bug can be digged out tonight ...

Comment 25 Wendy Cheng 2007-06-21 22:35:53 UTC

Wow... forgot to check the RPC errors, they are all *gone*..... Now, we're
back to good shape ... :)

SPEC SFS Benchmark Version 3.0, Creation - 11 July 2001
SFS Single Client (dhcp139.perf.redhat.com) Results, Thu Jun 21 17:28:44 2007
----------------------------------------------------------------------------

NFS V3     Target Actual   NFS    NFS    Mean     Std Dev  Std Error   Pcnt
Op           NFS    NFS     Op     Op    Response Response of Mean,95%  of
Type         Mix    Mix   Success Error   Time     Time    Confidence  Total
             Pcnt   Pcnt   Count  Count  Msec/Op  Msec/Op  +- Msec/Op  Time
----------------------------------------------------------------------------
null          0%    0.0%       0     0  0.00     0.00      0.00    0.0%
getattr      11%   11.1%   26925     0  0.49     3.68      0.02    7.0%
setattr       1%    1.0%    2404     0  1.20     8.95      0.12    1.5%
root          0%    0.0%       0     0  0.00     0.00      0.00    0.0%
lookup       27%   27.0%   65725     0  0.48     5.44      0.02    16.9%
readlink      7%    7.2%   17382     0  0.22     0.13      0.01    2.1%
read         18%   18.4%   44745     0  1.69    10.14      0.03    40.3%
wrcache       0%    0.0%       0     0  0.00     0.00      0.00    0.0%
write         9%    8.9%   21507     0  1.30     8.55      0.04    14.9%
create        1%    1.0%    2417     0  1.03     8.57      0.12    1.3%
remove        1%    1.0%    2430     0  0.75     6.39      0.10    1.0%
rename        0%    0.0%       0     0  0.00     0.00      0.00    0.0%
link          0%    0.0%       0     0  0.00     0.00      0.00    0.0%
symlink       0%    0.0%       0     0  0.00     0.00      0.00    0.0%
mkdir         0%    0.0%       0     0  0.00     0.00      0.00    0.0%
rmdir         0%    0.0%       0     0  0.00     0.00      0.00    0.0%
readdir       2%    2.0%    4915     0  0.45     0.41      0.02    1.2%
fsstat        1%    1.0%    2377     0  0.21     0.31      0.02    0.3%
access        7%    7.0%   17095     0  0.23     0.87      0.01    2.1%
commit        5%    4.4%   10707     0  0.18     0.10      0.01    1.0%
fsinfo        1%    1.0%    2469     0  0.19     0.11      0.01    0.3%
mknod         0%    0.0%       0     0  0.00     0.00      0.00    0.0%
pathconf      0%    0.0%       0     0  0.00     0.00      0.00    0.0%
readdirplus   9%    9.0%   21892     0  0.88     2.32      0.02    10.2%
----------------------------------------------------------------------------

      ------------------------------------------------------------
      | SPEC SFS VERSION    3.0 SINGLE CLIENT RESULTS SUMMARY    |
      ------------------------------------------------------------
NFS V3 THROUGHPUT:  812.67 Ops/Sec   AVG. RESPONSE TIME:    0.77 Msec/Op
TCP PROTOCOL
FAST CALLS: 0
NFS MIXFILE: [ SFS Default ]
CLIENT REQUESTED LOAD: 800 Ops/Sec
TOTAL NFS OPERATIONS: 242990      TEST TIME: 299 Sec
FILE SET SIZE CREATED: 8112208 KB
FILE SET SIZE ACCESSED: 811408 - 894145 KB  (100% to 110% of Base)

Comment 26 Wendy Cheng 2007-06-22 14:31:54 UTC

ok, we pass the gfs2_setattr (actually it is a truncate call) issue ... 
Now the benchmark is doing multiple runs (before the fix, we could only 
do one run - it then died when trying to clean up the test directories 
from the first run. Will let it loops for a while.. hopefully there will 
be no new show stopper.

The bug is really silly - GFS2 tries to read in the data page but pre-sets
the err to -EIO before calling block layer's ll_rw_block. It never resets 
the err even the disk read succeeds.

Comment 27 Wendy Cheng 2007-06-22 19:21:16 UTC

Created attachment 157640 [details]
This fixes the gfs_block_truncate_page EIO error. 

Fix a silly bug that would fail gfs2_setattr() with EIO error during truncating
a file. Will submit this to cluster-devel when Steve is on-line again.

Comment 29 Wendy Cheng 2007-06-24 17:32:06 UTC

All of above works are done using upstream kernel (2.6.22.rc4 taken from 
nmw git tree on Jun. 5). SPECsfs is able to run in a loop without issues. 
Will submit the patches to cluster-devel on Monday. 

Start to do RHEL 5 port now.

Comment 30 Wendy Cheng 2007-06-24 20:32:08 UTC

Never ending saga .. Freshly upgraded the machine to RHEL 5.1 newest built
(31.el5) and rebuilt it with above patches. The machine immediately hung -
happened to have a "tail -f /var/log/messages" screen in another window 
that showed a soft lockup somewhere in the journaling write. Not sure this
is old issue or new regression though. The test machine doesn't have power
switch. Have to wait until Monday when I get to office to check this out:

Jun 24 16:07:18 dhcp143 mountd[3668]: authenticated mount request from
dhcp139.perf.redhat.com:740 for /sfs1 (/sfs1)
Jun 24 16:07:51 dhcp143 kernel: crash memory driver: version 1.0
Jun 24 16:08:17 dhcp143 kernel: BUG: soft lockup detected on CPU#1!
Jun 24 16:08:17 dhcp143 kernel:
Jun 24 16:08:17 dhcp143 kernel: Call Trace:
Jun 24 16:08:17 dhcp143 kernel:  <IRQ>  [<ffffffff800b4f74>]
softlockup_tick+0xdb/0xed
Jun 24 16:08:17 dhcp143 kernel:  [<ffffffff800930f6>] update_process_times+0x42/0x68
Jun 24 16:08:17 dhcp143 kernel:  [<ffffffff800746f5>]
smp_local_timer_interrupt+0x23/0x47
Jun 24 16:08:17 dhcp143 kernel:  [<ffffffff80074db7>]
smp_apic_timer_interrupt+0x41/0x47
Jun 24 16:08:17 dhcp143 kernel:  [<ffffffff8005bc8e>] apic_timer_interrupt+0x66/0x6c
Jun 24 16:08:17 dhcp143 kernel:  <EOI>  [<ffffffff80062a18>] _spin_lock+0x3/0xa
Jun 24 16:08:17 dhcp143 kernel:  [<ffffffff8850a586>]
:gfs2:databuf_lo_before_commit+0x337/0x594
Jun 24 16:08:17 dhcp143 kernel:  [<ffffffff88508c98>]
:gfs2:gfs2_log_flush+0x12b/0x2da
Jun 24 16:08:17 dhcp143 kernel:  [<ffffffff885092c6>]
:gfs2:gfs2_log_reserve+0xd0/0x131
Jun 24 16:08:17 dhcp143 kernel:  [<ffffffff8851925e>]
:gfs2:gfs2_trans_begin+0xf6/0x135
Jun 24 16:08:17 dhcp143 kernel:  [<ffffffff8850cc45>]
:gfs2:gfs2_prepare_write+0x1d5/0x29b
Jun 24 16:08:17 dhcp143 kernel:  [<ffffffff8000f96e>]
generic_file_buffered_write+0x25a/0x6d8
Jun 24 16:08:17 dhcp143 kernel:  [<ffffffff8003ef74>] memcpy_toiovec+0x36/0x66
Jun 24 16:08:17 dhcp143 kernel:  [<ffffffff80030f3d>]
skb_copy_datagram_iovec+0x4f/0x237
Jun 24 16:08:17 dhcp143 kernel:  [<ffffffff8000ddb7>] current_fs_time+0x3b/0x40
Jun 24 16:08:17 dhcp143 kernel:  [<ffffffff80015d5d>]
__generic_file_aio_write_nolock+0x36c/0x3b8
Jun 24 16:08:17 dhcp143 kernel:  [<ffffffff800bdd92>]
__generic_file_write_nolock+0x8f/0xa8

(Lower part of the screen got chopped off).

Comment 31 Wendy Cheng 2007-06-25 14:15:21 UTC

Re-dispatched the job and took a quick back trace - surprised to see heavy
lock contentions on journaling code. Never saw this issue before... Look
serious to me.

Comment 32 Wendy Cheng 2007-06-25 14:18:58 UTC

Well, while I was typing above comment, the system passed previous lock-up
point. Will let it run.

Comment 33 Wendy Cheng 2007-06-25 16:04:54 UTC

hmm. not good. I ran debug version of RHEL5 32.el5 and the benchmark loops 
without hangs. This probably means when GFS2 is stressed enough and/or run 
in full speed, this journal write issue will surface. 

Not sure how long it would take to dig this out at this moment.

Comment 34 Wendy Cheng 2007-06-25 16:12:38 UTC

s/32.el5/31.el5/ in comment #33

Comment 35 Wendy Cheng 2007-06-25 17:35:16 UTC

Absolutely no idea why it *was* choked. Now the problem simply disappears, 
even without debug option on.

Comment 36 Wendy Cheng 2007-06-26 03:52:59 UTC

Found Ben and Bob have been working on journal writes issue for a while.
Problem from comment #30 seems to be right in that arena so I'll not dig 
more unless it shows up again. Patches are sent to cluster-devel tonight - 
waiting for Steve getting on-line to review. They can be applied directly 
to upstream kernel and RHEL 5.1 31.el5 build as well. Summary of changes
include:

[Patch 3-1] EIO error from gfs2_block_truncate_page
Code segment inside gfs2_block_truncate_page() doesn't set the return code
correctly. This causes NFSD erroneously returns EIO back to client with 
setattr procedure call (truncate error),

[Patch 3-2] Obtaining no_formal_ino from directory entry
GFS2 lookup code doesn't ask for inode shared glock. This implies during
in-memory inode creation for existing file, GFS2 will not disk-read in the 
inode contents. This leaves no_formal_ino un-initialized during lookup time.
The un-initialized no_formal_ino is subsequently encoded into file handle.
Clients will get ESTALE error whenever it tries to access these files.

[Patch 3-3] Remove i_mode passing from NFS File Handle
GFS2 passes i_mode within NFS File Handle. Other than the wrong assumption 
that there is always room for this extra 16 bit value, the current 
gfs2_get_dentry doesn't really need the i_mode to work correctly. Note that 
GFS2 NFS code does go thru the same lookup code path as direct file access 
route (where the mode is obtained from name lookup) but gfs2_get_dentry() 
is coded for different purpose. It is not used during lookup time. It is 
part of the file access procedure call.  When the call is invoked, if 
on-disk inode is not in-memory, it has to be read-in. This makes i_mode 
inside File Handle an obsolete feature. The patch removes it accordingly.
Without the removal, current NFS-GFS2 would panic in the locking assert.

Comment 37 Wendy Cheng 2007-06-26 03:59:11 UTC

I'll move on to work on bugzilla 244343. 

This bugzilla will stay in assigned state until Steve finishes reviewing the
patches.

Comment 38 Wendy Cheng 2007-06-27 21:25:09 UTC

Created attachment 158063 [details]
[patch 1/4] Fix gfs2_block_truncate_page err return

Comment 39 Wendy Cheng 2007-06-27 21:26:24 UTC

Created attachment 158064 [details]
[patch 2/4] obtain no_formal_ino from directory entry

Comment 40 Wendy Cheng 2007-06-27 21:27:26 UTC

Created attachment 158065 [details]
[patch 3/4] remove imode from file handle

Comment 41 Wendy Cheng 2007-06-27 21:28:49 UTC

Created attachment 158066 [details]
[patch 4/4] gfs2 truncate inode size inconsistency

Comment 42 Wendy Cheng 2007-06-27 21:38:24 UTC

patch 1/4 and 4/4 are in Steve's git tree now. The other two are waiting for his
review.

Comment 43 Robert Peterson 2007-06-29 20:50:11 UTC

As far as I can tell, 4/4 is not in swhiteho's git tree yet.  I've got
a duplicate problem in gfs, known as bug #241096, and nearly identical
code was suggested as a fix.  And yes, it would make the code work.

But before I ship that in gfs, I want to understand why the inode size
gets changed if it fails quota_check or any other check.  Does vfs change
the value before calling the setattr function?  Seems bad.  It seems to
me that the inode size shouldn't be changed until after the grow/shrink
is successful and we haven't gone through the error path.  That would
make this fix unnecessary.  I need to do some research.

Comment 44 Wendy Cheng 2007-06-29 21:22:31 UTC

Remember we are talking about two inodes here - one VFS inode and one 
gfs(1/2) inode. When size changes, both structures need to get changed. 
Current code seems to be cleaner - it does VFS inode change first (since 
it is light weight - undoing it is a no brainer), followed by gfs inode 
changes (which involves disk operations).

If you do gfs(1/2) first - vfs inode later, if vfs inode change fails,
undoing gfs(1/2) would be much messier.

I really don't see why we need to fuss about this simple fix. 

Steve has been in OLS and he might be making some mistakes (for pulling
in the fix). I'll check with him Monday morning. 

The following is the description in cluster-devel:

https://www.redhat.com/archives/cluster-devel/2007-June/msg00226.html

Comment 45 Robert Peterson 2007-06-29 21:40:58 UTC

After studying the code and seeing how vmtruncate works, I concur.
I agree this is the right fix, so I'll commit the fix for gfs(1) as
well.  No "fuss" really; I'm just trying to make you aware of the gfs1
bz.

Comment 47 Steve Whitehouse 2007-06-29 22:18:57 UTC

4/4 is in the git tree as are the others.

Comment 48 Don Zickus 2007-07-10 16:19:08 UTC

in 2.6.18-33.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5

Comment 51 errata-xmlrpc 2007-11-07 19:52:06 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2007-0959.html

Note You need to log in before you can comment on or make changes to this bug.