Bug 243136
Summary: | GFS2 - NFS SPECsfs test runs panic and/or stalled in lookup code | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 5 | Reporter: | Wendy Cheng <nobody+wcheng> |
Component: | kernel | Assignee: | Don Zickus <dzickus> |
Status: | CLOSED ERRATA | QA Contact: | GFS Bugs <gfs-bugs> |
Severity: | high | Docs Contact: | |
Priority: | medium | ||
Version: | 5.0 | CC: | bmarson, dshaks, jturner, kanderso, rkenna, swhiteho |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | All | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | RHBA-2007-0959 | Doc Type: | Bug Fix |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2007-11-07 19:52:06 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 204760 | ||
Attachments: |
Description
Wendy Cheng
2007-06-07 14:22:01 UTC
I wonder if this might have some relation to this: http://git.kernel.org/?p=linux/kernel/git/steve/gfs2-2.6-nmw.git;a=commitdiff;h=ca66d775276e70f69756b4a92727f498aba75e32 if so it might just be chance as to whether you see it or not, so that might by why its not been seen before. Didn't have a chance to look into this until this morning. First crash seemed to be easy from the surface - somehow NFS passes around a shorter file handle (with GFS2_SMALL_FH_SIZE) which is perfectly legal. This implies we couldn't always stuff "imode" into the file handle. However, all the "down-stream" code (mostly gfs_inode_lookup) requires imode to be known. This is alarming. For this particular crash, we assert in gfs2_inode_lookup() where DT_UNKNOWN was added in gfs2_decode_fh that made gfs2_inode_lookup mistakenly believing it was in unlinked inode recovery code path. It tried to ask for an exclusive lock while holding a shared lock already that subsequently generated the assertion. Actually not as bad as I expected - if mode is DT_UNKOWN, we will just go ahead to read in the inode. So the only issue here (the panic occurred this morning) is the deadlock. There are several ways to fix the deadlock. Will tentatively unlock the shared lock in gfs2_get_dentry before calling gfs2_inode_lookup. Let's see whether we have other problems - no need to rush into a quick solution at this moment. Staye tuned. inode->i_op could get screwed up too ... hopefully this is the root cause of these mysterious panic(s) and hang(s).. Will continue tomorrow. Its not valid to call lookup with DT_UNKNOWN from anywhere except the unlinked recovery code. I added the special case for the unlink code only because there was not any other solution to that problem and there is no reason that the mode cannot be added to the NFS filehandle (I thought it already had been some time ago). If there is not enough space, then we can just leave out some bits of the no_formal_ino field, it will not make a great deal of difference whether this field is 64 bits or 60 bits (only 4 bits are really needed to encode the type). This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release. Look like the draft patch is working - at least the benchmark has been running for more than 5 minutes (before this patch, it died within one minute). I don't see why we need to stuff imode into file handle. If the inode has been in memory, imode is there. If inode is not in memory, we need to do disk read anyway (and imode is part of the on disk structure field). Will summerize and work on a real patch if the freshly dispatched test run could go thru test directory setup (and start doing read-write). There are various reasons, but the most important is related to the ordering of I/O vs. VFS locking. Ideally we'd like to decouple the two so far as we can since glocks have to be ordered, or alternatively they have to be sent off as a bunch of "try locks" in order to get the most efficient I/O patterns when multiple threads are in use. Currently acquiring a glock implies bringing the inode uptodate and we don't really want to do that in the ->lookup since lookup is single threaded (effectively) due to i_mutex. So its much better to do the I/O outside of the ->lookup function if at all possible, leaving the way clear for other threads to look things up in parallel. Another reason is that it means that we are using the same system for both NFS and "normal" lookups, since the type is always available from the directory entry. I'd like to have been able to use the same system for the unlinked files, and if we'd adopted an ext3-like list of unlinked inodes, then we might have been able to do that (one list per type, per rgrp) but the current system has the advantage of fitting more easily into the existing GFS2 metadata. The unlinked stuff (since in this case its only recovery) is not in a performance critical path really, so its not such a big deal. sorry, forgot to do assign. Well, a historical moment :) .. Using RHEL3 as NFS client, I got the first GFS2 SPECsfs run completed without panic. Patch will be uploaded shortly. SFS Aggregate Results for 1 Client(s), Tue Jun 19 21:33:01 2007 NFS Protocol Version 3 ------------------------------------------------------------------------------ NFS Target Actual NFS NFS Mean Std Dev Std Error Pcnt Op NFS NFS Op Op Response Response of Mean,95% of Type Mix Mix Success Error Time Time Confidence Total Pcnt Pcnt Count Count Msec/Op Msec/Op +- Msec/Op Time ------------------------------------------------------------------------------ getattr 11% 11.2% 27131 0 0.49 4.19 0.02 7.0% setattr 1% 1.0% 2470 0 0.99 7.87 0.11 1.3% lookup 27% 27.4% 66343 0 0.46 4.07 0.02 16.1% readlink 7% 7.3% 17640 0 0.22 0.14 0.01 2.1% read 18% 17.9% 43440 1339 1.77 10.68 0.03 40.8% write 9% 8.7% 21022 455 1.25 8.15 0.04 13.9% create 1% 1.0% 2416 0 1.71 14.77 0.15 2.2% remove 1% 1.0% 2455 0 0.95 8.88 0.12 1.2% readdir 2% 2.1% 4980 0 0.46 0.47 0.02 1.2% fsstat 1% 1.0% 2421 0 0.25 1.71 0.05 0.3% access 7% 7.1% 17161 0 0.23 0.60 0.01 2.1% commit 5% 4.3% 10549 0 0.20 0.10 0.01 1.1% fsinfo 1% 1.0% 2520 0 0.20 0.13 0.01 0.3% readdirplus 9% 9.1% 22019 0 0.88 0.61 0.01 10.3% ------------------------------------------------------------------------------ -------------------------------------------------------- | SPEC SFS VERSION 3.0 AGGREGATE RESULTS SUMMARY | -------------------------------------------------------- NFS V3 THROUGHPUT: 811 Ops/Sec AVG. RESPONSE TIME: 0.8 Msec/Op TCP PROTOCOL NFS MIXFILE: [ SFS default ] AGGREGATE REQUESTED LOAD: 800 Ops/Sec TOTAL NFS OPERATIONS: 242567 TEST TIME: 299 Sec NUMBER OF SFS CLIENTS: 1 TOTAL FILE SET SIZE CREATED: 7922.1 MB TOTAL FILE SET SIZE ACCESSED: 792.4 - 874.9 MB (100% to 110% of Base) ------------------------------------------------------------------------ Created attachment 157438 [details]
Draft version of GFS2 NFS patch
Don't get too excited though - there are large amount of RPC errors. However,
no panic, no hang - that's quite an improvement !
Also don't take the patch too seriously. It is just a POC patch - to prove the issue is indeed with stuffing imode into the file handle. I would like to run this on RHEL4 and RHEL5 clients before finalizing the real solution. A side note.. during the run, the client passed 25855 short filehandles (that can't accomodate imode). Recap the issues found so far: 1. File handle doesn't have enough room for imode - this will cause hang or panic. Draft patch in comment #12 works fine. Will polish a little bit and send out for review tomorrow. 2. GFS lookup code path doesn't grab inode shared glock. Is this intentional or oversight ? The end result is that lookup code path would not do disk read to get inode contents. This implies the info passed to nfs filehandle encoding routine is only partially right - i.e. no_addr is correct (since it is obtained from directory entry) but no_formal_ino is garbage (since GFS2 never reads in the inode and no_formal_ino content is stored in on-disk inode). I tentatively add a shared glock into gfs2_lookupi() and it seems to solve this issue. Need to further test this out though. 3. Bump into a 3rd issue in gfs2_setattr() after the 2nd patch is applied. The benchmark was in the middle of cleaning up its test directories (to prepare for another run) and tries to do a truncate. The do_shrink() call fails in gfs2_block_truncate_page with error set EIO. Not sure what's the cause though. Seen this problem before ? I really don't like the idea of having two slightly different lookup methods when one will do. At the very least it should be possible to use the same code as the unlinked path uses, since it has the same problem to solve. Surely all that needs doing is to change to a shared lock for the NFS case? In that case we could use that as an extra argument to the lookup function. If you are going to add this code to avoid needing to know the imode, then you might as well take that information out of the larger file handles. Both of them should be using the same lookup method I think to avoid extra complications. The reason for having the imode in the file handles is so that we don't have to read the inodes in individually during the lookup function. The idea was that it should be possible to read them in, in a batch (assuming that we need to read a number of them) later on. It is intentional that we don't grab the lock in gfs2_lookupi(). We only take the directory lock and when the inode is returned it can then be locked, if required, to bring all the fields uptodate just like every other place in the code. From commnet #19
> It is intentional that we don't grab the lock in gfs2_lookupi().
I finally understand where GFS2's superior performance comes from ... but
please approach this performance improvement carefully.. After all, large
amount of users will access GFS2 thru NFS.
We all agree NFS needs something other than no_addr (real ino) to shield
itself from wrongly accessing a file that has been deleted and ino re-used.
In ext3, it uses generation numbers. In gfs2, it uses the generated
no_formal_ino during file creation.
For newly created files (e.g. the files under NFS connectathon test suit),
the issue will not be surfaced since all the information stays within
in-memory inode left over during creation time. For the files existing in
disk that doesn't get read-in yet, and/or its inode gets temporarily purged
by VM due to its unused status (mostly found in NFS SPECsfs environment),
the removal of inode shared lock will make lookup code bypassing disk read.
So we have an un-initialized no_formal_ino that gets decoded into filehandle.
NFS client would end up with ESTALE when it tries to access a legitimate and
still alive file.
I can't find a way to fix this unless we can:
1. put that inode shared glock back to gfs lookup code (as the old code
does); or
2. add no_formal_ino into directory entry (so we don't need to read in
the inode from the disk) - this is on-disk data structure changes though.
Be aware this issue differing from that i_mode issue. With the i_mode issue,
the code path comes down from gfs2_get_dentry, so we know it is from NFS.
For this second issue, nfsd simply does a generic lookup that goes to vfs
lookup, then comes down to gfs2 generic lookup code. There is really no way
to bypass this.
In short, NFS over GFS2 is very broken at this moment due to this change.
Discussed this with Steve .. the no_formal_ino is already stored inside directory entry. So will code the fix accordingly - need to find a good place to stuff that info into in-memmory inode though. This is very doable (I hope). Created attachment 157580 [details]
combined draft patch for i_mode and no_formal_ino issue
Turned out to be a trivial fix (for no_formal_ino issue) so I didn't bother
to seperate the two patches. Still have lots of debugging code inside that
needs to get cleaned up and polished. In general, it seems to work fine - but
I won't know for sure until the test run completed.
The benchmark is running now.
Benchmark completed but the 2nd runs hit the gfs2_setattr problem as expected. Hopefully the setattr bug can be digged out tonight ... Wow... forgot to check the RPC errors, they are all *gone*..... Now, we're back to good shape ... :) SPEC SFS Benchmark Version 3.0, Creation - 11 July 2001 SFS Single Client (dhcp139.perf.redhat.com) Results, Thu Jun 21 17:28:44 2007 ---------------------------------------------------------------------------- NFS V3 Target Actual NFS NFS Mean Std Dev Std Error Pcnt Op NFS NFS Op Op Response Response of Mean,95% of Type Mix Mix Success Error Time Time Confidence Total Pcnt Pcnt Count Count Msec/Op Msec/Op +- Msec/Op Time ---------------------------------------------------------------------------- null 0% 0.0% 0 0 0.00 0.00 0.00 0.0% getattr 11% 11.1% 26925 0 0.49 3.68 0.02 7.0% setattr 1% 1.0% 2404 0 1.20 8.95 0.12 1.5% root 0% 0.0% 0 0 0.00 0.00 0.00 0.0% lookup 27% 27.0% 65725 0 0.48 5.44 0.02 16.9% readlink 7% 7.2% 17382 0 0.22 0.13 0.01 2.1% read 18% 18.4% 44745 0 1.69 10.14 0.03 40.3% wrcache 0% 0.0% 0 0 0.00 0.00 0.00 0.0% write 9% 8.9% 21507 0 1.30 8.55 0.04 14.9% create 1% 1.0% 2417 0 1.03 8.57 0.12 1.3% remove 1% 1.0% 2430 0 0.75 6.39 0.10 1.0% rename 0% 0.0% 0 0 0.00 0.00 0.00 0.0% link 0% 0.0% 0 0 0.00 0.00 0.00 0.0% symlink 0% 0.0% 0 0 0.00 0.00 0.00 0.0% mkdir 0% 0.0% 0 0 0.00 0.00 0.00 0.0% rmdir 0% 0.0% 0 0 0.00 0.00 0.00 0.0% readdir 2% 2.0% 4915 0 0.45 0.41 0.02 1.2% fsstat 1% 1.0% 2377 0 0.21 0.31 0.02 0.3% access 7% 7.0% 17095 0 0.23 0.87 0.01 2.1% commit 5% 4.4% 10707 0 0.18 0.10 0.01 1.0% fsinfo 1% 1.0% 2469 0 0.19 0.11 0.01 0.3% mknod 0% 0.0% 0 0 0.00 0.00 0.00 0.0% pathconf 0% 0.0% 0 0 0.00 0.00 0.00 0.0% readdirplus 9% 9.0% 21892 0 0.88 2.32 0.02 10.2% ---------------------------------------------------------------------------- ------------------------------------------------------------ | SPEC SFS VERSION 3.0 SINGLE CLIENT RESULTS SUMMARY | ------------------------------------------------------------ NFS V3 THROUGHPUT: 812.67 Ops/Sec AVG. RESPONSE TIME: 0.77 Msec/Op TCP PROTOCOL FAST CALLS: 0 NFS MIXFILE: [ SFS Default ] CLIENT REQUESTED LOAD: 800 Ops/Sec TOTAL NFS OPERATIONS: 242990 TEST TIME: 299 Sec FILE SET SIZE CREATED: 8112208 KB FILE SET SIZE ACCESSED: 811408 - 894145 KB (100% to 110% of Base) ok, we pass the gfs2_setattr (actually it is a truncate call) issue ... Now the benchmark is doing multiple runs (before the fix, we could only do one run - it then died when trying to clean up the test directories from the first run. Will let it loops for a while.. hopefully there will be no new show stopper. The bug is really silly - GFS2 tries to read in the data page but pre-sets the err to -EIO before calling block layer's ll_rw_block. It never resets the err even the disk read succeeds. Created attachment 157640 [details]
This fixes the gfs_block_truncate_page EIO error.
Fix a silly bug that would fail gfs2_setattr() with EIO error during truncating
a file. Will submit this to cluster-devel when Steve is on-line again.
All of above works are done using upstream kernel (2.6.22.rc4 taken from nmw git tree on Jun. 5). SPECsfs is able to run in a loop without issues. Will submit the patches to cluster-devel on Monday. Start to do RHEL 5 port now. Never ending saga .. Freshly upgraded the machine to RHEL 5.1 newest built (31.el5) and rebuilt it with above patches. The machine immediately hung - happened to have a "tail -f /var/log/messages" screen in another window that showed a soft lockup somewhere in the journaling write. Not sure this is old issue or new regression though. The test machine doesn't have power switch. Have to wait until Monday when I get to office to check this out: Jun 24 16:07:18 dhcp143 mountd[3668]: authenticated mount request from dhcp139.perf.redhat.com:740 for /sfs1 (/sfs1) Jun 24 16:07:51 dhcp143 kernel: crash memory driver: version 1.0 Jun 24 16:08:17 dhcp143 kernel: BUG: soft lockup detected on CPU#1! Jun 24 16:08:17 dhcp143 kernel: Jun 24 16:08:17 dhcp143 kernel: Call Trace: Jun 24 16:08:17 dhcp143 kernel: <IRQ> [<ffffffff800b4f74>] softlockup_tick+0xdb/0xed Jun 24 16:08:17 dhcp143 kernel: [<ffffffff800930f6>] update_process_times+0x42/0x68 Jun 24 16:08:17 dhcp143 kernel: [<ffffffff800746f5>] smp_local_timer_interrupt+0x23/0x47 Jun 24 16:08:17 dhcp143 kernel: [<ffffffff80074db7>] smp_apic_timer_interrupt+0x41/0x47 Jun 24 16:08:17 dhcp143 kernel: [<ffffffff8005bc8e>] apic_timer_interrupt+0x66/0x6c Jun 24 16:08:17 dhcp143 kernel: <EOI> [<ffffffff80062a18>] _spin_lock+0x3/0xa Jun 24 16:08:17 dhcp143 kernel: [<ffffffff8850a586>] :gfs2:databuf_lo_before_commit+0x337/0x594 Jun 24 16:08:17 dhcp143 kernel: [<ffffffff88508c98>] :gfs2:gfs2_log_flush+0x12b/0x2da Jun 24 16:08:17 dhcp143 kernel: [<ffffffff885092c6>] :gfs2:gfs2_log_reserve+0xd0/0x131 Jun 24 16:08:17 dhcp143 kernel: [<ffffffff8851925e>] :gfs2:gfs2_trans_begin+0xf6/0x135 Jun 24 16:08:17 dhcp143 kernel: [<ffffffff8850cc45>] :gfs2:gfs2_prepare_write+0x1d5/0x29b Jun 24 16:08:17 dhcp143 kernel: [<ffffffff8000f96e>] generic_file_buffered_write+0x25a/0x6d8 Jun 24 16:08:17 dhcp143 kernel: [<ffffffff8003ef74>] memcpy_toiovec+0x36/0x66 Jun 24 16:08:17 dhcp143 kernel: [<ffffffff80030f3d>] skb_copy_datagram_iovec+0x4f/0x237 Jun 24 16:08:17 dhcp143 kernel: [<ffffffff8000ddb7>] current_fs_time+0x3b/0x40 Jun 24 16:08:17 dhcp143 kernel: [<ffffffff80015d5d>] __generic_file_aio_write_nolock+0x36c/0x3b8 Jun 24 16:08:17 dhcp143 kernel: [<ffffffff800bdd92>] __generic_file_write_nolock+0x8f/0xa8 (Lower part of the screen got chopped off). Re-dispatched the job and took a quick back trace - surprised to see heavy lock contentions on journaling code. Never saw this issue before... Look serious to me. Well, while I was typing above comment, the system passed previous lock-up point. Will let it run. hmm. not good. I ran debug version of RHEL5 32.el5 and the benchmark loops without hangs. This probably means when GFS2 is stressed enough and/or run in full speed, this journal write issue will surface. Not sure how long it would take to dig this out at this moment. s/32.el5/31.el5/ in comment #33 Absolutely no idea why it *was* choked. Now the problem simply disappears, even without debug option on. Found Ben and Bob have been working on journal writes issue for a while. Problem from comment #30 seems to be right in that arena so I'll not dig more unless it shows up again. Patches are sent to cluster-devel tonight - waiting for Steve getting on-line to review. They can be applied directly to upstream kernel and RHEL 5.1 31.el5 build as well. Summary of changes include: [Patch 3-1] EIO error from gfs2_block_truncate_page Code segment inside gfs2_block_truncate_page() doesn't set the return code correctly. This causes NFSD erroneously returns EIO back to client with setattr procedure call (truncate error), [Patch 3-2] Obtaining no_formal_ino from directory entry GFS2 lookup code doesn't ask for inode shared glock. This implies during in-memory inode creation for existing file, GFS2 will not disk-read in the inode contents. This leaves no_formal_ino un-initialized during lookup time. The un-initialized no_formal_ino is subsequently encoded into file handle. Clients will get ESTALE error whenever it tries to access these files. [Patch 3-3] Remove i_mode passing from NFS File Handle GFS2 passes i_mode within NFS File Handle. Other than the wrong assumption that there is always room for this extra 16 bit value, the current gfs2_get_dentry doesn't really need the i_mode to work correctly. Note that GFS2 NFS code does go thru the same lookup code path as direct file access route (where the mode is obtained from name lookup) but gfs2_get_dentry() is coded for different purpose. It is not used during lookup time. It is part of the file access procedure call. When the call is invoked, if on-disk inode is not in-memory, it has to be read-in. This makes i_mode inside File Handle an obsolete feature. The patch removes it accordingly. Without the removal, current NFS-GFS2 would panic in the locking assert. I'll move on to work on bugzilla 244343. This bugzilla will stay in assigned state until Steve finishes reviewing the patches. Created attachment 158063 [details]
[patch 1/4] Fix gfs2_block_truncate_page err return
Created attachment 158064 [details]
[patch 2/4] obtain no_formal_ino from directory entry
Created attachment 158065 [details]
[patch 3/4] remove imode from file handle
Created attachment 158066 [details]
[patch 4/4] gfs2 truncate inode size inconsistency
patch 1/4 and 4/4 are in Steve's git tree now. The other two are waiting for his review. As far as I can tell, 4/4 is not in swhiteho's git tree yet. I've got a duplicate problem in gfs, known as bug #241096, and nearly identical code was suggested as a fix. And yes, it would make the code work. But before I ship that in gfs, I want to understand why the inode size gets changed if it fails quota_check or any other check. Does vfs change the value before calling the setattr function? Seems bad. It seems to me that the inode size shouldn't be changed until after the grow/shrink is successful and we haven't gone through the error path. That would make this fix unnecessary. I need to do some research. Remember we are talking about two inodes here - one VFS inode and one gfs(1/2) inode. When size changes, both structures need to get changed. Current code seems to be cleaner - it does VFS inode change first (since it is light weight - undoing it is a no brainer), followed by gfs inode changes (which involves disk operations). If you do gfs(1/2) first - vfs inode later, if vfs inode change fails, undoing gfs(1/2) would be much messier. I really don't see why we need to fuss about this simple fix. Steve has been in OLS and he might be making some mistakes (for pulling in the fix). I'll check with him Monday morning. The following is the description in cluster-devel: https://www.redhat.com/archives/cluster-devel/2007-June/msg00226.html After studying the code and seeing how vmtruncate works, I concur. I agree this is the right fix, so I'll commit the fix for gfs(1) as well. No "fuss" really; I'm just trying to make you aware of the gfs1 bz. 4/4 is in the git tree as are the others. in 2.6.18-33.el5 You can download this test kernel from http://people.redhat.com/dzickus/el5 An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2007-0959.html |