Bug 672981

Summary: lseek() over NFS is returning an incorrect file length under some circumstances
Product: Red Hat Enterprise Linux 5 Reporter: Trond Myklebust <trond.myklebust>
Component: kernelAssignee: Jeff Layton <jlayton>
Status: CLOSED ERRATA QA Contact: yanfu,wang <yanwang>
Severity: high Docs Contact:
Priority: high    
Version: 5.6CC: bfields, cward, jiali, jlayton, jwest, qcai, rpacheco, rwheeler, steved, tgl
Target Milestone: betaKeywords: OtherQA
Target Release: 5.7   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2011-07-21 09:37:24 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Tighten up the attribute update code none

Description Trond Myklebust 2011-01-26 21:45:13 UTC
Description of problem:
While running a scan on a large postgresql database, we appear to have hit
an NFS attribute revalidation problem that was occasionally causing
lseek(fd, 0, SEEK_END) to return a stale file size.

Version-Release number of selected component (if applicable):
Reproduced with both kernel 2.6.18-128 and 2.6.18-238

How reproducible:

It can be reproduced reliably on our setup by doing a full scan of our Black Duck
database. At some point, the postgresl database will log an error, an example of which follows:

2011-01-20 12:53:51 PST  32485  ERROR:  unexpected data beyond EOF in block 219888 of relation base/18602/35518063


The Black Duck team managed to instrument the postgresql database to log all
lseek() and write() calls. When we did so, we logged the following events:

        FileSeek(SEEK_END)      base/18602/35518063.1   727646208
        FileSeek(SEEK_END)      base/18602/35518063.1   727580672
error reported here for "data beyond EOF in block 219888 of relation
base/18602/35518063"
       FileSeek(SEEK_END)      base/18602/35518063.1   727580672
       FileSeek(SEEK_END)      base/18602/35518063.1   727646208

All this occurred with no intervening write calls (and no truncates, obviously).

The file length of 727580672 did indeed correspond to a previous length
of the file. The correct file length was 727646208.

Comment 1 Trond Myklebust 2011-01-26 21:46:55 UTC
Created attachment 475487 [details]
Tighten up the attribute update code

Comment 2 Trond Myklebust 2011-01-26 21:58:25 UTC
The above patch fixes a couple of bugs in the RHEL-5.6 kernel:

1) nfs_wcc_update_inode() should not be called from nfs_check_inode_attributes(). nfs_refresh_inode_locked() has already determined that these attributes are likely to be stale, so it is a bug to then apply them anyway.

2) nfs_revalidate_file_size() shouldn't test for nfsi->npages != 0. If NFS_INO_REVAL_PAGECACHE is set, then that means we want to revalidate the page cache irrespective of whether we have dirty data or not.

3) If nfs_wcc_update_inode() updates the mtime/ctime/size, then we need to ensure that nfsi->attr_gencount gets updated too. Do so by having it set the NFS_INO_INVALID_ATTR flag, so that nfs_update_inode() performs the attr_gencount update.


Points 1) and 2) above are already changed in the upstream kernel.

Point 3) is not yet fixed in upstream, but will be soon...

Comment 3 Jeff Layton 2011-01-28 21:43:54 UTC
Thanks Trond,

I added this to my test kernels here:

    http://people.redhat.com/jlayton/

...the patch looks sane as best I can tell, and I suspect it may also fix bug 663068. Nate is going to run that test against my test kernels over the weekend so hopefully we'll have some results next week sometime.

Comment 4 RHEL Program Management 2011-02-01 17:06:59 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 9 Jarod Wilson 2011-05-13 22:19:16 UTC
Patch(es) available in kernel-2.6.18-261.el5
You can download this test kernel (or newer) from http://people.redhat.com/jwilson/el5
Detailed testing feedback is always welcomed.

Comment 11 yanfu,wang 2011-05-30 10:35:44 UTC
hi,
Could customer help to test and give feedback of test result? thanks.

Comment 12 Trond Myklebust 2011-05-30 14:52:36 UTC
We have not seen the Postgresql problem recur since we applied the patch to our
kernel in January. It used to occur several times a week.

Note also that all the patches have now been merged into the upstream kernels.

Comment 13 yanfu,wang 2011-05-31 10:30:48 UTC
(In reply to comment #12)
> We have not seen the Postgresql problem recur since we applied the patch to our
> kernel in January. It used to occur several times a week.
> 
> Note also that all the patches have now been merged into the upstream kernels.

thank you, and I do code review and verify the patch is being applied in kernel-2.6.18-264.el5.

Comment 14 Jeff Layton 2011-05-31 10:42:43 UTC
I think that's the best that can be done for this. I'm not aware of a reliable reproducer for the problems that this patch fixes.

Comment 15 errata-xmlrpc 2011-07-21 09:37:24 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2011-1065.html