Bug 672981 - lseek() over NFS is returning an incorrect file length under some circumstances
Summary: lseek() over NFS is returning an incorrect file length under some circumstances
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel
Version: 5.6
Hardware: x86_64
OS: Linux
high
high
Target Milestone: beta
: 5.7
Assignee: Jeff Layton
QA Contact: yanfu,wang
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2011-01-26 21:45 UTC by Trond Myklebust
Modified: 2012-01-06 03:26 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2011-07-21 09:37:24 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Tighten up the attribute update code (2.42 KB, patch)
2011-01-26 21:46 UTC, Trond Myklebust
no flags Details | Diff


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2011:1065 0 normal SHIPPED_LIVE Important: Red Hat Enterprise Linux 5.7 kernel security and bug fix update 2011-07-21 09:21:37 UTC

Description Trond Myklebust 2011-01-26 21:45:13 UTC
Description of problem:
While running a scan on a large postgresql database, we appear to have hit
an NFS attribute revalidation problem that was occasionally causing
lseek(fd, 0, SEEK_END) to return a stale file size.

Version-Release number of selected component (if applicable):
Reproduced with both kernel 2.6.18-128 and 2.6.18-238

How reproducible:

It can be reproduced reliably on our setup by doing a full scan of our Black Duck
database. At some point, the postgresl database will log an error, an example of which follows:

2011-01-20 12:53:51 PST  32485  ERROR:  unexpected data beyond EOF in block 219888 of relation base/18602/35518063


The Black Duck team managed to instrument the postgresql database to log all
lseek() and write() calls. When we did so, we logged the following events:

        FileSeek(SEEK_END)      base/18602/35518063.1   727646208
        FileSeek(SEEK_END)      base/18602/35518063.1   727580672
error reported here for "data beyond EOF in block 219888 of relation
base/18602/35518063"
       FileSeek(SEEK_END)      base/18602/35518063.1   727580672
       FileSeek(SEEK_END)      base/18602/35518063.1   727646208

All this occurred with no intervening write calls (and no truncates, obviously).

The file length of 727580672 did indeed correspond to a previous length
of the file. The correct file length was 727646208.

Comment 1 Trond Myklebust 2011-01-26 21:46:55 UTC
Created attachment 475487 [details]
Tighten up the attribute update code

Comment 2 Trond Myklebust 2011-01-26 21:58:25 UTC
The above patch fixes a couple of bugs in the RHEL-5.6 kernel:

1) nfs_wcc_update_inode() should not be called from nfs_check_inode_attributes(). nfs_refresh_inode_locked() has already determined that these attributes are likely to be stale, so it is a bug to then apply them anyway.

2) nfs_revalidate_file_size() shouldn't test for nfsi->npages != 0. If NFS_INO_REVAL_PAGECACHE is set, then that means we want to revalidate the page cache irrespective of whether we have dirty data or not.

3) If nfs_wcc_update_inode() updates the mtime/ctime/size, then we need to ensure that nfsi->attr_gencount gets updated too. Do so by having it set the NFS_INO_INVALID_ATTR flag, so that nfs_update_inode() performs the attr_gencount update.


Points 1) and 2) above are already changed in the upstream kernel.

Point 3) is not yet fixed in upstream, but will be soon...

Comment 3 Jeff Layton 2011-01-28 21:43:54 UTC
Thanks Trond,

I added this to my test kernels here:

    http://people.redhat.com/jlayton/

...the patch looks sane as best I can tell, and I suspect it may also fix bug 663068. Nate is going to run that test against my test kernels over the weekend so hopefully we'll have some results next week sometime.

Comment 4 RHEL Program Management 2011-02-01 17:06:59 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 9 Jarod Wilson 2011-05-13 22:19:16 UTC
Patch(es) available in kernel-2.6.18-261.el5
You can download this test kernel (or newer) from http://people.redhat.com/jwilson/el5
Detailed testing feedback is always welcomed.

Comment 11 yanfu,wang 2011-05-30 10:35:44 UTC
hi,
Could customer help to test and give feedback of test result? thanks.

Comment 12 Trond Myklebust 2011-05-30 14:52:36 UTC
We have not seen the Postgresql problem recur since we applied the patch to our
kernel in January. It used to occur several times a week.

Note also that all the patches have now been merged into the upstream kernels.

Comment 13 yanfu,wang 2011-05-31 10:30:48 UTC
(In reply to comment #12)
> We have not seen the Postgresql problem recur since we applied the patch to our
> kernel in January. It used to occur several times a week.
> 
> Note also that all the patches have now been merged into the upstream kernels.

thank you, and I do code review and verify the patch is being applied in kernel-2.6.18-264.el5.

Comment 14 Jeff Layton 2011-05-31 10:42:43 UTC
I think that's the best that can be done for this. I'm not aware of a reliable reproducer for the problems that this patch fixes.

Comment 15 errata-xmlrpc 2011-07-21 09:37:24 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2011-1065.html


Note You need to log in before you can comment on or make changes to this bug.