672981 – lseek() over NFS is returning an incorrect file length under some circumstances

Bug 672981 - lseek() over NFS is returning an incorrect file length under some circumstances

Summary: lseek() over NFS is returning an incorrect file length under some circumstances

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 5
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	5.6
Hardware:	x86_64
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	beta
Target Release:	5.7
Assignee:	Jeff Layton
QA Contact:	yanfu,wang
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2011-01-26 21:45 UTC by Trond Myklebust
Modified:	2012-01-06 03:26 UTC (History)
CC List:	10 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2011-07-21 09:37:24 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
Tighten up the attribute update code (2.42 KB, patch) 2011-01-26 21:46 UTC, Trond Myklebust	no flags	Details \| Diff
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2011:1065	0	normal	SHIPPED_LIVE	Important: Red Hat Enterprise Linux 5.7 kernel security and bug fix update	2011-07-21 09:21:37 UTC

Description Trond Myklebust 2011-01-26 21:45:13 UTC

Description of problem:
While running a scan on a large postgresql database, we appear to have hit
an NFS attribute revalidation problem that was occasionally causing
lseek(fd, 0, SEEK_END) to return a stale file size.

Version-Release number of selected component (if applicable):
Reproduced with both kernel 2.6.18-128 and 2.6.18-238

How reproducible:

It can be reproduced reliably on our setup by doing a full scan of our Black Duck
database. At some point, the postgresl database will log an error, an example of which follows:

2011-01-20 12:53:51 PST  32485  ERROR:  unexpected data beyond EOF in block 219888 of relation base/18602/35518063


The Black Duck team managed to instrument the postgresql database to log all
lseek() and write() calls. When we did so, we logged the following events:

        FileSeek(SEEK_END)      base/18602/35518063.1   727646208
        FileSeek(SEEK_END)      base/18602/35518063.1   727580672
error reported here for "data beyond EOF in block 219888 of relation
base/18602/35518063"
       FileSeek(SEEK_END)      base/18602/35518063.1   727580672
       FileSeek(SEEK_END)      base/18602/35518063.1   727646208

All this occurred with no intervening write calls (and no truncates, obviously).

The file length of 727580672 did indeed correspond to a previous length
of the file. The correct file length was 727646208.

Comment 1 Trond Myklebust 2011-01-26 21:46:55 UTC

Created attachment 475487 [details]
Tighten up the attribute update code

Comment 2 Trond Myklebust 2011-01-26 21:58:25 UTC

The above patch fixes a couple of bugs in the RHEL-5.6 kernel:

1) nfs_wcc_update_inode() should not be called from nfs_check_inode_attributes(). nfs_refresh_inode_locked() has already determined that these attributes are likely to be stale, so it is a bug to then apply them anyway.

2) nfs_revalidate_file_size() shouldn't test for nfsi->npages != 0. If NFS_INO_REVAL_PAGECACHE is set, then that means we want to revalidate the page cache irrespective of whether we have dirty data or not.

3) If nfs_wcc_update_inode() updates the mtime/ctime/size, then we need to ensure that nfsi->attr_gencount gets updated too. Do so by having it set the NFS_INO_INVALID_ATTR flag, so that nfs_update_inode() performs the attr_gencount update.


Points 1) and 2) above are already changed in the upstream kernel.

Point 3) is not yet fixed in upstream, but will be soon...

Comment 3 Jeff Layton 2011-01-28 21:43:54 UTC

Thanks Trond,

I added this to my test kernels here:

    http://people.redhat.com/jlayton/

...the patch looks sane as best I can tell, and I suspect it may also fix bug 663068. Nate is going to run that test against my test kernels over the weekend so hopefully we'll have some results next week sometime.

Comment 4 RHEL Program Management 2011-02-01 17:06:59 UTC

This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 9 Jarod Wilson 2011-05-13 22:19:16 UTC

Patch(es) available in kernel-2.6.18-261.el5
You can download this test kernel (or newer) from http://people.redhat.com/jwilson/el5
Detailed testing feedback is always welcomed.

Comment 11 yanfu,wang 2011-05-30 10:35:44 UTC

hi,
Could customer help to test and give feedback of test result? thanks.

Comment 12 Trond Myklebust 2011-05-30 14:52:36 UTC

We have not seen the Postgresql problem recur since we applied the patch to our
kernel in January. It used to occur several times a week.

Note also that all the patches have now been merged into the upstream kernels.

Comment 13 yanfu,wang 2011-05-31 10:30:48 UTC

(In reply to comment #12)
> We have not seen the Postgresql problem recur since we applied the patch to our
> kernel in January. It used to occur several times a week.
> 
> Note also that all the patches have now been merged into the upstream kernels.

thank you, and I do code review and verify the patch is being applied in kernel-2.6.18-264.el5.

Comment 14 Jeff Layton 2011-05-31 10:42:43 UTC

I think that's the best that can be done for this. I'm not aware of a reliable reproducer for the problems that this patch fixes.

Comment 15 errata-xmlrpc 2011-07-21 09:37:24 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2011-1065.html

Note You need to log in before you can comment on or make changes to this bug.