Bug 869441

Summary: NFS client sometimes gets wrong value back from stat()
Product: Red Hat Enterprise Linux 6 Reporter: Tom McDonald <tom>
Component: kernelAssignee: Jeff Layton <jlayton>
Status: CLOSED INSUFFICIENT_DATA QA Contact: Filesystem QE <fs-qe>
Severity: high Docs Contact:
Priority: unspecified    
Version: 6.0CC: jlayton, kzhang, steved
Target Milestone: rc   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2013-02-08 14:59:30 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
strace output of a test run reproducing problem
none
Repro utility
none
Packet capture taken while the problem occurred
none
patch -- never apply attributes if the ctime is older than the ones we already have none

Description Tom McDonald 2012-10-23 21:59:34 UTC
Created attachment 632423 [details]
strace output of a test run reproducing problem

Description of problem:
After calling truncate() on a file on an NFS mount, we sometimes see pre-truncate file sizes.


Version-Release number of selected component (if applicable):
2.6.32-71.el6.x86_64 (CentOS 6.0)
2.6.32-279.11.1.el6.x86_64 (CentOS 6.3)


How reproducible:
The problem seems to be timing related and we see it to a varying degree up to 1% of the time, and we had to build some test utilities we built to specifically target the problem.  We basically run the tests in a loop until the problem happens.

It seems to happen more often on Isilon than on NetApp file servers, if that matters.  We also have reports of problems on a Linux file server, but don't have test results to corroborate.

I'll upload one of our repro utilities along with some debug info (strace and tcpdump) gathered when the problem happened.

We have not tested on mainline kernels.  


Steps to Reproduce:
Call system calls in the following order on a file.  Many passes will likely be required.  We actually find that we need to truncate dozens of files in a row and in multiple passes before the problem can be observed.
1. open() file
2. mmap() file
3. memmove() a region of the file
4. msync()
5. memunmap()
6. close()
7. truncate()
8. stat()


Actual results:
Run the reproduce logic in a loop and you'll eventually see the stat() call return the size of the file before the truncate.


Expected results:
From a single threaded app modifying files on an NFS server with no other processes on another NFS client accessing the file, the expected behaviour is to have stat() on the same host return the truncated file size reliably.


Additional info:
We've observed this happening on both Isilon and NetApp NFS servers in the environment I'm working in.  The problem is most reproducible on Isilon which is an older, slower system.

The environment in question has run for years with CentOS 4 without this issue.

The problem _may_ only happen when the opened file is memory mapped.  We're doing a msync and memunmap before closing and truncating.  Limited testing without mmap calls hasn't reproduced the issue.

The environment I'm working in is mainly CentOS 6.0.  The company who owns the environment upgraded one of the nodes to 6.3 for testing, and we were still able to repro the issue.

Here small segment from strace showing stat returning 36528128 instead of 36527128:

10:09:02.933851 truncate("trunk_file33", 36527128) = 0 <0.000700>
10:09:02.934597 stat("trunk_file33", {st_mode=S_IFREG|0600, st_size=36528128, ...}) = 0 <0.000025>

Anecdotally, I can report that we have seen stat() calls on post-truncated files return the truncated size after waiting several seconds.  We also sometimes see a stat() immediately after the truncate() return the correct size with other stat() calls .25 seconds later showing the wrong size.

I'll try to upload repro utility, strace output, and tcpdump capture for more information.

Comment 2 Tom McDonald 2012-10-23 22:05:11 UTC
Created attachment 632425 [details]
Repro utility

Here's the repro utility.  Note that I'm not a programmer by trade.  :)

Comment 3 Tom McDonald 2012-10-23 22:06:49 UTC
Created attachment 632426 [details]
Packet capture taken while the problem occurred

Packet capture attached made with tcpdump

Comment 4 Tom McDonald 2012-12-22 00:10:41 UTC
A little more information that might be helpful—

We haven't had time to build a mainline kernel to test if this behavior exists there as well, but we did have time to try our repro on a debian (6.0.6) system running the 2.6.32-5-amd64 kernel, and we did reproduce the problem there too.

Also, the problem happened for us on both NetApp and Isilon file servers.  The problem happens more often with Isilon for some reason.

In extended testing, CentOS 4 did show a similar issue reporting the wrong file size very very occasionally (approximately 0.00001% of truncates) and only immediately after new files were copied over old ones before being truncated.

Comment 5 Jeff Layton 2013-01-07 19:24:53 UTC
This sounds somewhat similar to the problem that Chen Gang was trying to describe on the linux-nfs ml back in October:

    https://lkml.org/lkml/2012/10/14/303

In fact, looking at the capture, I see a bunch of READ calls and requests interleaved with SETATTR calls and requests, which is exactly the situation he was describing.

Most likely, this is a case where we have attribute updates racing, and the wrong one is occasionally winning. There are a couple of possibilities:

1) we got back the correct attributes from the SETATTR, but we mistakenly considered them stale and didn't update the inode with them.

2) a READ reply was processed after a SETATTR reply, and we mistakenly applied older, stale attributes to it even though the SETATTR ones were newer.

The hard part is figuring out which is the situation here...

Comment 6 Jeff Layton 2013-01-08 12:03:18 UTC
The interesting part here is that the server seems to be returning sub-second timestamps and those are changing on the SETATTR call (at least, assuming that the last frames captured represent when the race is occurring).

This is going to take some debugging effort to track down, so it would be ideal if you could open a support case so we could help you do that. Since you're using centos though you'll need to some more legwork on your own to track this down.

What I'd probably suggest first is to see if this is reproducible on a more current mainline kernel. Maybe something 3.7-ish? If it is, then you can report it to the linux-nfs mailing list and we can work on the problem there. Once a fix is identified, then we can see about backporting it into RHEL.

Comment 7 Jeff Layton 2013-01-08 12:25:14 UTC
Now that I look though, I suspect that I know what the problem is...

nfs_inode_attrs_need_update does this:

        return ((long)fattr->gencount - (long)nfsi->attr_gencount) > 0 ||
                nfs_ctime_need_update(inode, fattr) ||
                nfs_size_need_update(inode, fattr) ||
                ((long)nfsi->attr_gencount - (long)nfs_read_attr_generation_counter() > 0);

...I think the problem is that nfs_ctime_need_update() doesn't properly distinguish between the situation where the ctime of the inode is older than the
one in the inode vs. the case where it's the same. If the ctime in the inode is newer than the one in the fattr we're looking at, then we should toss out the attribute update.

The code above will still allow the attribute update since nfs_ctime_need_update just returns a boolean value and the size is larger. I think this is what Chen was trying to get across in his upstream discussion, but he never proposed a patch.

Comment 8 Jeff Layton 2013-01-08 12:43:05 UTC
Created attachment 674779 [details]
patch -- never apply attributes if the ctime is older than the ones we already have

Here's an untested patch that may do the right thing here. This one is against mainline, but will probably apply to rhel6 with a bit of wiggling. It'll probably need a bit of cleanup before I can submit it upstream, so consider this a proof-of-concept patch.

If you can test that and let me know whether it helps, then that would be helpful. Ideally, it would be best to test a stock mainline kernel too to make sure this is a problem there, and then test this patch on that too.

That'll make it easier to make the case to Trond for changing the logic here.

Comment 9 Jeff Layton 2013-01-15 19:22:02 UTC
Tom, would you be able to test this patch and let me know if it helps at all?

Comment 10 Jeff Layton 2013-01-30 14:03:21 UTC
Tom, any word on this patch?

Comment 11 Jeff Layton 2013-02-08 14:59:30 UTC
No response in a month, closing bug with a resolution of INSUFFICIENT_DATA. Please reopen if you're able to test the patch in comment #8 with your reproducer.

Comment 12 Tom McDonald 2013-02-08 17:02:51 UTC
Sorry for the delay, Jeff.  My contract with the company who had this problem ended in December.  I'm not sure if I'll be back with access to a test environment at this point, but the company is aware of this issue and it's status.