Bug 488780 - NFS read cache inconsistency
NFS read cache inconsistency
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel (Show other bugs)
x86_64 Linux
low Severity high
: ---
: ---
Assigned To: Jeff Layton
Red Hat Kernel QE team
Depends On:
  Show dependency treegraph
Reported: 2009-03-05 13:08 EST by Jason Priebe
Modified: 2014-06-18 03:38 EDT (History)
9 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Last Closed: 2010-10-15 09:26:04 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---

Attachments (Terms of Use)
command-line PHP scripts to help reproduce the problem (739 bytes, application/x-tar)
2009-03-05 13:08 EST, Jason Priebe
no flags Details

  None (edit)
Description Jason Priebe 2009-03-05 13:08:49 EST
Created attachment 334169 [details]
command-line PHP scripts to help reproduce the problem

echo bar > /nfs/test/new/f; mv /nfs/test/new/f /nfs/test/f; echo foo > /nfs/test/new/f; sleep 5; mv /nfs/test/new/f /nfs/test/f

until grep -q foo /nfs/test/f; do echo -n .; sleep 1; done

Description of problem:

Under certain conditions, if one client writes to a file in directory 1 and then moves it into directory 2, a second client will not see changes to the file in directory 2.  It will continue to use cached filedata for a long period of time (upwards of 30 minutes, possibly indefinitely).

Running 'ls' on the file in directory 2 does not cause the cache to refresh itself.  However, running 'ls' on directory 2 itself *usually* will cause the cache to refresh itself.

Note that this seems to be due to a race condition of some sort.  It is hard to replicate unless you have a large number of processes accessing the file in directory 2.  It is not impossible to replicate it with just once process reading the file once per second, but it may take a lot of trials to get it to happen.

Further, it seems that some machines are more prone to this than others; it's almost as if there's' something stateful going on.  Once a file has been "stuck" in the cache on a machine, it seems that machine is faster to reproduce the problem even after you've deleted the file and recreated it.

Version-Release number of selected component (if applicable):

RHEL 5.1 clients:
  kernel 2.6.18-53.el5

RHEL 5.2 server:

How reproducible:

Fairly reproducible with attached scripts.

Steps to Reproduce:

1. client A creates a file /nfs/test/new/foo.txt
2. client B reads existing file /nfs/test/foo.txt
3. client A moves /nfs/test/new/foo.txt to /nfs/test/foo.txt
4. client B reads file /nfs/test/foo.txt (and now gets stale data)

Again, it is not quite this simple; I've only once been able to get this
to happen using only these simple steps.  Normally, I need to run a script
that constantly updates the file contents 
Actual results:

In step 4, client B does not see the changes to foo.txt made by client A.

Expected results:

In step 4, client B should see the changes to foo.txt (at least within a few
seconds, based on acregmin/acregmax settings

Additional info:

Client A mount options: (rw,noatime,nodiratime,bg,nfsvers=3,tcp,timeo=600,rsize=32768,wsize=32768,hard,intr)
Client B mount options: (ro,noatime,nodiratime,bg,nfsvers=3,tcp,timeo=600,rsize=32768,wsize=32768,hard,intr

How to run my scripts:

Sorry for the PHP scripts -- we do all our dev work (CLI and web) in PHP; it's just faster for me to use it than anything else.  I hope you have easy access to a php 5 interpreter.

Anyway, on client A, with the read/write mount, run writer.php.  On client B, with the readonly mount, run reader.php.  It will print out the current timestamp, followed by the timestamp it found inside the test file.  The writer updates the file about every 12 seconds, so the age found by the reader shouldn't ever be more than about 15 or 20 seconds old.  It may take a few cycles of writing before you see the problem.  If you have trouble reproducing, you might want to try running "reader.php -r", which spawns off 10 processes to read the file at intervals of 100 to 500 ms.  It seems to increase the likelihood of the cache getting stuck.
Comment 1 Jason Priebe 2009-03-05 13:12:25 EST
Apologies for the commands that precede the problem description.  Cut-and-paste error.  In theory, running the first command on client A and the second on client B will reproduce the problem, but I've only succeeded at that once.  But the scripts I've attached are very reliable at reproducing the problem.
Comment 2 Jeff Layton 2009-03-05 18:33:29 EST
You may just be running into the limitations of time granularity on the underlying filesystem. ext3 only has 1s timestamp resolution so if you make several changes to an inode within the same second the client may end up relying on cached data.

That said, there are some patches that went into 5.3 that may help this (the one for bug 364351 comes to mind). Do 5.3 kernels on the client fare any better here?

Another thing that may help is having the server serve out a filesystem with more granular timestamps. ext4 is a tech preview in 5.3 (I think) so you might want to test that out on a server as well.
Comment 3 Jason Priebe 2009-03-06 08:24:54 EST
Bug 364351 could be related.  I will try a 5.3 kernel.

But I'm not convinced on the time granularity.

My test scripts only update the file approximately once every 12 seconds.

And even if I missed a update for reasons of time resolution, the system should clear its attribute cache after acregmax seconds.  But what I'm seeing is that the stale data stays in the cache *indefinitely*.  Yesterday, in continued testing with the scripts I provided, I saw data being cached for over 5 hours old, while the file had been updated once every 12 seconds during those 5 hours.
Comment 4 Jeff Layton 2009-03-06 15:58:09 EST
Sorry, I don't think I was quite clear...

The time granularity that I'm talking about is the granularity on the parent directory of the original file. If the parent directory's mtime hasn't changed (in this case, the one holding /nfs/test), client B might not realize that its lookup info is out of date.

If that's the case then you might see this behavior. Forcing a change of mtime on  /mnt/test might be enough to unwedge things if this is the problem.
Comment 5 Jason Priebe 2009-03-06 16:18:14 EST
It is true that if I move /nfs/test/new/foo.txt to /nfs/test/foo.txt, the mtime on /nfs/test does not change.

However, the same can be said if I write directly to /nfs/test/foo.txt, and the problem does not manifest itself under these circumstances.

Regardless, would the client really check the modtime on the directory before opening the file for reading?  Sorry if I'm misunderstanding how the caching works.

I would think that the client would check its file attribute cache for foo.txt; if it's not expired, use the cached mtime to determine whether it can use the cached file data.  

If the file attribute cache is expired, it gets the mtime for foo.txt from the nfs server; if the file is changed, the client re-reads the data from the server.
Comment 6 Jeff Layton 2009-07-06 10:23:42 EDT
Sorry for the delay on this.

Every unique inode has a filehandle. The lookup operation is what translates the pathname into a filehandle. When the client has some cached lookup info for the items in a directory, it'll do a check to see whether that directory has changed before trusting it (subject to the attrcache timeouts for the directory). So when you say:

"I would think that the client would check its file attribute cache for foo.txt;
if it's not expired, use the cached mtime to determine whether it can use the
cached file data."

...you're correct. It does that -- the problem is that there is disagreement about what inode "foo.txt" actually is since the lookup information is potentially out of date.

A lot of this is based on speculation however. What you'll really need to do is get the client into this situation and then do some analysis to see what the problem actually is.

A simple way to confirm this might be to build a server with a filesystem with finer grained timestamps (e.g. ext4 or xfs). Then see if you can reproduce it on that filesystem. If you can't, then timestamp granularity is almost certainly the problem.

If it turns out that it's not timestamp granularity, then I'd suggest opening a support case so that we have someone to help work through the problem with you. Tracking down these sorts of cache consistency problems is unfortunately pretty complex business.
Comment 7 Sami Kerola 2009-08-11 07:00:35 EDT
I have configuration where NFS server is NetApp and two clients with.


The client behavior is similar with original report, with exception that the
file timestamps are quite accurate. File which is no longer there seems to
be still around. If I get directory entries file status always gets right. I
cannot reproduce similar behavior with mkdir/rmdir and ls on other node.

Both servers use same ntp servers, so time stamps are pretty reliable, if
you doubt this is the status of NTP.

     remote           refid      st t when poll reach   delay   offset  jitter
ntpserver1       foooooooooooo    2 u  102  128  377    0.534    0.221   0.053
ntpserver2       baaaaaaaaaaar    2 u   99  128  377    0.592    0.343   0.079

-- snip
client-1# date >  diu && mkdir foo && cat diu
Tue Aug 11 10:57:35 CEST 2009

client-2# stat diu && cat diu && date
  File: `diu'
  Size: 30              Blocks: 0          IO Block: 4096   regular file
Device: 1ch/28d Inode: 7929602     Links: 1
Access: (0644/-rw-r--r--)  Uid: ( 1029/    sami)   Gid: ( 1029/    sami)
Access: 2009-08-11 10:57:29.769624000 +0200
Modify: 2009-08-11 10:57:29.774628000 +0200
Change: 2009-08-11 10:57:29.774628000 +0200
Tue Aug 11 10:57:29 CEST 2009
Tue Aug 11 10:57:32 CEST 2009

client-1# mv diu foo && date
Tue Aug 11 10:57:35 CEST 2009

client-2# stat diu && cat diu && date
  File: `diu'
  Size: 30              Blocks: 0          IO Block: 4096   regular file
Device: 1ch/28d Inode: 7929602     Links: 1
Access: (0644/-rw-r--r--)  Uid: ( 1029/    sami)   Gid: ( 1029/    sami)
Access: 2009-08-11 10:57:29.769624000 +0200
Modify: 2009-08-11 10:57:29.774628000 +0200
Change: 2009-08-11 10:57:35.802984000 +0200
Tue Aug 11 10:57:29 CEST 2009
Tue Aug 11 10:57:37 CEST 2009
-- snip
Comment 8 Jeff Layton 2010-10-15 09:26:04 EDT
Again, sorry for the delay here. The behavior outlined in comment #7 is expected. Lookup results are generally cached for a period of time. For performance reasons, the client only periodically checks to see whether the parent directory might have changed before trusting the lookup cache.

You can tune the amount of time between these checks by setting the acdirmin/acdirmax options. Another thing you can do is play with the lookupcache= option in recent kernels/nfs-utils packages for RHEL5. In fact, I was easily able to reproduce the results outlined in comment #7, but when I set lookupcache=none I wasn't. If you require tight lookup cache consistency you may want to use that option, but be forewarned that it will mean an increase in on the wire traffic.

At this point, I'm not seeing anything that indicates a bug here, so I'm going to close this as NOTABUG. Please reopen if you feel that this is incorrect.
Comment 9 Tim Connors 2012-11-02 00:11:05 EDT
As stated above, the condition persists for *30 minutes, possibly indefinitely*.   We also see this condition in our production webserver/nfs client environment (https://access.redhat.com/support/cases/00559692), and see the old file persisting for hours at a time.  LONG PAST the 60 second acdirmax timeout.  This is *clearly a bug*.  If the file merely persisted for 60 seconds, we would be happy with that.  If the old file is returned for an indefinite period, or if -ESTALE is returned, that is clearly not suitable.

We had to use the lookupcache=none workaround, but that blew up in our faces a month ago when the website overloaded when we had 300 hits per second per server (prior to the workaround being applied, we were able to cope with 14000 hits a second without raising a sweat).  During the overload, we didn't appear to come near any limits (about 4 read IOPs per second on the nfs server!) until the moment it all blew up.  So the workaround is not a suitable workaround, and the real bug must be fixed instead.

I strongly suspect the bug is related to other caching bogosity in the nfs kernel client implementation: https://bugzilla.kernel.org/show_bug.cgi?id=12557
Comment 10 Stephen P. Schaefer 2013-08-23 11:18:36 EDT
I've just encountered these symptoms in Red Hat Enterprise Linux 5.7.  For us, an attempt to touch a file in the directory, denied by permissions, appears to have been enough to trigger a cache refresh.  This is hard to replicate, but if that latter is sufficient, doing that periodically is probably better than lookupcache=none (until you've got many directories to refresh).

Note You need to log in before you can comment on or make changes to this bug.