Bug 488780
Summary: | NFS read cache inconsistency | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 5 | Reporter: | Jason Priebe <jpriebe> | ||||
Component: | kernel | Assignee: | Jeff Layton <jlayton> | ||||
Status: | CLOSED NOTABUG | QA Contact: | Red Hat Kernel QE team <kernel-qe> | ||||
Severity: | high | Docs Contact: | |||||
Priority: | low | ||||||
Version: | 5.1 | CC: | igeorgex, jlayton, kerolasa, rwheeler, spanjikk, sschaefer, sschaefer, steved, tim.w.connors | ||||
Target Milestone: | --- | ||||||
Target Release: | --- | ||||||
Hardware: | x86_64 | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2010-10-15 13:26:04 UTC | Type: | --- | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
Apologies for the commands that precede the problem description. Cut-and-paste error. In theory, running the first command on client A and the second on client B will reproduce the problem, but I've only succeeded at that once. But the scripts I've attached are very reliable at reproducing the problem. You may just be running into the limitations of time granularity on the underlying filesystem. ext3 only has 1s timestamp resolution so if you make several changes to an inode within the same second the client may end up relying on cached data. That said, there are some patches that went into 5.3 that may help this (the one for bug 364351 comes to mind). Do 5.3 kernels on the client fare any better here? Another thing that may help is having the server serve out a filesystem with more granular timestamps. ext4 is a tech preview in 5.3 (I think) so you might want to test that out on a server as well. Bug 364351 could be related. I will try a 5.3 kernel. But I'm not convinced on the time granularity. My test scripts only update the file approximately once every 12 seconds. And even if I missed a update for reasons of time resolution, the system should clear its attribute cache after acregmax seconds. But what I'm seeing is that the stale data stays in the cache *indefinitely*. Yesterday, in continued testing with the scripts I provided, I saw data being cached for over 5 hours old, while the file had been updated once every 12 seconds during those 5 hours. Sorry, I don't think I was quite clear... The time granularity that I'm talking about is the granularity on the parent directory of the original file. If the parent directory's mtime hasn't changed (in this case, the one holding /nfs/test), client B might not realize that its lookup info is out of date. If that's the case then you might see this behavior. Forcing a change of mtime on /mnt/test might be enough to unwedge things if this is the problem. It is true that if I move /nfs/test/new/foo.txt to /nfs/test/foo.txt, the mtime on /nfs/test does not change. However, the same can be said if I write directly to /nfs/test/foo.txt, and the problem does not manifest itself under these circumstances. Regardless, would the client really check the modtime on the directory before opening the file for reading? Sorry if I'm misunderstanding how the caching works. I would think that the client would check its file attribute cache for foo.txt; if it's not expired, use the cached mtime to determine whether it can use the cached file data. If the file attribute cache is expired, it gets the mtime for foo.txt from the nfs server; if the file is changed, the client re-reads the data from the server. Sorry for the delay on this. Every unique inode has a filehandle. The lookup operation is what translates the pathname into a filehandle. When the client has some cached lookup info for the items in a directory, it'll do a check to see whether that directory has changed before trusting it (subject to the attrcache timeouts for the directory). So when you say: "I would think that the client would check its file attribute cache for foo.txt; if it's not expired, use the cached mtime to determine whether it can use the cached file data." ...you're correct. It does that -- the problem is that there is disagreement about what inode "foo.txt" actually is since the lookup information is potentially out of date. A lot of this is based on speculation however. What you'll really need to do is get the client into this situation and then do some analysis to see what the problem actually is. A simple way to confirm this might be to build a server with a filesystem with finer grained timestamps (e.g. ext4 or xfs). Then see if you can reproduce it on that filesystem. If you can't, then timestamp granularity is almost certainly the problem. If it turns out that it's not timestamp granularity, then I'd suggest opening a support case so that we have someone to help work through the problem with you. Tracking down these sorts of cache consistency problems is unfortunately pretty complex business. I have configuration where NFS server is NetApp and two clients with. kernel-2.6.18-53.el5 nfs-utils-1.0.9-24.el5 nfs-utils-lib-1.0.8-7.2.z2 The client behavior is similar with original report, with exception that the file timestamps are quite accurate. File which is no longer there seems to be still around. If I get directory entries file status always gets right. I cannot reproduce similar behavior with mkdir/rmdir and ls on other node. Both servers use same ntp servers, so time stamps are pretty reliable, if you doubt this is the status of NTP. remote refid st t when poll reach delay offset jitter ============================================================================== ntpserver1 foooooooooooo 2 u 102 128 377 0.534 0.221 0.053 ntpserver2 baaaaaaaaaaar 2 u 99 128 377 0.592 0.343 0.079 -- snip client-1# date > diu && mkdir foo && cat diu Tue Aug 11 10:57:35 CEST 2009 client-2# stat diu && cat diu && date File: `diu' Size: 30 Blocks: 0 IO Block: 4096 regular file Device: 1ch/28d Inode: 7929602 Links: 1 Access: (0644/-rw-r--r--) Uid: ( 1029/ sami) Gid: ( 1029/ sami) Access: 2009-08-11 10:57:29.769624000 +0200 Modify: 2009-08-11 10:57:29.774628000 +0200 Change: 2009-08-11 10:57:29.774628000 +0200 Tue Aug 11 10:57:29 CEST 2009 Tue Aug 11 10:57:32 CEST 2009 client-1# mv diu foo && date Tue Aug 11 10:57:35 CEST 2009 client-2# stat diu && cat diu && date File: `diu' Size: 30 Blocks: 0 IO Block: 4096 regular file Device: 1ch/28d Inode: 7929602 Links: 1 Access: (0644/-rw-r--r--) Uid: ( 1029/ sami) Gid: ( 1029/ sami) Access: 2009-08-11 10:57:29.769624000 +0200 Modify: 2009-08-11 10:57:29.774628000 +0200 Change: 2009-08-11 10:57:35.802984000 +0200 Tue Aug 11 10:57:29 CEST 2009 Tue Aug 11 10:57:37 CEST 2009 -- snip Again, sorry for the delay here. The behavior outlined in comment #7 is expected. Lookup results are generally cached for a period of time. For performance reasons, the client only periodically checks to see whether the parent directory might have changed before trusting the lookup cache. You can tune the amount of time between these checks by setting the acdirmin/acdirmax options. Another thing you can do is play with the lookupcache= option in recent kernels/nfs-utils packages for RHEL5. In fact, I was easily able to reproduce the results outlined in comment #7, but when I set lookupcache=none I wasn't. If you require tight lookup cache consistency you may want to use that option, but be forewarned that it will mean an increase in on the wire traffic. At this point, I'm not seeing anything that indicates a bug here, so I'm going to close this as NOTABUG. Please reopen if you feel that this is incorrect. As stated above, the condition persists for *30 minutes, possibly indefinitely*. We also see this condition in our production webserver/nfs client environment (https://access.redhat.com/support/cases/00559692), and see the old file persisting for hours at a time. LONG PAST the 60 second acdirmax timeout. This is *clearly a bug*. If the file merely persisted for 60 seconds, we would be happy with that. If the old file is returned for an indefinite period, or if -ESTALE is returned, that is clearly not suitable. We had to use the lookupcache=none workaround, but that blew up in our faces a month ago when the website overloaded when we had 300 hits per second per server (prior to the workaround being applied, we were able to cope with 14000 hits a second without raising a sweat). During the overload, we didn't appear to come near any limits (about 4 read IOPs per second on the nfs server!) until the moment it all blew up. So the workaround is not a suitable workaround, and the real bug must be fixed instead. I strongly suspect the bug is related to other caching bogosity in the nfs kernel client implementation: https://bugzilla.kernel.org/show_bug.cgi?id=12557 I've just encountered these symptoms in Red Hat Enterprise Linux 5.7. For us, an attempt to touch a file in the directory, denied by permissions, appears to have been enough to trigger a cache refresh. This is hard to replicate, but if that latter is sufficient, doing that periodically is probably better than lookupcache=none (until you've got many directories to refresh). |
Created attachment 334169 [details] command-line PHP scripts to help reproduce the problem echo bar > /nfs/test/new/f; mv /nfs/test/new/f /nfs/test/f; echo foo > /nfs/test/new/f; sleep 5; mv /nfs/test/new/f /nfs/test/f until grep -q foo /nfs/test/f; do echo -n .; sleep 1; done Description of problem: Under certain conditions, if one client writes to a file in directory 1 and then moves it into directory 2, a second client will not see changes to the file in directory 2. It will continue to use cached filedata for a long period of time (upwards of 30 minutes, possibly indefinitely). Running 'ls' on the file in directory 2 does not cause the cache to refresh itself. However, running 'ls' on directory 2 itself *usually* will cause the cache to refresh itself. Note that this seems to be due to a race condition of some sort. It is hard to replicate unless you have a large number of processes accessing the file in directory 2. It is not impossible to replicate it with just once process reading the file once per second, but it may take a lot of trials to get it to happen. Further, it seems that some machines are more prone to this than others; it's almost as if there's' something stateful going on. Once a file has been "stuck" in the cache on a machine, it seems that machine is faster to reproduce the problem even after you've deleted the file and recreated it. Version-Release number of selected component (if applicable): RHEL 5.1 clients: kernel 2.6.18-53.el5 nfs-utils-1.0.9-24.el5 nfs-utils-lib-1.0.8-7.2.z2 RHEL 5.2 server: 2.6.18-92.el5 nfs-utils-1.0.9-40.el5 nfs-utils-lib-1.0.8-7.2.z2 How reproducible: Fairly reproducible with attached scripts. Steps to Reproduce: 1. client A creates a file /nfs/test/new/foo.txt 2. client B reads existing file /nfs/test/foo.txt 3. client A moves /nfs/test/new/foo.txt to /nfs/test/foo.txt 4. client B reads file /nfs/test/foo.txt (and now gets stale data) Again, it is not quite this simple; I've only once been able to get this to happen using only these simple steps. Normally, I need to run a script that constantly updates the file contents Actual results: In step 4, client B does not see the changes to foo.txt made by client A. Expected results: In step 4, client B should see the changes to foo.txt (at least within a few seconds, based on acregmin/acregmax settings Additional info: Client A mount options: (rw,noatime,nodiratime,bg,nfsvers=3,tcp,timeo=600,rsize=32768,wsize=32768,hard,intr) Client B mount options: (ro,noatime,nodiratime,bg,nfsvers=3,tcp,timeo=600,rsize=32768,wsize=32768,hard,intr How to run my scripts: Sorry for the PHP scripts -- we do all our dev work (CLI and web) in PHP; it's just faster for me to use it than anything else. I hope you have easy access to a php 5 interpreter. Anyway, on client A, with the read/write mount, run writer.php. On client B, with the readonly mount, run reader.php. It will print out the current timestamp, followed by the timestamp it found inside the test file. The writer updates the file about every 12 seconds, so the age found by the reader shouldn't ever be more than about 15 or 20 seconds old. It may take a few cycles of writing before you see the problem. If you have trouble reproducing, you might want to try running "reader.php -r", which spawns off 10 processes to read the file at intervals of 100 to 500 ms. It seems to increase the likelihood of the cache getting stuck.