Bug 170423
Summary: | Cache invalidation bug in nfs v3 | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 4 | Reporter: | Peter K <cap> | ||||||
Component: | kernel | Assignee: | Steve Dickson <steved> | ||||||
Status: | CLOSED ERRATA | QA Contact: | Brian Brock <bbrock> | ||||||
Severity: | high | Docs Contact: | |||||||
Priority: | medium | ||||||||
Version: | 4.0 | CC: | aaron, herrold, jay.hilliard, jbaron, nixon, staubach | ||||||
Target Milestone: | --- | ||||||||
Target Release: | --- | ||||||||
Hardware: | All | ||||||||
OS: | Linux | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | RHSA-2006-0132 | Doc Type: | Bug Fix | ||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | Environment: | ||||||||
Last Closed: | 2006-03-07 20:20:01 UTC | Type: | --- | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Bug Depends On: | |||||||||
Bug Blocks: | 168429 | ||||||||
Attachments: |
|
Description
Peter K
2005-10-11 16:00:53 UTC
There is now a fix from Trond: http://lkml.org/lkml/2005/10/13/142 I have just verified that it fixes the problem for us. Created attachment 120056 [details]
Purposed Upstream Patch
I spent most of today trying to reproduce this problem and was unable to... at least with the scenario described in the first Bug Comment. Note the caching in the 2.6.14 kernel is much different that in the RHEL4 kernel... so this might have been something that was introduced in a later kernel.... So unless I'm able to find a reproducer, I'll have to mark this bug as NOTABUG I'm really surprised, you are the first person that I've heard of that is unable to reproduce this. I've (as I wrote initially) reproduced it on 2.6.9-5.0.5smp, 2.6.9-11smp (both x86_64) and kernel.org 2.6.13.2. With the fix (that Trond found) atleast 2.6.13.2 is works ok. If I remember correctly people have also reproduced it on a bunch of fc and debian machines. This bug is not just a minor annoyance to us it makes it impossible for the climate model CCSM to be run correctly (without modifications) on our cluster. We also have a redhat support issue depending on this bugzilla. Too make sure that it's till there (the bug) I did a new test (on two machines in a cluster). This is the copy-paste-exact output (including timestamps) for both machines: [root@n9 test]# uname -r 2.6.9-5.0.5.ELsmp [root@n9 test]# pwd /home/test [root@n9 test]# mount | grep home h1:/tornado_home on /home type nfs (rw,nosuid,nodev,hard,tcp,addr=192.168.11.221) [root@n9 test]# grep home /etc/fstab h1:/tornado_home /home nfs defaults,nosuid,nodev,hard,tcp 0 0 [root@n9 test]# date ; echo foo > file Thu Nov 10 09:44:04 CET 2005 [root@n9 test]# date ; echo fxx > file Thu Nov 10 09:44:25 CET 2005 [root@n9 test]# date ; touch . Thu Nov 10 09:44:34 CET 2005 [root@n9 test]# date ; cat file Thu Nov 10 09:45:08 CET 2005 fxx [root@n9 test]# [root@n10 test]# uname -r 2.6.9-5.0.5.ELsmp [root@n10 test]# pwd /home/test [root@n10 test]# mount | grep home h1:/tornado_home on /home type nfs (rw,nosuid,nodev,hard,tcp,addr=192.168.11.221) [root@n10 test]# grep home /etc/fstab h1:/tornado_home /home nfs defaults,nosuid,nodev,hard,tcp 0 0 [root@n10 test]# date ; cat file Thu Nov 10 09:44:10 CET 2005 foo [root@n10 test]# date ; touch file Thu Nov 10 09:44:43 CET 2005 [root@n10 test]# date ; cat file Thu Nov 10 09:44:49 CET 2005 foo [root@n10 test]# date ; cat file Thu Nov 10 09:45:12 CET 2005 foo [root@n10 test]# date ; cat file Thu Nov 10 09:46:52 CET 2005 foo [root@n10 test]# -== Summary as of 17.00 CET 20051110 ==- * I have noticed that the bug only bites sometimes * I can still reproduce it on all machines though (including 2.6.9-22.0.1) * how idle the hosts are may make a difference * shouldn't you look at the upstream patch regardless since Trond considers it a bug and the fix seems handle a forgotten case somewhat caotic information follows including a fully automatic way to reproduce: increasing the number of machines I've tested and different kernels I've now noticed that it's not 100% reproducible... it seems to be alot easier to reproduce if the machine is idle and you have to xterms (one on each host). If I script the process (using ssh) it's harder to reproduce and on two 2.6.9-22.0.1 machines it's alot harder (but allways possible). It might be that the -22.0.1 kernel is better or only the fact that those two aren't 100% idle. here's the script I use and it has so far never failed to reproduce the problem (but it's usually only 1 in 3 that goes wrong when automated like this): it prints out foo followed by the new value fxx if everything was ok, it prints out foo followed by foo if the bug hits. [cap@tornado cap]$ uname -r 2.6.9-22.0.1.ELsmp [cap@tornado cap]$ mount | grep rossby3 d2:/nobackup/rossby3 on /nobackup/rossby3 type nfs (rw,nosuid,nodev,hard,tcp,addr=192.168.11.232) [cap@tornado cap]$ grep rossby3 /etc/fstab d2:/nobackup/rossby3 /nobackup/rossby3 nfs defaults,nosuid,nodev,hard,tcp 0 0 [cap@tornado cap]$ for i in 1 2 3 1 2 3 1 2 3; do echo foo > file ; sleep $i ;echo -n "$i "; ssh dunder cat /nobackup/rossby3/cap/file ; sleep $i ; touch . ; sleep $i ; echo fxx > file ; sleep $i ;echo -n "$i "; ssh dunder "touch /nobackup/rossby3/cap/file ; sleep $i ; cat /nobackup/rossby3/cap/file"; sleep $i; done 1 foo 1 fxx 2 foo 2 fxx 3 foo 3 fxx 1 foo 1 fxx 2 foo 2 fxx 3 foo 3 foo 1 foo 1 fxx 2 foo 2 fxx 3 foo 3 foo [cap@tornado cap]$ cat file fxx [cap@tornado cap]$ ssh dunder cat /nobackup/rossby3/cap/file foo [cap@tornado cap]$ ssh dunder cat /nobackup/rossby3/cap/file foo [cap@tornado cap]$ ssh dunder cat /nobackup/rossby3/cap/file foo [cap@tornado cap]$ touch file [cap@tornado cap]$ ssh dunder cat /nobackup/rossby3/cap/file fxx [cap@tornado cap]$ note how "file" stays foo after a loop like this until it's touched (on the writing client). I know that this isn't very nice and clean but it's atleast fully automated (you'll have to change hostname and filename inte the loop though). Unfortunately I'll go on vacation now (5 weeks in Australia =) so I can't follow up much more on this. I'll see if a collegue of mine can "take over" Created attachment 120884 [details]
ethereal trace of a working test run.
I was using later kernels (2.6.9-22) on both my clients so
I backed off to 2.6.9-5.0.5.ELsmp kernels and I'm still not
able to reproduce this issue... Here is what I was doing:
pro1$ uname -r
2.6.9-5.0.5.ELsmp
pro1$ cd /mnt/xeon5/home/tmp
pro1$ date ; echo foo > file
Thu Nov 10 11:36:19 EST 2005
pro1$ date ; echo fxx > file ; touch .
Thu Nov 10 11:36:34 EST 2005
pro5$ uname -r
2.6.9-5.0.5.ELsmp
pro5$ cd /mnt/xeon5/home/tmp
pro5$ date ; cat file
Thu Nov 10 11:36:24 EST 2005
foo
pro5$ date ; cat file
Thu Nov 10 11:36:39 EST 2005
fxx
I also made sure the clocks on both clients were sync-ed via ntpdate.
Now I'm not ready to give up on this yet... so I've attached a
bzip2-ed ethereal trace (captured on the server so both clients
could be traced) of a working run. If possible, I would like you
to do the same so they can be compared....
btw, what server are you using?
I'm also able to reproduce this on 2.6.9-22. A patch for this kernel is welcome. Unfortunately the upstream patch in Comment #2 needs 8 other upstream patches for it to apply cleanly... Which I'm not against doing since one, I've already done the work and two it would move the RHEL4 cache code close to upstream (for better or worse ;-) ). But since I can't reproduce the problem I have no way of verifying if these patches actually fix the bug... So would anybody be willing to download a pre-U3 test kernel from my people patch to see if one, there are any regressions and two, to see if actually fixes the caching bug? If so I could probably have something ready by later tonight or early tomorrow (depending out our build system) regarding the server we use, I have tried both 2.6.9-5.0.5smp, 2.6.9-11smp and now we are running a 2.6.13.2 if you have a testkernel for me to try I'll test it (if there's time before I leave) otherwise I'll try to get someone else to try it out. RHEL4 U3 Test kernels that have the patch in Comment #2 as well as a number of other patches that are needed for that one patch are available in: http://people.redhat.com/steved/bz170423/ Please let me know asap if these fix the caching issue your seeing... tia... This certainly fixes the problem. I was able to duplicate the original problem on a network appliance, Solaris8, Solaris9, and MacOSX nfs mounts. The patched kernel works! Thanks! Cool.... Thank you very much for your effort... Its definitely appreciated!! Hi, just wanted to mention we surely see this in a real work-load here RHEL4 ES (2.6.9-22.0.1.ELsmp). I'll try a patched kernel ASAP. Thanks! An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2006-0132.html Internal Status set to 'Resolved' Status set to: Closed by Client Resolution set to: 'RHEL 4 U4' This event sent from IssueTracker by uthomas issue 81774 |