Bug 144556
Summary: | NFS random "No such file or directory" | ||||||
---|---|---|---|---|---|---|---|
Product: | [Fedora] Fedora | Reporter: | Aaron C. de Bruyn <bugzilla> | ||||
Component: | nfs-utils | Assignee: | Steve Dickson <steved> | ||||
Status: | CLOSED DUPLICATE | QA Contact: | |||||
Severity: | medium | Docs Contact: | |||||
Priority: | medium | ||||||
Version: | 3 | CC: | alext, ash, astrand, bugzilla, dad, grubba, maurizio.antillon, menscher, pcoene1, rdieter, xtat | ||||
Target Milestone: | --- | ||||||
Target Release: | --- | ||||||
Hardware: | i386 | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2005-05-07 22:38:15 UTC | Type: | --- | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
Description
Aaron C. de Bruyn
2005-01-08 08:38:41 UTC
Would it be possible to get an ethereal trace of the problem? Also is there anything of interest in /var/log/messages? Sorry--I should have included messages in my original entry. During boot, the following gets logged related to nfs: Jan 5 21:08:32 elorg nfslock: rpc.statd startup succeeded Jan 5 21:08:57 elorg nfs: Starting NFS services: succeeded Jan 5 21:08:57 elorg nfs: rpc.rquotad startup succeeded Jan 5 21:08:58 elorg kernel: Installing knfsd (copyright (C) 1996 okir.de). Jan 5 21:08:58 elorg nfs: rpc.nfsd startup succeeded Jan 5 21:08:58 elorg nfs: rpc.mountd startup succeeded Jan 5 21:08:58 elorg rpc.idmapd: nfsdreopen: Opening '/proc/net/rpc/nfs4.idtoname/channel' failed: errno 2 (No such file or directory) Jan 5 21:08:58 elorg rpc.idmapd: nfsdreopen: Opening '/proc/net/rpc/nfs4.nametoid/channel' failed: errno 2 (No such file or directory) The only nfs4* files I have in /proc/net/rpc are: nfsd nfsd.export/ nfsd.fh/ Other than the errors above, RPC isn't throwing any errors. Unless anyone has another suggestion, I'm going to try running 'rpc.idmapd -F -vvv' and watch if it spits out any interesting debug messages. -A Sorry--I meant 'rpc.idmapd -f -vvv' We are also experienceing this problem between an F2 server and an F2 client. hmm... it appears rpc.mountd is not running... could to do the following to see and post the results service rpcidmapd stop service nfs start We installed Enterprise 3 WS (2.4.21-20.ELsmp) and began seeing "No such file" randomly, when a POP3 process tries to open a message on an NFS mount: fd = open(m->filename, O_RDONLY); Error in the log: Feb 10 14:23:53 mail4 tpop3d[26708]: maildir_new: scanned maildir /var/mail/userhidden/Maildir (1 messages) in 0.001s Feb 10 14:23:53 mail4 tpop3d[26708]: maildir_send_message: open(new/ 1108074103.H973837P26271.mail4.hctc.com,S=2239): No such file or directory However, the file does exist: /var/mail/userhidden/Maildir/new: -rw------- 1 userhidden 400 2239 Feb 10 12: 18 1108074103.H973837P26271.mail4.hctc.com,S=2239 A few moments later, the file is opened fine: Feb 10 14:26:09 mail4 tpop3d[27068]: maildir_send_message: sending message 1 (new/ 1108074103.H973837P26271.mail4.hctc.com,S=2239) size 2239 bytes The NFS server is 2.4.21-9.ELsmp and has 3 other mail servers connected to it, also running 2.4.21 -9.ELsmp with an identical tpop3d, who do not exhibit this problem. NFS is version 3, over TCP. Clients mount using /etc/fstab with: nfs:/var/mail /var/mail nfs rw,async,hard,intr 0 0 Out of 2392 message files opened in 1 hour, 90 of them resulted in "No such file or directory" when trying to open. The errors don't appear consecutively and affect any user. We haven't been able to reproduce this yet using "cp" (we can copy large groups of files from /var/mail over NFS without error). I am having similar problems with NFS failures. This command is a good test: cd /net/mach2; find . -mount 2> /dev/null Instead of silence, I see various parts of the root filesystem listed as "No such file or directory", with ./usr/share/pixmaps, ./usr/share/info, ./usr/include occurring frequently. This raises merry hell with my backup scheme, which depends on first listing all the files on a remote machine. I have four virtually identically provisioned machines, running Fedora Core 3, with all current updates. Here are some of the packages that are probably relevant: # rpmarch kernel nfs-utils portmap am-utils kernel-2.6.10-1.760_FC3 i686 kernel-2.6.10-1.766_FC3 i686 kernel-2.6.10-1.770_FC3 i686 nfs-utils-1.0.6-52 i386 portmap-4.0-63 i386 am-utils-6.0.9-10 i386 I use NFS with amd and not autofs, so remote filesystems are automounted whenever /net/mach2 is referenced. All four machines can mount all the others' filesystems. My son ran the above test on his system and experienced NO such errors. Since he is running all the same rpms listed above except the kernel, I reverted my machines to kernels 2.6.10-1.766_FC3 and 2.6.10-1.760_FC3, but still had the same errors. NFS filesystems are mounted with the options in /etc/amd.net: # cat /etc/amd.net /defaults fs:=${autodir}/${rhost}/root/${rfs};opts:=nosuid,nodev\ ,rw,hard,intr,rsize=8192,wsize=8192 * rhost:=${key};type:=host;rfs:=/ I've tried changing the block sizes to 4096 and 32768. This changes some of the errors during the 'find', but doesn't eliminate them. I am reasonably sure these errors never occurred sometime in the past, when I created my backup scheme, but I cannot know accurately when things when bad. I am using a workaround that does the find on the remote system: ssh mach2 "cd /; find . -mount" pending resolution of this serious NFS bug. Appears fixed in RHEL 4. Cannot reproduce. I am delighted to hear that RHEL 4 seems fixed. But this bugzilla report is against FC3, which is supposed to feed good stuff into RHEL, not the reverse. Can anyone confirm this misbehaviour in FC3, and propose a fix for FC3 (or FC4)? Is there a mechanism to feed corrections from RHEL into Fedora? I too have this problem. I run tar to backup a remote FC3 machine to tape. While backing up, I get random "cannot stat" errors on subdirectories. This makes backing up to a single tape drive rather difficult ;) I ran strace and all I found was that stat() was coming back with ENOENT randomly on some directories during the tar. gtar behaves the same way. The directories change from run to run... We are also experiencing this NFS filesystem problem on FC1/2/3 & Solaris 8 clients using a FC3 server. I can reproduce the random "No such file or directory" messages when running "find . -mount > /dev/null" But we are also observing randomly disappearing directories under normal usage, i.e. you "cd" out of directory you've been working in, and when you "cd" back, it's vanished! This includes users home directories. Missing directories can be seen using "ls", but not by "ls -l". Our workaround has been to "remove" the missing directory using "rmdir". The command correctly returns "Directory not empty", and after that is visible again. The following demonstrates this: [root@aftermath delly]# find . -mount > /dev/null find: ./usr/share/pixmaps/gaim/smileys/default: No such file or directory find: ./usr/share/info: No such file or directory find: ./usr/share/apps/ksgmltools2/docbook/xsl/params: No such file or directory [root@aftermath delly]# cd ./usr/share/pixmaps/gaim/smileys/ [root@aftermath smileys]# ls default none [root@aftermath smileys]# ls -l ls: default: No such file or directory total 8 drwxr-xr-x 2 root root 4096 Mar 10 14:05 none [root@aftermath smileys]# cd default [root@aftermath default]# ls -l ls: .: Stale NFS file handle [root@aftermath default]# cd .. [root@aftermath smileys]# ls -l ls: default: No such file or directory total 8 drwxr-xr-x 2 root root 4096 Mar 10 14:05 none [root@aftermath smileys]# rmdir default rmdir: `default': Directory not empty [root@aftermath smileys]# ls -l total 24 drwxr-xr-x 2 root root 12288 Mar 10 14:05 default drwxr-xr-x 2 root root 4096 Mar 10 14:05 none [root@aftermath smileys]# cd default [root@aftermath default]# ls -l total 772 -rw-r--r-- 1 root root 305 Mar 7 02:39 angel.png [...] I have observed this behaviour on two separate FC3 servers, but a third FC1 server appears unaffected. Also the FC3 server replaced a Solaris 8 server which worked fine. The FC3 servers are fully up2date, but I have tried experimenting on one system with different kernels (original FC3 kernel & rawhide kernel) and also installed the rawhide "nfs-utils". This made no change to the behaviour. Export options used: rw,sync,no_root_squash,insecure Mount options used: rw,sync,tcp One thing I should add is we're using yp to manage user accounts, but not to automatically mount users home directories (they should always be mounted on all clients). Cheers Robert Please add the "no_subtree_check" export option to see if that helps Hi I tried the "no_subtree_check" export option and the behaviour remains the same. Cheers Robert I also tried the "no_subtree_check" export option, and 'find' on a remote nfs-mounted filesystem still shows errors. I suspect this may be a timing issue. Of my four machines, one is a pretty fast 1200 MHz Athlon; the slowest is a Pentium II 200 MHz. When the remote machine is the fast one, no errors are reported (twice): [root@datant /net/datium] # time find . -mount > /dev/null real 3m36.491s user 0m4.632s sys 0m54.016s [root@datant /net/datium] # time find . -mount > /dev/null real 1m42.360s user 0m4.947s sys 0m49.730s With those same machines, in reverse - the remote machine is slow - there are errors (different ones in two tests): [root@datium /net/datant] # time find . -mount > /dev/null find: ./usr/share/info: No such file or directory find: ./usr/share/doc: No such file or directory real 6m34.467s user 0m0.719s sys 0m7.126s [root@datium /net/datant] # time find . -mount > /dev/null find: ./usr/share/pixmaps/nautilus/Bluecurve: No such file or directory find: ./usr/share/pixmaps/gaim/smileys/default: No such file or directory real 4m46.177s user 0m0.628s sys 0m5.438s I don't know if this will help... But I did a find twice in a row, from the same client. The first one failed on like 8 directories, the second one zero... Could this be related to caching? I also have fast clients and a slower server. I can duplicate at will if you need me to try some tests or some patch. Any progress on this, or any way I can help? My new network uses this for a backup strategy so I'm running without a net. Is there a short term outlook, or work I can do to help? If not, perhaps I need to buy more backup hardware. I've just noticed that all the "missing" directories have a size greater than or equal to 12288 bytes. On my system, only 159 of the total 29,387 directories are greater than or equal to 12K, and the last time I ran "find . -mount > /dev/null", 115 of them disappeared. I've tried mounting with: rsize=4096,wsize=4096 rsize=8192,wsize=8192 rsize=32768,wsize=32768 None of the above changed the behaviour. I see this is still in "need info" state. What info is needed? I'll do whatever I can to augment the data people have provided. A yum update on 4/16 provided many new package versions, including kernel.i686 2.6.11-1.14_FC3 Hoping that this might have improved the NFS problem, I ran tests with three machines: a fast one, a slow one, and a medium speed laptop. The 'find . -mount > /dev/null' test can be run 6 ways with three machines: A -> B B -> C C -> A and the reverse. I wrote a script to loop over these 6 tests 20 times. Of these 120 tests, which ran for 14 hr 18 min, all were error-free except 9. The 9 tests with errors were all A -> B, where A is the fast Athlon, and B is the slow Pentium. The errors occurred on the 12th thru 20th cycles, and the "missing directories" differed in number - 1, 9, 4, 5, etc. Some of the missing directories recurred in consecutive tests; others appeared sporadically. Since the A -> B tests were error-free for the first 11 tries, and then contained errors consistently thereafter, it is conceivable that some buffer is retaining bad data. Since the errors only occurred when the fast machine was interrogating the slow machine, there must be a timing issue. Maybe I should just retire my slow machine and declare victory. But I'd much prefer to find the source of this NFS defect. I've installed kernel 2.6.11-1.14_FC3 on a fast and a slow machine. Both still exhibit the missing directory problem. Every missing directory has size >= 12288 bytes which I believe must be significant. Has anybody else observed this? I suspect the problem is with the server, as the client machine has no problem with nfs mounts from FC1 and Solaris 8 servers. I have the what looks to be the same problem with CentOS4 server. In my case if the directory in question is listed on the nfs server (ls -la) it is possible to ls -la it on the client shortly after. I export filesystems read only, and set actimeo to 0. tcp/udp, sync/async v2/v3 and applying the latest kernel upgrade do not seem to have any effect on the problem. I am convinced that this is an stock nfs server problem (or, better said, the problem reside on the nfs *server* side), because several different clients experience the same problem, and also the above magic ls of the local directory makes a difference on the clients. Can we get the priority of this bumped to high? This seems to be a server side problem and is well duplicated. For me, it is interferring with backups unless I make a hardware change. Having this same problem with RHEL4/latest updates with _many_ Solaris 8/9 clients. Running tcpdump indicates that the linux server is reporting Error:ERR_NOENT to my RPC calls of GETATTR. A few more things: I set up monitoring on this to ls -l the directories in question every 2 seconds. For the duration of this, the system was entirely stable. I think that my monitoring was refreshing the filehandles on the solaris machines such that they would never expire and create the ERR_NOENT condition. FYI: I've provided a tentative patch that probably fixes the problem at https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=150759 Robert: The reason for the break point being at 12288 bytes is that that is the minimum size for dx directories (3 blocks @ 4096 bytes), and it is for dx directories that the unpatched ext3_get_parent breaks. Thanks Henrik, that's brilliant! I've applied the patch and it seems to have fixed the problem. *** This bug has been marked as a duplicate of 150759 *** I've never patched a linux kernel before. Should I just wait for this to be incorporated? Anyone want to send me a copy of the kernel after it has been built, or should I learn how to apply it? How long before typically, before this shows up as an update. My backups have been on hold for this. Paul: There's a description about how to build an fc3 kernel rpm on http://voidmain.is-a-geek.net/redhat/fedora_3_kernel_build.html . In this specific case: Copy the patch to /usr/src/redhat/SOURCES/linux-2.6.10-ext3-grubba-nfsd.patch Create a new spec file /usr/src/redhat/SPECS/kernel-2.6-grubba-nfsd.spec by patching kernel-2.6.spec with the patch I'll attach in a moment. Build the new kernel rpms by running rpmbuild -vv -ba --target=`arch` kernel-2.6-grubba-nfsd.spec in the /usr/src/redhat/SPECS directory. Your new kernel rpms will appear in /usr/src/redhat/RPMS/`arch`/. Created attachment 114214 [details]
Spec file patch.
After rebuilding the kernel with Henrik Grubbstrom's patch to namei.c on four PCs, I beat NFS to death for 17 hours running the test script described in Note #19 above. This runs 'find . -mount > /dev/null' on the NFS-mounted root filesystems of four machines; in all, 120 complete scans were run. Zero errors occurred with the Grubbstrom patch, in contrast to 9 random errors with unaltered kernel-2.6.11-1.14_FC3, so I am happy to add my observation to others saying that this bug is finally squashed. Thank you Henrik Grubbstrom for your persistent and perceptive work to fix this longstanding and insidious NFS bug. I hope this fix will be incorporated into a new kernel release, and surely into the forthcoming FC4 release. I've installed Henrik's patch also, and my NFS works fine now. I once again have backups! Thank you. |