Bug 144556

Summary: NFS random "No such file or directory"
Product: [Fedora] Fedora Reporter: Aaron C. de Bruyn <bugzilla>
Component: nfs-utilsAssignee: Steve Dickson <steved>
Status: CLOSED DUPLICATE QA Contact:
Severity: medium Docs Contact:
Priority: medium    
Version: 3CC: alext, ash, astrand, bugzilla, dad, grubba, maurizio.antillon, menscher, pcoene1, rdieter, xtat
Target Milestone: ---   
Target Release: ---   
Hardware: i386   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2005-05-07 22:38:15 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Spec file patch. none

Description Aaron C. de Bruyn 2005-01-08 08:38:41 UTC
Description of problem:
I have two FC3 boxes.  One a fresh install of FC3 (let's call it
'server') and another box that has been upgraded from FC2 (let's call
it 'client').
On my client I mount an NFS drive (mount -t nfs server.tld:/mydir /mnt)

When I try to access files (such as cp /mnt/* /wherever --recursive)
on the drive I seem to get random "No such file or directory" messages.

If I delete the files from /wherever and run the cp command again it
will give me the same error for different files/directories.

How reproducible:
I can reproduce this on my box every time--however it seems to be
random for the files it picks to 'not find'.

I have already checked that all the files have read permission for
everyone.

If I do a copy locally on the server from one directory to another, it
works fine.  This appears to only happen over NFS.

Steps to Reproduce:
1. mount -t nfs some.server.tld:/somedir /mnt
2. cp --recursive /mnt/* /somewhere
  
Actual results:

(I have images for Slackware 10, Fedora Core 3, and FreeBSD on one box
and I was moving them to a larger drive on my new server)

cp /mnt/slackware10/* /pxe/slackware10 --recursive
cp: cannot stat `/mnt/slackware10/slackware/d': No such file or directory

If I run the copy a second time I may get the same error for
'.....slackware/a' or '....slackware/ap'

Expected results:
Should copy without fail.

Additional info:
I get this error from various other utilities too.
I tried a diff on the source and destination directories and it showed
directories existing in the destination but not the source--when they
actually existed in both places.

I have not tried this the other way around.  If anyone thinks it would
be helpful I can mount an NFS drive from the client computer on my
server and test that out...

Comment 1 Steve Dickson 2005-01-12 16:24:33 UTC
Would it be possible to get an ethereal trace of the problem?
Also is there anything of interest in /var/log/messages?

Comment 2 Aaron C. de Bruyn 2005-01-12 23:16:58 UTC
Sorry--I should have included messages in my original entry.
During boot, the following gets logged related to nfs:

Jan  5 21:08:32 elorg nfslock: rpc.statd startup succeeded
Jan  5 21:08:57 elorg nfs: Starting NFS services:  succeeded
Jan  5 21:08:57 elorg nfs: rpc.rquotad startup succeeded
Jan  5 21:08:58 elorg kernel: Installing knfsd (copyright (C) 1996
okir.de).
Jan  5 21:08:58 elorg nfs: rpc.nfsd startup succeeded
Jan  5 21:08:58 elorg nfs: rpc.mountd startup succeeded
Jan  5 21:08:58 elorg rpc.idmapd: nfsdreopen: Opening
'/proc/net/rpc/nfs4.idtoname/channel' failed: errno 2 (No such file or
directory)
Jan  5 21:08:58 elorg rpc.idmapd: nfsdreopen: Opening
'/proc/net/rpc/nfs4.nametoid/channel' failed: errno 2 (No such file or
directory)

The only nfs4* files I have in /proc/net/rpc are:
nfsd
nfsd.export/
nfsd.fh/

Other than the errors above, RPC isn't throwing any errors.

Unless anyone has another suggestion, I'm going to try running
'rpc.idmapd -F -vvv' and watch if it spits out any interesting debug
messages.

-A

Comment 3 Aaron C. de Bruyn 2005-01-12 23:19:19 UTC
Sorry--I meant 'rpc.idmapd -f -vvv'

Comment 4 Alan Harder 2005-01-21 18:43:58 UTC
We are also experienceing this problem between an F2 server and an F2
client.

Comment 5 Steve Dickson 2005-02-01 15:35:22 UTC
hmm... it appears rpc.mountd is not running... could to do the
following to see and post the results

service rpcidmapd stop
service nfs start



Comment 6 WhidbeyNet 2005-02-10 23:39:16 UTC
We installed Enterprise 3 WS (2.4.21-20.ELsmp) and began seeing "No such file" randomly, when a POP3 
process tries to open a message on an NFS mount:

fd = open(m->filename, O_RDONLY);

Error in the log:

Feb 10 14:23:53 mail4 tpop3d[26708]: maildir_new: scanned maildir /var/mail/userhidden/Maildir (1 
messages) in 0.001s

Feb 10 14:23:53 mail4 tpop3d[26708]: maildir_send_message: open(new/
1108074103.H973837P26271.mail4.hctc.com,S=2239): No such file or directory

However, the file does exist:

/var/mail/userhidden/Maildir/new:
-rw-------    1 userhidden  400          2239 Feb 10 12:
18 1108074103.H973837P26271.mail4.hctc.com,S=2239

A few moments later, the file is opened fine:

Feb 10 14:26:09 mail4 tpop3d[27068]: maildir_send_message: sending message 1 (new/
1108074103.H973837P26271.mail4.hctc.com,S=2239) size 2239 bytes

The NFS server is 2.4.21-9.ELsmp and has 3 other mail servers connected to it, also running 2.4.21
-9.ELsmp with an identical tpop3d, who do not exhibit this problem. NFS is version 3, over TCP. Clients 
mount using /etc/fstab with:

nfs:/var/mail  /var/mail  nfs  rw,async,hard,intr 0 0

Out of 2392 message files opened in 1 hour, 90 of them resulted in "No such file or directory" when 
trying to open.  The errors don't appear consecutively and affect any user.

We haven't been able to reproduce this yet using "cp" (we can copy large groups of files from /var/mail 
over NFS without error).

Comment 7 David A. De Graaf 2005-03-24 16:59:03 UTC
I am having similar problems with NFS failures.  This command is a good test:
  cd /net/mach2;  find . -mount  2> /dev/null

Instead of silence, I see various parts of the root filesystem listed as "No
such file or directory", with ./usr/share/pixmaps, ./usr/share/info,
./usr/include occurring frequently.
This raises merry hell with my backup scheme, which depends on first listing all
the files on a remote machine.

I have four virtually identically provisioned machines, running Fedora Core 3,
with all current updates.  Here are some of the packages that are probably relevant:
# rpmarch kernel nfs-utils portmap am-utils
kernel-2.6.10-1.760_FC3 i686
kernel-2.6.10-1.766_FC3 i686
kernel-2.6.10-1.770_FC3 i686
nfs-utils-1.0.6-52 i386
portmap-4.0-63 i386
am-utils-6.0.9-10 i386

I use NFS with amd and not autofs, so remote filesystems are automounted
whenever /net/mach2 is referenced.  All four machines can mount all the others'
filesystems.

My son ran the above test on his system and experienced NO such errors.  Since
he is running all the same rpms listed above except the kernel, I reverted my
machines to kernels 2.6.10-1.766_FC3 and 2.6.10-1.760_FC3, but still had the
same errors.

NFS filesystems are mounted with the options in /etc/amd.net:
# cat /etc/amd.net
/defaults fs:=${autodir}/${rhost}/root/${rfs};opts:=nosuid,nodev\
,rw,hard,intr,rsize=8192,wsize=8192
*       rhost:=${key};type:=host;rfs:=/

I've tried changing the block sizes to 4096 and 32768.  This changes some of the
errors during the 'find', but doesn't eliminate them.

I am reasonably sure these errors never occurred sometime in the past, when I
created my backup scheme, but I cannot know accurately when things when bad.

I am using a workaround that does the find on the remote system:
  ssh mach2 "cd /; find . -mount"
pending resolution of this serious NFS bug.

Comment 8 WhidbeyNet 2005-03-24 18:50:51 UTC
Appears fixed in RHEL 4.  Cannot reproduce.

Comment 9 David A. De Graaf 2005-03-31 15:52:36 UTC
I am delighted to hear that RHEL 4 seems fixed.  But this bugzilla report is
against FC3, which is supposed to feed good stuff into RHEL, not the reverse.
Can anyone confirm this misbehaviour in FC3, and propose a fix for FC3 (or FC4)?
Is there a mechanism to feed corrections from RHEL into Fedora?

Comment 10 Paul Coene 2005-03-31 17:26:32 UTC
I too have this problem.  I run tar to backup a remote FC3 machine to tape. 
While backing up, I get random "cannot stat" errors on subdirectories.  This
makes backing up to a single tape drive rather difficult ;)

I ran strace and all I found was that stat() was coming back with ENOENT
randomly on some directories during the tar.  gtar behaves the same way.  The
directories change from run to run...

Comment 11 Robert Jackson 2005-04-02 22:16:58 UTC
We are also experiencing this NFS filesystem problem on FC1/2/3 & Solaris 8
clients using a FC3 server.

I can reproduce the random "No such file or directory" messages when running
"find . -mount > /dev/null"

But we are also observing randomly disappearing directories under normal usage,
i.e. you "cd" out of directory you've been working in, and when you "cd" back,
it's vanished!  This includes users home directories.

Missing directories can be seen using "ls", but not by "ls -l".  Our workaround
has been to "remove" the missing directory using "rmdir".  The command correctly
returns "Directory not empty", and after that is visible again.

The following demonstrates this:

[root@aftermath delly]# find . -mount > /dev/null
find: ./usr/share/pixmaps/gaim/smileys/default: No such file or directory
find: ./usr/share/info: No such file or directory
find: ./usr/share/apps/ksgmltools2/docbook/xsl/params: No such file or directory

[root@aftermath delly]# cd ./usr/share/pixmaps/gaim/smileys/

[root@aftermath smileys]# ls
default none

[root@aftermath smileys]# ls -l
ls: default: No such file or directory
total 8
drwxr-xr-x 2 root root 4096 Mar 10 14:05 none

[root@aftermath smileys]# cd default

[root@aftermath default]# ls -l
ls: .: Stale NFS file handle

[root@aftermath default]# cd ..

[root@aftermath smileys]# ls -l
ls: default: No such file or directory
total 8
drwxr-xr-x 2 root root 4096 Mar 10 14:05 none

[root@aftermath smileys]# rmdir default
rmdir: `default': Directory not empty

[root@aftermath smileys]# ls -l
total 24
drwxr-xr-x 2 root root 12288 Mar 10 14:05 default
drwxr-xr-x 2 root root 4096 Mar 10 14:05 none

[root@aftermath smileys]# cd default

[root@aftermath default]# ls -l
total 772
-rw-r--r--  1 root root  305 Mar  7 02:39 angel.png
[...]


I have observed this behaviour on two separate FC3 servers, but a third FC1
server appears unaffected.  Also the FC3 server replaced a Solaris 8 server
which worked fine.

The FC3 servers are fully up2date, but I have tried experimenting on one system
with different kernels (original FC3 kernel & rawhide kernel) and also installed
the rawhide "nfs-utils".  This made no change to the behaviour.

Export options used:  rw,sync,no_root_squash,insecure
Mount options used:   rw,sync,tcp

One thing I should add is we're using yp to manage user accounts, but not to
automatically mount users home directories (they should always be mounted on all
clients).

Cheers

Robert

Comment 12 Steve Dickson 2005-04-04 17:01:04 UTC
Please add the "no_subtree_check" export option to see if that helps

Comment 13 Robert Jackson 2005-04-04 19:13:14 UTC
Hi

I tried the "no_subtree_check" export option and the behaviour remains the same.

Cheers

Robert

Comment 14 David A. De Graaf 2005-04-04 21:20:11 UTC
I also tried the "no_subtree_check" export option, and 'find' on a remote
nfs-mounted filesystem still shows errors.  I suspect this may be a timing issue.
Of my four machines, one is a pretty fast 1200 MHz Athlon; the slowest is a
Pentium II 200 MHz.

When the remote machine is the fast one, no errors are reported (twice):

[root@datant /net/datium]
# time find . -mount > /dev/null

real    3m36.491s
user    0m4.632s
sys     0m54.016s
[root@datant /net/datium]
# time find . -mount > /dev/null

real    1m42.360s
user    0m4.947s
sys     0m49.730s


With those same machines, in reverse - the remote machine is slow - there are
errors (different ones in two tests):

[root@datium /net/datant]
# time find . -mount > /dev/null
find: ./usr/share/info: No such file or directory
find: ./usr/share/doc: No such file or directory

real    6m34.467s
user    0m0.719s
sys     0m7.126s
[root@datium /net/datant]
# time find . -mount > /dev/null
find: ./usr/share/pixmaps/nautilus/Bluecurve: No such file or directory
find: ./usr/share/pixmaps/gaim/smileys/default: No such file or directory

real    4m46.177s
user    0m0.628s
sys     0m5.438s



Comment 15 Paul Coene 2005-04-07 22:46:20 UTC
I don't know if this will help... But I did a find twice in a row, from the same
client.  The first one failed on like 8 directories, the second one zero... 
Could this be related to caching?

I also have fast clients and a slower server.

I can duplicate at will if you need me to try some tests or some patch.

Comment 16 Paul Coene 2005-04-11 01:34:46 UTC
Any progress on this, or any way I can help?  My new network uses this for a
backup strategy so I'm running without a net.  Is there a short term outlook, or
work I can do to help?  If not, perhaps I need to buy more backup hardware.

Comment 17 Robert Jackson 2005-04-11 09:10:48 UTC
I've just noticed that all the "missing" directories have a size greater than or
equal to 12288 bytes.

On my system, only 159 of the total 29,387 directories are greater than or equal
to 12K, and the last time I ran "find . -mount > /dev/null", 115 of them
disappeared.

I've tried mounting with:

rsize=4096,wsize=4096
rsize=8192,wsize=8192
rsize=32768,wsize=32768

None of the above changed the behaviour.

Comment 18 Paul Coene 2005-04-18 13:58:52 UTC
I see this is still in "need info" state.  What info is needed?  I'll do
whatever I can to augment the data people have provided.

Comment 19 David A. De Graaf 2005-04-18 17:18:54 UTC
A yum update on 4/16 provided many new package versions, including 
    kernel.i686 2.6.11-1.14_FC3

Hoping that this might have improved the NFS problem, I ran tests with
three machines:  a fast one, a slow one, and a medium speed laptop.
The  'find . -mount > /dev/null'  test can be run 6 ways with three
machines:  A -> B   B -> C   C -> A  and the reverse.  I wrote a
script to loop over these 6 tests 20 times.  Of these 120 tests, which
ran for 14 hr 18 min, all were error-free except 9.  The 9 tests with
errors were all  A -> B, where A is the fast Athlon, and B is the slow
Pentium.  The errors occurred on the 12th thru 20th cycles, and the
"missing directories" differed in number - 1, 9, 4, 5, etc.  Some of
the missing directories recurred in consecutive tests; others appeared
sporadically.

Since the A -> B tests were error-free for the first 11 tries, and
then contained errors consistently thereafter, it is conceivable that
some buffer is retaining bad data.  Since the errors only occurred
when the fast machine was interrogating the slow machine, there
must be a timing issue.

Maybe I should just retire my slow machine and declare victory.
But I'd much prefer to find the source of this NFS defect.


Comment 20 Robert Jackson 2005-04-18 20:17:17 UTC
I've installed kernel 2.6.11-1.14_FC3 on a fast and a slow machine.  Both still
exhibit the missing directory problem.

Every missing directory has size >= 12288 bytes which I believe must be
significant.  Has anybody else observed this?

I suspect the problem is with the server, as the client machine has no problem
with nfs mounts from FC1 and Solaris 8 servers.


Comment 21 Alex Tkachenko 2005-05-01 00:44:05 UTC
I have the what looks to be the same problem with CentOS4 server. In my case if 
the directory in question is listed on the nfs server (ls -la) it is possible to 
ls -la it on the client shortly after. I export filesystems read only, and set 
actimeo to 0. tcp/udp, sync/async v2/v3 and applying the latest kernel upgrade
do not seem to have any effect on the problem. I am convinced that this is an 
stock nfs server problem (or, better said, the problem reside on the nfs 
*server* side), because several different clients experience the same problem, 
and also the above magic ls of the local directory makes a difference on the 
clients.

Comment 22 Paul Coene 2005-05-01 01:33:16 UTC
Can we get the priority of this bumped to high?  This seems to be a server side
problem and is well duplicated.  For me, it is interferring with backups unless
I make a hardware change.

Comment 23 Todd Troxell 2005-05-02 19:45:45 UTC
Having this same problem with RHEL4/latest updates with _many_ Solaris 8/9 clients.

Running tcpdump indicates that the linux server is reporting Error:ERR_NOENT to
my RPC calls of GETATTR.

Comment 24 Todd Troxell 2005-05-03 18:46:58 UTC
A few more things:

I set up monitoring on this to ls -l the directories in question every 2
seconds.  For the duration of this, the system was entirely stable.

I think that my monitoring was refreshing the filehandles on the solaris
machines such that they would never expire and create the ERR_NOENT condition.

Comment 25 Henrik Grubbstrom 2005-05-07 18:31:16 UTC
FYI: I've provided a tentative patch that probably fixes the problem at
https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=150759

Robert: The reason for the break point being at 12288 bytes is that that is the
minimum size for dx directories (3 blocks @ 4096 bytes), and it is for dx
directories that the unpatched ext3_get_parent breaks.

Comment 26 Robert Jackson 2005-05-07 21:59:23 UTC
Thanks Henrik, that's brilliant!

I've applied the patch and it seems to have fixed the problem.

Comment 27 Steve Dickson 2005-05-07 22:38:15 UTC

*** This bug has been marked as a duplicate of 150759 ***

Comment 28 Paul Coene 2005-05-10 17:24:47 UTC
I've never patched a linux kernel before.  Should I just wait for this to be
incorporated?  Anyone want to send me a copy of the kernel after it has been
built, or should I learn how to apply it?  How long before typically, before
this shows up as an update.  My backups have been on hold for this.

Comment 29 Henrik Grubbstrom 2005-05-10 17:53:47 UTC
Paul: There's a description about how to build an fc3 kernel rpm on
http://voidmain.is-a-geek.net/redhat/fedora_3_kernel_build.html .

In this specific case:

Copy the patch to /usr/src/redhat/SOURCES/linux-2.6.10-ext3-grubba-nfsd.patch

Create a new spec file /usr/src/redhat/SPECS/kernel-2.6-grubba-nfsd.spec by
patching kernel-2.6.spec with the patch I'll attach in a moment.

Build the new kernel rpms by running
  rpmbuild -vv -ba --target=`arch` kernel-2.6-grubba-nfsd.spec
in the /usr/src/redhat/SPECS directory.

Your new kernel rpms will appear in /usr/src/redhat/RPMS/`arch`/.


Comment 30 Henrik Grubbstrom 2005-05-10 17:57:18 UTC
Created attachment 114214 [details]
Spec file patch.

Comment 32 David A. De Graaf 2005-05-12 14:13:07 UTC
After rebuilding the kernel with Henrik Grubbstrom's patch to namei.c
on four PCs, I beat NFS to death for 17 hours running the test script
described in Note #19 above.  This runs   'find . -mount > /dev/null'
on the NFS-mounted root filesystems of four machines; in all, 120 complete
scans were run.

Zero errors occurred with the Grubbstrom patch, in contrast to 9 random
errors with unaltered kernel-2.6.11-1.14_FC3, so I am happy to add my
observation to others saying that this bug is finally squashed.

Thank you Henrik Grubbstrom for your persistent and perceptive work to
fix this longstanding and insidious NFS bug.  I hope this fix will be
incorporated into a new kernel release, and surely into the forthcoming
FC4 release.


Comment 33 Paul Coene 2005-05-12 14:30:20 UTC
I've installed Henrik's patch also, and my NFS works fine now.  I once again
have backups!  Thank you.