Bug 185056 - rename(2) over nfs fails with ENOENT
rename(2) over nfs fails with ENOENT
Status: CLOSED WONTFIX
Product: Red Hat Enterprise Linux 4
Classification: Red Hat
Component: kernel (Show other bugs)
4.0
x86_64 Linux
medium Severity medium
: ---
: ---
Assigned To: Peter Staubach
Brian Brock
:
Depends On:
Blocks: 170416
  Show dependency treegraph
 
Reported: 2006-03-10 05:17 EST by Olle Liljenzin
Modified: 2007-11-30 17:07 EST (History)
2 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2006-06-21 08:25:08 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
Program to trigger the bug (1.09 KB, text/plain)
2006-03-10 05:17 EST, Olle Liljenzin
no flags Details
tcpdump (5.05 KB, text/plain)
2006-03-12 04:38 EST, Olle Liljenzin
no flags Details
program running when the tcpdump was done (901 bytes, text/plain)
2006-03-12 04:41 EST, Olle Liljenzin
no flags Details

  None (edit)
Description Olle Liljenzin 2006-03-10 05:17:25 EST
Description of problem:
rename(2) on an nfs mounted file system fails with ENOENT, but the file was
renamed on the server. Sometimes the nfs caches become inconsistent after the
failure.

Version-Release number of selected component (if applicable):
2.4.21-37.ELsmp (client machine)

How reproducible:
Sometimes

Steps to Reproduce:
1. Compile the attached program and run it with current directory nfs mounted
2. Run it several times if necessary, try with different loads on the server
  
Actual results:
% a.out
rename failed: No such file or directory
errno: 2
i: 357
j: 238
arg1: "foo"
arg2: "bar238"

Expected results:
The program should terminate with no output.

Additional info:
It is not easy to reproduce, but we see it on regular basis. It happens a few
times a day on a network with >100 nfs clients. We are only seeing it on RHEL3
clients with an RHEL4 server.

Sometimes the client caches look inconsistent after the failed rename. Note how
the inode numbers first are the same but then differs:

% stat foo bar238
  File: `foo'
  Size: 0               Blocks: 8          IO Block: 32768  Regular File
Device: 3ch/60d Inode: 411189      Links: 1    
Access: (0400/-r--------)  Uid: (  238/    olle)   Gid: (   13/    carm)
Access: 2006-03-10 09:28:17.000000000 +0100
Modify: 2006-03-10 09:28:17.000000000 +0100
Change: 2006-03-10 09:28:17.000000000 +0100

  File: `bar238'
  Size: 0               Blocks: 8          IO Block: 32768  Regular File
Device: 3ch/60d Inode: 411189      Links: 1    
Access: (0400/-r--------)  Uid: (  238/    olle)   Gid: (   13/    carm)
Access: 2006-03-10 09:28:17.000000000 +0100
Modify: 2006-03-10 09:28:17.000000000 +0100
Change: 2006-03-10 09:28:17.000000000 +0100

% echo "apa" >> foo
% echo "bepa" >> bar238
bar238: Permission denied.
% stat foo bar238
  File: `foo'
  Size: 4               Blocks: 16         IO Block: 32768  Regular File
Device: 3ch/60d Inode: 411462      Links: 1    
Access: (0664/-rw-rw-r--)  Uid: (  238/    olle)   Gid: (   13/    carm)
Access: 2006-03-10 10:23:13.000000000 +0100
Modify: 2006-03-10 10:20:52.000000000 +0100
Change: 2006-03-10 10:20:52.000000000 +0100

  File: `bar238'
  Size: 0               Blocks: 8          IO Block: 32768  Regular File
Device: 3ch/60d Inode: 411189      Links: 1    
Access: (0400/-r--------)  Uid: (  238/    olle)   Gid: (   13/    carm)
Access: 2006-03-10 09:28:17.000000000 +0100
Modify: 2006-03-10 09:28:17.000000000 +0100
Change: 2006-03-10 09:28:17.000000000 +0100
Comment 1 Olle Liljenzin 2006-03-10 05:17:25 EST
Created attachment 125939 [details]
Program to trigger the bug
Comment 2 Olle Liljenzin 2006-03-10 07:02:25 EST
Additional info:

The filesystem is exported with the following options:
rw,async,no_root_squash,insecure,fsid=1415926548

It is automounted on the client with default options
Comment 3 Olle Liljenzin 2006-03-10 08:28:35 EST
I have been doing some more testing.

The async option is probably needed to reproduce the bug; it would take forever
to run the test program without it. The other export options are not necessary.

The client can mount with fstab and defaults. automount is not necessary.

The server should be RHEL4 and the client RHEL3. I can't reproduce it with
RHEL3/RHEL3 or RHEL4/RHEL4.
Comment 4 Olle Liljenzin 2006-03-12 04:35:04 EST
I have made some tcpdumps of failing rename commands, and they all have the same
pattern.

I don't have a detailed knowledge of the nfs protocol, maybe someone that does
could have a look at the attached dump.

To my understanding the following happens:
1. The client sends a package to the server with 'rename(foo135,foo136)'
2. There is other noise on the network and the client does not receive the reply
from the server
3. The client sends a new 'rename(foo135,foo136)'
4. The server replies with error since 'foo135' does not exist
Comment 5 Olle Liljenzin 2006-03-12 04:38:44 EST
Created attachment 126004 [details]
tcpdump
Comment 6 Olle Liljenzin 2006-03-12 04:41:26 EST
Created attachment 126006 [details]
program running when the tcpdump was done
Comment 7 Olle Liljenzin 2006-03-14 04:53:08 EST
The tcpdump on the server shows that the server is replaying ok to the first
rename request, and then gets the repeated rename request again about 0.7-0.8
seconds later.

The first reply is lost on the network, but the second one reaches the client.
The server will then have renamed the file, but the client thinks the rename failed.

The server will not consider the second request to be a duplicate since it was
sent more than 0.2 seconds after the original request. (RC_DELAY=HZ/5)

Is there a mismatch between the timeout before the client resends a package and
the max delay for the server to recognize a duplicate?
Comment 9 Olle Liljenzin 2006-03-17 10:02:19 EST
> The server will not consider the second request to be a duplicate since it was
> sent more than 0.2 seconds after the original request. (RC_DELAY=HZ/5)

Ok. I misunderstood the code here. RC_DELAY is used for deciding if a reply
package should be sent or not, instead there is a contant of 120*HZ for when a
package is considered to not be a duplicate. Further analysis of the tcpdump
also shows that ducplicate requests are cached in most cases but not always.

But the duplicate request case is order of magnitudes too small for a modern
enterprise server!!!

   #define CACHESIZE		1024

Output for nfsstat on the server shows more than 2000 cache misses per second:

   % nfsstat -o rc ; sleep 1 ; nfsstat -o rc
   Server reply cache:
   hits       misses     nocache
   3389577    131351522   998711417
   Server reply cache:
   hits       misses     nocache
   3389615    131353641   998718497

The intention in the code looks like being to cache packages for the last 120
seconds. In this time frame we see 269153 cache misses compared to the 1024
entries that fits into the current cache.

   % nfsstat -o rc ; sleep 120 ; nfsstat -o rc
   Server reply cache:
   hits       misses     nocache
   3389738    131360829   998743060
   Server reply cache:
   hits       misses     nocache
   3394541    131629982   999669134
Comment 10 Olle Liljenzin 2006-03-21 05:35:09 EST
To get an idea on what an adequate cache size could be, I have been running the
following script on our nfs server during 24 hours of normal production:

   #! /bin/sh
   cache_misses()
   {
        nfsstat -o rc | awk '/[0-9]/ {print $2}'
   }
   i0=`cache_misses`
   while true; do
        sleep 1
        i=`cache_misses`
        date|tr '\n' '\t'
        expr $i - $i0
        i0=$i
   done

Top ten list:
   Mon Mar 20 23:16:37 CET 2006    18643
   Mon Mar 20 23:16:38 CET 2006    17733
   Mon Mar 20 22:38:44 CET 2006    14228
   Mon Mar 20 22:38:06 CET 2006    12953
   Mon Mar 20 23:16:28 CET 2006    11984
   Mon Mar 20 23:16:36 CET 2006    11771
   Mon Mar 20 21:18:48 CET 2006    11658
   Mon Mar 20 22:38:45 CET 2006    11607
   Mon Mar 20 22:38:36 CET 2006    11150
   Mon Mar 20 23:16:29 CET 2006    10866

Consider the peak of 18643 cache misses in one second. The current cache size is
1024, which gives a shortest lifetime of about 0.055 seconds for cached
requests, compared to the 0.7 seconds it takes for a client to repeat a request
if the reply package isn't received.
Comment 18 Peter Staubach 2006-04-06 15:32:39 EDT
The "right" solution, technically, to this problem is to switch to using
TCP as the transport.  The semantics of TCP, combined with the semantics
of RPC over TCP, would effectively eliminate this issue.

The other obvious solutions are not tenable.  One would be to modify the
CACHESIZE and HASHSIZE defines to be "large" enough.  This would involve
creating a much larger cache, allocated at boot time.  For most systems,
such a large cache would not be required and would be considered to be a
waste of space.  The other solution is to implement a more dynamic cache
which would grow and shrink as required.  This would be a large amount
of work and would be very risky, especially at this time in the RHEL-3
lifecycle.

Therefore, I am going to close this as WONTFIX.  I would suggest using
TCP as the transport and addressing any issues which might prevent this
from being implemented.

Note You need to log in before you can comment on or make changes to this bug.