Bug 158293

Summary: nfs server intermitently claims ENOENT on existing files or directories
Product: Red Hat Enterprise Linux 4 Reporter: Stephen Tweedie <sct>
Component: kernelAssignee: Stephen Tweedie <sct>
Status: CLOSED ERRATA QA Contact: Brian Brock <bbrock>
Severity: high Docs Contact:
Priority: medium    
Version: 4.0CC: elliot, glen, jjneely, jlayton, nhorman, poelstra, scott, steved
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: RHSA-2005-514 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2005-10-05 13:10:33 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 156322    

Description Stephen Tweedie 2005-05-20 12:04:19 UTC
+++ This bug was initially created as a clone of Bug #150759 +++

Description of problem:

We just migrated one of our nfs servers from Solaris/sun4u to
FC3/x86_64, and the clients started intermittently complianing
about missing files.

Example:

Solaris 8/x86 client (arkanoid):

 $ ls -l /usr/local/bin/wget
/usr/local/bin/wget: No such file or directory

Snooping the net during the above ls gives:

 # snoop -v homer
ETHER:  ----- Ether Header -----
ETHER:  
ETHER:  Packet 1 arrived at 20:58:52.51
ETHER:  Packet size = 206 bytes
ETHER:  Destination = 0:11:2f:d5:63:ee, 
ETHER:  Source      = 0:60:8:27:90:b8, 
ETHER:  Ethertype = 0800 (IP)
ETHER:  
IP:   ----- IP Header -----
IP:   
IP:   Version = 4
IP:   Header length = 20 bytes
IP:   Type of service = 0x00
IP:         xxx. .... = 0 (precedence)
IP:         ...0 .... = normal delay
IP:         .... 0... = normal throughput
IP:         .... .0.. = normal reliability
IP:   Total length = 192 bytes
IP:   Identification = 13320
IP:   Flags = 0x4
IP:         .1.. .... = do not fragment
IP:         ..0. .... = last fragment
IP:   Fragment offset = 0 bytes
IP:   Time to live = 64 seconds/hops
IP:   Protocol = 6 (TCP)
IP:   Header checksum = 232c
IP:   Source address = 212.247.28.227, arkanoid
IP:   Destination address = 212.247.28.50, homer.roxen.com
IP:   No options
IP:   
TCP:  ----- TCP Header -----
TCP:  
TCP:  Source port = 549
TCP:  Destination port = 2049 (Sun RPC)
TCP:  Sequence number = 3143675431
TCP:  Acknowledgement number = 3654694930
TCP:  Data offset = 32 bytes
TCP:  Flags = 0x18
TCP:        ..0. .... = No urgent pointer
TCP:        ...1 .... = Acknowledgement
TCP:        .... 1... = Push
TCP:        .... .0.. = No reset
TCP:        .... ..0. = No Syn
TCP:        .... ...0 = No Fin
TCP:  Window = 33304
TCP:  Checksum = 0xb767
TCP:  Urgent pointer = 0
TCP:  Options: (12 bytes)
TCP:    - No operation
TCP:    - No operation
TCP:    - TS Val = 1359745451, TS Echo = 784908738
TCP:  
RPC:  ----- SUN RPC Header -----
RPC:  
RPC:  Record Mark: last fragment, length = 136
RPC:  Transaction id = 2581780379
RPC:  Type = 0 (Call)
RPC:  RPC version = 2
RPC:  Program = 100003 (NFS), version = 3, procedure = 1
RPC:  Credentials: Flavor = 1 (Unix), len = 72 bytes
RPC:     Time = 09-Mar-05 19:58:52
RPC:     Hostname = arkanoid
RPC:     Uid = 0, Gid = 1
RPC:     Groups = 1 0 2 3 4 5 6 7 8 9 12 
RPC:  Verifier   : Flavor = 0 (None), len = 0 bytes
RPC:  
NFS:  ----- Sun NFS -----
NFS:  
NFS:  Proc = 1 (Get file attributes)
NFS:  File handle = [5319]
NFS:   0100000100FD00010200000013819201787BA91F
NFS:  

ETHER:  ----- Ether Header -----
ETHER:  
ETHER:  Packet 2 arrived at 20:58:52.51
ETHER:  Packet size = 98 bytes
ETHER:  Destination = 0:60:8:27:90:b8, 
ETHER:  Source      = 0:11:2f:d5:63:ee, 
ETHER:  Ethertype = 0800 (IP)
ETHER:  
IP:   ----- IP Header -----
IP:   
IP:   Version = 4
IP:   Header length = 20 bytes
IP:   Type of service = 0x00
IP:         xxx. .... = 0 (precedence)
IP:         ...0 .... = normal delay
IP:         .... 0... = normal throughput
IP:         .... .0.. = normal reliability
IP:   Total length = 84 bytes
IP:   Identification = 63374
IP:   Flags = 0x4
IP:         .1.. .... = do not fragment
IP:         ..0. .... = last fragment
IP:   Fragment offset = 0 bytes
IP:   Time to live = 64 seconds/hops
IP:   Protocol = 6 (TCP)
IP:   Header checksum = 6011
IP:   Source address = 212.247.28.50, homer.roxen.com
IP:   Destination address = 212.247.28.227, arkanoid
IP:   No options
IP:   
TCP:  ----- TCP Header -----
TCP:  
TCP:  Source port = 2049
TCP:  Destination port = 549 (Sun RPC)
TCP:  Sequence number = 3654694930
TCP:  Acknowledgement number = 3143675571
TCP:  Data offset = 32 bytes
TCP:  Flags = 0x18
TCP:        ..0. .... = No urgent pointer
TCP:        ...1 .... = Acknowledgement
TCP:        .... 1... = Push
TCP:        .... .0.. = No reset
TCP:        .... ..0. = No Syn
TCP:        .... ...0 = No Fin
TCP:  Window = 2520
TCP:  Checksum = 0x6583
TCP:  Urgent pointer = 0
TCP:  Options: (12 bytes)
TCP:    - No operation
TCP:    - No operation
TCP:    - TS Val = 784931572, TS Echo = 1359745451
TCP:  
RPC:  ----- SUN RPC Header -----
RPC:  
RPC:  Record Mark: last fragment, length = 28
RPC:  Transaction id = 2581780379
RPC:  Type = 1 (Reply)
RPC:  This is a reply to frame 1
RPC:  Status = 0 (Accepted)
RPC:  Verifier   : Flavor = 0 (None), len = 0 bytes
RPC:  Accept status = 0 (Success)
RPC:  
NFS:  ----- Sun NFS -----
NFS:  
NFS:  Proc = 1 (Get file attributes)
NFS:  Status = 2 (No such file or directory)
NFS:  

ETHER:  ----- Ether Header -----
ETHER:  
ETHER:  Packet 3 arrived at 20:58:52.60
ETHER:  Packet size = 66 bytes
ETHER:  Destination = 0:11:2f:d5:63:ee, 
ETHER:  Source      = 0:60:8:27:90:b8, 
ETHER:  Ethertype = 0800 (IP)
ETHER:  
IP:   ----- IP Header -----
IP:   
IP:   Version = 4
IP:   Header length = 20 bytes
IP:   Type of service = 0x00
IP:         xxx. .... = 0 (precedence)
IP:         ...0 .... = normal delay
IP:         .... 0... = normal throughput
IP:         .... .0.. = normal reliability
IP:   Total length = 52 bytes
IP:   Identification = 13321
IP:   Flags = 0x4
IP:         .1.. .... = do not fragment
IP:         ..0. .... = last fragment
IP:   Fragment offset = 0 bytes
IP:   Time to live = 64 seconds/hops
IP:   Protocol = 6 (TCP)
IP:   Header checksum = 23b7
IP:   Source address = 212.247.28.227, arkanoid
IP:   Destination address = 212.247.28.50, homer.roxen.com
IP:   No options
IP:   
TCP:  ----- TCP Header -----
TCP:  
TCP:  Source port = 549
TCP:  Destination port = 2049 
TCP:  Sequence number = 3143675571
TCP:  Acknowledgement number = 3654694962
TCP:  Data offset = 32 bytes
TCP:  Flags = 0x10
TCP:        ..0. .... = No urgent pointer
TCP:        ...1 .... = Acknowledgement
TCP:        .... 0... = No push
TCP:        .... .0.. = No reset
TCP:        .... ..0. = No Syn
TCP:        .... ...0 = No Fin
TCP:  Window = 33304
TCP:  Checksum = 0xdede
TCP:  Urgent pointer = 0
TCP:  Options: (12 bytes)
TCP:    - No operation
TCP:    - No operation
TCP:    - TS Val = 1359745461, TS Echo = 784931572
TCP:  

A little while later it works fine, probably because the automounter
has remounted /usr/local:

 $ ls -l /usr/local/bin/wget
lrwxrwxrwx   1 erik     local         24 Mar  8 13:50
/usr/local/bin/wget -> /i/wget/default/bin/wget
 $ df -k /usr/local/bin
Filesystem            kbytes    used   avail capacity  Mounted on
homer:/export/h2/local/solaris/8/x86
                     257416612 181730076 62610468    75%    /usr/local

On the FC3/x86_64 server (homer):

/export/h2/local/solaris/8/x86/bin exists:

[root@homer ~]# ls -Fdi /export/h2/local/solaris/8/x86/bin
26378515 /export/h2/local/solaris/8/x86/bin/
[root@homer ~]# df -kT /export/h2/
Filesystem    Type   1K-blocks      Used Available Use% Mounted on
/dev/mapper/VolGroupHomer-LogVolExportH2
              ext3   257416612 181730076  62610468  75% /export/h2
[root@homer ~]# ls -l /dev/mapper/VolGroupHomer-LogVolExportH2
brw-------  1 root root 253, 1 Feb 28 18:53
/dev/mapper/VolGroupHomer-LogVolExportH2

The filehandle from the snoop above looks correct:

fb_version:     0x01        1
fb_auth_type:   0x00        0
fb_fsid_type:   0x00        0
fb_fileid_type: 0x01        1
fs_major:       0x00FD      253
fs_minor:       0x0001      1
fs_inode:       0x00000002  2
fi_inode:       0x01928113  26378515
fi_generation:  0x1FA97B78  531200888



Version-Release number of selected component (if applicable):

nfs-utils-1.0.6-52
kernel-2.6.10-1.766_FC3

How reproducible:

The problem is intermittent.

Steps to Reproduce:
1. Set up an nfs server with a few million files.
2. Set up a few tens of clients accessing the server via the
automounter; preferrably with lots of small mounts via a glob rule like:

* homer:/export/h2/i/solaris/8/x86/&

where /export/h2/i/solaris/8/x86 contains several hundred sub directories.

3. Watch the clients have intermittent failures.

  
Actual results:


Expected results:


Additional info:

Comment 5 Jack Neely 2005-05-26 19:20:35 UTC
I'm experiancing similar issues with a RHEL 4 NFS server serving an 80 node RHEL
3 Beowulf cluster.

With or without using the automounter I have a problem with home directories
turning into stale NFS handles, back normal, then stale again.  Altough they
seem to be mostly always stale.

The only pattern I've been able to find is that this affects directories where
the directory structure has grown beyond the initial 4K inode.

Example:

drwx--x--x   36 jjneely  jjneely      4096 May 26 14:48 jjneely
drwx------    4 foo      foo         36864 Mar 11 16:11 foo

My home directory (jjneely) has no problems.  However, foo's home directory 9
times out of 10 is a stale NFS handle.

With the automounter but not static mounts, I get the following error that I've
been unable to track down:

   nfs_statfs64: statfs error = 2

If I can provide other information please let me know what to do.

Comment 6 Jack Neely 2005-05-26 19:41:51 UTC
This LKML thread seems to sum up the problem nicely:

http://www.ussg.iu.edu/hypermail/linux/kernel/0411.0/0364.html

Comment 7 Stephen Tweedie 2005-05-26 20:20:31 UTC
The patch we are testing for this is attached to bug 150759.  For reference, you
can find it directly at the url

https://bugzilla.redhat.com/bugzilla/attachment.cgi?id=114644



Comment 8 Stephen Tweedie 2005-05-26 20:37:06 UTC
This patch is now in the upstream -mm kernel branch for testing.

Comment 12 Stephen Tweedie 2005-07-13 16:16:23 UTC
This fix was commited to 2.6 mainline on June 23.

Comment 20 Stephen Tweedie 2005-09-21 21:21:05 UTC
*** Bug 168986 has been marked as a duplicate of this bug. ***

Comment 22 Red Hat Bugzilla 2005-10-05 13:10:33 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2005-514.html