Bug 158293 - nfs server intermitently claims ENOENT on existing files or directories
Summary: nfs server intermitently claims ENOENT on existing files or directories
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 4
Classification: Red Hat
Component: kernel
Version: 4.0
Hardware: All
OS: Linux
medium
high
Target Milestone: ---
: ---
Assignee: Stephen Tweedie
QA Contact: Brian Brock
URL:
Whiteboard:
: 168986 (view as bug list)
Depends On:
Blocks: 156322
TreeView+ depends on / blocked
 
Reported: 2005-05-20 12:04 UTC by Stephen Tweedie
Modified: 2007-11-30 22:07 UTC (History)
8 users (show)

Fixed In Version: RHSA-2005-514
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2005-10-05 13:10:33 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2005:514 0 qe-ready SHIPPED_LIVE Important: Updated kernel packages available for Red Hat Enterprise Linux 4 Update 2 2005-10-05 04:00:00 UTC

Description Stephen Tweedie 2005-05-20 12:04:19 UTC
+++ This bug was initially created as a clone of Bug #150759 +++

Description of problem:

We just migrated one of our nfs servers from Solaris/sun4u to
FC3/x86_64, and the clients started intermittently complianing
about missing files.

Example:

Solaris 8/x86 client (arkanoid):

 $ ls -l /usr/local/bin/wget
/usr/local/bin/wget: No such file or directory

Snooping the net during the above ls gives:

 # snoop -v homer
ETHER:  ----- Ether Header -----
ETHER:  
ETHER:  Packet 1 arrived at 20:58:52.51
ETHER:  Packet size = 206 bytes
ETHER:  Destination = 0:11:2f:d5:63:ee, 
ETHER:  Source      = 0:60:8:27:90:b8, 
ETHER:  Ethertype = 0800 (IP)
ETHER:  
IP:   ----- IP Header -----
IP:   
IP:   Version = 4
IP:   Header length = 20 bytes
IP:   Type of service = 0x00
IP:         xxx. .... = 0 (precedence)
IP:         ...0 .... = normal delay
IP:         .... 0... = normal throughput
IP:         .... .0.. = normal reliability
IP:   Total length = 192 bytes
IP:   Identification = 13320
IP:   Flags = 0x4
IP:         .1.. .... = do not fragment
IP:         ..0. .... = last fragment
IP:   Fragment offset = 0 bytes
IP:   Time to live = 64 seconds/hops
IP:   Protocol = 6 (TCP)
IP:   Header checksum = 232c
IP:   Source address = 212.247.28.227, arkanoid
IP:   Destination address = 212.247.28.50, homer.roxen.com
IP:   No options
IP:   
TCP:  ----- TCP Header -----
TCP:  
TCP:  Source port = 549
TCP:  Destination port = 2049 (Sun RPC)
TCP:  Sequence number = 3143675431
TCP:  Acknowledgement number = 3654694930
TCP:  Data offset = 32 bytes
TCP:  Flags = 0x18
TCP:        ..0. .... = No urgent pointer
TCP:        ...1 .... = Acknowledgement
TCP:        .... 1... = Push
TCP:        .... .0.. = No reset
TCP:        .... ..0. = No Syn
TCP:        .... ...0 = No Fin
TCP:  Window = 33304
TCP:  Checksum = 0xb767
TCP:  Urgent pointer = 0
TCP:  Options: (12 bytes)
TCP:    - No operation
TCP:    - No operation
TCP:    - TS Val = 1359745451, TS Echo = 784908738
TCP:  
RPC:  ----- SUN RPC Header -----
RPC:  
RPC:  Record Mark: last fragment, length = 136
RPC:  Transaction id = 2581780379
RPC:  Type = 0 (Call)
RPC:  RPC version = 2
RPC:  Program = 100003 (NFS), version = 3, procedure = 1
RPC:  Credentials: Flavor = 1 (Unix), len = 72 bytes
RPC:     Time = 09-Mar-05 19:58:52
RPC:     Hostname = arkanoid
RPC:     Uid = 0, Gid = 1
RPC:     Groups = 1 0 2 3 4 5 6 7 8 9 12 
RPC:  Verifier   : Flavor = 0 (None), len = 0 bytes
RPC:  
NFS:  ----- Sun NFS -----
NFS:  
NFS:  Proc = 1 (Get file attributes)
NFS:  File handle = [5319]
NFS:   0100000100FD00010200000013819201787BA91F
NFS:  

ETHER:  ----- Ether Header -----
ETHER:  
ETHER:  Packet 2 arrived at 20:58:52.51
ETHER:  Packet size = 98 bytes
ETHER:  Destination = 0:60:8:27:90:b8, 
ETHER:  Source      = 0:11:2f:d5:63:ee, 
ETHER:  Ethertype = 0800 (IP)
ETHER:  
IP:   ----- IP Header -----
IP:   
IP:   Version = 4
IP:   Header length = 20 bytes
IP:   Type of service = 0x00
IP:         xxx. .... = 0 (precedence)
IP:         ...0 .... = normal delay
IP:         .... 0... = normal throughput
IP:         .... .0.. = normal reliability
IP:   Total length = 84 bytes
IP:   Identification = 63374
IP:   Flags = 0x4
IP:         .1.. .... = do not fragment
IP:         ..0. .... = last fragment
IP:   Fragment offset = 0 bytes
IP:   Time to live = 64 seconds/hops
IP:   Protocol = 6 (TCP)
IP:   Header checksum = 6011
IP:   Source address = 212.247.28.50, homer.roxen.com
IP:   Destination address = 212.247.28.227, arkanoid
IP:   No options
IP:   
TCP:  ----- TCP Header -----
TCP:  
TCP:  Source port = 2049
TCP:  Destination port = 549 (Sun RPC)
TCP:  Sequence number = 3654694930
TCP:  Acknowledgement number = 3143675571
TCP:  Data offset = 32 bytes
TCP:  Flags = 0x18
TCP:        ..0. .... = No urgent pointer
TCP:        ...1 .... = Acknowledgement
TCP:        .... 1... = Push
TCP:        .... .0.. = No reset
TCP:        .... ..0. = No Syn
TCP:        .... ...0 = No Fin
TCP:  Window = 2520
TCP:  Checksum = 0x6583
TCP:  Urgent pointer = 0
TCP:  Options: (12 bytes)
TCP:    - No operation
TCP:    - No operation
TCP:    - TS Val = 784931572, TS Echo = 1359745451
TCP:  
RPC:  ----- SUN RPC Header -----
RPC:  
RPC:  Record Mark: last fragment, length = 28
RPC:  Transaction id = 2581780379
RPC:  Type = 1 (Reply)
RPC:  This is a reply to frame 1
RPC:  Status = 0 (Accepted)
RPC:  Verifier   : Flavor = 0 (None), len = 0 bytes
RPC:  Accept status = 0 (Success)
RPC:  
NFS:  ----- Sun NFS -----
NFS:  
NFS:  Proc = 1 (Get file attributes)
NFS:  Status = 2 (No such file or directory)
NFS:  

ETHER:  ----- Ether Header -----
ETHER:  
ETHER:  Packet 3 arrived at 20:58:52.60
ETHER:  Packet size = 66 bytes
ETHER:  Destination = 0:11:2f:d5:63:ee, 
ETHER:  Source      = 0:60:8:27:90:b8, 
ETHER:  Ethertype = 0800 (IP)
ETHER:  
IP:   ----- IP Header -----
IP:   
IP:   Version = 4
IP:   Header length = 20 bytes
IP:   Type of service = 0x00
IP:         xxx. .... = 0 (precedence)
IP:         ...0 .... = normal delay
IP:         .... 0... = normal throughput
IP:         .... .0.. = normal reliability
IP:   Total length = 52 bytes
IP:   Identification = 13321
IP:   Flags = 0x4
IP:         .1.. .... = do not fragment
IP:         ..0. .... = last fragment
IP:   Fragment offset = 0 bytes
IP:   Time to live = 64 seconds/hops
IP:   Protocol = 6 (TCP)
IP:   Header checksum = 23b7
IP:   Source address = 212.247.28.227, arkanoid
IP:   Destination address = 212.247.28.50, homer.roxen.com
IP:   No options
IP:   
TCP:  ----- TCP Header -----
TCP:  
TCP:  Source port = 549
TCP:  Destination port = 2049 
TCP:  Sequence number = 3143675571
TCP:  Acknowledgement number = 3654694962
TCP:  Data offset = 32 bytes
TCP:  Flags = 0x10
TCP:        ..0. .... = No urgent pointer
TCP:        ...1 .... = Acknowledgement
TCP:        .... 0... = No push
TCP:        .... .0.. = No reset
TCP:        .... ..0. = No Syn
TCP:        .... ...0 = No Fin
TCP:  Window = 33304
TCP:  Checksum = 0xdede
TCP:  Urgent pointer = 0
TCP:  Options: (12 bytes)
TCP:    - No operation
TCP:    - No operation
TCP:    - TS Val = 1359745461, TS Echo = 784931572
TCP:  

A little while later it works fine, probably because the automounter
has remounted /usr/local:

 $ ls -l /usr/local/bin/wget
lrwxrwxrwx   1 erik     local         24 Mar  8 13:50
/usr/local/bin/wget -> /i/wget/default/bin/wget
 $ df -k /usr/local/bin
Filesystem            kbytes    used   avail capacity  Mounted on
homer:/export/h2/local/solaris/8/x86
                     257416612 181730076 62610468    75%    /usr/local

On the FC3/x86_64 server (homer):

/export/h2/local/solaris/8/x86/bin exists:

[root@homer ~]# ls -Fdi /export/h2/local/solaris/8/x86/bin
26378515 /export/h2/local/solaris/8/x86/bin/
[root@homer ~]# df -kT /export/h2/
Filesystem    Type   1K-blocks      Used Available Use% Mounted on
/dev/mapper/VolGroupHomer-LogVolExportH2
              ext3   257416612 181730076  62610468  75% /export/h2
[root@homer ~]# ls -l /dev/mapper/VolGroupHomer-LogVolExportH2
brw-------  1 root root 253, 1 Feb 28 18:53
/dev/mapper/VolGroupHomer-LogVolExportH2

The filehandle from the snoop above looks correct:

fb_version:     0x01        1
fb_auth_type:   0x00        0
fb_fsid_type:   0x00        0
fb_fileid_type: 0x01        1
fs_major:       0x00FD      253
fs_minor:       0x0001      1
fs_inode:       0x00000002  2
fi_inode:       0x01928113  26378515
fi_generation:  0x1FA97B78  531200888



Version-Release number of selected component (if applicable):

nfs-utils-1.0.6-52
kernel-2.6.10-1.766_FC3

How reproducible:

The problem is intermittent.

Steps to Reproduce:
1. Set up an nfs server with a few million files.
2. Set up a few tens of clients accessing the server via the
automounter; preferrably with lots of small mounts via a glob rule like:

* homer:/export/h2/i/solaris/8/x86/&

where /export/h2/i/solaris/8/x86 contains several hundred sub directories.

3. Watch the clients have intermittent failures.

  
Actual results:


Expected results:


Additional info:

Comment 5 Jack Neely 2005-05-26 19:20:35 UTC
I'm experiancing similar issues with a RHEL 4 NFS server serving an 80 node RHEL
3 Beowulf cluster.

With or without using the automounter I have a problem with home directories
turning into stale NFS handles, back normal, then stale again.  Altough they
seem to be mostly always stale.

The only pattern I've been able to find is that this affects directories where
the directory structure has grown beyond the initial 4K inode.

Example:

drwx--x--x   36 jjneely  jjneely      4096 May 26 14:48 jjneely
drwx------    4 foo      foo         36864 Mar 11 16:11 foo

My home directory (jjneely) has no problems.  However, foo's home directory 9
times out of 10 is a stale NFS handle.

With the automounter but not static mounts, I get the following error that I've
been unable to track down:

   nfs_statfs64: statfs error = 2

If I can provide other information please let me know what to do.

Comment 6 Jack Neely 2005-05-26 19:41:51 UTC
This LKML thread seems to sum up the problem nicely:

http://www.ussg.iu.edu/hypermail/linux/kernel/0411.0/0364.html

Comment 7 Stephen Tweedie 2005-05-26 20:20:31 UTC
The patch we are testing for this is attached to bug 150759.  For reference, you
can find it directly at the url

https://bugzilla.redhat.com/bugzilla/attachment.cgi?id=114644



Comment 8 Stephen Tweedie 2005-05-26 20:37:06 UTC
This patch is now in the upstream -mm kernel branch for testing.

Comment 12 Stephen Tweedie 2005-07-13 16:16:23 UTC
This fix was commited to 2.6 mainline on June 23.

Comment 20 Stephen Tweedie 2005-09-21 21:21:05 UTC
*** Bug 168986 has been marked as a duplicate of this bug. ***

Comment 22 Red Hat Bugzilla 2005-10-05 13:10:33 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2005-514.html



Note You need to log in before you can comment on or make changes to this bug.