+++ This bug was initially created as a clone of Bug #150759 +++ Description of problem: We just migrated one of our nfs servers from Solaris/sun4u to FC3/x86_64, and the clients started intermittently complianing about missing files. Example: Solaris 8/x86 client (arkanoid): $ ls -l /usr/local/bin/wget /usr/local/bin/wget: No such file or directory Snooping the net during the above ls gives: # snoop -v homer ETHER: ----- Ether Header ----- ETHER: ETHER: Packet 1 arrived at 20:58:52.51 ETHER: Packet size = 206 bytes ETHER: Destination = 0:11:2f:d5:63:ee, ETHER: Source = 0:60:8:27:90:b8, ETHER: Ethertype = 0800 (IP) ETHER: IP: ----- IP Header ----- IP: IP: Version = 4 IP: Header length = 20 bytes IP: Type of service = 0x00 IP: xxx. .... = 0 (precedence) IP: ...0 .... = normal delay IP: .... 0... = normal throughput IP: .... .0.. = normal reliability IP: Total length = 192 bytes IP: Identification = 13320 IP: Flags = 0x4 IP: .1.. .... = do not fragment IP: ..0. .... = last fragment IP: Fragment offset = 0 bytes IP: Time to live = 64 seconds/hops IP: Protocol = 6 (TCP) IP: Header checksum = 232c IP: Source address = 212.247.28.227, arkanoid IP: Destination address = 212.247.28.50, homer.roxen.com IP: No options IP: TCP: ----- TCP Header ----- TCP: TCP: Source port = 549 TCP: Destination port = 2049 (Sun RPC) TCP: Sequence number = 3143675431 TCP: Acknowledgement number = 3654694930 TCP: Data offset = 32 bytes TCP: Flags = 0x18 TCP: ..0. .... = No urgent pointer TCP: ...1 .... = Acknowledgement TCP: .... 1... = Push TCP: .... .0.. = No reset TCP: .... ..0. = No Syn TCP: .... ...0 = No Fin TCP: Window = 33304 TCP: Checksum = 0xb767 TCP: Urgent pointer = 0 TCP: Options: (12 bytes) TCP: - No operation TCP: - No operation TCP: - TS Val = 1359745451, TS Echo = 784908738 TCP: RPC: ----- SUN RPC Header ----- RPC: RPC: Record Mark: last fragment, length = 136 RPC: Transaction id = 2581780379 RPC: Type = 0 (Call) RPC: RPC version = 2 RPC: Program = 100003 (NFS), version = 3, procedure = 1 RPC: Credentials: Flavor = 1 (Unix), len = 72 bytes RPC: Time = 09-Mar-05 19:58:52 RPC: Hostname = arkanoid RPC: Uid = 0, Gid = 1 RPC: Groups = 1 0 2 3 4 5 6 7 8 9 12 RPC: Verifier : Flavor = 0 (None), len = 0 bytes RPC: NFS: ----- Sun NFS ----- NFS: NFS: Proc = 1 (Get file attributes) NFS: File handle = [5319] NFS: 0100000100FD00010200000013819201787BA91F NFS: ETHER: ----- Ether Header ----- ETHER: ETHER: Packet 2 arrived at 20:58:52.51 ETHER: Packet size = 98 bytes ETHER: Destination = 0:60:8:27:90:b8, ETHER: Source = 0:11:2f:d5:63:ee, ETHER: Ethertype = 0800 (IP) ETHER: IP: ----- IP Header ----- IP: IP: Version = 4 IP: Header length = 20 bytes IP: Type of service = 0x00 IP: xxx. .... = 0 (precedence) IP: ...0 .... = normal delay IP: .... 0... = normal throughput IP: .... .0.. = normal reliability IP: Total length = 84 bytes IP: Identification = 63374 IP: Flags = 0x4 IP: .1.. .... = do not fragment IP: ..0. .... = last fragment IP: Fragment offset = 0 bytes IP: Time to live = 64 seconds/hops IP: Protocol = 6 (TCP) IP: Header checksum = 6011 IP: Source address = 212.247.28.50, homer.roxen.com IP: Destination address = 212.247.28.227, arkanoid IP: No options IP: TCP: ----- TCP Header ----- TCP: TCP: Source port = 2049 TCP: Destination port = 549 (Sun RPC) TCP: Sequence number = 3654694930 TCP: Acknowledgement number = 3143675571 TCP: Data offset = 32 bytes TCP: Flags = 0x18 TCP: ..0. .... = No urgent pointer TCP: ...1 .... = Acknowledgement TCP: .... 1... = Push TCP: .... .0.. = No reset TCP: .... ..0. = No Syn TCP: .... ...0 = No Fin TCP: Window = 2520 TCP: Checksum = 0x6583 TCP: Urgent pointer = 0 TCP: Options: (12 bytes) TCP: - No operation TCP: - No operation TCP: - TS Val = 784931572, TS Echo = 1359745451 TCP: RPC: ----- SUN RPC Header ----- RPC: RPC: Record Mark: last fragment, length = 28 RPC: Transaction id = 2581780379 RPC: Type = 1 (Reply) RPC: This is a reply to frame 1 RPC: Status = 0 (Accepted) RPC: Verifier : Flavor = 0 (None), len = 0 bytes RPC: Accept status = 0 (Success) RPC: NFS: ----- Sun NFS ----- NFS: NFS: Proc = 1 (Get file attributes) NFS: Status = 2 (No such file or directory) NFS: ETHER: ----- Ether Header ----- ETHER: ETHER: Packet 3 arrived at 20:58:52.60 ETHER: Packet size = 66 bytes ETHER: Destination = 0:11:2f:d5:63:ee, ETHER: Source = 0:60:8:27:90:b8, ETHER: Ethertype = 0800 (IP) ETHER: IP: ----- IP Header ----- IP: IP: Version = 4 IP: Header length = 20 bytes IP: Type of service = 0x00 IP: xxx. .... = 0 (precedence) IP: ...0 .... = normal delay IP: .... 0... = normal throughput IP: .... .0.. = normal reliability IP: Total length = 52 bytes IP: Identification = 13321 IP: Flags = 0x4 IP: .1.. .... = do not fragment IP: ..0. .... = last fragment IP: Fragment offset = 0 bytes IP: Time to live = 64 seconds/hops IP: Protocol = 6 (TCP) IP: Header checksum = 23b7 IP: Source address = 212.247.28.227, arkanoid IP: Destination address = 212.247.28.50, homer.roxen.com IP: No options IP: TCP: ----- TCP Header ----- TCP: TCP: Source port = 549 TCP: Destination port = 2049 TCP: Sequence number = 3143675571 TCP: Acknowledgement number = 3654694962 TCP: Data offset = 32 bytes TCP: Flags = 0x10 TCP: ..0. .... = No urgent pointer TCP: ...1 .... = Acknowledgement TCP: .... 0... = No push TCP: .... .0.. = No reset TCP: .... ..0. = No Syn TCP: .... ...0 = No Fin TCP: Window = 33304 TCP: Checksum = 0xdede TCP: Urgent pointer = 0 TCP: Options: (12 bytes) TCP: - No operation TCP: - No operation TCP: - TS Val = 1359745461, TS Echo = 784931572 TCP: A little while later it works fine, probably because the automounter has remounted /usr/local: $ ls -l /usr/local/bin/wget lrwxrwxrwx 1 erik local 24 Mar 8 13:50 /usr/local/bin/wget -> /i/wget/default/bin/wget $ df -k /usr/local/bin Filesystem kbytes used avail capacity Mounted on homer:/export/h2/local/solaris/8/x86 257416612 181730076 62610468 75% /usr/local On the FC3/x86_64 server (homer): /export/h2/local/solaris/8/x86/bin exists: [root@homer ~]# ls -Fdi /export/h2/local/solaris/8/x86/bin 26378515 /export/h2/local/solaris/8/x86/bin/ [root@homer ~]# df -kT /export/h2/ Filesystem Type 1K-blocks Used Available Use% Mounted on /dev/mapper/VolGroupHomer-LogVolExportH2 ext3 257416612 181730076 62610468 75% /export/h2 [root@homer ~]# ls -l /dev/mapper/VolGroupHomer-LogVolExportH2 brw------- 1 root root 253, 1 Feb 28 18:53 /dev/mapper/VolGroupHomer-LogVolExportH2 The filehandle from the snoop above looks correct: fb_version: 0x01 1 fb_auth_type: 0x00 0 fb_fsid_type: 0x00 0 fb_fileid_type: 0x01 1 fs_major: 0x00FD 253 fs_minor: 0x0001 1 fs_inode: 0x00000002 2 fi_inode: 0x01928113 26378515 fi_generation: 0x1FA97B78 531200888 Version-Release number of selected component (if applicable): nfs-utils-1.0.6-52 kernel-2.6.10-1.766_FC3 How reproducible: The problem is intermittent. Steps to Reproduce: 1. Set up an nfs server with a few million files. 2. Set up a few tens of clients accessing the server via the automounter; preferrably with lots of small mounts via a glob rule like: * homer:/export/h2/i/solaris/8/x86/& where /export/h2/i/solaris/8/x86 contains several hundred sub directories. 3. Watch the clients have intermittent failures. Actual results: Expected results: Additional info:
I'm experiancing similar issues with a RHEL 4 NFS server serving an 80 node RHEL 3 Beowulf cluster. With or without using the automounter I have a problem with home directories turning into stale NFS handles, back normal, then stale again. Altough they seem to be mostly always stale. The only pattern I've been able to find is that this affects directories where the directory structure has grown beyond the initial 4K inode. Example: drwx--x--x 36 jjneely jjneely 4096 May 26 14:48 jjneely drwx------ 4 foo foo 36864 Mar 11 16:11 foo My home directory (jjneely) has no problems. However, foo's home directory 9 times out of 10 is a stale NFS handle. With the automounter but not static mounts, I get the following error that I've been unable to track down: nfs_statfs64: statfs error = 2 If I can provide other information please let me know what to do.
This LKML thread seems to sum up the problem nicely: http://www.ussg.iu.edu/hypermail/linux/kernel/0411.0/0364.html
The patch we are testing for this is attached to bug 150759. For reference, you can find it directly at the url https://bugzilla.redhat.com/bugzilla/attachment.cgi?id=114644
This patch is now in the upstream -mm kernel branch for testing.
This fix was commited to 2.6 mainline on June 23.
*** Bug 168986 has been marked as a duplicate of this bug. ***
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2005-514.html