Description of problem: We just migrated one of our nfs servers from Solaris/sun4u to FC3/x86_64, and the clients started intermittently complianing about missing files. Example: Solaris 8/x86 client (arkanoid): $ ls -l /usr/local/bin/wget /usr/local/bin/wget: No such file or directory Snooping the net during the above ls gives: # snoop -v homer ETHER: ----- Ether Header ----- ETHER: ETHER: Packet 1 arrived at 20:58:52.51 ETHER: Packet size = 206 bytes ETHER: Destination = 0:11:2f:d5:63:ee, ETHER: Source = 0:60:8:27:90:b8, ETHER: Ethertype = 0800 (IP) ETHER: IP: ----- IP Header ----- IP: IP: Version = 4 IP: Header length = 20 bytes IP: Type of service = 0x00 IP: xxx. .... = 0 (precedence) IP: ...0 .... = normal delay IP: .... 0... = normal throughput IP: .... .0.. = normal reliability IP: Total length = 192 bytes IP: Identification = 13320 IP: Flags = 0x4 IP: .1.. .... = do not fragment IP: ..0. .... = last fragment IP: Fragment offset = 0 bytes IP: Time to live = 64 seconds/hops IP: Protocol = 6 (TCP) IP: Header checksum = 232c IP: Source address = 212.247.28.227, arkanoid IP: Destination address = 212.247.28.50, homer.roxen.com IP: No options IP: TCP: ----- TCP Header ----- TCP: TCP: Source port = 549 TCP: Destination port = 2049 (Sun RPC) TCP: Sequence number = 3143675431 TCP: Acknowledgement number = 3654694930 TCP: Data offset = 32 bytes TCP: Flags = 0x18 TCP: ..0. .... = No urgent pointer TCP: ...1 .... = Acknowledgement TCP: .... 1... = Push TCP: .... .0.. = No reset TCP: .... ..0. = No Syn TCP: .... ...0 = No Fin TCP: Window = 33304 TCP: Checksum = 0xb767 TCP: Urgent pointer = 0 TCP: Options: (12 bytes) TCP: - No operation TCP: - No operation TCP: - TS Val = 1359745451, TS Echo = 784908738 TCP: RPC: ----- SUN RPC Header ----- RPC: RPC: Record Mark: last fragment, length = 136 RPC: Transaction id = 2581780379 RPC: Type = 0 (Call) RPC: RPC version = 2 RPC: Program = 100003 (NFS), version = 3, procedure = 1 RPC: Credentials: Flavor = 1 (Unix), len = 72 bytes RPC: Time = 09-Mar-05 19:58:52 RPC: Hostname = arkanoid RPC: Uid = 0, Gid = 1 RPC: Groups = 1 0 2 3 4 5 6 7 8 9 12 RPC: Verifier : Flavor = 0 (None), len = 0 bytes RPC: NFS: ----- Sun NFS ----- NFS: NFS: Proc = 1 (Get file attributes) NFS: File handle = [5319] NFS: 0100000100FD00010200000013819201787BA91F NFS: ETHER: ----- Ether Header ----- ETHER: ETHER: Packet 2 arrived at 20:58:52.51 ETHER: Packet size = 98 bytes ETHER: Destination = 0:60:8:27:90:b8, ETHER: Source = 0:11:2f:d5:63:ee, ETHER: Ethertype = 0800 (IP) ETHER: IP: ----- IP Header ----- IP: IP: Version = 4 IP: Header length = 20 bytes IP: Type of service = 0x00 IP: xxx. .... = 0 (precedence) IP: ...0 .... = normal delay IP: .... 0... = normal throughput IP: .... .0.. = normal reliability IP: Total length = 84 bytes IP: Identification = 63374 IP: Flags = 0x4 IP: .1.. .... = do not fragment IP: ..0. .... = last fragment IP: Fragment offset = 0 bytes IP: Time to live = 64 seconds/hops IP: Protocol = 6 (TCP) IP: Header checksum = 6011 IP: Source address = 212.247.28.50, homer.roxen.com IP: Destination address = 212.247.28.227, arkanoid IP: No options IP: TCP: ----- TCP Header ----- TCP: TCP: Source port = 2049 TCP: Destination port = 549 (Sun RPC) TCP: Sequence number = 3654694930 TCP: Acknowledgement number = 3143675571 TCP: Data offset = 32 bytes TCP: Flags = 0x18 TCP: ..0. .... = No urgent pointer TCP: ...1 .... = Acknowledgement TCP: .... 1... = Push TCP: .... .0.. = No reset TCP: .... ..0. = No Syn TCP: .... ...0 = No Fin TCP: Window = 2520 TCP: Checksum = 0x6583 TCP: Urgent pointer = 0 TCP: Options: (12 bytes) TCP: - No operation TCP: - No operation TCP: - TS Val = 784931572, TS Echo = 1359745451 TCP: RPC: ----- SUN RPC Header ----- RPC: RPC: Record Mark: last fragment, length = 28 RPC: Transaction id = 2581780379 RPC: Type = 1 (Reply) RPC: This is a reply to frame 1 RPC: Status = 0 (Accepted) RPC: Verifier : Flavor = 0 (None), len = 0 bytes RPC: Accept status = 0 (Success) RPC: NFS: ----- Sun NFS ----- NFS: NFS: Proc = 1 (Get file attributes) NFS: Status = 2 (No such file or directory) NFS: ETHER: ----- Ether Header ----- ETHER: ETHER: Packet 3 arrived at 20:58:52.60 ETHER: Packet size = 66 bytes ETHER: Destination = 0:11:2f:d5:63:ee, ETHER: Source = 0:60:8:27:90:b8, ETHER: Ethertype = 0800 (IP) ETHER: IP: ----- IP Header ----- IP: IP: Version = 4 IP: Header length = 20 bytes IP: Type of service = 0x00 IP: xxx. .... = 0 (precedence) IP: ...0 .... = normal delay IP: .... 0... = normal throughput IP: .... .0.. = normal reliability IP: Total length = 52 bytes IP: Identification = 13321 IP: Flags = 0x4 IP: .1.. .... = do not fragment IP: ..0. .... = last fragment IP: Fragment offset = 0 bytes IP: Time to live = 64 seconds/hops IP: Protocol = 6 (TCP) IP: Header checksum = 23b7 IP: Source address = 212.247.28.227, arkanoid IP: Destination address = 212.247.28.50, homer.roxen.com IP: No options IP: TCP: ----- TCP Header ----- TCP: TCP: Source port = 549 TCP: Destination port = 2049 TCP: Sequence number = 3143675571 TCP: Acknowledgement number = 3654694962 TCP: Data offset = 32 bytes TCP: Flags = 0x10 TCP: ..0. .... = No urgent pointer TCP: ...1 .... = Acknowledgement TCP: .... 0... = No push TCP: .... .0.. = No reset TCP: .... ..0. = No Syn TCP: .... ...0 = No Fin TCP: Window = 33304 TCP: Checksum = 0xdede TCP: Urgent pointer = 0 TCP: Options: (12 bytes) TCP: - No operation TCP: - No operation TCP: - TS Val = 1359745461, TS Echo = 784931572 TCP: A little while later it works fine, probably because the automounter has remounted /usr/local: $ ls -l /usr/local/bin/wget lrwxrwxrwx 1 erik local 24 Mar 8 13:50 /usr/local/bin/wget -> /i/wget/default/bin/wget $ df -k /usr/local/bin Filesystem kbytes used avail capacity Mounted on homer:/export/h2/local/solaris/8/x86 257416612 181730076 62610468 75% /usr/local On the FC3/x86_64 server (homer): /export/h2/local/solaris/8/x86/bin exists: [root@homer ~]# ls -Fdi /export/h2/local/solaris/8/x86/bin 26378515 /export/h2/local/solaris/8/x86/bin/ [root@homer ~]# df -kT /export/h2/ Filesystem Type 1K-blocks Used Available Use% Mounted on /dev/mapper/VolGroupHomer-LogVolExportH2 ext3 257416612 181730076 62610468 75% /export/h2 [root@homer ~]# ls -l /dev/mapper/VolGroupHomer-LogVolExportH2 brw------- 1 root root 253, 1 Feb 28 18:53 /dev/mapper/VolGroupHomer-LogVolExportH2 The filehandle from the snoop above looks correct: fb_version: 0x01 1 fb_auth_type: 0x00 0 fb_fsid_type: 0x00 0 fb_fileid_type: 0x01 1 fs_major: 0x00FD 253 fs_minor: 0x0001 1 fs_inode: 0x00000002 2 fi_inode: 0x01928113 26378515 fi_generation: 0x1FA97B78 531200888 Version-Release number of selected component (if applicable): nfs-utils-1.0.6-52 kernel-2.6.10-1.766_FC3 How reproducible: The problem is intermittent. Steps to Reproduce: 1. Set up an nfs server with a few million files. 2. Set up a few tens of clients accessing the server via the automounter; preferrably with lots of small mounts via a glob rule like: * homer:/export/h2/i/solaris/8/x86/& where /export/h2/i/solaris/8/x86 contains several hundred sub directories. 3. Watch the clients have intermittent failures. Actual results: Expected results: Additional info:
After enabling nfsd_debug, I got the following for a failed request (nfs filehandle 0100000100FD00010200000028038E0167B9A81F): nfsd_dispatch: vers 3 proc 1 nfsd: GETATTR(3) 20: 01000001 0100fd00 00000002 018e0328 1fa8b967 00000000 nfsd: fh_verify(20: 01000001 0100fd00 00000002 018e0328 1fa8b967 00000000) nfsd_dispatch: vers 3 proc 1 nfsd: GETATTR(3) 28: 02000001 0100fd00 00000002 00964669 1fad0d79 00964668 nfsd: fh_verify(28: 02000001 0100fd00 00000002 00964669 1fad0d79 00964668) nfsd_acceptable failed at 000001003466d598 nfsd_dispatch: vers 3 proc 3
I noticed that the checks performed by nfsd_acceptable can be turned off by exporting with no_subtree_check, so I've done that for now.
That got rid of "nfsd_acceptable failed", but not of the ENOENT's.
After adding some extra debug to the kernel, I get the following: Mar 14 14:29:59 homer kernel: nfsd_dispatch: vers 3 proc 3 Mar 14 14:30:00 homer kernel: nfsd: LOOKUP(3) 20: 01000001 0100fd00 00000002 00a74425 1faef8ca 00000000 select.pyc Mar 14 14:30:00 homer kernel: nfsd: nfsd_lookup(fh 20: 01000001 0100fd00 00000002 00a74425 1faef8ca 00000000, select.pyc) Mar 14 14:30:00 homer kernel: nfsd: fh_verify(20: 01000001 0100fd00 00000002 00a74425 1faef8ca 00000000) Mar 14 14:30:00 homer kernel: nfsd: fh_lock(20: 01000001 0100fd00 00000002 00a74425 1faef8ca 00000000) locked = 0 Mar 14 14:30:00 homer kernel: nfsd: fh_compose(exp fd:01/2 plat-sunos5/select.pyc, ino=0) Mar 14 14:30:00 homer kernel: nfsd_lookup: no inode. Which indicates that dentry->d_inode is NULL at the end of fs/nfsd/vfs.c:nfsd_lookup().
Oops, that looks like a false hit.
Succeded in snooping the failure again (it occurrs more seldom after the no_subtree_check change). The snoop log is interleaved with the corresponding entries from the nfsd log (nfsd_debug:32767). It looks to me like fh_verify fails. ETHER: ----- Ether Header ----- ETHER: ETHER: Packet 10686 arrived at 21:05:1.01 ETHER: Packet size = 174 bytes ETHER: Destination = 0:11:2f:d5:63:ee, ETHER: Source = 0:60:8:27:90:b8, ETHER: Ethertype = 0800 (IP) ETHER: IP: ----- IP Header ----- IP: IP: Version = 4 IP: Header length = 20 bytes IP: Type of service = 0x00 IP: xxx. .... = 0 (precedence) IP: ...0 .... = normal delay IP: .... 0... = normal throughput IP: .... .0.. = normal reliability IP: Total length = 160 bytes IP: Identification = 22753 IP: Flags = 0x4 IP: .1.. .... = do not fragment IP: ..0. .... = last fragment IP: Fragment offset = 0 bytes IP: Time to live = 64 seconds/hops IP: Protocol = 6 (TCP) IP: Header checksum = fe72 IP: Source address = 212.247.28.227, arkanoid IP: Destination address = 212.247.28.50, homer.roxen.com IP: No options IP: TCP: ----- TCP Header ----- TCP: TCP: Source port = 868 TCP: Destination port = 2049 (Sun RPC) TCP: Sequence number = 3243914689 TCP: Acknowledgement number = 1156470685 TCP: Data offset = 32 bytes TCP: Flags = 0x18 TCP: ..0. .... = No urgent pointer TCP: ...1 .... = Acknowledgement TCP: .... 1... = Push TCP: .... .0.. = No reset TCP: .... ..0. = No Syn TCP: .... ...0 = No Fin TCP: Window = 33304 TCP: Checksum = 0x1377 TCP: Urgent pointer = 0 TCP: Options: (12 bytes) TCP: - No operation TCP: - No operation TCP: - TS Val = 1411620767, TS Echo = 110005392 TCP: RPC: ----- SUN RPC Header ----- RPC: RPC: Record Mark: last fragment, length = 104 RPC: Transaction id = 1540906396 RPC: Type = 0 (Call) RPC: RPC version = 2 RPC: Program = 100003 (NFS), version = 3, procedure = 1 RPC: Credentials: Flavor = 1 (Unix), len = 40 bytes RPC: Time = 15-Mar-05 20:05:01 RPC: Hostname = arkanoid RPC: Uid = 997, Gid = 997 RPC: Groups = 997 5008 2000 RPC: Verifier : Flavor = 0 (None), len = 0 bytes RPC: NFS: ----- Sun NFS ----- NFS: NFS: Proc = 1 (Get file attributes) NFS: File handle = [5D19] NFS: 0100000100FD00010200000012819201777BA91F NFS: Mar 15 21:05:00 homer kernel: nfsd_dispatch: vers 3 proc 1 Mar 15 21:05:00 homer kernel: nfsd: GETATTR(3) 20: 01000001 0100fd00 00000002 01928112 1fa97b77 00000000 Mar 15 21:05:00 homer kernel: nfsd: fh_verify(20: 01000001 0100fd00 00000002 01928112 1fa97b77 00000000) $ ls -lid /export/h2/local/solaris/8/x86 26378514 drwxrwsr-x 13 root local 4096 May 24 2001 /export/h2/local/solaris/8/x86 ETHER: ----- Ether Header ----- ETHER: ETHER: Packet 10687 arrived at 21:05:1.01 ETHER: Packet size = 182 bytes ETHER: Destination = 0:60:8:27:90:b8, ETHER: Source = 0:11:2f:d5:63:ee, ETHER: Ethertype = 0800 (IP) ETHER: IP: ----- IP Header ----- IP: IP: Version = 4 IP: Header length = 20 bytes IP: Type of service = 0x00 IP: xxx. .... = 0 (precedence) IP: ...0 .... = normal delay IP: .... 0... = normal throughput IP: .... .0.. = normal reliability IP: Total length = 168 bytes IP: Identification = 31534 IP: Flags = 0x4 IP: .1.. .... = do not fragment IP: ..0. .... = last fragment IP: Fragment offset = 0 bytes IP: Time to live = 64 seconds/hops IP: Protocol = 6 (TCP) IP: Header checksum = dc1d IP: Source address = 212.247.28.50, homer.roxen.com IP: Destination address = 212.247.28.227, arkanoid IP: No options IP: TCP: ----- TCP Header ----- TCP: TCP: Source port = 2049 TCP: Destination port = 868 (Sun RPC) TCP: Sequence number = 1156470685 TCP: Acknowledgement number = 3243914797 TCP: Data offset = 32 bytes TCP: Flags = 0x18 TCP: ..0. .... = No urgent pointer TCP: ...1 .... = Acknowledgement TCP: .... 1... = Push TCP: .... .0.. = No reset TCP: .... ..0. = No Syn TCP: .... ...0 = No Fin TCP: Window = 2520 TCP: Checksum = 0x3d5d TCP: Urgent pointer = 0 TCP: Options: (12 bytes) TCP: - No operation TCP: - No operation TCP: - TS Val = 110301705, TS Echo = 1411620767 TCP: RPC: ----- SUN RPC Header ----- RPC: RPC: Record Mark: last fragment, length = 112 RPC: Transaction id = 1540906396 RPC: Type = 1 (Reply) RPC: This is a reply to frame 10686 RPC: Status = 0 (Accepted) RPC: Verifier : Flavor = 0 (None), len = 0 bytes RPC: Accept status = 0 (Success) RPC: NFS: ----- Sun NFS ----- NFS: NFS: Proc = 1 (Get file attributes) NFS: Status = 0 (OK) NFS: File type = 2 (Directory) NFS: Mode = 042775 NFS: Setuid = 0, Setgid = 1, Sticky = 0 NFS: Owner's permissions = rwx NFS: Group's permissions = rwx NFS: Other's permissions = r-x NFS: Link count = 13, User ID = 0, Group ID = 2000 NFS: File size = 4096, Used = 4096 NFS: Special: Major = 0, Minor = 0 NFS: File system id = 64769, File id = 26378514 NFS: Last access time = 15-Mar-05 20:01:46.000000000 GMT NFS: Modification time = 24-May-01 09:50:19.000000000 GMT NFS: Attribute change time = 08-Mar-05 12:55:44.000000000 GMT NFS: NFS: ETHER: ----- Ether Header ----- ETHER: ETHER: Packet 10688 arrived at 21:05:1.01 ETHER: Packet size = 174 bytes ETHER: Destination = 0:11:2f:d5:63:ee, ETHER: Source = 0:60:8:27:90:b8, ETHER: Ethertype = 0800 (IP) ETHER: IP: ----- IP Header ----- IP: IP: Version = 4 IP: Header length = 20 bytes IP: Type of service = 0x00 IP: xxx. .... = 0 (precedence) IP: ...0 .... = normal delay IP: .... 0... = normal throughput IP: .... .0.. = normal reliability IP: Total length = 160 bytes IP: Identification = 22754 IP: Flags = 0x4 IP: .1.. .... = do not fragment IP: ..0. .... = last fragment IP: Fragment offset = 0 bytes IP: Time to live = 64 seconds/hops IP: Protocol = 6 (TCP) IP: Header checksum = fe71 IP: Source address = 212.247.28.227, arkanoid IP: Destination address = 212.247.28.50, homer.roxen.com IP: No options IP: TCP: ----- TCP Header ----- TCP: TCP: Source port = 868 TCP: Destination port = 2049 (Sun RPC) TCP: Sequence number = 3243914797 TCP: Acknowledgement number = 1156470801 TCP: Data offset = 32 bytes TCP: Flags = 0x18 TCP: ..0. .... = No urgent pointer TCP: ...1 .... = Acknowledgement TCP: .... 1... = Push TCP: .... .0.. = No reset TCP: .... ..0. = No Syn TCP: .... ...0 = No Fin TCP: Window = 33304 TCP: Checksum = 0x8a19 TCP: Urgent pointer = 0 TCP: Options: (12 bytes) TCP: - No operation TCP: - No operation TCP: - TS Val = 1411620767, TS Echo = 110301705 TCP: RPC: ----- SUN RPC Header ----- RPC: RPC: Record Mark: last fragment, length = 104 RPC: Transaction id = 1557683612 RPC: Type = 0 (Call) RPC: RPC version = 2 RPC: Program = 100003 (NFS), version = 3, procedure = 1 RPC: Credentials: Flavor = 1 (Unix), len = 40 bytes RPC: Time = 15-Mar-05 20:05:01 RPC: Hostname = arkanoid RPC: Uid = 997, Gid = 997 RPC: Groups = 997 5008 2000 RPC: Verifier : Flavor = 0 (None), len = 0 bytes RPC: NFS: ----- Sun NFS ----- NFS: NFS: Proc = 1 (Get file attributes) NFS: File handle = [5319] NFS: 0100000100FD00010200000013819201787BA91F NFS: Mar 15 21:05:00 homer kernel: nfsd_dispatch: vers 3 proc 1 Mar 15 21:05:00 homer kernel: nfsd: GETATTR(3) 20: 01000001 0100fd00 00000002 01928113 1fa97b78 00000000 Mar 15 21:05:00 homer kernel: nfsd: fh_verify(20: 01000001 0100fd00 00000002 01928113 1fa97b78 00000000) $ ls -lid /export/h2/local/solaris/8/x86/bin 26378515 drwxrwsr-x 2 root local 36864 Sep 20 14:33 /export/h2/local/solaris/8/x86/bin ETHER: ----- Ether Header ----- ETHER: ETHER: Packet 10689 arrived at 21:05:1.01 ETHER: Packet size = 98 bytes ETHER: Destination = 0:60:8:27:90:b8, ETHER: Source = 0:11:2f:d5:63:ee, ETHER: Ethertype = 0800 (IP) ETHER: IP: ----- IP Header ----- IP: IP: Version = 4 IP: Header length = 20 bytes IP: Type of service = 0x00 IP: xxx. .... = 0 (precedence) IP: ...0 .... = normal delay IP: .... 0... = normal throughput IP: .... .0.. = normal reliability IP: Total length = 84 bytes IP: Identification = 31536 IP: Flags = 0x4 IP: .1.. .... = do not fragment IP: ..0. .... = last fragment IP: Fragment offset = 0 bytes IP: Time to live = 64 seconds/hops IP: Protocol = 6 (TCP) IP: Header checksum = dc6f IP: Source address = 212.247.28.50, homer.roxen.com IP: Destination address = 212.247.28.227, arkanoid IP: No options IP: TCP: ----- TCP Header ----- TCP: TCP: Source port = 2049 TCP: Destination port = 868 (Sun RPC) TCP: Sequence number = 1156470801 TCP: Acknowledgement number = 3243914905 TCP: Data offset = 32 bytes TCP: Flags = 0x18 TCP: ..0. .... = No urgent pointer TCP: ...1 .... = Acknowledgement TCP: .... 1... = Push TCP: .... .0.. = No reset TCP: .... ..0. = No Syn TCP: .... ...0 = No Fin TCP: Window = 2520 TCP: Checksum = 0xa26c TCP: Urgent pointer = 0 TCP: Options: (12 bytes) TCP: - No operation TCP: - No operation TCP: - TS Val = 110301706, TS Echo = 1411620767 TCP: RPC: ----- SUN RPC Header ----- RPC: RPC: Record Mark: last fragment, length = 28 RPC: Transaction id = 1557683612 RPC: Type = 1 (Reply) RPC: This is a reply to frame 10688 RPC: Status = 0 (Accepted) RPC: Verifier : Flavor = 0 (None), len = 0 bytes RPC: Accept status = 0 (Success) RPC: NFS: ----- Sun NFS ----- NFS: NFS: Proc = 1 (Get file attributes) NFS: Status = 2 (No such file or directory) NFS: ETHER: ----- Ether Header ----- ETHER: ETHER: Packet 10690 arrived at 21:05:1.02 ETHER: Packet size = 174 bytes ETHER: Destination = 0:11:2f:d5:63:ee, ETHER: Source = 0:60:8:27:90:b8, ETHER: Ethertype = 0800 (IP) ETHER: IP: ----- IP Header ----- IP: IP: Version = 4 IP: Header length = 20 bytes IP: Type of service = 0x00 IP: xxx. .... = 0 (precedence) IP: ...0 .... = normal delay IP: .... 0... = normal throughput IP: .... .0.. = normal reliability IP: Total length = 160 bytes IP: Identification = 22755 IP: Flags = 0x4 IP: .1.. .... = do not fragment IP: ..0. .... = last fragment IP: Fragment offset = 0 bytes IP: Time to live = 64 seconds/hops IP: Protocol = 6 (TCP) IP: Header checksum = fe70 IP: Source address = 212.247.28.227, arkanoid IP: Destination address = 212.247.28.50, homer.roxen.com IP: No options IP: TCP: ----- TCP Header ----- TCP: TCP: Source port = 868 TCP: Destination port = 2049 (Sun RPC) TCP: Sequence number = 3243914905 TCP: Acknowledgement number = 1156470833 TCP: Data offset = 32 bytes TCP: Flags = 0x18 TCP: ..0. .... = No urgent pointer TCP: ...1 .... = Acknowledgement TCP: .... 1... = Push TCP: .... .0.. = No reset TCP: .... ..0. = No Syn TCP: .... ...0 = No Fin TCP: Window = 33304 TCP: Checksum = 0x878c TCP: Urgent pointer = 0 TCP: Options: (12 bytes) TCP: - No operation TCP: - No operation TCP: - TS Val = 1411620767, TS Echo = 110301706 TCP: RPC: ----- SUN RPC Header ----- RPC: RPC: Record Mark: last fragment, length = 104 RPC: Transaction id = 1591238044 RPC: Type = 0 (Call) RPC: RPC version = 2 RPC: Program = 100003 (NFS), version = 3, procedure = 1 RPC: Credentials: Flavor = 1 (Unix), len = 40 bytes RPC: Time = 15-Mar-05 20:05:01 RPC: Hostname = arkanoid RPC: Uid = 997, Gid = 997 RPC: Groups = 997 5008 2000 RPC: Verifier : Flavor = 0 (None), len = 0 bytes RPC: NFS: ----- Sun NFS ----- NFS: NFS: Proc = 1 (Get file attributes) NFS: File handle = [5319] NFS: 0100000100FD00010200000013819201787BA91F NFS: Mar 15 21:05:00 homer kernel: nfsd_dispatch: vers 3 proc 1 Mar 15 21:05:00 homer kernel: nfsd: GETATTR(3) 20: 01000001 0100fd00 00000002 01928113 1fa97b78 00000000 Mar 15 21:05:00 homer kernel: nfsd: fh_verify(20: 01000001 0100fd00 00000002 01928113 1fa97b78 00000000) $ ls -lid /export/h2/local/solaris/8/x86/bin 26378515 drwxrwsr-x 2 root local 36864 Sep 20 14:33 /export/h2/local/solaris/8/x86/bin ETHER: ----- Ether Header ----- ETHER: ETHER: Packet 10691 arrived at 21:05:1.02 ETHER: Packet size = 98 bytes ETHER: Destination = 0:60:8:27:90:b8, ETHER: Source = 0:11:2f:d5:63:ee, ETHER: Ethertype = 0800 (IP) ETHER: IP: ----- IP Header ----- IP: IP: Version = 4 IP: Header length = 20 bytes IP: Type of service = 0x00 IP: xxx. .... = 0 (precedence) IP: ...0 .... = normal delay IP: .... 0... = normal throughput IP: .... .0.. = normal reliability IP: Total length = 84 bytes IP: Identification = 31538 IP: Flags = 0x4 IP: .1.. .... = do not fragment IP: ..0. .... = last fragment IP: Fragment offset = 0 bytes IP: Time to live = 64 seconds/hops IP: Protocol = 6 (TCP) IP: Header checksum = dc6d IP: Source address = 212.247.28.50, homer.roxen.com IP: Destination address = 212.247.28.227, arkanoid IP: No options IP: TCP: ----- TCP Header ----- TCP: TCP: Source port = 2049 TCP: Destination port = 868 (Sun RPC) TCP: Sequence number = 1156470833 TCP: Acknowledgement number = 3243915013 TCP: Data offset = 32 bytes TCP: Flags = 0x18 TCP: ..0. .... = No urgent pointer TCP: ...1 .... = Acknowledgement TCP: .... 1... = Push TCP: .... .0.. = No reset TCP: .... ..0. = No Syn TCP: .... ...0 = No Fin TCP: Window = 2520 TCP: Checksum = 0x9fe0 TCP: Urgent pointer = 0 TCP: Options: (12 bytes) TCP: - No operation TCP: - No operation TCP: - TS Val = 110301706, TS Echo = 1411620767 TCP: RPC: ----- SUN RPC Header ----- RPC: RPC: Record Mark: last fragment, length = 28 RPC: Transaction id = 1591238044 RPC: Type = 1 (Reply) RPC: This is a reply to frame 10690 RPC: Status = 0 (Accepted) RPC: Verifier : Flavor = 0 (None), len = 0 bytes RPC: Accept status = 0 (Success) RPC: NFS: ----- Sun NFS ----- NFS: NFS: Proc = 1 (Get file attributes) NFS: Status = 2 (No such file or directory) NFS: ETHER: ----- Ether Header ----- ETHER: ETHER: Packet 10692 arrived at 21:05:1.11 ETHER: Packet size = 66 bytes ETHER: Destination = 0:11:2f:d5:63:ee, ETHER: Source = 0:60:8:27:90:b8, ETHER: Ethertype = 0800 (IP) ETHER: IP: ----- IP Header ----- IP: IP: Version = 4 IP: Header length = 20 bytes IP: Type of service = 0x00 IP: xxx. .... = 0 (precedence) IP: ...0 .... = normal delay IP: .... 0... = normal throughput IP: .... .0.. = normal reliability IP: Total length = 52 bytes IP: Identification = 22756 IP: Flags = 0x4 IP: .1.. .... = do not fragment IP: ..0. .... = last fragment IP: Fragment offset = 0 bytes IP: Time to live = 64 seconds/hops IP: Protocol = 6 (TCP) IP: Header checksum = fedb IP: Source address = 212.247.28.227, arkanoid IP: Destination address = 212.247.28.50, homer.roxen.com IP: No options IP: TCP: ----- TCP Header ----- TCP: TCP: Source port = 868 TCP: Destination port = 2049 TCP: Sequence number = 3243915013 TCP: Acknowledgement number = 1156470865 TCP: Data offset = 32 bytes TCP: Flags = 0x10 TCP: ..0. .... = No urgent pointer TCP: ...1 .... = Acknowledgement TCP: .... 0... = No push TCP: .... .0.. = No reset TCP: .... ..0. = No Syn TCP: .... ...0 = No Fin TCP: Window = 33304 TCP: Checksum = 0x6432 TCP: Urgent pointer = 0 TCP: Options: (12 bytes) TCP: - No operation TCP: - No operation TCP: - TS Val = 1411620777, TS Echo = 110301706 TCP:
Adding some more debug to fh_verify gives the following in the log (inode 0x0928113 is as before a valid directory): > Mar 26 04:15:00 homer kernel: nfsd_dispatch: vers 3 proc 1 > Mar 26 04:15:00 homer kernel: nfsd: GETATTR(3) 20: 01000001 0100fd00 000000020 01928113 1fa97b78 00000000 > Mar 26 04:15:00 homer kernel: nfsd: fh_verify(20: 01000001 0100fd00 00000002 01928113 1fa97b78 00000000) > Mar 26 04:15:00 homer kernel: nfsd: fh_verify(20: 01000001 0100fd00 00000002 01928113 1fa97b78 00000000): bad dentry: -2 > Mar 26 04:15:00 homer kernel: nfsd_dispatch: vers 3 proc 1 Which hints that the following code (fs/nfsd/nfsfh.c:~218) fails with ENOENT: if (fileid_type == 0) dentry = dget(exp->ex_dentry); else { struct export_operations *nop = exp->ex_mnt->mnt_sb->s_e xport_op; dentry = CALL(nop,decode_fh)(exp->ex_mnt->mnt_sb, datap, data_left, fileid_type, nfsd_acceptable, exp); }
Turning on no_subtree_check definitely seems like the right thing to do, but I am surprise your still getting ENOENTS.... Maybe I missed it, but has the client has been rebooted to ensure its not bad cached file handles?
also what kernel version is this happening with?
Hi Steve. The clients have not been rebooted, but since they mount via autofs the directory has been remounted any number of times. Note also that the failure is intermittent; it's detected by a cronjob that attempts to run /usr/local/bin/wget every ten minutes, and only fails ~10 times in a 24 hour period. I mentioned the kernel version in my first message: kernel-2.6.10-1.766_FC3 Below is a snoop where it succeeded; note that packet 5 is identical to a request that failed above, but this time succeeds. $ ls -lid /usr/local/bin 26378515 drwxrwsr-x 2 root local 36864 Sep 20 2004 /usr/local/bin ETHER: ----- Ether Header ----- ETHER: ETHER: Packet 1 arrived at 18:32:50.01 ETHER: Packet size = 206 bytes ETHER: Destination = 0:11:2f:d5:63:ee, ETHER: Source = 0:60:8:27:90:b8, ETHER: Ethertype = 0800 (IP) ETHER: IP: ----- IP Header ----- IP: IP: Version = 4 IP: Header length = 20 bytes IP: Type of service = 0x00 IP: xxx. .... = 0 (precedence) IP: ...0 .... = normal delay IP: .... 0... = normal throughput IP: .... .0.. = normal reliability IP: Total length = 192 bytes IP: Identification = 62115 IP: Flags = 0x4 IP: .1.. .... = do not fragment IP: ..0. .... = last fragment IP: Fragment offset = 0 bytes IP: Time to live = 64 seconds/hops IP: Protocol = 6 (TCP) IP: Header checksum = 6490 IP: Source address = 212.247.28.227, arkanoid IP: Destination address = 212.247.28.50, homer.roxen.com IP: No options IP: TCP: ----- TCP Header ----- TCP: TCP: Source port = 601 TCP: Destination port = 2049 (Sun RPC) TCP: Sequence number = 3970125624 TCP: Acknowledgement number = 2490468361 TCP: Data offset = 32 bytes TCP: Flags = 0x18 TCP: ..0. .... = No urgent pointer TCP: ...1 .... = Acknowledgement TCP: .... 1... = Push TCP: .... .0.. = No reset TCP: .... ..0. = No Syn TCP: .... ...0 = No Fin TCP: Window = 33304 TCP: Checksum = 0x5bce TCP: Urgent pointer = 0 TCP: Options: (12 bytes) TCP: - No operation TCP: - No operation TCP: - TS Val = 1539943878, TS Echo = 439894291 TCP: RPC: ----- SUN RPC Header ----- RPC: RPC: Record Mark: last fragment, length = 136 RPC: Transaction id = 2188762781 RPC: Type = 0 (Call) RPC: RPC version = 2 RPC: Program = 100003 (NFS), version = 3, procedure = 1 RPC: Credentials: Flavor = 1 (Unix), len = 72 bytes RPC: Time = 30-Mar-05 16:32:50 RPC: Hostname = arkanoid RPC: Uid = 0, Gid = 1 RPC: Groups = 1 0 2 3 4 5 6 7 8 9 12 RPC: Verifier : Flavor = 0 (None), len = 0 bytes RPC: NFS: ----- Sun NFS ----- NFS: NFS: Proc = 1 (Get file attributes) NFS: File handle = [5D19] NFS: 0100000100FD00010200000012819201777BA91F NFS: ETHER: ----- Ether Header ----- ETHER: ETHER: Packet 2 arrived at 18:32:50.01 ETHER: Packet size = 182 bytes ETHER: Destination = 0:60:8:27:90:b8, ETHER: Source = 0:11:2f:d5:63:ee, ETHER: Ethertype = 0800 (IP) ETHER: IP: ----- IP Header ----- IP: IP: Version = 4 IP: Header length = 20 bytes IP: Type of service = 0x00 IP: xxx. .... = 0 (precedence) IP: ...0 .... = normal delay IP: .... 0... = normal throughput IP: .... .0.. = normal reliability IP: Total length = 168 bytes IP: Identification = 61606 IP: Flags = 0x4 IP: .1.. .... = do not fragment IP: ..0. .... = last fragment IP: Fragment offset = 0 bytes IP: Time to live = 64 seconds/hops IP: Protocol = 6 (TCP) IP: Header checksum = 66a5 IP: Source address = 212.247.28.50, homer.roxen.com IP: Destination address = 212.247.28.227, arkanoid IP: No options IP: TCP: ----- TCP Header ----- TCP: TCP: Source port = 2049 TCP: Destination port = 601 (Sun RPC) TCP: Sequence number = 2490468361 TCP: Acknowledgement number = 3970125764 TCP: Data offset = 32 bytes TCP: Flags = 0x18 TCP: ..0. .... = No urgent pointer TCP: ...1 .... = Acknowledgement TCP: .... 1... = Push TCP: .... .0.. = No reset TCP: .... ..0. = No Syn TCP: .... ...0 = No Fin TCP: Window = 2520 TCP: Checksum = 0x2106 TCP: Urgent pointer = 0 TCP: Options: (12 bytes) TCP: - No operation TCP: - No operation TCP: - TS Val = 440061288, TS Echo = 1539943878 TCP: RPC: ----- SUN RPC Header ----- RPC: RPC: Record Mark: last fragment, length = 112 RPC: Transaction id = 2188762781 RPC: Type = 1 (Reply) RPC: This is a reply to frame 1 RPC: Status = 0 (Accepted) RPC: Verifier : Flavor = 0 (None), len = 0 bytes RPC: Accept status = 0 (Success) RPC: NFS: ----- Sun NFS ----- NFS: NFS: Proc = 1 (Get file attributes) NFS: Status = 0 (OK) NFS: File type = 2 (Directory) NFS: Mode = 042775 NFS: Setuid = 0, Setgid = 1, Sticky = 0 NFS: Owner's permissions = rwx NFS: Group's permissions = rwx NFS: Other's permissions = r-x NFS: Link count = 13, User ID = 0, Group ID = 2000 NFS: File size = 4096, Used = 4096 NFS: Special: Major = 0, Minor = 0 NFS: File system id = 64769, File id = 26378514 NFS: Last access time = 30-Mar-05 02:13:11.000000000 GMT NFS: Modification time = 24-May-01 09:50:19.000000000 GMT NFS: Attribute change time = 08-Mar-05 12:55:44.000000000 GMT NFS: NFS: ETHER: ----- Ether Header ----- ETHER: ETHER: Packet 3 arrived at 18:32:50.01 ETHER: Packet size = 210 bytes ETHER: Destination = 0:11:2f:d5:63:ee, ETHER: Source = 0:60:8:27:90:b8, ETHER: Ethertype = 0800 (IP) ETHER: IP: ----- IP Header ----- IP: IP: Version = 4 IP: Header length = 20 bytes IP: Type of service = 0x00 IP: xxx. .... = 0 (precedence) IP: ...0 .... = normal delay IP: .... 0... = normal throughput IP: .... .0.. = normal reliability IP: Total length = 196 bytes IP: Identification = 62116 IP: Flags = 0x4 IP: .1.. .... = do not fragment IP: ..0. .... = last fragment IP: Fragment offset = 0 bytes IP: Time to live = 64 seconds/hops IP: Protocol = 6 (TCP) IP: Header checksum = 648b IP: Source address = 212.247.28.227, arkanoid IP: Destination address = 212.247.28.50, homer.roxen.com IP: No options IP: TCP: ----- TCP Header ----- TCP: TCP: Source port = 601 TCP: Destination port = 2049 (Sun RPC) TCP: Sequence number = 3970125764 TCP: Acknowledgement number = 2490468477 TCP: Data offset = 32 bytes TCP: Flags = 0x18 TCP: ..0. .... = No urgent pointer TCP: ...1 .... = Acknowledgement TCP: .... 1... = Push TCP: .... .0.. = No reset TCP: .... ..0. = No Syn TCP: .... ...0 = No Fin TCP: Window = 33304 TCP: Checksum = 0xcd69 TCP: Urgent pointer = 0 TCP: Options: (12 bytes) TCP: - No operation TCP: - No operation TCP: - TS Val = 1539943878, TS Echo = 440061288 TCP: RPC: ----- SUN RPC Header ----- RPC: RPC: Record Mark: last fragment, length = 140 RPC: Transaction id = 2205539997 RPC: Type = 0 (Call) RPC: RPC version = 2 RPC: Program = 100003 (NFS), version = 3, procedure = 4 RPC: Credentials: Flavor = 1 (Unix), len = 72 bytes RPC: Time = 30-Mar-05 16:32:50 RPC: Hostname = arkanoid RPC: Uid = 0, Gid = 1 RPC: Groups = 1 0 2 3 4 5 6 7 8 9 12 RPC: Verifier : Flavor = 0 (None), len = 0 bytes RPC: NFS: ----- Sun NFS ----- NFS: NFS: Proc = 4 (Check access permission) NFS: File handle = [5D19] NFS: 0100000100FD00010200000012819201777BA91F NFS: Access bits = 0x00000002 NFS: .... ...0 = (no read) NFS: .... ..1. = Lookup NFS: .... .0.. = (no modify) NFS: .... 0... = (no extend) NFS: ...0 .... = (no delete) NFS: ..0. .... = (no execute) NFS: ETHER: ----- Ether Header ----- ETHER: ETHER: Packet 4 arrived at 18:32:50.01 ETHER: Packet size = 190 bytes ETHER: Destination = 0:60:8:27:90:b8, ETHER: Source = 0:11:2f:d5:63:ee, ETHER: Ethertype = 0800 (IP) ETHER: IP: ----- IP Header ----- IP: IP: Version = 4 IP: Header length = 20 bytes IP: Type of service = 0x00 IP: xxx. .... = 0 (precedence) IP: ...0 .... = normal delay IP: .... 0... = normal throughput IP: .... .0.. = normal reliability IP: Total length = 176 bytes IP: Identification = 61608 IP: Flags = 0x4 IP: .1.. .... = do not fragment IP: ..0. .... = last fragment IP: Fragment offset = 0 bytes IP: Time to live = 64 seconds/hops IP: Protocol = 6 (TCP) IP: Header checksum = 669b IP: Source address = 212.247.28.50, homer.roxen.com IP: Destination address = 212.247.28.227, arkanoid IP: No options IP: TCP: ----- TCP Header ----- TCP: TCP: Source port = 2049 TCP: Destination port = 601 (Sun RPC) TCP: Sequence number = 2490468477 TCP: Acknowledgement number = 3970125908 TCP: Data offset = 32 bytes TCP: Flags = 0x18 TCP: ..0. .... = No urgent pointer TCP: ...1 .... = Acknowledgement TCP: .... 1... = Push TCP: .... .0.. = No reset TCP: .... ..0. = No Syn TCP: .... ...0 = No Fin TCP: Window = 2520 TCP: Checksum = 0x1eef TCP: Urgent pointer = 0 TCP: Options: (12 bytes) TCP: - No operation TCP: - No operation TCP: - TS Val = 440061288, TS Echo = 1539943878 TCP: RPC: ----- SUN RPC Header ----- RPC: RPC: Record Mark: last fragment, length = 120 RPC: Transaction id = 2205539997 RPC: Type = 1 (Reply) RPC: This is a reply to frame 3 RPC: Status = 0 (Accepted) RPC: Verifier : Flavor = 0 (None), len = 0 bytes RPC: Accept status = 0 (Success) RPC: NFS: ----- Sun NFS ----- NFS: NFS: Proc = 4 (Check access permission) NFS: Status = 0 (OK) NFS: Post-operation attributes: NFS: File type = 2 (Directory) NFS: Mode = 042775 NFS: Setuid = 0, Setgid = 1, Sticky = 0 NFS: Owner's permissions = rwx NFS: Group's permissions = rwx NFS: Other's permissions = r-x NFS: Link count = 13, User ID = 0, Group ID = 2000 NFS: File size = 4096, Used = 4096 NFS: Special: Major = 0, Minor = 0 NFS: File system id = 64769, File id = 26378514 NFS: Last access time = 30-Mar-05 02:13:11.000000000 GMT NFS: Modification time = 24-May-01 09:50:19.000000000 GMT NFS: Attribute change time = 08-Mar-05 12:55:44.000000000 GMT NFS: NFS: Access = lookup NFS: ETHER: ----- Ether Header ----- ETHER: ETHER: Packet 5 arrived at 18:32:50.01 ETHER: Packet size = 206 bytes ETHER: Destination = 0:11:2f:d5:63:ee, ETHER: Source = 0:60:8:27:90:b8, ETHER: Ethertype = 0800 (IP) ETHER: IP: ----- IP Header ----- IP: IP: Version = 4 IP: Header length = 20 bytes IP: Type of service = 0x00 IP: xxx. .... = 0 (precedence) IP: ...0 .... = normal delay IP: .... 0... = normal throughput IP: .... .0.. = normal reliability IP: Total length = 192 bytes IP: Identification = 62117 IP: Flags = 0x4 IP: .1.. .... = do not fragment IP: ..0. .... = last fragment IP: Fragment offset = 0 bytes IP: Time to live = 64 seconds/hops IP: Protocol = 6 (TCP) IP: Header checksum = 648e IP: Source address = 212.247.28.227, arkanoid IP: Destination address = 212.247.28.50, homer.roxen.com IP: No options IP: TCP: ----- TCP Header ----- TCP: TCP: Source port = 601 TCP: Destination port = 2049 (Sun RPC) TCP: Sequence number = 3970125908 TCP: Acknowledgement number = 2490468601 TCP: Data offset = 32 bytes TCP: Flags = 0x18 TCP: ..0. .... = No urgent pointer TCP: ...1 .... = Acknowledgement TCP: .... 1... = Push TCP: .... .0.. = No reset TCP: .... ..0. = No Syn TCP: .... ...0 = No Fin TCP: Window = 33304 TCP: Checksum = 0xc96a TCP: Urgent pointer = 0 TCP: Options: (12 bytes) TCP: - No operation TCP: - No operation TCP: - TS Val = 1539943878, TS Echo = 440061288 TCP: RPC: ----- SUN RPC Header ----- RPC: RPC: Record Mark: last fragment, length = 136 RPC: Transaction id = 2222317213 RPC: Type = 0 (Call) RPC: RPC version = 2 RPC: Program = 100003 (NFS), version = 3, procedure = 1 RPC: Credentials: Flavor = 1 (Unix), len = 72 bytes RPC: Time = 30-Mar-05 16:32:50 RPC: Hostname = arkanoid RPC: Uid = 0, Gid = 1 RPC: Groups = 1 0 2 3 4 5 6 7 8 9 12 RPC: Verifier : Flavor = 0 (None), len = 0 bytes RPC: NFS: ----- Sun NFS ----- NFS: NFS: Proc = 1 (Get file attributes) NFS: File handle = [5319] NFS: 0100000100FD00010200000013819201787BA91F NFS: ETHER: ----- Ether Header ----- ETHER: ETHER: Packet 6 arrived at 18:32:50.01 ETHER: Packet size = 182 bytes ETHER: Destination = 0:60:8:27:90:b8, ETHER: Source = 0:11:2f:d5:63:ee, ETHER: Ethertype = 0800 (IP) ETHER: IP: ----- IP Header ----- IP: IP: Version = 4 IP: Header length = 20 bytes IP: Type of service = 0x00 IP: xxx. .... = 0 (precedence) IP: ...0 .... = normal delay IP: .... 0... = normal throughput IP: .... .0.. = normal reliability IP: Total length = 168 bytes IP: Identification = 61610 IP: Flags = 0x4 IP: .1.. .... = do not fragment IP: ..0. .... = last fragment IP: Fragment offset = 0 bytes IP: Time to live = 64 seconds/hops IP: Protocol = 6 (TCP) IP: Header checksum = 66a1 IP: Source address = 212.247.28.50, homer.roxen.com IP: Destination address = 212.247.28.227, arkanoid IP: No options IP: TCP: ----- TCP Header ----- TCP: TCP: Source port = 2049 TCP: Destination port = 601 (Sun RPC) TCP: Sequence number = 2490468601 TCP: Acknowledgement number = 3970126048 TCP: Data offset = 32 bytes TCP: Flags = 0x18 TCP: ..0. .... = No urgent pointer TCP: ...1 .... = Acknowledgement TCP: .... 1... = Push TCP: .... .0.. = No reset TCP: .... ..0. = No Syn TCP: .... ...0 = No Fin TCP: Window = 2520 TCP: Checksum = 0x2341 TCP: Urgent pointer = 0 TCP: Options: (12 bytes) TCP: - No operation TCP: - No operation TCP: - TS Val = 440061289, TS Echo = 1539943878 TCP: RPC: ----- SUN RPC Header ----- RPC: RPC: Record Mark: last fragment, length = 112 RPC: Transaction id = 2222317213 RPC: Type = 1 (Reply) RPC: This is a reply to frame 5 RPC: Status = 0 (Accepted) RPC: Verifier : Flavor = 0 (None), len = 0 bytes RPC: Accept status = 0 (Success) RPC: NFS: ----- Sun NFS ----- NFS: NFS: Proc = 1 (Get file attributes) NFS: Status = 0 (OK) NFS: File type = 2 (Directory) NFS: Mode = 042775 NFS: Setuid = 0, Setgid = 1, Sticky = 0 NFS: Owner's permissions = rwx NFS: Group's permissions = rwx NFS: Other's permissions = r-x NFS: Link count = 2, User ID = 0, Group ID = 2000 NFS: File size = 36864, Used = 36864 NFS: Special: Major = 0, Minor = 0 NFS: File system id = 64769, File id = 26378515 NFS: Last access time = 30-Mar-05 02:13:12.000000000 GMT NFS: Modification time = 20-Sep-04 12:33:41.000000000 GMT NFS: Attribute change time = 08-Mar-05 12:50:12.000000000 GMT NFS: NFS: ETHER: ----- Ether Header ----- ETHER: ETHER: Packet 7 arrived at 18:32:50.01 ETHER: Packet size = 210 bytes ETHER: Destination = 0:11:2f:d5:63:ee, ETHER: Source = 0:60:8:27:90:b8, ETHER: Ethertype = 0800 (IP) ETHER: IP: ----- IP Header ----- IP: IP: Version = 4 IP: Header length = 20 bytes IP: Type of service = 0x00 IP: xxx. .... = 0 (precedence) IP: ...0 .... = normal delay IP: .... 0... = normal throughput IP: .... .0.. = normal reliability IP: Total length = 196 bytes IP: Identification = 62118 IP: Flags = 0x4 IP: .1.. .... = do not fragment IP: ..0. .... = last fragment IP: Fragment offset = 0 bytes IP: Time to live = 64 seconds/hops IP: Protocol = 6 (TCP) IP: Header checksum = 6489 IP: Source address = 212.247.28.227, arkanoid IP: Destination address = 212.247.28.50, homer.roxen.com IP: No options IP: TCP: ----- TCP Header ----- TCP: TCP: Source port = 601 TCP: Destination port = 2049 (Sun RPC) TCP: Sequence number = 3970126048 TCP: Acknowledgement number = 2490468717 TCP: Data offset = 32 bytes TCP: Flags = 0x18 TCP: ..0. .... = No urgent pointer TCP: ...1 .... = Acknowledgement TCP: .... 1... = Push TCP: .... .0.. = No reset TCP: .... ..0. = No Syn TCP: .... ...0 = No Fin TCP: Window = 33304 TCP: Checksum = 0xc677 TCP: Urgent pointer = 0 TCP: Options: (12 bytes) TCP: - No operation TCP: - No operation TCP: - TS Val = 1539943878, TS Echo = 440061289 TCP: RPC: ----- SUN RPC Header ----- RPC: RPC: Record Mark: last fragment, length = 140 RPC: Transaction id = 2239094429 RPC: Type = 0 (Call) RPC: RPC version = 2 RPC: Program = 100227 (NFS_ACL), version = 3, procedure = 1 RPC: Credentials: Flavor = 1 (Unix), len = 72 bytes RPC: Time = 30-Mar-05 16:32:50 RPC: Hostname = arkanoid RPC: Uid = 0, Gid = 1 RPC: Groups = 1 0 2 3 4 5 6 7 8 9 12 RPC: Verifier : Flavor = 0 (None), len = 0 bytes RPC: NFS_ACL: ----- Sun NFS_ACL ----- NFS_ACL: NFS_ACL: Proc = 1 (Get file access control list) NFS_ACL: File handle = [5319] NFS_ACL: 0100000100FD00010200000013819201787BA91F NFS_ACL: Mask = 0xa NFS_ACL: .... ...0 = (no aclent) NFS_ACL: .... ..1. = aclcnt NFS_ACL: .... .0.. = (no dfaclent) NFS_ACL: .... 1... = dfaclcnt NFS_ACL: ETHER: ----- Ether Header ----- ETHER: ETHER: Packet 8 arrived at 18:32:50.01 ETHER: Packet size = 94 bytes ETHER: Destination = 0:60:8:27:90:b8, ETHER: Source = 0:11:2f:d5:63:ee, ETHER: Ethertype = 0800 (IP) ETHER: IP: ----- IP Header ----- IP: IP: Version = 4 IP: Header length = 20 bytes IP: Type of service = 0x00 IP: xxx. .... = 0 (precedence) IP: ...0 .... = normal delay IP: .... 0... = normal throughput IP: .... .0.. = normal reliability IP: Total length = 80 bytes IP: Identification = 61612 IP: Flags = 0x4 IP: .1.. .... = do not fragment IP: ..0. .... = last fragment IP: Fragment offset = 0 bytes IP: Time to live = 64 seconds/hops IP: Protocol = 6 (TCP) IP: Header checksum = 66f7 IP: Source address = 212.247.28.50, homer.roxen.com IP: Destination address = 212.247.28.227, arkanoid IP: No options IP: TCP: ----- TCP Header ----- TCP: TCP: Source port = 2049 TCP: Destination port = 601 (Sun RPC) TCP: Sequence number = 2490468717 TCP: Acknowledgement number = 3970126192 TCP: Data offset = 32 bytes TCP: Flags = 0x18 TCP: ..0. .... = No urgent pointer TCP: ...1 .... = Acknowledgement TCP: .... 1... = Push TCP: .... .0.. = No reset TCP: .... ..0. = No Syn TCP: .... ...0 = No Fin TCP: Window = 2520 TCP: Checksum = 0x4e0e TCP: Urgent pointer = 0 TCP: Options: (12 bytes) TCP: - No operation TCP: - No operation TCP: - TS Val = 440061289, TS Echo = 1539943878 TCP: RPC: ----- SUN RPC Header ----- RPC: RPC: Record Mark: last fragment, length = 24 RPC: Transaction id = 2239094429 RPC: Type = 1 (Reply) RPC: This is a reply to frame 7 RPC: Status = 0 (Accepted) RPC: Verifier : Flavor = 0 (None), len = 0 bytes RPC: Accept status = 1 (Program unavailable) ETHER: ----- Ether Header ----- ETHER: ETHER: Packet 9 arrived at 18:32:50.11 ETHER: Packet size = 66 bytes ETHER: Destination = 0:11:2f:d5:63:ee, ETHER: Source = 0:60:8:27:90:b8, ETHER: Ethertype = 0800 (IP) ETHER: IP: ----- IP Header ----- IP: IP: Version = 4 IP: Header length = 20 bytes IP: Type of service = 0x00 IP: xxx. .... = 0 (precedence) IP: ...0 .... = normal delay IP: .... 0... = normal throughput IP: .... .0.. = normal reliability IP: Total length = 52 bytes IP: Identification = 62119 IP: Flags = 0x4 IP: .1.. .... = do not fragment IP: ..0. .... = last fragment IP: Fragment offset = 0 bytes IP: Time to live = 64 seconds/hops IP: Protocol = 6 (TCP) IP: Header checksum = 6518 IP: Source address = 212.247.28.227, arkanoid IP: Destination address = 212.247.28.50, homer.roxen.com IP: No options IP: TCP: ----- TCP Header ----- TCP: TCP: Source port = 601 TCP: Destination port = 2049 TCP: Sequence number = 3970126192 TCP: Acknowledgement number = 2490468745 TCP: Data offset = 32 bytes TCP: Flags = 0x10 TCP: ..0. .... = No urgent pointer TCP: ...1 .... = Acknowledgement TCP: .... 0... = No push TCP: .... .0.. = No reset TCP: .... ..0. = No Syn TCP: .... ...0 = No Fin TCP: Window = 33304 TCP: Checksum = 0xb9f9 TCP: Urgent pointer = 0 TCP: Options: (12 bytes) TCP: - No operation TCP: - No operation TCP: - TS Val = 1539943888, TS Echo = 440061289 TCP:
Hmm... It seems the problem correlates to when the filesystem load is high. My cron jobs complain: at 21:00 - 21:30 when the incremental backup runs. at 04:00 - 04:30 when /etc/cron.daily/slocate.cron runs. I succeeded in forcing the problem by running fsck -vn /export/h2 -- -n -f (no errors reported).
hmm... I wonder if somehow your losing your exports.... does "exportfs -arv" bring back the lost file or directory?
Nope, doesn't seem to work: (# on server, $ on client) $ ls -lid /usr/local/bin 26378515 drwxrwsr-x 2 root local 36864 Sep 20 2004 /usr/local/bin/ # fsck -vn /export/h2 -- -n -f & [1] 14296 fsck 1.35 (28-Feb-2004) e2fsck 1.35 (28-Feb-2004) Warning! /dev/VolGroupHomer/LogVolExportH2 is mounted. Warning: skipping journal recovery because doing a read-only filesystem check. Pass 1: Checking inodes, blocks, and sizes $ ls -lid /usr/local/bin /usr/local/bin: No such file or directory # exportfs -arv exporting 194.52.182.0/255.255.255.0:/export/h2 exporting 212.247.28.0/255.255.255.0:/export/h2 exporting 194.52.182.0/255.255.255.0:/export/h1 exporting 212.247.28.0/255.255.255.0:/export/h1 exporting 194.52.182.0/255.255.255.0:/export/d1 exporting 212.247.28.0/255.255.255.0:/export/d1 exporting *.roxen.com:/export/h2 exporting *.roxen.com:/export/h1 exporting *.roxen.com:/export/d1 $ ls -lid /usr/local/bin /usr/local/bin: No such file or directory
Enabling and adding some further debug to fs/exportfs/expfs.c:find_exported_dentry yields the following: Apr 12 16:12:06 homer kernel: find_exported_dentry: DISCONNECTED for dm-1/26378515 Apr 12 16:12:06 homer kernel: find_exported_dentry: need to look harder for dm-1/26378515 Apr 12 16:12:06 homer kernel: find_exported_dentry: get_parent of 26378515 failed, err -2 Apr 12 16:12:06 homer kernel: find_exported_dentry: DISCONNECTED for target dm-1/26378515
For fs/ext3/namei.c:ext3_get_parent to fail with ENOENT fs/ext3/namei.c:ext3_find_entry must return NULL for "..". Is ext3_find_entry prepared for getting called with a DCACHE_DISCONNECTED dentry?
Remounting the filesystems with ext2 instead of ext3 seems to yield working nfsd. Unfortunately this is only a stop-gap workaround, since I need journaling too.
I think I have the same issue, with a slightly different setup. I posted the following to the sourceforge nfs list, and was pointed to your bug entry: I'm having a problem with directories "disappearing" using a Fedora 3 NFS server and a solaris 5.10 client (NFS v3). When I first cd into the directory, an ls shows what I expect. If I then wait an hour and redo the ls, I get: .: No such file or directory If I then do a pwd, followed by another ls, the diretory contents reappear. For example: solara$ ls .: No such file or directory solara$ pwd /proj/videorunner/hw/vr_enc/cae/PLD/lbenc/src solara$ ls AUDX_NEW/ mt48lc8m16a2_.vhd FLYWHEEL_DEC/ outdata0.txt I captured the NFS traffic with ethereal. When I do the initial ls, the sequence is a series of LOOKUP calls (walking the pathname) followed by a READDIRPLUS call to obtain the file names. When I later do the ls, an ACCESS call is issued, and that call returns ERR_NOENT. After the pwd, when the ls works, I see a series of LOOKUPs that again traverses the pathname, followed by LOOKUPs for all the filenames in the directory (so obviously the directory contents were cached in the solaris machine). All the permissions in the directories of the pathname include world read/execute, so I don't understand why the ACCESS call would return ERR_NOENT. I can attach all or part of the ethereal traces if that would help. The traces are about 70k bytes in a gzipped tar file.
Sounds like my problem, yes. Is your exported filesystem ext3?
Yes - we are using ext3. We will try using ext2 tonight to verify your workaround.
Many thanks to all that uncovered this! We have been having this identical problem. FYI - the temporary solution of remounting the servers filesystem as ext2 seems to have solved the problem for us. Thanks again, Alex.
I confirm that ext2 is a workaround.
Its nice to see other people with this issue. For the longest time we thought we were the only ones. We having dealing with this for four months now. We have now moved to Red Hat ES 4.0 and the problem is there too. Im not sure if EXT2 is an option for us due to the lack of journaling. Does anyone know of a 2.6 based kernal distro that DOESNT have this issue?
I found an old thread that discussed this bug, but there was no resolution at that time. http://www.ussg.iu.edu/hypermail/linux/kernel/0502.1/0545.html
FYI, I have been trying multiple versions of the kernel trying to find which one introduced this bug. These are my findings: Kernels with the problem: 2.6.9-1.3 2.6.9-1.11 2.6.10-1.14 2.6.10-1.771 2.6.10-1.9 Kernels with out this problem: 2.6.6-1.427 2.6.8-1.521
See also <a href="https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=144556">#144556</a>
It looks like get_parent() probably fails due to ".." (and ".") not being in the directory hashtable. This doesn't matter for normal filesystem operation, since they are usually handled directly by the dcache layer. get_parent exists to be able to reconnect detached dentrys (which is a case the dcache layer doesn't handle). May 6 19:51:03 homer kernel: find_exported_dentry: DISCONNECTED for dm-1/26378515 May 6 19:51:03 homer kernel: find_exported_dentry: need to look harder for dm-1/26378515 May 6 19:51:03 homer kernel: ext3_get_parent May 6 19:51:03 homer kernel: ext3_find_entry: bh: 0, err:-2 May 6 19:51:03 homer kernel: ext3_get_parent() failed with ENOENT A potential workaround would be to disable the ext3 dir_index feature (eg using tune2fs).
After looking at the definition of struct dx_root, it seems like the entries for "." and ".." are intentionally not in the hashtable. I have a preliminary fix for get_parent that I'm compiling right now.
Created attachment 114121 [details] ext3_get_parent now works for directories in dx-mode. Tentative fix. Seems to work fine for me.
*** Bug 144556 has been marked as a duplicate of this bug. ***
Patch submitted to the ext3 & nfsd maintainers.
After rebuilding the kernel with Henrik Grubbstrom's patch to namei.c on four PCs, I beat NFS to death for 17 hours running the test script described in Note #19 above. This runs 'find . -mount > /dev/null' on the NFS-mounted root filesystems of four machines; in all, 120 complete scans were run. Zero errors occurred with the Grubbstrom patch, in contrast to 9 random errors with unaltered kernel-2.6.11-1.14_FC3, so I am happy to add my observation to others saying that this bug is finally squashed. Thank you Henrik Grubbstrom for your persistent and perceptive work to fix this longstanding and insidious NFS bug. I hope this fix will be incorporated into a new kernel release, and surely into the forthcoming FC4 release.
I just want to echo the above sentiments... we're working around the bug at the moment, but it would be nice to see a fixed kernel released soon through official channels! Thank you very, very much to all of the above people, Henrik Grubbstrom especially, who have helped figure this out.
I have verified the behavior on CentOS4 system - thanks to Henrik's patch, the problem is gone. Hope this fix will make it to the RHEL4 U1 too. Many thanks to all the people who helped to squish this bug, especially to Henrik.
Created attachment 114644 [details] Fix lookup of ".." by ext3_get_parent This is the patch that has been proposed upstream to fix this problem; I'm testing it now.
I too am testing Andreas's patch. So far, it appears to fix the problem. In my test setup, I would ordinarily have seen a few errors by now, but with the patch I've seen zero errors. I'll post back on Monday, after a weekend of testing.
After running over the weekend, I had no issues with Andreas's patch. Looks good to me.
This patch is now in the upstream -mm kernel branch for testing.
Does anyone know if this has been fixed in Fedora Core 4 or in Enterprise yet?
I took a quick look through the fc4 kernel source, and didn't see the patch. It may be in testing though.
An update has been released for Fedora Core 3 (kernel-2.6.12-1.1372_FC3) which may contain a fix for your problem. Please update to this new kernel, and report whether or not it fixes your problem. If you have updated to Fedora Core 4 since this bug was opened, and the problem still occurs with the latest updates for that release, please change the version field of this bug to 'fc4'. Thank you.
I looked at the source for kernel-2.6.12-1.1372_FC3 and didn't see the patch. I also looked at the source for kernel-2.6.12-1.1398_FC4, and it wasn't there either. However, the patch is in patch-2.6.13-rc3.bz2 at kernel.org. I am reluctant to try 2.6.12-1.1372 on our production server, since the patch doesn't appear to be present. Did I just miss it somehow?
merged to cvs, will be in next fc3/fc4 updates coming very soon.
Is this patch going to make it into the RedHat EL4 kernel sometime soon?
OK, I see it has been filed as bug 158293