Hide Forgot
Description of problem: We automount a number of nfs directories and I'm starting to see the mount.nfs (or mount.nfs4) command hang sometimes. # ps -fe | grep mount root 823 1 0 Feb13 ? 00:00:33 automount --pid-file /var/run/autofs.pid root 21094 21020 0 10:02 pts/11 00:00:00 grep --color=auto mount root 25887 823 0 Feb15 ? 00:00:00 /bin/mount -n -t nfs -s -o intr,rsize=32768,wsize=32768,noatime alexandria:/export/cora /data/cora root 25888 25887 14 Feb15 ? 01:54:00 /sbin/mount.nfs alexandria:/export/cora /data/cora -s -n -o rw,noatime,intr,rsize=32768,wsize=32768 # gdb -p 25888 GNU gdb (GDB) Fedora (7.3.50.20110722-10.fc16) Copyright (C) 2011 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-redhat-linux-gnu". For bug reporting instructions, please see: <http://www.gnu.org/software/gdb/bugs/>. Attaching to process 25888 <hangs> I don't seem to be able to find anything relevant in the logs. Not sure where to look next. Suggestions welcome. Server is running CentOS 5. I was able to kill the /bin/mount process but not the /sbin/mount.nfs process. Eventually the /sbin/mount.nfs process was cleaned up. Version-Release number of selected component (if applicable): nfs-utils-1.2.5-4.fc16.x86_64 How reproducible: Infrequently
Here's another, kerberized nfsv4 this time. Interestingly the same server and mount. [root@amakihi ~]# ps -fe | grep mount root 1375 1 0 Feb10 ? 00:00:25 automount --pid-file /var/run/autofs.pid root 5862 1375 0 11:04 ? 00:00:00 /bin/mount -n -t nfs4 -s -o sec=krb5,intr,rsize=32768,wsize=32768,noatime alexandria:/cora /data/cora root 5863 5862 7 11:04 ? 00:03:32 /sbin/mount.nfs4 alexandria:/cora /data/cora -s -n -o rw,noatime,sec=krb5,intr,rsize=32768,wsize=32768 [root@amakihi ~]# strace -fp 5863 Process 5863 attached - interrupt to quit ^C^C^C^C^C^Z [1]+ Stopped strace -fp 5863 [root@amakihi ~]# kill -9 5863 [root@amakihi ~]# !ps ps -fe | grep mount root 1375 1 0 Feb10 ? 00:00:25 automount --pid-file /var/run/autofs.pid root 5862 1375 0 11:04 ? 00:00:00 /bin/mount -n -t nfs4 -s -o sec=krb5,intr,rsize=32768,wsize=32768,noatime alexandria:/cora /data/cora root 5863 5862 7 11:04 ? 00:03:33 [mount.nfs4] <defunct> root 6954 6852 0 11:54 pts/0 00:00:00 grep --color=auto mount [root@amakihi ~]# !ps ps -fe | grep mount root 1375 1 0 Feb10 ? 00:00:25 automount --pid-file /var/run/autofs.pid root 5862 1375 0 11:04 ? 00:00:00 /bin/mount -n -t nfs4 -s -o sec=krb5,intr,rsize=32768,wsize=32768,noatime alexandria:/cora /data/cora root 5863 5862 7 11:04 ? 00:03:33 [mount.nfs4] <defunct> root 6957 6852 0 11:55 pts/0 00:00:00 grep --color=auto mount [root@amakihi ~]# kill -9 5863 [root@amakihi ~]# kill -9 5863 [root@amakihi ~]# kill 5862 [root@amakihi ~]# kill 5862 -bash: kill: (5862) - No such process [root@amakihi ~]# !ps ps -fe | grep mount root 1375 1 0 Feb10 ? 00:00:25 automount --pid-file /var/run/autofs.pid root 5863 1 7 11:04 ? 00:03:33 [mount.nfs4] <defunct> root 6963 6852 0 11:55 pts/0 00:00:00 grep --color=auto mount
Different mount this time, same server though: root 8667 8666 21 10:59 ? 00:13:51 /sbin/mount.nfs alexandria:/export/cora5 /data/cora5 -s -n -o rw,noatime,intr,rsize=32768,wsize=32768
My best guess is that this coincided with the 3.2 kernels. Moving back to 3.1.9 for testing.
Hmm, seen on F15 with 2.6.42.3-2.fc15.x86_64 and our other server: root 9385 9384 8 Feb22 ? 01:06:04 /sbin/mount.nfs alexandria2:/export/cora6 /data/cora6 -s -n -o rw,noatime,intr,rsize=32768,wsize=32768 So, maybe not a kernel issue unless there are some common back-ports or if it's on the server end.
Created attachment 568339 [details] port nfs packet log I've attached a log from tshark for port nfs. At this point there were two clients with hung mount processes: 10.10.11.2 and 10.10.41.5. The server is 10.10.10.1. There is a *LOT* of traffic like: 1 0.000000 10.10.11.2 -> 10.10.10.1 NFS V4 COMP Call <EMPTY> PUTROOTFH PUTROOTFH;GETFH GETFH;GETATTR GETATTR 2 0.000029 10.10.41.5 -> 10.10.10.1 NFS V4 COMP Call <EMPTY> PUTROOTFH PUTROOTFH;GETFH GETFH;GETATTR GETATTR 3 0.000072 10.10.10.1 -> 10.10.11.2 NFS V4 COMP Reply (Call In 1) <EMPTY> PUTROOTFH PUTROOTFH;GETFH GETFH;GETATTR GETATTR 4 0.000087 10.10.10.1 -> 10.10.41.5 NFS V4 COMP Reply (Call In 2) <EMPTY> PUTROOTFH PUTROOTFH;GETFH GETFH;GETATTR GETATTR Now, I'm not 100% this is the same issue as the server was having some disk space issues in /var and other problems, but it looks similar.
Seems to be the same as another mount hang trace: 119362 423831.170067 10.10.20.12 -> 10.10.10.3 NFS V4 COMP Call <EMPTY> PUTROOTFH PUTROOTFH;GETFH GETFH;GETATTR GETATTR 119363 423831.170125 10.10.10.3 -> 10.10.20.12 NFS V4 COMP Reply (Call In 119362) <EMPTY> PUTROOTFH PUTROOTFH;GETFH GETFH;GETATTR GETATTR[Unreassembled Packet [incorrect TCP checksum]]
Is the server returning an error? (If so, which one?)
I can't find any error messages anywhere.
What I meant was: looking at the trace in wireshark (it'll be easier to see the details than with tshark) do you see the server returning errors?
Also, if you have made any modifications to /etc/nfsmount.conf, could you please share the contents with us.
Results appear to all be NFS4_OK. Call: Network File System, Ops(3): PUTROOTFH GETFH GETATTR [Program Version: 4] [V4 Procedure: COMPOUND (1)] Tag: <EMPTY> length: 0 contents: <EMPTY> minorversion: 0 Operations (count: 3) Opcode: PUTROOTFH (24) Opcode: GETFH (10) Opcode: GETATTR (9) GETATTR4args attr_request bitmap[0] = 0x0010011a [5 attributes requested] mand_attr: FATTR4_TYPE (1) mand_attr: FATTR4_CHANGE (3) mand_attr: FATTR4_SIZE (4) mand_attr: FATTR4_FSID (8) recc_attr: FATTR4_FILEID (20) bitmap[1] = 0x0030a23a [9 attributes requested] recc_attr: FATTR4_MODE (33) recc_attr: FATTR4_NUMLINKS (35) recc_attr: FATTR4_OWNER (36) recc_attr: FATTR4_OWNER_GROUP (37) recc_attr: FATTR4_RAWDEV (41) recc_attr: FATTR4_SPACE_USED (45) recc_attr: FATTR4_TIME_ACCESS (47) recc_attr: FATTR4_TIME_METADATA (52) recc_attr: FATTR4_TIME_MODIFY (53) [Main Opcode: PUTROOTFH (24)] [Main Opcode: GETATTR (9)] Reply: Network File System, Ops(3): PUTROOTFH GETFH GETATTR [Program Version: 4] [V4 Procedure: COMPOUND (1)] Status: NFS4_OK (0) Tag: <EMPTY> length: 0 contents: <EMPTY> Operations (count: 3) Opcode: PUTROOTFH (24) Status: NFS4_OK (0) Opcode: GETFH (10) Status: NFS4_OK (0) Filehandle length: 8 [hash (CRC-32): 0x62d40c52] decode type as: unknown filehandle: 0100010000000000 Opcode: GETATTR (9) Status: NFS4_OK (0) GETATTR4res resok4 obj_attributes attrmask bitmap[0] = 0x0010011a [5 attributes requested] mand_attr: FATTR4_TYPE (1) mand_attr: FATTR4_CHANGE (3) mand_attr: FATTR4_SIZE (4) mand_attr: FATTR4_FSID (8) recc_attr: FATTR4_FILEID (20) bitmap[1] = 0x0030a23a [9 attributes requested] recc_attr: FATTR4_MODE (33) recc_attr: FATTR4_NUMLINKS (35) recc_attr: FATTR4_OWNER (36) recc_attr: FATTR4_OWNER_GROUP (37) recc_attr: FATTR4_RAWDEV (41) recc_attr: FATTR4_SPACE_USED (45) recc_attr: FATTR4_TIME_ACCESS (47) recc_attr: FATTR4_TIME_METADATA (52) recc_attr: FATTR4_TIME_MODIFY (53) attr_vals mand_attr: FATTR4_TYPE (1) nfs_ftype4: NF4DIR (2) mand_attr: FATTR4_CHANGE (3) changeid: 5704186859416977408 mand_attr: FATTR4_SIZE (4) size: 4096 mand_attr: FATTR4_FSID (8) fattr4_fsid fsid4.major: 0 fsid4.minor: 0 recc_attr: FATTR4_FILEID (20) fileid: 2 recc_attr: FATTR4_MODE (33) fattr4_mode: 0755 000. .... .... .... = Unknown .... 0... .... .... = not SUID .... .0.. .... .... = not SGID .... ..0. .... .... = not save swapped text .... ...1 .... .... = Read permission for owner .... .... 1... .... = Write permission for owner .... .... .1.. .... = Execute permission for owner .... .... ..1. .... = Read permission for group .... .... ...0 .... = no Write permission for group .... .... .... 1... = Execute permission for group .... .... .... .1.. = Read permission for others .... .... .... ..0. = no Write permission for others .... .... .... ...1 = Execute permission for others recc_attr: FATTR4_NUMLINKS (35) numlinks: 28 recc_attr: FATTR4_OWNER (36) fattr4_owner: root.com length: 18 contents: root.com fill bytes: opaque data recc_attr: FATTR4_OWNER_GROUP (37) fattr4_owner_group: root.com length: 18 contents: root.com fill bytes: opaque data recc_attr: FATTR4_RAWDEV (41) specdata1: 0 specdata2: 0 recc_attr: FATTR4_SPACE_USED (45) space_used: 8192 recc_attr: FATTR4_TIME_ACCESS (47) seconds: 1331134896 nseconds: 0 recc_attr: FATTR4_TIME_METADATA (52) seconds: 1328109498 nseconds: 0 recc_attr: FATTR4_TIME_MODIFY (53) seconds: 1328109498 nseconds: 0 [Main Opcode: PUTROOTFH (24)] [Main Opcode: GETATTR (9)] But the same request and reply repeated over and over
(In reply to comment #10) > Also, if you have made any modifications to /etc/nfsmount.conf, could you > please share the contents with us. No modifications. It contains: [ NFSMount_Global_Options ] but everything else commented out.
Can you run rpcdebug -m nfs -s vfs -s mount -s client -s VFS and possibly also rpcdebug -m rpc -s call then attempt a mount and report the results from running dmesg -s 90000
Are these to be run on the client or server? The mount commands fail fairly infrequently, would you want output from a normal mount or just a failed mount? If just a failed I guess I can start the debug logging and send output from around a failure. Thanks.
On the client only. I know what a normal mount should give, so I'm only interested in the ones that fail.
Alternatively, if you see that a mount has failed, and is looping, then turn on rpcdebug might also give clues. If you can kill off any other NFS-related activity on that client, before turning on the debugging, then that would be even better...
(In reply to comment #0) > Description of problem: > > We automount a number of nfs directories and I'm starting to see the mount.nfs > (or mount.nfs4) command hang sometimes. > > # ps -fe | grep mount > root 823 1 0 Feb13 ? 00:00:33 automount --pid-file > /var/run/autofs.pid > root 21094 21020 0 10:02 pts/11 00:00:00 grep --color=auto mount > root 25887 823 0 Feb15 ? 00:00:00 /bin/mount -n -t nfs -s -o > intr,rsize=32768,wsize=32768,noatime alexandria:/export/cora /data/cora > root 25888 25887 14 Feb15 ? 01:54:00 /sbin/mount.nfs > alexandria:/export/cora /data/cora -s -n -o > rw,noatime,intr,rsize=32768,wsize=32768 > > > I don't seem to be able to find anything relevant in the logs. Not sure where > to look next. Suggestions welcome. > > Server is running CentOS 5. Just curious, making those mounts v3 mount either with a /etc/nfsmount.conf entry or a v3 on the command line eliminate the hangs? I'm just thinking your server is a bit crusty and not even the real deal... Meaning there is now way of telling what bug fixes are or are not there.
This appears to have gone away.