Bug 791311

Summary: nfs mount command hanging sometimes
Product: [Fedora] Fedora Reporter: Orion Poplawski <orion>
Component: nfs-utilsAssignee: Steve Dickson <steved>
Status: CLOSED WORKSFORME QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 16CC: bfields, jlayton, steved, trond.myklebust
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2012-06-13 00:14:08 EDT Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Attachments:
Description Flags
port nfs packet log none

Description Orion Poplawski 2012-02-16 12:20:17 EST
Description of problem:

We automount a number of nfs directories and I'm starting to see the mount.nfs (or mount.nfs4) command hang sometimes.

# ps -fe | grep mount
root       823     1  0 Feb13 ?        00:00:33 automount --pid-file /var/run/autofs.pid
root     21094 21020  0 10:02 pts/11   00:00:00 grep --color=auto mount
root     25887   823  0 Feb15 ?        00:00:00 /bin/mount -n -t nfs -s -o intr,rsize=32768,wsize=32768,noatime alexandria:/export/cora /data/cora
root     25888 25887 14 Feb15 ?        01:54:00 /sbin/mount.nfs alexandria:/export/cora /data/cora -s -n -o rw,noatime,intr,rsize=32768,wsize=32768

# gdb -p 25888
GNU gdb (GDB) Fedora (7.3.50.20110722-10.fc16)
Copyright (C) 2011 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Attaching to process 25888
<hangs>

I don't seem to be able to find anything relevant in the logs.  Not sure where to look next.  Suggestions welcome.

Server is running CentOS 5.

I was able to kill the /bin/mount process but not the /sbin/mount.nfs process.  Eventually the /sbin/mount.nfs process was cleaned up.

Version-Release number of selected component (if applicable):
nfs-utils-1.2.5-4.fc16.x86_64

How reproducible:
Infrequently
Comment 1 Orion Poplawski 2012-02-16 13:58:27 EST
Here's another, kerberized nfsv4 this time.  Interestingly the same server and mount.

[root@amakihi ~]# ps -fe | grep mount
root      1375     1  0 Feb10 ?        00:00:25 automount --pid-file /var/run/autofs.pid
root      5862  1375  0 11:04 ?        00:00:00 /bin/mount -n -t nfs4 -s -o sec=krb5,intr,rsize=32768,wsize=32768,noatime alexandria:/cora /data/cora
root      5863  5862  7 11:04 ?        00:03:32 /sbin/mount.nfs4 alexandria:/cora /data/cora -s -n -o rw,noatime,sec=krb5,intr,rsize=32768,wsize=32768
[root@amakihi ~]# strace -fp 5863
Process 5863 attached - interrupt to quit
^C^C^C^C^C^Z
[1]+  Stopped                 strace -fp 5863

[root@amakihi ~]# kill -9 5863
[root@amakihi ~]# !ps
ps -fe | grep mount
root      1375     1  0 Feb10 ?        00:00:25 automount --pid-file /var/run/autofs.pid
root      5862  1375  0 11:04 ?        00:00:00 /bin/mount -n -t nfs4 -s -o sec=krb5,intr,rsize=32768,wsize=32768,noatime alexandria:/cora /data/cora
root      5863  5862  7 11:04 ?        00:03:33 [mount.nfs4] <defunct>
root      6954  6852  0 11:54 pts/0    00:00:00 grep --color=auto mount
[root@amakihi ~]# !ps
ps -fe | grep mount
root      1375     1  0 Feb10 ?        00:00:25 automount --pid-file /var/run/autofs.pid
root      5862  1375  0 11:04 ?        00:00:00 /bin/mount -n -t nfs4 -s -o sec=krb5,intr,rsize=32768,wsize=32768,noatime alexandria:/cora /data/cora
root      5863  5862  7 11:04 ?        00:03:33 [mount.nfs4] <defunct>
root      6957  6852  0 11:55 pts/0    00:00:00 grep --color=auto mount
[root@amakihi ~]# kill -9 5863
[root@amakihi ~]# kill -9 5863
[root@amakihi ~]# kill 5862
[root@amakihi ~]# kill 5862
-bash: kill: (5862) - No such process
[root@amakihi ~]# !ps
ps -fe | grep mount
root      1375     1  0 Feb10 ?        00:00:25 automount --pid-file /var/run/autofs.pid
root      5863     1  7 11:04 ?        00:03:33 [mount.nfs4] <defunct>
root      6963  6852  0 11:55 pts/0    00:00:00 grep --color=auto mount
Comment 2 Orion Poplawski 2012-02-20 14:16:43 EST
Different mount this time, same server though:

root      8667  8666 21 10:59 ?        00:13:51 /sbin/mount.nfs alexandria:/export/cora5 /data/cora5 -s -n -o rw,noatime,intr,rsize=32768,wsize=32768
Comment 3 Orion Poplawski 2012-02-22 11:04:33 EST
My best guess is that this coincided with the 3.2 kernels.  Moving back to 3.1.9 for testing.
Comment 4 Orion Poplawski 2012-02-23 10:17:15 EST
Hmm, seen on F15 with 2.6.42.3-2.fc15.x86_64 and our other server:

root      9385  9384  8 Feb22 ?        01:06:04 /sbin/mount.nfs alexandria2:/export/cora6 /data/cora6 -s -n -o rw,noatime,intr,rsize=32768,wsize=32768

So, maybe not a kernel issue unless there are some common back-ports or if it's on the server end.
Comment 5 Orion Poplawski 2012-03-07 11:09:23 EST
Created attachment 568339 [details]
port nfs packet log

I've attached a log from tshark for port nfs.  At this point there were two clients with hung mount processes: 10.10.11.2 and 10.10.41.5.  The server is 10.10.10.1.  There is a *LOT* of traffic like:

  1   0.000000   10.10.11.2 -> 10.10.10.1   NFS V4 COMP Call <EMPTY> PUTROOTFH PUTROOTFH;GETFH GETFH;GETATTR GETATTR
  2   0.000029   10.10.41.5 -> 10.10.10.1   NFS V4 COMP Call <EMPTY> PUTROOTFH PUTROOTFH;GETFH GETFH;GETATTR GETATTR
  3   0.000072   10.10.10.1 -> 10.10.11.2   NFS V4 COMP Reply (Call In 1) <EMPTY> PUTROOTFH PUTROOTFH;GETFH GETFH;GETATTR GETATTR
  4   0.000087   10.10.10.1 -> 10.10.41.5   NFS V4 COMP Reply (Call In 2) <EMPTY> PUTROOTFH PUTROOTFH;GETFH GETFH;GETATTR GETATTR

Now, I'm not 100% this is the same issue as the server was having some disk space issues in /var and other problems, but it looks similar.
Comment 6 Orion Poplawski 2012-03-07 17:08:39 EST
Seems to be the same as another mount hang trace:

119362 423831.170067  10.10.20.12 -> 10.10.10.3   NFS V4 COMP Call <EMPTY> PUTROOTFH PUTROOTFH;GETFH GETFH;GETATTR GETATTR
119363 423831.170125   10.10.10.3 -> 10.10.20.12  NFS V4 COMP Reply (Call In 119362) <EMPTY> PUTROOTFH PUTROOTFH;GETFH GETFH;GETATTR GETATTR[Unreassembled Packet [incorrect TCP checksum]]
Comment 7 J. Bruce Fields 2012-03-07 17:21:08 EST
Is the server returning an error?  (If so, which one?)
Comment 8 Orion Poplawski 2012-03-07 17:23:21 EST
I can't find any error messages anywhere.
Comment 9 J. Bruce Fields 2012-03-07 17:39:02 EST
What I meant was: looking at the trace in wireshark (it'll be easier to see the details than with tshark) do you see the server returning errors?
Comment 10 Trond Myklebust 2012-03-07 17:56:14 EST
Also, if you have made any modifications to /etc/nfsmount.conf, could you
please share the contents with us.
Comment 11 Orion Poplawski 2012-03-07 17:57:23 EST
Results appear to all be NFS4_OK.

Call:
Network File System, Ops(3): PUTROOTFH GETFH GETATTR
    [Program Version: 4]
    [V4 Procedure: COMPOUND (1)]
    Tag: <EMPTY>
        length: 0
        contents: <EMPTY>
    minorversion: 0
    Operations (count: 3)
        Opcode: PUTROOTFH (24)
        Opcode: GETFH (10)
        Opcode: GETATTR (9)
            GETATTR4args
                attr_request
                    bitmap[0] = 0x0010011a
                        [5 attributes requested]
                        mand_attr: FATTR4_TYPE (1)
                        mand_attr: FATTR4_CHANGE (3)
                        mand_attr: FATTR4_SIZE (4)
                        mand_attr: FATTR4_FSID (8)
                        recc_attr: FATTR4_FILEID (20)
                    bitmap[1] = 0x0030a23a
                        [9 attributes requested]
                        recc_attr: FATTR4_MODE (33)
                        recc_attr: FATTR4_NUMLINKS (35)
                        recc_attr: FATTR4_OWNER (36)
                        recc_attr: FATTR4_OWNER_GROUP (37)
                        recc_attr: FATTR4_RAWDEV (41)
                        recc_attr: FATTR4_SPACE_USED (45)
                        recc_attr: FATTR4_TIME_ACCESS (47)
                        recc_attr: FATTR4_TIME_METADATA (52)
                        recc_attr: FATTR4_TIME_MODIFY (53)
        [Main Opcode: PUTROOTFH (24)]
        [Main Opcode: GETATTR (9)]


Reply:
Network File System, Ops(3): PUTROOTFH GETFH GETATTR
    [Program Version: 4]
    [V4 Procedure: COMPOUND (1)]
    Status: NFS4_OK (0)
    Tag: <EMPTY>
        length: 0
        contents: <EMPTY>
    Operations (count: 3)
        Opcode: PUTROOTFH (24)
            Status: NFS4_OK (0)
        Opcode: GETFH (10)
            Status: NFS4_OK (0)
            Filehandle
                length: 8
                [hash (CRC-32): 0x62d40c52]
                decode type as: unknown
                filehandle: 0100010000000000
        Opcode: GETATTR (9)
            Status: NFS4_OK (0)
            GETATTR4res
                resok4
                    obj_attributes
                        attrmask
                            bitmap[0] = 0x0010011a
                                [5 attributes requested]
                                mand_attr: FATTR4_TYPE (1)
                                mand_attr: FATTR4_CHANGE (3)
                                mand_attr: FATTR4_SIZE (4)
                                mand_attr: FATTR4_FSID (8)
                                recc_attr: FATTR4_FILEID (20)
                            bitmap[1] = 0x0030a23a
                                [9 attributes requested]
                                recc_attr: FATTR4_MODE (33)
                                recc_attr: FATTR4_NUMLINKS (35)
                                recc_attr: FATTR4_OWNER (36)
                                recc_attr: FATTR4_OWNER_GROUP (37)
                                recc_attr: FATTR4_RAWDEV (41)
                                recc_attr: FATTR4_SPACE_USED (45)
                                recc_attr: FATTR4_TIME_ACCESS (47)
                                recc_attr: FATTR4_TIME_METADATA (52)
                                recc_attr: FATTR4_TIME_MODIFY (53)
                        attr_vals
                            mand_attr: FATTR4_TYPE (1)
                                nfs_ftype4: NF4DIR (2)
                            mand_attr: FATTR4_CHANGE (3)
                                changeid: 5704186859416977408
                            mand_attr: FATTR4_SIZE (4)
                                size: 4096
                            mand_attr: FATTR4_FSID (8)
                                fattr4_fsid
                                    fsid4.major: 0
                                    fsid4.minor: 0
                            recc_attr: FATTR4_FILEID (20)
                                fileid: 2
                            recc_attr: FATTR4_MODE (33)
                                fattr4_mode: 0755
                                    000. .... .... .... = Unknown
                                    .... 0... .... .... = not SUID
                                    .... .0.. .... .... = not SGID
                                    .... ..0. .... .... = not save swapped text
                                    .... ...1 .... .... = Read permission for owner
                                    .... .... 1... .... = Write permission for owner
                                    .... .... .1.. .... = Execute permission for owner
                                    .... .... ..1. .... = Read permission for group
                                    .... .... ...0 .... = no Write permission for group
                                    .... .... .... 1... = Execute permission for group
                                    .... .... .... .1.. = Read permission for others
                                    .... .... .... ..0. = no Write permission for others
                                    .... .... .... ...1 = Execute permission for others
                            recc_attr: FATTR4_NUMLINKS (35)
                                numlinks: 28
                            recc_attr: FATTR4_OWNER (36)
                                fattr4_owner: root@cora.nwra.com
                                    length: 18
                                    contents: root@cora.nwra.com
                                    fill bytes: opaque data
                            recc_attr: FATTR4_OWNER_GROUP (37)
                                fattr4_owner_group: root@cora.nwra.com
                                    length: 18
                                    contents: root@cora.nwra.com
                                    fill bytes: opaque data
                            recc_attr: FATTR4_RAWDEV (41)
                                specdata1: 0
                                specdata2: 0
                            recc_attr: FATTR4_SPACE_USED (45)
                                space_used: 8192
                            recc_attr: FATTR4_TIME_ACCESS (47)
                                seconds: 1331134896
                                nseconds: 0
                            recc_attr: FATTR4_TIME_METADATA (52)
                                seconds: 1328109498
                                nseconds: 0
                            recc_attr: FATTR4_TIME_MODIFY (53)
                                seconds: 1328109498
                                nseconds: 0
        [Main Opcode: PUTROOTFH (24)]
        [Main Opcode: GETATTR (9)]


But the same request and reply repeated over and over
Comment 12 Orion Poplawski 2012-03-07 17:59:38 EST
(In reply to comment #10)
> Also, if you have made any modifications to /etc/nfsmount.conf, could you
> please share the contents with us.

No modifications.  It contains:

[ NFSMount_Global_Options ]

but everything else commented out.
Comment 13 Trond Myklebust 2012-03-07 18:24:03 EST
Can you run

rpcdebug -m nfs -s vfs -s mount -s client -s VFS

and possibly also

rpcdebug -m rpc -s call

then attempt a mount and report the results from running

dmesg -s 90000
Comment 14 Orion Poplawski 2012-03-07 18:32:26 EST
Are these to be run on the client or server?  The mount commands fail fairly infrequently, would you want output from a normal mount or just a failed mount?  If just a failed I guess I can start the debug logging and send output from around a failure.  Thanks.
Comment 15 Trond Myklebust 2012-03-07 18:39:40 EST
On the client only. I know what a normal mount should give, so I'm only
interested in the ones that fail.
Comment 16 Trond Myklebust 2012-03-07 18:48:18 EST
Alternatively, if you see that a mount has failed, and is looping, then
turn on rpcdebug might also give clues. If you can kill off any other
NFS-related activity on that client, before turning on the debugging,
then that would be even better...
Comment 17 Steve Dickson 2012-03-23 10:31:11 EDT
(In reply to comment #0)
> Description of problem:
> 
> We automount a number of nfs directories and I'm starting to see the mount.nfs
> (or mount.nfs4) command hang sometimes.
> 
> # ps -fe | grep mount
> root       823     1  0 Feb13 ?        00:00:33 automount --pid-file
> /var/run/autofs.pid
> root     21094 21020  0 10:02 pts/11   00:00:00 grep --color=auto mount
> root     25887   823  0 Feb15 ?        00:00:00 /bin/mount -n -t nfs -s -o
> intr,rsize=32768,wsize=32768,noatime alexandria:/export/cora /data/cora
> root     25888 25887 14 Feb15 ?        01:54:00 /sbin/mount.nfs
> alexandria:/export/cora /data/cora -s -n -o
> rw,noatime,intr,rsize=32768,wsize=32768
> 
> 
> I don't seem to be able to find anything relevant in the logs.  Not sure where
> to look next.  Suggestions welcome.
> 
> Server is running CentOS 5.
Just curious, making those mounts v3 mount either with a /etc/nfsmount.conf
entry or a v3 on the command line eliminate the hangs? 

I'm just thinking your server is a bit crusty and not even the real deal...
Meaning there is now way of telling what bug fixes are or are not there.
Comment 18 Orion Poplawski 2012-06-13 00:14:08 EDT
This appears to have gone away.