Bug 228727 - NFS mount failing from clustered resource
NFS mount failing from clustered resource
Status: CLOSED NOTABUG
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: nfs-utils (Show other bugs)
5.0
x86_64 Linux
medium Severity medium
: ---
: ---
Assigned To: Steve Dickson
Ben Levenson
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2007-02-14 12:19 EST by Andy Elmer
Modified: 2009-12-14 19:28 EST (History)
2 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2009-12-14 15:49:55 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
bz228728.pcap.bz2 --> "mount -v asdfssghome01:/export/Home01/ir/elmerar /mnt" (1.46 KB, application/x-bzip)
2007-03-05 16:46 EST, Andy Elmer
no flags Details
bz228729.pcap.bz2 --> "mount -v -o noacl asdfssghome01:/export/Home01/ir/elmerar /mnt" (1.48 KB, application/x-bzip)
2007-03-05 16:48 EST, Andy Elmer
no flags Details
bz228730.pcap.bz2 --> "mount -v -o tcp asdfssghome01:/export/Home01/ir/elmerar /mnt" (2.31 KB, application/x-bzip)
2007-03-05 16:49 EST, Andy Elmer
no flags Details

  None (edit)
Description Andy Elmer 2007-02-14 12:19:08 EST
Description of problem:
Unable to mount NFS filesystems from a VCS clustered host.  "mount" just
hangs/retries without success.  Mounting NFS filesystems from a non-clustered
system or using the *real* server name (not the cluster alias) works fine. 

How reproducible:
See below

Steps to Reproduce:
1.  See below
2.
3.
  
Actual results:
Ex.
server1 = file server (192.168.0.10)
homedir1 = hostname for clustered resource *currently* residing on server1
(homedir1 is basically an alias for server1) (192.168.0.11)

*note:  while server1 & homedir1 have different names & IP's, they refer to the
same system.

mount -v server1:/nfs/point /mount/point
mount: no type was given - I'll assume nfs because of the colon
mount: trying 10.66.61.101 prog 100003 vers 3 prot tcp port 2049
mount: trying 10.66.61.101 prog 100005 vers 3 prot udp port 33966
(successful)

mount -v homedir1:/nfs/point /mount/point  
mount: no type was given - I'll assume nfs because of the colon
mount: trying 10.66.61.93 prog 100003 vers 3 prot tcp port 2049
mount: mount to NFS server 'asdfssgHome01' failed: timed out (retrying).
mount: trying 10.66.61.93 prog 100003 vers 3 prot tcp port 2049
mount: mount to NFS server 'asdfssgHome01' failed: timed out (retrying).
(unsuccessful)

Both commands above are mounting the same NFS point from the same server.


Expected results:
Filesystem should mount without problems.  Our RHEL3 & RHEL4 workstations have
no problems with this.

Additional info:
- NFS file server is a Solaris 8 system running Veritas Cluster Server.
- If I specifically specify TCP to the mount command (-o tcp), the mount command
works every time -- which doesn't make sense since mounting w/tcp should be the
default behavior.

TCPDUMP output:

From unsucessful attempt:
<snip>
client1 -> homedir1       PORTMAP C GETPORT prog=100005 (MOUNT) vers=3 proto=UDP
homedir1 -> client1       TCP D=34844 S=2049 Fin Ack=1160248412 Seq=935776432
Len=0 Win=33304 Options=<nop,nop,tstamp 17450967 60794813>
client1 -> homedir1       TCP D=2049 S=34844     Ack=935776433 Seq=1160248412
Len=0 Win=46 Options=<nop,nop,tstamp 60794814 17450967>
server1 -> client1       PORTMAP R GETPORT port=33966
client1 -> server1      ICMP Destination unreachable (UDP port 32771 unreachable)


From successful attempt:
<snip>
client1 -> server1      PORTMAP C GETPORT prog=100005 (MOUNT) vers=3 proto=UDP
server1 -> client1       TCP D=56024 S=2049     Ack=1235741369 Seq=931535860
Len=0 Win=33304 Options=<nop,nop,tstamp 90305704 60868844>
server1 -> client1       TCP D=56024 S=2049 Fin Ack=1235741369 Seq=931535860
Len=0 Win=33304 Options=<nop,nop,tstamp 90305704 60868844>
server1 -> client1       PORTMAP R GETPORT port=32807
client1 -> server1      TCP D=2049 S=56024     Ack=931535861 Seq=1235741369
Len=0 Win=46 Options=<nop,nop,tstamp 60868844 90305704>
client1 -> server1      MOUNT3 C Null
server1 -> client1       MOUNT3 R Null 
client1 -> server1      MOUNT3 C Mount /export/JS/cfengine/local
</snip>
Comment 1 Steve Dickson 2007-02-16 06:22:20 EST
Make sure the server is listening for upd mounts with
the 'rpcinfo -p <server>' command. You should see at least
one of the following lines in rpcinfo's output

    100003    2   udp   2049  nfs
    100003    3   udp   2049  nfs
Comment 2 Andy Elmer 2007-02-19 10:42:48 EST
The server is listening:
/usr/sbin/rpcinfo -p server1 | grep nfs
    100003    2   udp   2049  nfs
    100003    3   udp   2049  nfs
    100227    2   udp   2049  nfs_acl
    100227    3   udp   2049  nfs_acl
    100003    2   tcp   2049  nfs
    100003    3   tcp   2049  nfs
    100227    2   tcp   2049  nfs_acl
    100227    3   tcp   2049  nfs_acl

To me, it seems to be a client side problem.

Our current RHEL3, RHEL4, and the few FC6 machines are able to mount these NFS
locations just fine.  I'll try installing RHEL5-beta2 on a different piece of
hardware to see if I get the same or different result.
Comment 3 Steve Dickson 2007-02-22 12:03:59 EST
Ok... try mounting with the '-o noacl' mount option
Comment 4 Andy Elmer 2007-02-23 16:02:39 EST
Mounting with "-o noacl" still fails

# mount -v -o noacl sevrer1:/mount/point /mnt/
mount: no type was given - I'll assume nfs because of the colon
mount: trying 10.66.61.75 prog 100003 vers 3 prot tcp port 2049
mount: mount to NFS server 'asdfssgcorp01.mpls.udlp.com' failed: timed out
(retrying).

Mounting with no options fails
# mount -v server1:/mount/point /mnt/
mount: no type was given - I'll assume nfs because of the colon
mount: trying 10.66.61.75 prog 100003 vers 3 prot tcp port 2049
mount: mount to NFS server 'asdfssgcorp01.mpls.udlp.com' failed: timed out
(retrying).

Mounting with -o tcp succeeds
# mount -v -o tcp server1:/mount/point /mnt/
mount: no type was given - I'll assume nfs because of the colon
mount: trying 10.66.61.75 prog 100003 vers 3 prot tcp port 2049
mount: trying 10.66.61.75 prog 100005 vers 3 prot tcp port 34206
# 
(success)

Now, I know I could just mount everything with "-o tcp" to get past this
problem, however, we use the automounter to mount these locations and as a
result, just hangs.

As another note, I spoke with a friend who had the same problem with a
particular version of Ubuntu.  Perhaps this is isolated to a specific version of
"mount".  In any case, RHEL3, RHEL4, and FC6 all mount the above location just
fine without additional options.
Comment 5 Steve Dickson 2007-02-27 17:15:15 EST
Could you please post a bzip2 binary tethereal trace of both 
failures... something similar to 
    tethereal -w /tmp/bz228727.pcap host <server> ; bzip2 /tmp/bz228727.pcap

tia..

Comment 6 Andy Elmer 2007-03-05 16:46:57 EST
Created attachment 149295 [details]
bz228728.pcap.bz2  -->  "mount -v asdfssghome01:/export/Home01/ir/elmerar /mnt"

I attached 3 tethereal traces.	These are the commands I ran with each trace:

bz228728.pcap.bz2  -->	"mount -v asdfssghome01:/export/Home01/ir/elmerar /mnt"


bz228729.pcap.bz2  -->	"mount -v -o noacl
asdfssghome01:/export/Home01/ir/elmerar /mnt"

bz228730.pcap.bz2  -->	"mount -v -o tcp
asdfssghome01:/export/Home01/ir/elmerar /mnt"
Comment 7 Andy Elmer 2007-03-05 16:48:42 EST
Created attachment 149296 [details]
bz228729.pcap.bz2  -->  "mount -v -o noacl asdfssghome01:/export/Home01/ir/elmerar /mnt"
Comment 8 Andy Elmer 2007-03-05 16:49:32 EST
Created attachment 149297 [details]
bz228730.pcap.bz2  -->  "mount -v -o tcp asdfssghome01:/export/Home01/ir/elmerar /mnt"
Comment 9 Matthew Thyer 2009-04-14 22:41:47 EDT
This is due to the way VCS on Solaris runs it's NFS server as a single server per node instead of one per "service group".
I have confirmed that this is still a problem on Solaris 10 10/08 (u6) but is expected to be fixed with the next release of Solaris 10 (u7).  It's Sun bug Id: 2159403.

The scenario is this:

The VCS cluster node has several IP addresses on the same network.  The first being it's actual IP address and the rest are the addresses of each "service group" which are implemented as aliases on the network interface (Sun would call these "floating logical addresses").

The problem is that rpcbind on the Solaris VCS server replies to the Fedora client using the IP address of the node and not the IP address of the "service group" that the client sent it's NFS mount request to.

The Fedora 9 and above NFS client suspects a man-in-the-middle type of attack and ignores the rpcbind server responses.

Fedora 8 as a client did not mind, but Fedora 9 and above do.

On some Linux hosts a workaround is to use TCP mounts instead of UDP (Mandriva 2007 ?).
Comment 13 Steve Dickson 2009-12-14 15:49:55 EST
This a Sun bug "It's Sun bugId: 2159403
Comment 14 Matthew Thyer 2009-12-14 19:28:22 EST
Sun produced a 'T' patch for me (for SPARC) that fixed this problem.
It has been released as a full patch and is: 140917.

Note You need to log in before you can comment on or make changes to this bug.