Bug 212471
Summary: | Fedora 6 is haveing trouble mounting NFS shares on Alpha servers | ||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | [Fedora] Fedora | Reporter: | Kevin Neuhaus <kevin.neuhaus> | ||||||||||||||
Component: | nfs-utils | Assignee: | Steve Dickson <steved> | ||||||||||||||
Status: | CLOSED WONTFIX | QA Contact: | Ben Levenson <benl> | ||||||||||||||
Severity: | medium | Docs Contact: | |||||||||||||||
Priority: | medium | ||||||||||||||||
Version: | 6 | CC: | benjamin.buetikofer, davem, deknuydt, jlayton, jmbastia, jonathan.w.miner, kevin.russell, kucharsk, ra, redhat.com, tommi, triage, xdl-redhat-bugzilla, xian | ||||||||||||||
Target Milestone: | --- | ||||||||||||||||
Target Release: | --- | ||||||||||||||||
Hardware: | i386 | ||||||||||||||||
OS: | Linux | ||||||||||||||||
Whiteboard: | bzcl34nup | ||||||||||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||||||||||
Doc Text: | Story Points: | --- | |||||||||||||||
Clone Of: | Environment: | ||||||||||||||||
Last Closed: | 2008-05-06 16:34:08 UTC | Type: | --- | ||||||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||||||
Documentation: | --- | CRM: | |||||||||||||||
Verified Versions: | Category: | --- | |||||||||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||||||
Embargoed: | |||||||||||||||||
Attachments: |
|
Description
Kevin Neuhaus
2006-10-26 21:23:57 UTC
I'm having a very similar problem trying to mount filesystems from a SUN Cluster 2.1 system (Solaris 8). FC5 worked fine. Mounts from other non-clustered Solaris 8 servers work OK. Mounts from RHEL 4 servers are OK too. My server is a Tru Cluster Alpha server. Would it be possible to post a bzip2 binary tethereal network trace? Something similar to: tethereal -w /tmp/data.pcap host <server> ; bzip2 /tmp/data.pcap Created attachment 139970 [details]
Failing packet capture
This is the resulting traffic from the command "mount superfly:/superfly/vol01
/mnt", where "superfly" is the NFS servicename of the Sun Cluster 2.1. The
individual nodes are running Solaris 8
Created attachment 139971 [details]
Sucessful packet capture
For comparision, this is the packet capture from "mount fly_a:/superfly/vol01
/mnt', where "fly_a" is the real name of the currently active node in the Sun
Cluster. This mount command works.
Looking at both traces, it appears the reply to portmap GETPORT query (packet 6) is missing in the hung mounts (i.e. in the data.pcap trace). Note packages 20 and 23, in the data_ok.pcap trace shows what should happen... So for some reason when the client (129.86.231.37) sends a udp request to the 129.86.21.203 server, that server *appears* to drop it. Does the server have multiple network interfaces? Which could possible mean the UDP response could be going out another interface (which is why its not seen the trace). Also, what happens when the protocol is explicitly set on the mount command line (i.e. mount -o tcp or mount -o udp) do the mounts still hang? Created attachment 139980 [details]
Packet capture
This time, I captured packets from both "fly_a" and "superfly".
Thanks for the analysis. I performed another packet capture, and there is data going out from both fly_a and superfly. Two separate IPs on the same physical interface. If I specifically use "-o tcp" then the mount works, using "-o udp" fails. Is this a bug introduced into FC6 or a new feature? I would like to be able to resolve it without having to modify my existing production environment. TCP mounting works for me also mount -otcp -tnfs nbnac64:/export/disk1 /mnt (works fine) UDP does not work. mount -oudp -tnfs nbnac64:/export/disk1 /mnt mount: mount to NFS server 'nbnac64' failed: timed out (retrying). Created attachment 140055 [details]
two trace files one failure one success
failure:
mount -tnfs nbnac64:/export/disk1 /mnt
success:
mount -tnfs -otcp nbnac64:/export/disk1 /mnt
Looking at the data_both.pcap trace from Comment #7 and the nbnac64_fail from Comment #10, it appears when the client is calling the rpc.mountd, its getting an ICMP Port Unreachable (See packets 27, 52, 77, etc). So I'm wondering if the rpc.mountd on the server is even listening for UDP connections. find this out use: 'rpcinfo -p <server> | grep mountd' to make sure there is something similar to 100005 1 udp 922 mountd If there is not, then the problem is solved, but if there is an UDP entry for mountd, then ping it to see if it is accepting connections. To do this do: rpcinfo -u <server> 100005 [root@nbn1309 /]# rpcinfo -p nbnac64 | grep mountd 100005 1 udp 621 mountd 100005 3 udp 621 mountd 100005 1 tcp 625 mountd 100005 3 tcp 625 mountd [root@nbn1309 /]# rpcinfo -u nbnac64 100005 program 100005 version 1 ready and waiting rpcinfo: RPC: Program/version mismatch; low version = 1, high version = 1 program 100005 version 2 is not available program 100005 version 3 ready and waiting [root@ac523421 ~]# rpcinfo -p superfly | grep mountd 100005 1 udp 33837 mountd 100005 2 udp 33837 mountd 100005 3 udp 33837 mountd 100005 1 tcp 33127 mountd 100005 2 tcp 33127 mountd 100005 3 tcp 33127 mountd [root@ac523421 ~]# rpcinfo -u superfly 100005 program 100005 version 1 ready and waiting program 100005 version 2 ready and waiting program 100005 version 3 ready and waiting At this point I think we might be looking at two different issues... Kevin, could you added 'MOUNTD_NFS_V2=no' to /etc/sysconfig/nfs an than restart nfs (via service nfs restart) and than post the 'rpcinfo -p nbnac64 | grep mountd' and the 'rpcinfo -u nbnac64 100005' again... Jonathan, Your issues seem to be a bit more bizarre... Here is why... looking at packages 20 23 and 27, You'll see the client (129.86.231.37) asking the server for the mountd's port using UDP.The server (129.86.21.203) returns the port number (33837), which is normal... but then the client sends a ICMP error as if the portmapper (the daemon that sent the message) has gone down... which it clearly has not... Generally this is usually a firewall or an SElinux problem... Just to be sure... try 'iptables -F' (which will flush any and all firewalls) and if that does not work, try 'setenforce 0' which will turn off SElinux. The server in my setup is a Tru64 Alpha server so /etc/sysconfig/nfs does not exist. From the Ethereal log it looks like it's already using NFS V3. Also NFS V2 is already not running on the server: rpcinfo -p nbnac64 | grep mountd 100005 1 udp 621 mountd 100005 3 udp 621 mountd 100005 1 tcp 625 mountd 100005 3 tcp 625 mountd I believe the problem is more with the NFS UDP packets that the Fedora 6 client is sending out. All versions of Fedora < 6 work with both UDP & TCP. Questions: If a client can't connect via UDP shouldn't it fail over to TCP? Is there a way to force a client to always use TCP? I double checked, neither iptable nor SElinux are enabled on this box. Purhaps FC6 does not like the fact that "fly_a" is responding to the UDP request, instead of "superfly"? Is this a security enhancement? If I read the packets correctly: 20: client asks "superfly" for portmapper info 23: fly_a replies with portmapper info 27: client replies to fly_a with "I didn't ask you" Like Kevin said, this worked prior to FC6. > Questions: If a client can't connect via UDP shouldn't it fail over to TCP? No... > Is there a way to force a client to always use TCP? mount -o tcp should make the mounts all ways use tcp. >> Is there a way to force a client to always use TCP?
>mount -o tcp should make the mounts all ways use tcp.
What about from autofs? All of the NFS mounts I need are distributed vi YP
services. I don't manually mount NFS shares.
see man 5 auto.master.... put 'tcp' in the options field of the map entry will make all the mount use tcp. Example: /home /etc/auto.home tcp I use ldap for auto mount information. Adding to the mountoptions "proto=tcp" (works with HP-UX "tcp" alone does not). My little analysis on the problem: * 10.1.1.26 - nfs client, FC5 which actually mounts the filesystem but looks weird * 10.1.69.1 - service guard cluster IP for nfs, HP-UX 11.11 * 10.1.1.12 - primary interface on the same machine as above, HP-UX 11.11 # mount 10.1.69.1:/export/sam /tmp/tmp 154.817424 10.1.1.26 -> 10.1.69.1 TCP 48367 > sunrpc [SYN] Seq=0 Len=0 MSS=1460 TSV=88588926 TSER=0 WS=0 154.817578 10.1.69.1 -> 10.1.1.26 TCP sunrpc > 48367 [SYN, ACK] Seq=0 Ack=1 Win=32768 Len=0 MSS=1460 WS=0 TSV=15112927 TSER=88588926 154.817613 10.1.1.26 -> 10.1.69.1 TCP 48367 > sunrpc [ACK] Seq=1 Ack=1 Win=5840 Len=0 TSV=88588926 TSER=15112927 154.818826 10.1.1.26 -> 10.1.69.1 Portmap V2 DUMP Call 154.819218 10.1.69.1 -> 10.1.1.26 Portmap V2 DUMP Reply (Call In 275) 154.819249 10.1.1.26 -> 10.1.69.1 TCP 48367 > sunrpc [ACK] Seq=45 Ack=913 Win=7296 Len=0 TSV=88588926 TSER=15112927 154.819599 10.1.1.26 -> 10.1.69.1 TCP 48367 > sunrpc [FIN, ACK] Seq=45 Ack=913 Win=7296 Len=0 TSV=88588926 TSER=15112927 154.819692 10.1.69.1 -> 10.1.1.26 TCP sunrpc > 48367 [ACK] Seq=913 Ack=46 Win=32768 Len=0 TSV=15112927 TSER=88588926 154.819758 10.1.1.26 -> 10.1.69.1 MOUNT V3 MNT Call /export/sam 154.819821 10.1.69.1 -> 10.1.1.26 TCP sunrpc > 48367 [FIN, ACK] Seq=913 Ack=46 Win=0 Len=0 TSV=15112927 TSER=88588926 154.819834 10.1.1.26 -> 10.1.69.1 TCP 48367 > sunrpc [ACK] Seq=46 Ack=914 Win=7296 Len=0 TSV=88588926 TSER=15112927 Here I start getting replies out of the blue from 10.1.1.12, the primary address of my nfs server. 154.824019 10.1.1.12 -> 10.1.1.26 MOUNT V3 MNT Reply (Call In 280) 154.824518 10.1.1.26 -> 10.1.69.1 NFS V3 GETATTR Call, FH:0x7fec0000 155.095042 10.1.1.12 -> 10.1.1.26 NFS V3 GETATTR Reply (Call In 284) Directory mode:0755 uid:0 gid:3 155.095588 10.1.1.26 -> 10.1.69.1 NFS V3 FSINFO Call, FH:0x7fec0000 155.786692 10.1.1.26 -> 10.1.69.1 NFS [RPC retransmission of #286]V3 FSINFO Call, FH:0x7fec0000 156.071551 10.1.1.12 -> 10.1.1.26 NFS V3 FSINFO Reply (Call In 286) 156.071595 10.1.1.12 -> 10.1.1.26 NFS [RPC duplicate of #288]V3 FSINFO Reply (Call In 286) The filesystem is actually mounted because this is FC5 but isn't this just a new security measure in the kernel FC6 uses and a bug in this instance is the HP-UX nfs server? ohh, the output after the mount command is the output of tethereal: # tethereal -i any '( host 10.1.1.12 or host 10.1.69.1 ) and port not ldap and not arp and port not domain and port not ldaps' I am having similar problems mounting NFS shares to a linux FC6 client. The server is an IRIX cluster, with multiple network interfaces, using ip aliases. I see the same ICMP error message, and I did not have this problem prior to FC6. I am willing to provide any traces or output requested. We mount UDP, but I will try some TCP mounts if I can ascertain SGI is currently supporting that well with CXFS/Failsafe. Created attachment 141099 [details]
packet capture between NetApp filer and FC6 client
NetApp filers with multiple network interfaces are also effected by this
problem
(In reply to comment #22) > I am having similar problems mounting NFS shares to a linux FC6 client. The > server is an IRIX cluster, with multiple network interfaces, using ip aliases. > I see the same ICMP error message, and I did not have this problem prior to FC6. > I am willing to provide any traces or output requested. We mount UDP, but I > will try some TCP mounts if I can ascertain SGI is currently supporting that > well with CXFS/Failsafe. > > verified TCP mounts nominally functional (ie, have not tested behavior in failover situation with TCP mounts) I can verify this is happening when trying to mount from various Solaris servers. The main factor seems to be that the NFS server needs to have multiple interfaces on the same subnet. The workaround is the same as listed above - specify -o tcp when performing a manual mount. As with other posters, this functionality worked fine in FC5, but broke upon updating to FC6. I confirm this problem also happens to me after a upgrade from FC5 to FC6. No changes have been made to the NFS servers and mounting using TCP instead of UDP works. Mounting from the individual severs instead of the cluster address works fine. I think I figured out what the problem is... The rpms in http://people.redhat.com/steved/bz212471 should hopefully solve this problem... Please let me know... Still behaving the same for me. UDP not mounting, TCP works. Anything else I should be doing other than freshening with the new rpms? Still behaving the same here also. UDP not mounting, TCP does work. No change for me either. I installed the RPMs, then rebooted. Understood... Those rpms do seem to fix bz215476 so I was hoping this issue was similar... This must be some issue with us moving the mount from util-linux into nfs-utils.. since FC5 mounts worked and none of the FC6 versions have... I'll keep plugging away at this... I just wish I could reproduce this... What if you put an extra IP on a NFS server and try mounting from a FC6 client a share from the server ? Basicly this only seems to happen when the NFS client is mounting a exported filesystem to an extra IP (IP alias) on the NFS server. The error _seems_ to occur when an NFS server is set up with multiple network interfaces on the same subnet. Many NFS servers further will split traffic across interfaces, leading to the problem. Using tcpdump and attempting to mount an NFS file system from such a system, the interaction seems to proceed as follows in response to a: mount bigserver-home1:/home/mydir /mnt 21:15:58.063392 IP (tos 0x0, ttl 64, id 47607, offset 0, flags [DF], proto: UDP (17), length: 84) fc6box.localdomain.32901 > bigserver-home1.localdomain.sunrpc: UDP, length 56 21:15:58.064276 IP (tos 0x0, ttl 254, id 48257, offset 0, flags [DF], proto: UDP (17), length: 56) phys-bigserver-1.localdomain.sunrpc > fc6box.localdomain.32901: [udp sum ok] UDP, length 28 21:15:58.064285 IP (tos 0xc0, ttl 64, id 42094, offset 0, flags [none], proto: ICMP (1), length: 84) fc6box.localdomain > phys-bigserver-1.localdomain: ICMP fc6box.localdomain udp port 32901 unreachable, length 64 IP (tos 0x0, ttl 254, id 48257, offset 0, flags [DF], proto: UDP (17), length: 56) phys-bigserver-1.localdomain.sunrpc > fc6box.localdomain.32901: UDP, length 28 <mount hangs here> Upon issuing the mount command, a UDP packet is sent to "bigserver-home1", but the server is configured such that the response is sent from a different interface on the same machine, "phys-bigserver-1." The problem is that after receiving that packet, the fc6 box sends further UDP traffic to "phys-bigserver-1," which is NOT set up to respond to incoming packets, rather than the mount-specified "bigserver-home1." Looking at similar output from a FC5 box shows two main differences: 1) It appears the default behavior for NFS in FC5 is to use TCP rather than UDP for NFS if no option was otherwise specified. 2) If "-o udp" is passed to mount, UDP packets are always sent to the host _specified in the mount command_, not whatever host replied to the request. Note how all outgoing packets continue to be sent to "bigserver-home1" despite the replies from "phys-bigserver-1": 21:24:23.884238 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto: UDP (17), length: 84) fc5box.localdomain.filenet-pch > bigserver-home1.localdomain.sunrpc: UDP, length 56 21:24:23.884784 IP (tos 0x0, ttl 254, id 37737, offset 0, flags [DF], proto: UDP (17), length: 56) phys-bigserver-1.localdomain.sunrpc > fc5box.localdomain.filenet-pch: [udp sum ok] UDP, length 28 21:24:23.884811 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto: UDP (17), length: 68) fc5box.localdomain.1325290645 > bigserver-home1.localdomain.nfs: 40 null 21:24:23.885024 IP (tos 0x0, ttl 254, id 37738, offset 0, flags [DF], proto: UDP (17), length: 52) phys-bigserver-1.localdomain.nfs > fc5box.localdomain.1325290645: reply ok 24 Yes... I do see the problem... I added a second nic to our Solaris 10 server and yes in deed, udp packets are being sent to one ip address and returning on a different ip address which is causing the ICMP error... Sorry for not picking up on this sooner... I have a sneaking hunch this maybe more of an network stack issue than a mounting or RPC issue... but only time will tell... Steve, please test if the FC5 mount binary (and FC5 NFS/RPC specific shared libraries needed, if any) works with the FC6 kernel. I'm asking for this specific test, because I have a hunch that the FC6 mount is binding the UDP socket differently, causing the problem. Meanwhile I'll study the mount sources in FC5 and FC6 to look for clues. Created attachment 142182 [details] patch: don't call connect on UDP sockets It appears that get_socket calls connect on all sockets, not just TCP ones, and I think that is causing the kernel to reject the packets from the other addresses. I have some RHEL5 beta packages on my people page with this patch: http://people.redhat.com/jlayton/bz208244/ I've not done any extensive testing with this, but that seems likely to be the issue. That looks exactly like what the problem would be. Good catch Jeff. It definitely only did the connect() for SOCK_DGRAM in the util-linux get_socket(). I wonder what other regressions are present in this nfs-utils mount code? :-( I installed the RPM that Jeff referenced in comment #36. Appears to work for me! Yes... Nice work Jeff!!! Fixed in nfs-utils-1.0.10-4.fc6 Today, I had some problem with nfs since I have updated my Fedora Core 6 with the last nfs-util package. When I mount a nfs share, the server is crashed and /var/adm/messages file shows : Nov 29 09:39:36 server.name nfssrv: [ID 694464 kern.warning] WARNING: nfsauth upcall failed: RPC: Operation in progress Additonnal information for the client : [pti-seb@mr129156 ~]$ uname -a Linux mr129156 2.6.18-1.2849.fc6 #1 SMP Fri Nov 10 12:45:28 EST 2006 i686 i686 i386 GNU/Linux [pti-seb@mr129156 ~]$ rpm -qa | grep nfs nfs-utils-lib-1.0.8-7.2 nfs-utils-1.0.10-4.fc6 Additonnal information for the remote server : bash-2.05$ uname -a SunOS eclipse 5.9 Generic_112233-11 sun4u sparc SUNW,Ultra-250 bash-2.05$ pkginfo -l SUNWnfssr PKGINST: SUNWnfssr NAME: Network File System (NFS) server support (Root) CATEGORY: system ARCH: sparc VERSION: 11.10.0,REV=2005.01.21.15.53 BASEDIR: / VENDOR: Sun Microsystems, Inc. DESC: Network File System (NFS) server support (Root) PSTAMP: gaget20050121155937 INSTDATE: Jun 21 2005 17:36 HOTLINE: Please contact your local service provider STATUS: completely installed FILES: 18 installed pathnames 11 shared pathnames 12 directories 2 executables 21 blocks used (approx) I think that it is strange one nfs client can crash a server. If you're getting a server crash, then that's clearly a bug in the server regardless of whether the client is causing it or not. If you think this is due to the client doing something it shouldn't then please open a new BZ and give a technical explanation of why you think so. I built a fresh FC6 system, and applied all updates, including nfs-utils-1.0.10-4.fc6. Everything is working. Thanks WRT Comment #40, In that *new* bz, please supply a bzip2 binary tethereal network trace of the crash. Something like: tethereal -w /tmp/sol11.pcap host <server> ; bzip2 /tmp/sol11.pcap Being that its a Solaris 11 server, I'm very intersted in finding the root cause of your issue... Sorry, but since this crash, administrator revoke my access to the nfs server, because only Suse and Ubuntu distribution are official in my company ... :-( Note that the uname information of: SunOS eclipse 5.9 Generic_112233-11 sun4u sparc SUNW,Ultra-250 means the machine is a Solaris _9_ box, not a Solaris 11 box. The nfs-utils-1.0.10-4.fc6 version of nfs utils fixed my problem. Fedora apologizes that these issues have not been resolved yet. We're sorry it's taken so long for your bug to be properly triaged and acted on. We appreciate the time you took to report this issue and want to make sure no important bugs slip through the cracks. If you're currently running a version of Fedora Core between 1 and 6, please note that Fedora no longer maintains these releases. We strongly encourage you to upgrade to a current Fedora release. In order to refocus our efforts as a project we are flagging all of the open bugs for releases which are no longer maintained and closing them. http://fedoraproject.org/wiki/LifeCycle/EOL If this bug is still open against Fedora Core 1 through 6, thirty days from now, it will be closed 'WONTFIX'. If you can reporduce this bug in the latest Fedora version, please change to the respective version. If you are unable to do this, please add a comment to this bug requesting the change. Thanks for your help, and we apologize again that we haven't handled these issues to this point. The process we are following is outlined here: http://fedoraproject.org/wiki/BugZappers/F9CleanUp We will be following the process here: http://fedoraproject.org/wiki/BugZappers/HouseKeeping to ensure this doesn't happen again. And if you'd like to join the bug triage team to help make things better, check out http://fedoraproject.org/wiki/BugZappers This bug is open for a Fedora version that is no longer maintained and will not be fixed by Fedora. Therefore we are closing this bug. If you can reproduce this bug against a currently maintained version of Fedora please feel free to reopen thus bug against that version. Thank you for reporting this bug and we are sorry it could not be fixed. This problem was fixed in FC6 and does not exist in FC7 so should be marked fixed rather than WONTFIX. |