| Summary: | problem with nodes rebooting and mounting NFS through a NAT | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 6 | Reporter: | Travis Gummels <tgummels> | ||||||||
| Component: | nfs-utils | Assignee: | Steve Dickson <steved> | ||||||||
| Status: | CLOSED NOTABUG | QA Contact: | Red Hat Kernel QE team <kernel-qe> | ||||||||
| Severity: | medium | Docs Contact: | |||||||||
| Priority: | medium | ||||||||||
| Version: | 6.2 | CC: | bfields, dhowells, jlayton, rwheeler, sprabhu, steved, woodard | ||||||||
| Target Milestone: | rc | ||||||||||
| Target Release: | --- | ||||||||||
| Hardware: | All | ||||||||||
| OS: | Linux | ||||||||||
| Whiteboard: | |||||||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||||||
| Doc Text: | Story Points: | --- | |||||||||
| Clone Of: | Environment: | ||||||||||
| Last Closed: | 2012-03-01 20:11:20 UTC | Type: | --- | ||||||||
| Regression: | --- | Mount Type: | --- | ||||||||
| Documentation: | --- | CRM: | |||||||||
| Verified Versions: | Category: | --- | |||||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||||
| Attachments: |
|
||||||||||
|
Description
Travis Gummels
2011-12-05 22:49:21 UTC
In the initial comment: The result is the gateway is dropping the priveleged port mount requests due to privilege port tcp connections in time wait Should read "port udp connections" not tcp. Created attachment 541119 [details]
rpc packet dump from NAT gw
Attaching raw packet capture of 7 packets that illustrate the issue.
Keep in mind this is 7 packets selected out of a capture of several
hundered thousand.
The packet capture is from the NAT gateway node, using -i any. On this
node 'ib0' is the internal interface, and eth0 is the external interface.
I found an internal host that hit the mount timeout and restricted the
packets to just that system for now. What I think we see is this:
V2 GETPORT Call
V2 GETPORT Reply
V3 MOUNT NULL Call (src port = 58544)
V3 MOUNT NULL Reply
V3 MOUNT NULL Call (src port = 836)
retransmit
retransmit
It is the second MOUNT NULL Call that we do _not_ see on RHEL5,
and there is a very high probability that this NULL Call request
gets dropped (Many, many nodes show this same signature)
Please let me know if you want more information from this packet catpture.
Created attachment 541481 [details]
All RPC packets from capture
Now attaching all RPC packets from the capture during the reproducer
bfields, a way to look at this that shows the problem is to filter on udp.port==836 2.728770 that is the original request from the compute node 2.728775 is the NAT'd packet 2.728965 is the reply to the NAT box 2.728969 is the NAT'd reply back to the compute node then 2.895488 is the next time we try to use port 836 166.519 ms later which is within the 2*segment time that if this were TCP would cause the packet to be discarded. I assume that there is something similar for UDP in the RPC stack or something on the filer and so it ignores the packet. On RHEL5 the fact that there isn't the 2nd RPC NULL call from a privileged port makes it so that the NAT doesn't recycle through the ports quite as fast and consequently the likelihood of hitting this same problem is lower to the point where it didn't happen on RHEL5. LLNL would like to know if that 2nd mount NULL call is really necessary? Is there some HW that doesn't work without it there or is there some set of mount options as mrchuck suggested which could eliminate that call. So that they don't hit this problem with the mount's failing due to reserved port reuse on the NAT box. Thanks, yes the first packet dump agrees with your description. A quick examination of the mount source in nfs-utils doesn't show many any MOUNT pings done using a client that asked for a reserved port--so I'm not sure where those are coming from. Created attachment 542640 [details]
packet dump with mountproto=tcp
After looking at the current TCP packet dump. This particular dump is not useful. It has been filtered to the degree where the important parts are missing. There are concerns about potential customer data and so tomorrow I will examine the data from the TCP mountproto file without attaching it to the case. The hope is that conntrack can use the TCP semantics to reuse connections more quickly. Unless I come up with some better idea after looking at the TCP mountproto dump my plan is to keep looking at this and if necessary generate a prototype patch which tries to reuse the port so that each compute node only uses 1 resv port to mount all the file systems rather than 1 resv port for each fs for each compute node. My current theories regarding the root cause of the difference between rhel5 and rhel6 in this case are: - fewer file systems, - more actual file servers, - fewer compute nodes going through one NAT In other words the 2nd NULL call is probably a red herring and we have a threshold thing. For some reason the RHEL5 clusters are sub-threshold but the new RHEL6 clusters are above the threshold. I've been looking at the mountproto=tcp dump and I have so far been unable to find a problem with it. However, I have some uncertainty still. I'm beginning to suspect that the reports by the sysadmin that this takes as long as with UDP is for a different reason. I think it simply is the accumulated time it takes to bring up and tear down that many TCP connections. Conntrack appears to be working right too. It is just that it is running out of resv ports and since there is nothing in the UDP protocol to tell conntrack that the connection is done the way that there is with TCP, it has to rely on a timeout and with all the nodes, all the mounts, and the few servers it can't come up with a mapping that works. Some possible solutions that I have suggested: 1) IPv6 - no NAT LLNL can't do this yet. 2) don't require resv ports on the Filer There is some concern that this won't be acceptable to the security people. 3) pdsh -g gw pdsh -f4 mount_all_nfs_filesystems This is their workaround for the moment. Other things which might be possible: 1) A modification to the NFS client code that tries to reuse ports if called to mount within a few seconds of the mount 2) A modification to the conntrack which understands the NFS mount protocol better and this allows it to reuse the resv ports more quickly because it can recognize that the resv port communication is done. |