Hide Forgot
Description of problem: LLNL is having trouble with NFS in the new cluster running RHEL 6.2 Beta. When the cluster boots it brings up a large number of NFS clients all at the same time that funnel all traffic through a gateway (NAT). They are seeing MOUNTv3 NULL on high port, MOUNTv3 NULL on low port, MOUNTv3 (filesystem) on low port with the last coming from the kernel. The result is the gateway is dropping the priveleged port mount requests due to privilege port tcp connections in time wait and it seems to be happening more in RHEL6 (as opposed to RHEL 5 where issue does not exist) due to the extra privileged port MOUNT NULL procedure. LLNL feels that the second NULL request is seems unnecessary since it follows a NFSv3 MOUNT NULL probe on a high number port. If you are already doing a MOUNT probe on low numbered port, why is the high numbered port needed? If it is absolutely needed could the second null ping be done at a non-reserved port instead? If it isn't absolutely needed could it be turned off? Ben Woodard: I believe it is TCP and the problem appears to be that all the IP tuples from the reserved ports are used and then reused. The server then sees new packets coming in from the same connection tuple and its IP stack discards the packets because it has those in TIME_WAIT. Theoretically, it isn't a new problem but due to the fact that the RHEL5 stack consumed one less connection made it so that on rhel5 the problem didn't appear. LLNL would like to know if the current behavior or if it is a change that just crept incidentally into the NFS code between RHEL5 and 6. I think that they would be happier living with it, if we could point to some reason the change was done. They would love to have us remove one of these NULL requests and argue that these are NULL requests they shouldn't do anything. However, I would be reluctant to change something like because it might be the thing that makes thus and such connectathon test work with this one particular other stack. Version-Release number of selected component (if applicable): RHEL 6 Update 2 latest kernel How reproducible: 100% Steps to Reproduce: Actual results: Failure in mounts Expected results: No failure in mounts Additional info:
In the initial comment: The result is the gateway is dropping the priveleged port mount requests due to privilege port tcp connections in time wait Should read "port udp connections" not tcp.
Created attachment 541119 [details] rpc packet dump from NAT gw Attaching raw packet capture of 7 packets that illustrate the issue. Keep in mind this is 7 packets selected out of a capture of several hundered thousand. The packet capture is from the NAT gateway node, using -i any. On this node 'ib0' is the internal interface, and eth0 is the external interface. I found an internal host that hit the mount timeout and restricted the packets to just that system for now. What I think we see is this: V2 GETPORT Call V2 GETPORT Reply V3 MOUNT NULL Call (src port = 58544) V3 MOUNT NULL Reply V3 MOUNT NULL Call (src port = 836) retransmit retransmit It is the second MOUNT NULL Call that we do _not_ see on RHEL5, and there is a very high probability that this NULL Call request gets dropped (Many, many nodes show this same signature) Please let me know if you want more information from this packet catpture.
Created attachment 541481 [details] All RPC packets from capture Now attaching all RPC packets from the capture during the reproducer
bfields, a way to look at this that shows the problem is to filter on udp.port==836 2.728770 that is the original request from the compute node 2.728775 is the NAT'd packet 2.728965 is the reply to the NAT box 2.728969 is the NAT'd reply back to the compute node then 2.895488 is the next time we try to use port 836 166.519 ms later which is within the 2*segment time that if this were TCP would cause the packet to be discarded. I assume that there is something similar for UDP in the RPC stack or something on the filer and so it ignores the packet. On RHEL5 the fact that there isn't the 2nd RPC NULL call from a privileged port makes it so that the NAT doesn't recycle through the ports quite as fast and consequently the likelihood of hitting this same problem is lower to the point where it didn't happen on RHEL5. LLNL would like to know if that 2nd mount NULL call is really necessary? Is there some HW that doesn't work without it there or is there some set of mount options as mrchuck suggested which could eliminate that call. So that they don't hit this problem with the mount's failing due to reserved port reuse on the NAT box.
Thanks, yes the first packet dump agrees with your description. A quick examination of the mount source in nfs-utils doesn't show many any MOUNT pings done using a client that asked for a reserved port--so I'm not sure where those are coming from.
Created attachment 542640 [details] packet dump with mountproto=tcp
After looking at the current TCP packet dump. This particular dump is not useful. It has been filtered to the degree where the important parts are missing. There are concerns about potential customer data and so tomorrow I will examine the data from the TCP mountproto file without attaching it to the case. The hope is that conntrack can use the TCP semantics to reuse connections more quickly. Unless I come up with some better idea after looking at the TCP mountproto dump my plan is to keep looking at this and if necessary generate a prototype patch which tries to reuse the port so that each compute node only uses 1 resv port to mount all the file systems rather than 1 resv port for each fs for each compute node. My current theories regarding the root cause of the difference between rhel5 and rhel6 in this case are: - fewer file systems, - more actual file servers, - fewer compute nodes going through one NAT In other words the 2nd NULL call is probably a red herring and we have a threshold thing. For some reason the RHEL5 clusters are sub-threshold but the new RHEL6 clusters are above the threshold.
I've been looking at the mountproto=tcp dump and I have so far been unable to find a problem with it. However, I have some uncertainty still. I'm beginning to suspect that the reports by the sysadmin that this takes as long as with UDP is for a different reason. I think it simply is the accumulated time it takes to bring up and tear down that many TCP connections. Conntrack appears to be working right too. It is just that it is running out of resv ports and since there is nothing in the UDP protocol to tell conntrack that the connection is done the way that there is with TCP, it has to rely on a timeout and with all the nodes, all the mounts, and the few servers it can't come up with a mapping that works. Some possible solutions that I have suggested: 1) IPv6 - no NAT LLNL can't do this yet. 2) don't require resv ports on the Filer There is some concern that this won't be acceptable to the security people. 3) pdsh -g gw pdsh -f4 mount_all_nfs_filesystems This is their workaround for the moment. Other things which might be possible: 1) A modification to the NFS client code that tries to reuse ports if called to mount within a few seconds of the mount 2) A modification to the conntrack which understands the NFS mount protocol better and this allows it to reuse the resv ports more quickly because it can recognize that the resv port communication is done.