Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 760370

Summary:

problem with nodes rebooting and mounting NFS through a NAT

Product:

Red Hat Enterprise Linux 6

Reporter:

Travis Gummels <tgummels>

Component:

nfs-utils

Assignee:

Steve Dickson <steved>

Status:

CLOSED NOTABUG

QA Contact:

Red Hat Kernel QE team <kernel-qe>

Severity:

medium

Docs Contact:

Priority:

medium

Version:

6.2

CC:

bfields, dhowells, jlayton, rwheeler, sprabhu, steved, woodard

Target Milestone:

Target Release:

---

Hardware:

All

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2012-03-01 20:11:20 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
rpc packet dump from NAT gw	none
All RPC packets from capture	none
packet dump with mountproto=tcp	none

Description Travis Gummels 2011-12-05 22:49:21 UTC

Description of problem:

LLNL is having trouble with NFS in the new cluster running RHEL 6.2 Beta. When the cluster boots it brings up a large number of NFS clients all at the same time that funnel all traffic through a gateway (NAT). They are seeing MOUNTv3 NULL on high port, MOUNTv3 NULL on low port, MOUNTv3 (filesystem) on low port with the last coming from the kernel. The result is the gateway is dropping the priveleged port mount requests due to privilege port tcp connections in time wait and it seems to be happening more in RHEL6 (as opposed to RHEL 5 where issue does not exist) due to the extra privileged port MOUNT NULL procedure.

LLNL feels that the second NULL request is seems unnecessary since it follows a NFSv3 MOUNT NULL probe on a high number port. If you are already doing a MOUNT probe on low numbered port, why is the high numbered port needed? If it is absolutely needed could the second null ping be done at a non-reserved port instead? If it isn't absolutely needed could it be turned off?

Ben Woodard:

I believe it is TCP and the problem appears to be that all the IP tuples
from the reserved ports are used and then reused. The server then sees
new packets coming in from the same connection tuple and its IP stack
discards the packets because it has those in TIME_WAIT. Theoretically,
it isn't a new problem but due to the fact that the RHEL5 stack consumed
one less connection made it so that on rhel5 the problem didn't appear.

LLNL would like to know if the current behavior or if it is a change
that just crept incidentally into the NFS code between RHEL5 and 6. I
think that they would be happier living with it, if we could point to
some reason the change was done. They would love to have us remove one
of these NULL requests and argue that these are NULL requests they
shouldn't do anything. However, I would be reluctant to change something
like because it might be the thing that makes thus and such connectathon
test work with this one particular other stack.

Version-Release number of selected component (if applicable):
RHEL 6 Update 2 latest kernel

How reproducible:
100%

Steps to Reproduce:

Actual results:
Failure in mounts

Expected results:
No failure in mounts

Additional info:

Comment 1 Travis Gummels 2011-12-05 23:16:48 UTC

In the initial comment:
 
The result is the gateway is dropping the priveleged port mount requests due to privilege port tcp connections in time wait

Should read "port udp connections" not tcp.

Comment 2 Mark A. Grondona 2011-12-05 23:34:05 UTC

Created attachment 541119 [details]
rpc packet dump from NAT gw

Attaching raw packet capture of 7 packets that illustrate the issue.
Keep in mind this is 7 packets selected out of a capture of several
hundered thousand.

The packet capture is from the NAT gateway node, using -i any. On this
node 'ib0' is the internal interface, and eth0 is the external interface.
I found an internal host that hit the mount timeout and restricted the
packets to just that system for now. What I think we see is this:

 V2 GETPORT Call
 V2 GETPORT Reply

 V3 MOUNT NULL Call (src port = 58544)
 V3 MOUNT NULL Reply

 V3 MOUNT NULL Call (src port = 836)
 retransmit
 retransmit

It is the second MOUNT NULL Call that we do _not_ see on RHEL5,
and there is a very high probability that this NULL Call request
gets dropped (Many, many nodes show this same signature)

Please let me know if you want more information from this packet catpture.

Comment 3 Mark A. Grondona 2011-12-06 16:45:06 UTC

Created attachment 541481 [details]
All RPC packets from capture

Now attaching all RPC packets from the capture during the reproducer

Comment 4 Ben Woodard 2011-12-06 18:22:13 UTC

bfields,

a way to look at this that shows the problem is to filter on udp.port==836
2.728770 that is the original request from the compute node
2.728775 is the NAT'd packet
2.728965 is the reply to the NAT box
2.728969 is the NAT'd reply back to the compute node
then 2.895488 is the next time we try to use port 836
166.519 ms later which is within the 2*segment time that if this were TCP would cause the packet to be discarded. I assume that there is something similar for UDP in the RPC stack or something on the filer and so it ignores the packet.

On RHEL5 the fact that there isn't the 2nd RPC NULL call from a privileged port makes it so that the NAT doesn't recycle through the ports quite as fast and consequently the likelihood of hitting this same problem is lower to the point where it didn't happen on RHEL5.

LLNL would like to know if that 2nd mount NULL call is really necessary? Is there some HW that doesn't work without it there or is there some set of mount options as mrchuck suggested which could eliminate that call. So that they don't hit this problem with the mount's failing due to reserved port reuse on the NAT box.

Comment 5 J. Bruce Fields 2011-12-06 20:22:05 UTC

Thanks, yes the first packet dump agrees with your description.

A quick examination of the mount source in nfs-utils doesn't show many any MOUNT pings done using a client that asked for a reserved port--so I'm not sure where those are coming from.

Comment 6 Mark A. Grondona 2011-12-08 17:27:49 UTC

Created attachment 542640 [details]
packet dump with mountproto=tcp

Comment 7 Ben Woodard 2011-12-09 01:22:37 UTC

After looking at the current TCP packet dump. This particular dump is not useful. It has been filtered to the degree where the important parts are missing.

There are concerns about potential customer data and so tomorrow I will examine the data from the TCP mountproto file without attaching it to the case. The hope is that conntrack can use the TCP semantics to reuse connections more quickly.

Unless I come up with some better idea after looking at the TCP mountproto dump my plan is to keep looking at this and if necessary generate a prototype patch which tries to reuse the port so that each compute node only uses 1 resv port to mount all the file systems rather than 1 resv port for each fs for each compute node.

My current theories regarding the root cause of the difference between rhel5 and rhel6 in this case are:
- fewer file systems, 
- more actual file servers, 
- fewer compute nodes going through one NAT

In other words the 2nd NULL call is probably a red herring and we have a threshold thing. For some reason the RHEL5 clusters are sub-threshold but the new RHEL6 clusters are above the threshold.

Comment 8 Ben Woodard 2011-12-14 01:50:29 UTC

I've been looking at the mountproto=tcp dump and I have so far been unable to find a problem with it. However, I have some uncertainty still. I'm beginning to suspect that the reports by the sysadmin that this takes as long as with UDP is for a different reason. I think it simply is the accumulated time it takes to bring up and tear down that many TCP connections. 

Conntrack appears to be working right too. It is just that it is running out of resv ports and since there is nothing in the UDP protocol to tell conntrack that the connection is done the way that there is with TCP, it has to rely on a timeout and with all the nodes, all the mounts, and the few servers it can't come up with a mapping that works.

Some possible solutions that I have suggested:
1) IPv6 - no NAT
LLNL can't do this yet.
2) don't require resv ports on the Filer
There is some concern that this won't be acceptable to the security people.
3) pdsh -g gw pdsh -f4 mount_all_nfs_filesystems
This is their workaround for the moment. 

Other things which might be possible:
1) A modification to the NFS client code that tries to reuse ports if called to mount within a few seconds of the mount
2) A modification to the conntrack which understands the NFS mount protocol better and this allows it to reuse the resv ports more quickly because it can recognize that the resv port communication is done.