Bug 624254

Summary: NFS mount/umount fails if run in rapid successions
Product: Red Hat Enterprise Linux 6 Reporter: Igor Lvovsky <ilvovsky>
Component: nfs-utilsAssignee: Steve Dickson <steved>
Status: CLOSED WONTFIX QA Contact: yanfu,wang <yanwang>
Severity: high Docs Contact:
Priority: high    
Version: 6.0CC: abaron, bazulay, cpelland, cplisko, iheim, ikent, ilvovsky, jkurik, lpeer, rwheeler
Target Milestone: rcKeywords: RHELNAK, TestBlocker
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2010-11-10 20:34:56 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 624265    
Attachments:
Description Flags
NFS mount/umount script none

Description Igor Lvovsky 2010-08-15 08:49:02 UTC
Created attachment 438811 [details]
NFS mount/umount script

Description of problem:

NFS mount/umount fails if run in rapid successions.
Rechecked on different servers and different clients.

This issue is a showstopper for getting vdsm2.3 out of the door.
 


Version-Release number of selected component (if applicable):


How reproducible:

Script attached

Steps to Reproduce:
1.
2.
3.
  
Actual results:

> ~/stressmount.sh
mount.nfs: mount system call failed
ERROR 32
Iteration 195


Expected results:


Additional info:

Comment 2 RHEL Program Management 2010-08-15 09:18:23 UTC
This issue has been proposed when we are only considering blocker
issues in the current Red Hat Enterprise Linux release.

** If you would still like this issue considered for the current
release, ask your support representative to file as a blocker on
your behalf. Otherwise ask that it be considered for the next
Red Hat Enterprise Linux release. **

Comment 3 RHEL Program Management 2010-08-18 21:32:57 UTC
Thank you for your bug report. This issue was evaluated for inclusion
in the current release of Red Hat Enterprise Linux. Unfortunately, we
are unable to address this request in the current release. Because we
are in the final stage of Red Hat Enterprise Linux 6 development, only
significant, release-blocking issues involving serious regressions and
data corruption can be considered.

If you believe this issue meets the release blocking criteria as
defined and communicated to you by your Red Hat Support representative,
please ask your representative to file this issue as a blocker for the
current release. Otherwise, ask that it be evaluated for inclusion in
the next minor release of Red Hat Enterprise Linux.

Comment 6 Steve Dickson 2010-09-20 17:10:01 UTC
Please remove the 'soft' mount option and rerun the test.

Comment 7 Igor Lvovsky 2010-09-21 10:27:45 UTC
I removed 'soft' mount option and got the same failure

[root@white-vdsd x86_64]# ~/stressmount.sh
mount.nfs: mount system call failed
ERROR 32
Iteration 196

Comment 8 Steve Dickson 2010-09-21 12:12:04 UTC
Thank you for trying that....

Comment 9 Steve Dickson 2010-09-21 19:43:01 UTC
The problem is the client is exhausting all of the reserve 
port used during the mounts. Reserve ports are network ports,
used with TCP connections, that are less 1024 and only root is
allow to bind to them. They are used as a somewhat outdated
security "feature"...

The problem lies when the TCP connection is close. The connection
goes into a state call TIME_WAIT for a minute or so which
in turn ties up that port for a minute or so... Being there 
is only finite number of ports, it does not take long to run
out of them when mounts are done so quickly... This is a 
very common problem across all OSes... 

With RHEL6, there are a couple work arounds:

1) Use -o noresvport mount option.This assumes the 
   server will allow mounts from non-reserve ports,
   which is generally not the default. With Linux server
   the 'insecure' export option needs to be specified 
   to allow non-secure mounts

2) Use UDP as the network transport since it does not
   tie up the port after the connection is closed... BUT...
   BUYER BEWARE.... UDP is a  far inferior network protocol
   than TCP especially on a busy network that contain routers.

   The reason being is TCP know hows to smartly retransmit lost
   packages and do flow control on busy networks. UDP does 
   not... With UDP, packets are continently blasted out 
   on the network until an acknowledgement is received.
   Basically making a busy network even worse. So be 
   very very very careful if you decide to go the 
   this route... Actually I would advise against it..
   The only reason I mention it is just wanted you to know
   (and hopefully understand) all the options... 


So at the end of the day this is not a bug, its just a 
known limitations with NFS mounts.

Comment 10 Ric Wheeler 2010-09-21 19:46:12 UTC
Sounds like a release note candidate and something to close as "NOTABUG"?

Comment 11 Ayal Baron 2010-09-22 09:08:08 UTC
How can we verify that this is indeed the problem we hit?  the attached script is just something we wrote thinking it reproduces the problem, but in our tests, we do not perform that many connects and disconnects before we hit the fail state.
Also, on RHEL 5.5, our tests worked fine.

Comment 12 Ayal Baron 2010-09-22 09:24:37 UTC
Also, this means that working in the "Standard way" limits us to less than 200 concurrent mounts, right? so if we need more, we will have to enable the "unsecure" option on the NFS server (assuming linux) and update our mount options appropriately.
Would this default to try and use a port under 1024 if available?

Also, assuming we did not cross this limit, we should probably check to see if we have NFS ports in this state and if so sleep and retry.  Steve, what do you think? any suggestions?

Comment 13 Steve Dickson 2010-09-22 13:22:03 UTC
> How can we verify that this is indeed the problem we hit? 
I ran the tests like this
# sh /tmp/stressmount.sh  ; netstat -na | grep TIME_WAIT | wc -l
mount.nfs: mount system call failed
ERROR 32
Iteration 358
1435
^^^^ is the number of connections that are in TIME_WAIT

I also wrote a systemtap probe that monitored one of the
kernel routines (xs_bind4()) that does the socket binding and
was failing with EADDRINUSE which means we ran out
of sockets... 

> Also, on RHEL 5.5, our tests worked fine
hmm... I'm a bit surprised at this since the your
test scrip failed in the same way when I ran a 
quick test.

> Also, this means that working in the "Standard way" limits us to less than 200
> concurrent mounts, right?
No. The limitation comes in to play with simultaneous mounts, a bunch
of mounts all at the same time. I'm not sure what the maximum number
of concurrent, but pretty sure its in the thousands... I would assume
it has something to do with the amount of memory that's available.

> Would this default to try and use a port under 1024 if available?
If I'm understanding the question, No. If you uses the -o noresvport
option the default behaviour will be to use an non-reserve port.

> Also, assuming we did not cross this limit, we should probably check to see if
> we have NFS ports in this state and if so sleep and retry.  Steve, what do you
> think? any suggestions?
I need more context as to what you are trying to doing...

Comment 14 Ayal Baron 2010-09-22 13:36:05 UTC
(In reply to comment #13)
> > How can we verify that this is indeed the problem we hit? 
> I ran the tests like this
> # sh /tmp/stressmount.sh  ; netstat -na | grep TIME_WAIT | wc -l
> mount.nfs: mount system call failed
> ERROR 32
> Iteration 358
> 1435
> ^^^^ is the number of connections that are in TIME_WAIT
> 
> I also wrote a systemtap probe that monitored one of the
> kernel routines (xs_bind4()) that does the socket binding and
> was failing with EADDRINUSE which means we ran out
> of sockets... 
Please attach the probe so we can test whether this is really what is hitting us.  By your comment about RHEL5.5 I'm not sure it is.

> 
> > Also, on RHEL 5.5, our tests worked fine
> hmm... I'm a bit surprised at this since the your
> test scrip failed in the same way when I ran a 
> quick test.
As I said, the attached script is not what we run, we have a python test suite with many NFS scripts which at some point simply starts to fail on the above error.  We wanted to test whether this is due to the mount/umount actions running and wrote above script.  It may be that we are looking at two different problems here.

> 
> > Also, this means that working in the "Standard way" limits us to less than 200
> > concurrent mounts, right?
> No. The limitation comes in to play with simultaneous mounts, a bunch
> of mounts all at the same time. I'm not sure what the maximum number
> of concurrent, but pretty sure its in the thousands... I would assume
> it has something to do with the amount of memory that's available.
Ok, good to know.

> 
> > Would this default to try and use a port under 1024 if available?
> If I'm understanding the question, No. If you uses the -o noresvport
> option the default behaviour will be to use an non-reserve port.
Ok, so we would have to set this param per connection, thanks.

> 
> > Also, assuming we did not cross this limit, we should probably check to see if
> > we have NFS ports in this state and if so sleep and retry.  Steve, what do you think? any suggestions?
> I need more context as to what you are trying to doing...
We are running in a dynamic environment where according to which VMs are running on the system we need to mount different mounts.  The number of mounts required depends on the number of VMs running, their placement on the storage (obviously we can have many VMs on same mount).  Also, we use NFS mounts to store ISO images (which are exposed to VM instead of CDs).  In a cloud environment, the numbers can grow rapidly and VMs can migrate between hosts.  If one host running 100 VMs is moved into "maintenance" mode, all its VMs would be migrated to different hosts (1 or more).  Which means we might face a "mount storm".  IIUC in this case we would need to take into account the possibility that there are no available sockets (but there will be shortly) in which case we would want to wait and then retry.  Before waiting though, we should make sure mount failed due to this reason and not something else (where retrying is pointless).

Comment 15 Steve Dickson 2010-09-22 14:04:44 UTC
> Please attach the probe so we can test whether this is really what is hitting
> us
Its pretty simple...

probe module("sunrpc").function("xs_bind4").return
{
	if ($return)
		printf("xs_bind4: %d %s\n", $return, errno_str($return));
}
probe begin { log("starting xprt probe") }
probe end { log("ending xprt probe") }

> As I said, the attached script is not what we run, we have a python test suite
> with many NFS scripts which at some point simply starts to fail on the above
> error.  We wanted to test whether this is due to the mount/umount actions
> running and wrote above script.  It may be that we are looking at two different
> problems here.
We might be... run the above probe and see if it fails with a
   'xs_bind4: -98 EADDRINUSE'

I'm CC-ing Ian Kent our autofs maintainer... Autofs has
a similar need of doing a large number of mounts during
system start.. We call them 'Mount storms" Maybe Ian
can shed some light on how autofs deals with running out
of ports...

Comment 18 Ian Kent 2010-09-22 15:03:58 UTC
(In reply to comment #15)
> 
> > As I said, the attached script is not what we run, we have a python test suite
> > with many NFS scripts which at some point simply starts to fail on the above
> > error.  We wanted to test whether this is due to the mount/umount actions
> > running and wrote above script.  It may be that we are looking at two different
> > problems here.
> We might be... run the above probe and see if it fails with a
>    'xs_bind4: -98 EADDRINUSE'
> 
> I'm CC-ing Ian Kent our autofs maintainer... Autofs has
> a similar need of doing a large number of mounts during
> system start.. We call them 'Mount storms" Maybe Ian
> can shed some light on how autofs deals with running out
> of ports...

Yeah, we have the same problem with rapid mounting with autofs.
I did a lot of work to minimize the number of ports I use by
re-use in the RPC code that I wrote for autofs. At one time
I also used a code pattern that eliminates the TIME_WAIT state
when closing down a connection, but only for connections that
had a high likely hood of not having so called lost duplicates.
Quite illegal for a TCP protocol standpoint and potentially
dangerous if there are lost duplicates (avoiding these
interfering with a subsequent connection is what the TIME_WAIT
state is for) still in transit.

I also seem to remember that the number of reserved ports
available had been reduced at some point so checking that
configuration and maximising the number available for use
might help a little.

Even so, the bottom line, unfortunately, is that the only
viable solution at the moment is to use ports above the
reserved port threshold.

Sorry I couldn't bring better news.

Comment 22 RHEL Program Management 2010-11-10 20:34:56 UTC
Development Management has reviewed and declined this request.  You may appeal
this decision by reopening this request.