Bug 486266 - Unexpected behaviour of "port=" option when mounting NFS location
Unexpected behaviour of "port=" option when mounting NFS location
Status: CLOSED WONTFIX
Product: Fedora
Classification: Fedora
Component: nfs-utils (Show other bugs)
10
All Linux
low Severity medium
: ---
: ---
Assigned To: Steve Dickson
Fedora Extras Quality Assurance
: Triaged
Depends On:
Blocks: 516998
  Show dependency treegraph
 
Reported: 2009-02-19 01:04 EST by Ian Kent
Modified: 2009-12-18 02:58 EST (History)
6 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2009-12-18 02:58:08 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:


Attachments (Terms of Use)
Invalid mount option values should fail even with "sloppy" (7.04 KB, patch)
2009-06-11 15:16 EDT, Chuck Lever
no flags Details | Diff
SUNRPC: RPC pings should fail quickly (1.79 KB, patch)
2009-06-12 15:54 EDT, Chuck Lever
no flags Details | Diff
Invalid mount option values should always fail, even with "sloppy" (8.90 KB, patch)
2009-06-15 13:36 EDT, Chuck Lever
no flags Details | Diff
More "sloppy" parsing problems (7.59 KB, patch)
2009-06-15 13:38 EDT, Chuck Lever
no flags Details | Diff

  None (edit)
Description Ian Kent 2009-02-19 01:04:29 EST
Description of problem:

The behaviour of the "port=" option appears to have changed at
some point.

Using "mount -t nfs -o port=10000 <server>:/path /mnt/tmp"
when the NFS server isn't listening on port 10000 takes quite
a while to timeout (about 2 minutes) and return a mount fail.
In times past (perhaps in F-8 or earlier, it also happens in
F-9) this would return a mount fail much more quickly.

Also, using "mount -t nfs -o port=-5 <server>:/path /mnt/tmp"
fails and returns an "incorrect mount options" message straight
away. But if the "-s" option is also used it performs a mount
using the default NFS service port and returns success (if the
server is listening on the NFS port) instead of returning an
error.

How reproducible:
always.
Comment 1 Ian Kent 2009-02-23 02:50:05 EST
I've noticed a couple of other regressions with the options vers
and proto option of mount.nfs(8).

The commands:

mount -t nfs -o vers=<invalid version> <server>:/<path> /<mountpoint>
mount -t nfs -o proto=<invalid proto> <server>:/<path> /<mountpoint>

both immediately fail.

But if the "-s" option is also used they both succeed with the
mount falling back to defaults (by the look of it).

In the past these failed even when the sloppy option was given, as
I think they should. I believe the sloppy option is meant to allow
the mount command to still function for mount options (for example
in shared autofs maps) that exist on other Unix implementations but
aren't present in the Linux mount.nfs(8). So, an invalid value
specified for a known mount option is different to an unknown mount
option and should fail appropriately.

Thoughts?
Ian
Comment 2 Steve Dickson 2009-06-11 13:25:44 EDT
Thinks have changed... When we moved to text base mounts, the pinging 
of the give port no longer happens... The ping was where mount command 
figured out that the server was not listing on that port plus we could
control the time out of that ping... 

Now the kernel gets untested port and blindly make the rpc call
which is the cause of the hang... So in the end, this is not a 
nfs-utils bug, its a kernel but if at all... 

Chuck, your thoughts??
Comment 3 Chuck Lever 2009-06-11 14:02:53 EDT
Indeed, the kernel is handling all of this now.  The mount command simply passes these values to the kernel, which ignores invalid mount options instead of failing the mount.  The mount command does not try its own RPCs unless the kernel's efforts failed.

If it's not pinging at all in this case, that's probably a kernel bug -- it is supposed to send a NULL procedure, I thought.  The error codes that are returned to user space are collapsed into a single error code (EIO in most cases) which is returned to user space.  I had planned to discuss this issue with Trond soon to understand what constraints we have on returned errno values from mount(2).  Part of the problem is communicating an appropriate error code to user space so it can recover or display an error message.

Fixing the behavior of "sloppy" may be somewhat more straightforward, and I can look at that right now for 2.6.31.
Comment 4 Chuck Lever 2009-06-11 14:07:48 EDT
I need to look at the kernel's RPC ping logic.  It may be that the network layer is immediately reporting that the port is unreachable, but the RPC client doesn't have a socket method defined to pick that up.  So, it is simply retrying until the timer expires, instead of failing immediately.
Comment 5 Chuck Lever 2009-06-11 15:16:09 EDT
Created attachment 347461 [details]
Invalid mount option values should fail even with "sloppy"

This patch should provide the correct failure mode for "sloppy" mount option parsing when invalid option values are encountered.  Please try it out.
Comment 6 Chuck Lever 2009-06-11 15:59:31 EDT
The port= problem appears to be more subtle.

The kernel _is_ attempting to do a NULLPROC call, but since this is TCP, it is trying to connect first.  The connection attempt eventually times out.  But basically, the kernel doesn't decide to give up the mount attempt until it has given the TCP connect a long chance to succeed.

The legacy mount command appears to double-check the port= value, and finds that the remote NFS service is not on 10000, so it fails immediately.

My impression was that mount should not do an rpcbind if port= was specified, which is why the kernel works this way.  I had thought that the legacy case wouldn't check with rpcbind, so that users could mount an alternate NFS service that might not be registered with the server's rpcbind.

What do we think the correct behavior should be here?
Comment 7 Jeff Layton 2009-06-11 16:19:41 EDT
I think if you specify port= explicitly then you're tacitly saying "I know what I'm doing". One of the main reasons we have the port= option is because people firewall off the portmapper. The legacy behavior sounds like it would do poorly in that situation, though maybe there is some workaround for that.

The existing kernel behavior of sounds more correct to me. I don't see any reason that the kernel should second guess what the user specified in the mount options. If the consequence of getting that wrong is a long timeout, then so be it.

That said, it would be nice to consider maybe shortening the connect timeout somehow for RPC pings. Something closer to the order of the RPC ping timeout when UDP is used would be nice (that's what, like 5s or so?). I'm not sure if that's even possible though (it may violate some TCP-related RFC)...
Comment 8 Chuck Lever 2009-06-11 16:39:55 EDT
(In reply to comment #7)
> I think if you specify port= explicitly then you're tacitly saying "I know what
> I'm doing". One of the main reasons we have the port= option is because people
> firewall off the portmapper.

Certainly true for NFSv4, which requires skipping the rpcbind unless port=0 is specified.  Not sure about the v2/v3 behavior, though.
Comment 9 Jeff Layton 2009-06-11 19:30:17 EDT
The nfs(5) manpage description of the port= option certainly implies that it skips doing a rpcbind query if you specify it. I guess I don't quite understand why you'd want to query rpcbind for the nfs port if you've already told the kernel where it is.

The legacy behavior just sounds wrong. It seems like the whole reason for the port= option is to work around cases where the remote rpcbind isn't available or is known to have wrong information. Having the mount command second guess it sort of defeats the purpose...
Comment 10 Chuck Lever 2009-06-12 00:36:11 EDT
(In reply to comment #9)
> The nfs(5) manpage description of the port= option certainly implies that it
> skips doing a rpcbind query if you specify it.

I rewrote nfs(5) last year to reflect the current behavior of text-based mounts.  I probably made an assumption about how the legacy mount command behaved in this case, and nobody corrected me ;-)

> I guess I don't quite understand
> why you'd want to query rpcbind for the nfs port if you've already told the
> kernel where it is.

My guess is that the double portmap query to set up the parameters for both NFS and MNT is always done, no matter what options are specified on the command line.  The pmap parameters for MNT seem to depend on what version of NFS is supported by the server.

> The legacy behavior just sounds wrong. It seems like the whole reason for the
> port= option is to work around cases where the remote rpcbind isn't available
> or is known to have wrong information. Having the mount command second guess it
> sort of defeats the purpose...

Yes, it sounds odd.  I had assumed it wouldn't send a query in this case, since so much else in the legacy mount command is designed to circumvent firewalls that block PMAP.

I've been burned by blithely changing historical behavior in this area, however, so I'd like to solicit opinions from folks who might have more history.

As an aside, text-based mounts renegotiate any unspecified mount options if the mount system call fails the first time.  In this case, text-based mounts use the same double query that the legacy mount command uses, in order that we get nearly the same behavior (ie bug-for-bug compatibility).  It might be time for a change in this area.
Comment 11 Steve Dickson 2009-06-12 09:30:12 EDT
> The kernel _is_ attempting to do a NULLPROC call, but since this is TCP, it is
> trying to connect first.  The connection attempt eventually times out.  But
> basically, the kernel doesn't decide to give up the mount attempt until it has
> given the TCP connect a long chance to succeed.
Yes... that's exactly what I saw as well... 

> The legacy mount command appears to double-check the port= value, and finds
> that the remote NFS service is not on 10000, so it fails immediately.
Yeah... I wrote that code... to stop mounts from hanging... 

> My impression was that mount should not do an rpcbind if port= was specified,
> which is why the kernel works this way.  I had thought that the legacy case
> wouldn't check with rpcbind, so that users could mount an alternate NFS 
> service that might not be registered with the server's rpcbind.

If remember correctly I wrote the code with very short time outs
so if the service didn't exist, it was known in a very short time.

> The legacy behavior just sounds wrong. It seems like the whole reason for the
> port= option is to work around cases where the remote rpcbind isn't available
> or is known to have wrong information. Having the mount command second guess 
> it sort of defeats the purpose... 

People don't like mounts hang. Period! Add that with the fact if the 
remote portmapper is not up and running, its very likely that NFS 
is not up and running... So taking a quick 'peek' (via a ping)
to avoid a long time out, I think is exactly the right thing to do.. 

I always thought the purpose of the 'port=' option was to talk to 
server not listening on the default port hand nothing to do with
avoiding portmapper calls... 


> What do we think the correct behavior should be here? 

Either have the kernel or mount.nfs do a some type of
a ping that will stop the mount from hanging... since
it appears people have gotten use to this type of
behaviour...
Comment 12 Jeff Layton 2009-06-12 10:52:20 EDT
(In reply to comment #11)

> 
> People don't like mounts hang. Period! Add that with the fact if the 
> remote portmapper is not up and running, its very likely that NFS 
> is not up and running... So taking a quick 'peek' (via a ping)
> to avoid a long time out, I think is exactly the right thing to do.. 
> 
> I always thought the purpose of the 'port=' option was to talk to 
> server not listening on the default port hand nothing to do with
> avoiding portmapper calls... 
> 

If the server's not listening on the default port and the portmapper is reachable then it would have the listing for the default port, right? The only reason I can see for the port= option to exist is to skip the portmap query since it basically supplies the info that the portmapper would return.

> 
> > What do we think the correct behavior should be here? 
> 
> Either have the kernel or mount.nfs do a some type of
> a ping that will stop the mount from hanging... since
> it appears people have gotten use to this type of
> behaviour...  
>

I think the ideal thing I think would be to shorten the TCP connect timeout when we're connecting a socket to do an RPC ping. That said, I'm not sure if that's really possible. I vaguely recall that those timeouts are governed by RFC's so we might not be able to just arbitrarily reduce them for this purpose.
Comment 13 Jeff Layton 2009-06-12 10:53:56 EDT
Sorry, that should read:

"should have the listing for the port that it's actually listening on, right?"
Comment 14 Chuck Lever 2009-06-12 10:54:52 EDT
(In reply to comment #11)
> > My impression was that mount should not do an rpcbind if port= was specified,
> > which is why the kernel works this way.  I had thought that the legacy case
> > wouldn't check with rpcbind, so that users could mount an alternate NFS 
> > service that might not be registered with the server's rpcbind.
> 
> If remember correctly I wrote the code with very short time outs
> so if the service didn't exist, it was known in a very short time.

That may work for UDP, but a TCP connect usually introduces a long timeout, even for an RPC NULLPROC ping or when attempting to connect to a portmapper that may not be accessible.

> > The legacy behavior just sounds wrong. It seems like the whole reason for the
> > port= option is to work around cases where the remote rpcbind isn't available
> > or is known to have wrong information. Having the mount command second guess 
> > it sort of defeats the purpose... 
> 
> People don't like mounts hang. Period! Add that with the fact if the 
> remote portmapper is not up and running, its very likely that NFS 
> is not up and running...

The remote portmapper may be inaccessible in some common cases, so that's really not a perfect test as to whether the remote's NFS service is running, especially if the portmapper is behind a blocking firewall.

> So taking a quick 'peek' (via a ping)
> to avoid a long time out, I think is exactly the right thing to do..

For legacy mounts, we are not pinging first, we are doing a portmap query.  This is not guaranteed to work in every case, and it definitely does not work if the portmapper is blocked by a firewall.  Plus, this bug is proof that pinging by itself can hang just as easily as anything else.

Perhaps there is a more flexible way to avoid hangs.  If the kernel knows to give up immediately when it gets ECONNREFUSED, we can probably avoid hanging here, without the rpcbind query.  Connected UDP might help as well.

> I always thought the purpose of the 'port=' option was to talk to 
> server not listening on the default port hand nothing to do with
> avoiding portmapper calls... 

We don't have much evidence one way or another here.  However,  I recall seeing some wording in the old man page that suggested that changing the NFS program number or port number was a feature added to allow Linux clients to access multiple NFS services running on the same host concurrently.  Only one of those services can register with the server's portmapper, so the logic you added breaks that feature, in fact.
Comment 15 Chuck Lever 2009-06-12 10:57:08 EDT
(In reply to comment #12)
> (In reply to comment #11)
> 
> > 
> > People don't like mounts hang. Period! Add that with the fact if the 
> > remote portmapper is not up and running, its very likely that NFS 
> > is not up and running... So taking a quick 'peek' (via a ping)
> > to avoid a long time out, I think is exactly the right thing to do.. 
> > 
> > I always thought the purpose of the 'port=' option was to talk to 
> > server not listening on the default port hand nothing to do with
> > avoiding portmapper calls... 
> > 
> 
> If the server's not listening on the default port and the portmapper is
> reachable then it would have the listing for the default port, right? The only
> reason I can see for the port= option to exist is to skip the portmap query
> since it basically supplies the info that the portmapper would return.

To be precise, the mount command may also want to query when it needs to determine the version or transport to use.

In the legacy case, this is a common operation.  For text-based mounts, we start with a set of defaults, and if that doesn't work, the mount command renegotiates (via rpcbind queries).
Comment 16 Steve Dickson 2009-06-12 11:15:04 EDT
>> If remember correctly I wrote the code with very short time outs
>> so if the service didn't exist, it was known in a very short time.
> That may work for UDP, but a TCP connect usually introduces a long timeout,
> even for an RPC NULLPROC ping or when attempting to connect to a portmapper
> that may not be accessible.

I believe I solved this problem *always* doing a UDP ping first
and then the TCP connect() when 'proto=tcp' was not specified...


In the end, I would think all this type of ping should be done 
from the mount command... so it can be (more) easily changed...
Comment 17 Chuck Lever 2009-06-12 11:27:54 EDT
(In reply to comment #16)
> >> If remember correctly I wrote the code with very short time outs
> >> so if the service didn't exist, it was known in a very short time.
> > That may work for UDP, but a TCP connect usually introduces a long timeout,
> > even for an RPC NULLPROC ping or when attempting to connect to a portmapper
> > that may not be accessible.
> 
> I believe I solved this problem *always* doing a UDP ping first
> and then the TCP connect() when 'proto=tcp' was not specified...

How does a UDP ping tell you if the TCP port is active?
Comment 18 Chuck Lever 2009-06-12 14:00:42 EDT
So, here's something interesting.

In my test case, the server isn't listening on TCP port 10000, so the client immediately gets RST,ACK from the server.  The kernel's RPC client was simply retrying the connect over and over to get the NULLPROC through.  Basically it was ignoring the RST, since the RPC client was hard, not soft.

So, I hacked my client to fail NULLPROC immediately if the server refuses the connection.  This causes the mount(2) system call to fail immediately with ECONNREFUSED, which is just what we want.  The upper layer protocol should usually decide policy; ie. how to recover if a transport connection can't be established.

However... a foreground mount is retried for 2 minutes by default (this should be true for both legacy and text-based mounts).  Even though the mount system call is failing immediately now, the mount command is still retrying for 2 minutes!  The mount command retries because it assumes that the server might be temporarily offline (for example, after a whole site loses power), so it keeps trying for a bit.

Now we have this question: how can the mount command tell the difference between a valid but incorrect port number and an NFS server that simply hasn't come up yet?  What if the server's portmapper is up, but its NFS service hasn't start yet?  What if the server itself is misconfigured?

Again, I have to conclude that retrying here (hanging, if you will) is probably the right answer in terms of overall reliability.  Matching up the port number to the rpcbind query, as the legacy mount command does, has some limitations, I think [*].

I agree that hanging mount requests should be avoided whenever possible, especially in automounter environments.  The challenge, though, is distinguishing between faulty mount options, and temporary server and network problems.

[*] The dependence on the client's ability to access the server's portmapper is problematic.  I think we should still be able to mount an NFS server where the server's portmapper is not accessible by the client.  If there is disagreement here, we probably should take this issue to the community.
Comment 19 Steve Dickson 2009-06-12 15:21:12 EDT
> Now we have this question: how can the mount command tell the difference
> between a valid but incorrect port number and an NFS server that simply hasn't
> come up yet?  What if the server's portmapper is up, but its NFS service hasn't
> start yet?  What if the server itself is misconfigured?

I think we leave this decision up to the caller of the mount(2)
system call.. in this case mount.nfs. There reason being the 
mount command can be configured to retry as need... if at all...
So I would say bubble up the failure and let the user decided...
Comment 20 Chuck Lever 2009-06-12 15:54:58 EDT
Created attachment 347656 [details]
SUNRPC: RPC pings should fail quickly

This patch attempts to make kernel-level RPC pings fail quickly if the remote isn't listening.  In this way, control will return from a mount(2) system call with an error code that suggests that the mount command should retry the mount later.

For TCP, a RST from the server will generate an immediate ECONNREFUSED.  For UDP, which is unconnected, several retries are attempted before the kernel gives up with ETIMEDOUT.

Specifying the mount option "retry=0" on the mount.nfs command line will make the mount request fail immediately when the system call returns.

I have doubts about whether this patch will be acceptable upstream, but it's worth trying this patch for discussion purposes.
Comment 21 Ian Kent 2009-06-13 00:53:31 EDT
(In reply to comment #18)
> 
> Now we have this question: how can the mount command tell the difference
> between a valid but incorrect port number and an NFS server that simply hasn't
> come up yet?  What if the server's portmapper is up, but its NFS service hasn't
> start yet?  What if the server itself is misconfigured?
> 
> Again, I have to conclude that retrying here (hanging, if you will) is probably
> the right answer in terms of overall reliability.  Matching up the port number
> to the rpcbind query, as the legacy mount command does, has some limitations, I
> think [*].
> 
> I agree that hanging mount requests should be avoided whenever possible,
> especially in automounter environments.  The challenge, though, is
> distinguishing between faulty mount options, and temporary server and network
> problems.

Most everything has already been said so I'll just rant a bit, ;)

As seen from user space mount should try hard to have the mount
succeed. Even though that's not optimal for interactive uses
like autofs I've resisted pulling the mount code into autofs,
as other automount implementations have done. The fact is that
there are cases where we just have to cope with it.

Your use of the RST response is innovative and, I think is
appropriate, as that is telling us the service isn't present.
But if the server isn't up we still have the dilemma of long
timeouts, particularly for TCP services.

But it gets worse because we're within the kernel and we don't
want to compromise established standard timeouts, so I'm not
sure what to do about that.

In autofs, when we have replicated mount entries (basically the
same export available from multiple servers) we construct an
ordered list from the entries based on proximity and response time.
The RPC timeouts were a problem from the beginning so we have code
to allow us to set them. Jeff Moyer contributed a non-blocking
connect and we wrote a bunch of RPC code to allow us to manage the
timeouts and more recently I wrote code to estimate proximity based
on network address (a bit broken wrt. IPv6 at this stage). We used
this initially to just to make the null proc pings quicker but now
we use the proximity information to decided the timeout to use. In
any case the timeouts are relatively short.

Some bits of this code may be useful, I don't know, but it is
available in the autofs git repository on kernel.org if you
want to have a look. If you do have a look, use the repository,
as there have been a couple of recent bug fixes not present in
the latest tar distribution.

It's at git://git.kernel.org/pub/scm/linux/storage/autofs/autofs.git

Ian
Comment 22 Chuck Lever 2009-06-13 18:34:29 EDT
(In reply to comment #16)
> In the end, I would think all this type of ping should be done 
> from the mount command... so it can be (more) easily changed...  

I'm coming to agree with this sentiment.

Text-based mounts use default settings maintained in the kernel.  Though policy is generally best managed in user space, we did this for a few of reasons.

1. The code that implements the actual NFS behavior is there, so the default settings should be close by, especially if we make kernel code changes that require simultaneous changes to the default settings;

2. We designed this back when NFS mounts were driven via the util-linux mount command, and we were having a hard time getting the maintainer to respond to our requests for new default settings.  Our response was to "take ownership" by trying to move the defaults to the kernel NFS client, which we could control directly;

3. The mount(2) system call can be invoked by applications besides mount.nfs, so there needs to be a bit of safety net there.

That said, we may have a special case with the pmap parameters for the NFS and MNT programs (that is, the port, proto, and vers settings for both programs).  Though the kernel knows the default values for these, and does negotiation for other mount options like rsize, negotiation (when needed) of these parameters is still done in user space for text-based mounts.  If a text-based mount request with the default options fails, the kernel kicks it back to mount.nfs, which renegotiates the pmap parameters.

The kernel's RPC infrastructure is somewhat simpler than what we have in user space.  It works well for everything that is already in the kernel, but it's limited when it comes to specific testing for remote services, or when making special adjustments to timeouts and connect behavior.

So, we could move the pmap parameter negotiation into the kernel too, or we could change the existing "try the default settings first then renegotiate if needed" to "negotiate the pmap parameters first, then invoke mount(2)" which is basically how the legacy mount command operates.

At this point, the second option seems more expedient, and would simplify mount.nfs's retry algorithm to boot.  Maybe in the long run it makes sense to move towards the first option.  I think having the flexibility of simply replacing a user space component to get better behavior is appealing, for now.

(In reply to comment #21)
> In autofs, when we have replicated mount entries (basically the
> same export available from multiple servers) we construct an
> ordered list from the entries based on proximity and response time.

  ...

> Some bits of this code may be useful, I don't know, but it is
> available in the autofs git repository on kernel.org if you
> want to have a look.

At the moment I would like to repair the existing code in mount.nfs, and I think I can do that fairly quickly.  In the longer term, it might be a good idea to share more code and behavior between mount.nfs and autofs, so I will certainly have a look.  I'm also happy to share IPv6 implementation experiences with you, since I'm doing a lot of the IPv6 support in nfs-utils (including mount.nfs).

Btw, I would like to submit the first attachment in this bz for 2.6.31, which is basically merging as we speak.  Ian, can you please test the "sloppy" patch attached here?

  https://bugzilla.redhat.com/attachment.cgi?id=347461

Any independent testing would be appreciated.
Comment 23 Ian Kent 2009-06-13 22:58:14 EDT
(In reply to comment #22)
> 
> (In reply to comment #21)
> > In autofs, when we have replicated mount entries (basically the
> > same export available from multiple servers) we construct an
> > ordered list from the entries based on proximity and response time.
> 
>   ...
> 
> > Some bits of this code may be useful, I don't know, but it is
> > available in the autofs git repository on kernel.org if you
> > want to have a look.
> 
> At the moment I would like to repair the existing code in mount.nfs, and I
> think I can do that fairly quickly.  In the longer term, it might be a good
> idea to share more code and behavior between mount.nfs and autofs, so I will
> certainly have a look.  I'm also happy to share IPv6 implementation experiences
> with you, since I'm doing a lot of the IPv6 support in nfs-utils (including
> mount.nfs).

My main difficulty with the IPv6 code is being able to tell
the difference between an address that is within a subnet
reachable through a local interface or a network reachable
through the local interface. I used the IPv4 network classes
for this. I'm not sure really whether it's such a big deal
for most cases but I was wanting to do the same sort of
calculation for IPv6 as I do for IPv4.

> 
> Btw, I would like to submit the first attachment in this bz for 2.6.31, which
> is basically merging as we speak.  Ian, can you please test the "sloppy" patch
> attached here?
> 
>   https://bugzilla.redhat.com/attachment.cgi?id=347461
> 
> Any independent testing would be appreciated.  

Will do.

Ian
Comment 24 Ian Kent 2009-06-15 02:21:10 EDT
Test results against Fedora 10 kernel 2.6.27.21 and upstream
kernel 2.6.30 with sloppy patch.

Kernel 2.6.27.21

Entry                                                                   
b2c  -port=-5     shark:/autofs/export1                                 
b3   -vers=10000  shark:/autofs/export1                                 
b3c  -vers=-5     shark:/autofs/export1                                 
b3d  /  shark:/autofs/export5/testing/test \
     /s1  -vers=10000 shark:/autofs/export5/testing/test/s1             
b3d/s1 fails
b3e /  shark:/autofs/export5/testing/test \
    /s1 shark:/autofs/export5/testing/test/s1 \
    /s1/ss1 -vers=10000 shark:/autofs/export5/testing/test/s1/ss1       
b3e/s1/ss1 fails
b4   -proto=10000  shark:/autofs/export1                                
b4a  -proto=0      shark:/autofs/export1                                
b4b  -proto=unknown shark:/autofs/export1                               
b4c  -proto=-5   shark:/autofs/export1                                  
b4d  /  shark:/autofs/export5/testing/test \
     /s1 -proto=10000  shark:/autofs/export5/testing/test/s1            
b4d/s1 fails
b4e  / shark:/autofs/export5/testing/test \
     /s1 shark:/autofs/export5/testing/test/s1 \
     /s1/ss1 -proto=10000 shark:/autofs/export5/testing/test/s1/ss1     
b4e/s1/ss1 fails

All these return success (ie. fallback to defaults) when they are
expected to fail. Note that 8 of these used to also fail before
kernel based text option parsing but I hadn't done anything about
it, oops!

Kernel 2.6.30 (with patch)

Entry                                                                   
b2c  -port=-5     shark:/autofs/export1                                 
b3c  -vers=-5     shark:/autofs/export1

Only these return success (ie. fallback to defaults) when they are
expected to fail with the patch, which I guess is the USHORT_MAX
comparison. Perhaps mount.nfs should check for the minus sign?

The three port=10000 entries I have take ages to complete but do
return the expected result. That is a mount failure.

Ian
Comment 25 Chuck Lever 2009-06-15 11:47:11 EDT
(In reply to comment #24)
> Test results against Fedora 10 kernel 2.6.27.21 and upstream
> kernel 2.6.30 with sloppy patch.
> 
> Kernel 2.6.27.21
> 
> Entry                                                                   
> b2c  -port=-5     shark:/autofs/export1                                 
> b3   -vers=10000  shark:/autofs/export1                                 
> b3c  -vers=-5     shark:/autofs/export1                                 
> b3d  /  shark:/autofs/export5/testing/test \
>      /s1  -vers=10000 shark:/autofs/export5/testing/test/s1             
> b3d/s1 fails
> b3e /  shark:/autofs/export5/testing/test \
>     /s1 shark:/autofs/export5/testing/test/s1 \
>     /s1/ss1 -vers=10000 shark:/autofs/export5/testing/test/s1/ss1       
> b3e/s1/ss1 fails
> b4   -proto=10000  shark:/autofs/export1                                
> b4a  -proto=0      shark:/autofs/export1                                
> b4b  -proto=unknown shark:/autofs/export1                               
> b4c  -proto=-5   shark:/autofs/export1                                  
> b4d  /  shark:/autofs/export5/testing/test \
>      /s1 -proto=10000  shark:/autofs/export5/testing/test/s1            
> b4d/s1 fails
> b4e  / shark:/autofs/export5/testing/test \
>      /s1 shark:/autofs/export5/testing/test/s1 \
>      /s1/ss1 -proto=10000 shark:/autofs/export5/testing/test/s1/ss1     
> b4e/s1/ss1 fails
> 
> All these return success (ie. fallback to defaults) when they are
> expected to fail. Note that 8 of these used to also fail before
> kernel based text option parsing but I hadn't done anything about
> it, oops!
> 
> Kernel 2.6.30 (with patch)
> 
> Entry                                                                   
> b2c  -port=-5     shark:/autofs/export1                                 
> b3c  -vers=-5     shark:/autofs/export1
> 
> Only these return success (ie. fallback to defaults) when they are
> expected to fail with the patch, which I guess is the USHORT_MAX
> comparison.

D'OH!

The USHORT_MAX comparison is a mixed sign comparison, so its broken for both the port= option and the mountport= option.

> The three port=10000 entries I have take ages to complete but do
> return the expected result. That is a mount failure.

Thanks for testing.
Comment 26 Chuck Lever 2009-06-15 12:30:09 EDT
(In reply to comment #25)
> D'OH!
> 
> The USHORT_MAX comparison is a mixed sign comparison, so its broken for both
> the port= option and the mountport= option.

Actually, it looks worse than that.  match_token() is punting "port=-5" because the value is not an unsigned integer.  So, the parser is treating "port=-5" as an unrecognized option rather than an invalid value.

We want "port=-5", "port=1234567", and "port=foo" all to be treated as an invalid value, not as an invalid option.  That way sloppy parsing can properly distinguish between the two.

The only way I see to fix this is to look for a _string_, not an integer, for the "port=" option, and do the integer conversion using strict_strtoul().  All other options that take a numeric value will have to have the same fix.
Comment 27 Chuck Lever 2009-06-15 13:36:39 EDT
Created attachment 347974 [details]
Invalid mount option values should always fail, even with "sloppy"

This is a replacement patch for 347461.  It has had some review upstream.
Comment 28 Chuck Lever 2009-06-15 13:38:13 EDT
Created attachment 347975 [details]
More "sloppy" parsing problems

This patch addresses how options with integer values are parsed.
Comment 29 Chuck Lever 2009-06-15 13:40:09 EDT
The last two attachments replace the original patch (347461), and should fix the test failures you reported.
Comment 30 Ian Kent 2009-06-16 05:47:27 EDT
(In reply to comment #29)
> The last two attachments replace the original patch (347461), and should fix
> the test failures you reported.  

Yep, that did the trick.

What was the decision on the way the timeouts work for
port=<service not available on this port> type options.

Is it likely this will be handled in the kernel or do I
need to handle this in user space?

Ian
Comment 31 Ian Kent 2009-06-16 05:59:42 EDT
(In reply to comment #30)
> (In reply to comment #29)
> > The last two attachments replace the original patch (347461), and should fix
> > the test failures you reported.  
> 
> Yep, that did the trick.
> 
> What was the decision on the way the timeouts work for
> port=<service not available on this port> type options.
> 
> Is it likely this will be handled in the kernel or do I
> need to handle this in user space?

Or should I have also tested the "SUNRPC: RPC pings should
fail quickly" patch? You were concerned about its suitability
I think. Dealing with this in autofs is possible but the
problem that arises is that we end up probing in both autofs
and mount.nfs (do we still do that Steve?) consuming far more
reserved ports than we really should. One idea that occurred
to me was writing a NetLink interface so that user space RPC
could take advantage of the port multiplexing added to the
NFS client, thoughts? Clearly that is tricky and libtirpc
would need to be carefully modified for this idea to be
useful.

Ian
Comment 32 Chuck Lever 2009-06-16 11:48:55 EDT
(In reply to comment #31)
> (In reply to comment #30)
> > (In reply to comment #29)
> > > The last two attachments replace the original patch (347461), and should fix
> > > the test failures you reported.  
> > 
> > Yep, that did the trick.
> > 
> > What was the decision on the way the timeouts work for
> > port=<service not available on this port> type options.

I'm planning to fix this problem in mount.nfs.  I'm at an NFS plugfest this week, so I can't promise I can post a patch for this until next week, unless testing goes really well today.  ;-)

This is actually a good thing, since the merge window for 2.6.31 is about to close.  We won't be able to get kernel patches in again for another 3 months.

> > Is it likely this will be handled in the kernel or do I
> > need to handle this in user space?
> 
> Or should I have also tested the "SUNRPC: RPC pings should
> fail quickly" patch? You were concerned about its suitability
> I think.
> 
> Dealing with this in autofs is possible but the
> problem that arises is that we end up probing in both autofs
> and mount.nfs (do we still do that Steve?) consuming far more
> reserved ports than we really should. One idea that occurred
> to me was writing a NetLink interface so that user space RPC
> could take advantage of the port multiplexing added to the
> NFS client, thoughts? Clearly that is tricky and libtirpc
> would need to be carefully modified for this idea to be
> useful.

mount.nfs implements its own portmapper client to reduce the use of reserved ports during mounting.  This also gives us the ability to fine-tune the connect timeout behavior, or try using connected UDP, which will fail immediately if an ICMP port unreachable packet comes back to the client.
Comment 33 Chuck Lever 2009-06-17 13:58:45 EDT
(In reply to comment #32)
> (In reply to comment #31)
> > (In reply to comment #30)
> > > (In reply to comment #29)
> > > > The last two attachments replace the original patch (347461), and should fix
> > > > the test failures you reported.  
> > > 
> > > Yep, that did the trick.
> > > 
> > > What was the decision on the way the timeouts work for
> > > port=<service not available on this port> type options.
> 
> I'm planning to fix this problem in mount.nfs.  I'm at an NFS plugfest this
> week, so I can't promise I can post a patch for this until next week, unless
> testing goes really well today.  ;-)
> 
> This is actually a good thing, since the merge window for 2.6.31 is about to
> close.  We won't be able to get kernel patches in again for another 3 months.

I discussed this (and 505535) with Trond today.  He'd like a solution that moves the negotiation and retry logic into the kernel.  We discussed some of the risks to this approach.

I think that kind of kernel support is rather a longer term project, so I may still provide some regression relief for Fedora by making changes to mount.nfs.  I had a chance to hack on this a bit yesterday.

Additionally, he's accepted for 2.6.31 the two patches to fix sloppy parsing you just tested.
Comment 34 Chuck Lever 2009-06-23 14:08:54 EDT
(In reply to comment #30)
> What was the decision on the way the timeouts work for
> port=<service not available on this port> type options.
> 
> Is it likely this will be handled in the kernel or do I
> need to handle this in user space?

This is a tough one.

Failing quickly, one could argue, is a bug.  This is a case where the mount command should retry, in case the server is simply busy, or hasn't come up yet.  If you don't want a long wait, you can specify "retry=0" to have it fail immediately.

There are many similar miscommunication issues where the mount command retries.  This would be an exception, imo.

Ian, do you have an argument in favor of always failing this case immediately other than "that's the way it used to work" ?  Steve, your thoughts?
Comment 35 Chuck Lever 2009-06-24 10:21:42 EDT
(In reply to comment #34)
> Ian, do you have an argument in favor of always failing this case immediately
> other than "that's the way it used to work" ?  Steve, your thoughts?  

Let me clarify this request a little.

It looks like the legacy mount.nfs command bails immediately if probe_bothports() fails.  See nfs_call_mount() -- it's called from the retry loop in nfsmount().  If probe_bothports() fails, the retry loop is broken immediately.  (Feel free to correct me if I have misunderstood what this does).

Text-based mounts can emulate this case, and that would address your concern about the new port= behavior.

There is a trade-off, of course.  This behavior makes mount.nfs responsive by eliminating an opportunity to hang.  If a server is just coming up however, there's a race condition where the server's rpcbind may start before its NFS service, causing mount requests to fail because the NFS service hasn't registered yet.

So, I'd like to understand the "requirement" to fail a mount if the NFS/MNT services aren't registered.  At least we should document it.
Comment 36 Ian Kent 2009-06-26 03:02:00 EDT
(In reply to comment #35)

Sorry for the delay, life is a bit like that though!

> (In reply to comment #34)
> > Ian, do you have an argument in favor of always failing this case immediately
> > other than "that's the way it used to work" ?  Steve, your thoughts?  

Not really.

The fact is that the mount command used by a system admin probably
should work that way.

But in automount environments that type of delay annoys people
so they log bugs. Then they compare the behaviour of Linux
mount against other systems and throw out the "but xxx works
OK, why doesn't the Linux mount work the same way" comment.

We have to admit that long delays in an interactive automount
environment isn't acceptable but we will get the same complaints
if we fail to mount because of the problems you described below,
except possibly somewhat less often. So, we're in trouble either
way. 

Also other automount implementations pull in the NFS mount code
to take control of this type of behaviour. For Linux autofs I
can control this (and I don't want to pull in the NFS mount code)
but shortly after the version 5 release I removed the probing for
single mounts due to a customer bug request and have ended up at
the mercy of mount. When I do this probing in autofs it leads it
being done twice which causes even more ports to be opened for
each mount, resulting in a real problem for TCP mounts. For sites
with security policies (misguided or not isn't really a judgement
we should be making for our customers) that require the use of
privileged ports for NFS mounts this is a sure fire port exhaustion
problem when mounting a bunch of automount mounts, as happens
fairly frequently. For other systems RPC connections multiplex
their communications in much the same way as the Linux NFS client
does now but we don't have this capability in user space so we
are stuck until we resolve that, if indeed we chose to work on
providing this feature at all.

Using "retry=0" could be useful for me but I'm not sure I really
want to "inject" mount options as that may not be what the user
wants. We can use this as a recommendation for people reporting
this problem (but I haven't actually checked it yet).

> 
> Let me clarify this request a little.
> 
> It looks like the legacy mount.nfs command bails immediately if
> probe_bothports() fails.  See nfs_call_mount() -- it's called from the retry
> loop in nfsmount().  If probe_bothports() fails, the retry loop is broken
> immediately.  (Feel free to correct me if I have misunderstood what this does).
> 
> Text-based mounts can emulate this case, and that would address your concern
> about the new port= behavior.
> 
> There is a trade-off, of course.  This behavior makes mount.nfs responsive by
> eliminating an opportunity to hang.  If a server is just coming up however,
> there's a race condition where the server's rpcbind may start before its NFS
> service, causing mount requests to fail because the NFS service hasn't
> registered yet.
> 
> So, I'd like to understand the "requirement" to fail a mount if the NFS/MNT
> services aren't registered.  At least we should document it.
Comment 37 Ian Kent 2009-06-26 03:11:07 EDT
(In reply to comment #36)
> Also other automount implementations pull in the NFS mount code
> to take control of this type of behaviour. For Linux autofs I
> can control this (and I don't want to pull in the NFS mount code)
> but shortly after the version 5 release I removed the probing for
> single mounts due to a customer bug request and have ended up at
> the mercy of mount. When I do this probing in autofs it leads it
> being done twice which causes even more ports to be opened for
> each mount, resulting in a real problem for TCP mounts. For sites
> with security policies (misguided or not isn't really a judgement
> we should be making for our customers) that require the use of
> privileged ports for NFS mounts this is a sure fire port exhaustion
> problem when mounting a bunch of automount mounts, as happens
> fairly frequently. For other systems RPC connections multiplex
> their communications in much the same way as the Linux NFS client
> does now but we don't have this capability in user space so we
> are stuck until we resolve that, if indeed we chose to work on
> providing this feature at all.

The network traffic generated by the multiple probing also attracts
complains which is why it was removed early on.
Comment 38 Chuck Lever 2009-06-26 10:08:01 EDT
(In reply to comment #37)
> (In reply to comment #36)
> > Also other automount implementations pull in the NFS mount code
> > to take control of this type of behaviour. For Linux autofs I
> > can control this (and I don't want to pull in the NFS mount code)
> > but shortly after the version 5 release I removed the probing for
> > single mounts due to a customer bug request and have ended up at
> > the mercy of mount. When I do this probing in autofs it leads it
> > being done twice which causes even more ports to be opened for
> > each mount, resulting in a real problem for TCP mounts. For sites
> > with security policies (misguided or not isn't really a judgement
> > we should be making for our customers) that require the use of
> > privileged ports for NFS mounts this is a sure fire port exhaustion
> > problem when mounting a bunch of automount mounts, as happens
> > fairly frequently. For other systems RPC connections multiplex
> > their communications in much the same way as the Linux NFS client
> > does now but we don't have this capability in user space so we
> > are stuck until we resolve that, if indeed we chose to work on
> > providing this feature at all.
> 
> The network traffic generated by the multiple probing also attracts
> complains which is why it was removed early on.  

Can you explain what exactly you mean by multiple probing?

I have a patch that makes the port=10000 case try other versions and transports, since we now quit immediately after a vers=3,tcp,port=10000 query fails.  What if vers=2,udp,port=10000 is actually working on the server?  I can see that kind of extra work would leave a couple of TCP ports in TIME_WAIT.  The question is, how often do we think this kind of configuration would come up?

I notice that Solaris does PMAP_GETPORT exclusively on UDP, even when looking for RPC services on TCP transports.  Maybe we should consider that, with a fallback to TCP if UDP rpcbind ports aren't working?  The transport we use for the MNT protocol is UDP by default, probably for this reason.

We can also consider reconnecting a TCP socket to AF_UNSPEC after we are finished with it -- that has the effect of getting rid of the TIME_WAIT ephemeral port.

Another option to consider: use PMAP_GETLIST initially.  Then, after we have a good idea of what is supported on a server, we can do PMAP_GETPORT and NULLPROC for each service to ensure that any firewall is properly informed of which ports we are looking for.  That would mean far fewer round trips in cases where the mount.nfs command is now poking at each version,transport,port combination.

Or, since the mount.nfs command has its own portmap client, it could cache its portmap client sockets so each of these PMAP/RPCBIND calls was done using the same socket (for a single given mount request).  That gets us part of the way to mount connection caching without having to build a separate daemon to do it.
Comment 39 Bug Zapper 2009-11-18 06:09:27 EST
This message is a reminder that Fedora 10 is nearing its end of life.
Approximately 30 (thirty) days from now Fedora will stop maintaining
and issuing updates for Fedora 10.  It is Fedora's policy to close all
bug reports from releases that are no longer maintained.  At that time
this bug will be closed as WONTFIX if it remains open with a Fedora 
'version' of '10'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version prior to Fedora 10's end of life.

Bug Reporter: Thank you for reporting this issue and we are sorry that 
we may not be able to fix it before Fedora 10 is end of life.  If you 
would still like to see this bug fixed and are able to reproduce it 
against a later version of Fedora please change the 'version' of this 
bug to the applicable version.  If you are unable to change the version, 
please add a comment here and someone will do it for you.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events.  Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

The process we are following is described here: 
http://fedoraproject.org/wiki/BugZappers/HouseKeeping
Comment 40 Bug Zapper 2009-12-18 02:58:08 EST
Fedora 10 changed to end-of-life (EOL) status on 2009-12-17. Fedora 10 is 
no longer maintained, which means that it will not receive any further 
security or bug fix updates. As a result we are closing this bug.

If you can reproduce this bug against a currently maintained version of 
Fedora please feel free to reopen this bug against that version.

Thank you for reporting this bug and we are sorry it could not be fixed.

Note You need to log in before you can comment on or make changes to this bug.