432861 – NFSv4 mounting from a Netapp filer fails intermittently

Bug 432861 - NFSv4 mounting from a Netapp filer fails intermittently

Summary: NFSv4 mounting from a Netapp filer fails intermittently

Keywords:
Status:	CLOSED CANTFIX
Alias:	None
Product:	Red Hat Enterprise Linux 4
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	4.6
Hardware:	All
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	rc
Target Release:	---
Assignee:	Samuel Li
QA Contact:	Ben Levenson
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	450897
TreeView+	depends on / blocked

Reported:	2008-02-14 20:11 UTC by John Caruso
Modified:	2018-10-19 22:06 UTC (History)
CC List:	14 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2008-06-25 04:19:13 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
binary capture of failed RHEL4 mount attempt against NetApp filer (928 bytes, application/octet-stream) 2008-03-10 18:52 UTC, Jeff Layton	no flags	Details
View All

Description John Caruso 2008-02-14 20:11:28 UTC

Description of problem:
mount fails intermittently for NFSv4

Version-Release number of selected component (if applicable):
util-linux-2.12a-16.EL4.29

How reproducible:
Perform multiple mounts of a volume specifying "-t nfs4".

Steps to Reproduce:
1. Run the script below (with appropriate substitutions)
  
Actual results:
NFSv4 mounting sometimes fails with the (spurious) error "mount: wrong fs type,
bad option, bad superblock on <remote-filesystem>, or too many mounted file
systems".  This error comes and goes on successive mounts, but it will persist
in failing or succeeding for a short period of time, e.g. over 30-60 seconds
(though it may actually be more tied to the number of mounts, since I've seen it
continue to fail for many hours when I was just doing individual mounts by hand).

Expected results:
NFSv4 mounts should function (i.e. fail or succeed) consistently, as NFSv3
mounts do.

Additional info:
When mounting a volume from a Netapp filer, RHEL 4.6 sometimes produces the
error "mount: wrong fs type, bad option, bad superblock on <remote-filesystem>,
or too many mounted file systems".  This error comes and goes on successive
mounts.  I've tried destroying and recreating the Netapp volume itself while the
error was occurring, but it continued to happen, so it's not a problem with the
volume.  Also, NFSv3 mounts of the same volume always work fine even when NFSv4
mounts are producing this error...so it's clear that it's a problem with the
Redhat NFSv4 mount code.

The following script demonstrates the issue:

---- 8< -----------------------------------------------------
#!/bin/sh

MOUNTOPTIONS=${@:-"-t nfs4"}

MOUNTS=0
FAILURES=0
OUTFILE=/tmp/mountout.$$

REMOTEVOLUME=netappfiler:/vol/test
MOUNTPOINT=/var/tmp/test

trap 'rm -f $OUTFILE; umount $MOUNTPOINT 2>/dev/null; echo "Exiting after
$MOUNTS mounts"; exit 0' INT

while true; do

        MOUNTS=`expr $MOUNTS + 1`

        if mount $MOUNTOPTIONS $REMOTEVOLUME $MOUNTPOINT/ 2>$OUTFILE; then
                if [ $FAILURES -gt 0 -o $MOUNTS -eq 1 ]; then
                        echo ""
                        echo "mount is SUCCEEDING as of `date` ($MOUNTS mounts)"
                fi
                FAILURES=0
        else
                if [ $FAILURES -eq 0 ]; then
                        echo ""
                        echo "mount is failing as of `date` ($MOUNTS mounts):
`cat $OUTFILE`"
                fi
                FAILURES=`expr $FAILURES + 1`
        fi
        umount $MOUNTPOINT 2>/dev/null
        sleep .5
done
---- 8< -----------------------------------------------------

Here's the output of the script when I ran it over a period of a few minutes
with "-t nfs4":

---- 8< -----------------------------------------------------
mount is now SUCCEEDING as of Thu Feb 14 11:25:37 PST 2008 (1 mounts)

mount is now failing as of Thu Feb 14 11:26:02 PST 2008 (31 mounts): mount:
wrong fs type, bad option, bad superblock on netappfiler:/vol/test,
       or too many mounted file systems

mount is now SUCCEEDING as of Thu Feb 14 11:26:34 PST 2008 (81 mounts)

mount is now failing as of Thu Feb 14 11:27:40 PST 2008 (159 mounts): mount:
wrong fs type, bad option, bad superblock on netappfiler:/vol/test,
       or too many mounted file systems

mount is now SUCCEEDING as of Thu Feb 14 11:28:12 PST 2008 (209 mounts)

mount is now failing as of Thu Feb 14 11:29:18 PST 2008 (287 mounts): mount:
wrong fs type, bad option, bad superblock on netappfiler:/vol/test,
       or too many mounted file systems

mount is now SUCCEEDING as of Thu Feb 14 11:29:50 PST 2008 (337 mounts)

mount is now failing as of Thu Feb 14 11:30:58 PST 2008 (415 mounts): mount:
wrong fs type, bad option, bad superblock on netappfiler:/vol/test,
       or too many mounted file systems

mount is now SUCCEEDING as of Thu Feb 14 11:31:30 PST 2008 (465 mounts)

mount is now failing as of Thu Feb 14 11:32:36 PST 2008 (543 mounts): mount:
wrong fs type, bad option, bad superblock on netappfiler:/vol/test,
       or too many mounted file systems

mount is now SUCCEEDING as of Thu Feb 14 11:33:09 PST 2008 (593 mounts)

mount is now failing as of Thu Feb 14 11:34:13 PST 2008 (671 mounts): mount:
wrong fs type, bad option, bad superblock on netappfiler:/vol/test,
       or too many mounted file systems
---- 8< -----------------------------------------------------

By contrast, if you run the script with "-o vers=3" the mount never fails (even
while mounts with "-t nfs4" are failing).

I've opened Redhat service request 180277 for this NFSv4 bug (and one other),
and that case has syslog and tcpdump output for the mount issue along with a
sysreport for the system in question.

Comment 1 John Caruso 2008-02-14 20:38:03 UTC

One possibly important piece of info: setting aside the initial failure count,
the mount switches from failing to succeeding after 50 mounts and from
succeeding back to failing after 78 mounts.  This is 100% consistent on two
different RHEL 4.6 servers where I've run the script.  You can see it in the
output above, but here's more output from another system:

mount is failing as of Thu Feb 14 12:22:32 PST 2008 (77 mounts)
mount is SUCCEEDING as of Thu Feb 14 12:23:04 PST 2008 (127 mounts)
mount is failing as of Thu Feb 14 12:24:09 PST 2008 (205 mounts)
mount is SUCCEEDING as of Thu Feb 14 12:24:42 PST 2008 (255 mounts)
mount is failing as of Thu Feb 14 12:25:47 PST 2008 (333 mounts)
mount is SUCCEEDING as of Thu Feb 14 12:26:19 PST 2008 (383 mounts)
mount is failing as of Thu Feb 14 12:27:24 PST 2008 (461 mounts)
mount is SUCCEEDING as of Thu Feb 14 12:27:56 PST 2008 (511 mounts)
mount is failing as of Thu Feb 14 12:29:02 PST 2008 (589 mounts)

Note the deltas: 50, 78, 50, 78, 50, 78, 50, 78.  I'm not sure what this means,
though 50+78=128, which seems suggestive.

Comment 2 John Caruso 2008-02-19 20:05:12 UTC

I've determined what's triggering the failed mounts.  In the NFS SETCLIENTID
call, there's an r_addr field (aka the "universal address") that contains the IP
address plus the port (I assume) of the client, in ASCII.  The difference
between mounts that fail and mounts that work is that for the working mounts,
the port number portion of r_addr has three digits in the last octet--e.g., an
r_addr of "10.0.0.1.191.98" (where the last octet is 98, which is two digits)
causes the filer to respond with NFS4ERR_INVAL, but an r_addr of
"10.0.0.1.191.100" (where the last octet is 100, which is three digits) causes
the filer to respond with NFS4_OK.

This explains why the mounts fail for 50 attempts and then work for 78 attempts.
 The port number is always incremented by 2, so while the second octet of the
port number increments from 0 to 98 the mounts fail.  But once the second octet
of the port number hits 100 and then as it continues to increment up to 254, the
mounts work.

So at this point I'm no longer sure if this is a bug in RHEL4; it may actually
be a bug in the Netapp filer's handling of the r_addr field.  Based on the
filer's response it would appear that it's expecting each octet of the port
number portion of r_addr to be zero-filled, but I can't find a reference that
gives the correct format (RFC 1833 says the format is "defined by the addressing
authority for the transport", with no pointer to another document).  The IP
address fields in r_addr aren't zero-filled, though, and that doesn't cause any
problem on the successful mounts.

I've opened a case with Netapp to see what they say.  If you want to see
tcpdumps of failed and successful mounts, that's available as part of RHN
support case 1802770.

Comment 3 Jeff Layton 2008-02-22 17:49:15 UTC

John, thanks for the nice investigative work. Quoting from RFC 3530:

-------[snip]-------
   For TCP over IPv4 and for UDP over IPv4, the format of r_addr is the
   US-ASCII string:

      h1.h2.h3.h4.p1.p2

   The prefix, "h1.h2.h3.h4", is the standard textual form for
   representing an IPv4 address, which is always four octets long.
   Assuming big-endian ordering, h1, h2, h3, and h4, are respectively,
   the first through fourth octets each converted to ASCII-decimal.
   Assuming big-endian ordering, p1 and p2 are, respectively, the first
   and second octets each converted to ASCII-decimal.  For example, if a
   host, in big-endian order, has an address of 0x0A010307 and there is
   a service listening on, in big endian order, port 0x020F (decimal
   527), then the complete universal address is "10.1.3.7.2.15".
-------[snip]-------

...given that even the example has port octets that are less than 3 digits I
think this is almost surely a server-side problem.

Comment 4 John Caruso 2008-02-22 21:20:54 UTC

Actually I was mistaken.  The correlation of a mount failure to a final octet
less than 3 digits long does hold, except in one case: immediately after a
reboot.  Right after a reboot, mount requests with a 3-digit final octet may
fail and mount requests with a final octet < 3 digits may succeed.  This
continues for one cycle, at which point the pattern reverts to the familiar
sequence I've already described.  I did find that after this initial anomalous
cycle I could consistently make the mounts succeed or fail by manipulating the
length of the port number's final octet using
/proc/sys/net/ipv4/ip_local_port_range.  So at that point the pattern holds.

This would seem to imply that the correlation I found is only a secondary
correlation--not the core issue.  Which makes it less clear (again) whether this
is a bug in RHEL4 or a bug on the Netapp filer.  Also, the RHN support tech and
I have verified that 1) Mac OS X has no similar problems mounting the same
volume from the same Netapp filer with NFS4, and 2) RHEL4 has no similar
problems mounting an NFS4-based share from a RHEL5 NFS server.  So this is an
interoperability issue between RHEL4 and Netapp's ONTAP 7.2.3, and it's hard to
say where the bug is.

You can see more details of this in RHN support case 1802770, including a
tcpdump of the after-reboot behavior.

Comment 5 Jeff Layton 2008-03-10 12:36:08 UTC

I've been able to reproduce this here as well on one of our lab netapps. I've
inspected some traces with the setclientid packet and don't see anything wrong
with it...

> 1) Mac OS X has no similar problems mounting the same volume from the same 
> Netapp filer with NFS4

Does OSX do a setclientid call at mount time?

> 2) RHEL4 has no similar problems mounting an NFS4-based share from a RHEL5 NFS 
> server.

Which would seem to imply that RHEL4 is generally sending valid SETCLIENTID
calls and that the filer is occasionally rejecting them.

Comment 6 Andrius Benokraitis 2008-03-10 12:41:33 UTC

Bikash - any comments for your side?

Comment 11 Jeff Layton 2008-03-10 17:10:59 UTC

I *think* the customer has opened a case with Netapp already and the case number
is 2855584. Please correct this if that's not right.

Comment 12 Jeff Layton 2008-03-10 18:52:16 UTC

Created attachment 297489 [details]
binary capture of failed RHEL4 mount attempt against NetApp filer

Binary capture of mount attempt against NetApp filer.

Comment 13 Bikash 2008-03-11 00:48:05 UTC

From all the errors that we have, it looks like we may have fixed this problem 
on DATA ONTAP 7.3. But I have asked the case owner to escalate this to Engg. 
to get the confirmation to this behavior.

Comment 14 John Caruso 2008-03-12 19:10:48 UTC

Yes, Mac OS X does send a SETCLIENTID call at mount time, but it doesn't fill in
r_netid or r_addr (which is apparently how Solaris operates as well, according
to the Redhat support engineer).

And yes, 2855584 is the Netapp case number.

Comment 15 Jeff Layton 2008-03-14 15:52:44 UTC

Ok. Since it sounds like this is really a Netapp bug, I'm going to set this to
NEEDINFO. Please update the case after you've upgraded the filer and retested.

Comment 16 John Caruso 2008-03-14 16:28:49 UTC

I think it's still Bikash who needs to provide the info, since there's nothing
for me to upgrade to or test at this point (and there may not be for a while if
this really is a Netapp bug and it's really fixed in ONTAP 7.3, which isn't
available yet).

Comment 20 Andrius Benokraitis 2008-04-02 16:19:13 UTC

John: would you be opposed to us closing this bug, but you can re-open it if
isn't fixed in ONTAP 7.3 GA and is indeed a confirmed RHEL issue?

Comment 21 John Caruso 2008-04-02 19:27:29 UTC

Bikash had speculated that this may be a Netapp bug that may be fixed in ONTAP
7.3, but it was only speculation--there's been no confirmation yet (either here
or on my Netapp support case).  So until Bikash or someone else from Netapp
confirms that Netapp considers this a bug in ONTAP, I'd prefer to keep the case
open.

Comment 22 Andrius Benokraitis 2008-04-03 03:03:18 UTC

John - That's fine, but we are running out of runway for kernel fixes for 4.7 Beta.

Bikash - Any testing/verification of this soon would be great to know for sure.

Comment 23 Andrius Benokraitis 2008-04-03 03:06:40 UTC

John - I would push or escalate your NetApp case if you can to find out for
sure, due to the tight time left in RHEL 4.7. I don't know how the Beta ONTAP
process is, but ideally it would be great if you could give some sort of ONTAP
7.3 Beta a shot. We really need to know for sure if this is fixed in ONTAP
before we proceed. Hope you understand our situation.

Comment 24 Bikash 2008-04-03 03:13:53 UTC

DATA ONTAP 7.3RC1 is currently available on now.netapp.com for dpwnload. I 
would suggest if you could test with that release and check the behavior 
changes.

Comment 25 John Caruso 2008-04-03 03:25:31 UTC

Andrius: I've been pushing Netapp very hard, with no luck.  I'd assume Redhat
has better channels than I do.  If/when Netapp takes ownership of the bug, you
can feel free to close this bug (as far as I'm concerned).

Bikash: I don't have anywhere I can test this--the filers I've been testing on
thus far are all production filers.  Maybe Redhat does?

Regardless, it would seem to me that Netapp should be able to say definitively
at this point whether or not this is a bug in ONTAP.

Comment 26 Andrius Benokraitis 2008-04-03 03:52:46 UTC

John - Yikes, I feel your pain.

I know we *just* received some spanking new filers courtesy of NetApp (thanks!)
but they were loaded with some version pre-release DATA ONTAP 7.3 already. I'm
not sure if we run the reproducer it will accurately portray a success/failure
due to differences in the environment that you have. I'm no expert here.

That and unfortunately our filers require service calls due to a bad drive as
well "monitor.shelf.fault:CRITICAL" nastiness that requires replacing the
modules that go behind the shelf. I'm not sure when we'll be back up and running.

Bikash - do you have any ideas on how we can say for sure if this may be fixed
without disturbing their production environment?

Comment 28 John Caruso 2008-04-03 04:03:37 UTC

I have this sneaking suspicion that Netapp has plenty of Netapp filers.  So they
could do the testing themselves.  Bikash?

I'll ping the owner of my Netapp support case as well and ask him to test it.

Comment 29 Andrius Benokraitis 2008-04-03 04:16:41 UTC

John - OK, keep us posted if you find out anything about this issue - our boxes
are used specifically for engineering activities (testing RHEL 5.2 and 4.7
currently), rather than support activities.

Leaving assigned to Bikash to see if any headway can be made on the NetApp
support case on their end.

Comment 31 Bikash 2008-04-03 17:01:44 UTC

John,

I would appreciate if you could call the support and ask for escalating the 
case with Engg. I have also done so at my end to expedite the progress on this 
case. Some should be in touch with you later today. Either way, please call 
the support line and talk to some for an update or an escalation.

Bikash

Comment 32 Samuel Li 2008-04-04 20:53:42 UTC

John,

I am Sam Li Escalation engineer with netapp support I have taken ownership of
the case and get things moving. 

Sincere apologies on behalf of netapp support.

Comment 33 John Caruso 2008-04-05 08:20:31 UTC

No problem, Sam--the urgency is (apparently) on Redhat's side, not mine.  We're
not even using NSFv4 at the moment; I just happened to discover this issue while
testing Kerberos with NFSv4 a while back, and thought I should report it.

Comment 34 Andrius Benokraitis 2008-04-06 02:30:32 UTC

Hi - the only urgency on our part is to either identify a root cause on the Red
Hat side (since this was reported in our bug reporting system) or close if it
isn't. Please keep us posted, thanks!

Comment 35 John Caruso 2008-04-09 21:22:34 UTC

FYI: Sam Li of Netapp informed me that based on testing he's performed, the bug
still exists between RHEL4 and ONTAP 7.3RC1.  So the speculation that it's fixed
in ONTAP 7.3 is apparently incorrect, and the root cause still isn't clear.

Comment 37 Andrius Benokraitis 2008-06-25 02:43:09 UTC

A root cause still has not been determined - deferring to RHEL 4.8.

Comment 38 John Caruso 2008-06-25 03:42:05 UTC

Netapp has opened a bug on their side for this (bug number 290200), and they say
that the fix is targeted for ONTAP 7.3.1.  Which would appear to mean it's their
issue, not Redhat's, in which case this bug could be closed.

Comment 39 Andrius Benokraitis 2008-06-25 04:19:13 UTC

John, this is great to hear. Last I heard we thought it wasn't something with
ONTAP. Please do not hesitate to reopen this issue if needed.

Comment 40 Samuel Li 2008-06-25 17:51:22 UTC

Andrius,

As John mentioned the fix will be targeted for ontap 7.3.1. The problem only
exhibits on linux clients and does not affect Solaris clients.

Note You need to log in before you can comment on or make changes to this bug.