Bug 201211 - Stale NFS file handle errors occur on automounted directories
Summary: Stale NFS file handle errors occur on automounted directories
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Fedora
Classification: Fedora
Component: kernel
Version: 9
Hardware: All
OS: Linux
medium
medium
Target Milestone: ---
Assignee: Steve Dickson
QA Contact: Brian Brock
URL:
Whiteboard: bzcl34nup
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2006-08-03 16:18 UTC by Michael Young
Modified: 2008-10-30 15:54 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2008-10-30 15:54:28 UTC
Type: ---
Embargoed:


Attachments (Terms of Use)
Packet capture when the problem occurs (224.33 KB, application/octet-stream)
2007-01-04 11:37 UTC, Michael Young
no flags Details
Shell script that (mostly) triggers the problem (209 bytes, application/x-shellscript)
2007-01-05 16:53 UTC, Michael Young
no flags Details
Capture of test script session demonstrating bug (22.44 KB, application/octet-stream)
2007-01-09 12:11 UTC, Michael Young
no flags Details
test script capture run on linux server and client (no failure) (16.96 KB, application/octet-stream)
2007-01-10 17:38 UTC, Michael Young
no flags Details

Description Michael Young 2006-08-03 16:18:21 UTC
Description of problem:
We have an automount managed directory at /usr/local in which we mount various
subdirectories and after maybe a day or so see NFS file errors on one such
mount, such as
ls: /usr/local/uvscan: Stale NFS file handle
If the filesystem is unmounted by hand, it can be automounted again. There are
errors in the /var/log/messages file such as
Aug  3 17:08:14 mailrelay5 kernel: NFS: server stevens error: fileid changed
Aug  3 17:08:14 mailrelay5 kernel: fsid 0:21: expected fileid 0x12dd7a3, got
0xa147e8
The automount setup is that we have an auto.master NIS map including the line
/usr/local /etc/auto.usr.local  -ro,intr,noquota
and an auto.usr.local map including the line
uvscan -rw,intr,noquota stevens:/vol/vol0/unix/apps/&/$ARCH

Version-Release number of selected component (if applicable):
kernel-2.6.17-1.2145
autofs-4.1.4-29_FC5

How reproducible:
This has occured on several machines.

Comment 1 Dave Jones 2006-10-16 18:51:20 UTC
A new kernel update has been released (Version: 2.6.18-1.2200.fc5)
based upon a new upstream kernel release.

Please retest against this new kernel, as a large number of patches
go into each upstream release, possibly including changes that
may address this problem.

This bug has been placed in NEEDINFO state.
Due to the large volume of inactive bugs in bugzilla, if this bug is
still in this state in two weeks time, it will be closed.

Should this bug still be relevant after this period, the reporter
can reopen the bug at any time. Any other users on the Cc: list
of this bug can request that the bug be reopened by adding a
comment to the bug.

In the last few updates, some users upgrading from FC4->FC5
have reported that installing a kernel update has left their
systems unbootable. If you have been affected by this problem
please check you only have one version of device-mapper & lvm2
installed.  See bug 207474 for further details.

If this bug is a problem preventing you from installing the
release this version is filed against, please see bug 169613.

If this bug has been fixed, but you are now experiencing a different
problem, please file a separate bug for the new problem.

Thank you.

Comment 2 Michael Young 2006-10-22 08:42:22 UTC
This bug still occurs with the 2.6.18-1.2200.fc5 kernel.

Comment 3 Steve Dickson 2006-12-22 16:29:26 UTC
Ian,

Any ideas?

Comment 5 Ian Kent 2006-12-28 03:02:54 UTC
(In reply to comment #0)
> Description of problem:
> We have an automount managed directory at /usr/local in which we mount various
> subdirectories and after maybe a day or so see NFS file errors on one such
> mount, such as
> ls: /usr/local/uvscan: Stale NFS file handle
> If the filesystem is unmounted by hand, it can be automounted again. There are
> errors in the /var/log/messages file such as
> Aug  3 17:08:14 mailrelay5 kernel: NFS: server stevens error: fileid changed
> Aug  3 17:08:14 mailrelay5 kernel: fsid 0:21: expected fileid 0x12dd7a3, got
> 0xa147e8

Are you absolutely sure the original file hasn't been replaced
when you see these errors?

Ian


Comment 6 Michael Young 2006-12-28 15:07:43 UTC
The inode doesn't correspond to a file but an automounted directory. The
directory itself doesn't change, though some of the files in it are updated once
or twice a day. There is a second automounted directory (/usr/local/lib) which
shows this problem less often, and the content of that directory is basically
static. Also the inode associated to the first automount has remained the same
since the original report of the bug back in August.
The bug still occurs with kernel 2.6.18-1.2257.fc5

Comment 7 Ian Kent 2006-12-28 15:58:30 UTC
(In reply to comment #6)
> The inode doesn't correspond to a file but an automounted directory. The
> directory itself doesn't change, though some of the files in it are updated once
> or twice a day. There is a second automounted directory (/usr/local/lib) which
> shows this problem less often, and the content of that directory is basically
> static. Also the inode associated to the first automount has remained the same
> since the original report of the bug back in August.
> The bug still occurs with kernel 2.6.18-1.2257.fc5

That's not normally the way it works.

The only way for the NFS VFS methods to be called is if
there is an NFS filesystem mounted atop of the autofs
directory (in this case anyway). When such a mount is
present the VFS skips over the autofs dentry when
performing path resolution. So autofs never knows
anything about it.

Also it appears that the calls that could result in this
message (at least it appears so, but I only looked briefly)
are from a VFS revalidate operation on an entry in the NFS
filesystem which basically means that file or directory
within the NFS filesystem exists or at least the client
thinks it still exists.

Finally, we don'y know which file within the NFS mounted
file system has triggered this message and we can't assume
that it is the mount point directory within the NFS filesystem.
It could be any file within the filesystem that has been
replaced. At least that is usually what causes staleness.

But as you say this doesn't happen within the filesystems
in question so I'm not sure what's going on.

Sorry.

Ian

Comment 8 Ian Kent 2006-12-28 16:17:07 UTC
(In reply to comment #6)
> The inode doesn't correspond to a file but an automounted directory. The
> directory itself doesn't change, though some of the files in it are updated once
> or twice a day. There is a second automounted directory (/usr/local/lib) which
> shows this problem less often, and the content of that directory is basically
> static. Also the inode associated to the first automount has remained the same
> since the original report of the bug back in August.
> The bug still occurs with kernel 2.6.18-1.2257.fc5

I'd be interested to know how the files within the directory are
updated (possibly ones contained within the mountpoint directory
itself)?

For this to happen a client would have to have the file open during
the update (or at least before a replacement) and the the file
would have to be replaced. Like processing into a temp file and
then moving the new over the top of the old.

But I doubt I'm saying anything you don't already know.

Ian


Comment 9 Michael Young 2006-12-28 17:53:07 UTC
The uvscan directory contains a command-line virus scanner which our mail
machines check in-transit mail with. We install updated virus data files using
tar (on solaris) in the uvscan directory, and chmod the file permissions back to
a sensible setting. As this is a virus scanner, it is possible the dat files are
in use when the update happens. The problem is that I have tried to spot a
correlation between the updates and the NFS problems, and have never been
convinced they match.

Comment 10 Ian Kent 2006-12-29 00:42:58 UTC
(In reply to comment #9)
> The uvscan directory contains a command-line virus scanner which our mail
> machines check in-transit mail with. We install updated virus data files using
> tar (on solaris) in the uvscan directory, and chmod the file permissions back to
> a sensible setting. As this is a virus scanner, it is possible the dat files are
> in use when the update happens. The problem is that I have tried to spot a
> correlation between the updates and the NFS problems, and have never been
> convinced they match.

I presume that the server is Solaris based and you tar
directly into the uvscan directory. I don't know what
tar does with respect to extracting files.

It would be interesting to see if updating the files
through an NFS mount (not a bind or lofs mount) rather
than directly made a difference.

Ian


Comment 11 Michael Young 2007-01-02 10:02:16 UTC
Actually, the NFS server is a NetApps box, so all updates to the files are
through NFS. It just so happens that the updates are done from a solaris box due
to the way the updates have evolved.

Comment 12 Ian Kent 2007-01-02 12:09:43 UTC
(In reply to comment #11)
> Actually, the NFS server is a NetApps box, so all updates to the files are
> through NFS. It just so happens that the updates are done from a solaris box due
> to the way the updates have evolved.

Ha .. time to look a bit harder then.

I must admit I'm stumped as well.

Ian


Comment 13 Steve Dickson 2007-01-02 14:23:14 UTC
Question, if you reboot the NetApps, do the the problem go away? Also 
what ONTAP version is the Filer running? 

Comment 14 Michael Young 2007-01-02 14:57:20 UTC
The ONTAP version is 6.5.5. Rebooting is not very practical as a LOT of things
use that filer.

Comment 15 Ian Kent 2007-01-02 16:07:04 UTC
(In reply to comment #14)
> The ONTAP version is 6.5.5. Rebooting is not very practical as a LOT of things
> use that filer.

I don't expect that a packet dump will show anything either
but it would be good if you could post one just in case we
are missing something. If possible including the file open
till the error (I do understand this would be difficult so
lets just see what we can get).

Ian


Comment 16 Steve Dickson 2007-01-03 03:15:38 UTC
Yes, I understand that reboot is not practical solution...
I was just trying to find a scenario that would help
isolate the problem.. 

I do agree with Ian that having a packet trace could
help assuming its not too large... 

Comment 17 Michael Young 2007-01-04 11:37:40 UTC
Created attachment 144787 [details]
Packet capture when the problem occurs

This is an extract from a packet capture when the problem occurred (the full
packet capture is 53Mb - the problem occurs about 3-4 seconds into the
extract). The relevant entries in /var/log/messages at the time are
Jan  4 10:03:55 mailrelay4 automount[21925]: failed to mount /usr/local/f-prot
Jan  4 10:03:55 mailrelay4 automount[21933]: failed to mount /usr/local/fsav
Jan  4 10:03:55 mailrelay4 automount[21946]: failed to mount
/usr/local/inoculan
Jan  4 10:03:55 mailrelay4 automount[21951]: failed to mount /usr/local/av
Jan  4 10:03:55 mailrelay4 kernel: NFS: server stevens error: fileid changed
Jan  4 10:03:55 mailrelay4 kernel: fsid 0:1c: expected fileid 0x12dd7a3, got
0x1
343f01
Jan  4 10:03:55 mailrelay4 automount[21988]: failed to mount /usr/local/nod32
Jan  4 10:03:55 mailrelay4 automount[22005]: failed to mount /usr/local/rav8
Jan  4 10:03:55 mailrelay4 automount[22012]: failed to mount /usr/local/Sophos
Jan  4 10:03:55 mailrelay4 automount[22013]: failed to mount /usr/local/Sophos
Jan  4 10:03:56 mailrelay4 automount[22026]: failed to mount /usr/local/vexira
(what is going on is MailScanner is looking for various possible virus scanners
only a few of which actually exist on our system).

Comment 18 Ian Kent 2007-01-05 02:36:24 UTC
(In reply to comment #16)
> Yes, I understand that reboot is not practical solution...

But I think that this is something that will need to be
done at some point.

Do you have a scheduled maintenance window at some time
where you could do a reboot?

Ian

Comment 19 Ian Kent 2007-01-05 03:18:03 UTC
(In reply to comment #17)
> Created an attachment (id=144787) [edit]
> Packet capture when the problem occurs

I've had a look at the packet capture and it doesn't reveal
much we don't already know but ...

Packet number 424 is an NFS MKDIR call and 425 is the reply.
Upon return from this call the NFS client checks the attributes
for the directory within which the create was requested to
see if it has changed during the operation. This weak cache
consistency check is done regardless of the return status of
the operation (as per normal NFS implementations). Unfortunately
the attributes returned don't match the directory (in this case
also the mount point) and the NFS client claims the file handle
is stale based on this post operation check. So it looks to me
like the Linux NFS client is doing what it is supposed to do.

Would it be possible to ask NetApp support if there are any known
issues like this?

Any other thoughts anyone?

Ian


Comment 20 Michael Young 2007-01-05 16:53:32 UTC
Created attachment 144913 [details]
Shell script that (mostly) triggers the problem

To confirm the problem, I have written a simple shell script that (most times)
triggers the problem for this particular NFS mount, and sometimes others, and
it does indeed seem to be the mkdir attempt that trips up the filer.
(I have also run this against an FC6 machine with kernel 2.6.18-1.2869.fc6xen
and unsurprising the problem is reproducable there also).

Comment 21 Ian Kent 2007-01-05 17:15:10 UTC
(In reply to comment #20)
> Created an attachment (id=144913) [edit]
> Shell script that (mostly) triggers the problem
> 
> To confirm the problem, I have written a simple shell script that (most times)
> triggers the problem for this particular NFS mount, and sometimes others, and
> it does indeed seem to be the mkdir attempt that trips up the filer.
> (I have also run this against an FC6 machine with kernel 2.6.18-1.2869.fc6xen
> and unsurprising the problem is reproducable there also).

Yes. That is what the packet log shows.

As I say I'm no expert but, as far as I can see the NFS MKDIR
procedure call in the client is formed correctly and other 
procedure calls surrounding the MKDIR, such as LOOKUP and
ACCESS return the expected attributes, even calls following
the MKDIR have the expected attributes, so it really looks
like the reply from the MKDIR call is incorrectly formed by
the server.

It would be interesting and perhaps informative if a simple
test could be carried out against different NFS servers,
such as Solaris or Linux to confirm that either the client
is or is not at fault. At this point it looks like it's not.

Ian


Comment 22 Michael Young 2007-01-05 17:29:20 UTC
I have had a quick go at getting our ONTAP 7.0.4 NetApp box to fail without
success. I will probably do some further testing after the weekend.

Comment 23 Michael Young 2007-01-08 17:46:49 UTC
I haven't managed to reproduce this problem with other hardware. I have tried
Solaris and FC3 NFS clients against the problem NetApp box, and also with a
linux NFS server. 

Comment 24 Ian Kent 2007-01-09 02:33:20 UTC
(In reply to comment #23)
> I haven't managed to reproduce this problem with other hardware. I have tried
> Solaris and FC3 NFS clients against the problem NetApp box, and also with a
> linux NFS server. 

Could we have a packet capture of a successful test please?

Comment 25 Michael Young 2007-01-09 12:11:50 UTC
Created attachment 145153 [details]
Capture of test script session demonstrating bug

Here is a capture of the shell script reproducing the problem. I have also
discovered that CREATE calls can trigger the bug as well as MKDIR calls.

Comment 26 Ian Kent 2007-01-10 03:13:44 UTC
(In reply to comment #23)
> I haven't managed to reproduce this problem with other hardware. I have tried
> Solaris and FC3 NFS clients against the problem NetApp box, and also with a
> linux NFS server. 

So there is some evidence to indicate that this may be a
bug on the FC NFS client side?

If so then a capture of a test that doesn't fail would
be usefull for comparison. The packet that is returned
from the server is incorrect which implies that the
request may be incorrectly formed.

Ian

Comment 27 Ian Kent 2007-01-10 06:26:00 UTC
(In reply to comment #20)
> Created an attachment (id=144913) [edit]
> Shell script that (mostly) triggers the problem
> 
> To confirm the problem, I have written a simple shell script that (most times)
> triggers the problem for this particular NFS mount, and sometimes others, and
> it does indeed seem to be the mkdir attempt that trips up the filer.
> (I have also run this against an FC6 machine with kernel 2.6.18-1.2869.fc6xen
> and unsurprising the problem is reproducable there also).

I'm starting to loose track of what we're chasing.
Just to confirm, my impression is that once this MKDIR (or
other) fails future attempts to access the mounted filesystem
result in a "Stale NFS file handle" message. Is that correct?

Ian





Comment 28 Michael Young 2007-01-10 17:38:20 UTC
Created attachment 145269 [details]
test script capture run on linux server and client (no failure)

The state of the problem is;
A linux client is mounting a read-only shared filesystem from a NetApp server.
If an attempt is made to create a file or directory in this mount (NFS calls
MKDIR or CREATE), the error packet returned contains information for an
unrelated directory (or file?), and the linux client sees this as an error. If
this is in the top level of the mounted file system, linux marks the file as
stale, causing some operations on it to fail until it is unmounted and
remounted.

This problem has been demonstrated on FC5 and FC6 against NetApp ONTAP 6.5.5
and 7.0.5 (my earlier testing against 7.0.4 was bogus as it wasn't against a
read-only shared filesystem). The problem is most likely to be a NetApp issue.

The problem hasn't occured in testing against a linux server, or with earlier
linux client versions (including FC3).

I have however now discovered that a Solaris 10 client can also have problems,
and can get the same replies from a MKDIR call. It doesn't mark the filesystem
as stale, but listing the filesystem directory (ls -dl) after such a reply can
give bogus values.

I am not sure what other capture you wanted, but I have attached a linux client
and server capture.

Comment 29 Michael Young 2007-01-10 17:42:23 UTC
A slight ammendment to my previous post: I meant to say that

If an attempt is made to create a file or directory in this mount (NFS calls
MKDIR or CREATE), the error packet returned CAN contain information for an
unrelated directory (or file?)

Comment 30 Ian Kent 2007-01-11 01:51:45 UTC
(In reply to comment #29)
> A slight ammendment to my previous post: I meant to say that
> 
> If an attempt is made to create a file or directory in this mount (NFS calls
> MKDIR or CREATE), the error packet returned CAN contain information for an
> unrelated directory (or file?)

Thanks for the summary.
Just one thing that I might not have made clear.
The reply for the MKDIR and CREATE RPC calls should contain the
attributes of the "directory" in which the mkdir or create is
requested for (the Weak Cache Consistency) checking by the NFS
client regardless of whether the call fails (but it seems that
several servers don't do this quite right).

I'm having some trouble verifying how everything fits together,
for example, I can see where the inode cache data is marked as
invalid during the call but I can't verify how that causes the
following accesses to fail. I also had a look at 2.6.9 and I
it looks like this doesn't happen for that kernel. Even more
interesting is that after a quick look at the RHEL5 kernel this
morning it appears that it might not happen in that kernel either
so maybe there's a patch around I'm not aware of.

Any ideas on this Steve?

Anyway I'll keep looking. 

Ian


Comment 31 Ian Kent 2007-01-11 01:56:01 UTC
(In reply to comment #28)
> Created an attachment (id=145269) [edit]
> test script capture run on linux server and client (no failure)
...
> I am not sure what other capture you wanted, but I have attached a linux client
> and server capture.

Yep. That's what I was after, thanks.

Ian



Comment 32 Michael Young 2007-01-11 17:28:16 UTC
The situation with the FC3 2.6.11 kernel (and thus probably to 2.6.9 also) is
that it doesn't ever do a MKDIR in this situation, possibly because it has
already done an ACCESS call in the parent directory and found it only has READ
and LOOKUP permissions, so doesn't try the MKDIR call which it expects will fail.

With regard to RHEL5, I tried running an FC6 box with the
kernel-xen-2.6.18-1.2747.el5 kernel, and that does indeed show the problem.

Comment 33 Michael Young 2007-01-11 17:37:57 UTC
> I'm having some trouble verifying how everything fits together,
> for example, I can see where the inode cache data is marked as
> invalid during the call but I can't verify how that causes the
> following accesses to fail.

Note that the stale marking doesn't cause all accesses to fail, if the system
already knows about a file it can still access it. What does fail is anything
requiring the directory to be listed, such as opening a fresh file.

Comment 34 Ian Kent 2007-01-12 02:58:56 UTC
(In reply to comment #32)
> With regard to RHEL5, I tried running an FC6 box with the
> kernel-xen-2.6.18-1.2747.el5 kernel, and that does indeed show the problem.

Yep.
After looking again it's the same code that breaks in our
other cases. I don't know how I thought it was OK.

Ian




Comment 35 Bug Zapper 2008-04-04 03:26:04 UTC
Fedora apologizes that these issues have not been resolved yet. We're
sorry it's taken so long for your bug to be properly triaged and acted
on. We appreciate the time you took to report this issue and want to
make sure no important bugs slip through the cracks.

If you're currently running a version of Fedora Core between 1 and 6,
please note that Fedora no longer maintains these releases. We strongly
encourage you to upgrade to a current Fedora release. In order to
refocus our efforts as a project we are flagging all of the open bugs
for releases which are no longer maintained and closing them.
http://fedoraproject.org/wiki/LifeCycle/EOL

If this bug is still open against Fedora Core 1 through 6, thirty days
from now, it will be closed 'WONTFIX'. If you can reporduce this bug in
the latest Fedora version, please change to the respective version. If
you are unable to do this, please add a comment to this bug requesting
the change.

Thanks for your help, and we apologize again that we haven't handled
these issues to this point.

The process we are following is outlined here:
http://fedoraproject.org/wiki/BugZappers/F9CleanUp

We will be following the process here:
http://fedoraproject.org/wiki/BugZappers/HouseKeeping to ensure this
doesn't happen again.

And if you'd like to join the bug triage team to help make things
better, check out http://fedoraproject.org/wiki/BugZappers

Comment 36 Michael Young 2008-04-04 08:35:36 UTC
This is still an issue, and I have just reproduced it with Fedora 9 beta against
ONTAP 7.2.2. It does seem likely that the problem is at the NetApp end though.

Comment 37 Chuck Ebbert 2008-04-27 03:24:07 UTC
(In reply to comment #36)
> This is still an issue, and I have just reproduced it with Fedora 9 beta against
> ONTAP 7.2.2. It does seem likely that the problem is at the NetApp end though.

I'm pretty sure that is not the latest release of OnTap.


Comment 38 Michael Young 2008-04-28 13:41:54 UTC
No, I believe 7.2.4 is the latest version, but as the filers are used by a lot
of people and all the time, getting them updated is a big deal. I might have a
go at reproducing it on a netapp simulator with 7.2.4, but this might not work
if the problem is related to the load or other set ups that are on the servers.

Comment 39 Bug Zapper 2008-05-14 02:16:23 UTC
Changing version to '9' as part of upcoming Fedora 9 GA.
More information and reason for this action is here:
http://fedoraproject.org/wiki/BugZappers/HouseKeeping

Comment 40 Michael Young 2008-06-06 14:30:07 UTC
This might now be fixed. We have just updated to ONTAP 7.2.5 and have been
unable to reproduce the problem in early tests. I will leave this bug open for
another couple of weeks to see if the problem does reoccur, but it does look
promising.

Comment 41 Michael Young 2008-10-30 15:54:28 UTC
We haven't seen this problem recently you I am closing the call.


Note You need to log in before you can comment on or make changes to this bug.