Description of problem: We have an automount managed directory at /usr/local in which we mount various subdirectories and after maybe a day or so see NFS file errors on one such mount, such as ls: /usr/local/uvscan: Stale NFS file handle If the filesystem is unmounted by hand, it can be automounted again. There are errors in the /var/log/messages file such as Aug 3 17:08:14 mailrelay5 kernel: NFS: server stevens error: fileid changed Aug 3 17:08:14 mailrelay5 kernel: fsid 0:21: expected fileid 0x12dd7a3, got 0xa147e8 The automount setup is that we have an auto.master NIS map including the line /usr/local /etc/auto.usr.local -ro,intr,noquota and an auto.usr.local map including the line uvscan -rw,intr,noquota stevens:/vol/vol0/unix/apps/&/$ARCH Version-Release number of selected component (if applicable): kernel-2.6.17-1.2145 autofs-4.1.4-29_FC5 How reproducible: This has occured on several machines.
A new kernel update has been released (Version: 2.6.18-1.2200.fc5) based upon a new upstream kernel release. Please retest against this new kernel, as a large number of patches go into each upstream release, possibly including changes that may address this problem. This bug has been placed in NEEDINFO state. Due to the large volume of inactive bugs in bugzilla, if this bug is still in this state in two weeks time, it will be closed. Should this bug still be relevant after this period, the reporter can reopen the bug at any time. Any other users on the Cc: list of this bug can request that the bug be reopened by adding a comment to the bug. In the last few updates, some users upgrading from FC4->FC5 have reported that installing a kernel update has left their systems unbootable. If you have been affected by this problem please check you only have one version of device-mapper & lvm2 installed. See bug 207474 for further details. If this bug is a problem preventing you from installing the release this version is filed against, please see bug 169613. If this bug has been fixed, but you are now experiencing a different problem, please file a separate bug for the new problem. Thank you.
This bug still occurs with the 2.6.18-1.2200.fc5 kernel.
Ian, Any ideas?
(In reply to comment #0) > Description of problem: > We have an automount managed directory at /usr/local in which we mount various > subdirectories and after maybe a day or so see NFS file errors on one such > mount, such as > ls: /usr/local/uvscan: Stale NFS file handle > If the filesystem is unmounted by hand, it can be automounted again. There are > errors in the /var/log/messages file such as > Aug 3 17:08:14 mailrelay5 kernel: NFS: server stevens error: fileid changed > Aug 3 17:08:14 mailrelay5 kernel: fsid 0:21: expected fileid 0x12dd7a3, got > 0xa147e8 Are you absolutely sure the original file hasn't been replaced when you see these errors? Ian
The inode doesn't correspond to a file but an automounted directory. The directory itself doesn't change, though some of the files in it are updated once or twice a day. There is a second automounted directory (/usr/local/lib) which shows this problem less often, and the content of that directory is basically static. Also the inode associated to the first automount has remained the same since the original report of the bug back in August. The bug still occurs with kernel 2.6.18-1.2257.fc5
(In reply to comment #6) > The inode doesn't correspond to a file but an automounted directory. The > directory itself doesn't change, though some of the files in it are updated once > or twice a day. There is a second automounted directory (/usr/local/lib) which > shows this problem less often, and the content of that directory is basically > static. Also the inode associated to the first automount has remained the same > since the original report of the bug back in August. > The bug still occurs with kernel 2.6.18-1.2257.fc5 That's not normally the way it works. The only way for the NFS VFS methods to be called is if there is an NFS filesystem mounted atop of the autofs directory (in this case anyway). When such a mount is present the VFS skips over the autofs dentry when performing path resolution. So autofs never knows anything about it. Also it appears that the calls that could result in this message (at least it appears so, but I only looked briefly) are from a VFS revalidate operation on an entry in the NFS filesystem which basically means that file or directory within the NFS filesystem exists or at least the client thinks it still exists. Finally, we don'y know which file within the NFS mounted file system has triggered this message and we can't assume that it is the mount point directory within the NFS filesystem. It could be any file within the filesystem that has been replaced. At least that is usually what causes staleness. But as you say this doesn't happen within the filesystems in question so I'm not sure what's going on. Sorry. Ian
(In reply to comment #6) > The inode doesn't correspond to a file but an automounted directory. The > directory itself doesn't change, though some of the files in it are updated once > or twice a day. There is a second automounted directory (/usr/local/lib) which > shows this problem less often, and the content of that directory is basically > static. Also the inode associated to the first automount has remained the same > since the original report of the bug back in August. > The bug still occurs with kernel 2.6.18-1.2257.fc5 I'd be interested to know how the files within the directory are updated (possibly ones contained within the mountpoint directory itself)? For this to happen a client would have to have the file open during the update (or at least before a replacement) and the the file would have to be replaced. Like processing into a temp file and then moving the new over the top of the old. But I doubt I'm saying anything you don't already know. Ian
The uvscan directory contains a command-line virus scanner which our mail machines check in-transit mail with. We install updated virus data files using tar (on solaris) in the uvscan directory, and chmod the file permissions back to a sensible setting. As this is a virus scanner, it is possible the dat files are in use when the update happens. The problem is that I have tried to spot a correlation between the updates and the NFS problems, and have never been convinced they match.
(In reply to comment #9) > The uvscan directory contains a command-line virus scanner which our mail > machines check in-transit mail with. We install updated virus data files using > tar (on solaris) in the uvscan directory, and chmod the file permissions back to > a sensible setting. As this is a virus scanner, it is possible the dat files are > in use when the update happens. The problem is that I have tried to spot a > correlation between the updates and the NFS problems, and have never been > convinced they match. I presume that the server is Solaris based and you tar directly into the uvscan directory. I don't know what tar does with respect to extracting files. It would be interesting to see if updating the files through an NFS mount (not a bind or lofs mount) rather than directly made a difference. Ian
Actually, the NFS server is a NetApps box, so all updates to the files are through NFS. It just so happens that the updates are done from a solaris box due to the way the updates have evolved.
(In reply to comment #11) > Actually, the NFS server is a NetApps box, so all updates to the files are > through NFS. It just so happens that the updates are done from a solaris box due > to the way the updates have evolved. Ha .. time to look a bit harder then. I must admit I'm stumped as well. Ian
Question, if you reboot the NetApps, do the the problem go away? Also what ONTAP version is the Filer running?
The ONTAP version is 6.5.5. Rebooting is not very practical as a LOT of things use that filer.
(In reply to comment #14) > The ONTAP version is 6.5.5. Rebooting is not very practical as a LOT of things > use that filer. I don't expect that a packet dump will show anything either but it would be good if you could post one just in case we are missing something. If possible including the file open till the error (I do understand this would be difficult so lets just see what we can get). Ian
Yes, I understand that reboot is not practical solution... I was just trying to find a scenario that would help isolate the problem.. I do agree with Ian that having a packet trace could help assuming its not too large...
Created attachment 144787 [details] Packet capture when the problem occurs This is an extract from a packet capture when the problem occurred (the full packet capture is 53Mb - the problem occurs about 3-4 seconds into the extract). The relevant entries in /var/log/messages at the time are Jan 4 10:03:55 mailrelay4 automount[21925]: failed to mount /usr/local/f-prot Jan 4 10:03:55 mailrelay4 automount[21933]: failed to mount /usr/local/fsav Jan 4 10:03:55 mailrelay4 automount[21946]: failed to mount /usr/local/inoculan Jan 4 10:03:55 mailrelay4 automount[21951]: failed to mount /usr/local/av Jan 4 10:03:55 mailrelay4 kernel: NFS: server stevens error: fileid changed Jan 4 10:03:55 mailrelay4 kernel: fsid 0:1c: expected fileid 0x12dd7a3, got 0x1 343f01 Jan 4 10:03:55 mailrelay4 automount[21988]: failed to mount /usr/local/nod32 Jan 4 10:03:55 mailrelay4 automount[22005]: failed to mount /usr/local/rav8 Jan 4 10:03:55 mailrelay4 automount[22012]: failed to mount /usr/local/Sophos Jan 4 10:03:55 mailrelay4 automount[22013]: failed to mount /usr/local/Sophos Jan 4 10:03:56 mailrelay4 automount[22026]: failed to mount /usr/local/vexira (what is going on is MailScanner is looking for various possible virus scanners only a few of which actually exist on our system).
(In reply to comment #16) > Yes, I understand that reboot is not practical solution... But I think that this is something that will need to be done at some point. Do you have a scheduled maintenance window at some time where you could do a reboot? Ian
(In reply to comment #17) > Created an attachment (id=144787) [edit] > Packet capture when the problem occurs I've had a look at the packet capture and it doesn't reveal much we don't already know but ... Packet number 424 is an NFS MKDIR call and 425 is the reply. Upon return from this call the NFS client checks the attributes for the directory within which the create was requested to see if it has changed during the operation. This weak cache consistency check is done regardless of the return status of the operation (as per normal NFS implementations). Unfortunately the attributes returned don't match the directory (in this case also the mount point) and the NFS client claims the file handle is stale based on this post operation check. So it looks to me like the Linux NFS client is doing what it is supposed to do. Would it be possible to ask NetApp support if there are any known issues like this? Any other thoughts anyone? Ian
Created attachment 144913 [details] Shell script that (mostly) triggers the problem To confirm the problem, I have written a simple shell script that (most times) triggers the problem for this particular NFS mount, and sometimes others, and it does indeed seem to be the mkdir attempt that trips up the filer. (I have also run this against an FC6 machine with kernel 2.6.18-1.2869.fc6xen and unsurprising the problem is reproducable there also).
(In reply to comment #20) > Created an attachment (id=144913) [edit] > Shell script that (mostly) triggers the problem > > To confirm the problem, I have written a simple shell script that (most times) > triggers the problem for this particular NFS mount, and sometimes others, and > it does indeed seem to be the mkdir attempt that trips up the filer. > (I have also run this against an FC6 machine with kernel 2.6.18-1.2869.fc6xen > and unsurprising the problem is reproducable there also). Yes. That is what the packet log shows. As I say I'm no expert but, as far as I can see the NFS MKDIR procedure call in the client is formed correctly and other procedure calls surrounding the MKDIR, such as LOOKUP and ACCESS return the expected attributes, even calls following the MKDIR have the expected attributes, so it really looks like the reply from the MKDIR call is incorrectly formed by the server. It would be interesting and perhaps informative if a simple test could be carried out against different NFS servers, such as Solaris or Linux to confirm that either the client is or is not at fault. At this point it looks like it's not. Ian
I have had a quick go at getting our ONTAP 7.0.4 NetApp box to fail without success. I will probably do some further testing after the weekend.
I haven't managed to reproduce this problem with other hardware. I have tried Solaris and FC3 NFS clients against the problem NetApp box, and also with a linux NFS server.
(In reply to comment #23) > I haven't managed to reproduce this problem with other hardware. I have tried > Solaris and FC3 NFS clients against the problem NetApp box, and also with a > linux NFS server. Could we have a packet capture of a successful test please?
Created attachment 145153 [details] Capture of test script session demonstrating bug Here is a capture of the shell script reproducing the problem. I have also discovered that CREATE calls can trigger the bug as well as MKDIR calls.
(In reply to comment #23) > I haven't managed to reproduce this problem with other hardware. I have tried > Solaris and FC3 NFS clients against the problem NetApp box, and also with a > linux NFS server. So there is some evidence to indicate that this may be a bug on the FC NFS client side? If so then a capture of a test that doesn't fail would be usefull for comparison. The packet that is returned from the server is incorrect which implies that the request may be incorrectly formed. Ian
(In reply to comment #20) > Created an attachment (id=144913) [edit] > Shell script that (mostly) triggers the problem > > To confirm the problem, I have written a simple shell script that (most times) > triggers the problem for this particular NFS mount, and sometimes others, and > it does indeed seem to be the mkdir attempt that trips up the filer. > (I have also run this against an FC6 machine with kernel 2.6.18-1.2869.fc6xen > and unsurprising the problem is reproducable there also). I'm starting to loose track of what we're chasing. Just to confirm, my impression is that once this MKDIR (or other) fails future attempts to access the mounted filesystem result in a "Stale NFS file handle" message. Is that correct? Ian
Created attachment 145269 [details] test script capture run on linux server and client (no failure) The state of the problem is; A linux client is mounting a read-only shared filesystem from a NetApp server. If an attempt is made to create a file or directory in this mount (NFS calls MKDIR or CREATE), the error packet returned contains information for an unrelated directory (or file?), and the linux client sees this as an error. If this is in the top level of the mounted file system, linux marks the file as stale, causing some operations on it to fail until it is unmounted and remounted. This problem has been demonstrated on FC5 and FC6 against NetApp ONTAP 6.5.5 and 7.0.5 (my earlier testing against 7.0.4 was bogus as it wasn't against a read-only shared filesystem). The problem is most likely to be a NetApp issue. The problem hasn't occured in testing against a linux server, or with earlier linux client versions (including FC3). I have however now discovered that a Solaris 10 client can also have problems, and can get the same replies from a MKDIR call. It doesn't mark the filesystem as stale, but listing the filesystem directory (ls -dl) after such a reply can give bogus values. I am not sure what other capture you wanted, but I have attached a linux client and server capture.
A slight ammendment to my previous post: I meant to say that If an attempt is made to create a file or directory in this mount (NFS calls MKDIR or CREATE), the error packet returned CAN contain information for an unrelated directory (or file?)
(In reply to comment #29) > A slight ammendment to my previous post: I meant to say that > > If an attempt is made to create a file or directory in this mount (NFS calls > MKDIR or CREATE), the error packet returned CAN contain information for an > unrelated directory (or file?) Thanks for the summary. Just one thing that I might not have made clear. The reply for the MKDIR and CREATE RPC calls should contain the attributes of the "directory" in which the mkdir or create is requested for (the Weak Cache Consistency) checking by the NFS client regardless of whether the call fails (but it seems that several servers don't do this quite right). I'm having some trouble verifying how everything fits together, for example, I can see where the inode cache data is marked as invalid during the call but I can't verify how that causes the following accesses to fail. I also had a look at 2.6.9 and I it looks like this doesn't happen for that kernel. Even more interesting is that after a quick look at the RHEL5 kernel this morning it appears that it might not happen in that kernel either so maybe there's a patch around I'm not aware of. Any ideas on this Steve? Anyway I'll keep looking. Ian
(In reply to comment #28) > Created an attachment (id=145269) [edit] > test script capture run on linux server and client (no failure) ... > I am not sure what other capture you wanted, but I have attached a linux client > and server capture. Yep. That's what I was after, thanks. Ian
The situation with the FC3 2.6.11 kernel (and thus probably to 2.6.9 also) is that it doesn't ever do a MKDIR in this situation, possibly because it has already done an ACCESS call in the parent directory and found it only has READ and LOOKUP permissions, so doesn't try the MKDIR call which it expects will fail. With regard to RHEL5, I tried running an FC6 box with the kernel-xen-2.6.18-1.2747.el5 kernel, and that does indeed show the problem.
> I'm having some trouble verifying how everything fits together, > for example, I can see where the inode cache data is marked as > invalid during the call but I can't verify how that causes the > following accesses to fail. Note that the stale marking doesn't cause all accesses to fail, if the system already knows about a file it can still access it. What does fail is anything requiring the directory to be listed, such as opening a fresh file.
(In reply to comment #32) > With regard to RHEL5, I tried running an FC6 box with the > kernel-xen-2.6.18-1.2747.el5 kernel, and that does indeed show the problem. Yep. After looking again it's the same code that breaks in our other cases. I don't know how I thought it was OK. Ian
Fedora apologizes that these issues have not been resolved yet. We're sorry it's taken so long for your bug to be properly triaged and acted on. We appreciate the time you took to report this issue and want to make sure no important bugs slip through the cracks. If you're currently running a version of Fedora Core between 1 and 6, please note that Fedora no longer maintains these releases. We strongly encourage you to upgrade to a current Fedora release. In order to refocus our efforts as a project we are flagging all of the open bugs for releases which are no longer maintained and closing them. http://fedoraproject.org/wiki/LifeCycle/EOL If this bug is still open against Fedora Core 1 through 6, thirty days from now, it will be closed 'WONTFIX'. If you can reporduce this bug in the latest Fedora version, please change to the respective version. If you are unable to do this, please add a comment to this bug requesting the change. Thanks for your help, and we apologize again that we haven't handled these issues to this point. The process we are following is outlined here: http://fedoraproject.org/wiki/BugZappers/F9CleanUp We will be following the process here: http://fedoraproject.org/wiki/BugZappers/HouseKeeping to ensure this doesn't happen again. And if you'd like to join the bug triage team to help make things better, check out http://fedoraproject.org/wiki/BugZappers
This is still an issue, and I have just reproduced it with Fedora 9 beta against ONTAP 7.2.2. It does seem likely that the problem is at the NetApp end though.
(In reply to comment #36) > This is still an issue, and I have just reproduced it with Fedora 9 beta against > ONTAP 7.2.2. It does seem likely that the problem is at the NetApp end though. I'm pretty sure that is not the latest release of OnTap.
No, I believe 7.2.4 is the latest version, but as the filers are used by a lot of people and all the time, getting them updated is a big deal. I might have a go at reproducing it on a netapp simulator with 7.2.4, but this might not work if the problem is related to the load or other set ups that are on the servers.
Changing version to '9' as part of upcoming Fedora 9 GA. More information and reason for this action is here: http://fedoraproject.org/wiki/BugZappers/HouseKeeping
This might now be fixed. We have just updated to ONTAP 7.2.5 and have been unable to reproduce the problem in early tests. I will leave this bug open for another couple of weeks to see if the problem does reoccur, but it does look promising.
We haven't seen this problem recently you I am closing the call.