I thought it a fluke once, but now this has occurred twice on our linux nfs server. I export a couple of items from our server, namely two directories /Users and /var/spool/mail. Most of our clients are using the AMD automounter (from am-utils) to mount these directories. After a time, the nfs clients lose their mounts and report "stale NFS handle" errors. On the server, the /var/log/messages logfile gets bombarded with entries of the form: Nov xx xx:xx:xx server kernel: nfsd Security: /// bad export. Only by either rebooting the server or stopping/restarting the nfs service (via /etc/rc.d/init.d/nfs stop and /etc/rc.d/init.d/nfs start) restores operation to normal.
We were experiencing this problem for months. It frequently occurred whenever the NFS server was experiencing high load (during backups, for example.) I couldn't found a solution anywhere, so i finally bit the bullet and commented out the piece of kernel code which was triggering the errors. We haven't had any NFS problems since. Maybe the folks at RedHat have a less risky solution?
As per johnb's comments: Can you give a few more details pertaining to your metnion of "commenting the piece of kernel code triggering the error"?
We may have a very similar problem. We export our home directorys from a redhat 6.1, kernel 2.2.12-20 After a period of usage, sometimes as much as a day :-) the system becomes overrun with stale file handles. We are using knfsd- 1.4.7 and have recently tried the latest stable kernel (2.2.14). None of this has improved the problem. This causes us approx one hour of downtime every two days and seems to be related to load. We do not have an environment where people are grossly sharing files etc. so can not understand why so many stale file handles exist. We are having massive problems with this and if we can't find a work-around soon we will have to shift all our home filespace back across to our slower solaris server. I don't really want this extra work.
Our problems have almost completely gone away since: 1. we've started using a lot less non-Linux clients (in our case, NeXTSTEP) 2. reconfiguring NIS and /etc/nsswitch.conf to NOT use NIS for hostname lookups 3. Upgrading to kernel-2.2.14-1.3.0 (it was once available at rawhide). I wouldn't hesitate in saying that an upgrade from 2.2.12-20 is absolutely essential. I haven't upgraded further simply because we've had problem-free uptimes of 1-2 months. (If it ain't broke...) 4. rpc.mountd DOES still occasionally die (once every ~2 weeks), preventing any new mounts. I think this is related to hostname lookup problems (our campus DNS servers crash semi-often). I wrote a little /etc/cron.hourly script to check for rpc.mountd's existence, and to relaunch if necessary: ------ /etc/cron.hourly/rpc.mountd -------- snip ------ #!/bin/sh . /etc/rc.d/init.d/functions dead=0 prog=rpc.mountd pid=`pidof $prog` #Only do check if nfs subsystem is activated if [ -f /var/lock/subsys/nfs ]; then if [ "$pid" != "" ]; then dead=0 else dead=1 date echo -n "$prog dead... restarting:" daemon /usr/sbin/rpc.mountd --no-nfs-version 3 fi fi -------- /etc/cron.hourly/rpc.mountd ------- snip ------
assigned to johnsonm
Bug 7483 is closed because the problem seems to have been fixed with the major changes in kernel, nfs and other utilities between 7.0 and 7.3. Our servers seem to be similarly set up with 200 nfs clients and no stale handle problems