I thought it a fluke once, but now this has occurred twice on our linux
I export a couple of items from our server, namely two directories /Users
and /var/spool/mail. Most of our clients are using the AMD automounter
(from am-utils) to mount these directories. After a time, the nfs clients
lose their mounts and report "stale NFS handle" errors. On the server,
the /var/log/messages logfile gets bombarded with entries of the form:
Nov xx xx:xx:xx server kernel: nfsd Security: /// bad export.
Only by either rebooting the server or stopping/restarting the nfs service
(via /etc/rc.d/init.d/nfs stop and /etc/rc.d/init.d/nfs start) restores
operation to normal.
We were experiencing this problem for months. It frequently occurred
whenever the NFS server was experiencing high load (during backups,
for example.) I couldn't found a solution anywhere, so i finally bit
the bullet and commented out the piece of kernel code which was
triggering the errors. We haven't had any NFS problems since.
Maybe the folks at RedHat have a less risky solution?
As per email@example.com's comments:
Can you give a few more details pertaining to your metnion of "commenting the
piece of kernel code triggering the error"?
We may have a very similar problem. We export our home directorys from a
redhat 6.1, kernel 2.2.12-20 After a period of usage, sometimes as much as a
day :-) the system becomes overrun with stale file handles. We are using knfsd-
1.4.7 and have recently tried the latest stable kernel (2.2.14). None of this
has improved the problem.
This causes us approx one hour of downtime every two days and seems to be
related to load. We do not have an environment where people are grossly
sharing files etc. so can not understand why so many stale file handles exist.
We are having massive problems with this and if we can't find a work-around
soon we will have to shift all our home filespace back across to our slower
solaris server. I don't really want this extra work.
Our problems have almost completely gone away since:
1. we've started using a lot less non-Linux clients (in our case, NeXTSTEP)
2. reconfiguring NIS and /etc/nsswitch.conf to NOT use NIS for hostname lookups
3. Upgrading to kernel-2.2.14-1.3.0 (it was once available at rawhide). I
wouldn't hesitate in saying that an upgrade from 2.2.12-20 is absolutely
essential. I haven't upgraded further simply because we've had problem-free
uptimes of 1-2 months. (If it ain't broke...)
4. rpc.mountd DOES still occasionally die (once every ~2 weeks), preventing
any new mounts. I think this is related to hostname lookup problems (our
campus DNS servers crash semi-often). I wrote a little /etc/cron.hourly script
to check for rpc.mountd's existence, and to relaunch if necessary:
------ /etc/cron.hourly/rpc.mountd -------- snip ------
#Only do check if nfs subsystem is activated
if [ -f /var/lock/subsys/nfs ]; then
if [ "$pid" != "" ]; then
echo -n "$prog dead... restarting:"
daemon /usr/sbin/rpc.mountd --no-nfs-version 3
-------- /etc/cron.hourly/rpc.mountd ------- snip ------
assigned to johnsonm
Bug 7483 is closed because the problem seems to have been fixed with the major
changes in kernel, nfs and other utilities between 7.0 and 7.3. Our servers seem
to be similarly set up with 200 nfs clients and no stale handle problems