If a (remote) NFS server (mounted with the default options) goes down,
this can cause the (local) load average to get very high, which can cause
problems such as sendmail refusing connections. This happens as follows:
- If a process is blocked waiting for a read from a file (e.g. a file on a
remote NFS server), it is put in state "D" (disc wait), corresponding to
TASK_UNINTERRUPTIBLE in include/linux/sched.h.
- When the load average is calculated in kernel/sched.c, in the function
count_active_tasks(), processes in the state TASK_UNINTERRUPTIBLE are
counted as "running", and thus contribute to the calculated load average
- If the remote NFS server is down, the process in question will remain in
the TASK_UNINTERRUPTIBLE state indefinitely, and thus the load average
In particular, the "slocate" program run by default from "cron" in the
standard RHL setup seems to attempt to contact the NFS server and hang
(this appears at first sight to be a bug in "slocate" since it is
configured not to scan NFS mounts - I'll investigate this further later);
thus the load average rises by at least 1 every day. This does not cause
too many problems at first, because the system is in fact not loaded (it
is just the _reported_ load that is high), so nothing slows down.
Eventually, however, this causes sendmail to refuse connections, blocking
I don't know too much about kernel-hacking, so I'm not sure what the best
fix for this would be - the easiest way to fix the symptom would be not to
include TASK_UNINTERRUPTIBLE in the processes counted by count_active_tasks
(); however I don't know what other consequences this would have.
This is the way unix load averages get defined... strange but true.
There is still a bug here: when any mounted NFS server goes
down, sendmail stops responding to connections because of
the apparently high load average. This needs to be fixed - by
changing the load average, or by making sure "slocate" etc.
don't get hung up on stale mounts, or by using a different
method to calculate the system load in sendmail - but whatever
the best solution, it's still a bug :-)