Bug 11040 - Remote NFS server being down causes high load average
Summary: Remote NFS server being down causes high load average
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Red Hat Linux
Classification: Retired
Component: kernel
Version: 6.1
Hardware: All
OS: Linux
medium
medium
Target Milestone: ---
Assignee: David Lawrence
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2000-04-25 18:26 UTC by iq4s-stu
Modified: 2007-04-18 16:26 UTC (History)
3 users (show)

Fixed In Version:
Clone Of:
Environment:
Last Closed: 2000-05-08 14:18:31 UTC
Embargoed:


Attachments (Terms of Use)

Description iq4s-stu 2000-04-25 18:26:06 UTC
If a (remote) NFS server (mounted with the default options) goes down,
this can cause the (local) load average to get very high, which can cause
problems such as sendmail refusing connections. This happens as follows:

- If a process is blocked waiting for a read from a file (e.g. a file on a
remote NFS server), it is put in state "D" (disc wait), corresponding to
TASK_UNINTERRUPTIBLE in include/linux/sched.h.

- When the load average is calculated in kernel/sched.c, in the function
count_active_tasks(), processes in the state TASK_UNINTERRUPTIBLE are
counted as "running", and thus contribute to the calculated load average

- If the remote NFS server is down, the process in question will remain in
the TASK_UNINTERRUPTIBLE state indefinitely, and thus the load average
will rise.

In particular, the "slocate" program run by default from "cron" in the
standard RHL setup seems to attempt to contact the NFS server and hang
(this appears at first sight to be a bug in "slocate" since it is
configured not to scan NFS mounts - I'll investigate this further later);
thus the load average rises by at least 1 every day. This does not cause
too many problems at first, because the system is in fact not loaded (it
is just the _reported_ load that is high), so nothing slows down.
Eventually, however, this causes sendmail to refuse connections, blocking
incoming email.


I don't know too much about kernel-hacking, so I'm not sure what the best
fix for this would be - the easiest way to fix the symptom would be not to
include TASK_UNINTERRUPTIBLE in the processes counted by count_active_tasks
(); however I don't know what other consequences this would have.

Comment 1 Alan Cox 2000-08-08 21:04:31 UTC
This is the way unix load averages get defined... strange but true.


Comment 2 patrick 2001-04-03 16:44:15 UTC
There is still a bug here: when any mounted NFS server goes 
down, sendmail stops responding to connections because of 
the apparently high load average. This needs to be fixed - by 
changing the load average, or by making sure "slocate" etc. 
don't get hung up on stale mounts, or by using a different 
method to calculate the system load in sendmail - but whatever
the best solution, it's still a bug :-)






Note You need to log in before you can comment on or make changes to this bug.