Red Hat Bugzilla – Bug 54114
swapped-out process stuck in system-mode using 100% cpu
Last modified: 2007-04-18 12:37:24 EDT
From Bugzilla Helper:
User-Agent: Mozilla/4.76C-CERN UNIX gryphn 45 [en] (X11; U; Linux 2.4.2-2
Description of problem:
After some time of running under heavy nfs traffic a process on one or
more of the nfs client nodes starts consuming 100% of one cpu but not
seeming to progress. Doing kill -9 <pid> as root does nothing. The
system otherwise continues to respond. Nothing unusual shows up in
/var/log/messages. Doing a normal reboot proceeds as usual, and the
rebooted system is idle again.
When the stuck process is running, /proc/loadavg reports 2.0 runnable
processes. Running top shows 100% of one cpu devoted to running the
stuck job. This process is running 100% system-time. It does not respond
to signals. Top reports this process as RW (runnable and swapped).
Its sizes (from top) are all zero as expected for swapped-out processes.
Version-Release number of selected component (if applicable):
2.4.2-2smp kernel. Test node is running stock i686 kernel. Other nodes
run a diskless (IP-autoconfig + nfs-root) build and show same behavior.
I have not seen it on my Athlon-smp nodes so far.
Steps to Reproduce:
1. Run several i/o bound (nfs) jobs on the cluster, loading the net/server
2. Log onto one of the client nodes and start something (eg. compiler)
3. Watch top to see the process jump to 100% system-mode usage.
Actual Results: At some point the job will get stuck. Sequential shutdown
(with /etc/rc.d/init.d/* stop) of everything (except network) fails to
unstick it. Reboot works.
Expected Results: process should have finished normally, or at least
responded to kill -9 <pid> from superuser.
Should have been fixed a long time ago by errata kernel. If not re-open.