Red Hat Bugzilla – Bug 139570
NFS Client Hang
Last modified: 2007-11-30 17:07:05 EST
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.4.3)
Description of problem:
I am seeing infrequent NFS client hangs. A process acessing a file
over an NFS mounted partition will suddenly freeze and the process
will be rendered un-killable (typically in the "D" state).
The server in this case is running an identical version of RedHat
Linux (Enterprise Release 3). There are no messages in
/var/log/messages when this problem occurs.
The client machine is a Dell Precision WorkStation 670 with a Xeon
NIS is used to lookup the mount point. The partition is mounted with
server:/some/dir on /some/dir type nfs
Version-Release number of selected component (if applicable):
Steps to Reproduce:
1. Server is under a high load
2. Try to access a file
Actual Results: Process accessing the file freezes and can't be
killed (e.g. "kill -9 ..."). Any subsequent attempts to access files
from the same server also fail.
Expected Results: Process should be able to access files over NFS
without the process hanging.
We have several other Linux boxes (RedHat 7.2, SuSe) that are not
having this problem.
This bug is a show stopper for us.
I've had very similar problems to this for quite a while, going back
to RedHat 7.3. The kernel NFS guys haven't been able to come up with
I'm running a technical computing cluster and occassionally a process
will get stuck writing a file on an NFS mount. On that machine, any
attempt to do an ls on the mount point results in a hung process.
However, if I mount the NFS share to another directory, on the same
machine, I can see the NFS export properly. The original hung NFS
mount never comes back.
I do see NFS server not responding, still trying messages on the
client, but it never returns to NFS server ok.
I recently switched to an 100 Mbit ethernet adapter on one client, and
this problem happened for the two jobs we attempted to run. Going
back to a 1 Gbit adapter allowed the jobs to run.
What is the best way to diagnose this problem?
Afraid I can't offer any advice on diagnostics, but I have a little
more information on the the problem we've been having.
In our situation we have a server that's running RedHat Enterprise 3
with kernel-2.4.21-20.EL, and we're pretty sure it's a server side
What's unusual is that we have some 30 machines running some version
of RedHat 7.x (maybe 7.1, but I'm not sure). These machines are
older and they never have an NFS problem with this server.
We can hang NFS pretty consistently when we run with a client using
the same kernel/OS as the server or using the latest Debian testing
release (a 32bit build running with the 2.4.x kernel). We've managed
to hang NFS with Debian on 2 somewhat different architectures.
We're unsure whether the problem persists with the 2.6.8 kernel. We
have one machine running with 2.6.8 that hasn't seen the problem yet.
I unfortunately haven't been able to build 2.6.8 due to a newer
adaptec scsi driver conflict.
Probably the best way to debug this would be to
post a AltSysRq-t stack trace. The easiest way to
do this is to do "echo t > /proc/sysrq-trigger" when
the process is hung...
Arrrgh... looks like the kernel I'm currently running wasn't
configured with SysRQ enabled (I just managed to get the thing wedged
again, but I can't get the sysrq dump).
I'll post back as soon as I can get some data.
I'm having the same thing. Here is a dump of one of the processes
stuck in D.
Dec 2 18:35:44 system kernel: processname D 00000003 3860 2756
Dec 2 18:35:44 system kernel: Call Trace: [<f8a5a922>]
__xprt_lock_write_next [sunrpc] 0x92 (0xc9af1d2c)
Dec 2 18:35:44 system kernel: [<c0123274>] schedule [kernel] 0x2f4
Dec 2 18:35:44 system kernel: [<f8a5e3d7>] __rpc_execute [sunrpc]
Dec 2 18:35:44 system kernel: [<c022481b>] memcpy_toiovec [kernel]
Dec 2 18:35:44 system kernel: [<f8a5959d>]
rpc_call_sync_Rsmp_c357b490 [sunrpc] 0xbd (0xc9af1dc4)
Dec 2 18:35:44 system kernel: [<f8a5bd90>] xprt_timer [sunrpc] 0x0
Dec 2 18:35:44 system kernel: [<f8a59e60>] call_status [sunrpc] 0x0
Dec 2 18:35:44 system kernel: [<f8a5d6e0>] rpc_run_timer [sunrpc]
Dec 2 18:35:44 system kernel: [<f8a90a74>] nfs3_rpc_wrapper [nfs]
Dec 2 18:35:44 system kernel: [<f8a90bb3>] nfs3_proc_getattr [nfs]
Dec 2 18:35:44 system kernel: [<f8a890d3>] __nfs_revalidate_inode
[nfs] 0x113 (0xc9af1ed0)
Dec 2 18:35:44 system kernel: [<c011f5ac>] do_page_fault [kernel]
Dec 2 18:35:44 system kernel: [<c010bdd4>] do_signal [kernel] 0x64
Dec 2 18:35:44 system kernel: [<f8a86b2a>] nfs_file_write [nfs] 0x5a
Dec 2 18:35:44 system kernel: [<c01608f7>] sys_write [kernel] 0x97
root]# uname -a
Linux system 2.4.21-15.0.2.ELsmp #1 SMP Fri Jun 18 23:13:20 EDT 2004
i686 i686 i386 GNU/Linux
Could you tar/gzip the entire backtrace and post
that? It would be good to see what the other processes
are doing as well....
wrt to this process it appears it has timed out
waiting for an ack from the server.... During these
hangs, is there any traffice going over the wire?
ethereal or tethereal can be used to verfiy that...
Created attachment 107865 [details]
Full stack trace
Attaching for Dan Taylor
I'm going to try to generate a stack trace later today. I'm running
2.4.x kernel on Debian, but it will give you another data point (since
this issue seems to be distribution/hardware independent for us).
Some more bits:
A coworker (more knowledgeable than I am) hasn't seen a hang since he
upgraded (his client) to 2.6.8.
I always have a .nfsXXXXXX file corresponding to the last file
accessed by the task that's hung (D state or maybe S state).
The hang always happens for me during an interactive compile session
(in an emacs shell with g++). I say "interactive" because I attempted
to replicate the hang in a bash script that did "make clean" "make
install" 50 times in a row without causing the NFS hang. After that I
went into Emacs and did a "make clean" "make install" in the same
directory and NFS hung right away. From what I can tell, my coworkers
were also compiling when NFS hung for them. I'm going to see if I can
get a consistent hang by having another process do "ls -lR" during the
"make clean"/"make install" loop.
There is a report on the NFS mailing list about a similar issue.
Here's a link to Jason Holme's message (10/13/2004):
Our sysadmin e-mail Jason Holmes about our issue and got this
response. We upgraded our servers bios/firmware but this didn't fix
our problem. We haven't had a chance to upgrade the server to 2.6.8
(which we're going to do soon):
Date: Thu, 18 Nov 2004 11:01:01 -0500
From: Jason Holmes <email@example.com>
User-Agent: Mozilla Thunderbird 0.9 (X11/20041103)
X-Accept-Language: en-us, en
To: Steve Huff <firstname.lastname@example.org>
Subject: Re: NFS hangs with linux 2.4 kernels
Right now I'm using 220.127.116.11 on the server and various 2.4s and 2.6s
(including RedHat kernels) on the clients. I don't have a hanging
problem anymore (at least I haven't had a hang in weeks). While things
were better moving to 18.104.22.168 on the server, I still did have a few
hangs. I think my problems went away finally when I did a BIOS upgrade
to the RAID controllers on the servers (it addressed a possible RAID
controller firmware hang with Matrox drives, of which I had some). My
guess at this point is that it was a combination of the two - RedHat
kernels on the server and the BIOS hang, but it could have just been the
BIOS since I haven't tried a RedHat kernel on the server since I did the
Steve Huff wrote:
>i'm a system administrator at MIT Lincoln Laboratory, and i'm seeing
>NFS problems to the ones you've been having. in my case we're seeing
>between RHEL3 clients and servers. they usually occur while users are
>building code in NFS-mounted directories.
>the last message on this topic i could find from you was a few weeks
>stated that you had just experienced your first hang between a 2.4 server
>and a 2.6 client. have you found a combination that works any
>particular, have you been able to try a 2.6 kernel on the server?
I have to agree, this appears to be a server problem since
all of the nfs process in stacktrace.txt are all waiting
for a server response. Would it be possible to get a
system trace of the server when the hang occurs?
It would be good to know what the nfsd threads are doing.
Created attachment 112489 [details]
U5 server patch that could solve this hang
It appears this maybe a duplicate of bz138182 which has been fixed
by the nfs-silly-del-revert.patch in RHEL3 U5. Please upgrade (via RHN)
to kernel version 2.4.21-31.EL or apply the the attached patch
What happened to the trace of Dan Taylor. He claimed to see the hang on
2.4.21-15.0.2EL. We have been hunted massively by similar hangs of NFS servers.
But.... the 2.4.21-15.0.2EL does not have the optimization that is being removed
in your attachment id=112489. This means Dan Taylor and I will probably still
experience NFS server lockups (all nfsd go into DW state) with the U5 kernel.
Can you comment to that Steve?
Well stack trace that was posted in comment #7 show a
number of client process hung waiting for a responses from
the server. So its really not that useful if in fact this turns
out to be an server issue.
Now I you could post a system trace of the server that
shows the nfsd in dw state, it would definitely shed some
light on whether this is or is not the same issue as in bz138182
Created attachment 112931 [details]
sysreq during nfs server lockup. All nfsd are in DW state.
Although its a bit difficult to tell with this type of system
trace, it appears that all of the nfsd are waiting to do a
setattr (i.e. trying to setting some file attribute) except
for two of them which are hung in mmfs.
Looking at those two processes, it appear one process
is hung in mmfs waiting for a locked inode and the other
seems to be waiting on some cluster I/O. Similarly all of
the mmfsds are waiting for the same type of I/O.
So I would have to conclude that this hang is being caused
by the MMFS fileystem not NFS.
I guess the next step would be to get IBM involved....
We were having severe NFS issues on our batch queue (of some 30+ linux boxes)
recently. The NFS usage during this time was extremely high. The major fix for
us which resolved our issue (which I believe is the same one I originally
reported) was an update to the kernel we were running:
On "Red Hat Enterprise Linux WS release 3 (Taroon Update 4)"
(I don't think our RH7.3 boxes were updated, but they are currently running
2.4.20-28.7smp and don't appear to have issues).
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.
*** This bug has been marked as a duplicate of 138182 ***
*** This bug has been marked as a duplicate of 138182 ***