Bug 139570 - NFS Client Hang
Summary: NFS Client Hang
Keywords:
Status: CLOSED DUPLICATE of bug 138182
Alias: None
Product: Red Hat Enterprise Linux 3
Classification: Red Hat
Component: kernel
Version: 3.0
Hardware: i686
OS: Linux
medium
medium
Target Milestone: ---
Assignee: Steve Dickson
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2004-11-16 19:11 UTC by Fred Richardson
Modified: 2007-11-30 22:07 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2005-09-08 00:04:52 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Full stack trace (6.97 KB, application/x-compressed)
2004-12-03 20:57 UTC, Fred Kern
no flags Details
U5 server patch that could solve this hang (2.99 KB, patch)
2005-03-30 21:39 UTC, Steve Dickson
no flags Details | Diff
stack trace (242.51 KB, text/plain)
2005-04-10 20:37 UTC, Jos VanWezel
no flags Details

Description Fred Richardson 2004-11-16 19:11:58 UTC
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.4.3)
Gecko/20040924

Description of problem:
I am seeing infrequent NFS client hangs.  A process acessing a file
over an NFS mounted partition will suddenly freeze and the process
will be rendered un-killable (typically in the "D" state).  

The server in this case is running an identical version of RedHat
Linux (Enterprise Release 3).  There are no messages in
/var/log/messages when this problem occurs.

The client machine is a Dell Precision WorkStation 670 with a Xeon
processor (64bit).

NIS is used to lookup the mount point.  The partition is mounted with
these options:

server:/some/dir on /some/dir type nfs
(rw,hard,intr,timeo=100,retrans=5,rsize=8192,wsize=8192,timeo=20,tcp,addr=xxx.yyy.zzz.qqq)


Version-Release number of selected component (if applicable):
kernel-2.4.21-20.EL

How reproducible:
Sometimes

Steps to Reproduce:
1. Server is under a high load
2. Try to access a file
    

Actual Results:  Process accessing the file freezes and can't be
killed (e.g. "kill -9 ...").  Any subsequent attempts to access files
from the same server also fail.

Expected Results:  Process should be able to access files over NFS
without the process hanging.

Additional info:

We have several other Linux boxes (RedHat 7.2, SuSe) that are not
having this problem.

This bug is a show stopper for us.

Comment 1 Joshua Weage 2004-11-24 17:13:59 UTC
I've had very similar problems to this for quite a while, going back
to RedHat 7.3.  The kernel NFS guys haven't been able to come up with
a solution.

I'm running a technical computing cluster and occassionally a process
will get stuck writing a file on an NFS mount.  On that machine, any
attempt to do an ls on the mount point results in a hung process. 
However, if I mount the NFS share to another directory, on the same
machine, I can see the NFS export properly.  The original hung NFS
mount never comes back.

I do see NFS server not responding, still trying messages on the
client, but it never returns to NFS server ok.

I recently switched to an 100 Mbit ethernet adapter on one client, and
this problem happened for the two jobs we attempted to run.  Going
back to a 1 Gbit adapter allowed the jobs to run.

What is the best way to diagnose this problem?


Comment 2 Fred Richardson 2004-11-24 23:14:00 UTC
Afraid I can't offer any advice on diagnostics, but I have a little 
more information on the the problem we've been having.

In our situation we have a server that's running RedHat Enterprise 3 
with kernel-2.4.21-20.EL, and we're pretty sure it's a server side 
problem now.

What's unusual is that we have some 30 machines running some version 
of RedHat 7.x (maybe 7.1, but I'm not sure).  These machines are 
older and they never have an NFS problem with this server.

We can hang NFS pretty consistently when we run with a client using 
the same kernel/OS as the server or using the latest Debian testing 
release (a 32bit build running with the 2.4.x kernel).  We've managed 
to hang NFS with Debian on 2 somewhat different architectures.

We're unsure whether the problem persists with the 2.6.8 kernel.  We 
have one machine running with 2.6.8 that hasn't seen the problem yet.

I unfortunately haven't been able to build 2.6.8 due to a newer 
adaptec scsi driver conflict.

Comment 3 Steve Dickson 2004-11-27 02:22:41 UTC
Probably the best way to debug this would be to
post a AltSysRq-t stack trace. The easiest way to
do this is to do "echo t > /proc/sysrq-trigger" when
the process is hung... 

Comment 4 Fred Richardson 2004-11-30 23:51:55 UTC
Arrrgh... looks like the kernel I'm currently running wasn't
configured with SysRQ enabled (I just managed to get the thing wedged
again, but I can't get the sysrq dump).

I'll post back as soon as I can get some data.

Comment 5 Dan Taylor 2004-12-03 01:01:12 UTC
I'm having the same thing.  Here is a dump of one of the processes 
stuck in D.  

Dec  2 18:35:44 system kernel: processname   D 00000003  3860  2756   
2724                     (NOTLB)
Dec  2 18:35:44 system kernel: Call Trace:   [<f8a5a922>] 
__xprt_lock_write_next [sunrpc] 0x92 (0xc9af1d2c)
Dec  2 18:35:44 system kernel: [<c0123274>] schedule [kernel] 0x2f4 
(0xc9af1d40)
Dec  2 18:35:44 system kernel: [<f8a5e3d7>] __rpc_execute [sunrpc] 
0x1f7 (0xc9af1d84)
Dec  2 18:35:44 system kernel: [<c022481b>] memcpy_toiovec [kernel] 
0x5b (0xc9af1d8c)
Dec  2 18:35:44 system kernel: [<f8a5959d>] 
rpc_call_sync_Rsmp_c357b490 [sunrpc] 0xbd (0xc9af1dc4)
Dec  2 18:35:44 system kernel: [<f8a5bd90>] xprt_timer [sunrpc] 0x0 
(0xc9af1e1c)
Dec  2 18:35:44 system kernel: [<f8a59e60>] call_status [sunrpc] 0x0 
(0xc9af1e24)
Dec  2 18:35:44 system kernel: [<f8a5d6e0>] rpc_run_timer [sunrpc] 
0x0 (0xc9af1e44)
Dec  2 18:35:44 system kernel: [<f8a90a74>] nfs3_rpc_wrapper [nfs] 
0x44 (0xc9af1e80)
Dec  2 18:35:44 system kernel: [<f8a90bb3>] nfs3_proc_getattr [nfs] 
0x63 (0xc9af1ea8)
Dec  2 18:35:44 system kernel: [<f8a890d3>] __nfs_revalidate_inode 
[nfs] 0x113 (0xc9af1ed0)
Dec  2 18:35:44 system kernel: [<c011f5ac>] do_page_fault [kernel] 
0x14c (0xc9af1ef4)
Dec  2 18:35:44 system kernel: [<c010bdd4>] do_signal [kernel] 0x64 
(0xc9af1f20)
Dec  2 18:35:44 system kernel: [<f8a86b2a>] nfs_file_write [nfs] 0x5a 
(0xc9af1f68)
Dec  2 18:35:44 system kernel: [<c01608f7>] sys_write [kernel] 0x97 
(0xc9af1f94)


 root]# uname -a
Linux system 2.4.21-15.0.2.ELsmp #1 SMP Fri Jun 18 23:13:20 EDT 2004 
i686 i686 i386 GNU/Linux


Comment 6 Steve Dickson 2004-12-03 11:38:33 UTC
Hey Dan,

Could you tar/gzip the entire backtrace and post
that? It would be good to see what the other processes
are doing as well.... 

wrt to this process it appears it has timed out 
waiting for an ack from the server.... During these
hangs, is there any traffice going over the wire? 
ethereal or tethereal can be used to verfiy that... 

Comment 7 Fred Kern 2004-12-03 20:57:43 UTC
Created attachment 107865 [details]
Full stack trace

Attaching for Dan Taylor

Comment 8 Fred Richardson 2004-12-06 17:23:20 UTC
I'm going to try to generate a stack trace later today.  I'm running
2.4.x kernel on Debian, but it will give you another data point (since
this issue seems to be distribution/hardware independent for us).

Some more bits:

A coworker (more knowledgeable than I am) hasn't seen a hang since he
upgraded (his client) to 2.6.8.

I always have a .nfsXXXXXX file corresponding to the last file
accessed by the task that's hung (D state or maybe S state).

The hang always happens for me during an interactive compile session
(in an emacs shell with g++).  I say "interactive" because I attempted
to replicate the hang in a bash script that did "make clean" "make
install" 50 times in a row without causing the NFS hang.  After that I
went into Emacs and did a "make clean" "make install" in the same
directory and NFS hung right away.  From what I can tell, my coworkers
were also compiling when NFS hung for them.  I'm going to see if I can
get a consistent hang by having another process do "ls -lR" during the
 "make clean"/"make install" loop.

There is a report on the NFS mailing list about a similar issue. 
Here's a link to Jason Holme's message (10/13/2004):
http://www.dragoninc.on.ca/mail-archives/nfs/2004-10/0055.html

Our sysadmin e-mail Jason Holmes about our issue and got this
response.  We upgraded our servers bios/firmware but this didn't fix
our problem.  We haven't had a chance to upgrade the server to 2.6.8
(which we're going to do soon):

Date: Thu, 18 Nov 2004 11:01:01 -0500
From: Jason Holmes <jholmes>
User-Agent: Mozilla Thunderbird 0.9 (X11/20041103)
X-Accept-Language: en-us, en
To: Steve Huff <shuff.edu>
Subject: Re: NFS hangs with linux 2.4 kernels
In-Reply-To: <20041118154806.GB19564.edu>

Steve,

Right now I'm using 2.6.8.1 on the server and various 2.4s and 2.6s 
(including RedHat kernels) on the clients.  I don't have a hanging 
problem anymore (at least I haven't had a hang in weeks).  While things 
were better moving to 2.6.8.1 on the server, I still did have a few 
hangs.  I think my problems went away finally when I did a BIOS upgrade 
to the RAID controllers on the servers (it addressed a possible RAID 
controller firmware hang with Matrox drives, of which I had some).  My 
guess at this point is that it was a combination of the two - RedHat 
kernels on the server and the BIOS hang, but it could have just been the 
BIOS since I haven't tried a RedHat kernel on the server since I did the 
BIOS upgrade.

Thanks,

--
Jason Holmes

Steve Huff wrote:
>hello jason!
>
>i'm a system administrator at MIT Lincoln Laboratory, and i'm seeing
similar
>NFS problems to the ones you've been having.  in my case we're seeing
hangs
>between RHEL3 clients and servers.  they usually occur while users are
>building code in NFS-mounted directories.
>
>the last message on this topic i could find from you was a few weeks
ago; it
>stated that you had just experienced your first hang between a 2.4 server
>and a 2.6 client.  have you found a combination that works any
better?  in
>particular, have you been able to try a 2.6 kernel on the server?
>
>thanks,
>steve
>

Comment 9 Steve Dickson 2005-03-30 15:32:32 UTC
Fred,

I have to agree, this appears to be a server problem since
all of the nfs process in stacktrace.txt are all waiting
for a server response. Would it be possible to get a
system trace of the server when the hang occurs?
It would be good to know what the nfsd threads are doing.


Comment 10 Steve Dickson 2005-03-30 21:39:17 UTC
Created attachment 112489 [details]
U5 server patch that could solve this hang

It appears this maybe a duplicate of bz138182 which has been fixed
by the nfs-silly-del-revert.patch in  RHEL3 U5. Please upgrade (via RHN)
to kernel version 2.4.21-31.EL or apply the the attached patch

Comment 11 Jos VanWezel 2005-04-01 00:48:39 UTC
What happened to the trace of Dan Taylor. He claimed to see the hang on
2.4.21-15.0.2EL. We have been hunted massively by similar hangs of NFS servers.
But.... the 2.4.21-15.0.2EL does not have the optimization that is being removed
in your attachment id=112489. This means Dan Taylor and I will probably still
experience NFS server lockups (all nfsd go into DW state) with the U5 kernel.

Can you comment to that Steve? 

Comment 12 Steve Dickson 2005-04-01 02:34:43 UTC
Well stack trace that was posted in comment #7 show a
number of client process hung waiting for a responses from
the server. So its really not that useful if in fact this turns
out to be an server issue.

Now I you could post a system trace of the server  that
shows the nfsd in dw state, it would definitely shed some
light on whether this is or is not the same issue as in  bz138182

Comment 13 Jos VanWezel 2005-04-10 20:37:40 UTC
Created attachment 112931 [details]
stack trace

sysreq during nfs server lockup. All nfsd are in DW state.

Comment 14 Steve Dickson 2005-04-25 14:49:07 UTC
Although its a bit difficult to tell with this type of system
trace, it appears that all of the nfsd are waiting to do a
setattr (i.e. trying to setting some file attribute) except 
for two of them which are hung in mmfs.

Looking at those two processes, it appear one process
is hung in mmfs waiting for a locked inode and the other
seems to be waiting on some cluster I/O. Similarly all of
the mmfsds are waiting for the same type of I/O.

So I would have to conclude that this hang is being caused
by the MMFS fileystem not NFS.

I guess the next step would be to get IBM involved....




Comment 15 Fred Richardson 2005-05-02 22:26:01 UTC
We were having severe NFS issues on our batch queue (of some 30+ linux boxes)
recently.  The NFS usage during this time was extremely high.  The major fix for
us which resolved our issue (which I believe is the same one I originally
reported) was an update to the kernel we were running:

On "Red Hat Enterprise Linux WS release 3 (Taroon Update 4)"
2.4.21-31.ELsmp

(I don't think our RH7.3 boxes were updated, but they are currently running
2.4.20-28.7smp and don't appear to have issues).

Comment 16 Ernie Petrides 2005-09-08 00:00:28 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2005-294.html


*** This bug has been marked as a duplicate of 138182 ***

Comment 17 Ernie Petrides 2005-09-08 00:07:15 UTC

*** This bug has been marked as a duplicate of 138182 ***


Note You need to log in before you can comment on or make changes to this bug.