From Bugzilla Helper: User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.4.3) Gecko/20040924 Description of problem: I am seeing infrequent NFS client hangs. A process acessing a file over an NFS mounted partition will suddenly freeze and the process will be rendered un-killable (typically in the "D" state). The server in this case is running an identical version of RedHat Linux (Enterprise Release 3). There are no messages in /var/log/messages when this problem occurs. The client machine is a Dell Precision WorkStation 670 with a Xeon processor (64bit). NIS is used to lookup the mount point. The partition is mounted with these options: server:/some/dir on /some/dir type nfs (rw,hard,intr,timeo=100,retrans=5,rsize=8192,wsize=8192,timeo=20,tcp,addr=xxx.yyy.zzz.qqq) Version-Release number of selected component (if applicable): kernel-2.4.21-20.EL How reproducible: Sometimes Steps to Reproduce: 1. Server is under a high load 2. Try to access a file Actual Results: Process accessing the file freezes and can't be killed (e.g. "kill -9 ..."). Any subsequent attempts to access files from the same server also fail. Expected Results: Process should be able to access files over NFS without the process hanging. Additional info: We have several other Linux boxes (RedHat 7.2, SuSe) that are not having this problem. This bug is a show stopper for us.
I've had very similar problems to this for quite a while, going back to RedHat 7.3. The kernel NFS guys haven't been able to come up with a solution. I'm running a technical computing cluster and occassionally a process will get stuck writing a file on an NFS mount. On that machine, any attempt to do an ls on the mount point results in a hung process. However, if I mount the NFS share to another directory, on the same machine, I can see the NFS export properly. The original hung NFS mount never comes back. I do see NFS server not responding, still trying messages on the client, but it never returns to NFS server ok. I recently switched to an 100 Mbit ethernet adapter on one client, and this problem happened for the two jobs we attempted to run. Going back to a 1 Gbit adapter allowed the jobs to run. What is the best way to diagnose this problem?
Afraid I can't offer any advice on diagnostics, but I have a little more information on the the problem we've been having. In our situation we have a server that's running RedHat Enterprise 3 with kernel-2.4.21-20.EL, and we're pretty sure it's a server side problem now. What's unusual is that we have some 30 machines running some version of RedHat 7.x (maybe 7.1, but I'm not sure). These machines are older and they never have an NFS problem with this server. We can hang NFS pretty consistently when we run with a client using the same kernel/OS as the server or using the latest Debian testing release (a 32bit build running with the 2.4.x kernel). We've managed to hang NFS with Debian on 2 somewhat different architectures. We're unsure whether the problem persists with the 2.6.8 kernel. We have one machine running with 2.6.8 that hasn't seen the problem yet. I unfortunately haven't been able to build 2.6.8 due to a newer adaptec scsi driver conflict.
Probably the best way to debug this would be to post a AltSysRq-t stack trace. The easiest way to do this is to do "echo t > /proc/sysrq-trigger" when the process is hung...
Arrrgh... looks like the kernel I'm currently running wasn't configured with SysRQ enabled (I just managed to get the thing wedged again, but I can't get the sysrq dump). I'll post back as soon as I can get some data.
I'm having the same thing. Here is a dump of one of the processes stuck in D. Dec 2 18:35:44 system kernel: processname D 00000003 3860 2756 2724 (NOTLB) Dec 2 18:35:44 system kernel: Call Trace: [<f8a5a922>] __xprt_lock_write_next [sunrpc] 0x92 (0xc9af1d2c) Dec 2 18:35:44 system kernel: [<c0123274>] schedule [kernel] 0x2f4 (0xc9af1d40) Dec 2 18:35:44 system kernel: [<f8a5e3d7>] __rpc_execute [sunrpc] 0x1f7 (0xc9af1d84) Dec 2 18:35:44 system kernel: [<c022481b>] memcpy_toiovec [kernel] 0x5b (0xc9af1d8c) Dec 2 18:35:44 system kernel: [<f8a5959d>] rpc_call_sync_Rsmp_c357b490 [sunrpc] 0xbd (0xc9af1dc4) Dec 2 18:35:44 system kernel: [<f8a5bd90>] xprt_timer [sunrpc] 0x0 (0xc9af1e1c) Dec 2 18:35:44 system kernel: [<f8a59e60>] call_status [sunrpc] 0x0 (0xc9af1e24) Dec 2 18:35:44 system kernel: [<f8a5d6e0>] rpc_run_timer [sunrpc] 0x0 (0xc9af1e44) Dec 2 18:35:44 system kernel: [<f8a90a74>] nfs3_rpc_wrapper [nfs] 0x44 (0xc9af1e80) Dec 2 18:35:44 system kernel: [<f8a90bb3>] nfs3_proc_getattr [nfs] 0x63 (0xc9af1ea8) Dec 2 18:35:44 system kernel: [<f8a890d3>] __nfs_revalidate_inode [nfs] 0x113 (0xc9af1ed0) Dec 2 18:35:44 system kernel: [<c011f5ac>] do_page_fault [kernel] 0x14c (0xc9af1ef4) Dec 2 18:35:44 system kernel: [<c010bdd4>] do_signal [kernel] 0x64 (0xc9af1f20) Dec 2 18:35:44 system kernel: [<f8a86b2a>] nfs_file_write [nfs] 0x5a (0xc9af1f68) Dec 2 18:35:44 system kernel: [<c01608f7>] sys_write [kernel] 0x97 (0xc9af1f94) root]# uname -a Linux system 2.4.21-15.0.2.ELsmp #1 SMP Fri Jun 18 23:13:20 EDT 2004 i686 i686 i386 GNU/Linux
Hey Dan, Could you tar/gzip the entire backtrace and post that? It would be good to see what the other processes are doing as well.... wrt to this process it appears it has timed out waiting for an ack from the server.... During these hangs, is there any traffice going over the wire? ethereal or tethereal can be used to verfiy that...
Created attachment 107865 [details] Full stack trace Attaching for Dan Taylor
I'm going to try to generate a stack trace later today. I'm running 2.4.x kernel on Debian, but it will give you another data point (since this issue seems to be distribution/hardware independent for us). Some more bits: A coworker (more knowledgeable than I am) hasn't seen a hang since he upgraded (his client) to 2.6.8. I always have a .nfsXXXXXX file corresponding to the last file accessed by the task that's hung (D state or maybe S state). The hang always happens for me during an interactive compile session (in an emacs shell with g++). I say "interactive" because I attempted to replicate the hang in a bash script that did "make clean" "make install" 50 times in a row without causing the NFS hang. After that I went into Emacs and did a "make clean" "make install" in the same directory and NFS hung right away. From what I can tell, my coworkers were also compiling when NFS hung for them. I'm going to see if I can get a consistent hang by having another process do "ls -lR" during the "make clean"/"make install" loop. There is a report on the NFS mailing list about a similar issue. Here's a link to Jason Holme's message (10/13/2004): http://www.dragoninc.on.ca/mail-archives/nfs/2004-10/0055.html Our sysadmin e-mail Jason Holmes about our issue and got this response. We upgraded our servers bios/firmware but this didn't fix our problem. We haven't had a chance to upgrade the server to 2.6.8 (which we're going to do soon): Date: Thu, 18 Nov 2004 11:01:01 -0500 From: Jason Holmes <jholmes> User-Agent: Mozilla Thunderbird 0.9 (X11/20041103) X-Accept-Language: en-us, en To: Steve Huff <shuff.edu> Subject: Re: NFS hangs with linux 2.4 kernels In-Reply-To: <20041118154806.GB19564.edu> Steve, Right now I'm using 2.6.8.1 on the server and various 2.4s and 2.6s (including RedHat kernels) on the clients. I don't have a hanging problem anymore (at least I haven't had a hang in weeks). While things were better moving to 2.6.8.1 on the server, I still did have a few hangs. I think my problems went away finally when I did a BIOS upgrade to the RAID controllers on the servers (it addressed a possible RAID controller firmware hang with Matrox drives, of which I had some). My guess at this point is that it was a combination of the two - RedHat kernels on the server and the BIOS hang, but it could have just been the BIOS since I haven't tried a RedHat kernel on the server since I did the BIOS upgrade. Thanks, -- Jason Holmes Steve Huff wrote: >hello jason! > >i'm a system administrator at MIT Lincoln Laboratory, and i'm seeing similar >NFS problems to the ones you've been having. in my case we're seeing hangs >between RHEL3 clients and servers. they usually occur while users are >building code in NFS-mounted directories. > >the last message on this topic i could find from you was a few weeks ago; it >stated that you had just experienced your first hang between a 2.4 server >and a 2.6 client. have you found a combination that works any better? in >particular, have you been able to try a 2.6 kernel on the server? > >thanks, >steve >
Fred, I have to agree, this appears to be a server problem since all of the nfs process in stacktrace.txt are all waiting for a server response. Would it be possible to get a system trace of the server when the hang occurs? It would be good to know what the nfsd threads are doing.
Created attachment 112489 [details] U5 server patch that could solve this hang It appears this maybe a duplicate of bz138182 which has been fixed by the nfs-silly-del-revert.patch in RHEL3 U5. Please upgrade (via RHN) to kernel version 2.4.21-31.EL or apply the the attached patch
What happened to the trace of Dan Taylor. He claimed to see the hang on 2.4.21-15.0.2EL. We have been hunted massively by similar hangs of NFS servers. But.... the 2.4.21-15.0.2EL does not have the optimization that is being removed in your attachment id=112489. This means Dan Taylor and I will probably still experience NFS server lockups (all nfsd go into DW state) with the U5 kernel. Can you comment to that Steve?
Well stack trace that was posted in comment #7 show a number of client process hung waiting for a responses from the server. So its really not that useful if in fact this turns out to be an server issue. Now I you could post a system trace of the server that shows the nfsd in dw state, it would definitely shed some light on whether this is or is not the same issue as in bz138182
Created attachment 112931 [details] stack trace sysreq during nfs server lockup. All nfsd are in DW state.
Although its a bit difficult to tell with this type of system trace, it appears that all of the nfsd are waiting to do a setattr (i.e. trying to setting some file attribute) except for two of them which are hung in mmfs. Looking at those two processes, it appear one process is hung in mmfs waiting for a locked inode and the other seems to be waiting on some cluster I/O. Similarly all of the mmfsds are waiting for the same type of I/O. So I would have to conclude that this hang is being caused by the MMFS fileystem not NFS. I guess the next step would be to get IBM involved....
We were having severe NFS issues on our batch queue (of some 30+ linux boxes) recently. The NFS usage during this time was extremely high. The major fix for us which resolved our issue (which I believe is the same one I originally reported) was an update to the kernel we were running: On "Red Hat Enterprise Linux WS release 3 (Taroon Update 4)" 2.4.21-31.ELsmp (I don't think our RH7.3 boxes were updated, but they are currently running 2.4.20-28.7smp and don't appear to have issues).
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2005-294.html *** This bug has been marked as a duplicate of 138182 ***
*** This bug has been marked as a duplicate of 138182 ***