From Bugzilla Helper: User-Agent: Mozilla/5.0 (X11; U; Linux i686 (x86_64); en-US; rv:1.7.10) Gecko/20050720 Fedora/1.0.6-1.1.fc3 Firefox/1.0.6 Description of problem: We have a 1500 node Linux cluster. Occasionally, processes will hang accessing files on a NetApp filer. If I run 'df' on these machines, it will hang trying to statfs() the filesystem that's hung. The following error is printed in the messages file: Sep 20 17:55:34 u6311 kernel: nfs_statfs: statfs error = 512 These machines have 8GB of memory and are 3.4GHz Intel Xeons. I'm attaching a couple of packet traces while df is being run on an affected machine. I'm also including the output produced by 'echo t > /proc/sysrq-trigger'. Version-Release number of selected component (if applicable): kernel-2.6.8-521 How reproducible: Sometimes Steps to Reproduce: 1. This can't really be reproduced on purpose. Statistically, it will occur every day. 2. 3. Additional info:
Created attachment 119055 [details] bzip'ed output of 'echo t > /proc/sysrq-trigger'
Created attachment 119056 [details] ethereal packet trace during execution of 'df' The command I used to generate this is: tethereal -w ~/pkts2 "host share-master and (port nfs or port rpc)"
Is that the entire ethreal trace??? All I'm seeing is NFS_FSSTATs and they are all successful... What we need to look for is NFS_FSTATS that are not successful. Try using the following display filter: nfs.procedure_v3 == 18 and nfs.status != 0 to try and catch the not successful one In that sysreq trace it show to df stuck in the exact same place. The both appear to trying to a buffer allocation so I'm wondering if this is a memory issue... to find I would like to increase verbosity of the kernel logging by doing a sysctl -w kernel.printk=7. Then, when this happens again, I would you to turn on some sunrpc debugging by doing "echo 2 > /proc/sys/sunrpc/rpc_debug' to see if df process is spinning trying to get memory... Finally when this happens, use the sysrq-m command to dump the memory usages and I'll have Larry take at a look at them to see what he thinks...
I'll do more specific trace, but that is the entire trace. On a machine where the first NFS filesystem is the problematic, no packets are sent, yet strace shows that statfs() was called. This was printed in /var/log/messages after turning on sunrpc debugging and running df: Sep 23 11:07:00 u6354 kernel: RPC: 34382 call_start nfs3 proc 18 (sync) Sep 23 11:07:00 u6354 kernel: RPC: 34382 call_reserve I think the "nfs_statfs: statfs error = 512" error might be a red herring. This error appears after I hit ctrl-C when df is hung. This is expected behavior when ctrl-C interrupts an NFS operation.
Ok... lets assuming the statfs error is a red herring, since ESYSRESTART is returned for an interrupts, but this does not answer the question of why the df is hanging in the first place. In the network trace also look for either RPC or TCP retransmits. Its a long shot but maybe the server or network is dropping packets under load.... Plus that SysRq-m output could show if there are memory issues... Finally, would it be possible to try a more recent kernel for testing purposes? As the guy from Angstrom points out, there has been a lot of work done both the NFS and RPC code which would be in included in a later FC3 or even FC4 kernel. Please note it is very possible to run later kernel on an older installed base. Currently I'm running an FC5 kernel on an FC3 installed base without problems...
Will do. We have a 2.6.12 kernel that's just coming out of testing here. (I think it might be from FC4). We'll try this on some of the affected machines and see if it helps. Do you know of any specific changes between 2.6.8 and 2.6.12 that would address this problem?
No... Although I do monitor all the upstream patches that are added to both the -mm and main upstream kernels as well as scan webs sites, like http://client.linux-nfs.org/ for patches that have not yet made (or never will) into an upstream kernel (to "cherry pick" for RHEL releases), I really don't pay too much attention to what patches are or are not in the FC stream since its always pretty close the current upstream kernel....
It looks like 2.6.12-1.1372_FC3 resolves this in our limited test group. We're rolling this out to a larger group, but it looks promising. Can you keep this bug open until we're sure it's fixed?
Good! Its good to hear your making progress... As far as closing this bug, do it whenever you feel its necessary
Spoke too soon, unfortunately. After rolling 2.6.12-1.1372_FC3 to several machines, we've now seen this behavior on those machines as well. What can we try next?
Boy your not making my day... :) Could you please post a new sysrq-t process trace as well as sysrq-m memory trace
Created attachment 119998 [details] sysrq-m memory trace and sysrq-t process trace (bzipped) on a 2.6.12 kernel machin
Boy it sure appears the NFS client is fat, dumb and happy waiting for a response from the server... Is that nfs_statfs failure still being logged to /var/log/messages? Are there any other types of errors being logged? In the previous ethereal trace, it showed the nfs_fsstat being responded to, but is not clear if those responses were for the dfs that worked or the dfs that got hung... so once your in this state I want you to 1) echo 16 > /proc/sys/sunrpc/nfs_debug which will log NFS requests and responses to /var/log/messages. 2) start up a 'tethereal -w /tmp/data.pcap host server' on the client and 3 ) try another df (which should also hang).
I'm attaching the chunk of /var/log/messages with debugging on during the df and another packet trace. This time I only ran df on the hanging filesystem, so there shouldn't be any successful statfs() calls.
Created attachment 120101 [details] excerpt of /var/log/messages with nfs_debug set to 16 during an NFS hang
Created attachment 120102 [details] packet trace during NFS hang
Well unfortunately that was a bit unfruitful.... I didn't see any statfs() in either network trace or debug log... When the df hangs, do all the other process hang as well? Also is there anything type of errors or warnings being logged on the server?
Yeah, I don't think statfs is getting sent. Any process that tries to access the affected mount will hang. There don't seem to be any errors logged.
We've also noticed that it usually takes about 10 days (after boot) for machines to find themselves in this state.
I think we may have solved this. The Intel Baseboard Management Controller that we have on our machines shares the IP of the host and uses port 623/tcp for some things. Apparently, this is occasionally used by NFS as well and can cause these problems. I've put in some xinetd config files to occupy those ports and the hanging seems to have stopped. We had run into this problem with 623/udp and NIS earlier and solved the problem the same way. I wish I'd thought of it sooner.... Please leave this open for a few days so we can ensure that the problem is solved.
Interesting... 623 (or reserver ports) are used by both rpc.mountd and the nfs client during mounts.... But I don't understand why this could cause a hang after the mount occurs.... Unless the df is be done on a autofs controlled mount point which means we could be hanging in autofs land...
That is interesting. Sometimes the affected mount is an autofs mount, but many times it isn't. I'm not sure how to explain it. After five or so days, the results are undeniable.
This is a mass-update to all currently open Fedora Core 3 kernel bugs. Fedora Core 3 support has transitioned to the Fedora Legacy project. Due to the limited resources of this project, typically only updates for new security issues are released. As this bug isn't security related, it has been migrated to a Fedora Core 4 bug. Please upgrade to this newer release, and test if this bug is still present there. This bug has been placed in NEEDINFO_REPORTER state. Due to the large volume of inactive bugs in bugzilla, if this bug is still in this state in two weeks time, it will be closed. Should this bug still be relevant after this period, the reporter can reopen the bug at any time. Any other users on the Cc: list of this bug can request that the bug be reopened by adding a comment to the bug. Thank you.
This is a mass-update to all currently open kernel bugs. A new kernel update has been released (Version: 2.6.15-1.1830_FC4) based upon a new upstream kernel release. Please retest against this new kernel, as a large number of patches go into each upstream release, possibly including changes that may address this problem. This bug has been placed in NEEDINFO_REPORTER state. Due to the large volume of inactive bugs in bugzilla, if this bug is still in this state in two weeks time, it will be closed. Should this bug still be relevant after this period, the reporter can reopen the bug at any time. Any other users on the Cc: list of this bug can request that the bug be reopened by adding a comment to the bug. If this bug is a problem preventing you from installing the release this version is filed against, please see bug 169613. Thank you.
Closing per previous comment.