Description of problem: When using NFS over TCP, very occasionally it will send RPC calls in such a way that the end of one RPC call contains the header for the next, and the next RPC call begins with data and runs on for a very long time due to the RPC headers being all weird. The filer's response is to a) complain about a nonsense RPC call, b) complain about a too-long RPC call ('nfsd.record.too.long'), and c) Kill the TCP connection to the client with an RST. The customer was able to reproduce it "several times in 2 hours" with 70+ clients using a tool to generate large sequential writes. I am attaching the trace from the customer. They call it out as being present in RHEL4 and in Fedora Core 3/4/5 and kernels from 2.6.9 to 2.6.15. They also report that it no longer occurs as of 2.6.16, so it seems to have been fixed by the Linux folks already. Is this known issue in the 2.6.9 kernel (RHEL4.x)? Is this something going to be identified and fix targeted in RHEL4.x?
Created attachment 342955 [details] Trace file This the trace that the customer has provided.
The problem with a nfs client not responding quickly to RSTs from the server has been reported at http://bugzilla.kernel.org/show_bug.cgi?id=11154 The following patch was proposed to fix this problem http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=2a9e1cfa23fb62da37739af81127dab5af095d99
Created attachment 363511 [details] Handle RST - attempt 1 This is the first attempt at backporting the patch from http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=2a9e1cfa23fb62da37739af81127dab5af095d99 The patch is _NOT_ KABI safe. It adds a function pointer old_error_report to structure rpc_xprt. rpx_xprt is not directly exported. However it is passed as a parameter for rpc_create_client() which is exported. Hence the modification to rpc_xprt breaks KABI. The module compiles fine. However we are not sure if it works as intended since we need a nfsd server which sends a RST packet.