65707 – nfs: task nnnn can't get a request slot with NFS_V3

Bug 65707 - nfs: task nnnn can't get a request slot with NFS_V3

Summary: nfs: task nnnn can't get a request slot with NFS_V3

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	Red Hat Linux
Classification:	Retired
Component:	kernel
Sub Component:
Version:	7.3
Hardware:	i386
OS:	Linux
Priority:	medium
Severity:	high
Target Milestone:	---
Assignee:	Steve Dickson
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2002-05-30 16:16 UTC by Fredrik Noring
Modified:	2007-04-18 16:42 UTC (History)
CC List:	12 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2004-09-30 15:39:38 UTC
Embargoed:

Attachments	(Terms of Use)

Description Fredrik Noring 2002-05-30 16:16:37 UTC

From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.0rc2) Gecko/20020520
Debian/1.0rc2-3

Description of problem:
Trying to use a NFS filesystem on Red Hat 7.3 with kernels 2.4.18-3smp
as well as 2.4.18-4 will hang after a short while. The filesystem is
exported from Solaris Sparc.

Version-Release number of selected component (if applicable):


How reproducible:
Always

Steps to Reproduce:
The fastest way to trig the bug seems to be by untar:ing a large file 
on the mounted NFS filesystem.


Actual Results:  All attempts to access the NFS filesytem hangs completely. The
following is logged in /var/log/messages:

   kernel: nfs: task nnnn can't get a request slot

I don't get any "not responding, still trying" as far as I can see,
however.


Expected Results:  The tar file should have been untar:ed.

Additional info:

Compiling kernel-source-2.4.18-4 with CONFIG_NFS_V3 and CONFIG_NFSD_V3
disabled (these are enabled default in Red Hat 7.3) solved the problem:

   CONFIG_NFS_FS=m
   # CONFIG_NFS_V3 is not set
   CONFIG_NFSD=m
   # CONFIG_NFSD_V3 is not set
   CONFIG_NCPFS_NFS_NS=y

Comment 1 akopps 2002-06-04 03:14:41 UTC

I have seen similar errors on a RedHat 7.2 box when mounting the directories from
a Solaris 2.6 server. Note that the solaris server has set "nfssrv:nfs_portmon=1"
in /etc/system which disallows NFS client connections from ports above 1024.
Apparently, the RedHat 7.2 NFS client doesn't play by the rules when using NFS3
over TCP and this results in many requests being denied on the server. This
problem does not occur when you use NFS3 over UDP. Just mentioning this because
you might a experiencing a similar problem. It would be nice if this client
behaviour has been fixed in RH 7.2 too.

-akop

Comment 2 adler 2002-06-07 18:54:20 UTC

I would like to add to this bug report. Here at Brookhaven National Laboratory,
we are experiencing exactly the same problem. The only difference is that we are
using an nfs server running solaris 2.8. We are currently trying to upgrade on
the order of 40 systems to red hat 7.3 but we will stick with 7.2 until this bug
has been resolved. We do not want to rebuild a special kernel for the 40 systems
as suggested by noring.

Comment 3 nicholas_esborn 2002-06-19 22:59:26 UTC

I see the same problem on a RedHat 7.3 machine, mounting from EMC Celerra NFS
appliances.  This one hurts!

Comment 4 Need Real Name 2002-10-11 17:35:06 UTC

Redhat 7.2 using NFS v3 to Solaris 2.8 machine causes the issue for us, but cd &
ls through to network file system by command line does NOT cause the issue, but
quite a few applictions do, e.g. as soon as we use Nautilus to browse the file
system it hangs solid, re-starting autofs free up the file system, but the
original application, nautilus in this case is locked solid until reboot!!

This is on multiple Dell systems, with Intel and 3COM network adapters, to
Solaris 2.8 tested (but also seen on Solaris 2.6) servers.

Comment 5 Eric Bourque 2002-11-23 17:49:24 UTC

I, too, am experiencing this problem from a RH 7.3 server (2.4.18-18.7.xsmp) and
two different RH 8.0 clients (2.4.18-18.8.0). This problem seems to have only
shown up recently when I applied one of the RHN kernel updates to my 7.3 server
(two updates ago). The problem seems to only show up after the client has been
up for a while.

Comment 6 John Cecere 2002-12-18 20:40:29 UTC

I'm experiencing the exact same conditions mentioned in this bug with a RH 8.0
NFS client to a SPARC Solaris 9 server. It seems to hang when doing a flush
operation on the NFS client side. This is always the last thing I see on the
Solaris side when it hangs (yoho=client, alyssa=server):

2462   0.00018         yoho -> alyssa       UDP IP fragment ID=60482 Offset=0
MF=1 TOS=0x0 TTL=64
2463   0.00155       alyssa -> yoho         RPC R XID=980343051 Success
2464   0.00038         yoho -> alyssa       NFS C COMMIT3 FH=6FAE at 29458432 for 0
2465   0.01849       alyssa -> yoho         NFS R COMMIT3 OK

On the client side, with sunrpc.nfs_debug set to 1 via sysctl, I see this in the
log file:

Dec 18 14:44:25 yoho kernel: NFS: refresh_inode(b/4 ct=2 info=0x7)
Dec 18 14:44:26 yoho last message repeated 87 times  
Dec 18 14:44:33 yoho kernel: nfs: write(//testfile(4), 8192@29368320)
Dec 18 14:44:33 yoho kernel: nfs: flush(b/4)

And this is where it hangs. I can mitigate the hang to a simple I/O error for
the app by mounting it soft,intr, but this only helps to the point that I don't
need to reboot the client. The file operation still fails.

I found this bug (a similar incidence anyway) in Sunsolve as bugid 4764852. It
mentions Redhat incident 38313 and bugzilla 16232. However, I am unsure of how
to find these docs. Anyhow, the bug also suggests that it may be a problem with
the NIC driver. I completely disagree with this notion. There is nothing to
indicate there is anything wrong with the driver for my NIC (3com 3C905), and
people with other NICs have complained about the same problem. 

Lastly, the problem most definitely is not fixed in Redhat Linux 8.0, since that
is what I am using. I currently have a custom kernel, version 2.4.19 loaded and
am experiencing this. I downloaded this kernel to see if I got different results
from the 2.4.18-17.8.0 kernel I was originally having the problem with. This bug
is a real show-stopper.

Comment 7 John Cecere 2003-05-22 21:41:08 UTC

The Component of this bug should be set to kernel, not autofs. The problem is
with the nfs driver in the kernel. Is someone ever going to look at this ?

Comment 8 Nedim Celik 2003-10-20 17:37:54 UTC

I can confirm this on the latest kernels for RHAS 2.1AS and RH7.3.  We had this
happen occasionally (once a month or so), but since upgrading to the latest
kernel, it is a showstopper.

Configuration: The server I am talking to is an Sun8 box.  When this happens, it
fills the network-pipe 100% with retransmissions from the server to the client.

Right now, this is a showstopper.  Any resolution coming?

Comment 9 Nedim Celik 2003-10-21 13:55:19 UTC

WORKAROUND:

Add nfsvers=2 to the mount options.

I want to point out this is not a real resolution and someone @ RedHat should
look at this.

Comment 10 John Cecere 2004-06-15 15:42:38 UTC

I haven't looked at this in a while. I don't think anyone's fixed this
yet. However, from what I remember, adding nfsvers=2 to the mount
option wasn't an effective workaround. I still saw this error occur
with NFSv2. The workaround that I implemented was to add tcp to the
mount options and force the client to use TCP instead of the default
UDP.I haven't seen this problem come up using NFSv3 in the past year
since using the tcp mount option. This seems to indicate that using
UDP as the transport for both NFSv2 and NFSv3 is the issue.

Comment 11 Nedim Celik 2004-06-15 15:55:44 UTC

We had this problem for a long time and we lost a lot of time and
money to try to find a fix. We finally did fix it and the fix is quite
surprising:

We removed the HP ProCurve 4000M switches and hooked everything up to
ExtremeNetworks switchs.

Apparently, the HP's would loose packages from time to time and the
NFS/UDP is not equiped to deal with it. TCP has a build-in mechanisam
to deal with lost packages (making it also slower).

NFS2 did help out a bit, but did not resolve it 100%. We experimented
with all other options as well window sizes, etc., but where not able
to get a 100% fix until we changed the switch.

Comment 12 Steve Dickson 2004-08-11 11:10:38 UTC

The TCP code has vastly improved in later kernels. So I'm going
to assume we do better in later kernels. But if the network
is droping packets, there is only so much NFS can do.

Comment 13 John Cecere 2004-08-11 12:47:33 UTC

This problem was never with TCP. The problem is UDP. As a matter of
fact, the workaround for this problem is to force the client to use
TCP as the transport. AFAIK, the UDP transport for NFS has not been
fixed. Also, this bug has nothing to do with the network dropping
packets. I encountered this problem on a private network with 3
systems on it.

Comment 14 John Cecere 2004-08-11 12:49:02 UTC

This problem was never with TCP. The problem is UDP. As a matter of
fact, the workaround for this problem is to force the client to use
TCP as the transport. AFAIK, the UDP transport for NFS has not been
fixed. Also, this bug has nothing to do with the network dropping
packets. I encountered this problem on a private network with 3
systems on it. Please reopen this bug. I doubt it is fixed.

Comment 15 Steve Dickson 2004-08-11 15:02:54 UTC

Ok... I did misunderstand this... sorry about that...

Although there were also quite a few congestion control
fixes that when into the 2.4.20ish kernel (which are
in the FC1 kernel) I'll reopen this and put into the
NEEDINFO state.... because I'm just not seeing this
with later kernels....

Comment 16 Bugzilla owner 2004-09-30 15:39:38 UTC

Thanks for the bug report. However, Red Hat no longer maintains this version of
the product. Please upgrade to the latest version and open a new bug if the problem
persists.

The Fedora Legacy project (http://fedoralegacy.org/) maintains some older releases, 
and if you believe this bug is interesting to them, please report the problem in
the bug tracker at: http://bugzilla.fedora.us/

Comment 17 Nedim Celik 2006-02-26 20:17:06 UTC

Well, great way to "resolve" bugreports.

Let me just say that I have confirmed this on RH AS & ES 2.1 as well as RH ES 
3.0. If anybody had actually read the messages posted he/she could have seen as 
much. 

You seemed eager to take money for support for your software, but I have yet to 
see a valid reason I should renew my 10 RHES servers and not replace them with 
something else.

You closing this case, while never resolving it makes me really, really pissed. 
Microsoft at least resolves it's issues.

Comment 18 Nedim Celik 2006-02-26 20:17:49 UTC

Well, great way to "resolve" bugreports.

Let me just say that I have confirmed this on RH AS & ES 2.1 as well as RH ES 
3.0. If anybody had actually read the messages posted he/she could have seen as 
much. 

You seemed eager to take money for support for your software, but I have yet to 
see a valid reason I should renew my 10 RHES servers and not replace them with 
something else.

You closing this case, while never resolving it makes me really, really pissed. 
Microsoft at least resolves it's issues.

Comment 19 John Cecere 2006-02-27 00:09:31 UTC

This baffles me as well. This bug was opened almost 4 years ago, and nothing was
ever done about it. It's because of this bug that I would never consider using
Linux as an NFS client, even if there's a workaround. And by the way, it *is* a
workaround, not a fix. There's no reason why NFS over UDP shouldn't work in
Linux. This is pretty fundamental stuff here. If this problem existed in
Solaris, it would have been fixed in a matter of days. At the risk of sounding
like a shameless plug for the company that I work for, my advice to Nedik is to
use Solaris 10 x86 (the OS itself is free) on your NFS clients if you can.  If
not, you're probably better off running Windows with some add-on NFS client.

Note You need to log in before you can comment on or make changes to this bug.