64921 – NFS version 3 hangs

Bug 64921 - NFS version 3 hangs

Summary: NFS version 3 hangs

Keywords:
Status:	CLOSED RAWHIDE
Alias:	None
Product:	Red Hat Linux
Classification:	Retired
Component:	kernel
Sub Component:
Version:	7.3
Hardware:	i686
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Assignee:	Ben LaHaise
QA Contact:	Brian Brock
Docs Contact:
URL:
Whiteboard:
Duplicates (2):	64984 65069 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2002-05-14 15:22 UTC by David Roberts
Modified:	2007-04-18 16:42 UTC (History)
CC List:	11 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2002-05-28 21:48:06 UTC
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2002:110	0	normal	SHIPPED_LIVE	Updated kernel with bugfixes available	2002-06-10 04:00:00 UTC

Description David Roberts 2002-05-14 15:22:58 UTC

From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.0rc1) Gecko/20020417

Description of problem:
I have found that from Redhat 7.1, I had to modify the autofs startup script to
add nfsvers=2 to the list of default options.

Version-Release number of selected component (if applicable):


How reproducible:
Always

Steps to Reproduce:
1.Mount a disk using NFS 3
2.Do lots of reading
3.It will hang
	

Actual Results:  Mountpoint hangs

Expected Results:  Mountpoints should not hang unless the server is indefinitely
broken

Additional info:

Using nfsvers=2

Comment 1 William R. Fulmer 2002-05-16 15:28:26 UTC

I don't know about reads, but I can vouch for NFS v3 having problems.  I can do
large read without problems, but any large writes over a network with any kind
of real latency and the NFS client goes beserk.  It seems to get into a state
where it does endless retries and the delay between retries decreases as time
progresses sending a flood of nfs packets over the net.  I was going across two
local routers and manages to take down 6 or seven of our production subnets. 
One client was enough to flood a 100M pipe.  Strangely enough, trying to
reproduce the event across a local switch (with no latency) didn't work.  

Version 2 works fine.  This problem did not exist in 7.2, so something got broke.

here's the output from nfsstat -c for my last test ( I had to stop any further
testing.  The network guys threatened bodily harm.)

Client rpc stats:
calls      retrans    authrefrsh
11483      55336      0
Client nfs v2:
null       getattr    setattr    root       lookup     readlink
0       0% 256    24% 0       0% 0       0% 254    24% 0       0%
read       wrcache    write      create     remove     rename
0       0% 0       0% 0       0% 0       0% 0       0% 0       0%
link       symlink    mkdir      rmdir      readdir    fsstat
0       0% 0       0% 0       0% 0       0% 539    51% 1       0%

Client nfs v3:
null       getattr    setattr    lookup     access     readlink
0       0% 1509   14% 1       0% 1842   17% 6202   59% 0       0%
read       write      create     mkdir      symlink    mknod
0       0% 32      0% 0       0% 0       0% 0       0% 0       0%
remove     rmdir      rename     link       readdir    readdirplus
0       0% 0       0% 0       0% 0       0% 809     7% 0       0%
fsstat     fsinfo     pathconf   commit
3       0% 2       0% 0       0% 2       0%

This after less than thiry seconds trying to copy a 16M file.

Comment 2 Michael K. Johnson 2002-05-16 19:26:33 UTC

We recommend that you use UDP for now.  NFS/TCP was just recently enabled
at all in the upstream kernel source tree, and it is functional for some
uses but clearly not for yours.

Comment 3 William R. Fulmer 2002-05-16 21:27:40 UTC

Is TCP the default?  The tcpdump output that we were looking at showed UDP.

Comment 4 Gilbert E. Detillieux 2002-05-17 16:59:54 UTC

It does appear to be using UDP, and not TCP on my network as well, but the NFS
version did seem to be 3.

If NFS 3 support is still not mature, I'm not sure why it would have been made
the new default.

What's worse, the change was not documented, as far as I can tell - the nfs(5)
man page still says...

  nfsvers=n      Use an alternate RPC version number to con-
                      tact the NFS daemon  on  the  remote  host.
                      This  option  is  useful for hosts that can
                      run  multiple  NFS  servers.   The  default
                      value is version 2.

Comment 5 William R. Fulmer 2002-05-17 18:16:53 UTC

The default was version three in 7.2 also.  The problem did not exist in 7.2 so
something has changed.  nfsvers=n is not what you think.  I think it is for
running multiple instances of an nfs server on the same box, but I'm not sure. 
The option that specifies the protocol version is not in the man page (or I
missed it).  It's just vers=n.

Comment 6 Gilbert E. Detillieux 2002-05-17 21:34:12 UTC

Well, nfsvers=2 and vers=2 appear to accomplish exactly the same thing, as far
as the kernel is concerned.  In either case, the option simply appears as "v2"
when you look at /proc/mounts,
so I'd just as soon stick to the documented options (even if the docs are out of
date).

In any case, setting localoptions='nfsvers=2' in /etc/init.d/autofs did fix the
problem for me.  (I was able to get things to fail consistently by forcing a
large core dump on an NFS-mounted file system before, and things now work as
they should.)

I think the nfs(5) man page is misleading about the multiple instances
suggestion.  That would apply to mountprog=n and nfsprog=n, but mountvers=n and
nfsvers=n are meant for specifying protocol versions.  The mountd and nfsd
daemons support multiple protocol versions, for reverse compatibility,
regardless of how many program instances you may be running.

Comment 7 Eric Sandeen 2002-05-21 14:48:22 UTC

I'll chime in that I'm seeing this problem as well, on an NIS homedir. 
If I don't change autofs to use nfsv2, the box gets in sorry shape very quickly.
Lots of "nfs: task XXXXX can't get a request slot" errors, and X session trying
to use the NFS/NIS homedir locks up hard.

Comment 8 Joseph Kotran 2002-05-21 17:36:56 UTC

Hello,

Please note that I confirm the behavior reported in this bug.  However, I offer
a different suggestion than working around the problem by setting nfsvers=2 in
/etc/rc.d/init.d/autofs.  Instead I set 'tcp,nfsvers=3'.  This fixs/works around
the problem for me.

I am a bit confused because I thought that nfs 3 was TCP only.  Like a previous
poster I experienced UDP traffic from a malfunctioning Red Hat 7.3 client.

In summary NFS 3 work well for me as long as I specify tcp in my mount entries.

Regards,

Joe

Comment 9 Ben LaHaise 2002-05-28 21:46:35 UTC

*** Bug 64984 has been marked as a duplicate of this bug. ***

Comment 10 Ben LaHaise 2002-05-28 21:48:00 UTC

*** Bug 65069 has been marked as a duplicate of this bug. ***

Comment 11 Ben LaHaise 2002-05-28 21:49:53 UTC

Several fixes to the NFS client are now added to the kernel and will show up in
subsequent errata releases.  Please reopen this bug if the errata kernels newer
than 2.4.18-4 exhibit the same problem.

Comment 12 Need Real Name 2002-06-04 17:41:56 UTC

I get the same thing as above only with dmfe ethernet driver and a solaris7 
nfs server.  It really becomes evident with starting mozilla for the first 
time if it has the convert the netscape 4.x profile over.   Tryed the 
nfsvers=2 option didn't fix it.  
solaris7 NFS server fully update with a freshly installed and update rh7.3 box 
with kernel-2.4.18-4, automounted home dirs. 
Email me for more info if you need it.  
Thanks,
Mitchell

Comment 13 Need Real Name 2002-07-02 17:55:45 UTC

I have very slow nfs write with 2.4.18-5 (the one with the nfs fixes). Client is
7.3, Solaris server. No problem with 7.2 clients.

I tried putting localoptions='nfsvers=2', localoptions='rsize=8192,wsize=8192'
and localoptions='rsize=8192,wsize=8192,vers=2' in /etc/init.d/autofs, with no
sucess.

Comment 14 Rex Dieter 2002-07-02 18:02:14 UTC

I, too, have experienced very slow writes (~50k/sec) with the new kernel.  See 
my bug report #67199

Comment 15 Need Real Name 2002-07-02 18:35:49 UTC

After deciding that nfs was still a problem in 2.4.18-5, I went back to 2.4.18-4
and set rsize=8192,wsize=8192. This seems better.

For an 8MB write:

2.4.9-31    1.6
2.4.18-4    2.5 (with rsize=8192,wsize=8192)
2.4.18-5    25  (with rsize=8192,wsize=8192)
                
... the price of progress. Any magic options for 2.4.18-5?

Comment 16 Rex Dieter 2002-07-02 18:59:26 UTC

Try mounting 'sync' vs 'async'.  Do you see a difference?  For me, async is 
great, sync = ~50k writes.  Ick.

Comment 17 Need Real Name 2002-07-09 13:21:45 UTC

The mount options rsize=8192,wsize=8192,async do give reasonable write speeds
with the 2.4.18-5 nfs client. (I put localoptions='rsize=8192,wsize=8192,async'
in /etc/init.d/autofs).

Thanks!

Comment 18 Need Real Name 2002-08-14 17:25:10 UTC

Solaris 7 NFS server and stock Red Hat GNU/Linux 7.3 NFS mount: Copying files 200KB or larger on the NFS mount causes the copy process to slow to a virual stop.
The kernel is 2.4.18-3.  Adding nfsvers=2 to /etc/fstab fixed the problem, as suggested by dlr in the initial post.  Unless you want to use NFS version 3,
nothing more needs to be done to fix this problem.

Comment 19 Need Real Name 2002-08-21 14:23:06 UTC

Some of us _do_ need NFSv3 to work, because we need to use files >2GB.  I have
tried a succession of kernels and options in a quest for decent performance and
still haven't arrived at a satisfactory solution.  Stability is also critical;
I'd rather them be a little slow than panic every so often.

Has anyone tried a 2.4.19 kernel?  Any relevant fixes in there?  How about
downgrading to 2.4.7 or so (7.2's release)?

Comment 20 venkat 2002-09-13 19:14:45 UTC

This problem still exists in RHL 7.3 with kernel 2.4.18-10. The NFS server is
RHL 7.3 with kernel 2.4.18-10 and the clients are also RHL 7.3 with kernel
2.4.18-10. When trying to write to the NFS mounted directory it hangs. But it
works fine with RHL 7.2 clients.

I have tried the following setting on the client side:

-fstype=nfs,hard,intr,nodev,nosuid,quota,rsize=8192,wsize=8192
servername:/home/&

-fstype=nfs,hard,intr,nodev,nosuid,quota,nfsvers=2,rsize=8192,wsize=8192
servername:/home/&

but it did not solve the problem.

Any advise on how to resolve this?

Thanks,
Venkat

Comment 21 venkat 2002-09-13 19:17:14 UTC

This problem still exists in RHL 7.3 with kernel 2.4.18-10. The NFS server is
RHL 7.3 with kernel 2.4.18-10 and the clients are also RHL 7.3 with kernel
2.4.18-10. When trying to write to the NFS mounted directory it hangs. But it
works fine with RHL 7.2 clients.

I have tried the following setting on the client side:

-fstype=nfs,hard,intr,nodev,nosuid,quota,rsize=8192,wsize=8192
servername:/home/&

-fstype=nfs,hard,intr,nodev,nosuid,quota,nfsvers=2,rsize=8192,wsize=8192
servername:/home/&

but it did not solve the problem.

Any advise on how to resolve this?

Thanks,
Venkat

Comment 22 Jason Corley 2002-09-13 19:37:21 UTC

I also am seeing this problem.

Comment 23 Jason Corley 2002-09-13 19:38:35 UTC

Sorry I hit the button too fast on my last comment.  I wanted to say that I've
been working with Venkat (venkat) and just wanted to attach my email
address to this bug so I could track progress that way.

Comment 24 Need Real Name 2002-09-13 19:47:12 UTC

I updated to a stock 2.4.{18|19} kernel and things are fine now.  The problems 
I was seeing from mozilla were a combo of both nfs problems and a mozilla bug.

Comment 25 Paul Raines 2002-10-03 19:38:40 UTC

I traced down my nfs writes being 10 times slower on my RH 7.3 machines compared
to my RH 7.1 machines to using the option 'timeo=300'.  I use this along with
'soft' even though I know all the docs say don't do 'soft'.  As soon as I
removed the 'timeo=300' (but not 'soft') my performance was normal again.  I
don't understand how timeo should be making a difference since according to the
man page it is only for when the server is not responding.

Note You need to log in before you can comment on or make changes to this bug.