64984 – Redhat 7.3: nfs writes very slow.

Bug 64984 - Redhat 7.3: nfs writes very slow.

Summary: Redhat 7.3: nfs writes very slow.

Keywords:
Status:	CLOSED DUPLICATE of bug 64921
Alias:	None
Product:	Red Hat Linux
Classification:	Retired
Component:	kernel
Sub Component:
Version:	7.3
Hardware:	i386
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	---
Assignee:	Ben LaHaise
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2002-05-15 17:35 UTC by phil
Modified:	2014-01-21 22:48 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2002-05-28 20:18:15 UTC
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2002:110	0	normal	SHIPPED_LIVE	Updated kernel with bugfixes available	2002-06-10 04:00:00 UTC

Description phil 2002-05-15 17:35:49 UTC

Description of Problem:  Redhat 7.3:   nfs writes very slow.

Version-Release number of selected component (if applicable):
Kernel version 2.4.18-4

How Reproducible:  Always.

Steps to Reproduce:
1.  Upgrade Dell Dimension 4100 800Mhz PIII with    
3COM 3C905-C-TXM 10/100 ethernet card from Redhat 7.1 to 7.3.
Mount user files via nfs from a Network Appliance file server.

2.   Try to run vi or mail on a 1.7 mb mail file.  Delete some
messages, and try to write the file out.

3.  After 5 or 6 minutes, the file will be written out.  During this 
time, other commands on the machine such as ps aux, dmesg, and top
will hang.    After it writes file out, wakes back up, dmesg and
/var/log/messages will show  repeated messages of form
 nfs: server sinagua not responding, still trying
 nfs: server sinagua OK

Actual Results:
5 or 6 minutes to write out 1.7 mb file to nfs file server.

Expected Results:
2 or 3 seconds.

Additional Information:

I've observed this on the first 2 machines that i've put 7.3 on -
Both are
 Dell Dimension 4100 800Mhz PIII
 3COM 3C905-C-TXM 10/100 ethernet

I tried downloading, building and using kernel version 2.5.15 from 
kernel.org, (the latest development version), and i don't seem to 
have this problem with that kernel.
However, i have some 200 machines to upgrade to 7.3, and i dont know
if i want to put 2.5.15 on all of them.

Let me know if there is additional information that i can provide.

    Phil Kaslo
    Dept. of Computer Science 520-621/2760
    University of Arizona     phil.edu
    Tucson, Ariz. 85721

Comment 1 Dan Morrill 2002-05-15 20:54:51 UTC

I think I ran into this myself, except that the result was 
much more disastrous.  Can you check on this info on your box, 
and see if your case matches? 
 
I installed (fresh install, not upgrade) 7.3 on a Dell 
Optiplex GX-150, which is similar to your system.  Notably, it 
has the same Ethernet card:  3C905C (Vortex Boomerang.) 
 
On previous RH versions (i.e. 7.1 and 7.2), NFS worked just 
fine.  However, it fails on 7.3, with the following symptoms: 
 
- works fine for short reads 
- longer reads (or writes) take a very very long time to 
complete 
 
Unfortunately, eventually my network support people shut off 
my ethernet port, because my box was basically DOS flooding 
the NFS server! 
 
After investigation, it turns out that somehow my 3c905C is 
experiencing A LOT of collisions, on a switched network.  The 
collisions are causing dropped TCP/IP packets.  With TCP, this 
just reduces throughput, but on UDP (which NFS uses,) it 
generated so many retries and resends between my box and the 
server that I was flooding the NFS server.  (The server was 
complaining of "dropped IP fragments".) 
 
I have noticed that the NFS issue is a red herring, in my 
case:  it's a symptom of this ethernet collision business (I 
think.)  The real question is what's up with ethernet. 
 
This is the same hardware, and as near as I can tell it's the 
same 3c59x.c driver version, so I'm not sure what changed.  
Also, I have used 2.4.18 on 7.2, so I don't know what it is 
about RH's 2.4.18-4 kernel in 7.3 that is causing this. 
 
I tried creating a slimmed down 2.4.18-4custom kernel with all 
multicast and other advanced TCP/IP options disabled, with no 
effect. 
 
My network analyst also had me try forcing the hardware into 
10baseT/half duplex mode, just in case the collisions were the 
result of autonegotiation failures.  No effect there, either. 
 
The odd thing is that this is a switched network, meaning that 
I should never see ethernet collisions on my interface, yet 
ifconfig reports exactly that.  Also, the network guy says 
that he is seeing FAR more collisions on his end than I see on 
my end. 
 
So, that's my brain dump.  Phil:  can you poke around and see 
if this is the same problem you are having?  (i.e. look for 
collisions via ifconfig.)  If it's not, I need to enter 
another bug. :)

Comment 2 Dan Morrill 2002-05-15 21:07:26 UTC

Oh, forgot to mention another couple details.  I did a 
"modprobe 3c59x debug=7; ifup eth0" to see what the driver had 
to say.  I got lots of messages like this: 
 
eth0: vortex_error(): status 0xe081 
eth0: vortex_error(): status 0xe481 
 
(I think -- I'm reciting from memory.  But the errors are 
0xe081 and 0xe481.) 
 
One error seemed to correspond with each collision recorded by 
the device.  Also, the ratio of 0xe081 to 0xe481 errors is 
around 5:1. 
 
- Dan

Comment 3 phil 2002-05-15 23:05:13 UTC

Yea - the networking people here have also encouraged me to not do this
(run Redhat 7.3 with the 2.4.18 kernel, as an nfs client, on our network.).  

Doing the nfs writes of a 2 - 3 mb mail file appears to kill the response on our
main nfs server (a Network Appliance)  for other users as well as myself.
(Before today i was just noticing it on the box i was running 7.3 on,
but today it appears that this is affecting other people.)

ifconfig eth0 on the client gives me this:

osp 80 ifconfig eth0
eth0      Link encap:Ethernet  HWaddr 00:01:03:24:6B:91  
          inet addr:192.12.69.241  Bcast:192.12.69.255  Mask:255.255.255.0
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:30970 errors:0 dropped:0 overruns:0 frame:0
          TX packets:2556692 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:100 
          RX bytes:13536614 (12.9 Mb)  TX bytes:3673708034 (3503.5 Mb)
          Interrupt:3 Base address:0xdc00 

The machine is on a switched network port (on a Cisco catalyst 5000).  
We ran a capture on that switched port using an Etherpeek sniffer for
several minutes.  It shows a lot of fragment errors between my machine 
and the file server:  The message
 
 "An IP datagram has been fragmented by the host application or a
 router, and one of the fragments is missing".

 10,555 errors on 11327 packets comprising 16,050,566 bytes.

(The host and the file server are on the same network (192.12.69.0),
so traffic is not going thru a router.)

So it sounds like client  NFS is broken in 7.3.

The  Cisco switch is not showing errors on the port my machine is connected to,
but there are collisions on the port that the file server is connected to.

Thanks for your help.  Let me know if there is more information that i can
provide.

Phil Kaslo

Comment 4 phil 2002-05-15 23:23:10 UTC

I did run  " modprobe 3c59x debug=7 ; ifup eth0 " on the client.

/var/log/messages does give me a lot of output of the form

May 15 16:18:16 osprey kernel: eth0: vortex_error(), status=0xe481
May 15 16:18:24 osprey kernel: eth0: vortex_error(), status=0xe081
May 15 16:18:43 osprey kernel: eth0: vortex_error(), status=0xe481
May 15 16:19:16 osprey last message repeated 2 times

Phil

Comment 5 David Alden 2002-05-16 13:29:14 UTC

Just a quick me too.  Only I wish you'd set the priority much higher (to
whatever the highest level is).  I'm 99% sure it's an NFS issue.  I have a RH7.3
client with a RH7.2 server and a Solaris 8 server.  If I untar a 1.5M file from
the Solaris 8 server onto the RH7.3 client, I get ~20,000 client rpc retransmits
(according to nfsstat).  If I untar the same file from the RH7.2 server onto the
RH7.3 client, I get an average of 0.5 client rpc retransmits.

Interestingly enough, if I untar the 1.5M file from the Solaris 8 server onto a
RH7.2 client, I get ~100 retransmits (versus an average of 0.5 if going from the
RH7.2 server to the RH7.2 client).

If there some setting we should be using?

Comment 6 Alan Cox 2002-05-16 18:43:54 UTC

This looks much more like the 3c59x driver is broken for some cards than NFS
problems

Comment 7 David Alden 2002-05-16 18:59:34 UTC

Sorry, but I'm not using 3COM cards.  The RH7.3 client is using an Intel
Ethernet Pro 100 (according to /proc/pci).  The RH7.2 server is using an Intel
82544EI Gigabit Ethernet Controller.  NFS works fine between the RH7.3 client
and RH7.2 server, but has serious problems between the RH7.3 client and the
Solaris 8 server.  All previous version of RH (from 6.2-7.2) have basically no
problems with the Solaris 8 server (I get some retransmits, but no where near
the amount that I get with RH 7.3).

...dave

Comment 8 Ben LaHaise 2002-05-16 19:16:03 UTC

What rsize/wsize are you using?  Does the problem go away if you set rsize=1024
and wsize=1024 on the client?  Is there any packet loss when pinging the solaris
server with ping -s 8300 or ping -s 4300?  This sounds like two seperate bugs
and should probably be entered seperately -- it's easier to mark bugs as
duplicate than to deal with two threads of information in a single bug.

For the person with the 3com, double check the duplex of the link and try the
above.  You may need to force full duplex on 3com driver with full_duplex=1 and
try again.

Comment 9 phil 2002-05-16 22:56:40 UTC

I tried putting '-rsize=1024,-wsize=1024' on the lines for autofs in
/etc/auto.master,
and 'options 3c59x full_duplex=1' in /etc/modules.conf, and rebooting.
The cisco switch indicates the port  that it is connected to is at auto-full
duplex, 
and auto-100 speed. 

I again tried a copy of a 2 mb mail file, and edit of it using mail, d, and q. 
It still takes
about 80 seconds to write it out,  during which commands hang, and after, dmesg 
reports
 nfs: server sinagua not responding, still trying
 nfs: server sinagua ok'

I can't  keep running these tests in a production environment here, because 
of the effect it has on our main nfs file server, and on other users.

Phil

Comment 10 Ben LaHaise 2002-05-16 23:09:13 UTC

What about the results of the pings?  How much packet loss is being seen?

Comment 11 phil 2002-05-16 23:44:50 UTC

When i start a  ping sinagua (a Network Appliance running
NetApp Release 6.1.l), then run mail, d, q, as above, i get

ping -s 8300 sinagua
...
--- sinagua.cs.arizona.edu ping statistics ---
22 packets transmitted, 19 received, 13% loss, time 21193ms
rtt min/avg/max/mdev = 2.279/2.452/3.832/0.379 ms

ping -s 4300 sinagua
...
--- sinagua.cs.arizona.edu ping statistics ---
24 packets transmitted, 21 received, 12% loss, time 23198ms
rtt min/avg/max/mdev = 1.497/1.654/4.019/0.531 ms

Phil

Comment 12 David Alden 2002-05-16 23:52:31 UTC

Ok, the [rw]size=1024 fixed my problem.  I tried various sizes.  It looks like 
RH changed the default from (I believe) 4096 to 32768???  I have no trouble up 
to 8192.  I start to see retransmits climb at 16384, but everything is still 
usable.  When I tried 32768 I see 10's of thousands of retransmits when 
untar'ing a 1.5M file.  So, did RH up the default to 32768?

BTW, I have 0 packet loss (on over a 1000 iterations) when I'm not trying to 
access the NFS server (I didn't try it when I do access the NFS server).  

...dave

Comment 13 phil 2002-05-16 23:55:50 UTC

When i try to ping , ping -s 8300, ping -s 4300 sinagua (the netapp)
when not doing that test  (mail, d, q), on the client, i get 0% packet loss.

Phil

Comment 14 Dan Morrill 2002-05-16 23:59:13 UTC

FYI, the NFS server I referred to in my case is also a Network 
Appliance, and I see the same symptoms.  Not sure what my rsize 
and wsize are offhand. 
 
Also, Alan:  as near as I can tell (according to version) the 
3c59x driver is the same as it was in 7.2, and it's the same 
hardware.  Not sure whether the collisions are actually 
relevant or not, but I would guess that ethernet has to be 
playing a role somewhere or Phil and I would not be getting the 
same errors from 3c59x.  Also, I'm on 10bT/half whereas Phil's 
on 100bT/full, which may explain why I see actual collisions 
and he doesn't. 
 
I may attempt to build a stock 2.4.18 (which worked on 7.2) on 
my box tomorrow and see if anything changes.  Like Phil, 
though, I can't test this much, since I'm in a production 
environment. 
 
Thanks, guys... 
 
- Dan

Comment 15 Brian Hill 2002-05-17 13:01:17 UTC

I'd just like to add one more "me too" and share my experiences.  I originally 
suspected the 3c59x driver included with RH 7.3, so I built the driver from 
scyld.com, as well as compared the driver that was included with RH 7.2, and 
got the same results.  I then reasoned that my card (3c905B) might have gone 
bad and replaced it with an Intel EtherExpress Pro card.  So I've experienced 
this with both NICs.  Also, I've tried tuning the NFS options (I had 
previously used the default settings, which always worked find) by setting 
rsize and wsize to 8192, and increasing the timeo value.  The increase of the 
timeo value (I've now got it set to 20, which is 2 seconds) has done the most 
to make my performance acceptable, but I can still trigger the problem by 
using NFS to write or read a large file.  Also, I am using my RH 7.3 client to 
access a Solaris 8 NFS server.  Today, I built 2.4.19-pre8 from kernel.org, 
and I'm still seeing some retransmissions, but nowhere near as many as I was 
with the RedHat-provided 2.4.18-4 kernel.  I'm also going to build 2.4.18 
stock and see what results I get.  I'll post another update if 2.4.18 behaves 
any differently than 2.4.19-pre8, but for now, I'm thinking that something is 
amiss with 2.4.18-4.

Comment 16 Gilbert E. Detillieux 2002-05-17 16:53:31 UTC

Just wondering if this problem is related to that in bug # 64921,  NFS version 3
hangs?

Comment 17 phil 2002-05-17 23:43:02 UTC

I think bug # 64921 is related.

I tried the suggestion in that thread --
 setting localoptions='nfsvers=2' in /etc/init.d/autofs.
   
That appears to fix the problem for me.

Thanks, guys.

Phil

Comment 18 Seth Vidal 2002-05-18 01:48:06 UTC

I know this is pretty much a pain in the ass but it might be worth the time and
pain to patch in a good number of neil's and trond's server and client patches
for 2.4.18 - they are all linked from nfs.sourceforge.net. Neil and Trond can
both give a good set of recommendations on which patches to use and these
problems only appear to be reported on red hat's kernels not stock ones patched
with neil's patches.

Comment 19 Dan Morrill 2002-05-20 20:27:50 UTC

As a follow-up, after working with my network people some more, it looks like all
that stuff I spouted about collisions and stuff is nonsense, and not related to
this.  (i.e. I downgraded to RH 7.2 and checked and the collisions are still there
after all.)  Sorry, my bad.

I'll try the fixes mentioned throughout this thread -- esp. Phil's since he seems
to have a similar configuration to me.  Not sure when I'll get time to do that
though.

Thanks,

- Dan

Comment 20 Need Real Name 2002-05-22 00:48:44 UTC

This problem also affects me on a Dell Inspiron 7500 with a 3Com 3CCFE575CT
PC card when mounting filesystems from Solaris 8 and Irix 6.5 servers.
My NFS read performance is OK, but NFS writes are absolutely horrible --
a 56Mb file which I could write in about 13 seconds under RedHat 7.2 would
now take well over an hour if I had the patience to let it complete.
Other network writes are OK, e.g., I can write just fine to a remote system
via ftp.

After reading some of the preceding comments on this topic I was able to devise
two independent workarounds:

(1) Change NFS_MAX_FILE_IO_BUFFER_SIZE in /usr/src/linux/include/linux/nfs_fs.h
from 32768 to 8192. Then recompile a new kernel.

(2) Edit the localoptions line in /etc/rc.d/init.d/autofs to read
	localoptions='rsize=8192,wsize=8192'
This does not require a kernel rebuild but its effectiveness is limited
to filesystems mounted via the automounter. Filesystems which are
mounted via 'mount' must have '-o rsize=8192,wsize=8192' specified as
arguments to the mount command.

Either of these two solutions will work independently for me, and I am
currently using (2) alone as it's easier to maintain. I should note that
we do have other 7.3 systems at our site which do *not* exhibit this problem.

Comment 21 phil 2002-05-22 01:12:55 UTC

I tried that suggestion (2)

(2) Edit the localoptions line in /etc/rc.d/init.d/autofs to read
         localoptions='rsize=8192,wsize=8192'

That works for me also.

Thanks.

Phil

Comment 22 Ben LaHaise 2002-05-22 01:40:41 UTC

A fix is being made to the kernel.  The default [rw]size will be set to 8KB, but
can still be configured to 32KB by the user.  We're also adding several nfs patches.

Comment 23 Need Real Name 2002-05-22 02:09:10 UTC

On my system, changing the default size (NFS_DEF_FILE_IO_BUFFER_SIZE in
nfs_fs.h, currently set at 4096) did absolutely nothing. I had to change
the *maximum* size (NFS_MAX_FILE_IO_BUFFER_SIZE) to fix the problem,
diminishing it from 32768 to 8192.

My incredibly uninformed guess is that although my system has a default size
of 4096, it somehow negotiates the larger 32768 transfer size with the
NFS server regardless of its 4096 default, and then the whole thing hangs.
I had to limit the maximum size to inhibit this behavior (or explicitly
specify 8192 at mount time as the transfer size).

Comment 24 borgia 2002-05-28 06:40:39 UTC

My "me,too" is different hw (ne2k-pci) and custom kernel (2.4.18) in both 
cases: 
-Server 7.3 and client 7.2: speed is ok 
-Both 7.3: starting and quitting pine with 5 (!) small mails takes forever. 
I will do a recheck to see what changed during the upgrade, but I guess it's 
clear the hw is not the culprit and maybe even RH's own kernels (I was not 
using them before upgrade and I'm not now, still the problem popped up after 
the upgrade).

Comment 25 borgia 2002-05-28 20:18:10 UTC

Please excuse my previous unreliable bugreport and comment: I found
out it was due to traffic-shaping the wrong interface (of all things!).
Never do critical changes late at night, right after an upgrade.
Never do...
Never do...

Comment 26 Ben LaHaise 2002-05-28 21:46:41 UTC


*** This bug has been marked as a duplicate of 64921 ***

Comment 27 Bob Swift 2004-07-27 03:17:51 UTC

In the for what it's worth dept.  We have been running 7.2 for some 
time.  A recent thunderstorm took out a box.  I replaced it with a 
higher speed box.  Put the hard drive in the new box and kudzu did 
its thing.  It always took about 25 seconds to print a ps page.  Now 
its 1.5 minutes.  Hmmm, lpr and lpd with debug show fast execution.  
Its the net(local) somehow. Local print from that box with the same 
info, 10 seconds!  What tripped my trigger, when the power wiped out 
the linux box, started printing to a winders box via samba, wow, much 
faster.  I figured it must be in the permission checking and such, 
but 1.5 minutes, hmmmm.  Have beat that to death, ready to give up.  
Created a new lpd.conf or is that perm,whatever, default allow.  
Still 1.5 minutes.  I tried the print to port from other boxes, 
without much luck, filter errors, and such, prolly just throwing too 
much at the problem at once to see the problem, obvious stupidity.
Will load a fresh copy of Suse 9.1 on it and see what happens.

Note You need to log in before you can comment on or make changes to this bug.